[BUG/ISSUE] GCHP dies when all diagnostic collections are turned off #9

yantosca · 2018-12-12T19:14:15Z

Have been looking at this issue.

Ran on the Amazon cloud in r5.2xlarge with AMI ID: GCHP12.1.0_tutorial_20181210 (ami-0f44e999c80ef6e66)

In HISTORY.rc I turned on only these collections
(1) SpeciesConc_avg : only archived SpeciesConc_NO
(2) SpeciesConc_inst : only archived SpeciesConc_NO
(3) StateMet_avg : only archived Met_AD, Met_OPTD, Met_PSC2DRY, Met_PSC2WET, Met_SPHU, Met_TropHt, Met_TropLev, Met_TropP
(4) StateMet_inst: only archived Met_AD

This run (1 hour) on 6 cores finished with all timing information:

GIGCenv: total 0.346
GIGCchem total: 123.970
Dynamics total: 18.741
GCHP total: 140.931
HIST total: 0.264
EXTDATA total: 133.351

So I am wondering if this is a memory issue. If we select less than a certain amount of diagnostics the run seems to finish fine. Maybe this is OK for the GCHP tutorial but there doesn't seem to be too much rhyme or reason as to why requesting more diagnostics fails. Maybe the memory limits in the instance? I don't know.

This AMI was built with mpich2 MPI. Maybe worth trying with OpenMPI on the cloud?

Also note: This run finished w/o dropping a core file (as currently happens on Odyssey). So this appears to be an Odyssey-specific environment problem.

But if I run with no diagnostics turned on then the run dies at 10 minutes

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% USING O3 COLUMNS FROM THE MET FIELDS! %%% 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     - RDAER: Using online SO4 NH4 NIT!
     - RDAER: Using online BCPI OCPI BCPO OCPO!
     - RDAER: Using online SALA SALC
     - DO_STRAT_CHEM: Linearized strat chemistry at 2016/07/01 00:00
###############################################################################
# Interpolating Linoz fields for jul
###############################################################################
     - LINOZ_CHEM3: Doing LINOZ
===============================================================================
Successfully initialized ISORROPIA code II
===============================================================================
  --- Chemistry done!
  --- Do wetdep now
  --- Wetdep done!
  
 Setting history variable pointers to GC and Export States:
 AGCM Date: 2016/07/01  Time: 00:10:00
                                             Memuse(MB) at MAPL_Cap:TimeLoop=  4.638E+03  4.409E+03      2.223E+03  2.601E+03  3.258E+03
                                                                      Mem/Swap Used (MB) at MAPL_Cap:TimeLoop=  1.823E+04  0.000E+00
MAPL_ExtDataInterpField                       3300
EXTDATA::Run_                                 1471
MAPL_Cap                                       777
application called MPI_Abort(MPI_COMM_WORLD, 21856) - process 0

From the traceback it looks as if it's hanging in interpolating a field in ExtData.

The text was updated successfully, but these errors were encountered:

yantosca · 2018-12-12T19:15:01Z

Jiawei Zhuang wrote:

The run also crashes at 00:10 if I only save one collection SpeciesConc_inst with only two species SpeciesConc_NO and SpeciesConc_O3 in it.

  --- Chemistry done!
  --- Do wetdep now
  --- Wetdep done!

 Setting history variable pointers to GC and Export States:
 SpeciesConc_NO
 SpeciesConc_O3
 AGCM Date: 2016/07/01  Time: 00:10:00
                                             Memuse(MB) at MAPL_Cap:TimeLoop=  4.723E+03  4.494E+03  2.306E+03  2.684E+03  3.260E+03
                                                                      Mem/Swap Used (MB) at MAPL_Cap:TimeLoop=  1.852E+04  0.000E+00
 offline_tracer_advection
ESMFL_StateGetPtrToDataR4_3                     54
DYNAMICSRun                                    703
GCHP::Run                                      407
MAPL_Cap                                       792

But with two collections SpeciesConc_avg and SpeciesConc_inst, each with only two species SpeciesConc_NO and SpeciesConc_O3 in it, the run is able to finish and print full timing information:

 Writing:    144 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.SpeciesConc_avg.20160701_0530z.nc4
 Writing:    144 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.SpeciesConc_inst.20160701_0600z.nc4


  Times for GIGCenv
TOTAL                   :       2.252
INITIALIZE              :       0.000
RUN                     :       2.250
GenInitTot              :       0.004
--GenInitMine           :       0.003
GenRunTot               :       0.000
--GenRunMine            :       0.000
GenFinalTot             :       0.000
--GenFinalMine          :       0.000
GenRecordTot            :       0.001
--GenRecordMine         :       0.000
GenRefreshTot           :       0.000
--GenRefreshMine        :       0.000

HEMCO::Finalize... OK.
Chem::Input_Opt Finalize... OK.
Chem::State_Chm Finalize... OK.
Chem::State_Met Finalize... OK.
   Character Resource Parameter GIGCchem_INTERNAL_CHECKPOINT_TYPE: pnc4
 Using parallel NetCDF for file: gcchem_internal_checkpoint_c24.nc

  Times for GIGCchem
TOTAL                   :     505.760
INITIALIZE              :       3.617
RUN                     :     498.376
FINALIZE                :       0.000
DO_CHEM                 :     488.864
CP_BFRE                 :       0.121
CP_AFTR                 :       4.080
GC_CONV                 :      36.070
GC_EMIS                 :       0.000
GC_DRYDEP               :       0.119
GC_FLUXES               :       0.000
GC_TURB                 :      17.966
GC_CHEM                 :     403.528
GC_WETDEP               :      19.443
GC_DIAGN                :       0.000
GenInitTot              :       2.719
--GenInitMine           :       2.719
GenRunTot               :       0.000
--GenRunMine            :       0.000
GenFinalTot             :       0.963
--GenFinalMine          :       0.963
GenRecordTot            :       0.000
--GenRecordMine         :       0.000
GenRefreshTot           :       0.000
--GenRefreshMine        :       0.000

   -----------------------------------------------------
      Block          User time  System Time   Total Time
   -----------------------------------------------------
   TOTAL                      815.4433       0.0000     815.4433
   COMM_TOTAL                   3.3098       0.0000       3.3098
   COMM_TRAC                    3.3097       0.0000       3.3097
   FV_TP_2D                    90.1448       0.0000      90.1448


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 3126 RUNNING AT ip-172-31-0-74
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

lizziel · 2018-12-12T19:17:39Z

This issue is not reproducible on the Harvard Odyssey cluster. If you repeat the same tests multiple times do you always get the same result? Do you bypass the issue if transport is turned off (turn off in runConfig.sh not input.geos)?

yantosca · 2018-12-12T20:41:16Z

On the AWS cloud, I can faithfully reproduce the run dying at 00:10 when all collections are turned off in HISTORY.rc.

With all collections turned off AND with transport turned off, the run still fails at 00:10.

JiaweiZhuang · 2018-12-12T21:26:12Z

Using OpenMPI 2.1 instead of MPICH 3.3 fixes this problem #10
But then it runs into the problem of not being able to save diagnostics.

lizziel · 2018-12-12T22:26:01Z

Upgrading to OpenMPI 3 may fix the remaining issue. We ran into this on the Odyssey cluster and switching to the new OpenMPI fixed it.

lizziel · 2018-12-12T22:49:37Z

I am closing this issue since it is fixed by switching to OpenMPI 2.1 from MPICH 3.3.

…arget-and-install Fixes GCHPctm's build's "all" target and installation

JiaweiZhuang mentioned this issue Dec 12, 2018

[BUG/ISSUE] Early termination at different points depending on diagnostics configuration #10

Closed

lizziel closed this as completed Dec 12, 2018

msulprizio changed the title ~~GCHP dies when all diagnostic collections are turned off~~ [BUG/ISSUE] GCHP dies when all diagnostic collections are turned off Sep 5, 2019

JiaweiZhuang mentioned this issue Oct 7, 2020

[DISCUSSION] GCHP needs a Continuous Integration (with a build matrix) geoschem/GCHP#43

Closed

sdeastham pushed a commit to sdeastham/gchp_legacy that referenced this issue Jun 14, 2021

Merge pull request geoschem#9 from geoschem/feature/cmake-fixed-all-t…

df1d1b7

…arget-and-install Fixes GCHPctm's build's "all" target and installation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG/ISSUE] GCHP dies when all diagnostic collections are turned off #9

[BUG/ISSUE] GCHP dies when all diagnostic collections are turned off #9

yantosca commented Dec 12, 2018

yantosca commented Dec 12, 2018

lizziel commented Dec 12, 2018

yantosca commented Dec 12, 2018

JiaweiZhuang commented Dec 12, 2018 •

edited

lizziel commented Dec 12, 2018

lizziel commented Dec 12, 2018 •

edited by JiaweiZhuang

[BUG/ISSUE] GCHP dies when all diagnostic collections are turned off #9

[BUG/ISSUE] GCHP dies when all diagnostic collections are turned off #9

Comments

yantosca commented Dec 12, 2018

yantosca commented Dec 12, 2018

lizziel commented Dec 12, 2018

yantosca commented Dec 12, 2018

JiaweiZhuang commented Dec 12, 2018 • edited

lizziel commented Dec 12, 2018

lizziel commented Dec 12, 2018 • edited by JiaweiZhuang

JiaweiZhuang commented Dec 12, 2018 •

edited

lizziel commented Dec 12, 2018 •

edited by JiaweiZhuang