Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG/ISSUE] GCHP dies when all diagnostic collections are turned off #9

Closed
yantosca opened this issue Dec 12, 2018 · 6 comments
Closed

Comments

@yantosca
Copy link
Contributor

Have been looking at this issue.

Ran on the Amazon cloud in r5.2xlarge with AMI ID: GCHP12.1.0_tutorial_20181210 (ami-0f44e999c80ef6e66)

In HISTORY.rc I turned on only these collections
(1) SpeciesConc_avg : only archived SpeciesConc_NO
(2) SpeciesConc_inst : only archived SpeciesConc_NO
(3) StateMet_avg : only archived Met_AD, Met_OPTD, Met_PSC2DRY, Met_PSC2WET, Met_SPHU, Met_TropHt, Met_TropLev, Met_TropP
(4) StateMet_inst: only archived Met_AD

This run (1 hour) on 6 cores finished with all timing information:

GIGCenv: total 0.346
GIGCchem total: 123.970
Dynamics total: 18.741
GCHP total: 140.931
HIST total: 0.264
EXTDATA total: 133.351

So I am wondering if this is a memory issue. If we select less than a certain amount of diagnostics the run seems to finish fine. Maybe this is OK for the GCHP tutorial but there doesn't seem to be too much rhyme or reason as to why requesting more diagnostics fails. Maybe the memory limits in the instance? I don't know.

This AMI was built with mpich2 MPI. Maybe worth trying with OpenMPI on the cloud?

Also note: This run finished w/o dropping a core file (as currently happens on Odyssey). So this appears to be an Odyssey-specific environment problem.

But if I run with no diagnostics turned on then the run dies at 10 minutes

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% USING O3 COLUMNS FROM THE MET FIELDS! %%% 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     - RDAER: Using online SO4 NH4 NIT!
     - RDAER: Using online BCPI OCPI BCPO OCPO!
     - RDAER: Using online SALA SALC
     - DO_STRAT_CHEM: Linearized strat chemistry at 2016/07/01 00:00
###############################################################################
# Interpolating Linoz fields for jul
###############################################################################
     - LINOZ_CHEM3: Doing LINOZ
===============================================================================
Successfully initialized ISORROPIA code II
===============================================================================
  --- Chemistry done!
  --- Do wetdep now
  --- Wetdep done!
  
 Setting history variable pointers to GC and Export States:
 AGCM Date: 2016/07/01  Time: 00:10:00
                                             Memuse(MB) at MAPL_Cap:TimeLoop=  4.638E+03  4.409E+03      2.223E+03  2.601E+03  3.258E+03
                                                                      Mem/Swap Used (MB) at MAPL_Cap:TimeLoop=  1.823E+04  0.000E+00
MAPL_ExtDataInterpField                       3300
EXTDATA::Run_                                 1471
MAPL_Cap                                       777
application called MPI_Abort(MPI_COMM_WORLD, 21856) - process 0

From the traceback it looks as if it's hanging in interpolating a field in ExtData.

@yantosca
Copy link
Contributor Author

Jiawei Zhuang wrote:

The run also crashes at 00:10 if I only save one collection SpeciesConc_inst with only two species SpeciesConc_NO and SpeciesConc_O3 in it.

  --- Chemistry done!
  --- Do wetdep now
  --- Wetdep done!

 Setting history variable pointers to GC and Export States:
 SpeciesConc_NO
 SpeciesConc_O3
 AGCM Date: 2016/07/01  Time: 00:10:00
                                             Memuse(MB) at MAPL_Cap:TimeLoop=  4.723E+03  4.494E+03  2.306E+03  2.684E+03  3.260E+03
                                                                      Mem/Swap Used (MB) at MAPL_Cap:TimeLoop=  1.852E+04  0.000E+00
 offline_tracer_advection
ESMFL_StateGetPtrToDataR4_3                     54
DYNAMICSRun                                    703
GCHP::Run                                      407
MAPL_Cap                                       792

But with two collections SpeciesConc_avg and SpeciesConc_inst, each with only two species SpeciesConc_NO and SpeciesConc_O3 in it, the run is able to finish and print full timing information:

 Writing:    144 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.SpeciesConc_avg.20160701_0530z.nc4
 Writing:    144 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.SpeciesConc_inst.20160701_0600z.nc4


  Times for GIGCenv
TOTAL                   :       2.252
INITIALIZE              :       0.000
RUN                     :       2.250
GenInitTot              :       0.004
--GenInitMine           :       0.003
GenRunTot               :       0.000
--GenRunMine            :       0.000
GenFinalTot             :       0.000
--GenFinalMine          :       0.000
GenRecordTot            :       0.001
--GenRecordMine         :       0.000
GenRefreshTot           :       0.000
--GenRefreshMine        :       0.000

HEMCO::Finalize... OK.
Chem::Input_Opt Finalize... OK.
Chem::State_Chm Finalize... OK.
Chem::State_Met Finalize... OK.
   Character Resource Parameter GIGCchem_INTERNAL_CHECKPOINT_TYPE: pnc4
 Using parallel NetCDF for file: gcchem_internal_checkpoint_c24.nc

  Times for GIGCchem
TOTAL                   :     505.760
INITIALIZE              :       3.617
RUN                     :     498.376
FINALIZE                :       0.000
DO_CHEM                 :     488.864
CP_BFRE                 :       0.121
CP_AFTR                 :       4.080
GC_CONV                 :      36.070
GC_EMIS                 :       0.000
GC_DRYDEP               :       0.119
GC_FLUXES               :       0.000
GC_TURB                 :      17.966
GC_CHEM                 :     403.528
GC_WETDEP               :      19.443
GC_DIAGN                :       0.000
GenInitTot              :       2.719
--GenInitMine           :       2.719
GenRunTot               :       0.000
--GenRunMine            :       0.000
GenFinalTot             :       0.963
--GenFinalMine          :       0.963
GenRecordTot            :       0.000
--GenRecordMine         :       0.000
GenRefreshTot           :       0.000
--GenRefreshMine        :       0.000

   -----------------------------------------------------
      Block          User time  System Time   Total Time
   -----------------------------------------------------
   TOTAL                      815.4433       0.0000     815.4433
   COMM_TOTAL                   3.3098       0.0000       3.3098
   COMM_TRAC                    3.3097       0.0000       3.3097
   FV_TP_2D                    90.1448       0.0000      90.1448


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 3126 RUNNING AT ip-172-31-0-74
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

@lizziel
Copy link
Contributor

lizziel commented Dec 12, 2018

This issue is not reproducible on the Harvard Odyssey cluster. If you repeat the same tests multiple times do you always get the same result? Do you bypass the issue if transport is turned off (turn off in runConfig.sh not input.geos)?

@yantosca
Copy link
Contributor Author

On the AWS cloud, I can faithfully reproduce the run dying at 00:10 when all collections are turned off in HISTORY.rc.

With all collections turned off AND with transport turned off, the run still fails at 00:10.

@JiaweiZhuang
Copy link
Contributor

JiaweiZhuang commented Dec 12, 2018

Using OpenMPI 2.1 instead of MPICH 3.3 fixes this problem #10
But then it runs into the problem of not being able to save diagnostics.

@lizziel
Copy link
Contributor

lizziel commented Dec 12, 2018

Upgrading to OpenMPI 3 may fix the remaining issue. We ran into this on the Odyssey cluster and switching to the new OpenMPI fixed it.

@lizziel
Copy link
Contributor

lizziel commented Dec 12, 2018

I am closing this issue since it is fixed by switching to OpenMPI 2.1 from MPICH 3.3.

@lizziel lizziel closed this as completed Dec 12, 2018
@msulprizio msulprizio changed the title GCHP dies when all diagnostic collections are turned off [BUG/ISSUE] GCHP dies when all diagnostic collections are turned off Sep 5, 2019
sdeastham pushed a commit to sdeastham/gchp_legacy that referenced this issue Jun 14, 2021
…arget-and-install

Fixes GCHPctm's build's "all" target and installation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants