Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG/ISSUE] GCHP c48 run with gfortran 8.2 on Odyssey hangs before end of run #13

Closed
yantosca opened this issue Dec 19, 2018 · 8 comments

Comments

@yantosca
Copy link
Contributor

yantosca commented Dec 19, 2018

I tried running a GCHP C48 run on Odyssey but the job hung right after printing out the GIGCenv timer results.

AGCM Date: 2016/07/01  Time: 01:00:00
 
 Writing:  11592 Slices (  1 Nodes,  1 PartitionRoot) to File:      OutputDir/GCHP.SpeciesConc_avg.20160701_0030z.nc4
 Writing:  11592 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.SpeciesConc_inst.20160701_0100z.nc4
 Writing:     72 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.StateMet_avg.20160701_0030z.nc4
 Writing:     72 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.StateMet_inst.20160701_0100z.nc4

  Times for GIGCenv
TOTAL                   :       1.069
INITIALIZE              :       0.000
RUN                     :       0.418
etc.

HEMCO::Finalize... OK.
Chem::State_Diag Finalize... OK.
Chem::State_Chm Finalize... OK.
Chem::State_Met Finalize... OK.
Chem::Input_Opt Finalize... OK.
 Using parallel NetCDF for file: gcchem_internal_checkpoint_c48.nc

The script I used to submit the job is:
gchp.run.txt

And here is the full log:
gchp.log.txt

A similar run (done by @lizziel) with Ifort 17.0.4 instead of gfortran 8.2 finished OK. Am wondering if the Gfortran compiler is not totally compatible with MAPL (or at least it seems to produce issues that we don't see when using ifort).

@lizziel
Copy link
Contributor

lizziel commented Dec 19, 2018 via email

@yantosca
Copy link
Contributor Author

yantosca commented Dec 19, 2018

This is gfortran 8.2, did not test earlier versions.

I think the restart files were written OK.

    256980 2018-12-19 15:41 gcchem_internal_checkpoint_c48.nc
2059258836 2018-12-19 15:38 gcchem_internal_checkpoint_c48.nc.20160701_0000z.bin

@lizziel
Copy link
Contributor

lizziel commented Dec 19, 2018 via email

@yantosca
Copy link
Contributor Author

The restart file doesn't have any coordinates:

data:

 lon = _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _ ;

 lat = _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _ ;

 lev = _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _ ;

 time = _ ;

So it looks like something is messed up in restart file output. If there is an out-of-bounds error maybe that's doing it

@lizziel
Copy link
Contributor

lizziel commented Dec 19, 2018

Have you tried compiling with debug flags? I think BOPT in GCHP/Shared/Config/ESMA_base.mk is what to configure, to 'g'.

@lizziel
Copy link
Contributor

lizziel commented Dec 19, 2018

And I think you can add additional flags to the fortran flags section in GIGC.mk in the run directory. Some ideas for what to use are at https://stackoverflow.com/questions/3676322/what-flags-do-you-set-for-your-gfortran-debugger-compiler-to-catch-faulty-code. Maybe this will help with geoschem/GCHP#11 and geoschem/GCHP#14 as well.

@yantosca
Copy link
Contributor Author

yantosca commented Dec 20, 2018

This issue (and also #14) appears to have been caused by an out-of-bounds error in the Olson landmap module. The variable maxFracInd was zero but should not have been. I added a quick fix in the GEOS-Chem "Classic" repo in GeosCore/olson_landmap_mod.F90:

       ! Get IUSE type index with maximum coverage [mil]
       ! NOTE: MaxFracInd is a vector of size 1!
       maxFracInd  = MAXLOC(State_Met%IUSE(I,J,1:State_Met%IREG(I,J)))

!-------------------------------------------------------------------------------
! Prior to 12/20/18:
! Rewrite IF statement to avoid out-of-bounds error (bmy, 12/20/18)
!       ! Force IUSE to sum to 1000 by updating max value if necessary
!       sumIUSE =  SUM(State_Met%IUSE(I,J,1:State_Met%IREG(I,J)))
!       IF ( sumIUSE /= 1000 ) THEN
!          State_Met%IUSE(I,J,maxFracInd) = State_Met%IUSE(I,J,maxFracInd) &
!                                           + ( 1000 - sumIUSE )
!       ENDIF
!-------------------------------------------------------------------------------

       ! Force IUSE to sum to 1000 by updating max value if necessary
       ! Also put an error trap on maxFracInd to avoid out-of-bounds errors
       ! (bmy, 12/20/18)
       sumIUSE =  SUM(State_Met%IUSE(I,J,1:State_Met%IREG(I,J)))
       IF ( sumIUSE /= 1000 .and. maxFracInd(1) > 0 ) THEN
          State_Met%IUSE(I,J,maxFracInd(1)) =                             &
          State_Met%IUSE(I,J,maxFracInd(1)) + ( 1000 - sumIUSE )
       ENDIF

With this fix, a C48 simulation finished properly on Odyssey, printing out all timing info.

It appears the Olson land map data is not being read in properly, which is the root cause of this issue. I am investigating this.

@yantosca
Copy link
Contributor Author

I am closing this thread because the root cause is #15. Fixing #15 will also fix this issue.

lizziel added a commit that referenced this issue Aug 1, 2019
This bug fix was submitted for use in GEOS CTM by Kyle Gerheiser (GMAO)
and is relevant to GCHP. See GEOS-ESM/GEOSgcm_GridComp issue #13:
GEOS-ESM/GEOSgcm_GridComp#13

Without this fix concentrations in GCHP will blow up in advection at
our previous default timesteps. Until the bug was fixed we decreased the
default dynamic timesteps as work-around. Updating the default timesteps
in the run directory creation template will be in a later commit.

Signed-off-by: Lizzie Lundgren <elundgren@seas.harvard.edu>
@msulprizio msulprizio changed the title GCHP c48 run with gfortran 8.2 on Odyssey hangs before end of run [BUG/ISSUE] GCHP c48 run with gfortran 8.2 on Odyssey hangs before end of run Sep 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants