Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default GCHP run crashes almost immediately in MAPL_CapGridComp.F90 #8

Closed
LiamBindle opened this issue Dec 17, 2019 · 2 comments
Closed

Comments

@LiamBindle
Copy link
Contributor

LiamBindle commented Dec 17, 2019

Hi everyone,

I'm just submitting this for the archive of issues on GitHub.

Relevent Information

  • ESMF was built with Spack
  • Using Intel MPI with Intel 19 compilers
  • ESMF was unintentionally built with ESMF_COMM=mpiuni

What happened

Yesterday I tried running the default 6-core 1-node 1-hour GCHP simulation and it crashed almost immediately. This happned with GHCP_CTM 13.0.0-alpha.1, but this could happen with any version that uses MAPL 2.0+. Below is the full output. The important parts to pick out are:

  1. It failed almost immediately (very little output).
  2. The "Abort(XXXXXX) on node Y" lines report GCHP is running on different nodes despite this being a 6-core single node simulation.
  3. GCHP crashed after the assertion on line 250 of MAPL_CapGridComp.F90 failed (permalink here)

Failed run output:

 In MAPL_Shmem:
     NumCores per Node =            6
     NumNodes in use   =            1
     Total PEs         =            6
 In MAPL_InitializeShmem (NodeRootsComm):
     NumNodes in use   =            1
 Integer*4 Resource Parameter: HEARTBEAT_DT:600
 Integer*4 Resource Parameter: HEARTBEAT_DT:600
 Integer*4 Resource Parameter: HEARTBEAT_DT:600
 Integer*4 Resource Parameter: HEARTBEAT_DT:600
 Integer*4 Resource Parameter: HEARTBEAT_DT:600
 Integer*4 Resource Parameter: HEARTBEAT_DT:600
 NOT using buffer I/O for file: cap_restart
 NOT using buffer I/O for file: cap_restart
 NOT using buffer I/O for file: cap_restart
 NOT using buffer I/O for file: cap_restart
 NOT using buffer I/O for file: cap_restart
 NOT using buffer I/O for file: cap_restart
pe=00001 FAIL at line=00250    MAPL_CapGridComp.F90                     <something impossible happened>
pe=00001 FAIL at line=00826    MAPL_CapGridComp.F90                     <status=1>
pe=00001 FAIL at line=00427    MAPL_Cap.F90                             <status=1>
pe=00001 FAIL at line=00303    MAPL_Cap.F90                             <status=1>
pe=00001 FAIL at line=00151    MAPL_Cap.F90                             <status=1>
pe=00001 FAIL at line=00129    MAPL_Cap.F90                             <status=1>
pe=00001 FAIL at line=00029    GEOSChem.F90                             <status=1>
Abort(262146) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 262146) - process 1
pe=00002 FAIL at line=00250    MAPL_CapGridComp.F90                     <something impossible happened>
pe=00002 FAIL at line=00826    MAPL_CapGridComp.F90                     <status=1>
pe=00002 FAIL at line=00427    MAPL_Cap.F90                             <status=1>
pe=00002 FAIL at line=00303    MAPL_Cap.F90                             <status=1>
pe=00002 FAIL at line=00151    MAPL_Cap.F90                             <status=1>
pe=00002 FAIL at line=00129    MAPL_Cap.F90                             <status=1>
pe=00002 FAIL at line=00029    GEOSChem.F90                             <status=1>
pe=00003 FAIL at line=00250    MAPL_CapGridComp.F90                     <something impossible happened>
pe=00003 FAIL at line=00826    MAPL_CapGridComp.F90                     <status=1>
pe=00003 FAIL at line=00427    MAPL_Cap.F90                             <status=1>
pe=00003 FAIL at line=00303    MAPL_Cap.F90                             <status=1>
pe=00003 FAIL at line=00151    MAPL_Cap.F90                             <status=1>
pe=00003 FAIL at line=00129    MAPL_Cap.F90                             <status=1>
pe=00003 FAIL at line=00029    GEOSChem.F90                             <status=1>
pe=00000 FAIL at line=00250    MAPL_CapGridComp.F90                     <something impossible happened>
pe=00000 FAIL at line=00826    MAPL_CapGridComp.F90                     <status=1>
pe=00000 FAIL at line=00427    MAPL_Cap.F90                             <status=1>
pe=00000 FAIL at line=00303    MAPL_Cap.F90                             <status=1>
pe=00000 FAIL at line=00151    MAPL_Cap.F90                             <status=1>
pe=00000 FAIL at line=00129    MAPL_Cap.F90                             <status=1>
pe=00000 FAIL at line=00029    GEOSChem.F90                             <status=1>
Abort(262146) on node 2 (rank 2 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 262146) - process 2
Abort(262146) on node 3 (rank 3 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 262146) - process 3
pe=00004 FAIL at line=00250    MAPL_CapGridComp.F90                     <something impossible happened>
pe=00004 FAIL at line=00826    MAPL_CapGridComp.F90                     <status=1>
pe=00004 FAIL at line=00427    MAPL_Cap.F90                             <status=1>
pe=00004 FAIL at line=00303    MAPL_Cap.F90                             <status=1>
pe=00004 FAIL at line=00151    MAPL_Cap.F90                             <status=1>
pe=00004 FAIL at line=00129    MAPL_Cap.F90                             <status=1>
pe=00004 FAIL at line=00029    GEOSChem.F90                             <status=1>
Abort(262146) on node 4 (rank 4 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 262146) - process 4
pe=00005 FAIL at line=00250    MAPL_CapGridComp.F90                     <something impossible happened>
pe=00005 FAIL at line=00826    MAPL_CapGridComp.F90                     <status=1>
pe=00005 FAIL at line=00427    MAPL_Cap.F90                             <status=1>
pe=00005 FAIL at line=00303    MAPL_Cap.F90                             <status=1>
pe=00005 FAIL at line=00151    MAPL_Cap.F90                             <status=1>
pe=00005 FAIL at line=00129    MAPL_Cap.F90                             <status=1>
pe=00005 FAIL at line=00029    GEOSChem.F90                             <status=1>
Abort(262146) on node 5 (rank 5 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 262146) - process 5
Abort(262146) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 262146) - process 0

The Problem

The issue was ESMF was built with ESMF_COMM=mpiuni. This appears to have happended because the spack install spec wasn't quite right, but I didn't build ESMF myself so I can't be sure.

How do I check which ESMF_COMM my ESMF was built with?

The build-time value of ESMF_COMM is written to esmf.mk beside your ESMF libraries. You can see it with the following command

grep 'ESMF_COMM' $(spack location -i esmf)/lib/esmf.mk

or

grep 'ESMF_COMM' /path/to/ESMF/libraries/esmf.mk

Solution

Rebuild ESMF and make sure ESMF_COMM is set to the appropriate MPI flavor.

@lizziel
Copy link
Contributor

lizziel commented Dec 17, 2019

Thanks @LiamBindle. Do you know if using mpiuni set mpi to automatically use one core per node? Does this implementation exist for some specific purpose?

@LiamBindle
Copy link
Contributor Author

LiamBindle commented Dec 17, 2019

@lizziel According to this mpiuni is "a single-processor MPI-bypass library". I don't really understand what that means, but it seems to fit inline with what I was seeing (each process thinking they were root).

Actually, last night I just noticed ESMF_COMM wasn't "intelmpi", and I thought it should be, so I rebuilt ESMF and it fixed my problem. I tried to generalize this issue for the purpose of the issue archive. This issue is really "ESMF was built with the wrong ESMF_COMM" and isn't because of a ESMF + Intel MPI compatibility problem.

Here is the part of spack that sets ESMF_COMM. It looks like ESMF_COMM=intelmpi iff +mpi and ^intel-parallel-studio+mpi is in the spack install spec. Our compute1 sysadmin was having trouble getting spack to concretize this spec though, and ultimately opted to build ESMF manually rather than with spack. I suspect someone familiar with spack could advise on how to do this properly, but rebuilding ESMF manually was easy enough.

@lizziel lizziel mentioned this issue Aug 2, 2022
yantosca added a commit that referenced this issue Mar 28, 2024
This merge brings PR #395 (Expand GCHP advection diagnostics, by
@lizziel) into the GCHP "no-diff-to-benchmark" development stream.

This PR contains the changelog updates for GCHP advection diagnostic
updates going into FV3 and GEOS-Chem submodules. It also fixes a bug in
setting the UpwardsMassFlux flux diagnostic that was causing it to be all zeros.

This PR should be merged in at the same time geos-chem PR #2199 and
FVdycoreCubed_GridComp #8.

Signed-off-by: Bob Yantosca <yantosca@seas.harvard.edu>
yantosca added a commit that referenced this issue Mar 28, 2024
This commit informs the GCHP superproject about the following
commits that were pushed to the GitHub geoschem/geos-chem repository:

af42462 Merge PR #8 (Add PLEadv diagnostic for offline advection in GCHP)

This PR adds the PLEadv diagnostic export to FvDyCore_Cubed.

Signed-off-by: Bob Yantosca <yantosca@seas.harvard.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants