Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include PET number in error messages if using ESMF #925

Merged
merged 1 commit into from Sep 30, 2021

Conversation

lizziel
Copy link
Contributor

@lizziel lizziel commented Sep 29, 2021

This is a companion PR to go with updates to HEMCO to make error handling more clear and robust when using MPI. However, this update can also be applied without those HEMCO updates. The update for GEOS-Chem is simply to write the CPU number when writing error messages. If all threads print the same traceback via GC_ERROR then the many messages can be more easily parsed by isolating prints for a single CPU. If the error occurs on only one thread, or a subset of threads, this will more easily identify that.

Signed-off-by: Lizzie Lundgren <elundgren@seas.harvard.edu>
@lizziel lizziel added the topic: GCHP Related to GCHP only label Sep 29, 2021
@lizziel lizziel added this to the 13.3.0 milestone Sep 29, 2021
@lizziel
Copy link
Contributor Author

lizziel commented Sep 29, 2021

@jimmielin, this PR prints core number when GEOS-Chem encounters an error, but my implementation is only for MAPL/ESMF. Would it be helpful to make this usable by other MPI models too, e.g. WRF?

@jimmielin
Copy link
Contributor

Hi Lizzie,

Yes, I can add this functionality for WRF and CESM. Which would be the best way to add some code to a PR I don’t own? Sorry I’m a bit unfamiliar with this situation…

Another note is that I vaguely remember that Input_Opt stores the CPU ID in a GC run with MPI. It would be nice to pull the ID from there, as then ESMF/WRF/CESM specific code can be abstracted away from GC_ERROR and put in the coupler which feeds the CPU ID into Input_Opt. Would this be possible? I remember passing in this parameter to GEOS-Chem when coupling to WRF, but it was a long time ago.

Thanks!

@lizziel
Copy link
Contributor Author

lizziel commented Sep 29, 2021

Thanks for the extremely fast response! Yes, I thought of passing Input_Opt but wanted to get in a quick fix for GEOS. Passing in Input_Opt is more involved since it will need to be added to the argument list everywhere, hence I just extracted it locally the way it is set for MAPL/ESMF. How are you setting it for WRF? Since GC_ERROR is only called when there is a problem it seems preferable to extract it if it is called rather than add arguments everywhere, but I'm open to a counter-opinion on that.

@lizziel
Copy link
Contributor Author

lizziel commented Sep 29, 2021

Thinking more about this, maybe we could use pure MPI code rather than ESMF/MAPL wrapper code that probably does this under the hood.

@jimmielin
Copy link
Contributor

I'm setting this in WRF using WRF's own routines. I think CESM does the same. So it might be useful to find a pure MPI code, then decide on the code path based on the MODEL_... C-preprocessors.

WRF:

      use module_dm

      ! WRF DM (MPI) Parallel Information - is master process?
      logical              :: Am_I_Root
      integer              :: WRF_DM_MyProc, WRF_DM_NProc, WRF_DM_Comm

      if(wrf_dm_on_monitor()) then
         Am_I_Root = .true.
      else
         Am_I_Root = .false.
      endif

      call wrf_get_nproc(WRF_DM_NProc)
      call wrf_get_myproc(WRF_DM_MyProc)
      call wrf_get_dm_communicator(WRF_DM_Comm)

      ! Pass some HPC Information to Input_Opt...
      Global_Input_Opt%isMPI   = .true.
      Global_Input_Opt%amIRoot = Am_I_Root
      Global_Input_Opt%thisCPU = WRF_DM_MyProc
      Global_Input_Opt%numCPUs = WRF_DM_NProc
      Global_Input_Opt%MPIComm = WRF_DM_Comm

CESM-GC:

    use spmd_utils,          only : MasterProc, myCPU=>Iam, nCPUs=>npes

    Input_Opt%thisCPU  = myCPU
    Input_Opt%amIRoot  = MasterProc

@lizziel
Copy link
Contributor Author

lizziel commented Sep 30, 2021

Thanks for including those. I'm going to make a feature request to expand including PET number in error messages to if using MPI but not ESMF.

@lizziel lizziel removed the request for review from LiamBindle September 30, 2021 15:08
@lizziel lizziel merged commit 0757813 into dev Sep 30, 2021
@msulprizio msulprizio deleted the feature/make_error_handling_robust_for_mpi branch October 4, 2021 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: GCHP Related to GCHP only
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants