Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make error handling robust for mpi #111

Merged
merged 2 commits into from Sep 30, 2021

Conversation

lizziel
Copy link
Contributor

@lizziel lizziel commented Sep 29, 2021

This update send all HCO_ERROR messages to stderr and writes from all threads. It also prints core number in HEMCO error messages if using MAPL/ESMF. This update is a quick fix for inconsistency writing HEMCO error messages to log which were impact trouble-shooting in GEOS. There were several issues that are now addressed:

1. HCO_ERROR was sometimes called to pass error messages to HEMCO.log and
   other times called to pass error messages to stdout. This leads to
   confusion about where to look for error messages.

2. If the error message was passed to HEMCO.log then it was only written
   by the root thread. Because HCO_ERROR is called right before early
   program termination the root thread would sometimes not get to write
   the message before early exit. This was especially true for MPI
   applications with many cores.

3. Passing the error message to HEMCO.log requires that the HEMCO.log is
   open to write. But the HEMCO.log file is only opened by the root
   thread and other processes can get ahead of that if using many cores.

The subroutine that HCO_ERROR invokes if passing messages to HEMCO.log is still present in the code but will be reassessed in the future. For now, writing messages when there is early termination to std out and by all threads is the most straightforward and consistent way to alert users to the source of the error.

This update is a quick fix to inconsistency writing HEMCO error messages
to log. There were several issues:

1. HCO_ERROR was sometimes called to pass error messages to HEMCO.log and
   other times called to pass error messages to stdout. This leads to
   confusion about where to look for error messages.

2. If the error message was passed to HEMCO.log then it was only written
   by the root thread. Because HCO_ERROR is called right before early
   program termination the root thread would sometimes not get to write
   the message before early exit. This was especially true for MPI
   applications with many cores.

3. Passing the error message to HEMCO.log requires that the HEMCO.log is
   open to write to. But the HEMCO.log file is only opened by the root
   thread and other processes can get ahead of that if using many cores.

The subroutine that HCO_ERROR invokes if passing messages to HEMCO.log
is still present in the code but will be reassessed in the future. For now,
writing messages when there is early termination to std out and by all
threads is the most straightforward and consistent way to alert users
to the source of the error.

Signed-off-by: Lizzie Lundgren <elundgren@seas.harvard.edu>
Signed-off-by: Lizzie Lundgren <elundgren@seas.harvard.edu>
Copy link
Collaborator

@jimmielin jimmielin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Lizzie. This looks good to me. As we also discussed with the GC error handling, probably good idea to get the CPU ID from regular MPI code down the line. Not sure if this would fight with whatever decomposition ESMF uses. But this looks good, and it works for WRF and CESM since they already automatically pipe output from different cores to different log files

@yantosca yantosca added the topic: ESMF or MPI Related to issues in the ESMF and/or MPI environments label Sep 29, 2021
@lizziel
Copy link
Contributor Author

lizziel commented Sep 30, 2021

Thanks @jimmielin. Yes, I'll make a companion feature request to expand inclusion of the PET number to MPI applications that do not use ESMF.

@lizziel lizziel merged commit 2e5b703 into dev Sep 30, 2021
@msulprizio msulprizio added this to the 3.2.0 milestone Oct 1, 2021
@msulprizio msulprizio deleted the feature/make_error_handling_robust_for_mpi branch October 1, 2021 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: Feature Request New feature or request topic: ESMF or MPI Related to issues in the ESMF and/or MPI environments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants