New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make error handling robust for mpi #111
Conversation
This update is a quick fix to inconsistency writing HEMCO error messages to log. There were several issues: 1. HCO_ERROR was sometimes called to pass error messages to HEMCO.log and other times called to pass error messages to stdout. This leads to confusion about where to look for error messages. 2. If the error message was passed to HEMCO.log then it was only written by the root thread. Because HCO_ERROR is called right before early program termination the root thread would sometimes not get to write the message before early exit. This was especially true for MPI applications with many cores. 3. Passing the error message to HEMCO.log requires that the HEMCO.log is open to write to. But the HEMCO.log file is only opened by the root thread and other processes can get ahead of that if using many cores. The subroutine that HCO_ERROR invokes if passing messages to HEMCO.log is still present in the code but will be reassessed in the future. For now, writing messages when there is early termination to std out and by all threads is the most straightforward and consistent way to alert users to the source of the error. Signed-off-by: Lizzie Lundgren <elundgren@seas.harvard.edu>
Signed-off-by: Lizzie Lundgren <elundgren@seas.harvard.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Lizzie. This looks good to me. As we also discussed with the GC error handling, probably good idea to get the CPU ID from regular MPI code down the line. Not sure if this would fight with whatever decomposition ESMF uses. But this looks good, and it works for WRF and CESM since they already automatically pipe output from different cores to different log files
Thanks @jimmielin. Yes, I'll make a companion feature request to expand inclusion of the PET number to MPI applications that do not use ESMF. |
This update send all HCO_ERROR messages to stderr and writes from all threads. It also prints core number in HEMCO error messages if using MAPL/ESMF. This update is a quick fix for inconsistency writing HEMCO error messages to log which were impact trouble-shooting in GEOS. There were several issues that are now addressed:
The subroutine that HCO_ERROR invokes if passing messages to HEMCO.log is still present in the code but will be reassessed in the future. For now, writing messages when there is early termination to std out and by all threads is the most straightforward and consistent way to alert users to the source of the error.