Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG/ISSUE] Crash in OpenMP loop in hcox_lightnox_mod #75

Closed
sdeastham opened this issue Jan 14, 2021 · 14 comments
Closed

[BUG/ISSUE] Crash in OpenMP loop in hcox_lightnox_mod #75

sdeastham opened this issue Jan 14, 2021 · 14 comments
Assignees
Labels
category: Bug Something isn't working stale No recent activity on this issue

Comments

@sdeastham
Copy link
Contributor

Description of the problem

When building GCHP 13.0.0 with Intel 18.3 and the HPE MPT (an MPI implementation) on NASA's Pleiades cluster, I found that the code would fail (with no stack trace) immediately after the HEMCO (VOLCANO) printout which reports read-in of the volcano emissions file. Based on some hand-debugging, the error appears to originate within an OpenMP loop in hcox_lightnox_mod.F90. This bug is resolved by either removing the relevant OpenMP directives or modifying the root CMakeLists.txt file so that OMP=OFF.

There are already several reasons to want to disable OpenMP when building GCHP (see e.g. geoschem/HEMCO#57), but @lizziel noted in commit 7bf5fdd that compiling GCHP with Intel compilers resulted in a segfault if OpenMP was disabled. Since that does not seem to be the case on Pleiades, and OpenMP is continuing to cause issues, I'd like to suggest re-opening the investigation of if and why OpenMP is needed for Intel compilers.

NB: The lack of a stack trace appears to be a semi-reliable indication that the issue is originating in an OpenMP loop.

GEOS-Chem version

GEOS-Chem v13.0.0

Description of code modifications

No modifications were necessary to produce this bug. The code was compiled with the CMake directive -DCMAKE_BUILD_TYPE=Debug.

Software versions

  • Compilers (Intel or GNU, and version): Intel 18.3.222
  • NetCDF version: NetCDF v4.7.3, NetCDF-Fortran v4.5.2
  • MPI version: HPE MPT 2.23
  • ESMF version: 8.1.0 beta (snapshot_08-8-g3d0334a5db)
@sdeastham sdeastham added the category: Bug Something isn't working label Jan 14, 2021
@LiamBindle
Copy link
Contributor

Thanks Seb, this is good info to have. Perhaps the thing to do in the interim is disable OMP in HEMCOBuildProperties, and GEOSChemBuildProperties, but leave it enabled in MAPL since IIRC that's where the segfault with OMP=OFF was happening.

Does that seem reasonable to you? If so I will check that it doesn't break GCHP+Intel runs on Compute1. Would you be able to try the update on Pleiades?

Note: The recommended way to turn OMP off is running cmake . -DOMP=OFF (ref: https://gchp.readthedocs.io/en/latest/user-guide/compiling.html#gchp-build-options)

@sdeastham
Copy link
Contributor Author

Happy to give that a shot on Pleiades! Will update ASAP.

@lizziel
Copy link
Contributor

lizziel commented Jan 14, 2021

For what it's worth, I previously found an issue in the lightning NOx HEMCO extension file when compiling with Intel debug flag to check pointers. I narrowed it down to an OpenMP issue. Not sure if it's related to what you are seeing @sdeastham, but it may indeed point to the culprit. See HEMCO issue geoschem/HEMCO#50.

@LiamBindle
Copy link
Contributor

I pushed b9a70c4 on a new branch bugfix/75. It adds ${OMP_HEMCO} which lets you override the value of ${OMP} in HEMCO's scope. For Intel builds OMP_HEMCO defaults to OFF (so OpenMP is disabled in HEMCO for Intel builds by default).

Could you git fetch origin, git cherry-pick b9a70c4, and then try recompiling+running?

@LiamBindle
Copy link
Contributor

It looks like Intel MPI isn't working on Compute1 today, so I'll have to follow up on my test tomorrow.

@sdeastham
Copy link
Contributor Author

I can confirm that, with the given cherry-pick, the code compiles and runs as-is (defaulting to OMP_HEMCO=OFF) and fails in lightnox if OMP_HEMCO=ON.

@LiamBindle
Copy link
Contributor

LiamBindle commented Jan 15, 2021

It must be something in MAPL then that's causing the crash. Thanks for checking.

Here's the issue that Lizzie found: geoschem/GCST-internal#5. I think we need to dig a bit deeper since Intel+Debug mode+OMP=ON on Pleiades crashes while Intel+Release mode+OMP=OFF crashes on Cannon.

I will check if

  • GCHP built with Intel compilers with OMP=OFF crashes
  • GCHP built with Intel compilers with OMP=ON and CMAKE_BUILD_TYPE=Debug crashes
    on Compute1. @WilliamDowns, would you able able to try these cases on Cannon?

@sdeastham Have you tried the default build in Release mode? Does that run okay on Pleiades with the default value of OMP=ON? I'm just wondering if we can confirm it's related to the runtime checks from the debug flags.

@WilliamDowns
Copy link
Contributor

WilliamDowns commented Jan 15, 2021

A quick 1-hour test of those two setups with ifort 18 and OpenMPI on Cannon yields no crash for GCHP compiled with -DOMP=OFF, but yields the same observed crash in lightnox with DCMAKE_BUILD_TYPE=Debug.

@LiamBindle
Copy link
Contributor

Thanks for checking that @WilliamDowns. I tried both cases on Compute1, and both were executed successfully (with Intel 2020.0 compilers).

Since @sdeastham (Pleiades), @WilliamDowns (Cannon), and my (Compute1) tests of GCHP built with Intel compilers and OMP=OFF had no runtime errors, I propose we make the default OMP=OFF for both compilers. Does that seem reasonable?

@lizziel
Copy link
Contributor

lizziel commented Feb 19, 2021

Sounds reasonable to me. My understanding is using OpenMP in GCHP should not improve performance and the only reason it was on was to avoid an error associated with GMAO libraries that I previously ran into. If that is no longer an issue, and my performance assumption is correct, then it makes sense to have OpenMP off by default.

@LiamBindle
Copy link
Contributor

Any concerns to me making OMP=OFF the default on GCHP/main today?

@sdeastham
Copy link
Contributor Author

Works for me!

@lizziel
Copy link
Contributor

lizziel commented Mar 5, 2021

Is this issue all set to close? I still have the HEMCO issue open where HEMCO crash in the lightning NOx extension when using Intel compilers and -check-pointers debug flag. But if that was the cause of this issue then it is not longer relevant to GCHP.

@stale
Copy link

stale bot commented Apr 4, 2021

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days it will be closed. You can add the "never stale" tag to prevent the Stale bot from closing this issue.

@stale stale bot added the stale No recent activity on this issue label Apr 4, 2021
@lizziel lizziel closed this as completed Apr 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: Bug Something isn't working stale No recent activity on this issue
Projects
None yet
Development

No branches or pull requests

4 participants