-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG/ISSUE] GCHP dies when all diagnostic collections are turned off #9
Comments
Jiawei Zhuang wrote: The run also crashes at 00:10 if I only save one collection
But with two collections
|
This issue is not reproducible on the Harvard Odyssey cluster. If you repeat the same tests multiple times do you always get the same result? Do you bypass the issue if transport is turned off (turn off in runConfig.sh not input.geos)? |
On the AWS cloud, I can faithfully reproduce the run dying at 00:10 when all collections are turned off in HISTORY.rc. With all collections turned off AND with transport turned off, the run still fails at 00:10. |
Using OpenMPI 2.1 instead of MPICH 3.3 fixes this problem #10 |
Upgrading to OpenMPI 3 may fix the remaining issue. We ran into this on the Odyssey cluster and switching to the new OpenMPI fixed it. |
I am closing this issue since it is fixed by switching to OpenMPI 2.1 from MPICH 3.3. |
…arget-and-install Fixes GCHPctm's build's "all" target and installation
Have been looking at this issue.
Ran on the Amazon cloud in r5.2xlarge with AMI ID: GCHP12.1.0_tutorial_20181210 (ami-0f44e999c80ef6e66)
In HISTORY.rc I turned on only these collections
(1) SpeciesConc_avg : only archived SpeciesConc_NO
(2) SpeciesConc_inst : only archived SpeciesConc_NO
(3) StateMet_avg : only archived Met_AD, Met_OPTD, Met_PSC2DRY, Met_PSC2WET, Met_SPHU, Met_TropHt, Met_TropLev, Met_TropP
(4) StateMet_inst: only archived Met_AD
This run (1 hour) on 6 cores finished with all timing information:
GIGCenv: total 0.346
GIGCchem total: 123.970
Dynamics total: 18.741
GCHP total: 140.931
HIST total: 0.264
EXTDATA total: 133.351
So I am wondering if this is a memory issue. If we select less than a certain amount of diagnostics the run seems to finish fine. Maybe this is OK for the GCHP tutorial but there doesn't seem to be too much rhyme or reason as to why requesting more diagnostics fails. Maybe the memory limits in the instance? I don't know.
This AMI was built with mpich2 MPI. Maybe worth trying with OpenMPI on the cloud?
Also note: This run finished w/o dropping a core file (as currently happens on Odyssey). So this appears to be an Odyssey-specific environment problem.
But if I run with no diagnostics turned on then the run dies at 10 minutes
From the traceback it looks as if it's hanging in interpolating a field in ExtData.
The text was updated successfully, but these errors were encountered: