-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG/ISSUE] Timing results are incomplete #6
Comments
We have run into this problem in the past. It is an issue with cleanup of state_met_mod (State_Met Finalize) and only occurs when the model was built with gfortran 6+. What version of fortran are you using? I will test with it on Odyssey in our dev branch. |
I have tested with gfortran 7.2.1/7.3.0/7.3.1; all have the same issue |
I successfully compiled and ran to completion with gfortran 7.1.0 on Odyssey. Are you sure you are using 12.1.0 and not 12.0.0? There was an issue in 12.0.0 that caused this symptom and it was fixed in 12.0.1 with geoschem/geos-chem@8084db0. |
Yes I am using 12.1.0. I will build a public GCHP AMI today and you can use it for debugging. Does this sound good? |
Thanks, one of us will take a look. For what it's worth I also tried 8.2.0 on Odyssey with success. |
OK I've made a public AMI: Use at least After login, simply
I am still writing formal documentation but this should be enough for testing. |
Interestingly, after I fix the restart file issue by geoschem/GCHP#8 (comment), even
Full log files (only differ at the last few lines) Use Update: this is not just about incomplete timing info. The run simply crashes at hour 1. See geoschem/GCHP#8 (comment) |
This new issue (crash after 1hr run) will be addressed in geoschem/GCHP#8. To restrict the problem to the cleanup_state_met issue you can turn off all diagnostics collections in HISTORY.rc. |
Have been looking at this issue. Ran on the Amazon cloud in r5.2xlarge with AMI ID: GCHP12.1.0_tutorial_20181210 (ami-0f44e999c80ef6e66) In HISTORY.rc I turned on only these collections This run (1 hour) on 6 cores finished with all timing information: GIGCenv: total 0.346 So I am wondering if this is a memory issue. If we select less than a certain amount of diagnostics the run seems to finish fine. Maybe this is OK for the GCHP tutorial but there doesn't seem to be too much rhyme or reason as to why requesting more diagnostics fails. Maybe the memory limits in the instance? I don't know. This AMI was built with mpich2 MPI. Maybe worth trying with OpenMPI on the cloud? Also note: This run finished w/o dropping a core file (as currently happens on Odyssey). So this appears to be an Odyssey-specific environment problem. |
Also, if I run with no diagnostics turned on then the run dies at 10 minutes
From the traceback it looks as if it's hanging in interpolating a field in ExtData. |
The run also crashes at 00:10 if I only save one collection
But with two collections
|
Based on these reports I think the issue of not completing the timing report is not related to cleaning up the State_Met array. I ran into similar symptoms sometimes, but not all times, on the Harvard Odyssey cluster when there was an issue writing out diagnostic files. I found that if I reduced the total memory across all diagnostics then the run went to completion. Sometimes when the problem was encountered the log appeared to have a problem writing the timing. This issue started on the Odyssey cluster after the operating system upgrade from CentOS6 to CentOS7. We were using OpenMPI 1.10.3 when the switch happening. Upgrading to OpenMPI 2 did not fix the issue. Then upgrading to OpenMPI 3 (either 3.0 or 3.1) corrected the problem for unknown reasons. I therefore think this issue is related to OS and MPI version compatibility, although we never figured out the cause so this is just a theory. |
Tested with OpenMPI 2.2.1, still cannot print the full timing (#10) |
This is choking in Cleanup_State_Met. |
I suggest seeing if this issue goes away after upgrading to OpenMPI 3. If it does not, adding print statements through cleanup_state_met to see where it dies should give insight on if cleanup of a specific met-field is a problem. Many State_met fields point to MAPL imports so proper deallocation/nullify can be an issue due to the ordering of state_met and import cleanup. |
Tested with gcc 7.3.0 + OpenMPI 3.1.3, the timing info is still incomplete. Again stops at run_openmpi3_default_diags.log Can be reproduced in |
I cannot use the tricks for MPICH3 (saving It always terminates at |
I discovered that we were not deallocating State_Diag at the end of a GCHP run. So I now pass State_Diag from Chem_GridCompMod to routine GIGC_CHUNK_FINAL. Then within GIGC_CHUNK_FINAL, we call Cleanup_State_Diag. This now prints out all timing info. But the run still drops core at the end as in #11. Here is a snippet of the log file at the last timestep of the run:
|
It seems that the code is exiting the Finalize_ routine of Chem_GridCompMod.F90 normally, since the end of PET00000.GEOSCHEMchem.log is written:
I also put some debug output in GIGC_GridCompMod::Finalize_ and it appears it is getting calling GIGC_CHUNK_FINAL and destroying the state objects properly. So wherever the core dump is, it is maybe a level or two higher. In the meantime, I'll push my edits for deallocating State_Diag to a bugfix branch of GCHP so that we can use it going forward. |
Thanks for the fix. Now the cloud should have exactly the same behavior as Odyssey? |
Yes, I think the cloud will behave the same. One clarification: I updated the GCC and GCHP folders to 12.1.1, then rebuilt with "make clean_gc; make compile_standard". |
Pushed the fix to deallocate State_Diag to the bugfix/GCHP_issues branch. This will be merged into the next GCHP version. |
Made gFTL a submodule
I've successfully run GCHP 12.1.0 on AWS, with both Ubuntu (gcc 7.3.0) and CentOS (gcc 7.3.1).
CentOS was already working before; I am glad that Ubuntu now also works (again with lots of dirty fixes). I strongly prefer Ubuntu as it has a lot more pre-packaged libraries and the environment is a lot faster to build (no need to compile libraries from source).
Full logs for record:
However, for both OS, the log files only show the time for GIGCenv, but no other components.
Does the same issue happen on Odyssey?
The text was updated successfully, but these errors were encountered: