New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG/ISSUE] Differences in output when splitting a run into consecutive shorter runs #57
Comments
One issue I have found is that the It is important to note that this setting in Following this update, the only multi vs single run differences in the GCHP full chemistry simulation occur in chemistry. There is currently a parallelization bug that is preventing further progress in identifying the source of the differences. |
Great to see this progress and I'm excited at the prospect of finally getting parity between single and multi-segment runs! Can you clarify on the parallelization bug though? Is something only being initialized on the root CPU and not propagating to the others? |
My understanding is the parallelization bug is this: geoschem/geos-chem#392. @msulprizio has been looking at it more closely lately. |
Got it. That would only affect GC-Classic though, right (being an OpenMP parallelization bug)? |
Actually, we have OpenMP on by default if using Intel compilers. Due to the way the CMake files were written, OpenMP has been on for a while for all compilers despite building with I plan on revisiting the diffs in chemistry after spending some time on 13.0 documentation to make sure I can get rid of the parallelization bug signature by turning off OpenMP. I'm also curious if what I am seeing goes away with het chem off. |
I assume though that, even though we have OpenMP on, we are still only using one OpenMP thread such that we still shouldn't be affected by OpenMP parallelization errors? That having been said it sounds like there's something even weirder than a pure parallelization error going on in there, so I see your point about waiting to see what happens there first! |
This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days it will be closed. You can add the "never stale" tag to prevent the Stale bot from closing this issue. |
Quick update on this. Current dev branch for GC-Classic has a parallelization bug still. It is not in GCHP so pretty sure it's an OpenMP issue. However, I did a quick multi vs single run test for GCHP 13.0 (2-hr run vs two 1-hr runs) and there are still differences for the full chemistry simulation when chemistry is turned on. I am keeping this issue open to keep it on the radar. Fortunately we will soon switch over to using MAPL History monthly collections so will no longer be running 1-year benchmarks in 1-month run segments. |
Also see issue: geoschem/HEMCO#78. I discovered what might be an issue in the HEMCO MEGAN extension. Not sure yet if this is related. I am going to be looking into this parallelization issue and should hopefully solve it soon. |
I don't think this is related. The differences are only coming in when chemistry is turned on. HEMCO on but chemistry off does not have differences. |
This issue has not been looked into for a while but I am keeping it open to bring attention to it as a long-standing issue. It will be revisited again in the future at some point. |
I plan to re-assess this issue in 14.0. |
I am revisiting this issue after work by @christophkeller to try to eliminate differences seen in GEOS due to GEOS-Chem. He added 60+ variables to the internal state, mostly State_Chm arrays used in isorropia, and reported this fixed the issue in 13.3 but not 13.4. Having zero diff across runs regardless of how the runs are split up in time is a requirement for GEOS. I am doing tests with 13.4 using the additional internal state variables to hone in on the remaining source of differences, presumably missing internal state variables. The differences only appear when chemistry is turned on. I am finding a remaining bias near the surface when comparing a 2hr run versus two 1hr runs. For example, negative bias in ozone zonal mean: It is not good to have 60+ additional 3D arrays in the internal state. This significantly increases the memory requirement and is particularly costly at high resolutions. I will look into whether we can adjust the order of operations such that carrying the fields for isorropia across timesteps is not necessary or at least minimized. In summary, the to do list for this work is:
Items 1 and 2 are motivated by strict requirements in GEOS with benefit to GCHP. All items will improve GEOS-Chem 1-year fullchem benchmark accuracy since we currently break up the 1-year benchmark runs into multiple months. Fixes and updates related to this will go into 14.0. |
I found and fixed a problem in wet scavenging limited to H2O2. See geoschem/geos-chem#1178. The fix is going into 13.4. |
All remaining differences when splitting up a GCHP benchmark simulation will be removed in geoschem/geos-chem#1229 that is going into 14.0.0. |
It has been a known issue for a long time that GCHP does not give exactly the same final result between a long single run and the identical run split up into shorter durations. This has been true for both the transport tracer and the full chemistry simulations.
This is especially problematic for GCHP because currently the only way to output monthly mean diagnostics is to break up a run into 1-month run segments. A monthly mean capability was supposed to be included within MAPL for the 13.0.0 release but that update is not yet ready in a MAPL release. Since we output monthly means in GCHP 1-year benchmarks I have been looking more closely at this issue to find fixes before we do the 13.0.0 benchmark.
Recent updates that are going into GEOS-Chem 13.0.0 correct this problem for transport tracers. Bug fixes in the GEOS-Chem and HEMCO submodules resolved the issue and the simulation now gives zero diffs regardless of how the run is split up. See the following posts on GitHub for more information on these updates:
Differences persist in the full chemistry simulation and I am actively looking into them.
The text was updated successfully, but these errors were encountered: