Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG/ISSUE] Differences in output when splitting a run into consecutive shorter runs #57

Closed
lizziel opened this issue Nov 4, 2020 · 15 comments
Assignees
Labels
category: Bug Something isn't working never stale Never label this issue as stale
Milestone

Comments

@lizziel
Copy link
Contributor

lizziel commented Nov 4, 2020

It has been a known issue for a long time that GCHP does not give exactly the same final result between a long single run and the identical run split up into shorter durations. This has been true for both the transport tracer and the full chemistry simulations.

This is especially problematic for GCHP because currently the only way to output monthly mean diagnostics is to break up a run into 1-month run segments. A monthly mean capability was supposed to be included within MAPL for the 13.0.0 release but that update is not yet ready in a MAPL release. Since we output monthly means in GCHP 1-year benchmarks I have been looking more closely at this issue to find fixes before we do the 13.0.0 benchmark.

Recent updates that are going into GEOS-Chem 13.0.0 correct this problem for transport tracers. Bug fixes in the GEOS-Chem and HEMCO submodules resolved the issue and the simulation now gives zero diffs regardless of how the run is split up. See the following posts on GitHub for more information on these updates:

Differences persist in the full chemistry simulation and I am actively looking into them.

@lizziel lizziel added the category: Bug Something isn't working label Nov 4, 2020
@lizziel lizziel self-assigned this Nov 4, 2020
@lizziel
Copy link
Contributor Author

lizziel commented Nov 4, 2020

One issue I have found is that the input.geos option to initialize stratospheric H2O is set to True by default for all runs. This introduces small changes into meteorology during subroutine SET_H2O_TRAC. If splitting a run up into multiple segments then the init strat H2O setting should be True only for the very first run. This issue is now fixed in GCHP commit #4d100d4, which is a change to the GCHP run directory in the GEOS-Chem submodule.

It is important to note that this setting in input.geos should be updated if GEOS-Chem Classic runs are split up as well. We do not have scripts to automate the job submission and config file updates for consecutive GEOS-Chem Classic runs so no updates are needed for the GEOS-Chem repository for that.

Following this update, the only multi vs single run differences in the GCHP full chemistry simulation occur in chemistry. There is currently a parallelization bug that is preventing further progress in identifying the source of the differences.

@sdeastham
Copy link
Contributor

Great to see this progress and I'm excited at the prospect of finally getting parity between single and multi-segment runs! Can you clarify on the parallelization bug though? Is something only being initialized on the root CPU and not propagating to the others?

@lizziel
Copy link
Contributor Author

lizziel commented Nov 5, 2020

My understanding is the parallelization bug is this: geoschem/geos-chem#392. @msulprizio has been looking at it more closely lately.

@sdeastham
Copy link
Contributor

Got it. That would only affect GC-Classic though, right (being an OpenMP parallelization bug)?

@lizziel
Copy link
Contributor Author

lizziel commented Nov 5, 2020

Actually, we have OpenMP on by default if using Intel compilers. Due to the way the CMake files were written, OpenMP has been on for a while for all compilers despite building with OMP=n. As of a very recent 13.0 commit, OpenMP is now off if using gfortran to avoid a slow-down when using that compiler. It is still on with Intel due to a separate bug. Both of these problems are on the to do list for 13.0.0. We should in theory be able to switch between OMP=y and OMP=n without issue, but that is not yet the case.

I plan on revisiting the diffs in chemistry after spending some time on 13.0 documentation to make sure I can get rid of the parallelization bug signature by turning off OpenMP. I'm also curious if what I am seeing goes away with het chem off.

@sdeastham
Copy link
Contributor

I assume though that, even though we have OpenMP on, we are still only using one OpenMP thread such that we still shouldn't be affected by OpenMP parallelization errors? That having been said it sounds like there's something even weirder than a pure parallelization error going on in there, so I see your point about waiting to see what happens there first!

@stale
Copy link

stale bot commented Mar 19, 2021

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days it will be closed. You can add the "never stale" tag to prevent the Stale bot from closing this issue.

@stale stale bot added the stale No recent activity on this issue label Mar 19, 2021
@lizziel lizziel added never stale Never label this issue as stale and removed stale No recent activity on this issue labels Mar 19, 2021
@lizziel
Copy link
Contributor Author

lizziel commented Apr 8, 2021

Quick update on this. Current dev branch for GC-Classic has a parallelization bug still. It is not in GCHP so pretty sure it's an OpenMP issue. However, I did a quick multi vs single run test for GCHP 13.0 (2-hr run vs two 1-hr runs) and there are still differences for the full chemistry simulation when chemistry is turned on. I am keeping this issue open to keep it on the radar. Fortunately we will soon switch over to using MAPL History monthly collections so will no longer be running 1-year benchmarks in 1-month run segments.

@yantosca
Copy link
Contributor

yantosca commented Apr 8, 2021

Also see issue: geoschem/HEMCO#78. I discovered what might be an issue in the HEMCO MEGAN extension. Not sure yet if this is related. I am going to be looking into this parallelization issue and should hopefully solve it soon.

@lizziel
Copy link
Contributor Author

lizziel commented Apr 8, 2021

I don't think this is related. The differences are only coming in when chemistry is turned on. HEMCO on but chemistry off does not have differences.

@lizziel
Copy link
Contributor Author

lizziel commented Dec 15, 2021

This issue has not been looked into for a while but I am keeping it open to bring attention to it as a long-standing issue. It will be revisited again in the future at some point.

@lizziel
Copy link
Contributor Author

lizziel commented Jan 31, 2022

I plan to re-assess this issue in 14.0.

@lizziel lizziel added this to the 14.0.0 milestone Feb 15, 2022
@lizziel
Copy link
Contributor Author

lizziel commented Mar 15, 2022

I am revisiting this issue after work by @christophkeller to try to eliminate differences seen in GEOS due to GEOS-Chem. He added 60+ variables to the internal state, mostly State_Chm arrays used in isorropia, and reported this fixed the issue in 13.3 but not 13.4. Having zero diff across runs regardless of how the runs are split up in time is a requirement for GEOS.

I am doing tests with 13.4 using the additional internal state variables to hone in on the remaining source of differences, presumably missing internal state variables. The differences only appear when chemistry is turned on. I am finding a remaining bias near the surface when comparing a 2hr run versus two 1hr runs. For example, negative bias in ozone zonal mean:
O3_dev

It is not good to have 60+ additional 3D arrays in the internal state. This significantly increases the memory requirement and is particularly costly at high resolutions. I will look into whether we can adjust the order of operations such that carrying the fields for isorropia across timesteps is not necessary or at least minimized.

In summary, the to do list for this work is:

  1. Find and fix remaining differences in chemistry when splitting up a run.
  2. Examine order of operations to reduce required set of arrays that need to be carried across timesteps and thus stored in the internal state and included in the restart file.
  3. Apply the GCHP updates to GC-Classic after GC-Classic no longer has parallelization problems, and assess if differences are still seen when splitting up runs into multiple segments.

Items 1 and 2 are motivated by strict requirements in GEOS with benefit to GCHP. All items will improve GEOS-Chem 1-year fullchem benchmark accuracy since we currently break up the 1-year benchmark runs into multiple months.

Fixes and updates related to this will go into 14.0.

@lizziel
Copy link
Contributor Author

lizziel commented Mar 18, 2022

I found and fixed a problem in wet scavenging limited to H2O2. See geoschem/geos-chem#1178. The fix is going into 13.4.

@lizziel
Copy link
Contributor Author

lizziel commented May 11, 2022

All remaining differences when splitting up a GCHP benchmark simulation will be removed in geoschem/geos-chem#1229 that is going into 14.0.0.

@lizziel lizziel closed this as completed May 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: Bug Something isn't working never stale Never label this issue as stale
Projects
None yet
Development

No branches or pull requests

3 participants