-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG/ISSUE] GCHP crash on reading in lightning NOx when trying to start a simulation in February 2016 and crash when trying to do a leap day #61
Comments
@jmoch1214 I think I've actually ran into this as well. However, I ran into this when I was setting up a new ExtData, so I thought the error had to do with one of my input files. It sounds like there might be a bigger issue that's causing sims starting in leap-year Februaries to crash. Does the simulation crash if you start in Februrary 2020 as well? What about a non-leap year like 2017? Sorry, I would try it myself, but our cluster is packed right now. |
I am able to reproduce this for Feb 2016 and also Feb 2017. Jan and Mar 2016 are fine. Indeed, my Feb run crashes no matter when in the month I start. Looking at the ExtData log with debug prints on it appears to be grabbing lightning from a completely different year and month. I'll report back when I figure out what is going on. |
Here is an example of strange behavior that is happening, in this case when starting a run on 20170201 000000. Notice it is using file for Nov 2019 despite the target time being in Feb 2017.
|
Offline lightning read for February in years that do not fail are also wonky. For example, starting a run on Feb 1, 2019 shows this for offline lightning left bracket:
Notable here are the initial file date of 2022-01-01 and ultimate settling on reading 2018-02-01 for a simulation that starts on 2019-02-01. The code that determines file time has some warnings that need investigation:
|
@sdeastham , do you remember anything about the motivation for your updates to this piece of code, and whether it might have anything to do with the issue seen in lightning? For reference, for offline lightning
The data is in units of minutes since start of year and is stored in monthly data files. I have found that this issue is not just in the new v2020-03 data directory, but also the previous one we used (v2019-01). The file times ultimately used per year are seemingly random: 2018-02 for 2019 sim, 2017-02 for 2018 sim, 2019-11 for 2017 sim, 2017-08 for 2016 sim, etc., but they are reproducible across repeated sim year runs. |
The first place it seems to be messing up is this line: n is returning 444, which corresponds to 37 years given a 1-month frequency. Added to ref time 1985 that gives 2022:
cTime is the same as target time, so somehow (2019-02 minus 1985-01) / 1-month is yielding 37 years. |
Argh, I basically wasn't thinking about files which are specified once per month when I wrote that. My focus was on daily files. Months are just a terrible unit of time because they are completely unreliable - 30 days, or 31 days, or 28 days, or 29 days... useless. Unfortunately the easiest hack might be to do something like replacing the use of |
I'm thinking an even easier quick fix would be to generate daily lightning files from the monthly. This would bypass having to update MAPL and thus the potential to mess up the read of other input files. |
Also please note I have not yet looked into the leap day issue originally reported by @jmoch1214. The discussion thus far has centered on reading lightning data in February at any day. |
See also geoschem/geos-chem#160. |
It looks like this bug also affects simulations starting in June |
@LiamBindle, have you checked other months as well? And other years? I haven't seen a definite pattern yet. |
No, I just had to rerun a timing test that started in June. Just posting here that I saw it crash in June as well. |
I generated daily files from all of the v2020-03 offline lightning data yesterday and preliminary tests show it fixes the issue. If you want to try using those instead (just edit path and filenames in ExtData.rc), they are located here: http://ftp.as.harvard.edu/gcgrid/data/ExtData/HEMCO/OFFLINE_LIGHTNING/v2020-11/. |
Here is an example of the change between switching between using monthly and daily files. This is the average total flash rate for standard simulation GCHP runs started on February 1, 2019. Unlike 2016 which crashes when starting in February, 2019 does not crash but does use the wrong data. I have not yet figured out why 2016 crashes (and 2015 and 2017), but 2018 and 2019 do not. Regardless, using daily files corrects both the incorrect file read issue and the crashing. I will push an update to 13.0.0 dev so that GCHP uses the daily files in time for the first 13.0.0 1-year benchmark for GCHP. We will continue to look into a robust solution for MAPL input file handling so that we can then revert to using the same offline lightning files in both GCHP and GEOS-Chem Classic. |
Could you also generate the daily files for GFED4 and for TZOMPASOSA_C2H6? It looks like those are the other monthly files used by default. I added the lightning NOx fix mannually in HEMCO_Configr.c and ExtData.rc, but then it crashed on GFED with the same type of bracketing error. Did that not happen to you for your tests? |
Yikes, this is getting unwieldy. @LiamBindle do you have any thoughts on how we might more permanently fix this? |
The TZOMPASOS_C2H6 files are monthly climatology files so would not trigger the logic in ExtData that we are discussing as problematic for offline lightning files. The GFED files are not monthly climatology, but they are single time monthly files not daily. If there are issues with these files for 2016 then it may be a separate issue. I did not have any problems with them in my short runs for 2016 or in my month-long test run in 2019, but I also did not look at the ExtData debug output to check if the correct months were read. |
as an update on this, totally out of the box gchp (dev/gchp_13.0.0 branch) crashes when trying to start in February 2016 on GFED4 also. I just dowlnloaded the code as is (so it includes Lizzie's recent updates for the lightning NOx files). There is the same bracketing data error as previously: " ExtData could not find bracketing data from file template I compield with debug and rrtmg set to yes, again using Lizzie's ifort 18 environment. I strarted an hour long single run simulation for February 1, 2016. The added infomraiton in the error message from the debug flag also points to MAPL, but also some MPI issues are appearing. |
I was able to reproduce this issue for GFED4, and verify it is indeed similar to the lightning issue in that some years crash (2016,2017) while others read successfully but from the wrong file (2018). This means the issue is not high resolution data in monthly files, but monthly files in general. Climatology files should be unaffected since they go through separate logic, although it seems ExtData in general needs a thorough check. I checked a July simulation for both 2016 and 2019 and the correct files were read both times, so this does not impact the 1-month benchmark. (Earlier I did the same for the original offline lightning files too and they were also fine, but I don't think I reported on that here). I think the simplest way to diagnose all issues would be to do a 1-year run and then scrutinize the emissions and inventory outputs in a comparison with GEOS-Chem Classic. This would catch other months with incorrect time read if average monthly plots or tables are created. This would then give a better assessment of full scope of the issue(s). |
I also recommend having ExtData debug on for that 1-year run so that the year/month used per import per timestep is available in a log. |
One last thing, you can speed up the test run by turning off all GEOS-Chem components other than emissions. No need to do huge computation for these tests. |
@jmoch1214, regarding a quick fix for these other problematic files I'd recommend putting into annual not daily. You can adapt my script for offline lightning which is located on gcgrid in the subdirectory OFFLINE_LIGHTNING/v2020-11/scripts. After adapting the time loops simply use cdo mergetime rather than cdo selday. |
I think the problem is the division by n = max(0,floor((cTime-item%reff_time)/item%frequency))
if (n>0) fTime = fTime + (n*item%frequency) IIUC division by an interval is only valid if the interval is constant. I think we could solve this by instead dividing by a constant interval like 1-day if the export's frequency is >=1 month (i.e. the frequency is variable). i.e., replacing that block with if (item%frequency_is_constant) then
n = max(0,floor((cTime-item%reff_time)/item%frequency))
if (n>0) fTime = fTime + (n*item%frequency)
else
call ESMF_TimeIntervalSet(interval_1day,d=1)
n = max(0,floor((cTime-item%reff_time)/interval_1day))
if (n>0) fTime = fTime + (n*interval_1day)
end if where For me, this appears to fix:
I will push a fix tomorrow morning. In the meantime, here's a diff with the fix for MAPL (you can apply it with What do you think @sdeastham? I haven't verified if this also fixes the leap day problem. |
I agree that this gets to the heart of the problem, and it seems like a good solution. IIRC this particular chunk of code is just trying to figure out "what would the next file be to grab, based on the file template, if all files existed". It (and the previous code) would fail if the given reference time did not provide a valid file. It may also deal with leap years fine - most of the leap year logic is in the later code anyway, and this just needs to find the first valid file template which is after the current time (it's always testing the next time to see if this one is OK). My big remaining worry here is performance. If it ends up being slow, a clunkier - but robust - solution would be to work explicitly with |
@LiamBindle I appled the diff and the issue of the model crashing in February 2016 on lightning NOx (or other monthly files) seems to be fixed! But the model still seems to crash at 00:00 on Feb 29, 2016. I tried an out of the box GCHP run (compiled as before, meaning with ifort18 and RRTMG on and debug on) and tried a 2 hour simulation starting at 23:00 on Feb 28, but the simulation keeps crashing at 00:00 on February 29. The error messsages indicate it is another MAPL issue, but I can't tell from the messages exactly what. The log file is attached. |
@jmoch1214 I think #62 should fix the issues. When you have a chance, could you let me know if you're still getting anything weird? |
Just following up on this. Thanks again @jmoch1214 for trying #62. It looks like the RRTMG simulation is now crashing at a different part of ExtData on the leap day. Specifically, it's crashing on the final line of ! apply the timestamp template
call ESMF_TimeGet(time, yy=yy, mm=mm, dd=dd, h=hs, m=ms, s=ss, __RC__)
i = scan(str_yy, '%'); if (i == 0) read (str_yy, '(I4)') yy
i = scan(str_mm, '%'); if (i == 0) read (str_mm, '(I2)') mm
i = scan(str_dd, '%'); if (i == 0) read (str_dd, '(I2)') dd
i = scan(str_hs, '%'); if (i == 0) read (str_hs, '(I2)') hs
i = scan(str_ms, '%'); if (i == 0) read (str_ms, '(I2)') ms
i = scan(str_ss, '%'); if (i == 0) read (str_ss, '(I2)') ss
call ESMF_TimeSet(timestamp_, yy=yy, mm=mm, dd=dd, h=hs, m=ms, s=ss, __RC__) @sdeastham wrote:
|
Just following up on this. @jmoch1214 and I have been in communication offline and it looks like this ! Fix searching for Feb 29 of non-leap year
if ((mod(yy - 1960, 4) /= 0).and.(mm==2).and.(dd==29))
! note: "mod(yy - 1960, 4) /= 0" is TRUE if it is not a leap year
dd=28
endif immediately before the |
Although... won't this fail for 2000 (not a leap year)? |
@sdeastham thanks for pointing that out! I think 2000 actually is a leap year though, since it's divisible by 400, isn't it? This would misclassify 1900 and 2100 as a leap years though. That being said, I wouldn't propose this as a robust solution, but moreso just a quick kludgy workaround. Once the upstream ExtData updates are merged, it would probably be worthwhile to revise our calendar jumping, and implement some unit tests for it. It sounds like Ben has made a lot of progress on this front. |
@LiamBindle - this is why I shouldn't try to science on too little sleep! You're 100% right. I was getting my "special cases" the wrong way round, forgetting that 2000 is unusual because it is a leap year when usually a century isn't. Agreed then that this solution is fine for now. Thanks! |
I'm going to close this issue. Feel free to reopen at any time. |
Describe the bug:
GCHP crashes and says there is an error reading in lightning NOx when I try to restart the multirun set of simulations. GCHP also crashes on a leap day with a MAPL error. I don't know if the two errors are related.
Expected behavior:
GCHP reading in lightning NOx and proceeds with the simulation and doesn't crash in the first place when getting to the leap day.
Actual behavior:
GCHP crashes and sas there is an error reading in lightning NOx when I try to restart a multirun set of simulations and crashes on a leap day.
Steps to reproduce: the bug:
Start a single run simulaiton on 20160229 000000 (or 20160207 000000 or seemingly any time in Febuary 2016)
Or attempt to re-start a multirun simulation set that previously crashed by using an existing cap_reststart and the last restart file from the multirun (restarting on Feburary 1).
For the leap day simulations I've now had multile simulations crash for the month of February when getting to 00:00 on Feb 29. See the log file below for an example.
Compilation commands
I used cmake and ifort 18. The standard environment used by Lizzie Lundgren. With RRTMG on.
Run commands
used the gchp.run script.
Error messages
For the lightning NOx crash the .out file says:
ExtData could not find bracketing data from file template
./HcoDir/OFFLINE_LIGHTNING/v2020-03/GEOSFP/%y4/FLASH_CTH_GEOSFP_0.25x0.3125_%y4
_%m2.nc4 for side L
The .err files for both types of crashes have lots of MAPL errors and MPI abort errors. See the relevant log files listed below.
HEMCO.log didn't have anything specific for either of the two errors.
Required information:
Your GCHP version and runtime environment:
Input and log files to attach
see here on Cannon for all the above files: /n/holyscratch01/jacob_lab/jmoch/geoE_rdirs/GCHP_13.0.0_geoE_off_vtest3
the log file relevant is: slurm-7116496.out and slurm-7116496.err for the initial crash. And slurm-7180457.out and slurm-7180457.out for the crash when I try to restart it and get a lightning NOx error.
Additional context
The text was updated successfully, but these errors were encountered: