Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG/ISSUE] GCHP crash on reading in lightning NOx when trying to start a simulation in February 2016 and crash when trying to do a leap day #61

Closed
jmoch1214 opened this issue Nov 9, 2020 · 33 comments
Assignees
Labels
category: Bug Something isn't working

Comments

@jmoch1214
Copy link

Describe the bug:

GCHP crashes and says there is an error reading in lightning NOx when I try to restart the multirun set of simulations. GCHP also crashes on a leap day with a MAPL error. I don't know if the two errors are related.

Expected behavior:

GCHP reading in lightning NOx and proceeds with the simulation and doesn't crash in the first place when getting to the leap day.

Actual behavior:

GCHP crashes and sas there is an error reading in lightning NOx when I try to restart a multirun set of simulations and crashes on a leap day.

Steps to reproduce: the bug:

Start a single run simulaiton on 20160229 000000 (or 20160207 000000 or seemingly any time in Febuary 2016)

Or attempt to re-start a multirun simulation set that previously crashed by using an existing cap_reststart and the last restart file from the multirun (restarting on Feburary 1).

For the leap day simulations I've now had multile simulations crash for the month of February when getting to 00:00 on Feb 29. See the log file below for an example.

Compilation commands
I used cmake and ifort 18. The standard environment used by Lizzie Lundgren. With RRTMG on.

Run commands
used the gchp.run script.

Error messages

For the lightning NOx crash the .out file says:
ExtData could not find bracketing data from file template
./HcoDir/OFFLINE_LIGHTNING/v2020-03/GEOSFP/%y4/FLASH_CTH_GEOSFP_0.25x0.3125_%y4
_%m2.nc4 for side L

The .err files for both types of crashes have lots of MAPL errors and MPI abort errors. See the relevant log files listed below.

HEMCO.log didn't have anything specific for either of the two errors.

Required information:

Your GCHP version and runtime environment:

  • GCHPctm version (can be last commit hash): ____ dev.gchp_13.0.0
  • MPI type and version: __
  • Fortran cmpiler type and version: ___ifort 18
  • netCDF version: __
  • Are you using GCHP "out of the box" (i.e. unmodified): __
    • If you have modified GCHP, please list what was changed: __ __ I added full column diagnostics for RRTMG and read in some startopsheric aeroosl properties. These all work fine for the first month but the simulations keep crashing when starting Feb 29 2016. And then when I try to restart them on Feb 1 2016 I get the lightning NOx error

Input and log files to attach

  • runConfig.sh: __
  • input.geos: __
  • HEMCO_Config.rc: __
  • ExtData.rc: __
  • HISTORY.rc: __
  • GCHP compile log file: __
  • GCHP run log file: __
  • HEMCO.log: __
  • slurm.out or any other error messages from your scheduler: __
  • Any other error messages: __

see here on Cannon for all the above files: /n/holyscratch01/jacob_lab/jmoch/geoE_rdirs/GCHP_13.0.0_geoE_off_vtest3
the log file relevant is: slurm-7116496.out and slurm-7116496.err for the initial crash. And slurm-7180457.out and slurm-7180457.out for the crash when I try to restart it and get a lightning NOx error.

Additional context

@jmoch1214 jmoch1214 added the category: Bug Something isn't working label Nov 9, 2020
@LiamBindle
Copy link
Contributor

LiamBindle commented Nov 10, 2020

@jmoch1214 I think I've actually ran into this as well. However, I ran into this when I was setting up a new ExtData, so I thought the error had to do with one of my input files. It sounds like there might be a bigger issue that's causing sims starting in leap-year Februaries to crash.

Does the simulation crash if you start in Februrary 2020 as well? What about a non-leap year like 2017? Sorry, I would try it myself, but our cluster is packed right now.

@lizziel
Copy link
Contributor

lizziel commented Nov 10, 2020

I am able to reproduce this for Feb 2016 and also Feb 2017. Jan and Mar 2016 are fine. Indeed, my Feb run crashes no matter when in the month I start. Looking at the ExtData log with debug prints on it appears to be grabbing lightning from a completely different year and month. I'll report back when I figure out what is going on.

@lizziel lizziel self-assigned this Nov 10, 2020
@lizziel
Copy link
Contributor

lizziel commented Nov 10, 2020

Here is an example of strange behavior that is happening, in this case when starting a run on 20170201 000000. Notice it is using file for Nov 2019 despite the target time being in Feb 2017.

ExtData Run_: READ_LOOP: variable 068 of 737: LDENS
    ==> file: 
 ./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/%y4/FLASH_CTH_MERRA2_0.5x0.625_%y4_%
 m2.nc4
    ==> cyclic: n
    ==> isConst:  F
    ExtData Run_: DO_UPDATE: Start. doUpdate_ is true.
       ExtData Run_: HAS_RUN: Start. hasRun is false. Update time.
       ExtData Run_: HAS_RUN: NotSingle is true. Update left time (bracket L)
            UpdateBracketTime: Scanning template ./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/%y4/FLASH_CTH_MERRA2_0.5x0.625_%y4_%m2.nc4 for side L
            UpdateBracketTime: Target time   : 2017-02-01 01:30:00
            UpdateBracketTime: Reference time: 1985-01-01 00:00:00
            UpdateBracketTime: Untemplating ./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/%y4/FLASH_CTH_MERRA2_0.5x0.625_%y4_%m2.nc4
            ==> Target time   : 2017-02-01 01:30:00
            ==> File time     : 2019-11-01 00:00:00
            ==> item%frequency     : 0000-01-00 00:00:00
            ==> # iterations until found:   418
 DEBUG: **Target file** for ./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/%y4/FLASH_CTH_MERRA2_0.5x0.625_%y4_%m2.nc4 found and is ./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/2019/FLASH_CTH_MERRA2_0.5x0.625_2019_11.nc4
            UpdateBracketTime: Making metadata for ./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/2019/FLASH_CTH_MERRA2_0.5x0.625_2019_11.nc4
 DEBUG: Retrieving formatter for: ./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/2019/FLASH_CTH_MERRA2_0.5x0.625_2019_11.nc4
               GetBracketTimeOnFile: (L) called for ./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/2019/FLASH_CTH_MERRA2_0.5x0.625_2019_11.nc4
               GetBracketTimeOnFile: Year offset of  0 applied while scanning ./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/2019/FLASH_CTH_MERRA2_0.5x0.625_2019_11.nc4 to give target time 2017-02-01 01:30:00
            UpdateBracketTime: Found status of ./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/2019/FLASH_CTH_MERRA2_0.5x0.625_2019_11.nc4: F
            UpdateBracketTime: Scanning for bracket L of ./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/2019/FLASH_CTH_MERRA2_0.5x0.625_2019_11.nc4. RSide: F
 DEBUG: Retrieving formatter for: ./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/2019/FLASH_CTH_MERRA2_0.5x0.625_2019_10.nc4
               GetBracketTimeOnFile: (L) called for ./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/2019/FLASH_CTH_MERRA2_0.5x0.625_2019_10.nc4
               GetBracketTimeOnFile: Year offset of  0 applied while scanning ./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/2019/FLASH_CTH_MERRA2_0.5x0.625_2019_10.nc4 to give target time 2017-02-01 01:30:00
 ExtData could not find bracketing data from file template 
 ./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/%y4/FLASH_CTH_MERRA2_0.5x0.625_%y4_%
 m2.nc4 for side L
===> Run ended at Tue Nov 10 16:50:41 EST 2020

@lizziel
Copy link
Contributor

lizziel commented Nov 10, 2020

Offline lightning read for February in years that do not fail are also wonky. For example, starting a run on Feb 1, 2019 shows this for offline lightning left bracket:

ExtData Run_: READ_LOOP: variable 068 of 737: LDENS
    ==> file: 
 ./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/%y4/FLASH_CTH_MERRA2_0.5x0.625_%y4_%
 m2.nc4
    ==> cyclic: n
    ==> isConst:  F
    ExtData Run_: DO_UPDATE: Start. doUpdate_ is true.
       ExtData Run_: HAS_RUN: Start. hasRun is false. Update time.
       ExtData Run_: HAS_RUN: NotSingle is true. Update left time (bracket L)
            UpdateBracketTime: Scanning template ./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/%y4/FLASH_CTH_MERRA2_0.5x0.625_%y4_%m2.nc4 for side L
            UpdateBracketTime: Target time   : 2019-02-01 01:30:00
            UpdateBracketTime: Reference time: 1985-01-01 00:00:00
            UpdateBracketTime: Untemplating ./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/%y4/FLASH_CTH_MERRA2_0.5x0.625_%y4_%m2.nc4
            ==> Target time   : 2019-02-01 01:30:00
            ==> File time     : 2022-01-01 00:00:00
            ==> item%frequency     : 0000-01-00 00:00:00
            ==> # iterations until found:   444
            UpdateBracketTime: Target file not found: 
./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/%y4/FLASH_CTH_MERRA2_0.5x0.625_%y4_%m2.nc4
            ==> Propagating forwards in file from reference time
            ==> Reference time: 1985-01-01 00:00:00
 DEBUG: Retrieving formatter for: ./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/1985/FLASH_CTH_MERRA2_0.5x0.625_1985_01.nc4
            UpdateBracketTime: Extrapolating FORWARD for bracket L for file ./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/%y4/FLASH_CTH_MERRA2_0.5x0.625_%y4_%m2.nc4
            UpdateBracketTime: Making metadata for ./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/2018/FLASH_CTH_MERRA2_0.5x0.625_2018_02.nc4
 DEBUG: Retrieving formatter for: ./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/2018/FLASH_CTH_MERRA2_0.5x0.625_2018_02.nc4
               GetBracketTimeOnFile: (L) called for ./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/2018/FLASH_CTH_MERRA2_0.5x0.625_2018_02.nc4
               GetBracketTimeOnFile: Year offset of -1 applied while scanning ./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/2018/FLASH_CTH_MERRA2_0.5x0.625_2018_02.nc4 to give target time 2018-02-01 01:30:00
               GetBracketTimeOnFile:: Data from time 2018-02-01 01:30 set for bracket L of file ./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/2018/FLASH_CTH_MERRA2_0.5x0.625_2018_02.nc4
               GetBracketTimeOnFile: ==> Mapped to: 2019-02-01 01:30 Offset:-01
            UpdateBracketTime: Found status of ./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/2018/FLASH_CTH_MERRA2_0.5x0.625_2018_02.nc4: T
            UpdateBracketTime: Updated bracket L for ./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/2018/FLASH_CTH_MERRA2_0.5x0.625_2018_02.nc4
            ==> (L) Time requested: 2019-02-01 01:30:00
            ==> (L) Record time   : 2018-02-01 01:30:00
            ==> (L) Effective time: 2019-02-01 01:30:00

Notable here are the initial file date of 2022-01-01 and ultimate settling on reading 2018-02-01 for a simulation that starts on 2019-02-01.

The code that determines file time has some warnings that need investigation:

           else
              yrOffset = 0
              if (item%reff_time > cTime) then
                 _ASSERT(.False.,'Reference time for file ' // trim(item%file) // ' is too late')
              end if
              ! This approach causes a problem if cTime and item%reff_time are too far
              ! apart - do it the hard way instead... 
              ftime = item%reff_time
              n = 0
              ! SDE DEBUG: This caused problems in the past but the
              ! alternative is far too slow... need to keep an eye 
              ! on this but the Max(0,...) should help.
              n = max(0,floor((cTime-item%reff_time)/item%frequency))
              if (n>0) fTime = fTime + (n*item%frequency)
              do while (.not.found)
                 ! SDE: This needs to be ">"
                 found = ((ftime + item%frequency) > ctime)
                 if (.not.found) then
                    n = n + 1
                    ftime = fTime+item%frequency
                 end if
              end do

@lizziel
Copy link
Contributor

lizziel commented Nov 11, 2020

@sdeastham , do you remember anything about the motivation for your updates to this piece of code, and whether it might have anything to do with the issue seen in lightning? For reference, for offline lightning
in ExtData.rc we have this:

FLASH_DENS 1 N Y F0;013000 none none LDENS ./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/%y4\
/FLASH_CTH_MERRA2_0.5x0.625_%y4_%m2.nc4

The data is in units of minutes since start of year and is stored in monthly data files. I have found that this issue is not just in the new v2020-03 data directory, but also the previous one we used (v2019-01). The file times ultimately used per year are seemingly random: 2018-02 for 2019 sim, 2017-02 for 2018 sim, 2019-11 for 2017 sim, 2017-08 for 2016 sim, etc., but they are reproducible across repeated sim year runs.

@lizziel
Copy link
Contributor

lizziel commented Nov 11, 2020

The first place it seems to be messing up is this line:
n = max(0,floor((cTime-item%reff_time)/item%frequency))

n is returning 444, which corresponds to 37 years given a 1-month frequency. Added to ref time 1985 that gives 2022:

            UpdateBracketTime: Scanning template ./HcoDir/OFFLINE_LIGHTNING/v2020-03/MERRA2/%y4/FLASH_CTH_MERRA2_0.5x0.625_%y4_%m2.nc4 for side L
            UpdateBracketTime: Target time   : 2019-02-01 01:30:00
            UpdateBracketTime: Reference time: 1985-01-01 00:00:00
            ==> Initial file time equal to item%reff_time     : 1985-01-01 00:00:00
            ==> file time after incremented by n*item%frequency, n=0444, 2022-01-01 00:00:00

cTime is the same as target time, so somehow (2019-02 minus 1985-01) / 1-month is yielding 37 years.

@sdeastham
Copy link
Contributor

Argh, I basically wasn't thinking about files which are specified once per month when I wrote that. My focus was on daily files. Months are just a terrible unit of time because they are completely unreliable - 30 days, or 31 days, or 28 days, or 29 days... useless. Unfortunately the easiest hack might be to do something like replacing the use of item%frequency with something (pseudocode here, not sure what it would actually be) along the lines of freq = min(item%frequency,timedelta(days=1)) - basically, forcing the "check" frequency to be something robust. That could do with some more careful thought though.

@lizziel
Copy link
Contributor

lizziel commented Nov 12, 2020

I'm thinking an even easier quick fix would be to generate daily lightning files from the monthly. This would bypass having to update MAPL and thus the potential to mess up the read of other input files.

@lizziel
Copy link
Contributor

lizziel commented Nov 12, 2020

Also please note I have not yet looked into the leap day issue originally reported by @jmoch1214. The discussion thus far has centered on reading lightning data in February at any day.

@lizziel
Copy link
Contributor

lizziel commented Nov 12, 2020

See also geoschem/geos-chem#160.

@LiamBindle
Copy link
Contributor

It looks like this bug also affects simulations starting in June

@lizziel
Copy link
Contributor

lizziel commented Nov 13, 2020

@LiamBindle, have you checked other months as well? And other years? I haven't seen a definite pattern yet.

@LiamBindle
Copy link
Contributor

No, I just had to rerun a timing test that started in June. Just posting here that I saw it crash in June as well.

@lizziel
Copy link
Contributor

lizziel commented Nov 13, 2020

I generated daily files from all of the v2020-03 offline lightning data yesterday and preliminary tests show it fixes the issue. If you want to try using those instead (just edit path and filenames in ExtData.rc), they are located here: http://ftp.as.harvard.edu/gcgrid/data/ExtData/HEMCO/OFFLINE_LIGHTNING/v2020-11/.

@lizziel
Copy link
Contributor

lizziel commented Nov 13, 2020

Here is an example of the change between switching between using monthly and daily files. This is the average total flash rate for standard simulation GCHP runs started on February 1, 2019. Unlike 2016 which crashes when starting in February, 2019 does not crash but does use the wrong data.

Screen Shot 2020-11-13 at 1 42 39 PM

I have not yet figured out why 2016 crashes (and 2015 and 2017), but 2018 and 2019 do not. Regardless, using daily files corrects both the incorrect file read issue and the crashing.

I will push an update to 13.0.0 dev so that GCHP uses the daily files in time for the first 13.0.0 1-year benchmark for GCHP. We will continue to look into a robust solution for MAPL input file handling so that we can then revert to using the same offline lightning files in both GCHP and GEOS-Chem Classic.

@lizziel lizziel assigned LiamBindle and unassigned lizziel Nov 16, 2020
@jmoch1214
Copy link
Author

jmoch1214 commented Nov 16, 2020

Could you also generate the daily files for GFED4 and for TZOMPASOSA_C2H6? It looks like those are the other monthly files used by default. I added the lightning NOx fix mannually in HEMCO_Configr.c and ExtData.rc, but then it crashed on GFED with the same type of bracketing error. Did that not happen to you for your tests?

@sdeastham
Copy link
Contributor

Yikes, this is getting unwieldy. @LiamBindle do you have any thoughts on how we might more permanently fix this?

@lizziel
Copy link
Contributor

lizziel commented Nov 16, 2020

The TZOMPASOS_C2H6 files are monthly climatology files so would not trigger the logic in ExtData that we are discussing as problematic for offline lightning files. The GFED files are not monthly climatology, but they are single time monthly files not daily. If there are issues with these files for 2016 then it may be a separate issue. I did not have any problems with them in my short runs for 2016 or in my month-long test run in 2019, but I also did not look at the ExtData debug output to check if the correct months were read.

@jmoch1214
Copy link
Author

as an update on this, totally out of the box gchp (dev/gchp_13.0.0 branch) crashes when trying to start in February 2016 on GFED4 also. I just dowlnloaded the code as is (so it includes Lizzie's recent updates for the lightning NOx files). There is the same bracketing data error as previously:

" ExtData could not find bracketing data from file template
./HcoDir/GFED4/v2020-02/%y4/GFED4_gen.025x025.%y4%m2.nc for side L"

I compield with debug and rrtmg set to yes, again using Lizzie's ifort 18 environment. I strarted an hour long single run simulation for February 1, 2016. The added infomraiton in the error message from the debug flag also points to MAPL, but also some MPI issues are appearing.

@lizziel
Copy link
Contributor

lizziel commented Nov 17, 2020

I was able to reproduce this issue for GFED4, and verify it is indeed similar to the lightning issue in that some years crash (2016,2017) while others read successfully but from the wrong file (2018). This means the issue is not high resolution data in monthly files, but monthly files in general. Climatology files should be unaffected since they go through separate logic, although it seems ExtData in general needs a thorough check.

I checked a July simulation for both 2016 and 2019 and the correct files were read both times, so this does not impact the 1-month benchmark. (Earlier I did the same for the original offline lightning files too and they were also fine, but I don't think I reported on that here).

I think the simplest way to diagnose all issues would be to do a 1-year run and then scrutinize the emissions and inventory outputs in a comparison with GEOS-Chem Classic. This would catch other months with incorrect time read if average monthly plots or tables are created. This would then give a better assessment of full scope of the issue(s).

@lizziel
Copy link
Contributor

lizziel commented Nov 17, 2020

I also recommend having ExtData debug on for that 1-year run so that the year/month used per import per timestep is available in a log.

@lizziel
Copy link
Contributor

lizziel commented Nov 17, 2020

One last thing, you can speed up the test run by turning off all GEOS-Chem components other than emissions. No need to do huge computation for these tests.

@lizziel
Copy link
Contributor

lizziel commented Nov 17, 2020

@jmoch1214, regarding a quick fix for these other problematic files I'd recommend putting into annual not daily. You can adapt my script for offline lightning which is located on gcgrid in the subdirectory OFFLINE_LIGHTNING/v2020-11/scripts. After adapting the time loops simply use cdo mergetime rather than cdo selday.

@LiamBindle
Copy link
Contributor

LiamBindle commented Nov 18, 2020

Yikes, this is getting unwieldy. @LiamBindle do you have any thoughts on how we might more permanently fix this?

I think the problem is the division by item%frequency here:

              n = max(0,floor((cTime-item%reff_time)/item%frequency))
              if (n>0) fTime = fTime + (n*item%frequency)

IIUC division by an interval is only valid if the interval is constant. I think we could solve this by instead dividing by a constant interval like 1-day if the export's frequency is >=1 month (i.e. the frequency is variable).

i.e., replacing that block with

              if (item%frequency_is_constant) then
                n = max(0,floor((cTime-item%reff_time)/item%frequency))
                if (n>0) fTime = fTime + (n*item%frequency)
              else
                call ESMF_TimeIntervalSet(interval_1day,d=1)
                n = max(0,floor((cTime-item%reff_time)/interval_1day))
                if (n>0) fTime = fTime + (n*interval_1day)
              end if

where item%frequency_is_constant is set in CreateTimeInterval(). This way the division is by a interval with a constant value. I think the rest of UpdateBracketTime() is okay, since it only does addition with time intervals (addition is valid for monthly or yearly intervals).

For me, this appears to fix:

  • Starting the simulation in Februrary 2019
  • The OFFLINE_LIGHTNING error (with monthly files)
  • The GFED4 and other monthly emissions being read in at weird times (e.g. in 2022)

I will push a fix tomorrow morning. In the meantime, here's a diff with the fix for MAPL (you can apply it with git appy in the MAPL submodule: fix.diff.zip.

What do you think @sdeastham?

I haven't verified if this also fixes the leap day problem.

@sdeastham
Copy link
Contributor

I agree that this gets to the heart of the problem, and it seems like a good solution. IIRC this particular chunk of code is just trying to figure out "what would the next file be to grab, based on the file template, if all files existed". It (and the previous code) would fail if the given reference time did not provide a valid file. It may also deal with leap years fine - most of the leap year logic is in the later code anyway, and this just needs to find the first valid file template which is after the current time (it's always testing the next time to see if this one is OK).

My big remaining worry here is performance. If it ends up being slow, a clunkier - but robust - solution would be to work explicitly with ESMF_TimeGet and MAPL_PackTime, explicitly incrementing the year or month based on the template.

@jmoch1214
Copy link
Author

@LiamBindle I appled the diff and the issue of the model crashing in February 2016 on lightning NOx (or other monthly files) seems to be fixed! But the model still seems to crash at 00:00 on Feb 29, 2016.

I tried an out of the box GCHP run (compiled as before, meaning with ifort18 and RRTMG on and debug on) and tried a 2 hour simulation starting at 23:00 on Feb 28, but the simulation keeps crashing at 00:00 on February 29.

The error messsages indicate it is another MAPL issue, but I can't tell from the messages exactly what. The log file is attached.
slurm-8292456.out.txt. There didn't appear to be any fail messages in gchp.log or HEMCO.log, but those are also attached.
gchp.log
HEMCO.log

LiamBindle added a commit that referenced this issue Nov 19, 2020
LiamBindle added a commit that referenced this issue Nov 19, 2020
@LiamBindle
Copy link
Contributor

@jmoch1214 I think #62 should fix the issues. When you have a chance, could you let me know if you're still getting anything weird?

@LiamBindle
Copy link
Contributor

Just following up on this. Thanks again @jmoch1214 for trying #62. It looks like the RRTMG simulation is now crashing at a different part of ExtData on the leap day. Specifically, it's crashing on the final line of

           ! apply the timestamp template
           call ESMF_TimeGet(time, yy=yy, mm=mm, dd=dd, h=hs, m=ms, s=ss, __RC__)

           i = scan(str_yy, '%'); if (i == 0) read (str_yy, '(I4)') yy
           i = scan(str_mm, '%'); if (i == 0) read (str_mm, '(I2)') mm
           i = scan(str_dd, '%'); if (i == 0) read (str_dd, '(I2)') dd
           i = scan(str_hs, '%'); if (i == 0) read (str_hs, '(I2)') hs
           i = scan(str_ms, '%'); if (i == 0) read (str_ms, '(I2)') ms
           i = scan(str_ss, '%'); if (i == 0) read (str_ss, '(I2)') ss

           call ESMF_TimeSet(timestamp_, yy=yy, mm=mm, dd=dd, h=hs, m=ms, s=ss, __RC__)

@sdeastham wrote:

Ah, damn. OK - some logic needs to be added there to deal with the situation when we want to extrapolate leap day information from non-leap-year files. I have (had?) logic elsewhere in ExtData which deals with that, but I'm guessing that when reading one of the RRTMG input files (maybe the MODIS surface information or TES profiles) it is using some slightly different logic to do the extrapolation - CheckUpdate is, for example, new (I think) since I wrote my klunky older logic. I'm guessing that an invalid date (e.g. "2002-02-29") is being passed to ESMF_TimeSet, because it's taking today and adding a simple year offset.

@LiamBindle
Copy link
Contributor

Just following up on this. @jmoch1214 and I have been in communication offline and it looks like this

! Fix searching for Feb 29 of non-leap year
if ((mod(yy - 1960, 4) /= 0).and.(mm==2).and.(dd==29))
    ! note: "mod(yy - 1960, 4) /= 0" is TRUE if it is not a leap year
    dd=28 
endif 

immediately before the ESMF_TimeSet() call fixes RRTMG crashing on the leap-day.

@sdeastham
Copy link
Contributor

sdeastham commented Dec 15, 2020

Although... won't this fail for 2000 (not a leap year)?
EDIT: To clarify - the algorithm will incorrectly classify 2000 as a leap year. I suspect that doesn't matter, but can't see where this code would take effect, so I'm not totally sure - would appreciate your thoughts @LiamBindle !

@LiamBindle
Copy link
Contributor

LiamBindle commented Dec 15, 2020

@sdeastham thanks for pointing that out! I think 2000 actually is a leap year though, since it's divisible by 400, isn't it? This would misclassify 1900 and 2100 as a leap years though. That being said, I wouldn't propose this as a robust solution, but moreso just a quick kludgy workaround. Once the upstream ExtData updates are merged, it would probably be worthwhile to revise our calendar jumping, and implement some unit tests for it. It sounds like Ben has made a lot of progress on this front.

@sdeastham
Copy link
Contributor

@LiamBindle - this is why I shouldn't try to science on too little sleep! You're 100% right. I was getting my "special cases" the wrong way round, forgetting that 2000 is unusual because it is a leap year when usually a century isn't. Agreed then that this solution is fine for now. Thanks!

@LiamBindle
Copy link
Contributor

I'm going to close this issue. Feel free to reopen at any time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants