Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG/ISSUE]CH4 simulation: Infinity in DO_CLOUD_CONVECTION #954

Closed
yanglibj opened this issue Oct 14, 2021 · 29 comments
Closed

[BUG/ISSUE]CH4 simulation: Infinity in DO_CLOUD_CONVECTION #954

yanglibj opened this issue Oct 14, 2021 · 29 comments
Assignees
Labels
category: Bug Something isn't working stale No recent activity on this issue topic: Input Data Related to input data

Comments

@yanglibj
Copy link

yanglibj commented Oct 14, 2021

Description of the problem

Using version12.9.3, got the "Infinity in DO_CLOUD_CONVECTION" error starting since the 2021/04/04 19:30 time step.

     - Found all A1     met fields for 2021/04/04 18:30
     - Found all A3cld  met fields for 2021/04/04 19:30
     - Found all A3dyn  met fields for 2021/04/04 19:30
     - Found all A3mstC met fields for 2021/04/04 19:30
     - Found all A3mstE met fields for 2021/04/04 19:30
     - Found all I3     met fields for 2021/04/04 21:00
Infinity in DO_CLOUD_CONVECTION!
Infinity in DO_CLOUD_CONVECTION!
Infinity in DO_CLOUD_CONVECTION!
Infinity in DO_CLOUD_CONVECTION!
Infinity in DO_CLOUD_CONVECTION!
Infinity in DO_CLOUD_CONVECTION!
Infinity in DO_CLOUD_CONVECTION!
Infinity in DO_CLOUD_CONVECTION!
Infinity in DO_CLOUD_CONVECTION!
K, IC, Q(K,IC):    8   1           NaN
K, IC, Q(K,IC):   12   1           NaN
Infinity in DO_CLOUD_CONVECTION!
K, IC, Q(K,IC):    8   1           NaN
K, IC, Q(K,IC):   11   1           NaN
K, IC, Q(K,IC):    2   1           NaN
K, IC, Q(K,IC):    9   1           NaN
K, IC, Q(K,IC):    8   1           NaN
Infinity in DO_CLOUD_CONVECTION!
Infinity in DO_CLOUD_CONVECTION!
K, IC, Q(K,IC):   10   1           NaN
K, IC, Q(K,IC):   10   1           NaN
Infinity in DO_CLOUD_CONVECTION!
Infinity in DO_CLOUD_CONVECTION!
K, IC, Q(K,IC):    8   1           NaN

Description of troubleshooting performed

Tried an earlier fix that was added to UCX_MOD: e4a632b
but this didn't resolve the problem.

Turned on debugging to track back the problem but didn't find anything wrong with tpcore_fvdas_mod.F90.

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
libifcoremt.so.5   00002AE02E103216  for__signal_handl     Unknown  Unknown
libpthread-2.17.s  00002AE030167630  Unknown               Unknown  Unknown
geos_bpch          00000000006ED032  tpcore_fvdas_mod_        1885  tpcore_fvdas_mod.F90
geos_bpch          00000000006C9945  tpcore_fvdas_mod_         839  tpcore_fvdas_mod.F90
libiomp5.so        00002AE02FBFDC53  __kmp_invoke_micr     Unknown  Unknown
libiomp5.so        00002AE02FBCD357  Unknown               Unknown  Unknown
libiomp5.so        00002AE02FBCE413  __kmp_fork_call       Unknown  Unknown
libiomp5.so        00002AE02FBA4E2A  __kmpc_fork_call      Unknown  Unknown
geos_bpch          00000000006CEED6  tpcore_fvdas_mod_         828  tpcore_fvdas_mod.F90
geos_bpch          000000000042C946  transport_mod_mp_         579  transport_mod.F
geos_bpch          000000000042B1F0  transport_mod_mp_         293  transport_mod.F
geos_bpch          0000000000644B9D  MAIN__                   1442  main.F
geos_bpch          00000000004039BE  Unknown               Unknown  Unknown
libc-2.17.so       00002AE03059A555  __libc_start_main     Unknown  Unknown
geos_bpch          00000000004038C9  Unknown               Unknown  Unknown

GEOS-Chem version

12.9.3

@yanglibj yanglibj added the category: Bug Something isn't working label Oct 14, 2021
@yantosca
Copy link
Contributor

yantosca commented Oct 14, 2021

Line 1885 of tpcore_fvdas_mod.F90 is the declaration of DQ1 in this code below:

  SUBROUTINE Qckxyz( dq1, J1P, J2P,  JU1_GL, J2_GL, &
                     ILO, IHI, JULO, JHI,    I1,    &
                     I2,  JU1, J2,   K1,     K2 )
!
! !INPUT PARAMETERS:
!
    ! Global latitude indices at the edges of the S/N polar caps
    ! J1P=JU1_GL+1; J2P=J2_GL-1 for a polar cap of 1 latitude band
    ! J1P=JU1_GL+2; J2P=J2_GL-2 for a polar cap of 2 latitude bands
    INTEGER, INTENT(IN)  :: J1P,    J2P

    ! Global min & max latitude (J) indices
    INTEGER, INTENT(IN)  :: JU1_GL, J2_GL

    ! Local min & max longitude (I), latitude (J), altitude (K) indices
    INTEGER, INTENT(IN)  :: I1,     I2
    INTEGER, INTENT(IN)  :: JU1,    J2
    INTEGER, INTENT(IN)  :: K1,     K2

    ! Local min & max longitude (I) and latitude (J) indices
    INTEGER, INTENT(IN)  :: ILO,    IHI
    INTEGER, INTENT(IN)  :: JULO,   JHI
!
! !INPUT/OUTPUT PARAMETERS:
!
    ! Species density [hPa]
    REAL(fp),  INTENT(INOUT) :: dq1(ILO:IHI, JULO:JHI, K1:K2)
!

One thing that is puzzling is that this is called at line 927. but the traceback says line 828

       if (FILL) then
        ! ===========
          call Qckxyz  &
        ! ===========
               (dq1, &
               j1p, j2p, 1, jm, &
               1, im, 1, jm, 1, im, 1, jm, 1, km)
       end if

@yantosca
Copy link
Contributor

@yanglibj can you also try using a Gfortran (such as gfortran 10) with debugging? That might trap the error better than ifort.

@yanglibj
Copy link
Author

yanglibj commented Oct 14, 2021 via email

@yantosca
Copy link
Contributor

@yanglibj
Copy link
Author

@yantosca I tried this environment but somehow got different weird problems when compiling. Since the DO_CLOUD_CONVECTION error was related to the original code 12.9.3, could you please try from your side? Thank you!

@yanglibj
Copy link
Author

yanglibj commented Oct 14, 2021

To provide more info, when I used gfortran to compile, I got the following error. However, using ifort was fine...

/n/helmod/apps/centos7/MPI/intel/17.0.4-fasrc01/openmpi/2.1.0-fasrc02/netcdf-fortran/4.4.0-fasrc03/lib/libnetcdff.so: undefined reference to `iso_c_binding_mp_c_null_ptr_'
collect2: error: ld returned 1 exit status
make[8]: *** [exe] Error 1
make[8]: Leaving directory `/n/home07/yanglibj/GC/Code.12.9.3/HEMCO/src/Interfaces'
make[7]: *** [all] Error 2
make[7]: Leaving directory `/n/home07/yanglibj/GC/Code.12.9.3/HEMCO/src/Interfaces'
make[6]: *** [check] Error 2
make[6]: Leaving directory `/n/home07/yanglibj/GC/Code.12.9.3/HEMCO'
make[5]: *** [lib] Error 2
make[5]: Leaving directory `/n/home07/yanglibj/GC/Code.12.9.3/HEMCO'
make[4]: *** [libhemco] Error 2
make[4]: Leaving directory `/n/home07/yanglibj/GC/Code.12.9.3/GeosCore'
make[3]: *** [lib] Error 2
make[3]: Leaving directory `/n/home07/yanglibj/GC/Code.12.9.3/GeosCore'
make[2]: *** [all] Error 2
make[2]: Leaving directory `/n/home07/yanglibj/GC/Code.12.9.3/GeosCore'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/n/home07/yanglibj/GC/Code.12.9.3'
cp -f ./CodeDir/bin/geos geos
cp: cannot stat ‘./CodeDir/bin/geos’: No such file or directory
make: *** [build] Error 1

Copy link
Contributor

@yanglibj did you make clean before compiling with gfortran?

@yanglibj
Copy link
Author

Yes I did. I always make clean before compiling.

@yantosca
Copy link
Contributor

Also another tip: open a new terminal window and then load the gfortran environment. Then make realclean and try to compile again. Sometimes if you change modules in the same terminal session that could lead to issues (if leftover modules aren't purged).

@yanglibj
Copy link
Author

yanglibj commented Oct 14, 2021

I still couldn't make the model running with gcc.gfortran10.2_cannon.env. But I was able to use gfortran through init.gc-classic.gfortran71.centos7
However, the new simulation didn't give me any new information at all...

I also tried an earlier version 12.7.1 but got the same NaN error. Below is the message:

libifcoremt.so.5   00002AEF268F5216  for__signal_handl     Unknown  Unknown
libpthread-2.17.s  00002AEF28959630  Unknown               Unknown  Unknown
geos_bpch          00000000006ED032  tpcore_fvdas_mod_        1885  tpcore_fvdas_mod.F90
geos_bpch          00000000006C9945  tpcore_fvdas_mod_         839  tpcore_fvdas_mod.F90
libiomp5.so        00002AEF283EFC53  __kmp_invoke_micr     Unknown  Unknown
libiomp5.so        00002AEF283BF357  Unknown               Unknown  Unknown
libiomp5.so        00002AEF283C0413  __kmp_fork_call       Unknown  Unknown
libiomp5.so        00002AEF28396E2A  __kmpc_fork_call      Unknown  Unknown
geos_bpch          00000000006CEED6  tpcore_fvdas_mod_         828  tpcore_fvdas_mod.F90
geos_bpch          000000000042C946  transport_mod_mp_         579  transport_mod.F
geos_bpch          000000000042B1F0  transport_mod_mp_         293  transport_mod.F
geos_bpch          0000000000644B9D  MAIN__                   1442  main.F
geos_bpch          00000000004039BE  Unknown               Unknown  Unknown
libc-2.17.so       00002AEF28D8C555  __libc_start_main     Unknown  Unknown
geos_bpch          00000000004038C9  Unknown               Unknown  Unknown

@yantosca
Copy link
Contributor

@yanglibj, where is your code & rundir on Cannon? I can try to take a look. Make sure the folders have world-readable permissons (i.e. chmod 755 for folders or executables, chmod 644 for files).

@yantosca
Copy link
Contributor

Also, @yanglibj, is your code "out-of-the-box" or does it contain any modifications?

@msulprizio
Copy link
Contributor

@yanglibj You wrote:

I managed to run v12.9.3 with gcc on the Baylor Kodiak cluster. It didn't show NaN as my student's run but the simulation stopped at the same time step.

If the run crashing consistently at the same date and time then that may indicate a bad met field. I would recommend looking into the met fields for that date. If you see any issues, let us know and we can try to reprocess the meteorology fields.

@yanglibj
Copy link
Author

I didn't see anything weird related to the met field for 2021-4-4 so I didn't mention it up earlier. Maybe I missed anything. If you could reprocess the data, I can try the simulation again and see whether it fixes the problem. Also, I have shared my code and rundir on Cannon with your team in my private email. It would be helpful if you can take a look for me. Thank you!

@yanglibj
Copy link
Author

Btw if you prefer to look into a clean code version, you can download a clean v12.9.3 code and it should give you the same problem.

@yantosca
Copy link
Contributor

Thanks @yanglibj. From the error output you sent via direct message, I noticed this

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
 
Backtrace for this error:
#0  0x2aaaabfdf3ff in ???
#1  0x1164338 in __ncdf_mod_MOD_nc_read_arr
     at /home/liy/GC/Code.12.9.3/NcdfUtil/ncdf_mod.F90:1156
#2  0xcc2d6c in __hcoio_read_std_mod_MOD_hcoio_read_std
     at /home/liy/GC/Code.12.9.3/HEMCO/src/Core/hcoio_read_std_mod.F90:748
#3  0xd7e9ea in __hcoio_dataread_mod_MOD_hcoio_dataread
     at /home/liy/GC/Code.12.9.3/HEMCO/src/Core/hcoio_dataread_mod.F90:257
#4  0xc7f844 in readlist_fill
     at /home/liy/GC/Code.12.9.3/HEMCO/src/Core/hco_readlist_mod.F90:510
#5  0xc80b1d in __hco_readlist_mod_MOD_readlist_read
     at /home/liy/GC/Code.12.9.3/HEMCO/src/Core/hco_readlist_mod.F90:327
#6  0xc5be75 in __hco_driver_mod_MOD_hco_run
     at /home/liy/GC/Code.12.9.3/HEMCO/src/Core/hco_driver_mod.F90:163
#7  0x57aba5 in __hcoi_gc_main_mod_MOD_hcoi_gc_run
     at /home/liy/GC/Code.12.9.3/GeosCore/hcoi_gc_main_mod.F90:828
#8  0x78d6d8 in __emissions_mod_MOD_emissions_run
     at /home/liy/GC/Code.12.9.3/GeosCore/emissions_mod.F90:203
#9  0x4ff54c in geos_chem
     at /home/liy/GC/Code.12.9.3/GeosCore/main.F90:990
#10  0x503e20 in main
     at /home/liy/GC/Code.12.9.3/GeosCore/main.F90:32
--------------------------------------------------------------------------
mpiexec noticed that process rank 6 with PID 36403 on node n003 exited on signal 8 (Floating point exception).
--------------------------------------------------------------------------

which would imply that you are running GEOS-Chem Classic with MPI, which may be part of the problem. Could you also send or post the run script that you are using?

@yanglibj
Copy link
Author

yanglibj commented Oct 18, 2021

Thanks for catching this. This is the error from ./geos > GC_12.9.3.log, no MPI.

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x2aaaac1ec3ff in ???
#1  0x1164338 in __ncdf_mod_MOD_nc_read_arr
	at /home/liy/GC/Code.12.9.3/NcdfUtil/ncdf_mod.F90:1156
#2  0xcc2d6c in __hcoio_read_std_mod_MOD_hcoio_read_std
	at /home/liy/GC/Code.12.9.3/HEMCO/src/Core/hcoio_read_std_mod.F90:748
#3  0xd7e9ea in __hcoio_dataread_mod_MOD_hcoio_dataread
	at /home/liy/GC/Code.12.9.3/HEMCO/src/Core/hcoio_dataread_mod.F90:257
#4  0xc7f844 in readlist_fill
	at /home/liy/GC/Code.12.9.3/HEMCO/src/Core/hco_readlist_mod.F90:510
#5  0xc80b1d in __hco_readlist_mod_MOD_readlist_read
	at /home/liy/GC/Code.12.9.3/HEMCO/src/Core/hco_readlist_mod.F90:327
#6  0xc5be75 in __hco_driver_mod_MOD_hco_run
	at /home/liy/GC/Code.12.9.3/HEMCO/src/Core/hco_driver_mod.F90:163
#7  0x57aba5 in __hcoi_gc_main_mod_MOD_hcoi_gc_run
	at /home/liy/GC/Code.12.9.3/GeosCore/hcoi_gc_main_mod.F90:828
#8  0x78d6d8 in __emissions_mod_MOD_emissions_run
	at /home/liy/GC/Code.12.9.3/GeosCore/emissions_mod.F90:203
#9  0x4ff54c in geos_chem
	at /home/liy/GC/Code.12.9.3/GeosCore/main.F90:990
#10  0x503e20 in main
	at /home/liy/GC/Code.12.9.3/GeosCore/main.F90:32
Floating point exception (core dumped)

Have you got a chance to try a run based on a clean 12.9.3 code? Did you see the same problem when running for 2021-4-4? Thank you!

@yantosca
Copy link
Contributor

Hi @yanglibj, I was finally able to confirm that the issue happens with a clean 12.9.3 CH4 run (using 2021 GEOS_FP met) as well as with a 13.2.1 run for the same time period. The run always dies at 18z on 2021-04-04:

In 12.9.3:

---> DATE: 2021/04/04  UTC: 17:50  X-HRS:     89.833336
---> DATE: 2021/04/04  UTC: 18:00  X-HRS:     90.000000
     - Found all A1     met fields for 2021/04/04 18:30
     - Found all A3cld  met fields for 2021/04/04 19:30
     - Found all A3dyn  met fields for 2021/04/04 19:30
     - Found all A3mstC met fields for 2021/04/04 19:30
     - LightNOX extension is off. Skipping FLASH_DENS and CONV_DEPTH fields in FlexGrid_Read_A3mstE.
     - Found all A3mstE met fields for 2021/04/04 19:30
     - Found all I3     met fields for 2021/04/04 21:00
Infinity in DO_CLOUD_CONVECTION!
Infinity in DO_CLOUD_CONVECTION!
Infinity in DO_CLOUD_CONVECTION!
Infinity in DO_CLOUD_CONVECTION!
K, IC, Q(K,IC):    8   1           NaN
K, IC, Q(K,IC):    8   1           NaN

and again in 13.2.1:

---> DATE: 2021/04/04  UTC: 17:50  X-HRS:     89.833336
---> DATE: 2021/04/04  UTC: 18:00  X-HRS:     90.000000
     - Found all A1     met fields for 2021/04/04 18:30
     - Found all A3cld  met fields for 2021/04/04 19:30
     - Found all A3dyn  met fields for 2021/04/04 19:30
     - Found all A3mstC met fields for 2021/04/04 19:30
     - LightNOX extension is off. Skipping FLASH_DENS and CONV_DEPTH fields in FlexGrid_Read_A3mstE.
     - Found all A3mstE met fields for 2021/04/04 19:30
     - Found all I3     met fields for 2021/04/04 21:00
Infinity in DO_CLOUD_CONVECTION!
K, IC, Q(K,IC):    4   1           NaN        CH4
                       NaN                       NaN                       NaN                       NaN                       NaN                       NaN   300.00000000000000                            NaN

so I would lean towards a bad met field at that date & time. At 18:00z, met fields are being read in for the next 3-hr period.

I've traced it to the GEOSFP.20210404.I3.4x5.nc file. In the temperature field, there are denormal values (-3e38 to +3e38).

yang

@YanshunLi-washu, would you be able to reprocess the 2021/04/04 GEOS-FP met data?

@yanglibj
Copy link
Author

@yantosca Thank you for confirming the issue. I will try again after @YanshunLi-washu reprocesses the met data.

@YanshunLi-washu
Copy link
Contributor

@yantosca Will do!

@yantosca yantosca self-assigned this Oct 27, 2021
@yantosca yantosca added the topic: Input Data Related to input data label Oct 27, 2021
@YanshunLi-washu
Copy link
Contributor

Hi @yanglibj and @yantosca, the 2021/04/04 data has been reprocessed! Hope that will help.

@yanglibj
Copy link
Author

yanglibj commented Nov 21, 2021 via email

@yanglibj
Copy link
Author

@yantosca @YanshunLi-washu Hi Bob and Yanshun, I am checking in to see whether there is any update or anything I can do from my side. Thank you!

@YanshunLi-washu
Copy link
Contributor

hi @yanglibj I checked the GEOSFP files for 20210405 at 4x5 resolution, all files are readable by Panoply. Not sure whether reprocessing the met data will help. Could you provide us more specific error messages so that @yantosca Bob and I could help further.

@yanglibj
Copy link
Author

yanglibj commented Nov 30, 2021

Hi @YanshunLi-washu , thanks for your reply. Bob also tried the simulations and identified the same error. So the error was not related to how I use the code. Instead, it seems to be a general issue when running GEOS-Chem for recent months (April, 2021).

See Bob's earlier reply:
"Hi @yanglibj, I was finally able to confirm that the issue happens with a clean 12.9.3 CH4 run (using 2021 GEOS_FP met) as well as with a 13.2.1 run for the same time period. The run always dies at 18z on 2021-04-04:"

So that's why I wanted to see whether Bob tried your new met data as well. @yantosca

@YanshunLi-washu
Copy link
Contributor

Hi @yanglibj Yes. That's correct. Bob @yantosca identified that there is invalid values in the Temperature field in the GEOSFP 20210404 I3 file. But this has been fixed, see the plot below, the distribution pattern and the data range of T is normal.
Screen Shot 2021-11-30 at 9 40 56 AM

@stale
Copy link

stale bot commented Dec 30, 2021

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days it will be closed. You can add the "never stale" tag to prevent the Stale bot from closing this issue.

@stale stale bot added the stale No recent activity on this issue label Dec 30, 2021
@CodeCatLZW
Copy link

I also encountered such a problem.
I ran CH4 and taggedCO simulations, both of which ended on 04-04, 2021-04.
These are my GC.log:
image
image
i want know what should i do now.

@CodeCatLZW
Copy link

I changed the time of the restart file to 2021-04-11, and now I'm simulating from 2021-04-11. There seems to be no problem after that. Although this loses some days of simulation accuracy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: Bug Something isn't working stale No recent activity on this issue topic: Input Data Related to input data
Projects
None yet
Development

No branches or pull requests

5 participants