[BUG/ISSUE] GCHP c48 runs on AWS within Docker container die within 1 hour #14

yantosca · 2018-12-19T22:08:06Z

I ran a GCHP c48 run AWS cloud using

AMI         : container_geoschem_tutorial_2018121
Machine     : r4.4xlarge
Diagnostics : SpeciesConc_avg and SpeciesConc_inst

and it died after an hour.

In runConfig.sh:

# Make sure your settings here match the resources you request on your
# cluster in your run script!!!
NUM_NODES=1
NUM_CORES_PER_NODE=12
NY=12
NX=1

# MAPL shared memory option (0: off, 1: on). Keep off unless you know what
# you are doing. Contact GCST for more information if you have memory
# problems you are unable to fix.
USE_SHMEM=0

#------------------------------------------------
#   Internal Cubed Sphere Resolution
#------------------------------------------------
CS_RES=48    # 24 ~ 4x5, 48 ~ 2x2.5, 90 ~ 1x1.25, 180 ~ 1/2 deg, 360 ~ 1/4 deg

...
Start_Time="20160701 000000"
End_Time="20160701 010000"
Duration="00000000 010000"
....
common_freq="010000"
common_dur="010000"
common_mode="'time-averaged'"

The Docker commands were:

docker pull docker pull geoschem/gchp_model
docker run --rm -it -v $HOME/ExtData:/ExtData -v $HOME/OutputDir:/OutputDir geoschem/gchp_model
mpirun -np 12 -oversubscribe --allow-run-as-root ./geos | tee gchp.log.c48

Tail end of log file:

 AGCM Date: 2016/07/01  Time: 00:10:00

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node 22262d174fea exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

I then commented out SpeciesConc_avg from the HISTORY.rc file and re-ran.
Now, the only diagnostic active was SpeciesConc_inst. This also died at 1 hour:

AGCM Date: 2016/07/01  Time: 01:00:00

 Writing:  11592 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.SpeciesConc_inst.20160701_0100z.nc4
free(): invalid next size (normal)

Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
#0  0x7efd1c1dd2da in ???
#1  0x7efd1c1dc503 in ???
..etc..

Times for GIGCenv
TOTAL                   :       0.726
INITIALIZE              :       0.000
RUN                     :       0.723
...etc...
HEMCO::Finalize... OK.
Chem::State_Diag Finalize... OK.
Chem::State_Chm Finalize... OK.
Chem::State_Met Finalize... OK.
Chem::Input_Opt Finalize... OK.
 Using parallel NetCDF for file: gcchem_internal_checkpoint_c48.nc
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 6 with PID 0 on node 22262d174fea exited on signal 6 (Aborted).
--------------------------------------------------------------------------

This message:

free(): invalid next size (normal)

might be indicative of an out-of-bounds error, perhaps where we deallocate arrays (or fields of State_* objects).

The text was updated successfully, but these errors were encountered:

JiaweiZhuang · 2018-12-19T22:21:41Z

Same issue when running it natively on the AMI?

JiaweiZhuang · 2018-12-19T22:23:49Z

Stop the instance, change its type to r5.24xlarge, restart and run again. If that still dies then it is definitely not a (inadequate) memory problem...

yantosca · 2018-12-19T22:45:14Z

So I ran again in the container in r5.24xlarge and now I get this error:

 AGCM Date: 2016/07/01  Time: 00:10:00
At line 2731 of file /tutorial/gchp_standard/CodeDir/GCHP/ESMF/src/Superstructure/State/src/ESMF_StateAPI.F90
Fortran runtime error: End of record

Error termination. Backtrace:
#0  0x7f657849c2da in ???
#1  0x7f657849cec5 in ???
#2  0x7f657849d68d in ???

... etc...
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[28387,1],0]
  Exit code:    2
--------------------------------------------------------------------------

So it would appear to be an issue internal to MAPL. Or I might have run out of disk space. I requested 500 GB though.

Also I had run in the AMI itself earlier at c48 and had similar crashes to the

JiaweiZhuang · 2018-12-19T22:58:21Z

That's new message though:

> At line 2731 of file /tutorial/gchp_standard/CodeDir/GCHP/ESMF/src/Superstructure/State/src/ESMF_StateAPI.F90 
Fortran runtime error: End of record

Haven't seen this ever before...

yantosca · 2018-12-19T23:04:04Z

There are some references to this issue.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=20257
|https://stackoverflow.com/questions/29489388/end-of-record-error-when-saving-a-variable
https://stackoverflow.com/questions/32684816/end-of-record-error-in-file-opening

It was a bug in gfortran but was supposed to be fixed in 4.1. But who knows

yantosca · 2018-12-20T21:05:19Z

This issue seems to have been caused by an out-of-bounds error in the Olson landmap module, as described in geoschem/GCHP#13 (comment)

JiaweiZhuang · 2018-12-20T22:12:25Z

Interesting! Why not happening in C24🤔 Can c48 run on AWS now?

yantosca · 2018-12-20T22:17:29Z

So what appears to be happening is that the Olson landmap is not getting read in properly. This is happening in the code where State_Met%LandTypeFrac is populated from the OLSON Pointers from ExtData. Not sure why this is happening but it may be a MAPL issue. The OLSON data is read in by the custom code in MAPL to read in fraction of grid box (the "F:int" feature).

So while you can run on the cloud with the quick fix, I would avoid doing that until we understand the root cause of why the State_Met%LandTypeFrac is all zero.

yantosca · 2018-12-21T18:37:07Z

I am closing this thread because the root cause is #15. Fixing #15 will fix this issue.

lizziel mentioned this issue Dec 19, 2018

[BUG/ISSUE] GCHP c48 run with gfortran 8.2 on Odyssey hangs before end of run #13

Closed

yantosca mentioned this issue Dec 21, 2018

[BUG/ISSUE] Olson landmap data is being read in as all zeroes for GCHP simulations with gfortran (but not ifort) #15

Closed

yantosca closed this as completed Dec 21, 2018

msulprizio changed the title ~~GCHP c48 runs on AWS within Docker container die within 1 hour~~ [BUG/ISSUE] GCHP c48 runs on AWS within Docker container die within 1 hour Sep 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG/ISSUE] GCHP c48 runs on AWS within Docker container die within 1 hour #14

[BUG/ISSUE] GCHP c48 runs on AWS within Docker container die within 1 hour #14

yantosca commented Dec 19, 2018

JiaweiZhuang commented Dec 19, 2018

JiaweiZhuang commented Dec 19, 2018

yantosca commented Dec 19, 2018

JiaweiZhuang commented Dec 19, 2018

yantosca commented Dec 19, 2018

yantosca commented Dec 20, 2018 •

edited by JiaweiZhuang

JiaweiZhuang commented Dec 20, 2018

yantosca commented Dec 20, 2018

yantosca commented Dec 21, 2018

[BUG/ISSUE] GCHP c48 runs on AWS within Docker container die within 1 hour #14

[BUG/ISSUE] GCHP c48 runs on AWS within Docker container die within 1 hour #14

Comments

yantosca commented Dec 19, 2018

JiaweiZhuang commented Dec 19, 2018

JiaweiZhuang commented Dec 19, 2018

yantosca commented Dec 19, 2018

JiaweiZhuang commented Dec 19, 2018

yantosca commented Dec 19, 2018

yantosca commented Dec 20, 2018 • edited by JiaweiZhuang

JiaweiZhuang commented Dec 20, 2018

yantosca commented Dec 20, 2018

yantosca commented Dec 21, 2018

yantosca commented Dec 20, 2018 •

edited by JiaweiZhuang