Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG/ISSUE] GCHP c48 runs on AWS within Docker container die within 1 hour #14

Closed
yantosca opened this issue Dec 19, 2018 · 9 comments
Closed

Comments

@yantosca
Copy link
Contributor

I ran a GCHP c48 run AWS cloud using

AMI         : container_geoschem_tutorial_2018121
Machine     : r4.4xlarge
Diagnostics : SpeciesConc_avg and SpeciesConc_inst 

and it died after an hour.

In runConfig.sh:

# Make sure your settings here match the resources you request on your
# cluster in your run script!!!
NUM_NODES=1
NUM_CORES_PER_NODE=12
NY=12
NX=1

# MAPL shared memory option (0: off, 1: on). Keep off unless you know what
# you are doing. Contact GCST for more information if you have memory
# problems you are unable to fix.
USE_SHMEM=0

#------------------------------------------------
#   Internal Cubed Sphere Resolution
#------------------------------------------------
CS_RES=48    # 24 ~ 4x5, 48 ~ 2x2.5, 90 ~ 1x1.25, 180 ~ 1/2 deg, 360 ~ 1/4 deg

...
Start_Time="20160701 000000"
End_Time="20160701 010000"
Duration="00000000 010000"
....
common_freq="010000"
common_dur="010000"
common_mode="'time-averaged'"

The Docker commands were:

docker pull docker pull geoschem/gchp_model
docker run --rm -it -v $HOME/ExtData:/ExtData -v $HOME/OutputDir:/OutputDir geoschem/gchp_model
mpirun -np 12 -oversubscribe --allow-run-as-root ./geos | tee gchp.log.c48

Tail end of log file:

 AGCM Date: 2016/07/01  Time: 00:10:00

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node 22262d174fea exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

I then commented out SpeciesConc_avg from the HISTORY.rc file and re-ran.
Now, the only diagnostic active was SpeciesConc_inst. This also died at 1 hour:

AGCM Date: 2016/07/01  Time: 01:00:00

 Writing:  11592 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.SpeciesConc_inst.20160701_0100z.nc4
free(): invalid next size (normal)

Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
#0  0x7efd1c1dd2da in ???
#1  0x7efd1c1dc503 in ???
..etc..

Times for GIGCenv
TOTAL                   :       0.726
INITIALIZE              :       0.000
RUN                     :       0.723
...etc...
HEMCO::Finalize... OK.
Chem::State_Diag Finalize... OK.
Chem::State_Chm Finalize... OK.
Chem::State_Met Finalize... OK.
Chem::Input_Opt Finalize... OK.
 Using parallel NetCDF for file: gcchem_internal_checkpoint_c48.nc
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 6 with PID 0 on node 22262d174fea exited on signal 6 (Aborted).
--------------------------------------------------------------------------

This message:

free(): invalid next size (normal)

might be indicative of an out-of-bounds error, perhaps where we deallocate arrays (or fields of State_* objects).

@JiaweiZhuang
Copy link
Contributor

Same issue when running it natively on the AMI?

@JiaweiZhuang
Copy link
Contributor

Stop the instance, change its type to r5.24xlarge, restart and run again. If that still dies then it is definitely not a (inadequate) memory problem...

@yantosca
Copy link
Contributor Author

So I ran again in the container in r5.24xlarge and now I get this error:

 AGCM Date: 2016/07/01  Time: 00:10:00
At line 2731 of file /tutorial/gchp_standard/CodeDir/GCHP/ESMF/src/Superstructure/State/src/ESMF_StateAPI.F90
Fortran runtime error: End of record

Error termination. Backtrace:
#0  0x7f657849c2da in ???
#1  0x7f657849cec5 in ???
#2  0x7f657849d68d in ???

... etc...
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[28387,1],0]
  Exit code:    2
--------------------------------------------------------------------------

So it would appear to be an issue internal to MAPL. Or I might have run out of disk space. I requested 500 GB though.

Also I had run in the AMI itself earlier at c48 and had similar crashes to the

@JiaweiZhuang
Copy link
Contributor

That's new message though:

> At line 2731 of file /tutorial/gchp_standard/CodeDir/GCHP/ESMF/src/Superstructure/State/src/ESMF_StateAPI.F90 
Fortran runtime error: End of record

Haven't seen this ever before...

@yantosca
Copy link
Contributor Author

@yantosca
Copy link
Contributor Author

yantosca commented Dec 20, 2018

This issue seems to have been caused by an out-of-bounds error in the Olson landmap module, as described in geoschem/GCHP#13 (comment)

@JiaweiZhuang
Copy link
Contributor

Interesting! Why not happening in C24🤔 Can c48 run on AWS now?

@yantosca
Copy link
Contributor Author

So what appears to be happening is that the Olson landmap is not getting read in properly. This is happening in the code where State_Met%LandTypeFrac is populated from the OLSON Pointers from ExtData. Not sure why this is happening but it may be a MAPL issue. The OLSON data is read in by the custom code in MAPL to read in fraction of grid box (the "F:int" feature).

So while you can run on the cloud with the quick fix, I would avoid doing that until we understand the root cause of why the State_Met%LandTypeFrac is all zero.

@yantosca
Copy link
Contributor Author

I am closing this thread because the root cause is #15. Fixing #15 will fix this issue.

@msulprizio msulprizio changed the title GCHP c48 runs on AWS within Docker container die within 1 hour [BUG/ISSUE] GCHP c48 runs on AWS within Docker container die within 1 hour Sep 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants