Skip to content

Commit

Permalink
Update ReadTheDocs page on running GCHP
Browse files Browse the repository at this point in the history
Signed-off-by: Lizzie Lundgren <elundgren@seas.harvard.edu>
  • Loading branch information
lizziel committed Jun 15, 2022
1 parent b6e1b4d commit ce3431b
Showing 1 changed file with 57 additions and 62 deletions.
119 changes: 57 additions & 62 deletions docs/source/user-guide/running.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,26 @@
Running GCHP
============


.. note::
Another useful resource for instructions on running GCHP is our `YouTube tutorial <https://www.youtube.com/watch?v=K6frcfCjpds>`_.


This page presents the basic information needed to run GCHP as well as how to verify a successful run and reuse a run directory.
A pre-run checklist is included at the end to help prevent run errors.
The GCHP "standard" simulation run directory is configured for a 1-hr simulation at c24 resolution and is a good first test case to check that GCHP runs on your system.
A pre-run checklist is included here for easy reference. Please read the rest of this page to understand these steps.

Pre-run checklist
-----------------

Prior to running GCHP, always run through the following checklist to ensure everything is set up properly.

1. Start date is set in :file:`cap_restart`
2. Executable :file:`gchp` is present.
3. All symbolic links are valid (no broken links)
4. Settings are correct in :file:`setCommonRunSettings.sh`
5. :file:`setRestartLink.sh` runs without error (ensures restart file is available)
6. If running via a job scheduler, the resource allocation in :file:`setCommonRunSettings.sh` and the run script are consistent (# nodes and cores)
7. If running interactively, the resource allocation in :file:`setCommonRunSettings.sh` is available locally
at the end to help anticipate run errors before they happen.

How to run GCHP
---------------
Expand All @@ -24,78 +36,83 @@ There is a symbolic link in the GCHP run directory called :file:`runScriptSample
Each file includes extra commands that make the run process easier and less prone to user error.
These commands include:

1. Source environment file symbolic link :file:`gchp.env` to ensure run environment consistent with build
2. Source config file :file:`setCommonRunSettings.sh` to set run-time configuration
3. Delete any previous run output files that might interfere with the new run if present
4. Send standard output to run-time log file :file:`gchp.log`
5. Rename the output restart file to include "restart" and datetime
1. Define a GCHP log file that includes start date configured in :file:`cap_restart` in its name
2. Source environment file symbolic link :file:`gchp.env`
3. Source config file :file:`setCommonRunSettings.sh` to update commonly changed run settings
4. Set restart file symbolic link :file:`gchp_restart.nc4` to target file in :file:`Restarts` subdirectory for configured start date and grid resolution
5. Check if run was successful based on content of :file:`cap_restart` (should update to end date if successful)
6. Move the output restart file to the :file:`Restarts` subdirectory
7. Rename the output restart file to include run start date and grid resolution (format GEOSChem.Restarts.YYYYMMDD_HHmmSSz.cN.nc4)

Run interactively
^^^^^^^^^^^^^^^^^

Copy or adapt example run script gchp.local.run to run GCHP locally on your machine.
Before running, open your run script and set nCores to the number of processors you plan to use.
Make sure you have this number of processors available locally.
It must be at least 6.
Next, open file setCommonRunSettings.sh and set NUM_CORES, NUM_NODES, and NUM_CORES_PER_NODE to be consistent with your run script.
Before running, make sure the total number of cores configured in :file:`setCommonRunSettings.sh` is available locally.
It must be at least 6.

To run, type the following at the command prompt:

.. code-block:: console
$ ./gchp.local.run
Standard output will be displayed on your screen in addition to being sent to log file :file:`gchp.log`.
Standard output will be displayed on your screen in addition to being sent to a log file with filename format gchp.YYYYMMDD_HHmmSSz.log. The HEMCO log output is also included in this file.

Run as batch job
^^^^^^^^^^^^^^^^

Batch job run scripts will vary based on what job scheduler you have available.
Most of the example run scripts are for use with SLURM, and the most basic example of these is :file:`gchp.run`.
You may copy any of the example run scripts to your run directory and adapt for your system and preferences as needed.
We offer a template batch job run script in the :file:`runScriptSamples` subdirectory called :file:`gchp.batch_job.sh`. This file contains examples for 3 types of job scheduler: SLURM, LSF, and PBS.
You may copy and adapt this file for your system and preferences as needed.

At the top of all batch job scripts are configurable run settings.
Most critically are requested # cores, # nodes, time, and memory.
Figuring out the optimal values for your run can take some trial and error.
For a basic six core standard simulation job on one node you should request at least 20 min and 32GB of memory.
The more cores you request the faster GCHP will run.
For a basic six core c24 standard full-chemistry simulation on one node you should request at least 20 min and 32GB of memory.
The more cores you request the faster GCHP will run given the same grid resolution.
Configurable job scheduler settings and acceptable formats are often accessible from the command line.
For example, type :command:`man sbatch` to scroll through configurable options for SLURM, including various ways of specifying number of cores, time and memory requested.

To submit a batch job using SLURM:
To submit a batch job using a run script called :file:`gchp.run` and the SLURM job scheduler:

.. code-block:: console
$ sbatch gchp.run
To submit a batch job using Grid Engine:
To submit using Grid Engine instead of SLURM:

.. code-block:: console
$ qsub gchp.run
Standard output will be sent to log file :file:`gchp.log` once the job is started unless you change that feature of the run script.
Standard error will be sent to a file specific to your scheduler, e.g. :file:`slurm-jobid.out` if using SLURM, unless you configure your run script to do otherwise.

If your computational cluster uses a different job scheduler, e.g. Grid Engine, LSF, or PBS, check with your IT staff or search the internet for how to configure and submit batch jobs.
For each job scheduler, batch job configurable settings and acceptable formats are available on the internet and are often accessible from the command line.
For example, type :command:`man sbatch` to scroll through options for SLURM, including various ways of specifying number of cores, time and memory requested.
If your computational cluster uses a different job scheduler, check with your IT staff or search the internet for how to configure and submit batch jobs on your system.

Verify a successful run
-----------------------

There are several ways to verify that your run was successful.
Standard output and standard error will be sent to a file specific to your scheduler, e.g. :file:`slurm-jobid.out`, unless you configured your run script to send it to a different log file. Variable :literal:`log` is defined in the template run script as :file:`gchp.YYYYMMDD_HHmmSSz.log` if you wish to use it. The date string in the log filename is the start date of your simulation as configured in :file:`cap_restart`. This log is automatically used if you execute the interactive run script example :file:`gchp.local.run`.

There are several ways to verify that your run was successful. Here are just a few:

1. NetCDF files are present in the :file:`OutputDir/` subdirectory
2. Standard output file :file:`gchp.log` ends with :literal:`Model Throughput` timing information
1. The end of the GCHP log file contains timing information and includes informational information every timestep (search for :literal:`AGCM`.
2. NetCDF files are present in the :file:`OutputDir/` subdirectory.
3. There is a restart file corresponding to your end date in the :file:`Restarts` subdirectory.
4. The start date in :file:`cap_restart` has been updated to your run end date.
3. The job scheduler log does not contain any error messages
4. Output file :file:`allPEs.log` does not contain any error messages

If it looks like something went wrong, scan through the log files to determine where there may have been an error. Here are a few debugging tips:

* Review all of your configuration files to ensure you have proper setup
* :literal:`MAPL_Cap` errors typically indicate an error with your start time, end time, and/or duration set in :file:`setCommonRunSettings.sh`
* :literal:`MAPL_ExtData` errors often indicate an error with your input files specified in either :file:`HEMCO_Config.rc` or :file:`ExtData.rc`
* :literal:`MAPL_HistoryGridComp` errors are related to your configured output in :file:`HISTORY.rc`
* Review all of your configuration files to ensure you have proper setup, especially :file:`setCommonRunSettings.sh`.
* :literal:`MAPL_Cap` errors in the run log typically indicate an error with your start time and/or duration. Check :file:`cap_restart` and :file:`setCommonRunSettings.sh`.
* :literal:`MAPL_ExtData` errors in the run log indicate an error with your input files. Check :file:`HEMCO_Config.rc` and :file:`ExtData.rc`.
* :literal:`MAPL_HistoryGridComp` errors in the run log are related to your configured diagnostics. Check :file:`HISTORY.rc`.
* Change the warnings and verbose options in :file:`HEMCO_Config.rc` to 3 and rerun
* Change the :literal:`root_level` settings for :literal:`CAP.ExtData` in :file:`logging.yml` to :literal:`DEBUG` and rerun
* Recompile the model with cmake option :literal:`-DCMAKE_BUILD_TYPE=Debug` and rerun.

If you cannot figure out where the problem is please do not hesitate to create a GCHPctm GitHub issue.
If you cannot figure out where the problem is then please create a GCHP GitHub issue.

Reuse a run directory
---------------------
Expand All @@ -110,54 +127,32 @@ Archiving runs is useful for other reasons as well, including:

* Save all settings and logs for later reference after a run crashes
* Generate data from the same executable using different run-time settings for comparison, e.g. c48 versus c180
* Run short runs in quick succession for debugging
* Run short runs in quick succession for comparison debugging

To archive a run, pass the archive script a descriptive subdirectory name where data will be archived. For example:

.. code-block:: console
$ ./archiveRun.sh 1mo_c24_24hrdiag
All files are archived to subfolders in the new directory.
All files except restart files are archived to subfolders in the new directory.
Which files are copied and to where are displayed on the screen.
Diagnostic files in the :file:`OutputDir/` directory are moved rather than copied so as not to duplicate large files.
You will be prompted at the command line to accept this change prior to data move.

Clean a run directory
^^^^^^^^^^^^^^^^^^^^^

You should always clean your run directory prior to your next run.
It is a good practice to clean your run directory prior to your next run if starting on the same date.
This avoids confusion about what output was generated when and with what settings.
Under certain circumstances it also avoids having your new run crash.
GCHP will crash if:

* Output file :file:`cap_restart` is present and you did not change your start/end times
* Your last run failed in such a way that the restart file was not renamed in the post-run commands in the run script

The example run scripts include extra commands to clean the run directory of the two problematic files listed above.
However, you may write your own run script and omit them in which case not cleaning the run directory prior to rerun will cause problems.

To make run directory cleaning simple is utility shell script :file:`cleanRunDir.sh`. To clean the run directory simply execute this script.
To make run directory cleaning simple we provide utility shell script :file:`cleanRunDir.sh`. To clean the run directory simply execute this script.

.. code-block:: console
$ ./cleanRunDir.sh
All GCHP output files, including diagnostics files in :file:`OutputDir/`, will then be deleted.
Only restart files with names matching :literal:`gcchem*` are deleted.
This preserve the initial restart symbolic links that come with the run directory.

Pre-run checklist
-----------------

Prior to running GCHP, always run through the following checklist to ensure everything is set up properly.
All GCHP output diagnostics files and logs, including diagnostics files in :file:`OutputDir/`, will then be deleted.
Restart files in the :file:`Restarts` subdirectory will not be deleted.

1. Your run directory contains the executable :file:`gchp`.
2. All symbolic links in your run directory are valid (no broken links)
3. You have looked through and set all configurable settings in :file:`setCommonRunSettings.sh`
4. If running via a job scheduler: you have a run script and the resource allocation in :file:`setCommonRunSettings.sh` and your run script are consistent (# nodes and cores)
5. If running interactively: the resource allocation in :file:`setCommonRunSettings.sh` is available locally
6. If reusing a run directory (optional but recommended): you have archived your last run with :literal:`./archiveRun.sh` if you want to keep it and you have deleted old output files with :literal:`./cleanRunDir.sh`

Recommended MPI configuration
-----------------------------
Expand All @@ -173,4 +168,4 @@ IntelMPI
OpenMPI
^^^^^^^

At high-core counts (e.g., > ~1000 cores) it's recommended to set :literal:`WRITE_RESTART_BY_OSERVER: YES` in :file:`GCHP.rc`.
At high-core counts (e.g., > ~1000 cores) it's recommended to set :literal:`WRITE_RESTART_BY_OSERVER: YES` in :file:`GCHP.rc`. This is set automatically when executing :file:`setCommonRunSettings.sh`.

0 comments on commit ce3431b

Please sign in to comment.