Skip to content

Commit

Permalink
Update and streamline the referene workflow
Browse files Browse the repository at this point in the history
  • Loading branch information
JiaweiZhuang committed Dec 17, 2018
1 parent 2b8bc66 commit 2e18f9f
Showing 1 changed file with 32 additions and 97 deletions.
129 changes: 32 additions & 97 deletions doc/source/chapter02_beginner-tutorial/research-workflow.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@ A reference workflow
Here's the outline of a typical research workflow:

1. Launch EC2 instances from pre-configured AMI. Consider spot instances for big computing. :doc:`Consider AWSCLI to simply the above process to one shell command <../chapter03_advanced-tutorial/advanced-awscli>`
2. Prepare input data by pulling them from S3 to EC2. Put commonly used ``aws s3 cp`` commands into bash scripts.
3. Tweak model configurations as needed.
2. Tweak model configurations as needed.
3. Pull necessary input data from S3 to EC2. Put commonly used ``aws s3 cp`` commands into bash scripts.
4. Run simulations :ref:`with tmux <keep-running-label>`. Log out and go to sleep if the model runs for a long time. Re-login at anytime to check progress.
5. Use Python/Jupyter to analyze output data.
6. When the EC2 instance is not needed anymore, transfer output data and customized model configuration (mostly run directories) to S3. Or download them to local machines if necessary (Recall that :ref:`egress charge <data-egress-label>` is $90/TB; for several GBs the cost is negligible).
Expand All @@ -27,7 +27,7 @@ Here's the outline of a typical research workflow:

Talk is cheap. Let's actually walk through them.

Below are reproducible steps (copy & paste-able commands) to set up a custom model run. We use a half-day, 2x2.5 simulation as an example, but the same idea applies to other types of runs. **Most laborious steps only need to be done once**. Subsequent workflow will be much simpler.
Below are reproducible steps (copy & paste-able commands) to set up a custom model run. We use a one-month, 2x2.5 simulation as an example, but the same idea applies to other types of runs. **Most laborious steps only need to be done once**. Subsequent workflow will be much simpler.

I assume you've read all previous sections. Don't worry if you can't remember everything -- there will be links to previous sections whenever necessary.

Expand Down Expand Up @@ -63,7 +63,7 @@ Advanced tutorial will show you how to :doc:`use AWSCLI to simply the above proc
Set up your own model configuration
-----------------------------------

Log into the instance :ref:`as in the quick start guide <login_ec2-label>`. Here you will set up you own model configuration, instead of using the pre-configured tutorial run directory. You can also change model version -- any versions newer than `v11-02a <http://wiki.seas.harvard.edu/geos-chem/index.php/GEOS-Chem_v11-02#v11-02a>`_ should work smoothly. The system will still work with future releases of GEOS-Chem, unless there are big structural changes that break the compile process.
Log into the instance :ref:`as in the quick start guide <login_ec2-label>`. Here you will set up you own model configuration, instead of using the pre-configured tutorial run directory. The system will still work with future releases of GEOS-Chem, unless there are big structural changes that break the compile process.

Existing GEOS-Chem users should feel quite familiar about the steps presented here. New users might need to refer to our `user guide <http://acmg.seas.harvard.edu/geos/doc/man/>`_ for more complete explanation.

Expand All @@ -74,25 +74,24 @@ You can obtain the latest copy of the code from `GEOS-Chem's GitHub repo <https:

$ mkdir ~/GC # make you own folder instead using the "tutorial" folder.
$ cd ~/GC
$ git clone https://github.com/geoschem/geos-chem Code.12.0.1 # might use different names for future versions
$ git clone https://github.com/geoschem/geos-chem Code.GC
$ git clone https://github.com/geoschem/geos-chem-unittest.git UT

You may list all versions (they are just `git tags <https://git-scm.com/book/en/v2/Git-Basics-Tagging>`_) in chronological order::

$ cd Code.12.0.1
$ cd Code.GC
$ git log --tags --simplify-by-decoration --pretty="format:%ci %d"
2018-08-22 16:57:08 -0400 (HEAD -> master, tag: 12.0.1, origin/master, origin/HEAD)
2018-08-09 16:59:22 -0400 (tag: 12.0.0)
2018-07-30 10:31:40 -0400 (tag: 12.0.0-1yr-bm, tag: 12.0.0-1mo-bm)
2018-06-21 10:04:24 -0400 (tag: v11-02-release-candidate, tag: v11-02-rc)
2018-05-11 16:31:42 -0400 (tag: v11-02f-1yr-Run1)
2018-12-11 08:48:25 -0500 (HEAD -> master, tag: 12.1.1, origin/master, origin/HEAD)
2018-11-21 09:07:51 -0500 (tag: 12.1.0, origin/HEMCO)
2018-10-16 16:52:11 -0400
2018-10-16 11:25:42 -0400 (tag: 12.0.3)
...
**New users had better just use the default, latest version to minimize confusion**. Experienced users might want to checkout to a different version, say ``12.0.0``::
**New users had better just use the default, latest version to minimize confusion**. Experienced users might want to checkout to a specific version, say ``12.1.0``::

$ git checkout 12.0.0 # just the name of the tag
$ git checkout 12.1.0 # just the name of the tag
$ git branch
* (HEAD detached at 12.0.0)
* (HEAD detached at 12.1.0)
$ git checkout master # restore the latest version if you want

You need to do version checkout for both source code and unit tester.
Expand Down Expand Up @@ -120,13 +119,13 @@ to::
...
COPY_PATH : {HOME}/GC

Then uncomment the run directory you want::
Then un-comment the run directory you want, say for global 2x2.5 simulation::

geosfp 2x25 - standard 2016070100 2016080100 -
In ``UT/perl/Makefile``, make sure the source code path is correct::

CODE_DIR :=$(HOME)/GC/Code.$(VERSION)
CODE_DIR :=$(HOME)/GC/Code.GC
Finally, generate the run directory::

Expand All @@ -139,12 +138,22 @@ Go to the run directory and compile::

Note that you almost have to execute ``make`` command **in the run directory**. This will ensure the correct combination of compile flags for this specific run configuration. GEOS-Chem's compile flags have become so complicated that you will almost never get the right compile settings by compiling in the source code directory. See `our wiki <http://wiki.seas.harvard.edu/geos-chem/index.php/GEOS-Chem_Makefile_Structure#Compiling_in_a_run_directory>`_ for more information.

Tweak run-time configurations as needed
---------------------------------------

For example, in ``input.geos``, check if the simulation length is one month::

Start YYYYMMDD, hhmmss : 20160701 000000
End YYYYMMDD, hhmmss : 20160801 000000

You might also want to tweak ``HEMCO_Config.rc`` to select emission inventories, and ``HISTORY.rc`` to select output fields.

Get more input data from S3
---------------------------

If you just run the executable ``./geos.mp``, it will complain about missing input data. Remember that the default ``~/ExtData`` folder only contains sample data for a demo 4x5 simulation; other data need to be retrieved from S3 using AWSCLI commands (:doc:`see here to review S3 usage <use-s3>`). In order to use AWSCLI on EC2, you need to either :ref:`configure credentials (beginner approach) <credentials-label>` or :doc:`configure IAM role (advanced approach) <../chapter03_advanced-tutorial/iam-role>`.
If you just run the executable ``./geos.mp``, it will probably complain about missing input data. Remember that the default ``~/ExtData`` folder only contains sample data for a demo 4x5 simulation; other data need to be retrieved from S3 using AWSCLI commands (:doc:`see here to review S3 usage <use-s3>`). In order to use AWSCLI on EC2, you need to either :ref:`configure credentials (beginner approach) <credentials-label>` or :doc:`configure IAM role (advanced approach) <../chapter03_advanced-tutorial/iam-role>`.

Try ``aws s3 ls`` to make sure AWSCLI is woking. Then retrieve data by::
Try ``aws s3 ls`` to make sure AWSCLI is working. Then retrieve data by::
# GEOSFP 2x2.5 CN metfield
aws s3 cp --request-payer=requester --recursive \
Expand All @@ -156,73 +165,17 @@ Try ``aws s3 ls`` to make sure AWSCLI is woking. Then retrieve data by::
# 2x2.5 restart file
aws s3 cp --request-payer=requester \
s3://gcgrid/SPC_RESTARTS/initial_GEOSChem_rst.2x25_standard.nc ~/ExtData/SPC_RESTARTS
s3://gcgrid/GEOSCHEM_RESTARTS/v2018-11/initial_GEOSChem_rst.2x25_standard.nc ~/ExtData/GEOSCHEM_RESTARTS/v2018-11/
# fix the softlink in run directory
ln -s ~/ExtData/SPC_RESTARTS/initial_GEOSChem_rst.2x25_standard.nc ~/GC/geosfp_2x25_standard/GEOSChem_restart.201607010000.nc

Tweak run-time configurations
-----------------------------

Here shows very common customizations in the run directory. You might further tweak any settings as needed.

In ``input.geos``, change the simulation length to 12 hours instead of 1 month.

::

End YYYYMMDD, hhmmss : 20160701 120000

.. note::
If you do need to run the simulation over months, remember to pull more metfields in the previous step. For example, metfields for the entire year can be retrieved by ``aws s3 cp --request-payer=requester --recursive s3://gcgrid/GEOS_2x2.5/GEOS_FP/2016/ ~/ExtData/GEOS_2x2.5/GEOS_FP/2016/``

In ``HEMCO_Config.rc``, tweak emission configurations as needed. Here I disable CEDS due to `its data size issue <https://github.com/geoschem/geos-chem/issues/12>`_, which will be fixed in 12.1.0.

::

--> CEDS : false
ln -s ~/ExtData/GEOSCHEM_RESTARTS/v2018-11/initial_GEOSChem_rst.2x25_standard.nc ~/GC/geosfp_2x25_standard/GEOSChem_restart.201607010000.nc

In ``HISTORY.rc``, change the output path:

::

EXPID: ./OutputDir/GEOSChem
Remember to ``mkdir OutputDir`` so the path you specified actually exists.

Say I am only interested in Species Concentrations diagnostics. Comment out others::

COLLECTIONS: 'SpeciesConc',
# 'AerosolMass',
# 'Aerosols',
# 'CloudConvFlux',
# 'ConcAfterChem',
# 'DryDep',
# 'JValues',
# 'JValuesLocalNoon',
# 'LevelEdgeDiags',
# 'ProdLoss',
# 'StateChm',
# 'StateMet',
# 'WetLossConv',
# 'WetLossLS',

Output hourly instantaneous field, instead of the original monthly mean:

::

SpeciesConc.frequency: 00000000 010000
SpeciesConc.duration: 00000000 010000
SpeciesConc.mode: 'instantaneous'

.. note::
To massively change the date for all collections, in ``vim`` you can perform a subsitition by ``:%s/00000100 000000/00000000 010000/g``

Now you should be able to execute the model without problems.
Now the model should run without problems.

Perform long-term simulation
----------------------------

:ref:`With tmux <keep-running-label>`, you can keep the program running after logging out.
Such a long simulation can take about a day. :ref:`With tmux <keep-running-label>`, you can keep the program running after logging out.

::

Expand All @@ -235,31 +188,13 @@ Perform long-term simulation

Log out of the server (``Ctrl + d`` or just close the terminal). The model will be safely running in the background. You can re-login anytime and check the progress by looking at ``run.log``. If you need to cancel the simulation, type ``tmux a`` to resume the interactive session and then ``Ctrl + c`` to kill the program.

This half-day simulation will take about half an hour. In the meantime, do whatever you like such as having a cup coffee... Just come back and re-login after half an hour. The same strategy applies to simulations that run over many days. You don't have to keep the terminal open.

.. note::
What if the model finishes at mid-night? Any way to automatically terminate the instance to stop paying for charge? I tried multiple auto-checking methods but they often bring more troubles than benefits. For example, :ref:`the HPC cluster solution <hpc-overview-label>` will handle server termination for you, but that often makes the workflow more complicated, especially if you are not a heavy user. Manually examining the simulation on next day is usually the easiest way. The cost of EC2 piles up for simulations that last for many days, but for just one night it is negligible.

Analyze output data
-------------------

Output data will be inside ``OutputDir/`` as specified in ``HISTORY.rc``. You can :ref:`use Jupyter notebooks <jupyter-label>` to analyze them, or simply ``ipython`` for a quick check. One tip is that multi-file time-series can be opened as a single object by ``xarray.open_mfdataset()``::

$ source activate geo
$ ipython
Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.5.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import xarray as xr

In [2]: ds = xr.open_mfdataset("GEOSChem.SpeciesConc.20160701_*00z.nc4") # multiple many files at once

In [3]: ds # 12 time frames in the same object
Out[3]:
<xarray.Dataset>
Dimensions: (ilev: 73, lat: 91, lev: 72, lon: 144, time: 12)
...
Output data will be generated during simulation as specified by ``HISTORY.rc``. You can :ref:`use Jupyter notebooks <jupyter-label>` to analyze them, or simply ``ipython`` for a quick check.

Save your files to S3
---------------------
Expand Down

0 comments on commit 2e18f9f

Please sign in to comment.