Skip to content

Commit

Permalink
Added instructions for Caching Input Data on Fast Drives to RTD.
Browse files Browse the repository at this point in the history
  • Loading branch information
LiamBindle committed Apr 13, 2022
1 parent b8fa2b9 commit 340be48
Show file tree
Hide file tree
Showing 2 changed files with 153 additions and 0 deletions.
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ with Spack, as well as how to use Spack to install GCHP's dependencies if needed
supplement/job-script-examples.rst
supplement/building-gchp-dependencies.rst
supplement/setting-up-aws-parallelcluster.rst
supplement/caching-input-data.rst
supplement/containers.rst
supplement/plotting-output.rst
stretched-grid.rst
Expand Down
152 changes: 152 additions & 0 deletions docs/source/supplement/caching-input-data.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
.. _caching_input_data:

Caching Input Data on Fast Drives
=================================

This page describes how to set up a cache of GEOS-Chem input data.
This is useful if you want to temporarily transfer a simulation's input data to a performant hard drive.
This can improve the speed of your GCHP simulation by reducing the time spent reading input data.
Caching input data is also useful if the file system that stores your GEOS-Chem input data repository has issues that are causing simulations to crash (i.e., you can transfer the data
for your simulation to more stable hard drives).


Install the bashdatacatalog
---------------------------

Install the bashdatacatalog with the following command. Follow the prompts and restart your console.

.. code-block:: console
gcuser:~$ bash <(curl -s https://raw.githubusercontent.com/LiamBindle/bashdatacatalog/main/install.sh)
.. note:: You can rerun this command to upgrade to the latest version.

Set Up the ExtDataCache Directory
---------------------------------

Next, we are going to set up the :file:`ExtDataCache` directory.
You should put this directory in the appropriate path so that desired hard drives are used.
For example, if you have performance hard drives at :file:`/scratch/`, create a directory like :file:`/scratch/ExtDataCache/`.
We are going to use :file:`ExtDataCache/` to temporarily store the input data for simulations.

In the future, the idea is that you will copy the prerequisite input data to :file:`ExtDataCache/` before you run a simulation.
Since :file:`ExtDataCache/` is temporary data, you can delete it periodically to "purge" it.
Alternatively, you can use bashdatacatalog commands to selectively remove files.
If you are running long simulations, you can keep a few years of data in :file:`ExtDataCache/`, sort of like a moving window tracking the progress of your simulation.

Create a subdirectory in :file:`ExtDataCache/` to store catalog files.
You need a set of four catalog files for each simulation:

* MeteorologicalInputs.csv -- Specifies the simulation's meteorological input data
* ChemistryInputs.csv -- Specifies the simulation's chemistry input data
* EmissionsInputs.csv -- Specifies the simulation's emissions input data
* InitialConditions.csv -- Specifies the default restart files for the simulation

A good directory structure for catalog files is :file:`ExtDataCache/CatalogFiles/SIMULATION_ID` where :literal:`SIMULATION_ID` is a placeholder for a unique identifier for your simulation.
These instructions will put a demo set of catalog files in :file:`ExtDataCache/CatalogFiles/DemoSimulation`:

.. code-block:: console
gcuser:~$ cd /scratch
gcuser:/scratch$ mkdir ExtDataCache # for storing input data for simulations
gcuser:/scratch$ mkdir ExtDataCache/CatalogFiles # for storing catalog files
gcuser:/scratch$ mkdir ExtDataCache/CatalogFiles/DemoSimulation # for storing catalog files for a specific simulation
Next, download the catalog files for the appropriate version of GEOS-Chem. You can find the GEOS-Chem catalog files `here <http://geoschemdata.wustl.edu/ExtData/DataCatalogs>`_.

.. code-block:: console
gcuser:/scratch$ cd ExtDataCache/CatalogFiles/DemoSimulation
gcuser:/scratch/ExtDataCache/CatalogFiles/DemoSimulation$
gcuser:/scratch/ExtDataCache/CatalogFiles/DemoSimulation$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/MeteorologicalInputs.csv
gcuser:/scratch/ExtDataCache/CatalogFiles/DemoSimulation$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.3/ChemistryInputs.csv
gcuser:/scratch/ExtDataCache/CatalogFiles/DemoSimulation$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.3/EmissionsInputs.csv
gcuser:/scratch/ExtDataCache/CatalogFiles/DemoSimulation$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.3/InitialConditions.csv
Edit the catalog files according to your simulation configuration. You can enable/disable data collections by editing column 3 (:literal:`1` to enable a collection, :literal:`0` to disable a collection).
If you are not sure if your simulation needs a collection, it is better to err on the side of inclusion.
The meteorological data collections are the largest by volume.
Only one meteorological data collection in :file:`MeteorologicalInputs.csv` needs to be enabled.

Update the Collection URLs
--------------------------

The default collection URLs in the catalog files point to http://geoschemdata.wustl.edu/ExtData.
To copy data from your primary ExtData repository, edit column 2 of the catalog files.
For example, if your primary ExtData repository is at :file:`/storage/ExtData` you would replace :literal:`http://geoschemdata.wustl.edu/ExtData` with :literal:`file:///storage/ExtData`
in column 2 of the catalog files.
Below is a :command:`sed` command that will do the replacement.

.. code-block:: console
gcuser:/scratch/ExtDataCache/CatalogFiles/DemoSimulation$ export FIND_STR="http://geoschemdata.wustl.edu/ExtData"
gcuser:/scratch/ExtDataCache/CatalogFiles/DemoSimulation$ export REPLACE_STR="file:///storage/ExtData" # replace '/storage/ExtData' with the path to your ExtData
gcuser:/scratch/ExtDataCache/CatalogFiles/DemoSimulation$ sed -i "s#${FIND_STR}#${REPLACE_STR}#g" *.csv # do url find/replace
Copy Data to ExtDataCache
-------------------------

Navigate to :file:`ExtDataCache/`.
One you are there, run :command:`bashdatacatalog-fetch` to fetch metadata from ExtData.
The arguments to :command:`bashdatacatalog-fetch` are catalog files.
This metadata includes the file list for each data collection, and the details to classify each file as a temporal or static file.

.. code-block:: console
gcuser:/scratch/ExtDataCache/CatalogFiles/DemoSimulation$ cd ../..
gcuser:/scratch/ExtDataCache$ bashdatacatalog-fetch CatalogFiles/DemoSimulation/*.csv
Now you can run :command:`bashdatacatalog-list` commands to generate file lists.
The output of :command:`bashdatacatalog-list` is controlled using flags.
For example, add the :literal:`-s` to list "static" files (input files that are always required regardless of the simulation period).
You can list "temporal" files with the :literal:`-t` flag.
You can filter temporal files according to a date range with the :literal:`-r START,END` argument.
You can filter out files that exist using the :literal:`-m` flag (lists files that are missing).
You can specify different file list formats using the `-f FORMAT` argument.
Below is a command that lists all the files in ExtDataCache that are missing for a simulation starting on 2017-01-01 and ending on 2017-12-31.

.. code-block:: console
gcuser:/scratch/ExtDataCache$ bashdatacatalog-list -stm -r 2016-12-31,2018-01-01 CatalogFiles/DemoSimulation/*.csv
.. note::
You need to subtract/add one day to the period of your simulation.
The example above uses :literal:`-r 2016-12-31,2018-01-01` because the simulation period is 2017-01-01 to 2017-12-31.

To copy the missing files to ExtDataCache, you can use the argument :literal:`-f xargs-curl` to specify the output list should be formatted as input to :literal:`xargs curl`.
You can use a command similar to the one below to copy all the missing files for your simulation to ExtDataCache.


.. code-block:: console
gcuser:/scratch/ExtDataCache$ bashdatacatalog-list -stm -r 2016-12-31,2018-01-01 -f xargs-curl CatalogFiles/DemoSimulation/*.csv | xargs -P 4 curl
.. note::
The :literal:`-P 4` argument to :command:`xargs` allows for 4 parallel copies at a time.

Update Run Directory to use ExtDataCache
----------------------------------------

To update a run directory to use ExtDataCache, you can run the following commands.
Make sure to set :literal:`FIND_PATH` to ExtData and :literal:`REPLACE_PATH` to ExtDataCache.

.. code-block:: console
gcuser:/scratch/ExtDataCache$ cd /MyRunDirectory # cd to your run directory
gcuser:/MyRunDirectory$ export FIND_PATH=/storage/ExtData # replace path to your primary ExtData
gcuser:/MyRunDirectory$ export REPLACE_PATH=/scratch/ExtDataCache # replace with the path to your ExtDataCache
gcuser:/MyRunDirectory$ function swap_extdata_link { ln -sfn $(readlink $1 | sed "s#${FIND_PATH}/*#${REPLACE_PATH}/#") $1; }
gcuser:/MyRunDirectory$ swap_extdata_link ChemDir
gcuser:/MyRunDirectory$ swap_extdata_link HcoDir
gcuser:/MyRunDirectory$ swap_extdata_link MetDir
gcuser:/MyRunDirectory$ sed -i "s#${FIND_PATH}#${REPLACE_PATH}#g" HEMCO_Config.rc input.geos
Now your GCHP simulation will use input data from ExtDataCache.

See Also
--------

* `bashdatacatalog - Instructions for GEOS-Chem Users <https://github.com/LiamBindle/bashdatacatalog/wiki/Instructions-for-GEOS-Chem-Users>`_
* `bashdatacatalog - List of useful commands <https://github.com/LiamBindle/bashdatacatalog/wiki/3.-Useful-Commands>`_
* `GEOS-Chem Input Data Catalogs <http://geoschemdata.wustl.edu/ExtData/DataCatalogs/>`_

0 comments on commit 340be48

Please sign in to comment.