This repository accompanies the paper The Proper Care and Feeding of CAMELS: How Limited Training Data Affects Streamflow Prediction (Martin Gauch, Juliane Mai, and Jimmy Lin, Environmental Modelling & Software, Volume 135, 2021.
(open-access verion on arXiv, originally posted November 2019)
It is built upon the code from https://github.com/kratzert/ealstm_regional_modeling.
main_gridEvaluation.py
Main python file used to generate bash scripts (more precisely, SLURM submission scripts) that train and evaluate XGBoost and EA-LSTM models on different amounts of training data.main.py
Main python file used for training and evaluating EA-LSTM modelsmain_xgboost.py
Main python file used for training and evaluating XGBoost models (including random parameter search)data/
contains the list of basins (USGS gauge ids) considered in our study, and shapefiles of the continental US (source), used for plotting.papercode/
contains the entire code (besides themain*.py
files in the root directory)notebooks/
contains the notebook that guides through the results of our study (as well as the notebooks from the original Kratzert et al. paper).notebooks/performance_gridEvaluation.ipynb
: This notebook evaluates and compares the results of XGBoost and EA-LSTMs trained on different amounts of training data.notebooks/performance.ipynb
: This notebooks evaluates and compares the LSTM architechtures from Kratzert et al.notebooks/ranking.ipynb
: This notebooks evaluates feature rankings and model robustness of the LSTM architechtures from Kratzert et al.notebooks/embedding.ipynb
: This notebooks analyzes the LSTM catchment embeddings from Kratzert et al.
Download this repository either as zip-file or clone it to your local file system by running
git clone git@github.com:gauchm/ealstm_regional_modeling.git
Within this repository, we provide two environment files (environment_cpu.yml
and environment_gpu.yml
) that can be used with Anaconda or Miniconda to create an environment with all needed packages.
Simply run
conda env create -f environment_cpu.yml
for the cpu-only version. Or run
conda env create -f environment_gpu.yml
if you have a CUDA-capable NVIDIA GPU. This is recommended if you want to train/evaluate the LSTM on your machine, but not strictly necessary.
In addition, you will have to install XGBoost from source (the current version on conda, 0.90, has a bug that affects training with a custom objective):
conda activate ealstm
git clone https://github.com/dmlc/xgboost --recursive
git checkout 96cd7ec2bbdec1addf81b1ca2adb13c9155e32f3 # this is the version we used in our study
cd xgboost; mkdir build; cd build;
cmake ..
make -j4
cd ../python-package
python setup.py install
First, you need the CAMELS data set to run any of your code. This data set can be downloaded for free here:
- CAMELS: Catchment Attributes and Meteorology for Large-sample Studies - Dataset Downloads Make sure to download the
CAMELS time series meteorology, observed flow, meta data (.zip)
file, as well as theCAMELS Attributes (.zip)
. Extract the data set on your file system and make sure to put the attribute folder (camels_attributes_v2.0
) inside the CAMELS main directory.
However, we trained our models with an updated version of the Maurer forcing data that is not yet officially published (CAMELS data set will be updated soon). The updated Maurer forcings contain daily minimum and maximum temperature, while the original Maurer data included in the CAMELS data set only include daily mean temperature. You can find the updated forcings temporarily here:
Download and extract the updated forcing into the basin_mean_forcing
folder of the CAMELS data set and do not rename it (name should be maurer_extended
). The resulting folder structure should look something like this:
|-CAMELS
|---camels_attributes_v2.0
|---basin_dataset_public_v1p2
|-----basin_mean_forcing
|-------maurer_extended
|---------01
...
|-----usgs_streamflow
|-------01
...
The pre-trained XGBoost and EA-LSTM models, the predictions, the SLURM submission scripts, and the traditional hydrologic benchmark models for our study are available for download at the following links:
To download pre-trained models and simulations from the Kratzet et al. paper, use the following link:
For training or evaluating any of the LSTM models a CUDA-capable NVIDIA GPU is recommended but not strictly necessary. Since we only train/use LSTM-based models, a strong multi-core CPU will work as well.
Before starting, make sure you have activated the conda environment.
conda activate ealstm
To change the input sequence length, change the value of GLOBAL_SETTINGS["seq_length"]
in main.py
/main_xgboost.py
.
To train an LSTM model, run the following line of code from the terminal
python main.py train --camels_root /path/to/CAMELS
This would train a single EA-LSTM model with a randomly generated seed using the basin average NSE as loss function and store the results under runs/
. Additionally the following options can be passed:
--seed NUMBER
Train a model using a fixed random seed--cache_data True
Load the entire training data into memory. This will speed up training but requires approximately 50GB of RAM.--num_workers NUMBER
Defines the number of parallel threads that will load and preprocess inputs.--train_start
This date (formatted asddmmyyyy
will be used as the training start date.--train_start
This date (formatted asddmmyyyy
will be used as the training enddate.--no_static True
If passed, will train a standard LSTM without static features. If this is not desired, don't passFalse
but instead remove the argument entirely.--concat_static True
If passed, will train a standard LSTM where the catchment attributes as concatenated at each time step to the meteorological inputs. If this is not desired, don't passFalse
but instead remove the argument entirely.--use_mse True
If passed, will train the model using the mean squared error as loss function. If this is not desired, don't passFalse
but instead remove the argument entirely.--run_dir_base
If passed, will store training data and results in a subfolder of this folder. Default isruns/
--run_name
If passed, will store training data and results in{run_dir_base}/{run_name}
. By default, a name is generated based on the current date and time.--basins
If passed, will only use these basins during training (evaluation automatically only uses the basins that were used in training). Pass multiple basins separated by spaces. Default is all 531 basins.
To train an XGBoost model, run the following line of code from the terminal
python main_xgboost.py train --camels_root /path/to/CAMELS
This would train a single XGBoost model with a randomly generated seed using MSE as objective and store the results under runs/
. Additionally the following options can be passed:
--seed NUMBER
Train a model using a fixed random seed--num_workers NUMBER
Defines the number of parallel threads that will load and preprocess inputs.--train_start
This date (formatted asddmmyyyy
will be used as the training start date.--train_start
This date (formatted asddmmyyyy
will be used as the training enddate.--no_static True
If passed, will train a model without static features. If this is not desired, don't passFalse
but instead remove the argument entirely.--use_mse
If passed, will train the model using MSE as objective. If this is not desired, remove the argument entirely.--model_dir
If passed, will train an XGBoost model using the model parameters from the run in this folder (pass the directory that contains themodel.pkl
file). If not passed, training will include a random search for suitable parameters.--run_dir_base
If passed, will store training data and results in a subfolder of this folder. Default isruns/
--run_name
If passed, will store training data and results in{run_dir_base}/{run_name}
. By default, a name is generated based on the current date and time.--basins
If passed, will only use these basins during training (evaluation automatically only uses the basins that were used in training). Pass multiple basins separated by spaces. Default is all 531 basins.
Once training is finished, you can use the models to run inference and generate predictions for the test period. This will calculate the discharge simulation for the validation period and store the results alongside the observed discharge for all basins that were used during training in a pickle file. The pickle file is stored in the main directory of the model run.
After inference, you can run the notebook in notebooks/performance_gridEvaluation.ipynb
to evaluate the predictions' accuracy.
To generate predictions with an LSTM model, run the following line of code from the terminal.
python main.py evaluate --camels_root /path/to/CAMELS --run_dir path/to/model_run
To generate predictions with an XGBoost model, run the following line of code from the terminal.
python main_xgboost.py evaluate --camels_root /path/to/CAMELS --run_dir path/to/model_run
To create SLURM submission scripts that automatically train and evaluate XGBoost and EA-LSTM models on varying amounts of training data, run the following line of code from the terminal.
python main_gridEvaluation.py --camels_root /path/to/CAMELS
Additionally, the following options can be passed:
--num_workers_ealstm
Use this option to determine the number of workers used for EA-LSTM training. Default is 12.--num_workers_xgb
Use this option to determine the number of workers used for EA-LSTM training. Default is 20.--use_mse
Provide this option if you want to use NSE as objective and loss function in XGBoost and EA-LSTM training.--user
Use this option to set the email address that SLURM job failure notifications will be sent to.--use_params
Use this option to reuse XGBoost parameters from the model in the specified directory, rather than performing a parameter search.
The script will generate SLURM submission scripts in a folder run_grid_ddmm_hhmm/
, which you can either execute as normal bash scripts (note that you will need to make them executable through chmod +x path/to/script.sbatch
) or submit to a SLURM scheduler via sbatch path/to/script.sbatch
(note that you will need to adapt the account
submission parameter).
The following scripts will be generated:
- There will be one EA-LSTM training script (running
main.py
) for each combination of basins, training years, and seed (run_ealstm_{train_start}_{train_end}_basinsample{number_of_basins}_{id_of_basinsample}_seed{seed}.sbatch
). - For XGBoost, there will be two types of scripts (both running
main_xgboost.py
):- One script for to find suitable parameters in a random search (
run_xgb_param_search_{train_start}_{train_end}_basinsample{number_of_basins}_{id_of_basinsample}_seed111.sbatch
). - Additionally, there will be one XGBoost training script for each combination of basins, training years, and seed that use the parameters from the above parameter search to train models (
run_xgb_train_{train_start}_{train_end}_basinsample{number_of_basins}_{id_of_basinsample}_seed{seed}.sbatch
). Because these scripts need the parameters from thexgb_param_search
runs, you can only execute them after completion of the parameter search.
- One script for to find suitable parameters in a random search (
To evaluate the LSTM model robustness against noise of the static input features run the following line of code from the terminal.
python main.py eval_robustness --camels_root /path/to/CAMELS --run_dir path/to/model_run
This will run 265,500 model evaluations (10 levels of added random noise and 50 repetitions per noise level for 531 basins). This evaluations is only implemented for our EA-LSTM. Therefore, make sure that the model_run
folder contains the results of training an EA-LSTM.
In your terminal, go to the project folder and start a jupyter notebook server by running
jupyter notebook
If you use any of this code in your experiments, please make sure to cite our paper and the publication by Kratzert et al.:
Gauch, M., Mai, J., Lin, J., "The Proper Care and Feeding of CAMELS:
How Limited Training Data Affects Streamflow Prediction". Environmental Modelling & Software, Volume 135, 2021, 104926, ISSN 1364-8152.
Kratzert, F., Klotz, D., Shalev, G., Klambauer, G., Hochreiter, S., Nearing, G., "Benchmarking
a Catchment-Aware Long Short-Term Memory Network (LSTM) for Large-Scale Hydrological Modeling".
submitted to Hydrol. Earth Syst. Sci. Discussions (2019)
The CAMELS data set only allows non-commercial use. Thus, our pre-trained models and the updated Maurer forcings underlie the same TERMS OF USE as the CAMELS data set.