# 1) Generate Datasets

## 1.1) Training Dataset

In [1]:
!ufs2arco training.yaml

  xds = xr.open_zarr(


## 1.2) Validation Dataset

In [2]:
!ufs2arco validation.yaml

  xds = xr.open_zarr(


## 1.3) Testing Dataset

In [3]:
!ufs2arco testing.yaml

  xds = xr.open_zarr(


# 2) Generate and Modify Config Files

## 2.1) Generate Config Files

In [7]:
!anemoi-training config generate

2025-07-02 17:52:09 INFO Generating configs, please wait.


The config files generated have many settings that need to be changed.

## 2.2) Define batch sizes and configure datasets

Batch sizes must be defined for each dataset. The default *dataloader* file *dataloader/native_grid.yaml* has pre-defined batch sizes, however these can be overriden in *config.yaml*.
- **batch_size.training**: training dataset batch size
- **batch_size.validation**: validation dataset batch size
- **batch_size.test**: testing dataset batch size

For each dataset, the dataset path and start and end dates need to be specified.
- **training.dataset**: full path to the training dataset
- **training.start**: start date for training dataset (YYYY-MM-DD)
- **training.end**: end date for training dataset (YYYY-MM-DD)
- **validation.dataset**: full path to the validation dataset
- **validation.start**: start date for validation dataset (YYYY-MM-DD)
- **validation.end**: end date for validation dataset (YYYY-MM-DD)
- **test.dataset**: full path to the test dataset
- **test.start**: start date for test dataset (YYYY-MM-DD)
- **test.end**: end date for test dataset (YYYY-MM-DD)

Example implementation in *config.yaml*:

## 2.3) Configure GPUs and Paths

One of the most important steps for running the Anemoi framework is configuring paths. At the top of *config.yaml*, the 'hardware' parameter should be set to 'example'. This calls the default settings in *hardware/example.yaml*, however the **data** path is not specified in the *example* yaml. In addition, you may want to specify different directories for storing outputs and model graphs.

- **paths.output**: directory for the outputs (checkpoints, plots, etc.). Directory structure will be created if it does not already exist.
- **paths.data**: directory for the datasets generated with ufs2arco.
- **paths.graph**: directory for the model graph.

The name of the zarr file containing the training dataset must also be specified.
- **files.dataset**: name of the training dataset zarr file (do not include absolute path with directory structure)

You can also specify the number of GPUs to use for each model with the **num_gpus_per_model** parameter.

An example implementation in *config.yaml* is shown below.

## 2.4) Configure Model Training

There are a few parameters that should be specified in the main *config.yaml* file so model training configurations can be easily modified.

At the top of *config.yaml*, you will probably see a 'training' parameter that is set to 'default'. This calls training configuration settings in the *training/default.yaml* file. All of these settings can be overriden in *config.yaml*.

Here are some useful training parameters to include in *config.yaml*:
- **max_epochs**: specifies the maximum number of epochs for model training. Training will stop if this limit is reached.
- **max_steps**: specifies the maximum number of total steps for model training (*not steps per epoch*). Training will stop if this limit is reached.
- **lr.rate**: starting learning rate
- **lr.min**: minimum learning rate

An example implementation in *config.yaml* with the aforementioned parameters is shown below.

## 2.5) Configure Diagnostics

During training, it is useful to plot sample model predictions and log other information pertaining to the model output/performance in order to get a good idea if your model is 'working' as intended.

In the *config.yaml* file, the default file for diagnostics is *diagnostics/evaluation.yaml*. There are a couple empty fields that we will need to define in the following steps.

### 2.5.1) Performance Logging

For now, we will disable Weights and Biases for performance logging (though you may want to configure a WandB workflow in the future). This can be done by setting the **diagnostics.log.wandb.entity** parameter to 'null'.

We will also disable the MLflow tracking server by setting **diagnostics.log.mlflow.tracking_uri** to 'null'.

An example implementation in *config.yaml* is shown below. Note that we will continue to modify **diagnostics** in later steps.

### 2.5.2) Plotting

With the default settings in *diagnostics/evaluation.yaml*, the following plots will be produced at user-defined frequencies for specified variables:
* Spatial plots of model predictions and errors
* Histograms showing binned model predictions and errors for **every** variable in a single plot

The frequency of plotting can be modified directly in *config.yaml* with the following parameters:
* **diagnostics.plot.frequency.epoch**: plot frequency in epochs
* **diagnostics.plot.frequency.batch**: plot frequency in batches

Adding these to **diagnostics** in *config.yaml*:

The next thing to do is define what variables we want to plot. 

First, let's modify a few lines in *diagnostics/evaluation.yaml*.
- Under **callbacks**, assure that every instance of **parameters** (should be three instances in total) calls back to the user-specified variables in **diagnostics.plot.parameters** (see cell below). This will make sure that plots include every variable that you would like to monitor.
- You can leave the instance of **parameters** near the top of the file unchanged as we will be overriding it in *config.yaml*.

Now that the plotting file is configured, we can add define the variables we want to plot in *config.yaml*.
* Note that precipitation and related moisture variables need to be defined in **diagnostics.plot.precip_and_related_fields** as well as **diagnostics.plot.parameters**.

Adding our desired variables for plotting to **diagnostics.plot** in *config.yaml*:

After configuring the diagnostics, the *config.yaml* file can be used for training.

# 3) Set Environment Variables

Anemoi requires a "base seed" and a SLURM job ID.
- The base seed is used to initialize model weights. Changing the seed will result in different initial model parameters.
- The SLURM job ID is required, even if you are not on SLURM (just leave it as "0").

*Hydra* can be configured to output more complete tracebacks for debugging purposes.

In [4]:
import os

### Required ###
os.environ["ANEMOI_BASE_SEED"] = "42"
os.environ["SLURM_JOB_ID"] = "0"

### Optional ###
os.environ['HYDRA_FULL_ERROR'] = "1"  # for debugging

## 4) Train the Model

In [None]:
!anemoi-training train --config-name=config.yaml

## 5) Model Inference

Model inference with Anemoi is performed with the *anemoi-inference* module: https://anemoi.readthedocs.io/projects/inference/en/latest/index.html#index-page

### 5.1) Retrieve Model Runs and Load Checkpoint
Each model run is saved in a folder with a random hash identifier.

In [2]:
import os
model_runs = os.listdir('p1/training-output/checkpoint')
print('Available model runs:')
for run in model_runs:
    print(run + '\n')

Available model runs:
d46e7b66-9ba1-474f-9142-5dd28be63f50



Select a model run from the list above and load the checkpoint.

In [5]:
model_run = 'd46e7b66-9ba1-474f-9142-5dd28be63f50'  # model run hash identifier

## Do not change this ##
checkpoint = f'p1/training-output/checkpoint/{model_run}/inference-last.ckpt'

### 5.2) Configure and Run Model Inference
Select an initialization time from the testing dataset and set a forecast lead time. NOTE: Make sure that the valid time (i.e., time of the forecast) is within the testing dataset.

You can also create and call a config YAML file that contains the inference settings, however all settings can be easily passed through the command line.

In [6]:
init_time = '1994-03-12T21'  # initialization time [YYYY]-[MM]-[DD]T[HH]
lead_time = 240  # hours

## Do not change these ##
inference_dataset = 'p1/dataset/testing.zarr'
output_file = 'forecast.nc'  # output file containing the model forecast

!anemoi-inference run checkpoint={checkpoint} date={init_time} lead_time={lead_time} input.dataset={inference_dataset} output.netcdf={output_file}

                No post_processors defined. Accumulations will be accumulated from the beginning of the forecast.

                🚧🚧🚧 In a future release, the default will be to NOT accumulate from the beginning of the forecast. 🚧🚧🚧
                Update your config if you wish to keep accumulating from the beginning.
                https://github.com/ecmwf/anemoi-inference/issues/131
                
2025-07-08 16:59:35 INFO Pre processors: []
2025-07-08 16:59:35 INFO Accumulating fields []
2025-07-08 16:59:35 INFO Post processors: [Accumulate([])]
2025-07-08 16:59:35 INFO Using DefaultRunner runner, device=cuda
2025-07-08 16:59:35 INFO Input: DatasetInput(('p1/dataset/testing.zarr',), {})
2025-07-08 16:59:35 INFO Output: NetCDFOutput(forecast.nc)
2025-07-08 16:59:36 INFO 🚧🚧🚧🚧🚧🚧 XXXXXX cos_julian_day, 0, (73728,)
2025-07-08 16:59:36 INFO 🚧🚧🚧🚧🚧🚧 XXXXXX cos_local_time, 0, (73728,)
2025-07-08 16:59:36 INFO 🚧🚧🚧🚧🚧🚧 XXXXXX cos_longitude, 0, (73728,)
2025-07-08 16:59:36 INFO 🚧🚧🚧🚧🚧🚧 XXXXXX