<a href="https://colab.research.google.com/github/andrewjustin/anemoi-house-workflow/blob/master/colab-anemoi-workflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generating Datasets with ufs2arco for Anemoi Model Training

##### This notebook will guide you through the process of generating training, validation, and testing datasets for Anemoi models.

##### For questions, please contact andrew.justin@noaa.gov.

## 1) Environment Setup

In [8]:
!pip install ufs2arco==0.6 mpi4py anemoi-datasets==0.5.23 anemoi-graphs==0.5.2 anemoi-models==0.5.0 \
anemoi-training==0.4.0 anemoi-inference flash-attn 'numpy<2.3' 'earthkit-data<0.14.0'

import os



## 2) Dataset YAML Paths

In [3]:
train_yaml_path = 'training.yaml'  # training YAML path
valid_yaml_path = 'validation.yaml'  # validation YAML path
test_yaml_path = 'testing.yaml'  # testing YAML path

The datasets will include data for the following timeframes at 3-hourly intervals:
- **Training**: 0z 1 Jan 1994 - 21z 2 Jan 1994
- **Validation**: 0z 3 Jan 1994 - 21z 4 Jan 1994
- **Testing**: 0z 5 Jan 1994 - 21z 6 Jan 1994

# 3) Dataset Generation

### 3.1) Create the Training Dataset

In [4]:
!ufs2arco {train_yaml_path}

  xds = xr.open_zarr(
  result_data = func(*input_data)


### 3.2) Create the Validation Dataset

In [5]:
!ufs2arco {valid_yaml_path}

  xds = xr.open_zarr(


### 3.3) Create the Testing Dataset

In [6]:
!ufs2arco {test_yaml_path}

  xds = xr.open_zarr(


# 4) Model Setup & Training

### 4.1) Environment Variables

- Anemoi requires a "base seed" and a SLURM job ID.
  - The base seed is used to initialize model weights. Changing the seed will result in different initial model parameters.
  - The SLURM job ID is required, even if you are not on SLURM (just leave it as "0").
- Hydra can be configured to output more complete tracebacks for debugging purposes.


In [9]:
### Required ###
os.environ["ANEMOI_BASE_SEED"] = "42"
os.environ["SLURM_JOB_ID"] = "0"

### Optional ###
os.environ['HYDRA_FULL_ERROR'] = "1"  # for debugging