# Generating Datasets with ufs2arco for Anemoi Model Training

##### This notebook will guide you through the process of generating training, validation, and testing datasets for Anemoi models.

##### For questions, please contact andrew.justin@noaa.gov.

## 1) Environment Setup

In [1]:
!pip install ufs2arco==0.6 mpi4py

import time

Collecting ufs2arco==0.6
  Using cached ufs2arco-0.6.0-py3-none-any.whl.metadata (1.8 kB)
Collecting mpi4py
  Using cached mpi4py-4.0.3-cp311-cp311-linux_x86_64.whl
Collecting numpy (from ufs2arco==0.6)
  Using cached numpy-2.3.1-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (62 kB)
Collecting xarray (from ufs2arco==0.6)
  Using cached xarray-2025.6.1-py3-none-any.whl.metadata (12 kB)
Collecting cf_xarray (from ufs2arco==0.6)
  Using cached cf_xarray-0.10.6-py3-none-any.whl.metadata (16 kB)
Collecting cftime (from ufs2arco==0.6)
  Using cached cftime-1.6.4.post1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.7 kB)
Collecting netCDF4 (from ufs2arco==0.6)
  Using cached netCDF4-1.7.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.8 kB)
Collecting h5netcdf (from ufs2arco==0.6)
  Using cached h5netcdf-1.6.1-py3-none-any.whl.metadata (13 kB)
Collecting zarr<3 (from ufs2arco==0.6)
  Using cached zarr-2.18.7-py3-none-any.whl.metadata (5.8 kB)
C

## 2) Dataset YAML Paths

In [2]:
train_yaml_path = 'training.yaml'  # training YAML path
valid_yaml_path = 'validation.yaml'  # validation YAML path
test_yaml_path = 'testing.yaml'  # testing YAML path

The datasets will include data for the following timeframes at 3-hourly intervals:
- **Training**: 0z 1 Jan 1994 - 21z 2 Jan 1994
- **Validation**: 0z 3 Jan 1994 - 21z 4 Jan 1994
- **Testing**: 0z 5 Jan 1994 - 21z 6 Jan 1994

### 3.1) Create the Training Dataset

In [3]:
start_time = time.time()
!ufs2arco {train_yaml_path}
time_elapsed = time.gmtime(time.time() - start_time)
print(f'ufs2arco training dataset: {time_elapsed.tm_min} min {time_elapsed.tm_sec} sec')

  xds = xr.open_zarr(
ufs2arco training dataset: 1 min 11 sec


### 3.2) Create the Validation Dataset

In [4]:
start_time = time.time()
!ufs2arco {valid_yaml_path}
time_elapsed = time.gmtime(time.time() - start_time)
print(f'ufs2arco validation dataset: {time_elapsed.tm_min} min {time_elapsed.tm_sec} sec')

  xds = xr.open_zarr(
ufs2arco validation dataset: 1 min 6 sec


### 3.3) Create the Testing Dataset

In [5]:
start_time = time.time()
!ufs2arco {test_yaml_path}
time_elapsed = time.gmtime(time.time() - start_time)
print(f'ufs2arco test dataset: {time_elapsed.tm_min} min {time_elapsed.tm_sec} sec')

  xds = xr.open_zarr(
ufs2arco test dataset: 1 min 9 sec
