# Ice Station Zebra Pipeline Demo

This demonstration showcases the complete Ice Station Zebra ML pipeline capabilities through CLI commands. 

**Target Audience:** Developer teams and future team members who want to understand our design decisions, 
trade-offs, and flexible experimentation capabilities.

**You'll learn how to:**
- Run our training pipeline end-to-end in three lines of code
- Swap between different modelling paradigms
- Reproduce runs and inspect the outputs
- Evaluate the performance of the models in line with community standards on sea ice forecasting

In [1]:
#import warnings
#warnings.filterwarnings("ignore", message=".*repr.*Field.*")

<div class="alert alert-danger">

If you want to run this notebook, you will need a **CDS account** in order to download the ERA5 weather data. Details of how to set this up can be found [here](https://cds.climate.copernicus.eu/how-to-api).

</div>

## Notebook Structure

[**Section 1: End-to-End Training**](#section-1-end-to-end-training-pipeline)
- Run a full zebra pipeline end-to-end using a minimal configuration & data
- Inspect training artifacts and see evaluation outputs

[**Section 2: Model Flexibility**](#section-2-model-flexibility)
- Switch between Encode-Process-Decode paradigm and standalone persistence model
- Explore Encoder module functionality (Multimodality)

[**Section 3: Full flexibility - Advanced Example**](#section-3-full-flexibility---advanced-example)
- Write or adapt config files to change pipeline behavior
- use anemoi functionality to fetch and inspect standard datsets
- see our pipeline data checks and validation in action
- Evaluate and compare model performance using a pretrained model checkpoint
- Explore different plotting formats and metrics

# Section 1: End-to-End Training Pipeline

In this section, we'll demonstrate the complete training pipeline using a simple **UNet model with a naive encoder / decoder** (more details of this can be found below in [section 2](#section-2-model-flexibility)). 
The dataset contains sea ice concentration data (OSISAF) and corresponding atmospheric data (ERA5).
We don't expect the model to do well as we are only training for 10 epochs, and won't do any hyperparameter optimisation. However, it will give us a sense of the pipeline.

You can install the repo by running the following commands in your terminal:

```bash
git clone https://github.com/alan-turing-institute/ice-station-zebra
cd ice-station-zebra
pip install .
```

### Environment Verification

Let's verify that our zebra cli tools are available and working.

To run this notebook, you'll need a kernel (e.g. conda or .venv) with the ice_station_zebra repo and jupyter installed.

In [2]:
!zebra --help

[1m                                                                                [0m
[1m [0m[1;33mUsage: [0m[1mzebra [OPTIONS] COMMAND [ARGS]...[0m[1m                                      [0m[1m [0m
[1m                                                                                [0m
 Entrypoint for zebra application commands                                      
                                                                                
[2m╭─[0m[2m Options [0m[2m───────────────────────────────────────────────────────────────────[0m[2m─╮[0m
[2m│[0m [1;36m-[0m[1;36m-install[0m[1;36m-completion[0m            Install completion for the current shell.    [2m│[0m
[2m│[0m [1;36m-[0m[1;36m-show[0m[1;36m-completion[0m               Show completion for the current shell, to    [2m│[0m
[2m│[0m                                 copy it or customize the installation.       [2m│[0m
[2m│[0m [1;36m-[0m[1;36m-help[0m                [1;32m-h[0

### Download the dataset for running the model

<div class="alert alert-danger"> 

N.B. to run this minimal example requires you to have signed up to access the ERA5 data. Details on how to do so can be found [here](https://cds.climate.copernicus.eu/how-to-api).

</div>

The model will download a set of ERA5 weather data, and OSISAF sea ice concentration (SIC) data over 2017-2019. The details of this data are specified in `demo_nb.yaml`. (Details about how these config files work will be covered in a later section.) 

If the data is already present, a summary of the dataset will be printed (it will not be downloaded again). This same summary can be created by running `zebra datasets inspect`.

**This assumes you have a folder called `my_data/` in the root of the repo.**

In [3]:
!zebra datasets create --config-name=demo_nb.yaml

Working on samp-sicsouth-osisaf-25k-2017-2019-24h-v1.
Inspecting dataset samp-sicsouth-osisaf-25k-2017-2019-24h-v1 at /Users/sarana/Documents/e&s/SeaIce/ice-station-zebra/my_data/data/anemoi/samp-sicsouth-osisaf-25k-2017-2019-24h-v1.zarr.
Dataset samp-sicsouth-osisaf-25k-2017-2019-24h-v1 not found at /Users/sarana/Documents/e&s/SeaIce/ice-station-zebra/my_data/data/anemoi/samp-sicsouth-osisaf-25k-2017-2019-24h-v1.zarr.
OSISAF SIC data will be downloaded to /Users/sarana/Documents/e&s/SeaIce/ice-station-zebra/my_data/data/preprocessing/samp-sicsouth-osisaf-25k-2017-2019-24h-v1/IceNetSIC.
Preparing to download data for 2017.
Downloading polar masks for 2015.
No mask found for 2015-02, using previous month.
No mask found for 2015-03, using previous month.
No mask found for 2015-04, using previous month.
No mask found for 2015-05, using previous month.
No mask found for 2015-06, using previous month.
No mask found for 2015-07, using previous month.
No mask found for 2015-08, using previous

### Train the model

The next command will train a simple UNet model for sea ice concentration forecasting, and the training will run for 10 epochs. The options used for training this model are specified in `demo_nb.yaml`. 

In [4]:
!zebra train --config-name=demo_nb.yaml

Found 2 dataset_groups.
[31m╭─[0m[31m────────────────────[0m[31m [0m[1;31mTraceback [0m[1;2;31m(most recent call last)[0m[31m [0m[31m─────────────────────[0m[31m─╮[0m
[31m│[0m [2m/opt/miniconda3/envs/ice_station_zebra/lib/python3.12/site-packages/anemoi/d[0m [31m│[0m
[31m│[0m [2matasets/data/[0m[1mstores.py[0m:186 in open_zarr                                      [31m│[0m
[31m│[0m                                                                              [31m│[0m
[31m│[0m   [2m183 [0m[2m│   │   [0m[94mif[0m cache [95mis[0m [95mnot[0m [94mNone[0m:                                          [31m│[0m
[31m│[0m   [2m184 [0m[2m│   │   │   [0mstore = zarr.LRUStoreCache(store, max_size=cache)          [31m│[0m
[31m│[0m   [2m185 [0m[2m│   │   [0m                                                               [31m│[0m
[31m│[0m [31m❱ [0m186 [2m│   │   [0m[94mreturn[0m zarr.convenience.open(store, [33m"[0m[33mr[0m[33m"[

### Evaluate the model

Finally the model is evaluated using the checkpoint, saved during the training run. The results are then logged to Weights & Biases. (There is a wandb account for the Sea Ice project where results are logged by default.)

In [5]:
!zebra evaluate --config-name=demo_nb.yaml --checkpoint="../my_data/training/naive_unet_naive_demo_south/wandb/latest-run/checkpoints/epoch=9-step=1810.ckpt"

[31m╭─[0m[31m────────────────────[0m[31m [0m[1;31mTraceback [0m[1;2;31m(most recent call last)[0m[31m [0m[31m─────────────────────[0m[31m─╮[0m
[31m│[0m [2m/opt/miniconda3/envs/ice_station_zebra/lib/python3.12/site-packages/ice_stat[0m [31m│[0m
[31m│[0m [2mion_zebra/cli/[0m[1mhydra.py[0m:41 in wrapper                                         [31m│[0m
[31m│[0m                                                                              [31m│[0m
[31m│[0m   [2m38 [0m[2m│   [0m) -> RetType:                                                       [31m│[0m
[31m│[0m   [2m39 [0m[2m│   │   [0m[94mwith[0m initialize(config_path=[33m"[0m[33m../config[0m[33m"[0m, version_base=[94mNone[0m):    [31m│[0m
[31m│[0m   [2m40 [0m[2m│   │   │   [0mconfig = compose(config_name=config_name, overrides=overrid [31m│[0m
[31m│[0m [31m❱ [0m41 [2m│   │   [0m[94mreturn[0m [1;4mfunction(*args, config=config, **kwargs)[0m                 [31m│

We can then check the predictions from this model. (N.b. as it is only run for a small number of epochs on a limited dataset, the results are not great.)

In [6]:
import os
from IPython.display import Image

# extract the file name (it has a random string on the end)
folder = "../my_data/training/naive_unet_naive_demo_south/wandb/latest-run/files/media/videos/"
file = os.path.join(folder, os.listdir(folder)[0])

Image(filename=file)

FileNotFoundError: [Errno 2] No such file or directory: '../my_data/training/naive_unet_naive_demo_south/wandb/latest-run/files/media/videos/'

# Section 2: Model Flexibility

In this section, we'll demonstrate how easy it is to switch between different model architectures.
We'll also show the difference between standalone models and processor models.

The conceptually simpler model type is a standalone model, which takes in the input data and directly outputs the prediction. These models are less flexible, as they have to be specifically coded to handle new input / output data. Consequently a separate instance of the model is likely to be needed for each input / output combination. However, the input variables are available without transformation.  

![pipeline_standalone](../docs/assets/pipeline-standalone.png)

The more complex model type is processor model, which uses an encode-process-decode paradigm. Here, the input data is first encoded into a latent representation, which is then processed by a core model, before being decoded back into the output space. This allows for more flexibility in terms of input and output variables, as well as the ability to use different types of models for each component.

![pipeline_encode_process_decode](../docs/assets/pipeline-encode-process-decode.png)

## Using an alternative processor model

The initial model we ran, the `naive_unet_naive` model, is an example of a processor model. It uses a naive encoder, to convert the parameters into the right dimensions for the latent space. A UNet model is then run on the latent space, before a naive encoder extracts the SIC predictions. 

In this example, we switch the `naive_unet_naive` model for a more complex `naive_vit_naive` model, which still uses a naive enocder / decoder, but replaces the UNet model with a Vision Transformer (ViT).

In [None]:
!zebra train --config-name=demo_nb.yaml model=naive_vit_naive loggers.wandb.name=naive_vit_naive_demo_south
# the loggers.wandb.name is for convenience of plotting, and is not a required argument

In [None]:
!zebra evaluate --config-name=demo_nb.yaml --checkpoint="../my_data/training/naive_vit_naive_demo_south/wandb/latest-run/checkpoints/epoch=9-step=1810.ckpt" loggers.wandb.name=naive_vit_naive_demo_south
# the loggers.wandb.name is for convenience of plotting, and is not a required argument

The results are logged to [Weights & Biases](https://wandb.ai/), but they can also be viewed locally. When inspecting the predictions of the model using a vision transformer architecture you can notice a checkerboard pattern, an artefact of the patch embedding approach this model uses.

In [None]:
# extract the file name (it has a random string on the end)
folder = "../my_data/training/naive_vit_naive_demo_south/wandb/latest-run/files/media/videos/"
file = os.path.join(folder, os.listdir(folder)[0])

Image(filename=file)


There is also the option to replace the naive encoder / decoder with convolutional neural networks (CNNs), however these struggle to train on a laptop, so we won't demonstrate them here. If you want to test them out, just set the model to be `cnn_unet_cnn` or `cnn_vit_cnn` in the config file.

## Standalone persistence model

As mentioned above, an alternative form of model doesn't use an encoder / decoder architecture. We can demonstrate this with a `persistence model` which simply outputs the last input frame as the prediction.

Note, that the easiest way to run the `persistence` model is to use the `persistence.yaml` config file. As this config file is not set-up to be run in this notebook (unlike the `demo_nb.yaml` file we've been using so far), we need to specify a few extra options in the command line to ensure it works correctly.

In [None]:
!zebra train --config-name=persistence.yaml ++base_path="../my_data" loggers.wandb.save_dir="../my_data/training/persistence_demo_south"
# note the base_path and the loggers.wandb.save_dir command were not needed in the previous examples as they were specified in the demo_nb.yaml config file

In [None]:
!zebra evaluate --config-name=persistence.yaml ++base_path="../my_data" --checkpoint="../my_data/training/persistence_demo_south/wandb/latest-run/checkpoints/epoch=1-step=0.ckpt" loggers.wandb.save_dir="../my_data/training/persistence_demo_south"

In [None]:
# extract the file name (it has a random string on the end)
folder = "../my_data/training/persistence_demo_south/wandb/latest-run/files/media/videos/"
file = os.path.join(folder, os.listdir(folder)[0])

Image(filename=file)

# Section 3: Full flexibility - Advanced Example

This section shows how to adapt the different parts of the model pipeline to your needs. We will look in to the config files that are the basis of the pipeline, and we will show how to create your own config file to give you full control over the model training. 

The pipeline uses Hydra, which is a powerful configuration management tool. More details can be found [here](https://hydra.cc/docs/intro/). 

## The base config file

So far in this notebook, we've mostly used the `demo_nb.yaml` config file. We can have a look at the contents of this file to see what options it is using.

In [None]:
!cat ../ice_station_zebra/config/demo_nb.yaml

As you can see here, the defaults section points to the `base.yaml` config file, which contains the main options for the pipeline. There are then some specific options for how many epochs to train for, and where to save the data.

So lets have a look at the `base.yaml` file.

In [None]:
!cat ../ice_station_zebra/config/base.yaml

There are lots more options under the defaults section here, for specifying a range of options such as the datasets, or the model to use. Some of these options should also look familiar from the `demo_nb.yaml` file. For example, under `loggers`, we can see `wandb` specified as the logger to use.

If you look at the `ice_station_zebra/config/` folder, you can see that all the defaults in this base config, map to specific folders / files. By opening those in turn, you can see the options that are being used for each part of the pipeline. You can also see the alternative files that could be substituted in (e.g. using `naive_vit_naive.yaml` instead of `naive_unet_naive.yaml` for the model).

If you only want to change a single parameter, rather than the whole config file, you can specify this in the config, referencing the nested structure. An example of this is how we specified the number of epochs to train for in the `demo_nb.yaml` file. Here, we overrode the default of 50 epochs specified in `train/trainer/default.yaml`, by setting `max_epochs` to be 10.

You might remember, we also used the `persistence.yaml` config file to run a persistence model. We can look at the contents of that file:

In [None]:
!cat ../ice_station_zebra/config/persistence.yaml

Here you can see that it mostly uses the same parameters as the `base` config. However, it overrides the default configs for `train` and `model` to switch in the persistence model options. 

<div class="alert alert-success">

Typically you should not change the values in the config files. If you want to change some of these options, the best way to do so is to create your own config file, similar to `demo_nb.yaml`, It's easiest to use the `base` config as the default, and then specify the parameters you want to change using `override` or by specifying the parameter in the config hierarchy. 

When you want to run code using that config file, you can specify it using the `--config-name` option in the CLI command. This will remove the need to specify the individual parameters in the command line each time. 

</div>

## Creating a dataset

The first step in the model pipeline is to download the required input data. This has been developed to make use of the [anemoi datasets](https://anemoi.readthedocs.io/projects/datasets/en/latest/) package, which is part of the [anemoi toolkit](https://anemoi.readthedocs.io/en/latest/) developed by ECMWF. 

New datasets can be downloaded using the `zebra datasets create` command shown in the first section of this notebook. 

If the data has already be downloaded, the `zebra datasets inspect` command prints a summary of the dataset (or datsets), including a full list of the variables and some summary statistics. 

In [None]:
!zebra datasets inspect --config-name=demo_nb.yaml

The details of the dataset to be downloaded are defined in the datasets config file. The default datasets (those used by the `base` config) are `samp_sicsouth_osisaf_25k_2017_2019_24h_v1` and `samp_weathersouth_era5_0p5_2017_2019_24h_v1`. We will have a look at these in a bit more detail.

In [None]:
!cat ../ice_station_zebra/config/datasets/samp_sicsouth_osisaf_25k_2017_2019_24h_v1.yaml

This dataset contains sea ice concentration (SIC) data from the OSI SAF 25km dataset for the years 2017-2019, with a temporal resolution of 24 hours. The data is from the Southern Hemisphere. This download makes use of data preprocessing developed as part of the [IceNet repository](https://github.com/icenet-ai/icenet).

In [None]:
!cat ../ice_station_zebra/config/datasets/samp_weathersouth_era5_0p5_2017_2019_24h_v1.yaml

This dataset contains 50km resolution weather data from ERA5 for the years 2017-2019, with a temporal resolution of 24 hours. The data is from the Southern Hemisphere. The variables downloaded are those commonly used in sea ice forecasting, including temperature, u and v wind components, surface / sea level pressure, humidity and geopotential height. It also includes the cos and sin julian day to help the model learn seasonal patterns.

N.b. there are also a set of other configs in the `config/datasets` folder for different subsets of variables, different hemispheres, and different datasets. 

## Training a model

Training a model is done using the `zebra train` command, as shown in section 1 of this notebook. There are a set of config files that specifically relate to the model training. For example the choice of model, the training parameters, the variable to predict, the dataset split and the logger to use. We will explore each of these in turn.

### Model choice

In [None]:
!cat ../ice_station_zebra/config/model/naive_unet_naive.yaml

This config points to the specific python code used for each part of the model (i.e. the naive encoder, the UNet processor and the naive decoder). There are various parameters that can be set, such as the size of the latent space, the UNet kernel size and the number of start-out channels for the UNet. 

There are a set of possible model configs - each of these will have specific parameters (and their default values) that need to be set for those models. 

### Training parameters

In [None]:
!cat ../ice_station_zebra/config/train/default.yaml

A set of training parameters are specified here, normally it is fine to just use the default values. 

### Prediction parameters

In [None]:
!cat ../ice_station_zebra/config/predict/osisaf-south.yaml

The `predict` config files specify the variable to be predicted, as well as the number of historic days to include as input, and the number of days to forecast ahead. There are config files for predicting SIC in the northern or the southern hemisphere. 

### Dataset split

In [None]:
!cat ../ice_station_zebra/config/split/sample_dataset.yaml

The dataset split specifies the number of batches to use for training. It also specifies how the datasets should be split in to training, validation and test sets, based on date ranges. These date shouldn't be altered, to ensure that all the different model runs are comparable.

### Loggers

In [None]:
!cat ../ice_station_zebra/config/loggers/wandb.yaml

By default, the pipeline uses Weights & Biases to log the training and evaluation metrics. All runs are saved to the 'turing-seaice' project, as well as being saved locally in a `training` folder alongside the data. We don't log the model to W&B to save space.

## Evaluating a model

Training a model produces a set of artifacts, including one or more model checkpoint. The checkpoints can be used to evaluate the model performance using the `zebra evaluate` command (as shown in section 1 of this notebook). There are a couple of config files that specifically relate to the evaluation process within the `evaluate` folder (though several of the ones we have already explored are also relevant).

### Plotting

In [None]:
!cat ../ice_station_zebra/config/evaluate/callbacks/plotting.yaml

By default, static maps and videos of the predicted forecasts of the output variable (normally SIC) are created for a set of examples from the test set. These are uploaded to Weights & Biases, and also saved locally. 

There are a lot of options for adjusting the model output plots. 

### Metric summary

In [None]:
!cat ../ice_station_zebra/config/evaluate/callbacks/metric_summary.yaml

The default metric summary is based on average loss.

<div class="alert alert-success">

Typically you should not change the values in the config files. If you want to change some of these options, the best way to do so is to create your own config file, similar to `demo_nb.yaml`, It's easiest to use the `base` config as the default, and then specify the parameters you want to change using `override` or by specifying the parameter in the config hierarchy. 

When you want to run code using that config file, you can specify it using the `--config-name` option in the CLI command. This will remove the need to specify the individual parameters in the command line each time. 

</div>

Hopefully this notebook has given you an overview of the Sea Ice forecasting pipeline, and shown you how you can use it flexibily to run different models and evaluate their performance.