# Data Processing

ClimateLearn makes it super easy to prepare data for your machine learning pipelines. In this tutorial, we'll see how to download [ERA5](https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5) data from [WeatherBench](https://github.com/pangeo-data/WeatherBench) and prepare it for both the forecasting and [downscaling](https://uaf-snap.org/how-do-we-do-it/downscaling) tasks. This tutorial is intended for use in Google Colab.

## Google Colab setup
You might need to restart the kernel after installing ClimateLearn so that your Colab environment knows to use the correct package versions.

In [None]:
!pip install climate-learn

In [None]:
from google.colab import drive
drive.mount("/content/drive")

## Download

The following cell will take several minutes to run - the scale of climate data is huge!

In [1]:
from climate_learn.data import download

root = "/content/drive/MyDrive/ClimateLearn"
source = "weatherbench"
dataset = "era5"
resolution = "5.625"
variable = "2m_temperature"

download(root=root, source=source, dataset=dataset, resolution=resolution, variable=variable)

ClimateLearn comes with some utilities to view the downloaded data in its raw format. This can be useful as a quick sanity check that you have the data you expect. Climate data is natively stored in the [NetCDF format](https://www.unidata.ucar.edu/software/netcdf/), which means it comes bundled with lots of helpful named metadata such as latitude, longitude, and time. However, we want the data in a form that can be easily ingested by PyTorch machine learning models.

In [3]:
from climate_learn.utils.data import load_dataset, view

my_dataset = load_dataset(f"{root}/data/{source}/{dataset}/{resolution}/{variable}")
view(my_dataset)

## Preparing data for forecasting

In this cell, we specify the dataset arguments. The temporal range of ERA5 data on WeatherBench is 1979 to 2018.

In [4]:
from climate_learn.data.climate_dataset.args import ERA5Args

years = range(1979, 2018)
data_args = ERA5Args(
    root_dir=f"{root}/data/{source}/{dataset}/{resolution}/",
    variables=[variable],
    years=years
)

Now we specify the task arguments. In this case we are interested in forecasting only `2m_temperature` using only `2m_temperature`, but one could specify additional variables, provided that the data for those variables is downloaded. The prediction range is in hours, so if we want to predict 3 days ahead, we provide `3*24`. Further, we subsample every 6 hours of the day since weather conditions do not change significantly on hourly intervals.

In [5]:
from climate_learn.data.task.args import ForecastingArgs

forecasting_args = ForecastingArgs(
    in_vars=[variable],
    out_vars=[variable],
    pred_range=3*24,
    subsample=6
)

As the scale of climate data is huge, we need to specify how we want to load the data in the CPU memory. ClimateLearn allows us to either load the entire data into memory or shard it and then load it in chunks. The latter comes with the overhead of loading data multiple times in every epoch. In this tutorial, as the data has just single variable, we would use the first technique to load the data.

In [6]:
from climate_learn.data.dataset.args import MapDatasetArgs

map_dataset_args = MapDatasetArgs(
    climate_dataset_args=data_args,
    task_args=forecasting_args
)

Finally, we specify the data module, where we define our train-validation-testing split and the batch size.

In [7]:
from climate_learn.data import DataModuleArgs, DataModule

data_module_args = DataModuleArgs(
    dataset_args=map_dataset_args,
    train_start_year=1979,
    val_start_year=2015,
    test_start_year=2017,
    end_year=2018
)

data_module = DataModule(
    data_module_args=data_module_args,
    batch_size=128,
    num_workers=1
)

## Preparing data for downscaling

In the [downscaling task](https://uaf-snap.org/how-do-we-do-it/downscaling), we want to build a machine learning model that can map low-resolution weather patterns (source) to high-resolution weather patterns (target). In the previous section, we already downloaded a dataset for `2m_temperature` at 5.625 degrees resolution. Here, let's download a dataset also for `2m_temperature` but at 2.8125 degrees resolution.

In [8]:
hi_resolution = "2.8125"
download(root=root, source=source, dataset=dataset, resolution=hi_resolution, variable=variable)

Next, we specify the dataset arguments. This is the same procedure as for forecasting, but with two datasets now: one set of arguments is for the source, and another set of arguments is for the target.

In [9]:
lowres_data_args = ERA5Args(
    root_dir=f"{root}/data/{source}/{dataset}/{resolution}/",
    variables=[variable],
    years=years
)

highres_data_args = ERA5Args(
    root_dir=f"{root}/data/{source}/{dataset}/{hi_resolution}",
    variables=[variable],
    years=years
)

Now, we need to wrap these multiple dataset sources into one.

In [10]:
from climate_learn.data.climate_dataset.args import StackedClimateDatasetArgs

data_args = StackedClimateDatasetArgs(
    data_args=[lowres_data_args, highres_data_args]
)

Then, we specify the task arguments.

In [11]:
from climate_learn.data.task.args import DownscalingArgs

downscaling_args = DownscalingArgs(
    in_vars=[variable],
    out_vars=[variable],
    subsample=6,
)

We again need to specifiy how to load data into memory. This time let's try the sharding approach. Other than the `climate_dataset_args` and `task_args`, we also need to specify the number of chunks we want to shard the dataset into. Note that in `ERA5` the data for different years is stored in different files. Thus, we can't have nuber of chunks greater than the number of training years.

In [12]:
from climate_learn.data.dataset.args import ShardDatasetArgs

shard_data_args = ShardDatasetArgs(
    climate_dataset_args=data_args,
    task_args=downscaling_args,
    n_chunks=5
)

Finally, we specify the data module, which looks the same as for the forecasting task.

In [13]:
data_module_args = DataModuleArgs(
    dataset_args=shard_data_args,
    train_start_year=1979,
    val_start_year=2015,
    test_start_year=2017,
    end_year=2018
)

data_module = DataModule(
    data_module_args=data_module_args,
    batch_size=128,
    num_workers=1
)

Congralutions! Now you know how to load and process data with ClimateLearn. Please visit our [docs](https://climatelearn.readthedocs.io/en/latest/user-guide/datasets.html) to learn more.