# Data Processing for Weather Forcasting

ClimateLearn makes it super easy to prepare data for your machine learning pipelines. In this tutorial, we'll see how to download [ERA5](https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5) data from [WeatherBench](https://github.com/pangeo-data/WeatherBench) and prepare it for both the forecasting and [downscaling](https://uaf-snap.org/how-do-we-do-it/downscaling) tasks. This tutorial is intended for use in Google Colab.

## Google Colab setup
You might need to restart the kernel after installing ClimateLearn so that your Colab environment knows to use the correct package versions.

In [None]:
!pip install climate-learn

In [None]:
from google.colab import drive
drive.mount("/content/drive")

## Download

The following cell will take several minutes to run - the scale of climate data is huge!

In [None]:
from climate_learn.data import download

root = "/content/drive/MyDrive/ClimateLearn"
source = "weatherbench"
dataset = "era5"
resolution = "5.625"
variable = "2m_temperature"
years = range(1979, 2018)

download(root=root, source=source, dataset=dataset, resolution=resolution, variable=variable)

ClimateLearn comes with some utilities to view the downloaded data in its raw format. This can be useful as a quick sanity check that you have the data you expect. Climate data is natively stored in the [NetCDF format](https://www.unidata.ucar.edu/software/netcdf/), which means it comes bundled with lots of helpful named metadata such as latitude, longitude, and time. However, we want the data in a form that can be easily ingested by PyTorch machine learning models.

In [None]:
from climate_learn.utils.data import load_dataset, view

my_dataset = load_dataset(f"{root}/data/{source}/{dataset}/{resolution}/{variable}")
view(my_dataset)

## Preparing data for forecasting

In this cell, we specify the dataset arguments. The temporal range of ERA5 data on WeatherBench is 1979 to 2018.

In [None]:
from climate_learn.data.climate_dataset.args import ERA5Args

data_args = ERA5Args(
    root_dir=f"{root}/data/{source}/{dataset}/{resolution}/",
    variables=[variable],
    years=years
)

Now we specify the task arguments. In this case we are interested in forecasting only `2m_temperature` using only `2m_temperature`, but one could specify additional variables, provided that the data for those variables is downloaded. The prediction range is in hours, so if we want to predict 3 days ahead, we provide `3*24`. Further, we subsample every 6 hours of the day since weather conditions do not change significantly on hourly intervals.

In [None]:
from climate_learn.data.tasks.args import ForecastingArgs

forecasting_args = ForecastingArgs(
    dataset_args=data_args,
    in_vars=[variable],
    out_vars=[variable],
    pred_range=3*24,
    subsample=6
)

Finally, we specify the data module, where we define our train-validation-testing split and the batch size.

In [None]:
from climate_learn.data import DataModuleArgs, DataModule

data_module_args = DataModuleArgs(
    task_args=forecasting_args,
    train_start_year=1979,
    val_start_year=2015,
    test_start_year=2017,
    end_year=2018
)

data_module = DataModule(
    data_module_args=data_module_args,
    batch_size=128,
    num_workers=1
)