# Data

## WeatherBench

We are using the [WeatherBench dataset](https://github.com/pangeo-data/WeatherBench). The WeatherBench Dataset consists of preprocessed meteorological data from [ERA5](https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5). ERA5 provides hourly estimates of a large number of atmospheric, land and oceanic climate variables. The data cover the Earth on a 30km grid and resolve the atmosphere using 137 levels from the surface up to a height of 80km. WeatherBench processes this dataset to provide wind direction, wind speed, temperature, geopotential, vorticity, humidity, solar radiation, cloud cover, and precipitation data at various levels, with hourly readings going back to 1979. The data is given at multiple spatial resolutions, specifically 1.40625 deg, 2.8125 deg, and 5.625 deg. The total size of all provided data is 5.8 TB, but we download only the specific data that we need.

## Loading The Data

We can directly download zip files consisting of the data we are interested in from [here](https://dataserv.ub.tum.de/index.php/s/m1524895?path=%2F). The corresponding zip file consists of yearly [NetCDF](https://en.wikipedia.org/wiki/NetCDF) files, which can be combined to be represented as a `xarray.Dataset`.

In [None]:
# Downloads 2m temperature data at 5.625 degree resolution
!wget --no-check-certificate "https://dataserv.ub.tum.de/s/m1524895/download?path=%2F5.625deg%2F2m_temperature&files=2m_temperature_5.625deg.zip" -O '2m_temperature_5.625deg.zip'
!mkdir -p '2m_temperature'
!unzip -d '2m_temperature/' '2m_temperature_5.625deg.zip'
!rm '2m_temperature_5.625deg.zip'

In [None]:
# Loads data as xarray.Dataset
data = xr.open_mfdataset('2m_temperature/*.nc', combine='by_coords')
data

## CliMetLab

Instead of downloading data through the zip files, we will use the CliMetLab package. CliMetLab is a Python package aiming at simplifying access to climate and meteorological datasets, allowing users to focus on science instead of technical issues such as data access and data formats. It is mostly intended to be used in Jupyter notebooks, and be interoperable with all popular data analytic packages, such as NumPy, Pandas, Xarray, SciPy, Matplotlib, etc. as well as machine learning frameworks, such as TensorFlow, Keras or PyTorch.

In [None]:
# !pip install climetlab
# !pip install climetlab_weatherbench
# !pip install numexpr=='2.7.3'
import climetlab as cml
data = cml.load_dataset("weather-bench", parameter="2m_temperature", resolution=5.625)
data

### Plotting

CliMetLab also allows for easy plotting and visualization features that allow us to visualize various kinds of meteorological data. 

In [None]:
cml.plot_map(
    data.to_xarray().t2m[0],
    foreground=dict(
        map_grid=False,
        map_boundaries=True,
    ),
)

## Saving Data

To process the data, we will narrow down the columns in the data to be only the ones we need, specifically `2m_temperature` and `total_precipitation`. If our data has layers, then we will be only using the layer closest to the earth. We will also be calculating running window averages for longer term climatology forecasting using the [`rolling`](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.rolling.html) function for `xarray`. 

In [None]:
# Uses the first entry per year.
yearly = data.to_xarray().t2m.thin(time=24*365)

# Saves as a NetCDF File.
yearly.to_netcdf("../data/yearly_2m_temperature.nc")

# Data Augmentation

One of our innovations is that we will also augment the data with rotations of the original data set and with running window averages for longer term climatology forecasting. 

In [None]:
# Running averages
n_avg = lambda x, t: x.rolling(time=t).mean().thin(time=t)

daily_averages = n_avg(x, 24)
weekly_averages = n_avg(x, 24*7)
monthly_averages = n_avg(x, 24*30)
yearly_averages = n_avg(x, 24*365)

In [None]:
# TODO: Augment Data with Rotations

## Prepare Data for Training/Testing

We will split the data as training/testing by considering data from 2017-2018 as testing data, and pre-2017 data as training. Then, we normalize both the training and testing data.


In [None]:
# TODO: split data into testing, training, validation


In [None]:
# TODO: Normalize dataset


In [None]:
# TODO: Create Keras DataGenerators for testing, training, validation


### Future Work

Note that we are using a relatively small subset of the provided data. However, as more computational complexity needed to process larger datasets, using all features and all levels may result in more accurate predictions, particularly as accurate prediction of temperature and precipitation depends on the other variables provided. 