# ClimateNet Dataset

In this notebook, we demonstrate how the ClimateNet dataset can be loaded using the xarray library.
Furthermore, we analyze the dataset by calculating several useful statistics and visualizing interesting samples.

In [None]:
from pathlib import Path

# specify the path to the data and output directories
out_dir = Path('/mnt/data/ai4good/out')
data_dir = Path('/mnt/data/ai4good/climatenet_new')  # expected to have a subfolder 'train' containing the train set
train_dir = data_dir / 'train'
test_dir = data_dir / 'test'

## Introduction to xarray

Xarray is a python library to facilitate working with labelled multi-dimensional arrays.
NetCDF is the recommended file fromat for xarray objects. The xarray Dataset data model is inspired by the one of a netCDF file.

**Resources:**
- [Xarray documentation](https://docs.xarray.dev/en/stable/getting-started-guide/quick-overview.html)
- [NetCDF CF Metadata Conventions](https://cfconventions.org/cf-conventions/cf-conventions.html)

In [None]:
import xarray as xr

NetCDF files can conveniently be loaded and investigated as an xarray dataset.

In [None]:
example_file = 'data-2000-12-20-01-1_5.nc'
example_ds = xr.load_dataset(train_dir / example_file)  # example dataset containg a single sample

In [None]:
example_ds

In [None]:
example_ds_dims = example_ds.dims
example_ds_coords = example_ds.coords
example_ds_vars = example_ds.data_vars

print(f'Dimensions of the example dataset: {example_ds_dims}\n')
print(f'Coordinates of the example dataset: {example_ds_coords}\n')
print(f'Variables of the example dataset: {example_ds_vars}\n')

There are four different approaches to [indexing an xarray dataset](https://docs.xarray.dev/en/stable/getting-started-guide/quick-overview.html#indexing):
- positional and by integer label, like numpy
- loc or "location": positional and coordinate label, like pandas
- isel or "integer select":  by dimension name and integer label
- sel or "select": by dimension name and coordinate label

In [None]:
example_var = 'TMQ'
example_desc = example_ds[example_var].attrs['description']
like_np = example_ds['TMQ'][0,0,0].values
like_pandas = example_ds['TMQ'].loc[
    dict(
        time='data-2000-12-20-01-1.nc',
        lat=-90.0,
        lon=0.0
)].values
isel = example_ds['TMQ'].isel(time=0, lat=0, lon=0).values
sel = example_ds['TMQ'].sel(time='data-2000-12-20-01-1.nc', lat=-90.0, lon=0.0).values

print(f'The different queries for {example_var} return the same value (as they should):')
print(f'like_np: {like_np}')
print(f'like_pandas: {like_pandas}')
print(f'isel: {isel}')
print(f'sel: {sel}')
print(f'\nWe also can print the description of the variable {example_var}:\n{example_desc}')

Computation on xarray data arrays works just like with numpy.
We can print some useful stats about our labels.

In [None]:
import numpy as np

example_labels = example_ds['LABELS'].values
example_labels_max = np.max(example_labels)
example_labels_min = np.min(example_labels)
example_labels_mean = np.mean(example_labels)
example_labels_std = np.std(example_labels)

print(f'The labels have a maximum value of {example_labels_max}, a minimum value of {example_labels_min}, a mean of {example_labels_mean} and a standard deviation of {example_labels_std}.')

And just like in pandas, xarray supports gropued operations. The code cell below prints the mean total (vertically integrated) precipitable water for each of the classes.

In [None]:
example_ds['TMQ'].groupby(example_ds['LABELS']).mean()

Last but not least, we can easily plot xarray data arrays as such:

In [None]:
example_ds['LABELS'].plot()

## ClimateNet Dataset Analysis