# How to work with ERA5 single levels on Earth Data Hub
### Modelling of climate zones in Europe

***
This notebook will provide you guidance on how to access and use the `ecv-for-climate-change-1979-2023.zarr` datset on Earth Data Hub (EDH).

The first goal is to compute monthly averages over Europe.

The second goal is model a given number of different climate using a profile classification model.
***

## What you will learn:

* how to access and preview the dataset
* select and reduce the data
* define a profile classification model (PCM)
* plot the results

## Preparation of software packages for Google Colab
***

* The zarr package is needed by xarray to use engine="zarr" for Earth Data Hub datasets, needs to be installed before xarray is imported
* The s3fs package is needed to access S3


In [2]:
# install dependencies
# this cell might need to be run twice to solve version conflicts
# can not use the apt package python3-zarr because of too old numcodecs without BitRound compressor
#!apt-get remove -y python3-numcodecs

!pip install zarr
!pip install cartopy
# use latest pyxpcm to avoid incompatibility with new numpy versions
#!pip install pyxpcm
!pip install ipython==8.3.0
!pip install git+https://github.com/obidam/pyxpcm.git@master


Collecting zarr
  Downloading zarr-2.17.0-py3-none-any.whl (207 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/207.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m112.6/207.5 kB[0m [31m3.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.5/207.5 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting asciitree (from zarr)
  Downloading asciitree-0.3.3.tar.gz (4.0 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting numcodecs>=0.10.0 (from zarr)
  Downloading numcodecs-0.12.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fasteners (from zarr)
  Downloading fasteners-0.19-py3-none-any.whl (18 kB)
Building wheels for collected packages: asciitree
  Building whee

Support for S3 filesystem incl. python package

In [None]:
!apt-get install s3fs

#!pip install fsspec==2023.6.0
!pip uninstall -y s3fs
#!pip uninstall -y gcsfs
#!pip uninstall -y fsspec

# the s3fs version must match the already installed version of gcsfs
!pip install s3fs==2023.6.0


## Load packages needed for this tutorial

In [1]:
import os, sys
import numpy as np
import pandas as pd
import xarray as xr
print("xarray: %s, %s" % (xr.__version__, xr.__file__))

import matplotlib.pyplot as plt
%matplotlib inline

import pyxpcm
print("pyxpcm: %s, %s" % (pyxpcm.__version__, pyxpcm.__file__))

import matplotlib as mpl
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import seaborn as sns

xarray: 2023.7.0, /usr/local/lib/python3.10/dist-packages/xarray/__init__.py


ModuleNotFoundError: No module named 'pyxpcm'

## Data access and preview
***

Xarray and Dask work together following a lazy principle. This means when you access and manipulate a Zarr store the data is in not immediately downloaded and loaded in memory. Instead, Dask constructs a task graph that represents the operations to be performed. A smart user will reduce the amount of data that needs to be downloaded before the computation takes place (e.g., when the `.compute()` or `.plot()` methods are called).

To preview the data, only the dataset metadata must be downloaded. Xarray does this automatically:

***

In [None]:
# your `~/.netrc` file MUST contain your credentials for earthdatahub.com
#
# machine earthdatahub.com
#   login {your_username}
#   password {your_password}

ds = xr.open_dataset("s3://hedp/era5/ecv-for-climate-change-1979-2023.zarr", chunks={}, engine="zarr").astype("float32")
ds

In [None]:
# ERA5 climate variables from EarthDataHub
# url for testing, not official
#dataset_url = "https://user1:lojppbmcw2EMwVRSHv8s0OKR@earthdatahub.com/stores/hedp/era5/ecv-for-climate-change-1979-2023.zarr"
dataset_url = "https://user1:lojppbmcw2EMwVRSHv8s0OKR@earthdatahub.com/stores/ecmwf-era5-single-levels/reanalysis-era5-single-levels-v0.zarr"
ds = xr.open_dataset(dataset_url, chunks={}, engine="zarr", storage_options={"client_kwargs": {"trust_env": True}})
# .astype("float32")
ds

## Working with data

Datasets on EDH are typically very large and remotely hosted. Typical use imply a selection of the data followed by one or more reduction steps to be performed in a local or distributed Dask environment.

The structure of a workflow that uses EDH data looks like this:
1. data selection
2. (optional) data reduction
3. (optional) visualization

## Long-term monthly averages of the variables for Europe

### 1. Data selection

We perform a geographical selection corresponding to the central Europe area. This reduces the amount of data that will be downloaded from EDH.

In [None]:
ds_europe = ds.sel(**{"latitude": slice(55, 45), "longitude": slice(2, 24)})
ds_europe

### 2. Data reduction

Now we want monthly long-term averages, but only for the last 30 years (the dataset starts at 1940-01-01):

In [None]:
ds_europe_30yrs = ds_europe.sel(valid_time=slice("1991-01-01", "2020-12-31"))
ds_europe_30yrs

Windspeed is an interesting variable to add to the modelling of climate zones, but it takes some time to calculate wind speed from the u and v components which must happen before any spatial or temporal aggregation. Therefore windspeed is disabled by default,but can be enabled by setting `USE_WINDSPEED` to `True`.

In [None]:
USE_WINDSPEED = False

if USE_WINDSPEED:
  ds_europe_30yrs = ds_europe_30yrs.assign(windspeed=lambda x: np.sqrt(x.u10 * x.u10 + x.v10 * x.v10))


Long-term monthly averages:

In [None]:
ds_europe_lt_monthly = ds_europe_30yrs.groupby("valid_time.month").mean("valid_time")
ds_europe_lt_monthly



At this point, no data has been downloaded yet, nor loaded in memory. However, the selection is small enough to call `.compute()` on it. This will trigger the download of data from EDH and load it in memory.

We can measure the time it takes, should be about 8 minutes:


In [None]:
%%time

ds_europe_lt_monthly = ds_europe_lt_monthly.compute()

### 3. Visualization

We can plot the average precipitation for July on a map:

In [None]:
from cartopy import crs, feature
import matplotlib.pyplot as plt

ds_europe_lt_monthly_dec = ds_europe_lt_monthly.sel(month=7)
dec_tp = ds_europe_lt_monthly_dec.tp

_, ax = plt.subplots(
    figsize=(6, 6),
    subplot_kw={"projection": crs.Miller()},
)
dec_tp.plot(
    ax=ax,
    cmap="YlOrRd",
    transform=crs.PlateCarree(),
    cbar_kwargs={"orientation": "horizontal", "pad": 0.05, "aspect": 40, "label": "Average precipitation for July [m]"},
)
ax.coastlines()
ax.add_feature(feature.BORDERS)
ax.set_title("Average July precipitation")
plt.show()

## Profile Classification Model (PCM)

We want to determine homogeneous climatic zones in Europe using the monthly long-term averages.

### Create a model

Let's import the Profile Classification Model (PCM) constructor:

In [None]:
from pyxpcm.models import pcm

A PCM can be created independently of any dataset using the class constructor.

A PCM requires a number of classes (or clusters) and a dictionary to define the list of features and their profile axis:

In [None]:
z = np.arange(-1, -12, -1)
if USE_WINDSPEED:
  pcm_features = {'temperature': z, 'precipitation':z, 'windspeed': z}
else:
  pcm_features = {'temperature': z, 'precipitation':z}

We can now instantiate a PCM, say with 8 classes:

In [None]:
# error in PCA:
# ValueError: n_components=15 must be between 0 and min(n_samples, n_features)=11 with svd_solver='full'
# n_components is set somewhere internally
# -> try without PCA: reduction=0

m = pcm(K=8, features=pcm_features, reduction=0)
m

### Fit the model on data

Fitting can be done on any dataset coherent with the PCM definition, in a sense that it must have the feature variables of the PCM.

To tell the PCM model how to identify features in any :class:`xarray.Dataset`, we need to provide a dictionary of variable names mapping:

In [None]:
if USE_WINDSPEED:
  features_in_ds = {'temperature': 't2m', 'precipitation': 'tp', 'windspeed': 'windspeed'}
else:
  features_in_ds = {'temperature': 't2m', 'precipitation': 'tp'}


which means that the PCM feature ``temperature`` is to be found in the dataset variables ``t2m``.

We also need to specify the profile dimension of the dataset variables:

In [None]:
features_pdim='month'

The values of the profile dimension must be <= 0:

In [None]:
ds_europe_lt_monthly_neg = ds_europe_lt_monthly.assign_coords(month=(-ds_europe_lt_monthly.month))

In [None]:
ds_europe_lt_monthly_neg = ds_europe_lt_monthly_neg.compute()
ds_europe_lt_monthly_neg

Now we're ready to fit the model on the this dataset:

In [None]:
m.fit(ds_europe_lt_monthly_neg, features=features_in_ds, dim=features_pdim)
m

### Classify data

Now that the PCM is fitted, we can predict the classification results like:

In [None]:
m.predict(ds_europe_lt_monthly_neg, features=features_in_ds, dim=features_pdim, inplace=True)
ds_europe_lt_monthly_neg

Prediction labels are automatically added to the dataset as `PCM_LABELS` because the option `inplace` was set to `True`.

pyXpcm use a Gaussian Mixture Model (GMM) classifier by default, which is a fuzzy classifier. So we can also predict the probability of each classes for all profiles, the so-called *posteriors*:

In [None]:
m.predict_proba(ds_europe_lt_monthly_neg, features=features_in_ds, dim=features_pdim, inplace=True)
ds_europe_lt_monthly_neg

which are added to the dataset as the `PCM_POST` variables. The probability of classes for each profiles has a new dimension `pcm_class` by default that goes from 0 to K-1.

In [None]:
classno = 2
ds_plt = ds_europe_lt_monthly_neg['PCM_POST'].sel(pcm_class=classno)


### Geographic distribution of classes

In [None]:
ds_plt = ds_europe_lt_monthly_neg['PCM_LABELS']

_, ax = plt.subplots(
    figsize=(6, 6),
    subplot_kw={"projection": crs.Miller()},
)
ds_plt.plot(
    ax=ax,
    cmap="tab20",
    transform=crs.PlateCarree(),
    cbar_kwargs={"orientation": "horizontal", "pad": 0.05, "aspect": 40, "label": "PCM class numbers"},
)
ax.coastlines()
ax.add_feature(feature.BORDERS)
ax.set_title("PCM classes")
plt.show()


Show probabilities for a selected class:

In [None]:
classno = 2
ds_plt = ds_europe_lt_monthly_neg['PCM_POST'].sel(pcm_class=classno)

cmap = sns.light_palette("blue", as_cmap=True)

_, ax = plt.subplots(
    figsize=(6, 6),
    subplot_kw={"projection": crs.Miller()},
)
ds_plt.plot(
    ax=ax,
    cmap=cmap,
    transform=crs.PlateCarree(),
    cbar_kwargs={"orientation": "horizontal", "pad": 0.05, "aspect": 40, "label": f"Probabilities for class no {classno}"},
)
ax.coastlines()
ax.add_feature(feature.BORDERS)
ax.set_title(f"PCM Probabilities for class no {classno}")
plt.show()

### Prediction

It is important to note that once the PCM is fitted, you can predict labels for any dataset, as long as it has the PCM features.

For instance, let's predict labels for a single year:

In [None]:
ds_europe_2023 = ds_europe.sel(valid_time=slice("2023-01-01", "2023-12-31"))
if USE_WINDSPEED:
  ds_europe_2023 = ds_europe_2023.assign(windspeed=lambda x: np.sqrt(x.u10 * x.u10 + x.v10 * x.v10))


Aggregate to monthly data and prepare for PCM

In [None]:
ds_europe_2023_monthly = ds_europe_2023.groupby("valid_time.month").mean("valid_time")
ds_europe_2023_monthly_neg = ds_europe_2023_monthly.assign_coords(month=(-ds_europe_2023_monthly.month))
ds_europe_2023_monthly_neg = ds_europe_2023_monthly_neg.compute()
ds_europe_2023_monthly_neg

Apply the model

In [None]:
m.predict(ds_europe_2023_monthly_neg, features=features_in_ds, dim=features_pdim, inplace=True)
ds_europe_2023_monthly_neg

Show the result

In [None]:
ds_plt = ds_europe_2023_monthly_neg['PCM_LABELS']

_, ax = plt.subplots(
    figsize=(6, 6),
    subplot_kw={"projection": crs.Miller()},
)
ds_plt.plot(
    ax=ax,
    cmap="tab20",
    transform=crs.PlateCarree(),
    cbar_kwargs={"orientation": "horizontal", "pad": 0.05, "aspect": 40, "label": "PCM class numbers"},
)
ax.coastlines()
ax.add_feature(feature.BORDERS)
ax.set_title("PCM classes")
plt.show()