# How to work with Climate Adaptation Digital Twin data on Earth Data Hub: fields on a single level or surface, standard resolution

***
This notebook will provide you guidance on how to access and use the `SSP3-7.0-IFS-NEMO-0001-standard-sfc-v0.zarr` datset on Earth Data Hub. This is a sample dataset for the Destine Climate Adaptation Digital Twin, fields on a single level or surface, standard resolution.

Our first goal is to plot the mean 2 metre temperature in January 2029 over Central Europe.

Our second goal is to compute the 2 metre temperature climatology (monthly means and standard deviations) in Berlin for the 2020-2028 reference period.
***

## What you will learn:

* how to access and preview the dataset
* select and reduce the data
* plot the results

## Setup distributed processing

In [None]:
import distributed

dask_client = distributed.Client("tcp://controller:8786")
dask_client

## Data access and preview
***

Xarray and Dask work together following a lazy principle. This means when you access and manipulate a Zarr store the data is in not immediately downloaded and loaded in memory. Instead, Dask constructs a task graph that represents the operations to be performed. A smart user will reduce the amount of data that needs to be downloaded before the computation takes place (e.g., when the `.compute()` or `.plot()` methods are called).

To preview the data, only the dataset metadata must be downloaded. Xarray does this automatically:

***

In [None]:
import xarray as xr

ds = xr.open_dataset(
    "https://bopen:edh_pat_ba2fd8913788600bfa1eded6aa161604ac6b915e58ee8499b94ce2e7a20a19db10aaff39b7f494c1e44fa1b1e86f3c18@data.earthdatahub.com/d1-climate-dt/ScenarioMIP-SSP3-7.0-IFS-NEMO-0001-high-sfc-v0.zarr",
    chunks={},
    engine="zarr",
)
ds

In [None]:
t2m = (ds.t2m - 273.15).drop_vars(["heightAboveGround", "surface", "step"])
t2m.attrs["units"] = "°C"
t2m.attrs["long_name"] = "temperature"

t2m

In [None]:
%config InlineBackend.figure_format='retina'

import display
import matplotlib.ticker

ax = display.map(
    t2m.sel(time="2020-02-15T12:00")[::10, ::10],
    vmax=45, vmin=-45, cmap="RdBu_r",
    figsize=(15, 6),
)
gl = ax.gridlines(linewidth=0.25, color='dimgrey', alpha=0.5)
gl.xlocator = matplotlib.ticker.MultipleLocator(22.5)
gl.ylocator = matplotlib.ticker.MultipleLocator(22.5)

# Working with data

Datasets on EDH are typically very large and remotely hosted. Typical use imply a selection of the data followed by one or more reduction steps to be performed in a local or distributed Dask environment. 

The structure of a workflow that uses EDH data looks like this:
1. data selection
2. (optional) data reduction
3. (optional) visualization

## 2 metre temperature: average January 2029 in Germany

### 1. Data selection

First, we perform a geographical selection corresponding to the Germany area. This greatly reduces the amount of data that will be downloaded from EDH. Also, we convert the temperature to `°C`.

In [None]:
selection = {"latitude": slice(42, 56), "longitude": slice(0, 18)}
extent = (0, 16, 42, 56)

t2m_central_europe = t2m.sel(selection)
t2m_central_europe

!NB: At this point, no data has been downloaded yet, nor loaded in memory.

Second, we further select January 2029. This is again a lazy operation:

In [None]:
display.map(
    t2m_central_europe.sel(time="2020-05-15T12:00"),
    vmax=40, vmin=-40, cmap="RdBu_r",
    figsize=(10, 6),
    extent=extent,
)

In [None]:
t2m_central_europe_daily = t2m_central_europe.resample(time="1D").mean()
t2m_central_europe_daily

In [None]:
t2m_central_europe_daily = t2m_central_europe_daily.compute()

In [None]:
import numpy as np
hdd_daily = np.maximum((15.5 - t2m_central_europe_daily), 0)

In [None]:
hdd = hdd_daily.resample(time="1M").mean()
hdd

In [None]:
display.map(
    hdd.sel(time="2020-01"),
    vmax=25, vmin=0, cmap="Blues",
    figsize=(10, 6),
    extent=extent,
)

In [None]:
display.map(
    hdd.sel(time="2029-01"),
    vmax=25, vmin=0, cmap="Blues",
    figsize=(10, 6),
    extent=extent,
)

In [None]:
_ = hdd.groupby("time.month")[1].plot(col="time", col_wrap=3, vmax=25, vmin=0, cmap="Blues", add_colorbar=False)

In [None]:
cdd_daily = np.maximum(t2m_central_europe_daily - 22, 0)

In [None]:
cdd = cdd_daily.resample(time="1M").mean()
cdd

In [None]:
display.map(
    cdd.sel(time="2020-07"),
    vmax=8, vmin=0, cmap="Reds",
    figsize=(10, 6),
    extent=(0, 16, 43, 56),
)

In [None]:
display.map(
    cdd.sel(time="2020-07"),
    vmax=8, vmin=0, cmap="Reds",
    figsize=(10, 6),
    extent=(0, 16, 43, 56),
)

In [None]:
cdd.groupby("time.month")[7].plot(col="time", col_wrap=3, vmax=8, cmap="Reds")

At this point the selection is small enough to call `.compute()` on it. This will trigger the download of data from EDH and load it in memory. 

We can measure the time it takes:

In [None]:
%%time

t2m_germany_area_january_2029 = t2m_germany_area_january_2029.compute()

The data was very small, this didn't take long.

### 2. Data reduction

Now that the data is loaded in memory, we can easily compute the october 2023 monthly mean:

In [None]:
t2m_germany_area_january_2029_monthly_mean = t2m_germany_area_january_2029.mean(dim="time")
t2m_germany_area_january_2029_monthly_mean

## 3. Visualization
Finally, we can plot the january 2029 monthly mean on a map:

In [None]:
import display
import matplotlib.pyplot as plt

In [None]:
display.map(t2m_germany_area_january_2029_monthly_mean, vmax=None, cmap="YlOrRd", title="Mean Surface Temperature, Jan 2029")

## 2020-2028 climatology

We will now compute the 2 metre temperature climatology (montly mean and standard deviation) in Berlin for the reference period 2020-2028.

We first select the closet data to Berlin:

In [None]:
%%time

t2m_Berlin_2020_2028 = t2m.sel(**{"latitude": 52.5, "longitude": 13.4}, method="nearest")
t2m_Berlin_2020_2028

This is already small enought to be computed:

In [None]:
%%time

t2m_Berlin_2020_2028 = t2m_Berlin_2020_2028.compute()

Now that the data is loaded in memory we can easily compute the climatology:

In [None]:
t2m_Berlin_climatology_mean = t2m_Berlin_2020_2028.groupby("time.month").mean(dim="time")
t2m_Berlin_climatology_std = t2m_Berlin_2020_2028.groupby("time.month").std(dim="time")

We can finally plot the climatology in Berlin for the 2020-2028 refrence period

In [None]:
plt.figure(figsize=(10, 5))
t2m_Berlin_climatology_mean.plot(label="Mean", color="#3498db")
plt.errorbar(
    t2m_Berlin_climatology_mean.month, 
    t2m_Berlin_climatology_mean, 
    yerr=t2m_Berlin_climatology_std, 
    fmt="o", 
    label="Standard Deviation",
    color="#a9a9a9"
)

plt.title("Surface Temperature climatology in Berlin (DE), 2020-2028")
plt.xticks(t2m_Berlin_climatology_mean.month)
plt.xlabel("Month")
plt.ylabel("Surface Temperature [C]")
plt.legend()
plt.grid(alpha=0.3)
plt.show()