# How to work with ERA5 land on Earth Data Hub

***
This notebook will provide you guidance on how to access and use the `reanalysis-era5-land-no-antartica-v0.zarr` datset on Earth Data Hub.

The first goal is to compute the total precipitation observed during the Storm Daniel event, from 6 to 7 September 2023, in Greece, and compare it with the average 1991-2020 precipitation in the same area.

The second goal is to compare the 2023 cumulative precipitation on a specific location (Greece interland) with the cumulative precipitation of past years (1991-2022) for the same location.
***

## What you will learn:

* how to access and preview the dataset
* select and reduce the data
* plot the results

## Data access and preview
***

Xarray and Dask work together following a lazy principle. This means when you access and manipulate a Zarr store the data is in not immediately downloaded and loaded in memory. Instead, Dask constructs a task graph that represents the operations to be performed. A smart user will reduce the amount of data that needs to be downloaded before the computation takes place (e.g., when the `.compute()` or `.plot()` methods are called).

To preview the data, only the dataset metadata must be downloaded. Xarray does this automatically:

***

In [None]:
import xarray as xr

ds = xr.open_dataset(
    "s3://ecmwf-era5-land/reanalysis-era5-land-no-antartica-v0.zarr", 
    chunks={}, 
    engine="zarr"
).astype("float32")
ds

## Working with data

Datasets on EDH are typically very large and remotely hosted. Typical use imply a selection of the data followed by one or more reduction steps to be performed in a local or distributed Dask environment. 

The structure of a workflow that uses EDH data looks like this:
1. data selection
2. (optional) data reduction
3. (optional) visualization

## Storm Daniel precipitation VS average September precipitation 1991-2022

### 1. Data selection

First, we perform a geographical selection corresponding to the Greece area:

In [None]:
tp = ds.tp
tp_greece = tp.sel(**{"latitude": slice(41, 34), "longitude": slice(19, 28)})
tp_greece

Second, we further select only two days: september 6 and 7, 2023. This greatly reduces the amount of data that will be downloaded from EDH.

In [None]:
tp_greece_storm_daniel = tp_greece.sel(valid_time=["2023-09-06", "2023-09-07"])
tp_greece_storm_daniel

At this point, the selection is small enough to call `.compute()` on it, which will trigger the download of the data and load it in memory. 

We can measure the time it takes:

In [None]:
%time

tp_greece_storm_daniel = tp_greece_storm_daniel.compute()

The data was very small. This didn't take long!

### 2. Data reduction

Now that the data is loaded in memory, we can easily compute the total precipitation for the Storm Daniel event. We also convert the unit of measure to `mm`.

In [None]:
tp_greece_storm_daniel_sum = tp_greece_storm_daniel.sum("valid_time")
tp_greece_storm_daniel_sum
tp_greece_storm_daniel_sum = tp_greece_storm_daniel_sum * 1000
tp_greece_storm_daniel_sum.attrs["units"] = "mm"
tp_greece_storm_daniel_sum

### 3. Visualization
Finally, we can plot the Storm Daniel event on a map:

In [None]:
import display
display.map(tp_greece_storm_daniel_sum, vmax=400, title="Storm Daniel precipitation")

We want to compare the total precipitation observed during Storm Daniel with the average precipitation observed in September between 1991 and 2020. 

The same considerations done before apply here. We will first select a subset of the dataset and then compute.

In [None]:
YEARS = [
    "1991", "1992", "1993",
    "1994", "1995", "1996",
    "1997", "1998", "1999",
    "2000", "2001", "2002",
    "2003", "2004", "2005",
    "2006", "2007", "2008",
    "2009", "2010", "2011",
    "2012", "2013", "2014",
    "2015", "2016", "2017",
    "2018", "2019", "2020",
]
DAYS = [
    "01", "02", "03",
    "04", "05", "06",
    "07", "08", "09",
    "10", "11", "12",
    "13", "14", "15",
    "16", "17", "18",
    "19", "20", "21",
    "22", "23", "24",
    "25", "26", "27",
    "28", "29", "30",
]

MONTH_REFERENCE_TIME = [f"{y}-09-{d}" for y in YEARS for d in DAYS]

tp_greece_september_1991_2020 = tp_greece.sel(valid_time=MONTH_REFERENCE_TIME)
tp_greece_september_1991_2020

This is already small enough to call `.compute()` on it.

In [None]:
%time

tp_greece_september_1991_2020 = tp_greece_september_1991_2020.compute()

Now that the data is loaded in memory, we can easily compute the average september total precipitation for the years 1991-2020. We also convert the unit of measure to `mm`:

In [None]:
tp_greece_september_1991_2020_average = (tp_greece_september_1991_2020.sum("valid_time") / len(YEARS))
tp_greece_september_1991_2020_average
tp_greece_september_1991_2020_average = tp_greece_september_1991_2020_average * 1000
tp_greece_september_1991_2020_average.attrs["units"] = "mm"

Finally, we can plot the Storm Daniel event and the September 1991-2020 average side by side:

In [None]:
display.maps(
    [tp_greece_storm_daniel_sum, tp_greece_september_1991_2020_average],
    vmax=400,
    axs_set=[
        {"title": "Storm Daniel precipitation"},
        {"title": "Average precipitation in September"},
    ],
)

## 2023 cumulative precipitation VS 1991-2022 comulated precipitation

In this section we will compare the 2023 cumulative precipitation on a specific location in Greece with the cumulative precipitation of each year between 1991 and 2020 (same location).

In [None]:
tp_hinterland_location = ds.tp.sel(**{"latitude": 39.25, "longitude": 21.9, "method": "nearest"})
tp_hinterland_location

This is already small enought to be computed:

In [None]:
%%time

tp_hinterland_location = tp_hinterland_location.compute()

With the data already loaded in memory, we can easily select the total daily precipitation (time 00:00) for each  day of the year:

In [None]:
import datetime
tp_hinterland_location_daily_total_2023 = tp_hinterland_location.sel(valid_time="2023").groupby("valid_time.time")[datetime.time()]
tp_hinterland_location_daily_total_1991_2022 = tp_hinterland_location.sel(valid_time=slice("1991", "2020")).groupby("valid_time.time")[datetime.time()]

Using the `display.compare()` method we can plot the cumulative precipitation for all the years between 1991 and 2022 (mean curve in red) and the cumulative precipitation for the year 2023 up to the 31 of October (blue curve)

In [None]:
display.compare(tp_hinterland_location_daily_total_2023, tp_hinterland_location_daily_total_1991_2022, time="valid_time", ylim=[0, 1600])