# Validation and intercomparison of AgERA5 and other reanalysis datasets for agricultural applications

Production date: DD-MM-2025

**Please note that this repository is used for development and review, so quality assessments should be considered work in progress until they are merged into the main branch.**

Dataset version: 2.0.

Produced by: C3S2_521 contract.

## 🌍 Use case: Agricultural yield estimation and prediction based on reanalysis data

## ❓ Quality assessment question
* How do reanalysis datasets compare to observations, and to each other, for agriculturally relevant variables?
* Is AgERA5 fit-for-purpose as an input dataset for crop yield models?

A very short introduction before the assessment statement describing approach taken to answer the user question. One or two key references could be useful,  if the assessment summarises literature (referenced directly in the text, or with numerical labels like this (also listed at the end) `[[1]](https://doi.org/10.1038/s41598-018-20628-2))`giving: [[1]](https://doi.org/10.1038/s41598-018-20628-2)).

[[CDS AgERA5]](https://doi.org/10.24381/cds.6c68c9bb).

## 📢 Quality assessment statement

```{admonition} These are the key outcomes of this assessment
:class: note
* Finding 1
* Finding 2
* Finding 3
* etc
```

## 📋 Methodology

**Agrometeorological indicators from 1979 to present derived from reanalysis** (*AgERA5*; [doi 10.24381/cds.6c68c9bb](https://doi.org/10.24381/cds.6c68c9bb)).

A ‘free text’ introduction to the data analysis steps or a description of the literature synthesis, with a justification of the approach taken, and limitations mentioned. **Mention which CDS catalogue entry is used, including a link, and also any other entries used for the assessment**.

---
Variables of interest for a crop growth simulator such as [PCSE/WOFOST](https://github.com/ajwdewit/pcse):

| Variable name     | Statistics | Unit     | Example assessment |
|-------------------|------------|----------|--------------------|
| Solar irradiation | 24 h total | J/m²/day | example |
| 2 m temperature   | 24 h min   | °C       | example |
|                   | 24 h max   |          |         |
|                   | 24 h mean  |          |         |
| Vapour pressure   | 24 h mean  | kPa      | example |
| Rain              | 24 h total | cm/day   | example |
| 2 m Wind speed    | 24 h mean  | m/s      | example |
| Snow depth        | 24 h mean  | cm       | example |

[Source](https://pcse.readthedocs.io/en/stable/code.html#pcse.base.WeatherDataContainer)
E0, ES0, ET0 are taken from evapotranspiration calculation

---

* These headings can be specific to the quality assessment, and help guide the user through the ‘story’ of the assessment. This means we cannot pre-define the sections and headings here, as they will be different for each assessment.
* Sub-bullets could be used to outline what will be done/shown/discussed in each section
* The list below is just an example, or may need more or fewer sections, with different headings

The analysis and results are organised in the following steps, which are detailed in the sections below:

**[](section-setup)**

**[](section-download)**
 * AgERA5, ERA5-Land, E-OBS, ...

**[](section-results)** 
 * Point-by-point comparison
 * Time series comparison
 * Geospatial comparison

Any further notes on the method could go here (explanations, caveats or limitations).

## 📈 Analysis and results

(section-setup)=
### 1. Code setup
```{note}
This notebook uses [earthkit](https://github.com/ecmwf/earthkit) for 
downloading ([earthkit-data](https://github.com/ecmwf/earthkit-data)) 
and 
visualising ([earthkit-plots](https://github.com/ecmwf/earthkit-plots)) data.
Because earthkit is in active development, some functionality may change after this notebook is published.
If any part of the code stops functioning, please raise an issue on our GitHub repository so it can be fixed.
```

In [None]:
import earthkit.data as ekd
import earthkit.plots as ekp
import xarray as xr
from matplotlib import pyplot as plt

(section-download)=
### 2. Download data
#### General setup
This notebook uses [earthkit-data](https://github.com/ecmwf/earthkit-data) to download files from the CDS.
If you intend to run this notebook multiple times, it is highly recommended that you [enable caching](https://earthkit-data.readthedocs.io/en/latest/guide/caching.html) to prevent having to download the same files multiple times.

We will be downloading multiple datasets in this notebook.
In this section, we define the parameters common to all datasets: time and space.
This way, these only need to be changed in one place if you wish to modify the notebook for your own use case.

In this example, we will be looking at data for the United Kingdom and Ireland every day in January–September 2024:

In [None]:
request_domain = {
    "area": [60, -12, 48, 4]  # North, West, South, East
}

In [None]:
request_time = {
    "year": "2024",
    "month": [f"{mo:02}" for mo in range(1, 10)],
    "day": [f"{d:02}" for d in range(1, 32)],
}

We can define a helper function that adds the time and domain parameters, as well as a dictionary of parameters specific to one dataset (e.g. AgERA5, ERA5-Land), to a number of requests:

In [None]:
def make_full_request(request_dataset: dict, *requests: dict) -> dict:
    base_request = request_time | request_domain | request_dataset
    updated_requests = [base_request | req for req in requests]
    return updated_requests

#### AgERA5
We now define parameters unique to AgERA5:

In [None]:
agera5_ID = "sis-agrometeorological-indicators"

request_agera5 = {
    "version": "2_0",
}

Next, we specify the variables of interest:

In [None]:
request_irradiation = {
    "variable": "solar_radiation_flux",
}

# Temperature has to be split into separate requests because of size limits
request_temperature_min = {
    "variable": "2m_temperature",
    "statistic": ["24_hour_minimum"],
}

request_temperature_max = {
    "variable": "2m_temperature",
    "statistic": ["24_hour_maximum"],
}

request_temperature_mean = {
    "variable": "2m_temperature",
    "statistic": ["24_hour_mean"],
}

request_vapour_pressure = {
    "variable": "vapour_pressure",
    "statistic": ["24_hour_mean"],
}

request_rain = {
    "variable": "precipitation_flux",
}

request_wind = {
    "variable": "10m_wind_speed",
    "statistic": ["24_hour_mean"],
}

request_snow = {
    "variable": "snow_thickness",
    "statistic": ["24_hour_mean"],
}

The requests for specific variables are combined with the default, time, and domain parameters and passed to earthkit for download from the CDS:

In [None]:
requests_agera5_combined = make_full_request(request_agera5,
                                             request_irradiation,
                                             request_temperature_min, request_temperature_max, request_temperature_mean,
                                             request_vapour_pressure,
                                             request_rain,
                                             request_wind,
                                             request_snow,
                                            )

ds_agera5 = ekd.from_source("cds", agera5_ID, *requests_agera5_combined)

Earthkit-data downloads the dataset as a [field list](https://earthkit-data.readthedocs.io/en/latest/guide/data.html), which can be manipulated directly.
Here, we convert it to an Xarray object for ease of use later (when comparing multiple datasets):

In [None]:
print("AgERA5 data type from earthkit-data:", type(ds_agera5))
data_agera5 = ds_agera5.to_xarray(compat="equals")
print("AgERA5 data type in Xarray:", type(data_agera5))
data_agera5

#### ERA5-Land
We now define parameters unique to ERA5-Land and the variables of interest.
This will involve two CDS datasets.
The [*ERA5-Land hourly data from 1950 to present*](https://doi.org/10.24381/cds.e2161bac) dataset contains hourly data for variables like 2m temperature as well as accumulated data for variables like solar irradiation.
For the present use case, we are only interested in the daily accumulated data.
[*ERA5-Land post-processed daily statistics from 1950 to present*](https://doi.org/10.24381/cds.e9c9c792) provides daily minimum/maximum/mean statistics for the hourly variables, saving us the effort of aggregating these ourselves.
In the following subsections, we will download the data (accumulated and daily statistics) and harmonise these to the same format as AgERA5.

##### Accumulated data from ERA5-Land
Per the [documentation for ERA5-Land](https://confluence.ecmwf.int/display/CKB/ERA5-Land%3A+data+documentation#heading-Accumulations):
The accumulations in the short forecasts of ERA5-Land (with hourly steps from 01 to 24) are treated the same as those in ERA-Interim or ERA-Interim/Land, i.e., they are accumulated from the beginning of the forecast to the end of the forecast step. For example, runoff at day=D, step=12 will provide runoff accumulated from day=D, time=0 to day=D, time=12. The maximum accumulation is over 24 hours, i.e., from day=D, time=0 to day=D+1,time=0 (step=24). For the CDS time, or validity time, of 00 UTC, the accumulations are over the 24 hours ending at 00 UTC i.e. the accumulation is during the previous day.

In practice, this means that we need to download data for *day+1* (e.g. 2 January 2024) to get the total accumulated value for *day* (1 January 2024).
Our existing `request_time` will provide data for 2023-12-31 – 2024-09-29, so we need to add one extra day.
This is achieved by adding a second request for just the last day to the earthkit-data download.

In [None]:
era5land_ID = "reanalysis-era5-land"

request_era5land = {
    "time": ["00:00"],
    "data_format": "grib",
    "download_format": "unarchived",
}

request_era5land_extratime = {
    "month": "10",
    "day": "01",
}

In [None]:
request_irradation = {
    "variable": ["surface_solar_radiation_downwards"],
}

request_rain = {
    "variable": ["total_precipitation"],
}

In [None]:
request_era5land_irradiation, request_era5land_rain = make_full_request(request_era5land,
                                                                        request_irradation,
                                                                        request_rain,
                                                                       )

ds_era5land_accumulated = ekd.from_source("cds", era5land_ID, request_era5land_irradiation, request_era5land_irradiation | request_era5land_extratime,
                                                              request_era5land_rain, request_era5land_rain | request_era5land_extratime)
data_era5land_accumulated = ds_era5land_accumulated.to_xarray()

When we inspect the resulting dataset (in Xarray format), we see that the `forecast_reference_time` coordinate conveniently matches the variable to the day of accumulation:

In [None]:
data_era5land_accumulated

##### Daily statistics from the post-processed ERA5-Land dataset
The daily statistics are indexed according to the day they apply to, meaning we do not need to worry about adding extra days here.

In [None]:
era5land_ID = "derived-era5-land-daily-statistics"

request_era5land = {
    "time_zone": "utc+00:00",
    "frequency": "1_hourly",
}

In [None]:
# Temperature has to be split into separate requests because of size limits
request_temperature_min = {
    "variable": "2m_temperature",
    "daily_statistic": "daily_minimum",
}

request_temperature_max = {
    "variable": "2m_temperature",
    "daily_statistic": ["daily_maximum"],
}

request_temperature_mean = {
    "variable": "2m_temperature",
    "daily_statistic": ["daily_mean"],
}

# request_vapour_pressure : Not available

request_wind_u = {
    "variable": "10m_u_component_of_wind",
    "daily_statistic": ["daily_mean"],
}

request_wind_v = {
    "variable": "10m_v_component_of_wind",
    "daily_statistic": ["daily_mean"],
}

request_snow = {
    "variable": "snow_depth",
    "daily_statistic": ["daily_mean"],
}

We download the different variables separately in anticipation of the harmonisation step in the next subsection:

In [None]:
requests_era5land_combined = make_full_request(request_era5land,
                                               request_temperature_min, request_temperature_max, request_temperature_mean,
                                               request_wind_u, request_wind_v,
                                               request_snow
                                              )

ds_era5land = [ekd.from_source("cds", era5land_ID, req) for req in requests_era5land_combined]
data_era5land = [ds.to_xarray() for ds in ds_era5land]
data_era5land_temperature_min, data_era5land_temperature_max, data_era5land_temperature_mean, data_era5land_wind_u, data_era5land_wind_v, data_era5land_snow = data_era5land

##### Harmonisation
The ERA5-Land dataset is structured differently from AgERA5 and requires some processing before the two can be intercompared.
This involves renaming coordinates and variables.
**Note: Also check units**

For the accumulated data, we only need to rename the variables and coordinates to match those in AgERA5:

In [None]:
data_era5land_accumulated = data_era5land_accumulated.rename({"ssrd": "Solar_Radiation_Flux",
                                                              "tp": "Precipitation_Flux",
                                                              "forecast_reference_time": "time", 
                                                              "latitude": "lat", "longitude": "lon"})

The temperature statistics are downloaded as simply `t2m`.
These need to be renamed before they can be combined:

In [None]:
data_era5land_temperature_min = data_era5land_temperature_min.rename({"t2m": "Temperature_Air_2m_Min_24h"})
data_era5land_temperature_max = data_era5land_temperature_max.rename({"t2m": "Temperature_Air_2m_Max_24h"})
data_era5land_temperature_mean = data_era5land_temperature_mean.rename({"t2m": "Temperature_Air_2m_Mean_24h"})
data_era5land_temperature = xr.merge([data_era5land_temperature_min, data_era5land_temperature_max, data_era5land_temperature_mean], compat="equals")

The 10 m wind speed is calculated from the two variables representing its U (east–west) and V (north–south) components:

In [None]:
data_era5land_wind = xr.merge([data_era5land_wind_u, data_era5land_wind_v], compat="equals")
data_era5land_wind = data_era5land_wind.assign(
    {"Wind_Speed_10m_Mean_24h": xr.ufuncs.sqrt(data_era5land_wind["u10"]**2 + data_era5land_wind["v10"]**2)}
)

Lastly, we rename the precipitation and snow parameters to match AgERA5:

In [None]:
# data_era5land_temperature_min = data_era5land_temperature_min.rename({"t2m": "Temperature_Air_2m_Min_24h"})
data_era5land_snow = data_era5land_snow.rename({"sde": "Snow_Thickness_Mean_24h"})

Now we can combine the pre-processed variables into a single Xarray dataset.
We also rename the coordinates to match AgERA5.

In [None]:
data_era5land = xr.merge([data_era5land_temperature, data_era5land_wind, data_era5land_snow], compat="equals")
data_era5land = data_era5land.rename({"valid_time": "time", "latitude": "lat", "longitude": "lon"})

Lastly, we combine the pre-processed and accumulated variables:

In [None]:
data_era5land = xr.merge([data_era5land_accumulated, data_era5land], compat="equals")
data_era5land

#### E-OBS
We now define parameters unique to E-OBS and the variables of interest:

In [None]:
eobs_ID = "insitu-gridded-observations-europe"
request_eobs = {
    "product_type": "ensemble_mean",
    "variable": [
        "mean_temperature",
        "minimum_temperature",
        "maximum_temperature",
        "precipitation_amount",
        "surface_shortwave_downwelling_radiation",
        "wind_speed"
    ],
    "grid_resolution": "0_1deg",
    "period": "2011_2024",
    "version": ["30_0e"]
}

In [None]:
ds_eobs = ekd.from_source("cds", eobs_ID, request_eobs | request_domain)
data_eobs = ds_eobs.to_xarray(compat="equals")

In [None]:
eobs_ID = "insitu-gridded-observations-europe"
request_eobs = {
    "product_type": "ensemble_mean",
    "variable": [
        "mean_temperature",
    ],
    "grid_resolution": "0_1deg",
    "period": "2011_2024",
    "version": ["30_0e"]
}

In [None]:
ds_eobs = ekd.from_source("cds", eobs_ID, request_eobs | request_domain)
data_eobs = ds_eobs.to_xarray(compat="equals")

Rename variables and coordinates

In [None]:
data_eobs

#### General harmonisation
Before the data can be analysed, we must make sure their coordinates are aligned equally.
All of the datasets used here are provided on a regular 0.1° by 0.1° grid, so no regridding is necessary.
However, two steps need to be taken before the data can be compared directly:
* Their representation as floating-point numbers can cause very small differences to appear, which do not reflect any real differences in the data but are difficult for software (in this case Xarray) to work with. Knowing that the data are on a regular 0.1° by 0.1° grid, we can simply round all of the coordinates to 2 digits to force them to be the same.
* The bounds of the datasets (in space and in time) are slightly different and need to be aligned.

In [None]:
# Round coordinates before alignment to avoid floating-point errors
# Cannot be a dict-comp with coord in lat/lon because of symbol table issues
d = 2
round_mapping = {"lat": (lambda dataset: dataset["lat"].round(d)),
                 "lon": (lambda dataset: dataset["lon"].round(d))}

round_agera5, round_era5land = [dataset.assign_coords(round_mapping) for dataset in (data_agera5, data_era5land)]

In [None]:
# Align data using an inner join
data_agera5, data_era5land, data_eobs = xr.align(round_agera5, round_era5land, data_eobs, join="inner")

In [None]:
data_eobs

(section-results)=
### 3. Results
Describe what is done in this step/section and what the `code` in the cell does (if code is included). 

If this is the **results section**, we expect the final plots to be created here with a description of how to interpret them, and what information can be extracted for the specific use case and user question. The information in the 'quality assessment statement' should be derived here.

#### Setup
First, we get a list of all variables from the AgERA5 dataset, which can be looped over for plots:

In [None]:
variables = list(data_agera5.keys())
variables.remove("crs")

#### Point-by-point comparison
Here, we compare the different variables 1-to-1 between the different datasets, across their entire spatial and temporal domain, to determine the typical differences.

In [None]:
# Shared arguments for scatter plots
hexbin_kw = {}

# Create figure
fig, axs = plt.subplots(nrows=4, ncols=2, figsize=(10, 20), layout="constrained")
axs = axs.ravel()

# Plot individual scatter plots
for ax, var in zip(axs, variables):
    try:  # Account for missing variables
        ax.hexbin(data_agera5[var].values.ravel(), data_era5land[var].values.ravel(),
                  gridsize=500, mincnt=1, cmap="cividis")
    except KeyError:
        print(f"KeyError: no variable `{var}` in one of the datasets")

    ax.text(0.05, 0.95, var, 
            horizontalalignment="left", verticalalignment="top", transform=ax.transAxes,
            bbox={"facecolor": "white", "edgecolor": "black", "alpha": 1})

# Visual settings
for ax in axs:
    ax.grid(True, axis="both", linestyle="--")
    ax.set_aspect("equal", "box")
    ax.set_xlabel("")
    ax.set_title("")

fig.suptitle(f"Point-by-point comparison")

# Show result
plt.show()

#### Time series comparison
In this section, we compare time series from the various datasets at a few chosen sites.
First, we define our test site(s) and select only the data at those locations:

In [None]:
selection = {"lat": 52.5, "lon": 0.0, "method": "nearest"}
# Nearest is necessary because of floating-point errors

In [None]:
timeseries_agera5 = data_agera5.sel(**selection)
timeseries_era5land = data_era5land.sel(**selection)

We now create a plot showing all variables:

In [None]:
# Create figure
fig, axs = plt.subplots(nrows=8, figsize=(10, 20), sharex=True, layout="constrained")

# Plot individual time series
for j, timeseries in enumerate([timeseries_agera5, timeseries_era5land]):
    for ax, var in zip(axs, variables):
        try:  # Account for missing variables
            timeseries[var].plot(ax=ax)
        except KeyError:
            print(f"KeyError: no variable `{var}` in dataset {j}")

# Visual settings
for ax in axs:
    ax.grid(True, axis="both", linestyle="-")
    ax.tick_params("x", labelbottom=True)
    ax.set_xlabel("")
    ax.set_title("")

fig.suptitle(f"Time series comparison at ({selection['lat']} °N, {selection['lon']} °E)")

# Show result
plt.show()

#### Geospatial comparison

In [None]:
domain = ekp.geo.domains.union(["United Kingdom", "Ireland"], name="UK & Ireland")

In [None]:
agera5_oneday = data_agera5.sel(time="20240101")

In [None]:
ekp.quickplot(agera5_oneday[["Solar_Radiation_Flux", "crs"]], domain=domain)

In [None]:
era5land_oneday = data_era5land.sel(time="20240101")

In [None]:
ekp.quickplot(era5land_oneday["Solar_Radiation_Flux"], domain=domain)

In [None]:
chart = ekp.Map(domain=domain)
chart.contourf(era5land_oneday, z="Temperature_Air_2m_Mean_24h", units="celsius", levels={"step": 1})
chart.land()
chart.coastlines()
chart.gridlines()
chart.legend()
chart.title()
chart.show()

## ℹ️ If you want to know more

### Key resources

List some key resources related to this assessment. E.g. CDS entries, applications, dataset documentation, external pages.
Also list any code libraries used (if applicable).

Code libraries used:
* Earthkit
  * [earthkit-data](https://github.com/ecmwf/earthkit)
  * [earthkit-plots](https://github.com/ecmwf/earthkit-plots)

### References
[[CDS AgERA5]](https://doi.org/10.24381/cds.6c68c9bb) Boogaard, H., Schubert, J., De Wit, A., Lazebnik, J., Hutjes, R., Van der Grijn, G., (2020): Agrometeorological indicators from 1979 to present derived from reanalysis. Copernicus Climate Change Service (C3S) Climate Data Store (CDS). DOI: 10.24381/cds.6c68c9bb (Accessed on DD-MMM-YYYY)