# Agricultural simulations using AgERA5: Regional intercomparison between reanalysis datasets

Production date: 2025-MM-DD

**Please note that this repository is used for development and review, so quality assessments should be considered work in progress until they are merged into the main branch.**

Dataset version: 2.0.

Produced by: Olivier Burggraaff (National Physical Laboratory).

## 🌍 Use case: Agricultural yield estimation based on reanalysis data

## ❓ Quality assessment question
* Which reanalysis dataset should I use for simulations of crop yields?
* Is AgERA5 fit-for-purpose as an input dataset for crop models?

Reliable estimation of crop yields using models such as WOrld FOod STudies ([WOFOST](https://github.com/ajwdewit/pcse)) [[De Wit+19](https://doi.org/10.1016/j.agsy.2018.06.018)] depends on reliable and fit-for-purpose meteorological data.
Crop models simulate growth and yield based on daily weather variables including temperature, solar irradiation, vapour pressure, wind speed, and precipitation (snow/rain).
Local simulations may make use of local observational data,
such as individual weather stations,
but these do not always provide sufficient spatial and temporal coverage and consistency.

Reanalyses
like ECMWF Reanalysis v5 (ERA5) [[Hersbach+20](https://doi.org/10.1002/qj.3803)]
provide meteorological data consistently and with gap-filled spatial and temporal coverage by combining historical observations and modelling.
Thanks to these advantages,
ERA5 and its higher-resolution derivative ERA5-Land [[Muñoz-Sabater+21](https://doi.org/10.5194/essd-13-4349-2021)]
are now commonly used to drive crop models [[Evenflow+24](https://climate.copernicus.eu/sites/default/files/2024-12/Value-generated-by-ERA5-full-report.pdf)].
However,
these datasets were not designed specifically for agriculture and therefore require considerable processing before they can be used in crop models.

The [*Agrometeorological indicators from 1979 to present derived from reanalysis*](https://doi.org/10.24381/cds.6c68c9bb) or AgERA5 dataset fills this gap by providing agriculturally relevant variables in a ready-to-use format.
Compared to ERA5,
AgERA5 is downscaled from 0.25° to 0.1° spatial resolution,
aggregated to daily resolution in local time zones,
and
bias-adjusted.
It is accessible from the CDS directly [[AgERA5 dataset](https://doi.org/10.24381/cds.6c68c9bb)]
or through the [`agera5tools`](https://github.com/ajwdewit/agera5tools) Python package.
AgERA5 has become very popular for crop modelling and climate analysis, among other use cases [[De Wit+24](https://climate.copernicus.eu/sites/default/files/custom-uploads/7th%20GA%20C3S/Presentations/Day%203/S1/05-s19.06.24_AgERA5UserPerspective_AllarddeWit_v1.pdf)].

Here, we assess
the fitness-for-purpose of AgERA5 for agricultural studies
through an intercomparison with ERA5-Land.
The assessment focuses on
the ease of use of both datasets for this particular use case
as well as
the quantitative difference between AgERA5 and ERA5-Land for agriculturally relevant variables.
Both datasets are derived from ERA5,
with differences in their aggregation and downscaling methods as well as in the variables provided.
Hence,
this intercomparison is not an independent validation of either dataset,
but rather an intercomparison between similar products.
Quality assessments for ERA5 and ERA5-Land can be found in [the relevant chapter of this website](../Reanalyses/reanalysis.md).

Agricultural intercomparisons are best performed regionally
to avoid confusing differences between datasets with physical differences between regions.
Accordingly,
this notebook is focused on one region in one growth season,
but it is written in such a way
that it can serve as a template for the user to pick up and apply to their own desired region or time frame.

## 📢 Quality assessment statement

```{admonition} These are the key outcomes of this assessment
:class: note
* The AgERA5 dataset ("Agrometeorological indicators from 1979 to present derived from reanalysis") is well-suited to agricultural studies because it provides daily aggregates and statistics of important parameters such as irradiation, temperature, vapour pressure, precipitation, wind speed, and snow depth. 
* The resolution of AgERA5 (daily, 0.1°) is comparable to other data sources and is sufficient for simulations at similar spatial scales. Comparisons can be made at scales larger than the cell size, i.e. 0.1° or ~11 km.
* The AgERA5 and ERA5-Land datasets, both derived from ERA5, provide different values due to their different methods for downscaling and bias adjustment. While these differences are (in some cases) statistically significant, they are small compared to the typical uncertainty in ERA5 and compared to the uncertainties in agricultural models. Differences may be larger in specific areas, e.g. at higher elevation.
* etc
```

## 📋 Methodology
This quality assessment tests the
ease of use
of the [*Agrometeorological indicators from 1979 to present derived from reanalysis*](https://doi.org/10.24381/cds.6c68c9bb) (AgERA5) dataset
and its consistency with
ERA5-Land ([*ERA5-Land hourly data from 1950 to present*](https://doi.org/10.24381/cds.e2161bac) and [*ERA5-Land post-processed daily statistics from 1950 to present*](https://doi.org/10.24381/cds.e9c9c792)).
It focuses on the following variables of interest for a crop model such as [PCSE/WOFOST](https://github.com/ajwdewit/pcse):

| Variable name     | Statistic (24h)                | Unit     | Example assessment |
|-------------------|--------------------------------|----------|--------------------|
| Solar irradiation | Total                          | MJ/m²/day| [[Araghi+22](https://doi.org/10.1016/j.eja.2021.126419)] |
| Temperature (2m)  | Mean <br> Minimum <br> Maximum | °C       | [[Kruger+24](https://doi.org/10.17159/sajs.2024/16043), [Hasan Karaman+23](https://doi.org/10.1016/j.asr.2023.02.006)] |
| Rain              | Total                          | cm/day   | [[Esquivel-Arriaga+24](https://doi.org/10.1175/JAMC-D-23-0227.1), [Suraweera+24](https://doi.org/10.1109/MERCon63886.2024.10689062)] |
| Wind speed (2m)   | Mean                           | m/s      |  |
| Snow depth        | Mean                           | cm       |  |
<!--
| Vapour pressure   | Mean                           | kPa      |  |
[Source](https://pcse.readthedocs.io/en/stable/code.html#pcse.base.WeatherDataContainer)
-->
AgERA5 also provides reference evapotranspiration values derived from the variables described above.
Example assessments comparing evapotranspiration between AgERA5 and other datasets include [[Garbanzo+25](https://doi.org/10.3390/hydrology12070161), [Garcia-Prats+25](https://doi.org/10.1016/j.ejrh.2025.102531)].

Both AgERA5 and ERA5-Land provide wind speed at 10 m altitude,
following ERA5,
because this is the standard height for wind measurements in meteorology.
The wind speed at 2 m,
which crop models typically use for their evapotranspiration calculations,
can be estimated as 0.75 times the wind speed at 10 m [[Allen+98](https://www.fao.org/4/x0490e/x0490e00.htm)].

An important quantity derived from temperature is the number of _growing degree-days_ (°C d) or _thermal time_ (TSUM in WOFOST).
Growing degree-days are essentially the sum of the _effective daily temperature_ T{sub}`eff` over time,
with T{sub}`eff` the temperature relative to 
a base temperature T{sub}`base`, below which crop development stops, 
and
capped by a temperature T{sub}`cap`, above which crop development does not increase [[De Wit+20](https://research.wur.nl/en/publications/system-description-of-the-wofost-72-cropping-systems-model), [Ceglar+19](https://doi.org/10.1016/j.agsy.2018.05.002)]:
<!-- \qquad used because amsmath is not active, aligning with & in $$ is tricky -->

$$
T_\text{eff} &= 0                 &\qquad\text{if}&&\                  T \leq T_\text{base} \\
T_\text{eff} &= T - T_\text{base} &\qquad\text{if}&&\  T_\text{base} < T < T_\text{cap}     \\
T_\text{eff} &= T_\text{cap}      &\qquad\text{if}&&\                  T \geq T_\text{cap}  \\
$$

The cumulative thermal time or total growing degree-days, TSUM, is the sum of T{sub}`eff` over time:

$$
\text{TSUM} = \sum_t T_\text{eff}(t)
$$

In many crop models, growing degree-days control the development stages of a crop,
with thresholds set for emergence, anthesis, and maturity.
Specific values depend on the crop, cultivar, site, model, etc.
As an example,
temperate maize has T{sub}`base` of 4 °C and T{sub}`cap` of 30 °C, and TSUM of 110 °C d for emergence, 695 °C d for emergence–anthesis, and 800 °C d for anthesis–maturity
in the [WOFOST crop parameter sets](https://github.com/ajwdewit/WOFOST_crop_parameters).
These are the values of T{sub}`base` and T{sub}`cap` used in this assessment;
if running the notebook yourself, these values can be customised by editing the `growing_degree_days` function in the [code setup](section-setup).

A full crop model run is outside the scope of this assessment,
but the effective daily temperature and cumulative growing degree-days will be included in the analysis.
There are other agriculturally relevant cumulative quantities relating to moisture, irradiation, and nutrient absorption, but these are considerably more complex and therefore also out of scope here.

The analysis and results are organised in the following steps, which are detailed in the sections below:

**[](section-setup)**
* Import all required libraries.
* Definition of helper functions.

**[](section-download)**
 * Download data from AgERA5.
 * Download data from ERA5-Land.
 * Pre-process data.

**[](section-results)** 
 * Geospatial comparison
 * Point-by-point comparison
 * Time series comparison

## 📈 Analysis and results

(section-setup)=
### 1. Code setup

```{note}
This notebook uses [earthkit](https://github.com/ecmwf/earthkit) for 
downloading ([earthkit-data](https://github.com/ecmwf/earthkit-data)) 
and 
visualising ([earthkit-plots](https://github.com/ecmwf/earthkit-plots)) data.
Because earthkit is in active development, some functionality may change after this notebook is published.
If any part of the code stops functioning, please raise an issue on our GitHub repository so it can be fixed.
```

#### Import required libraries
In this section, we import all the relevant packages needed for running the notebook.

In [1]:
# Input / Output
from pathlib import Path
import earthkit.data as ekd
import warnings

# General data handling
import numpy as np
import pandas as pd
import xarray as xr
from functools import partial, wraps

# Visualisation
import pprint  # Pretty-print
import earthkit.plots as ekp
from earthkit.plots.styles import Style
import matplotlib.pyplot as plt
plt.rcParams["grid.linestyle"] = "--"
from tqdm import tqdm  # Progress bars

# Visualisation in Jupyter book -- automatically ignored otherwise
try:
    from myst_nb import glue
except ImportError:
    glue = None

# Type hints for helper functions
from typing import Callable, Optional, Iterable
from earthkit.plots.geo.domains import Domain
AnyDomain = (Domain | str)

#### Helper functions
This section defines some functions and variables used in the following analysis, allowing code cells in later sections to be shorter and ensuring consistency.

##### Data downloading

The following functions help with downloading data within the desired spatial/temporal domain:

In [2]:
# Download within a spatial domain
def domain_to_request(domain: ekp.geo.domains.Domain) -> dict:
    """ From an earthkit-plots domain, generate a request for earthkit-data / cdsapi. """
    bbox = domain.bbox.to_latlon_bbox()

    # Round
    north = int(np.ceil(bbox.north) + 1)
    south = int(np.floor(bbox.south) - 1)
    west = int(np.floor(bbox.west) - 1)
    east = int(np.ceil(bbox.east) + 1)
    
    area = [north, west, south, east]
    return {"area": area}

# Pretty-print a request (dict)
print_request = partial(pprint.pp, compact=True)

##### Data (pre-)processing

The following functions handle [data chunking in dask](https://docs.xarray.dev/en/latest/user-guide/dask.html) for computational efficiency:

In [3]:
# Rechunking of data to speed up dask calculations
# and maintain grib / netcdf compatibility
def rechunk(ds: xr.Dataset) -> xr.Dataset:
    """ Rechunk a dataset `ds` based on pre-set memory requirements. """
    # Assign new chunk sizes
    chunks = {"time": -1, "lat": 200, "lon": 200,}
    return ds.chunk(chunks)

The following cell defines the order in which data variables are shown in tables and figures, for consistency:

In [4]:
# Pre-defined variables to ensure consistent order
# These lists will be used as defaults in most functions below
# ALL-CAPS to signify these are constants
VARIABLES_TEMPERATURE = [
                         "Temperature_Air_2m_Max_24h",
                         "Temperature_Air_2m_Mean_24h",
                         "Temperature_Air_2m_Min_24h",
                        ]

VARIABLES_NOT_TEMPERATURE = [
                             "Solar_Radiation_Flux", 
                             "Wind_Speed_2m_Mean_24h",
                             "Precipitation_Flux",
                             "Snow_Thickness_Mean_24h",
                            ]

VARIABLES_DERIVED = [
                     "Temperature_Effective",
                     "Growing_Degree_Days",
                    ]


VARIABLES = VARIABLES_TEMPERATURE + VARIABLES_NOT_TEMPERATURE + VARIABLES_DERIVED
VARIABLES_RELATIVE = VARIABLES_NOT_TEMPERATURE + VARIABLES_DERIVED  # Variables for which relative differences make sense (not temperature)

# Shortcuts for display purposes
nvars = len(VARIABLES)
nvars_half = int(np.ceil(nvars / 2))

The following functions handle updating and propagating metadata:

In [5]:
# Metadata handling
def adjust_metadata(data_array: xr.DataArray, **updates) -> xr.DataArray:
    """ Adjust metadata using a new dictionary. Returns a new object. """
    metadata_old = data_array.attrs
    metadata_new = metadata_old | updates
    data_array = data_array.assign_attrs(**metadata_new)
    return data_array

# Adjust metadata to a consistent format
def adjust_names(dataset: xr.Dataset, dataset_name: str) -> xr.Dataset:
    """
    Adjust the names and units of pre-defined variables in a dataset.
    Also adds the name of the dataset to its attrs for easy acces.
    """
    # Rename variables
    dataset["Temperature_Air_2m_Max_24h"]  = adjust_metadata(dataset["Temperature_Air_2m_Max_24h"], 
                                                             long_name="Temperature (max)", units="°C")
    dataset["Temperature_Air_2m_Mean_24h"] = adjust_metadata(dataset["Temperature_Air_2m_Mean_24h"], 
                                                             long_name="Temperature (mean)", units="°C")
    dataset["Temperature_Air_2m_Min_24h"]  = adjust_metadata(dataset["Temperature_Air_2m_Min_24h"], 
                                                             long_name="Temperature (min)", units="°C")
    dataset["Solar_Radiation_Flux"]        = adjust_metadata(dataset["Solar_Radiation_Flux"], 
                                                             long_name="Solar irradiation", units="MJ/m²/d")
    dataset["Wind_Speed_10m_Mean_24h"]     = adjust_metadata(dataset["Wind_Speed_10m_Mean_24h"], 
                                                             long_name="Wind speed (10 m)", units="m/s")
    dataset["Precipitation_Flux"]          = adjust_metadata(dataset["Precipitation_Flux"], 
                                                             long_name="Total precipitation", units="mm/d")
    dataset["Snow_Thickness_Mean_24h"]     = adjust_metadata(dataset["Snow_Thickness_Mean_24h"], 
                                                             long_name="Snow depth", units="cm")
    
    # Assign dataset name
    dataset = dataset.assign_attrs({"name": dataset_name})
    return dataset

# Helper: Preserve CRS (decorator)
def preserve_crs(func: Callable) -> Callable:
    """ Decorator that ensures propagation of CRS: find it in the original data and apply it to the result. """
    @wraps(func)
    def func_with_data(data1: xr.Dataset, *args, **kwargs) -> xr.Dataset:
        # Capture CRS from first dataset
        CRS_present = hasattr(data1, "crs")
        if CRS_present:
            crs = data1.crs

        # Apply function as normal
        result = func(data1, *args, **kwargs)

        # Propagate CRS
        if CRS_present:
            result["crs"] = crs
        return result

    return func_with_data

The following functions handle unit conversions, e.g. K to °C:

In [6]:
# Unit conversion
def convert_unit(dataset: xr.Dataset, key: str, conversion: Callable, new_unit: str) -> None:
    """ Convert the units of dataset[key] to new_unit using a conversion function (e.g. lambda x: x*1000 for m to mm), in-place. """
    # Metadata handling
    metadata_old = dataset[key].attrs
    metadata_new = metadata_old | {"units": new_unit}

    # Apply changes
    dataset[key] = conversion(dataset[key]).assign_attrs(**metadata_new)

convert_m2cm = partial(convert_unit, conversion=(lambda x: x*100),    new_unit="cm")  # meter -> centimeter
convert_m2mm = partial(convert_unit, conversion=(lambda x: x*1000),   new_unit="cm")  # meter -> millimeter
convert_K2C  = partial(convert_unit, conversion=(lambda x: x-273.15), new_unit="°C")  # Kelvin -> Celsius
convert_J2MJ = partial(convert_unit, conversion=(lambda x: x/1e6),    new_unit="MJ/m²/d")  # J/m²/d -> MJ/m²/d

The following functions handle harmonisation of coordinates:

In [7]:
def round_coordinates(data: xr.Dataset, d: int=2) -> xr.Dataset:
    """ Round coordinates in `data` to `d` decimals. Hard-coded for lat/lon. """
    # Cannot be a dict-comp with coord in lat/lon because of symbol table issues
    round_mapping = {"lat": (lambda data: data["lat"].round(d)),
                     "lon": (lambda data: data["lon"].round(d))}

    return data.assign_coords(round_mapping)

##### Derived variables
The following functions calculate derived variables such as 2 m wind speed and growing degree-days:

In [8]:
def calculate_2m_wind_speed(wind_10m: xr.DataArray) -> xr.DataArray:
    """ Calculate 2 m wind speed from 10 m wind speed. """
    # Calculate
    wind_2m = 0.75 * wind_10m

    # Fix metadata
    wind_2m.name = "Wind_Speed_2m_Mean_24h"
    metadata = wind_10m.attrs | {"long_name": "Wind speed (2 m)"}
    wind_2m = adjust_metadata(wind_2m, **metadata)

    return wind_2m

def convert_to_2m_wind_speed(data: xr.Dataset) -> xr.Dataset:
    """ Convert 10 m wind speed to 2 m wind speed. """
    # Add new variable
    data["Wind_Speed_2m_Mean_24h"] = calculate_2m_wind_speed(data["Wind_Speed_10m_Mean_24h"])

    # Remove old variable
    data = data.drop_vars(["Wind_Speed_10m_Mean_24h"])

    return data

In [9]:
# Calculate growing degree days
def growing_degree_days(data: xr.Dataset, *,
                        method: str="mean",
                        T_base: float=4., T_cap: float=30.) -> xr.Dataset:
    """ For a dataset `data` with temperatures, calculate the growing degree-days aggregated over the entire temporal range. """
    # Use min/max temperatures
    if method == "minmax":
        T_mean = (data["Temperature_Air_2m_Max_24h"] + data["Temperature_Air_2m_Min_24h"]) / 2
    elif method == "mean":
        T_mean = data["Temperature_Air_2m_Mean_24h"]
    else:
        raise ValueError(f"Unrecognised growing_degree_days method '{method}' -- please use 'minmax' or 'mean'")

    # Calculate T_eff
    T_eff = T_mean - T_base  # General case
    T_eff = T_eff.where(T_mean > T_base, 0)  # 0 if T <= T_base
    T_eff = T_eff.where(T_mean < T_cap, T_cap)  # T_cap if T >= T_cap
    T_eff = adjust_metadata(T_eff, long_name="Effective temperature", units="°C")

    # Aggregate TSUM
    TSUM = T_eff.cumsum("time")
    TSUM = adjust_metadata(TSUM, long_name="Σ Growing degree-days", units="°C d")

    # Propagate NaN
    T_eff = T_eff.where(T_mean.notnull(), np.nan)
    TSUM = TSUM.where(T_mean.notnull(), np.nan)

    return T_eff, TSUM

# Convenience: Add growing degree days to dataset
def add_growing_degree_days(data: xr.Dataset, **kwargs) -> xr.Dataset:
    """ Calculate growing degree days (daily, aggregate) and add them to the input dataset. """
    T_eff, TSUM = growing_degree_days(data, **kwargs)
    data["Temperature_Effective"] = T_eff
    data["Growing_Degree_Days"] = TSUM
    return data

##### Statistics
The following functions calculate the difference (absolute / relative) between datasets, handling metadata etc.:

In [10]:
# Labels
def label_with_unit(data: xr.Dataset, key: str, *, linebreak=False, unit: Optional[str]=None) -> str:
    """ Extract the full name with unit for a key in a dataset. """
    long_name = data[key].long_name  # Variable name
    unit = data[key].units if unit is None else unit  # If unit is specified -> use that ; else -> use metadata unit
    spacer = "\n" if linebreak else " "
    return f"{long_name}{spacer}[{unit}]"

In [11]:
# Constants
NONZERO_THRESHOLD = 1e-5
NONZERO_THRESHOLD_PCT = 0.1

# Difference between datasets
@preserve_crs
def difference_between_datasets(data1: xr.Dataset, data2: xr.Dataset, *,
                                diff_variables: Iterable[str]=VARIABLES) -> xr.Dataset:
    """ Calculate the difference between two datasets, preserving CRS and updating metadata. """
    # Subtract
    difference = xr.ufuncs.subtract(data1[diff_variables], data2[diff_variables])

    # Adjust metadata
    for var in diff_variables:
        old_metadata = data1[var].attrs
        updated_metadata = old_metadata | {"long_name": r"Δ " + old_metadata["long_name"]}
        difference[var] = adjust_metadata(difference[var], **updated_metadata)

    # Add name
    name1, name2 = [dataset.name if hasattr(dataset, "name") else "<unspecified>" for dataset in (data1, data2)]
    difference = difference.assign_attrs({"name": f"Difference: {name1} – {name2}"})
        
    return difference

@preserve_crs
def relative_difference_between_datasets(data1: xr.Dataset, data2: xr.Dataset, *,
                                         reldiff_variables: Iterable[str]=VARIABLES_RELATIVE) -> xr.Dataset:
    """
    Calculate the relative [%] difference between two datasets, preserving CRS and updating metadata.
    Relative difference is calculated symmetrically, i.e. divided by (data1 + data2)/2.
    Where data1 == 0 and data2 == 0, the relative difference is set to 0 too.
    """
    # Select and calculate
    data1, data2 = [dataset[reldiff_variables] for dataset in (data1, data2)]
    
    relative_difference = (data1 - data2) / (data1 + data2) * 200.

    # Replace 0/0 with 0
    both_zero = ((data1 + data2) <= NONZERO_THRESHOLD)  # Threshold slightly > 0 because of floating-point errors
    relative_difference = relative_difference.where(~both_zero, 0.)

    # Adjust metadata
    for var in reldiff_variables:
        old_metadata = data1[var].attrs
        updated_metadata = old_metadata | {"long_name": r"rΔ " + old_metadata["long_name"], "units": "%"}
        relative_difference[var] = adjust_metadata(relative_difference[var], **updated_metadata)

    # Add name
    name1, name2 = [dataset.name if hasattr(dataset, "name") else "<unspecified>" for dataset in (data1, data2)]
    relative_difference = relative_difference.assign_attrs({"name": f"% Difference: {name1} – {name2}"})

    return relative_difference

def comparison_statistics(data1: xr.Dataset, data2: xr.Dataset, *,
                          diff_variables: Iterable[str]=VARIABLES,
                          reldiff_variables: Iterable[str]=VARIABLES_RELATIVE) -> pd.DataFrame:
    """
    Given two datasets, calculate a number of statistics for each variable and return the result in a table.
    """
    # Calculate differences
    differences     =          difference_between_datasets(data1, data2, diff_variables=diff_variables)
    differences_rel = relative_difference_between_datasets(data1, data2, reldiff_variables=reldiff_variables)

    # Convert to pandas
    differences = differences.to_dataframe()[diff_variables]
    differences_abs = differences.abs()

    differences_rel = differences_rel.to_dataframe()[reldiff_variables]
    differences_rel_abs = differences_rel.abs()

    # Calculate aggregate statistics
    md   =               differences.agg(["mean", "median"])  \
                                    .rename({"median": r"Median Δ", "mean": "Mean Δ"})
    mad  =           differences_abs.agg(["median"])  \
                                    .rename({"median": r"Median |Δ|"})
    mapd =       differences_rel_abs.agg(["median"])  \
                                    .rename({"median": r"Median |Δ| [%]"})
    md, mad, mapd = [df.T for df in (md, mad, mapd)]

    # Calculate correlation coefficients
    corrs = {var: xr.corr(data1[var], data2[var]).values 
             for var in diff_variables}
    corrs = pd.DataFrame.from_dict(corrs, orient="index", columns=["Pearson r"])
    
    # Combine statistics into one dataframe
    stats = pd.concat([md, mad, mapd, corrs], axis=1)

    return stats

def display_difference_stats(data1: xr.Dataset, data2: xr.Dataset, *args, **kwargs) -> str:
    """ Given two datasets, calculate a number of statistics for each variable and display the result in a table. """
    # Helper function for displaying variables / units nicely
    # Uses <br> instead of "\n" for HTML output
    display_var = lambda var: label_with_unit(data1, var, linebreak=True).replace("\n", "<br>")

    # Calculate and format statistics
    comparison_stats = comparison_statistics(data1, data2, *args, **kwargs)
    formatted = comparison_stats.style \
                                .format(precision=4)  \
                                .format_index(display_var) \
                                .set_caption("AgERA5 – ERA5-Land")
    return formatted

def timeseries_statistics(data: xr.Dataset, *, coords=("lat", "lon")) -> tuple[xr.Dataset]:
    """ For a given dataset, provide statistics (mean, standard deviation) averaged spatially. """
    return data.median(dim=coords), data.std(dim=coords)

The following functions aid in sub-selecting data, e.g. extracting time series:

In [12]:
# Subselection of data
def select_in_multiple_datasets(*datasets: xr.Dataset, method: str="nearest", **kwargs) -> list[xr.Dataset]:
    """ Extract the same selection (e.g. one site, time series, ...) from any number of datasets. """
    datasets_selected = [dataset.sel(method=method, **kwargs) for dataset in datasets]
    return datasets_selected

##### Visualisation

The following cell defines [earthkit-plots styles](https://earthkit-plots.readthedocs.io/en/latest/_api/plots/styles/index.html) for the variables in the datasets.
These styles define the colour maps and colour bar ranges for each quantity.
Earthkit-plots styles are explained further in the [corresponding documentation](https://earthkit-plots.readthedocs.io/en/latest/examples/examples/examples.html#Styles).

In [13]:
# Styles for indicators
n_diff = 9  # Levels in difference charts

# Temperature
_style_t        = {"cmap": plt.cm.YlOrBr.resampled(14),    "vmin": -5,  "vmax": 30, "extend": "both"}
_style_t_diff   = {"cmap": plt.cm.RdBu.resampled(n_diff),  "vmin": -3,  "vmax": 3,  "extend": "both"}
_style_gdd      = {"cmap": plt.cm.YlOrRd.resampled(14),    "vmin": 0,   "vmax": 3e3,"extend": "max"}
_style_gdd_diff = {"cmap": plt.cm.RdBu.resampled(n_diff),  "vmin": -50, "vmax": 50, "extend": "both"}

# Irradiation
_style_ssrd      = {"cmap": plt.cm.YlOrRd.resampled(12),   "vmin": 0,   "vmax": 30, "extend": "max"}
_style_ssrd_diff = {"cmap": plt.cm.RdBu.resampled(n_diff), "vmin": -2,  "vmax": 2,  "extend": "both"}

# Wind speed
_style_wind      = {"cmap": plt.cm.Purples.resampled(12),  "vmin": 0,   "vmax": 12, "extend": "max"}
_style_wind_diff = {"cmap": plt.cm.PuOr.resampled(n_diff), "vmin": -3,  "vmax": 3,  "extend": "both"}

# Precipitation
_style_tp        = {"cmap": plt.cm.GnBu.resampled(10),     "vmin": 0,   "vmax": 20, "extend": "max"}
_style_tp_diff   = {"cmap": plt.cm.BrBG.resampled(n_diff), "vmin": -3,  "vmax": 3,  "extend": "both"}
_style_snow      = {"cmap": plt.cm.GnBu.resampled(8),      "vmin": 0,   "vmax": 16, "extend": "max"}
_style_snow_diff = {"cmap": plt.cm.BrBG.resampled(n_diff), "vmin": -15, "vmax": 15, "extend": "both"}

# Individual styles
# Set up like this so they can still be edited individually
styles = {
    "Temperature_Air_2m_Max_24h":  Style(**_style_t),      "Temperature_Air_2m_Max_24h_diff":  Style(**_style_t_diff),
    "Temperature_Air_2m_Mean_24h": Style(**_style_t),      "Temperature_Air_2m_Mean_24h_diff": Style(**_style_t_diff),
    "Temperature_Air_2m_Min_24h":  Style(**_style_t),      "Temperature_Air_2m_Min_24h_diff":  Style(**_style_t_diff),
    "Temperature_Effective":       Style(**_style_t),      "Temperature_Effective_diff":       Style(**_style_t_diff),
    "Growing_Degree_Days":         Style(**_style_gdd),    "Growing_Degree_Days_diff":         Style(**_style_gdd_diff),
    "Solar_Radiation_Flux":        Style(**_style_ssrd),   "Solar_Radiation_Flux_diff":        Style(**_style_ssrd_diff),
    "Wind_Speed_10m_Mean_24h":     Style(**_style_wind),   "Wind_Speed_10m_Mean_24h_diff":     Style(**_style_wind_diff),
    "Wind_Speed_2m_Mean_24h":      Style(**_style_wind),   "Wind_Speed_2m_Mean_24h_diff":      Style(**_style_wind_diff),
    "Precipitation_Flux":          Style(**_style_tp),     "Precipitation_Flux_diff":          Style(**_style_tp_diff),
    "Snow_Thickness_Mean_24h":     Style(**_style_snow),   "Snow_Thickness_Mean_24h_diff":     Style(**_style_snow_diff),
}

# Apply general settings
for style in styles.values():
    style.normalize = False

The following functions are helpers for displaying in Jupyter Notebook or Jupyter Book style, adding textboxes with consistent formatting, adjusting axis limits, etc.:

In [14]:
# Visualisation: Helper functions, general
RELATIVE_DIFFERENCE_LIMIT = 50

def _glue_or_show(fig: plt.Figure, glue_label: Optional[str]=None) -> None:
    """
    If `glue` is available, glue the figure using the provided label.
    If not, display the figure in the notebook.
    """
    try:
        glue(glue_label, fig, display=False)
    except TypeError:
        plt.show()
    finally:
        plt.close()

def _add_textbox_to_subplots(text: str, *axs: Iterable[plt.Axes | ekp.Subplot], right=False) -> None:
    """ Add a text box to each of the specified subplots. """
    # Get the plt.Axes for each ekp.Subplot
    axs = [subplot.ax if isinstance(subplot, ekp.Subplot) else subplot for subplot in axs]

    # Set up location
    x = 0.95 if right else 0.05
    horizontalalignment = "right" if right else "left"

    # Add the text
    for ax in axs:
        ax.text(x, 0.95, text, transform=ax.transAxes,
        horizontalalignment=horizontalalignment, verticalalignment="top",
        bbox={"facecolor": "white", "edgecolor": "black", "boxstyle": "round",
              "alpha": 1})

def _sharexy(axs: np.ndarray, *, which: str="xy") -> None:
    """ Force all of the axes in axs to share x and/or y with the first element. """
    main_ax = axs.ravel()[0]
    for ax in axs.ravel():
        if "x" in which:
            ax.sharex(main_ax)
        if "y" in which:
            ax.sharey(main_ax)

def _symmetric_lim(ax: plt.Axes, which: str=None) -> None:
    """ Adjust the x- or y-lims for one Axes to be symmetric, based on existing values. """
    # Pick axis
    if which == "x":
        getter, setter = ax.get_xlim, ax.set_xlim
    elif which == "y":
        getter, setter = ax.get_ylim, ax.set_ylim
    else:
        raise ValueError(f"_symmetric_lim needs axis 'x' or 'y', was given '{which}'")

    # Apply
    current = getter()
    current = np.abs(current)
    maxlim = np.max(current)
    newlim = (-maxlim, maxlim)
    setter(newlim)

_symmetric_xlim = partial(_symmetric_lim, which="x")
_symmetric_ylim = partial(_symmetric_lim, which="y")

def _set_lim_from_style(ax: plt.Axes, var: str, axis: str="y") -> None:
    """ Retrieve a style from the global styles dict, then use its vmin/vmax to set the x/y lim for this ax. """
    style = styles[var]._kwargs
    if "x" in axis:
        ax.set_xlim(style["vmin"], style["vmax"])
    if "y" in axis:
        ax.set_ylim(style["vmin"], style["vmax"])

def find_percentile(*data_arrays: Iterable[xr.DataArray], percentile: float, round: str=None) -> float:
    """
    Find the specified percentile across all of the provided datasets.
    Used for making consistent colour maps and axis limits.
    """
    data_flat = np.concatenate([arr.to_numpy().ravel() for arr in data_arrays])
    perc = np.nanpercentile(data_flat, percentile)
    if round == "up":
        perc = np.ceil(perc)
    elif round == "down":
        perc = np.floor(perc)
    return perc

cmap_percentile = 0.5
find_vmin = partial(find_percentile, percentile=cmap_percentile)
find_vmax = partial(find_percentile, percentile=100-cmap_percentile)

def subplots_2byN(layout="constrained", **kwargs) -> tuple[plt.Figure, np.ndarray[plt.Axes]]:
    """
    Create a figure with 2 x (N/2) panels, with N the number of variables.
    Return them unravelled, turning off and removing any spares (e.g. 8th panel for 7 variables).
    """
    # Create figure, panels
    fig, axs = plt.subplots(nrows=nvars_half, ncols=2, layout=layout, **kwargs)
    axs = axs.ravel()

    # White out last panel if odd number of variables
    for ax in axs[nvars:]:
        ax.set_axis_off()
    axs = axs[:nvars]

    return fig, axs

The following functions are also base helper functions, but specific to geospatial plots:

In [15]:
# Visualisation: Helper functions for geospatial plots
def _spatial_plot_append_subplots(fig: ekp.Figure, *data: xr.Dataset, domain: Optional[AnyDomain]=None, **kwargs) -> list[ekp.Subplot]:
    """ Plot any number of datasets into new subplots in an existing earthkit figure. """
    # Create subplots
    subplots = [fig.add_map(domain=domain) for d in data]

    # Plot
    for subplot, d in zip(subplots, data):
        subplot.grid_cells(d, x="lon", y="lat", **kwargs)

    return subplots

The following functions perform geospatial comparisons between datasets, including the per-pixel difference:

In [16]:
# Visualisation: Plot indicators geospatially
def geospatial_comparison_with_difference(data1: xr.Dataset, data2: xr.Dataset, date: str, *,
                                          variables: Iterable[str]=VARIABLES,
                                          label1: str="AgERA5", label2: str="ERA5-Land",
                                          domain: Optional[AnyDomain]=None,
                                          glue_label: Optional[str]=None) -> None:
    """
    Plot a list of `variables` in two datasets, geospatially.
    A specific date has to be specified.
    """
    # Pre-process: Select data on specified date, calculate difference
    data1_date, data2_date = select_in_multiple_datasets(data1, data2, time=date)
    difference = difference_between_datasets(data1_date, data2_date, diff_variables=variables)

    # Setup indicators
    n_variables = len(variables)
    loop_variables = tqdm(variables, desc="Plotting variables", leave=False)

    # Create figure
    fig = ekp.Figure(rows=n_variables, columns=3, size=(7.5, max(5, 2*n_variables)))

    # Plot indicators
    for var in loop_variables:
        # Plot individual datasets
        subplots_data = _spatial_plot_append_subplots(fig, data1_date, data2_date, domain=domain, 
                                                      z=var, style=styles[var])

        # Plot difference
        subplot_diff, *_ = _spatial_plot_append_subplots(fig, difference, domain=domain,
                                                         z=var, style=styles[f"{var}_diff"])
        # Decorate: Text + Colour bar
        var_label = label_with_unit(data1, var, linebreak=True)
        subplots_data[0].legend(label=var_label, location="left")
        subplot_diff.legend(label="Difference", location="right")

    # Titles on top
    titles = [label1, label2, "Difference"]
    for title, subplot in zip(titles, fig.subplots):
        subplot.ax.set_title(title)

    # Decorate figure
    fig.land()
    fig.coastlines()
    fig.gridlines(linestyle=plt.rcParams["grid.linestyle"])
    fig.title("Geospatial intercomparison: {time:%-d %B %Y}")
    
    # Show result
    _glue_or_show(fig.fig, glue_label)

The following functions perform point-by-point (scatter, histogram) comparisons between datasets:

In [17]:
# Point-by-point comparison - Scatter
def scatter_comparison(data1: xr.Dataset, data2: xr.Dataset, *,
                       label1: str="AgERA5", label2: str="ERA5-Land",
                       n_bins: int=51,
                       glue_label: Optional[str]=None) -> None:
    """
    Plot a list of variables in two datasets, with one-to-one comparisons.
    Hard-coded for VARIABLES, VARIABLES_TEMPERATURE, VARIABLES_NOT_TEMPERATURE.
    """
    # Create figure
    fig, axs = subplots_2byN(figsize=(8, 16))

    # Plot individual scatter plots
    for ax, var in zip(axs, VARIABLES):
        # Get vmin/vmax for lims
        style = styles[var]._kwargs
        ax.hexbin(data1[var].values.ravel(), data2[var].values.ravel(),
                  extent=(style["vmin"], style["vmax"], style["vmin"], style["vmax"]),
                  gridsize=n_bins, mincnt=1, cmap="cividis")

        # Set limits from style, matching colour maps
        _set_lim_from_style(ax, var, axis="xy")
        ax.set_aspect("equal", adjustable="box")
        
        var_label = label_with_unit(data1, var, linebreak=True)
        _add_textbox_to_subplots(var_label, ax)

    # Visual settings
    for ax in axs:
        ax.grid(True, axis="both")
        ax.set_title("")

        # Highlight diagonal
        ax.axline((0, 0), slope=1, color=plt.rcParams["grid.color"], linewidth=0.8, linestyle="-")

    fig.suptitle("Point-by-point comparison across full domain", fontweight="bold")
    fig.supxlabel(label1, fontweight="bold")
    fig.supylabel(label2, fontweight="bold")

    # Show result
    _glue_or_show(fig, glue_label)


# Point-by-point comparison - Histogram
def histogram_comparison(data1: xr.Dataset, data2: xr.Dataset, *,
                         label1: str="AgERA5", label2: str="ERA5-Land",
                         n_bins: int=51,
                         glue_label: Optional[str]=None) -> None:
    # Calculate difference
    difference = difference_between_datasets(data1, data2)
    difference_rel = relative_difference_between_datasets(data1, data2)
    
    # Create figure
    fig, axs = plt.subplots(nrows=nvars, ncols=2, figsize=(5, 2*nvars), layout="constrained")
    
    # Share x, y for temperature plots; assume these are the first N panels
    n_temperature_variables = len(VARIABLES_TEMPERATURE)
    axes_temperature     = axs[:n_temperature_variables, 0]
    axes_not_temperature = axs[n_temperature_variables:, 0]
    _sharexy(axes_temperature)
    
    # Share y for absolute/relative non-temperature plots (row-wise)
    for ax_row in axs[n_temperature_variables:]:
        _sharexy(ax_row, which="y")
    
    # Plot differences
    for ax_row, var in zip(axs, VARIABLES):
        # Limits based on styles
        lim = styles[var+"_diff"]._kwargs["vmax"]
    
        bins = np.linspace(-lim, lim, n_bins)
        bins_rel = np.linspace(-RELATIVE_DIFFERENCE_LIMIT, RELATIVE_DIFFERENCE_LIMIT, n_bins)
        ax_row[0].set_xlim(-lim, lim)
        ax_row[1].set_xlim(-RELATIVE_DIFFERENCE_LIMIT, RELATIVE_DIFFERENCE_LIMIT)
    
        # Plot differences
        ax_row[0].hist(difference[var].values.ravel(), bins=bins, color="black")
        try:
            ax_row[1].hist(difference_rel[var].values.ravel(), bins=bins_rel, color="black")
        except KeyError:  # Remove panel if there are no relative differences, e.g. temperature
            ax_row[1].set_axis_off()
        else:  # If panel is active
            ax_row[1].yaxis.set_label_position("right")
            ax_row[1].yaxis.tick_right()
        
        # Axis labels
        ax_row[0].set_xlabel(label_with_unit(difference, var, linebreak=True))
        ax_row[1].set_xlabel(label_with_unit(difference, var, linebreak=True, unit="%"))
        ax_row[0].set_ylabel("Frequency")
    
    # Visual settings
    for ax in axs.ravel():
        grid = ax.grid(True, axis="both")
        ax.set_title("")
    
        # Highlight 0
        if ax.axison:
            ax.axvline(0, color=plt.rcParams["grid.color"], linewidth=1.5, linestyle="-", alpha=0.7)
    
        # Log scale if the histogram is extremely concentrated (usually around 0)
        # if ax.get_ylim()[1] > 1e5:
            # ax.set_yscale("log")
    
    # Figure
    fig.suptitle(f"{difference.name} (overall distribution)")
    fig.align_ylabels()
    
    # Show result
    _glue_or_show(fig, glue_label)

The following functions perform time series comparisons between datasets:

In [18]:
# Time series comparisons
# Consistent styling
COLOURS_DATA = "#0077bb", "#ee7733"
COLOUR_DIFF, COLOUR_DIFF_REL = "#004488", "#bb5566"
ALPHA_TIMESERIES = 0.6

# Consistent plot setup
_subplots_timeseries = partial(plt.subplots, nrows=nvars, figsize=(8, 2*nvars), sharex=True, layout="constrained")

def _plot_mean_and_std(ax: plt.Axes, mean: xr.Dataset, std: xr.Dataset, var: str, *,
                       alpha: Optional[float]=1., **kwargs) -> None:
    """ Plot the mean (line) and std (shaded area) into an ax. """
    m, s = mean[var], std[var]  # Short-hand
    m.plot(ax=ax, alpha=alpha, **kwargs)  # Mean
    ax.fill_between(m["time"], m - s, m + s, alpha=alpha*0.7, **kwargs)  # Spread


# Time series comparison - Values - One site
def timeseries_comparison(data1: xr.Dataset, data2: xr.Dataset, site: dict, *,
                          label1: str="AgERA5", label2: str="ERA5-Land",
                          glue_label: Optional[str]=None) -> None:
    """
    Plot a list of variables in two datasets, with time series comparisons -- here showing the values of variables.
    Hard-coded for VARIABLES, VARIABLES_TEMPERATURE, VARIABLES_NOT_TEMPERATURE.
    """
    # Select site
    timeseries = select_in_multiple_datasets(data1, data2, **site)

    # Create figure
    fig, axs = _subplots_timeseries()

    # Plot time series
    for ax, var in zip(axs, VARIABLES):
        # Plot individual time series
        for ts, c, label in zip(timeseries, COLOURS_DATA, [label1, label2]):
            ts[var].plot(ax=ax, alpha=ALPHA_TIMESERIES, color=c, label=label)

        # Set vertical limits from style, matching colour maps
        _set_lim_from_style(ax, var, axis="y")

        # Label variable
        ax.set_ylabel(label_with_unit(data1, var, linebreak=True))
        _add_textbox_to_subplots(label_with_unit(ts, var), ax)

    # Visual settings
    for ax in axs:
        ax.grid(True, axis="both")
        ax.set_xlabel("")
        ax.set_title("")

        ax.axhline(0, color=plt.rcParams["grid.color"], linewidth=1.5, linestyle="-")
        ax.legend(loc="lower right")

    # Decoration
    axs[-1].set_xlabel("Time")
    fig.suptitle(f"Time series comparison at ({site["lat"]} °N, {site["lon"]} °E)")
    fig.align_ylabels()

    # Show result
    _glue_or_show(fig, glue_label)


# Time series comparison - Difference - One site
def timeseries_comparison_difference(data1: xr.Dataset, data2: xr.Dataset, site: dict, *,
                                     label1: str="AgERA5", label2: str="ERA5-Land",
                                     glue_label: Optional[str]=None) -> None:
    """
    Plot a list of variables in two datasets, with time series comparisons -- here showing the difference (abs+rel).
    Hard-coded for VARIABLES, VARIABLES_TEMPERATURE, VARIABLES_NOT_TEMPERATURE.
    """
    # Select site
    timeseries = select_in_multiple_datasets(data1, data2, **site)

    # Calculate difference
    difference = difference_between_datasets(*timeseries)
    difference_rel = relative_difference_between_datasets(*timeseries)

    # Create figure
    fig, axs = _subplots_timeseries()

    # Plot time series
    for ax, var in zip(axs, VARIABLES):
        # Plot absolute difference
        difference[var].plot(ax=ax, alpha=ALPHA_TIMESERIES, color=COLOUR_DIFF)
    
        # Set vertical limits from style, matching colour maps
        _set_lim_from_style(ax, var+"_diff", axis="y")
    
        # Label variable
        units = data1[var].units
        ax.set_ylabel(f"Difference\n[{units}]", color=COLOUR_DIFF)
        _add_textbox_to_subplots(label_with_unit(data1, var), ax)
    
        # Make ytick colour match line + ylabel
        ax.tick_params(axis="y", colors=COLOUR_DIFF)
    
        # Plot relative difference
        if var in VARIABLES_RELATIVE:
            # Create second y-axis
            ax2 = ax.twinx()
    
            # Plot relative difference
            difference_rel[var].plot(ax=ax2, alpha=ALPHA_TIMESERIES, color=COLOUR_DIFF_REL)
    
            # Set vertical limits
            ax2.set_ylim(-RELATIVE_DIFFERENCE_LIMIT, RELATIVE_DIFFERENCE_LIMIT)
    
            # Label variable
            ax2.set_ylabel("Difference [%]", color=COLOUR_DIFF_REL)
    
            # Make ytick colour match line + ylabel
            ax2.tick_params(axis="y", colors=COLOUR_DIFF_REL)
    
            # Remove unneeded visuals
            ax2.grid(False)
            ax2.set_title("")

    # Visual settings
    for ax in axs:
        ax.grid(True, axis="both")
        ax.set_xlabel("")
        ax.set_title("")
    
        ax.axhline(0, color=plt.rcParams["grid.color"], linewidth=1.5, linestyle="-")
    
    # Decoration
    axs[-1].set_xlabel("Time")
    fig.suptitle(f"Time series comparison at ({site["lat"]} °N, {site["lon"]} °E)")
    fig.align_ylabels()
    
    # Show result
    _glue_or_show(fig, glue_label)


# Time series comparison - Difference - All sites
def timeseries_comparison_difference_multi(data1: xr.Dataset, data2: xr.Dataset, *,
                                           label1: str="AgERA5", label2: str="ERA5-Land",
                                           glue_label: Optional[str]=None) -> None:
    """
    Plot a list of variables in two datasets, with time series comparisons -- here showing the difference (abs+rel).
    Hard-coded for VARIABLES, VARIABLES_TEMPERATURE, VARIABLES_NOT_TEMPERATURE.
    """
    # Calculate difference
    difference = difference_between_datasets(data1, data2)
    difference_rel = relative_difference_between_datasets(data1, data2)

    # Calculate statistics
    difference_stats = timeseries_statistics(difference)
    difference_rel_stats = timeseries_statistics(difference_rel)

    # Create figure
    fig, axs = _subplots_timeseries()

    # Plot time series
    for ax, var in zip(axs, VARIABLES):
        # Plot absolute difference
        _plot_mean_and_std(ax, *difference_stats, var, alpha=ALPHA_TIMESERIES, color=COLOUR_DIFF)

        # Set vertical limits from style, matching colour maps
        _set_lim_from_style(ax, var+"_diff", axis="y")
    
        # Label variable
        units = data1[var].units
        ax.set_ylabel(f"Difference\n[{units}]", color=COLOUR_DIFF)
        _add_textbox_to_subplots(label_with_unit(data1, var), ax)

        # Make ytick colour match line + ylabel
        ax.tick_params(axis="y", colors=COLOUR_DIFF)

        # Plot relative difference
        if var in VARIABLES_RELATIVE:
            # Create second y-axis
            ax2 = ax.twinx()

            # Plot relative difference
            _plot_mean_and_std(ax2, *difference_rel_stats, var, alpha=ALPHA_TIMESERIES, color=COLOUR_DIFF_REL)

            # Set vertical limits
            ax2.set_ylim(-RELATIVE_DIFFERENCE_LIMIT, RELATIVE_DIFFERENCE_LIMIT)

            # Label variable
            ax2.set_ylabel("Difference [%]", color=COLOUR_DIFF_REL)

            # Make ytick colour match line + ylabel
            ax2.tick_params(axis="y", colors=COLOUR_DIFF_REL)

            # Remove unneeded visuals
            ax2.grid(False)
            ax2.set_title("")

    # Visual settings
    for ax in axs:
        ax.grid(True, axis="both")
        ax.set_xlabel("")
        ax.set_title("")

        ax.axhline(0, color=plt.rcParams["grid.color"], linewidth=1.5, linestyle="-")

    # Decoration
    axs[-1].set_xlabel("Time")
    fig.suptitle("Time series comparison (overall)")
    fig.align_ylabels()

    # Show result
    _glue_or_show(fig, glue_label)

(section-download)=
### 2. Download data
#### General setup
This notebook uses [earthkit-data](https://github.com/ecmwf/earthkit-data) to download files from the CDS.
If you intend to run this notebook multiple times, it is highly recommended that you [enable caching](https://earthkit-data.readthedocs.io/en/latest/guide/caching.html) to prevent having to download the same files multiple times.
If you prefer not to use earthkit, the following requests can also be used with the [cdsapi module](https://cds.climate.copernicus.eu/how-to-api#linux-use-client-step).
In either case (earthkit-data or cdsapi), it is required to set up a CDS account and API key as explained [on the CDS website](https://cds.climate.copernicus.eu/how-to-api).

We will be downloading multiple datasets in this notebook.
CDS data requests take the form of dictionaries in Python.
When making multiple requests
(e.g. to download data from multiple catalogue entries),
it is convenient to set up a _template_ request with some default parameters.
In this section, we define our template containing those parameters that are constant between datasets: the domain in time and space.
This way, these are guaranteed to be consistent between downloads
and
only need to be changed in one place if you wish to modify the notebook for your own use case.

In this example, we will be looking at data for the United Kingdom and Ireland every day in January–September 2023.
We will also do a time series comparison at one site within this area.
The domain, site, and period are defined in the following cell,
and can be edited when running this notebook yourself:

In [19]:
# Space
domain = ekp.geo.domains.union(["United Kingdom", "Ireland"], name="UK & Ireland")
site = {"lat": 52.5, "lon": 0.0}

# Time
year = 2023
months = range(1, 10)  # [start, end)

The CDS request template is then defined using the values defined above.
Additional information to download specific data variables will be mixed into this template in the following sections.

In [20]:
# Space
request_domain = domain_to_request(domain)

# Time
request_time = {
    "year": year,
    "month": [f"{mo:02}" for mo in months],
    "day": [f"{d:02}" for d in range(1, 32)],  # All days
}

# Template (combining space + time)
request_default = request_domain | request_time

print("Request template:")
print_request(request_default, compact=True)

Request template:
{'area': [61, -13, 48, 5],
 'year': 2023,
 'month': ['01', '02', '03', '04', '05', '06', '07', '08', '09'],
 'day': ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12',
         '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24',
         '25', '26', '27', '28', '29', '30', '31']}


#### AgERA5
Due to size limits on the CDS, the different variables of interest in AgERA5 (temperature, irradiation, etc.) have to be downloaded separately and then combined afterwards.
This is achieved by combining the previously defined template request with parameters specific to AgERA5, and subsequently with parameters specific to each individual variable.

First, the parameters specific to AgERA5 (dataset ID and version) are defined and combined with the template:

In [21]:
agera5_ID = "sis-agrometeorological-indicators"

request_agera5_default = {
    "version": "2_0",
} | request_default

print(f"Request template for {agera5_ID}:")
print_request(request_agera5_default)

Request template for sis-agrometeorological-indicators:
{'version': '2_0',
 'area': [61, -13, 48, 5],
 'year': 2023,
 'month': ['01', '02', '03', '04', '05', '06', '07', '08', '09'],
 'day': ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12',
         '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24',
         '25', '26', '27', '28', '29', '30', '31']}


Next, the requests for each of the individual data variables are defined:

In [22]:
# Temperature has to be split into separate requests because of size limits
request_temperature_min = {
    "variable": "2m_temperature",
    "statistic": ["24_hour_minimum"],
}

request_temperature_max = {
    "variable": "2m_temperature",
    "statistic": ["24_hour_maximum"],
}

request_temperature_mean = {
    "variable": "2m_temperature",
    "statistic": ["24_hour_mean"],
}

# Non-temperature variables
request_irradiation = {
    "variable": "solar_radiation_flux",
}

request_wind = {
    "variable": "10m_wind_speed",
    "statistic": ["24_hour_mean"],
}

request_rain = {
    "variable": "precipitation_flux",
}

request_snow = {
    "variable": "snow_thickness",
    "statistic": ["24_hour_mean"],
}

# Compile all requests into one list for easier iteration
requests_agera5_variables = [request_temperature_min, request_temperature_max, request_temperature_mean,
                             request_irradiation,
                             request_rain,
                             request_wind,
                             request_snow,
                            ]

The requests for specific variables are combined with the template and passed to earthkit for download from the CDS.
Earthkit-data downloads the dataset as a [field list](https://earthkit-data.readthedocs.io/en/latest/guide/data.html).
Here, we convert this object to an [xarray dataset](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html) for ease of use later (when comparing multiple datasets):

In [23]:
# Final setup for requests: mix template and variable-specific parameters
requests_agera5 = [request_agera5_default | request for request in requests_agera5_variables]

# Show one example
print(f"Example request for one variable from {agera5_ID}:")
print_request(requests_agera5[0])

Example request for one variable from sis-agrometeorological-indicators:
{'version': '2_0',
 'area': [61, -13, 48, 5],
 'year': 2023,
 'month': ['01', '02', '03', '04', '05', '06', '07', '08', '09'],
 'day': ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12',
         '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24',
         '25', '26', '27', '28', '29', '30', '31'],
 'variable': '2m_temperature',
 'statistic': ['24_hour_minimum']}


In [24]:
# Download data and convert to desired format
data_agera5 = ekd.from_source("cds", agera5_ID, *requests_agera5)  # Download as field list
data_agera5 = data_agera5.to_xarray(compat="equals")  # Convert to xarray dataset
data_agera5 = rechunk(data_agera5)  # Rechunk in Dask for better performance -- not strictly necessary
data_agera5  # Display in notebook

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 118.37it/s]


Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 548 graph layers,1 chunks in 548 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 24.37 MiB 24.37 MiB Shape (273, 130, 180) (273, 130, 180) Dask graph 1 chunks in 548 graph layers Data type float32 numpy.ndarray",180  130  273,

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 548 graph layers,1 chunks in 548 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.13 kiB,2.13 kiB
Shape,"(273,)","(273,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,int64 numpy.ndarray,int64 numpy.ndarray
"Array Chunk Bytes 2.13 kiB 2.13 kiB Shape (273,) (273,) Dask graph 1 chunks in 1 graph layer Data type int64 numpy.ndarray",273  1,

Unnamed: 0,Array,Chunk
Bytes,2.13 kiB,2.13 kiB
Shape,"(273,)","(273,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,int64 numpy.ndarray,int64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 548 graph layers,1 chunks in 548 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 24.37 MiB 24.37 MiB Shape (273, 130, 180) (273, 130, 180) Dask graph 1 chunks in 548 graph layers Data type float32 numpy.ndarray",180  130  273,

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 548 graph layers,1 chunks in 548 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 548 graph layers,1 chunks in 548 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 24.37 MiB 24.37 MiB Shape (273, 130, 180) (273, 130, 180) Dask graph 1 chunks in 548 graph layers Data type float32 numpy.ndarray",180  130  273,

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 548 graph layers,1 chunks in 548 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 548 graph layers,1 chunks in 548 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 24.37 MiB 24.37 MiB Shape (273, 130, 180) (273, 130, 180) Dask graph 1 chunks in 548 graph layers Data type float32 numpy.ndarray",180  130  273,

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 548 graph layers,1 chunks in 548 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 548 graph layers,1 chunks in 548 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 24.37 MiB 24.37 MiB Shape (273, 130, 180) (273, 130, 180) Dask graph 1 chunks in 548 graph layers Data type float32 numpy.ndarray",180  130  273,

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 548 graph layers,1 chunks in 548 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 548 graph layers,1 chunks in 548 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 24.37 MiB 24.37 MiB Shape (273, 130, 180) (273, 130, 180) Dask graph 1 chunks in 548 graph layers Data type float32 numpy.ndarray",180  130  273,

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 548 graph layers,1 chunks in 548 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 548 graph layers,1 chunks in 548 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 24.37 MiB 24.37 MiB Shape (273, 130, 180) (273, 130, 180) Dask graph 1 chunks in 548 graph layers Data type float32 numpy.ndarray",180  130  273,

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 548 graph layers,1 chunks in 548 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


  return self.func(*new_argspec)
  return self.func(*new_argspec)
  return self.func(*new_argspec)
  return self.func(*new_argspec)
  return self.func(*new_argspec)
  return self.func(*new_argspec)
  return self.func(*new_argspec)
  return self.func(*new_argspec)
  return self.func(*new_argspec)
  return self.func(*new_argspec)


#### ERA5-Land
The process to download data from ERA5-Land is similar to that for AgERA5 above:
defining a template and variable-specific requests.
A difference is that data will be downloaded from two CDS datasets.
The [*ERA5-Land hourly data from 1950 to present*](https://doi.org/10.24381/cds.e2161bac) dataset contains hourly data for variables like 2 m temperature as well as accumulated data for variables like solar irradiation.
For the use case in this assessment, we are only interested in the daily accumulated data.
[*ERA5-Land post-processed daily statistics from 1950 to present*](https://doi.org/10.24381/cds.e9c9c792) provides daily minimum/maximum/mean statistics for the hourly variables, saving us the effort of aggregating these ourselves.
In the following subsections, the data (accumulated and daily statistics) are downloaded and harmonised to the same format as AgERA5.

##### Accumulated data from ERA5-Land
The [documentation for ERA5-Land](https://confluence.ecmwf.int/display/CKB/ERA5-Land%3A+data+documentation#heading-Accumulations) explains:
> The accumulations in the short forecasts of ERA5-Land (with hourly steps from 01 to 24) are treated the same as those in ERA-Interim or ERA-Interim/Land, i.e., they are accumulated from the beginning of the forecast to the end of the forecast step. For example, runoff at day=D, step=12 will provide runoff accumulated from day=D, time=0 to day=D, time=12. The maximum accumulation is over 24 hours, i.e., from day=D, time=0 to day=D+1,time=0 (step=24). For the CDS time, or validity time, of 00 UTC, the accumulations are over the 24 hours ending at 00 UTC i.e. the accumulation is during the previous day.

In practice, this means that one needs to download data for *day+1* (e.g. 2 January 2023) to get the total accumulated value for *day* (1 January 2023).
Hence, for this specific dataset, our existing `request_time` will download data for 2022-12-31 – 2023-09-29, so we need to add one extra day.
This is achieved by adding a second request for just the extra day to the earthkit-data download.

In [25]:
era5land_ID = "reanalysis-era5-land"

# Default parameters
request_era5land_default = {
    "time": ["00:00"],
    "data_format": "grib",  # Downloading in NetCDF changes the time format, causing a 1-day offset
    "download_format": "unarchived",
} | request_default

print(f"Request template for {era5land_ID}:")
print_request(request_era5land_default)

# Additional day (accounting for accumulated format)
request_extraday = {
    "month": str(months.stop),  # If months was range(1, 10), it downloaded 1,2,3,4,5,6,7,8,9 and this adds 10
    "day": "01",
}
request_era5land_default_extraday = request_era5land_default | request_extraday

print(f"\nExtra-day request template for {era5land_ID}:")
print_request(request_era5land_default_extraday)

Request template for reanalysis-era5-land:
{'time': ['00:00'],
 'data_format': 'grib',
 'download_format': 'unarchived',
 'area': [61, -13, 48, 5],
 'year': 2023,
 'month': ['01', '02', '03', '04', '05', '06', '07', '08', '09'],
 'day': ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12',
         '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24',
         '25', '26', '27', '28', '29', '30', '31']}

Extra-day request template for reanalysis-era5-land:
{'time': ['00:00'],
 'data_format': 'grib',
 'download_format': 'unarchived',
 'area': [61, -13, 48, 5],
 'year': 2023,
 'month': '10',
 'day': '01'}


Next, the requests for each of the individual data variables are defined:

In [26]:
request_irradation = {
    "variable": ["surface_solar_radiation_downwards"],
}

request_rain = {
    "variable": ["total_precipitation"],
}

requests_era5land_variables = [request_irradation,
                               request_rain,
                              ]

The requests for specific variables are combined with the template and passed to earthkit for download from the CDS.

In [27]:
# Final setup for requests: mix template and variable-specific parameters
requests_era5land          = [request_era5land_default          | request for request in requests_era5land_variables]
requests_era5land_extraday = [request_era5land_default_extraday | request for request in requests_era5land_variables]

# Download data and convert to desired format
data_era5land_accumulated = ekd.from_source("cds", era5land_ID, *requests_era5land, *requests_era5land_extraday)
data_era5land_accumulated = data_era5land_accumulated.to_xarray()

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 141.05it/s]


Inspecting the resulting dataset
(in xarray format)
shows that the `forecast_reference_time` coordinate conveniently matches the variable to the day of accumulation.
It is important to note that this is not true if the data are downloaded in NetCDF format,
in which case there is a 1-day offset;
beware of this difference if modifying this notebook yourself.

In [28]:
data_era5land_accumulated

##### Daily statistics from the post-processed ERA5-Land dataset
The daily statistics are indexed according to the day they apply to,
like AgERA5,
meaning we do not need to worry about adding extra days here.

In [29]:
era5land_ID = "derived-era5-land-daily-statistics"

request_era5land_default = {
    "time_zone": "utc+00:00",
    "frequency": "1_hourly",
} | request_default

print(f"Request template for {era5land_ID}:")
print_request(request_era5land_default)

Request template for derived-era5-land-daily-statistics:
{'time_zone': 'utc+00:00',
 'frequency': '1_hourly',
 'area': [61, -13, 48, 5],
 'year': 2023,
 'month': ['01', '02', '03', '04', '05', '06', '07', '08', '09'],
 'day': ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12',
         '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24',
         '25', '26', '27', '28', '29', '30', '31']}


Next, the requests for each of the individual data variables are defined:

In [30]:
# Temperature has to be split into separate requests because of size limits
request_temperature_min = {
    "variable": "2m_temperature",
    "daily_statistic": "daily_minimum",
}

request_temperature_max = {
    "variable": "2m_temperature",
    "daily_statistic": ["daily_maximum"],
}

request_temperature_mean = {
    "variable": "2m_temperature",
    "daily_statistic": ["daily_mean"],
}

# Wind is downloaded as two separate components
request_wind_u = {
    "variable": "10m_u_component_of_wind",
    "daily_statistic": ["daily_mean"],
}

request_wind_v = {
    "variable": "10m_v_component_of_wind",
    "daily_statistic": ["daily_mean"],
}

request_snow = {
    "variable": "snow_depth",
    "daily_statistic": ["daily_mean"],
}

requests_era5land_variables = [request_temperature_min, request_temperature_max, request_temperature_mean,
                               request_wind_u, request_wind_v,
                               request_snow,
                              ]

The requests for specific variables are combined with the template and passed to earthkit for download from the CDS.
Here, we download the different variables separately in anticipation of the harmonisation step in the next subsection.

In [31]:
# Final setup for requests: mix template and variable-specific parameters
requests_era5land = [request_era5land_default | request for request in requests_era5land_variables]

# Download data and convert to desired format
data_era5land = [ekd.from_source("cds", era5land_ID, req) for req in requests_era5land]
data_era5land = [ds.to_xarray() for ds in data_era5land]
data_era5land_temperature_min, data_era5land_temperature_max, data_era5land_temperature_mean, data_era5land_wind_u, data_era5land_wind_v, data_era5land_snow = data_era5land

##### Pre-processing
The ERA5-Land dataset is set up differently from AgERA5 and requires some pre-processing before the two can be intercompared.
This involves renaming coordinates and variables, and adjusting units.

For the accumulated data, the following steps are necessary:
* Rename the variables and coordinates to match those in AgERA5.
* Select data within the desired time window only.
* Convert the units for precipitation to mm.

In [32]:
# Rename variables and coordinates
data_era5land_accumulated = data_era5land_accumulated.rename({"ssrd": "Solar_Radiation_Flux",
                                                              "tp": "Precipitation_Flux",
                                                              "forecast_reference_time": "time", 
                                                              "latitude": "lat", "longitude": "lon"})
# Select only relevant dates
time_window = slice(f"{year}-{months.start}-01", f"{year}-{months.stop}-01")  # [2023-01-01, 2023-10-01) == [2023-01-01, 2023-09-30]
data_era5land_accumulated = data_era5land_accumulated.sel(time=time_window)

# Unit conversions
convert_m2mm(data_era5land_accumulated, "Precipitation_Flux")

For the daily statistics, the steps are as follows:
* Rename the temperature variables from just `t2m` to maximum, mean, minimum.
* Calculate 10 m wind speed from its U (east–west) and V (north–south) components.
* Rename the snow variable and convert its units to cm.
* Combine the pre-processed variables into one dataset.
* Rename coordinates to match AgERA5's.

In [33]:
# Rename and combine temperature variables
data_era5land_temperature_max  = data_era5land_temperature_max.rename( {"t2m": "Temperature_Air_2m_Max_24h"})
data_era5land_temperature_mean = data_era5land_temperature_mean.rename({"t2m": "Temperature_Air_2m_Mean_24h"})
data_era5land_temperature_min  = data_era5land_temperature_min.rename( {"t2m": "Temperature_Air_2m_Min_24h"})
data_era5land_temperature = xr.merge([data_era5land_temperature_min, data_era5land_temperature_max, data_era5land_temperature_mean], compat="equals")

# Calculate 10 m wind speed
data_era5land_wind = xr.merge([data_era5land_wind_u, data_era5land_wind_v], compat="equals")
data_era5land_wind = data_era5land_wind.assign(
    {"Wind_Speed_10m_Mean_24h": xr.ufuncs.sqrt(data_era5land_wind["u10"]**2 + data_era5land_wind["v10"]**2)}
)
data_era5land_wind = data_era5land_wind.drop_vars(["u10", "v10"])

# Rename snow and convert its units
data_era5land_snow = data_era5land_snow.rename({"sde": "Snow_Thickness_Mean_24h"})
convert_m2cm(data_era5land_snow, "Snow_Thickness_Mean_24h")

# Combine into one dataset
data_era5land = xr.merge([data_era5land_temperature, data_era5land_wind, data_era5land_snow], compat="equals")

# Rename coordinates
data_era5land = data_era5land.rename({"valid_time": "time", "latitude": "lat", "longitude": "lon"})
data_era5land = data_era5land.drop_vars("number")  # Remove unneeded dimension

Lastly, the accumulated data and daily statistics are combined into one xarray object:

In [34]:
# Convert both to (same) Dask
data_era5land_accumulated = rechunk(data_era5land_accumulated)
data_era5land             = rechunk(data_era5land)

# Combine and display
data_era5land = xr.merge([data_era5land_accumulated, data_era5land], compat="equals")
data_era5land  # Display in notebook

Unnamed: 0,Array,Chunk
Bytes,49.39 MiB,49.39 MiB
Shape,"(273, 131, 181)","(273, 131, 181)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 49.39 MiB 49.39 MiB Shape (273, 131, 181) (273, 131, 181) Dask graph 1 chunks in 2 graph layers Data type float64 numpy.ndarray",181  131  273,

Unnamed: 0,Array,Chunk
Bytes,49.39 MiB,49.39 MiB
Shape,"(273, 131, 181)","(273, 131, 181)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,49.39 MiB,49.39 MiB
Shape,"(273, 131, 181)","(273, 131, 181)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 49.39 MiB 49.39 MiB Shape (273, 131, 181) (273, 131, 181) Dask graph 1 chunks in 1 graph layer Data type float64 numpy.ndarray",181  131  273,

Unnamed: 0,Array,Chunk
Bytes,49.39 MiB,49.39 MiB
Shape,"(273, 131, 181)","(273, 131, 181)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24.69 MiB,24.69 MiB
Shape,"(273, 131, 181)","(273, 131, 181)"
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 24.69 MiB 24.69 MiB Shape (273, 131, 181) (273, 131, 181) Dask graph 1 chunks in 3 graph layers Data type float32 numpy.ndarray",181  131  273,

Unnamed: 0,Array,Chunk
Bytes,24.69 MiB,24.69 MiB
Shape,"(273, 131, 181)","(273, 131, 181)"
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24.69 MiB,24.69 MiB
Shape,"(273, 131, 181)","(273, 131, 181)"
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 24.69 MiB 24.69 MiB Shape (273, 131, 181) (273, 131, 181) Dask graph 1 chunks in 3 graph layers Data type float32 numpy.ndarray",181  131  273,

Unnamed: 0,Array,Chunk
Bytes,24.69 MiB,24.69 MiB
Shape,"(273, 131, 181)","(273, 131, 181)"
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24.69 MiB,24.69 MiB
Shape,"(273, 131, 181)","(273, 131, 181)"
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 24.69 MiB 24.69 MiB Shape (273, 131, 181) (273, 131, 181) Dask graph 1 chunks in 3 graph layers Data type float32 numpy.ndarray",181  131  273,

Unnamed: 0,Array,Chunk
Bytes,24.69 MiB,24.69 MiB
Shape,"(273, 131, 181)","(273, 131, 181)"
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24.69 MiB,24.69 MiB
Shape,"(273, 131, 181)","(273, 131, 181)"
Dask graph,1 chunks in 9 graph layers,1 chunks in 9 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 24.69 MiB 24.69 MiB Shape (273, 131, 181) (273, 131, 181) Dask graph 1 chunks in 9 graph layers Data type float32 numpy.ndarray",181  131  273,

Unnamed: 0,Array,Chunk
Bytes,24.69 MiB,24.69 MiB
Shape,"(273, 131, 181)","(273, 131, 181)"
Dask graph,1 chunks in 9 graph layers,1 chunks in 9 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24.69 MiB,24.69 MiB
Shape,"(273, 131, 181)","(273, 131, 181)"
Dask graph,1 chunks in 4 graph layers,1 chunks in 4 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 24.69 MiB 24.69 MiB Shape (273, 131, 181) (273, 131, 181) Dask graph 1 chunks in 4 graph layers Data type float32 numpy.ndarray",181  131  273,

Unnamed: 0,Array,Chunk
Bytes,24.69 MiB,24.69 MiB
Shape,"(273, 131, 181)","(273, 131, 181)"
Dask graph,1 chunks in 4 graph layers,1 chunks in 4 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


#### Harmonise datasets
Before the data can be analysed, the two datasets (AgERA5 and ERA5-Land) must be aligned in terms of coordinates and variable names.

##### Grid alignment
Both datasets are provided on a regular 0.1° by 0.1° grid, so no regridding is necessary.
However, two steps need to be taken before the data can be compared directly:
* Their representation as floating-point numbers can cause very small differences to appear, which do not reflect any real differences in the data but are difficult for software (in this case xarray) to work with. Knowing that the data are on a regular 0.1° by 0.1° grid, we can simply round all of the coordinates to 2 digits to force them to be the same.
* The bounds of the datasets (in space and in time) are slightly different and need to be aligned.

In [35]:
# Round coordinates before alignment to avoid floating-point errors
data_agera5 = round_coordinates(data_agera5)
data_era5land = round_coordinates(data_era5land)

# Align data using an inner join
data_agera5, data_era5land = xr.align(data_agera5, data_era5land, join="inner")

##### Units and variable names
Temperatures are provided in K in both datasets – here, we convert these to °C which is more commonly used in agricultural studies.
Solar irradiation is converted from J/m²/d to MJ/m²/d which is more intuitive, with 2.45 MJ/m² approximately equivalent to 1 mm of potential water evaporation [[De Wit+20](https://research.wur.nl/en/publications/system-description-of-the-wofost-72-cropping-systems-model)].
The "long" names of variables and units, as stored in xarray metadata, are also harmonised between the two datasets to simplify the analysis steps and figures later on.
This is not strictly necessary, but it is convenient when intercomparing two datasets.

In [36]:
# Convert temperatures to °C
for var in VARIABLES_TEMPERATURE:
    convert_K2C(data_agera5,   var)
    convert_K2C(data_era5land, var)

# Convert irradiation to MJ/m²/d
convert_J2MJ(data_agera5,   "Solar_Radiation_Flux")
convert_J2MJ(data_era5land, "Solar_Radiation_Flux")

# Harmonise variable/unit names
data_agera5   = adjust_names(data_agera5,   "AgERA5")
data_era5land = adjust_names(data_era5land, "ERA5-Land")

#### Derived variables
Lastly, the derived variables of 2 m wind speed, effective temperature, and cumulative growing degree-days are calculated and added to both datasets:

In [37]:
# Add 2 m wind speed (from 10 m wind speed)
data_agera5   = convert_to_2m_wind_speed(data_agera5)
data_era5land = convert_to_2m_wind_speed(data_era5land)

In [38]:
# Add effective temperature and cumulative growing degree-days
data_agera5   = add_growing_degree_days(data_agera5)
data_era5land = add_growing_degree_days(data_era5land)

In [39]:
data_agera5

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 549 graph layers,1 chunks in 549 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 24.37 MiB 24.37 MiB Shape (273, 130, 180) (273, 130, 180) Dask graph 1 chunks in 549 graph layers Data type float32 numpy.ndarray",180  130  273,

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 549 graph layers,1 chunks in 549 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.13 kiB,2.13 kiB
Shape,"(273,)","(273,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,int64 numpy.ndarray,int64 numpy.ndarray
"Array Chunk Bytes 2.13 kiB 2.13 kiB Shape (273,) (273,) Dask graph 1 chunks in 1 graph layer Data type int64 numpy.ndarray",273  1,

Unnamed: 0,Array,Chunk
Bytes,2.13 kiB,2.13 kiB
Shape,"(273,)","(273,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,int64 numpy.ndarray,int64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 549 graph layers,1 chunks in 549 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 24.37 MiB 24.37 MiB Shape (273, 130, 180) (273, 130, 180) Dask graph 1 chunks in 549 graph layers Data type float32 numpy.ndarray",180  130  273,

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 549 graph layers,1 chunks in 549 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 549 graph layers,1 chunks in 549 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 24.37 MiB 24.37 MiB Shape (273, 130, 180) (273, 130, 180) Dask graph 1 chunks in 549 graph layers Data type float32 numpy.ndarray",180  130  273,

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 549 graph layers,1 chunks in 549 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 549 graph layers,1 chunks in 549 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 24.37 MiB 24.37 MiB Shape (273, 130, 180) (273, 130, 180) Dask graph 1 chunks in 549 graph layers Data type float32 numpy.ndarray",180  130  273,

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 549 graph layers,1 chunks in 549 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 548 graph layers,1 chunks in 548 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 24.37 MiB 24.37 MiB Shape (273, 130, 180) (273, 130, 180) Dask graph 1 chunks in 548 graph layers Data type float32 numpy.ndarray",180  130  273,

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 548 graph layers,1 chunks in 548 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 548 graph layers,1 chunks in 548 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 24.37 MiB 24.37 MiB Shape (273, 130, 180) (273, 130, 180) Dask graph 1 chunks in 548 graph layers Data type float32 numpy.ndarray",180  130  273,

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 548 graph layers,1 chunks in 548 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 549 graph layers,1 chunks in 549 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 24.37 MiB 24.37 MiB Shape (273, 130, 180) (273, 130, 180) Dask graph 1 chunks in 549 graph layers Data type float32 numpy.ndarray",180  130  273,

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 549 graph layers,1 chunks in 549 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 557 graph layers,1 chunks in 557 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 24.37 MiB 24.37 MiB Shape (273, 130, 180) (273, 130, 180) Dask graph 1 chunks in 557 graph layers Data type float32 numpy.ndarray",180  130  273,

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 557 graph layers,1 chunks in 557 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 559 graph layers,1 chunks in 559 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 24.37 MiB 24.37 MiB Shape (273, 130, 180) (273, 130, 180) Dask graph 1 chunks in 559 graph layers Data type float32 numpy.ndarray",180  130  273,

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 559 graph layers,1 chunks in 559 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


In [40]:
data_era5land

Unnamed: 0,Array,Chunk
Bytes,48.74 MiB,48.74 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 5 graph layers,1 chunks in 5 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 48.74 MiB 48.74 MiB Shape (273, 130, 180) (273, 130, 180) Dask graph 1 chunks in 5 graph layers Data type float64 numpy.ndarray",180  130  273,

Unnamed: 0,Array,Chunk
Bytes,48.74 MiB,48.74 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 5 graph layers,1 chunks in 5 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,48.74 MiB,48.74 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 48.74 MiB 48.74 MiB Shape (273, 130, 180) (273, 130, 180) Dask graph 1 chunks in 3 graph layers Data type float64 numpy.ndarray",180  130  273,

Unnamed: 0,Array,Chunk
Bytes,48.74 MiB,48.74 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 6 graph layers,1 chunks in 6 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 24.37 MiB 24.37 MiB Shape (273, 130, 180) (273, 130, 180) Dask graph 1 chunks in 6 graph layers Data type float32 numpy.ndarray",180  130  273,

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 6 graph layers,1 chunks in 6 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 6 graph layers,1 chunks in 6 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 24.37 MiB 24.37 MiB Shape (273, 130, 180) (273, 130, 180) Dask graph 1 chunks in 6 graph layers Data type float32 numpy.ndarray",180  130  273,

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 6 graph layers,1 chunks in 6 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 6 graph layers,1 chunks in 6 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 24.37 MiB 24.37 MiB Shape (273, 130, 180) (273, 130, 180) Dask graph 1 chunks in 6 graph layers Data type float32 numpy.ndarray",180  130  273,

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 6 graph layers,1 chunks in 6 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 6 graph layers,1 chunks in 6 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 24.37 MiB 24.37 MiB Shape (273, 130, 180) (273, 130, 180) Dask graph 1 chunks in 6 graph layers Data type float32 numpy.ndarray",180  130  273,

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 6 graph layers,1 chunks in 6 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 12 graph layers,1 chunks in 12 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 24.37 MiB 24.37 MiB Shape (273, 130, 180) (273, 130, 180) Dask graph 1 chunks in 12 graph layers Data type float32 numpy.ndarray",180  130  273,

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 12 graph layers,1 chunks in 12 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 14 graph layers,1 chunks in 14 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 24.37 MiB 24.37 MiB Shape (273, 130, 180) (273, 130, 180) Dask graph 1 chunks in 14 graph layers Data type float32 numpy.ndarray",180  130  273,

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 14 graph layers,1 chunks in 14 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 16 graph layers,1 chunks in 16 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 24.37 MiB 24.37 MiB Shape (273, 130, 180) (273, 130, 180) Dask graph 1 chunks in 16 graph layers Data type float32 numpy.ndarray",180  130  273,

Unnamed: 0,Array,Chunk
Bytes,24.37 MiB,24.37 MiB
Shape,"(273, 130, 180)","(273, 130, 180)"
Dask graph,1 chunks in 16 graph layers,1 chunks in 16 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


(section-results)=
### 3. Results
This section contains the comparison between values retrieved from AgERA5 vs ERA5-Land.
The datasets are compared in three ways:
* Point-by-point: Comparison between individual data points with matching coordinates (in time and space). Provides a quantitative estimate of the agreement between the datasets averaged over the entire domain.
* Time series: Agreement between the datasets over time for one site or an ensemble. Most practically relevant for crop yield modelling, which is based on values at specific times and/or aggregated over the entire time series.
* Geospatial: Distribution across the spatial domain. Displays spatial patterns and highlights specific areas or types of terrain (e.g. coastal, mountain) with better or worse agreement.

These comparisons are followed by a general discussion and conclusions.

#### Point-by-point comparison
In this section, the overall distributions of values in the two datasets, and their differences, are compared.

We first examine some metrics that describe the difference Δ between corresponding (in time and space) pixels:

In [41]:
display_difference_stats(data_agera5, data_era5land)

Unnamed: 0,Mean Δ,Median Δ,Median |Δ|,Median |Δ| [%],Pearson r
Temperature (max) [°C],0.2329,0.2401,0.4402,,0.9942
Temperature (mean) [°C],-0.1145,-0.1044,0.3325,,0.9957
Temperature (min) [°C],-0.5084,-0.4838,0.59,,0.9895
Solar irradiation [MJ/m²/d],-0.3077,-0.2269,0.3099,2.9868,0.9978
Wind speed (2 m) [m/s],0.4809,0.4702,0.5099,19.2494,0.9088
Total precipitation [mm/d],-0.0002,-0.0013,0.068,12.8833,0.991
Snow depth [cm],-0.0454,0.0,0.0,0.0,0.673
Effective temperature [°C],-0.1066,-0.0449,0.2845,3.569,0.9958
Σ Growing degree-days [°C d],-12.238,-8.3359,11.6122,2.8305,0.9995


Next, we display the point-by-point comparison in two ways:
a scatter plot showing the relationship between values in the two datasets
and
a histogram showing the overall distribution of differences between values.

Figure {numref}`{number} <indicator_sis-agrometeorological-indicators_consistency_q01_fig-scatter>`
Figure {numref}`{number} <indicator_sis-agrometeorological-indicators_consistency_q01_fig-hist>`

In [42]:
scatter_comparison(data_agera5, data_era5land,
                   glue_label="indicator_sis-agrometeorological-indicators_consistency_q01_fig-scatter")

In [43]:
histogram_comparison(data_agera5, data_era5land,
                     glue_label="indicator_sis-agrometeorological-indicators_consistency_q01_fig-hist")

::::{tab-set}
:::{tab-item} Scatter plot
:sync: scatter
```{glue:figure} indicator_sis-agrometeorological-indicators_consistency_q01_fig-scatter
:figwidth: 700px
:name: "indicator_sis-agrometeorological-indicators_consistency_q01_fig-scatter"

Point-by-point comparison (density scatter) between AgERA5 and ERA5-Land for each variable.
Yellow means high density, blue means low density, white means zero points in that bin.
```
:::
:::{tab-item} Histogram
:sync: histogram
```{glue:figure} indicator_sis-agrometeorological-indicators_consistency_q01_fig-hist
:figwidth: 700px
:name: "indicator_sis-agrometeorological-indicators_consistency_q01_fig-hist"

Overall distributions of point-by-point differences between AgERA5 and ERA5-Land for each variable.
Note that some variables (e.g. snow depth) are highly concentrated at 0 difference.
```
:::
::::

#### Time series comparison
Crop yield estimates make use of the time series of the relevant variables in a given site.
Therefore, a comparison between time series from the different datasets should provide a better estimate of the differences downstream than an overall point-by-point comparison.

First, we examine the metrics that describe the difference Δ between corresponding (in time and space) pixels at our test site (defined at the top of this notebook):

In [44]:
# Select from each dataset
# **site unpacks `site` (defined at the start) into lat=..., lon=...
timeseries_agera5, timeseries_era5land = select_in_multiple_datasets(data_agera5, data_era5land, **site)
display_difference_stats(timeseries_agera5, timeseries_era5land)

Unnamed: 0,Mean Δ,Median Δ,Median |Δ|,Median |Δ| [%],Pearson r
Temperature (max) [°C],0.2198,0.2174,0.4139,,0.9954
Temperature (mean) [°C],-0.0071,0.0203,0.329,,0.9962
Temperature (min) [°C],-0.1923,-0.16,0.457,,0.9911
Solar irradiation [MJ/m²/d],-0.038,-0.042,0.2173,1.9353,0.9989
Wind speed (2 m) [m/s],0.5495,0.4769,0.4769,16.0054,0.9375
Total precipitation [mm/d],0.0596,-0.0015,0.0444,16.2491,0.9934
Snow depth [cm],-0.0008,0.0,0.0,0.0,0.3361
Effective temperature [°C],-0.0366,0.0,0.2772,3.2895,0.9965
Σ Growing degree-days [°C d],-5.2544,-5.4141,5.4141,1.1388,1.0


We now create a plot showing all variables:

In [45]:
timeseries_comparison(data_agera5, data_era5land, site,
                      glue_label="indicator_sis-agrometeorological-indicators_consistency_q01_fig-timeseries-abs")

timeseries_comparison_difference(data_agera5, data_era5land, site,
                      glue_label="indicator_sis-agrometeorological-indicators_consistency_q01_fig-timeseries-rel")

timeseries_comparison_difference_multi(data_agera5, data_era5land,
                      glue_label="indicator_sis-agrometeorological-indicators_consistency_q01_fig-timeseries-multi")

::::{tab-set}
:::{tab-item} Values, one site
:sync: abs1
```{glue:figure} indicator_sis-agrometeorological-indicators_consistency_q01_fig-timeseries-abs
:figwidth: 700px
:name: "indicator_sis-agrometeorological-indicators_consistency_q01_fig-timeseries-abs"

Time series comparison between AgERA5 and ERA5-Land for each variable, in one site.
```
:::
:::{tab-item} Difference, one site
:sync: rel1
```{glue:figure} indicator_sis-agrometeorological-indicators_consistency_q01_fig-timeseries-rel
:figwidth: 700px
:name: "indicator_sis-agrometeorological-indicators_consistency_q01_fig-timeseries-rel"

Time series comparison between AgERA5 and ERA5-Land for each variable, in one site.
```
:::
:::{tab-item} Difference, full domain
:sync: reldom
```{glue:figure} indicator_sis-agrometeorological-indicators_consistency_q01_fig-timeseries-multi
:figwidth: 700px
:name: "indicator_sis-agrometeorological-indicators_consistency_q01_fig-timeseries-multi"

Time series comparison between AgERA5 and ERA5-Land for each variable, in one site.
```
:::
::::

#### Geospatial comparison
In this section, the spatial distributions of values in the two datasets, and their differences, are compared visually using earthkit-plots.
For a fair comparison, this is done across multiple dates, which can be specified in the following cell:

In [46]:
# Date(s) for geospatial comparison plot, in {year}-mm-dd format
# Examples were chosen arbitrarily
comparison_dates = {
    "winter": f"{year}-01-01",  #  1 January
    "spring": f"{year}-04-15",  # 15 April
    "summer": f"{year}-06-20",  # 20 June
    "autumn": f"{year}-09-24",  # 24 September
}

Results here
(Figure {numref}`{number} <indicator_sis-agrometeorological-indicators_consistency_q01_fig-geo-winter>`)
(Figure {numref}`{number} <indicator_sis-agrometeorological-indicators_consistency_q01_fig-geo-spring>`)
(Figure {numref}`{number} <indicator_sis-agrometeorological-indicators_consistency_q01_fig-geo-summer>`)
(Figure {numref}`{number} <indicator_sis-agrometeorological-indicators_consistency_q01_fig-geo-autumn>`)

<!--
Note that differences in snow thickness are highest in areas where you wouldn't generally have crops anyway
-->

In [47]:
# Loop over dates and plot each -- Note this take several minutes
for label, date in comparison_dates.items():
    # Label for Jupyter-book
    glue_label = f"indicator_sis-agrometeorological-indicators_consistency_q01_fig-geo-{label}"

    # Plot
    geospatial_comparison_with_difference(data_agera5, data_era5land, date, domain=domain,
                                          glue_label=glue_label)

                                                                                                                                                                                                                 

                                                                                                                                                                                                                 

                                                                                                                                                                                                                 

                                                                                                                                                                                                                 

::::{tab-set}
:::{tab-item} Winter
:sync: winter
```{glue:figure} indicator_sis-agrometeorological-indicators_consistency_q01_fig-geo-winter
:figwidth: 700px
:name: "indicator_sis-agrometeorological-indicators_consistency_q01_fig-geo-winter"

Geospatial comparison between AgERA5 and ERA5-Land.
```
:::
:::{tab-item} Spring
:sync: spring
```{glue:figure} indicator_sis-agrometeorological-indicators_consistency_q01_fig-geo-spring
:figwidth: 700px
:name: "indicator_sis-agrometeorological-indicators_consistency_q01_fig-geo-spring"

Geospatial comparison between AgERA5 and ERA5-Land.
```
:::
:::{tab-item} Summer
:sync: summer
```{glue:figure} indicator_sis-agrometeorological-indicators_consistency_q01_fig-geo-summer
:figwidth: 700px
:name: "indicator_sis-agrometeorological-indicators_consistency_q01_fig-geo-summer"

Geospatial comparison between AgERA5 and ERA5-Land.
```
:::
:::{tab-item} Autumn
:sync: autumn
```{glue:figure} indicator_sis-agrometeorological-indicators_consistency_q01_fig-geo-autumn
:figwidth: 700px
:name: "indicator_sis-agrometeorological-indicators_consistency_q01_fig-geo-autumn"

Geospatial comparison between AgERA5 and ERA5-Land.
```
:::
::::

#### Discussion and Conclusions
Text.

## ℹ️ If you want to know more

### Key resources
The CDS catalogue entries for the data used were:
* Agrometeorological indicators from 1979 to present derived from reanalysis (AgERA5): [sis-agrometeorological-indicators](https://doi.org/10.24381/cds.6c68c9bb)
* ERA5-Land hourly data from 1950 to present: [reanalysis-era5-land](https://doi.org/10.24381/cds.e2161bac)
* ERA5-Land post-processed daily statistics from 1950 to present: [derived-era5-land-daily-statistics](https://doi.org/10.24381/cds.e9c9c792)

Code libraries used:
* [earthkit](https://github.com/ecmwf/earthkit)
  * [earthkit-data](https://github.com/ecmwf/earthkit-data)
  * [earthkit-plots](https://github.com/ecmwf/earthkit-plots)

More about crop yield estimation:
* [A gentle introduction to WOFOST](https://www.wur.nl/en/show/a-gentle-introduction-to-wofost.htm)
* [Crop yield prediction based on reanalysis and crop phenology data in the agroclimatic zones](https://doi.org/10.1007/s00704-024-05046-x)
* [Historical trends and future projections of compound cloudy-rainy events during the global winter wheat harvest phase](https://doi.org/10.1016/j.agrformet.2025.110637)

More about reanalysis data:
* [The ERA5 global reanalysis](https://doi.org/10.1002/qj.3803)
* [ERA5-Land: a state-of-the-art global reanalysis dataset for land applications](https://doi.org/10.5194/essd-13-4349-2021)
* AgERA5
  * [Algorithm Theoretical Basis (ATBD)](https://confluence.ecmwf.int/pages/viewpage.action?pageId=278550984)
  * [Product User Guide and Specification (PUGS)](https://confluence.ecmwf.int/pages/viewpage.action?pageId=278551004)
  * [Global Agriculture Downscaling and bias correction](https://confluence.ecmwf.int/display/CKB/Global+Agriculture+Downscaling+and+bias+correction)
  * [AgERA5tools Python package](https://github.com/ajwdewit/agera5tools)

### References
[[De Wit+19](https://doi.org/10.1016/j.agsy.2018.06.018)] A. de Wit et al., ‘25 years of the WOFOST cropping systems model’, Agricultural Systems, vol. 168, pp. 154–167, Oct. 2019, doi: 10.1016/j.agsy.2018.06.018.

[[Hersbach+20](https://doi.org/10.1002/qj.3803)] H. Hersbach et al., ‘The ERA5 global reanalysis’, Quarterly Journal of the Royal Meteorological Society, vol. 146, no. 730, pp. 1999–2049, May 2020, doi: 10.1002/qj.3803.

[[Muñoz-Sabater+21](https://doi.org/10.5194/essd-13-4349-2021)] J. Muñoz-Sabater et al., ‘ERA5-Land: a state-of-the-art global reanalysis dataset for land applications’, Earth System Science Data, vol. 13, no. 9, pp. 4349–4383, Sept. 2021, doi: 10.5194/essd-13-4349-2021.

[[Evenflow+24](https://climate.copernicus.eu/sites/default/files/2024-12/Value-generated-by-ERA5-full-report.pdf)] Evenflow, ‘The value generated by ERA5’, Copernicus Climate Change Service (C3S), Bonn, Germany, Dec. 2024.

[[AgERA5 dataset](https://doi.org/10.24381/cds.6c68c9bb)] H. Boogaard, J. Schubert, A. de Wit, J. Lazebnik, R. Hutjes, and G. van der Grijn, ‘Agrometeorological indicators from 1979 to present derived from reanalysis’. Copernicus Climate Change Service (C3S) Climate Data Store (CDS), Jan. 30, 2020. doi: 10.24381/cds.6c68c9bb.

[[De Wit+24](https://climate.copernicus.eu/sites/default/files/custom-uploads/7th%20GA%20C3S/Presentations/Day%203/S1/05-s19.06.24_AgERA5UserPerspective_AllarddeWit_v1.pdf)] A. de Wit, H. Boogaard, S. Hoek, and E. Müller, ‘Climate Services for Agriculture – A user perspective on AgERA5’, presented at the 7th C3S General Assembly, Brussels, Belgium, June 19, 2024.

[[Araghi+22](https://doi.org/10.1016/j.eja.2021.126419)] A. Araghi, C. J. Martinez, and J. E. Olesen, ‘Evaluation of multiple gridded solar radiation data for crop modeling’, European Journal of Agronomy, vol. 133, p. 126419, Feb. 2022, doi: 10.1016/j.eja.2021.126419.

[[Hasan Karaman+23](https://doi.org/10.1016/j.asr.2023.02.006)] Ç. Hasan Karaman and Z. Akyürek, ‘Evaluation of near-surface air temperature reanalysis datasets and downscaling with machine learning based Random Forest method for complex terrain of Turkey’, Advances in Space Research, vol. 71, no. 12, pp. 5256–5281, June 2023, doi: 10.1016/j.asr.2023.02.006.

[[Kruger+24](https://doi.org/10.17159/sajs.2024/16043)] J. A. Kruger, S. J. Roffe, and A. J. van der Walt, ‘AgERA5 representation of seasonal mean and extreme temperatures in the Northern Cape, South Africa’, South African Journal of Science, vol. 120, no. 3–4, pp. 1–13, Mar. 2024, doi: 10.17159/sajs.2024/16043.

[[Esquivel-Arriaga+24](https://doi.org/10.1175/JAMC-D-23-0227.1)] G. Esquivel-Arriaga et al., ‘Performance Evaluation of Global Precipitation Datasets in Northern Mexico Drylands’, Journal of Applied Meteorology and Climatology, vol. 63, no. 12, pp. 1545–1558, Dec. 2024, doi: 10.1175/JAMC-D-23-0227.1.

[[Suraweera+24](https://doi.org/10.1109/MERCon63886.2024.10689062)] B. Suraweera, K. De Silva, L. Gunawardhana, and L. Rajapakse, ‘Evaluation of Satellite Rainfall Estimates for Surface Runoff Modelling in the Maha Oya Basin, Sri Lanka’, in 2024 Moratuwa Engineering Research Conference (MERCon), Aug. 2024, pp. 67–72. doi: 10.1109/MERCon63886.2024.10689062.

[[Garbanzo+25](https://doi.org/10.3390/hydrology12070161)] G. Garbanzo et al., ‘Addressing Weather Data Gaps in Reference Crop Evapotranspiration Estimation: A Case Study in Guinea-Bissau, West Africa’, Hydrology, vol. 12, no. 7, p. 161, June 2025, doi: 10.3390/hydrology12070161.

[[Garcia-Prats+25](https://doi.org/10.1016/j.ejrh.2025.102531)] A. Garcia-Prats et al., ‘High-resolution spatially interpolated FAO Penman-Monteith crop reference evapotranspiration maps of Sicily Island (Italy) and Jucar River system (Spain) using AgERA5 and ERA5-Land reanalysis datasets’, Journal of Hydrology: Regional Studies, vol. 60, p. 102531, Aug. 2025, doi: 10.1016/j.ejrh.2025.102531.

[[Allen+98](https://www.fao.org/4/x0490e/x0490e00.htm)] R. G. Allen, L. S. Pereira, D. Raes, and M. Smith, Crop evapotranspiration – Guidelines for computing crop water requirements. in FAO Irrigation and drainage papers, no. 56. Rome, Italy: FAO – Food and Agriculture Organization of the United Nations, 1998. Accessed: Nov. 24, 2025. [Online]. Available: https://www.fao.org/4/x0490e/x0490e00.htm

[[Ceglar+19](https://doi.org/10.1016/j.agsy.2018.05.002)] A. Ceglar et al., ‘Improving WOFOST model to simulate winter wheat phenology in Europe: Evaluation and effects on yield’, Agricultural Systems, vol. 168, pp. 168–180, Jan. 2019, doi: 10.1016/j.agsy.2018.05.002.

[[De Wit+20](https://research.wur.nl/en/publications/system-description-of-the-wofost-72-cropping-systems-model)] A. J. W. de Wit, H. L. Boogaard, I. Supit, and M. van den Berg, ‘System description of the WOFOST 7.2 cropping systems model’, Wageningen Environmental Research, May 2020. [Online]. Available: https://research.wur.nl/en/publications/system-description-of-the-wofost-72-cropping-systems-model