# How to work with ERA5 land hourly and Eurostat data on Earth Data Hub
### Forecast of crop production

***
This notebook will provide you guidance on how to access and use the [`reanalysis-era5-land-no-antartica-v0.zarr`](https://earthdatahub.com/collections/era5/datasets/reanalysis-era5-land) dataset on Earth Data Hub in combination with Eurostat crop production.

Eurostat crop production is available for NUTS levels 0, 1, 2 and starting with the year 2000. The time lag for updates to the database is about two years. Detailed data on NUTS level 2 stop with 2015.

The crops of interest here are Cereals (excluding rice) for the production of grain (including seed) for which annual yield in t/ha will be used.

In this tutorial, we try to use air temperature and precipitation as a predictor for annual crop yields.

The goal is to train and evaluate a neural network to predict annual crop yields from air temperature and precipitation.


## What you will learn:

* how to access and preview the dataset
* select and reduce the data
* set up and train a neural network
* plot the results

## Installations for Google Colab

Install required packages for Google Colab.

In [None]:
!pip install zarr
!pip install s3fs==2023.6.0
!pip install cartopy==0.21.0

## Software Requirements

Load required packages.

In [None]:
%matplotlib inline
import xarray as xr
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import sklearn
import sklearn.ensemble
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_errorimport scipy.stats
from scipy.stats import pearsonr
from tqdm import tqdm
import xarray as xr
import pandas as pd
import numpy as np
from cartopy import crs as ccrs
import cartopy.feature as cfeature
import torch
import torchvision
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader


## Data access and preview
***

Xarray and Dask work together following a lazy principle. This means when you access and manipulate a Zarr store the data is in not immediately downloaded and loaded in memory. Instead, Dask constructs a task graph that represents the operations to be performed. A smart user will reduce the amount of data that needs to be downloaded before the computation takes place (e.g., when the `.compute()` or `.plot()` methods are called).

To preview the data, only the dataset metadata must be downloaded. Xarray does this automatically:

***

Data access with HTTPS:

In [None]:
# import with https

dataset = "https://earthdatahub.com/stores/ecmwf-era5-land/reanalysis-era5-land-no-antartica-v1.zarr"
ds_era5 = xr.open_dataset(
    dataset,
    chunks={},
    engine="zarr",
    storage_options={"client_kwargs": {"trust_env": "true"}},
)
ds_era5

Eurostat data

In [None]:
dataset = "https://earthdatahub.com/stores/eurostat/apro_cpshr-20000101-20240101.zarr"

ds_eurostat = xr.open_dataset(
    dataset,
    engine="zarr",
    chunks={}
)

ds_eurostat

## Working with data

Datasets on EDH are typically very large and remotely hosted. Typical use cases imply a selection of the data followed by one or more reduction steps to be performed in a local or distributed Dask environment.

The structure of a workflow that uses EDH data looks like this:
1. data selection
2. data reduction
3. (optional) visualization

Wrapping longitudes from 0, 360 to -180, 180

In [None]:
# wrap longitudes
if 0:
  lon = ds_era5['longitude']
  ds_era5 = ds_era5.assign(**lon: np.where(lon > 180, lon - 360, lon))
  # sort the data
  ds_era5 = ds_era5.reindex({ 'longitude' : np.sort(ds_era5['longitude'])})
else:
  lon_name = 'longitude'  # whatever name is in the data
  ds = ds_era5

  # Adjust lon values to make sure they are within (-180, 180)
  ds['_longitude_adjusted'] = xr.where(
    ds[lon_name] > 180,
    ds[lon_name] - 360,
    ds[lon_name])

  # reassign the new coords to as the main lon coords
  # and sort DataArray using new coordinate values
  ds = (
    ds
    .swap_dims({lon_name: '_longitude_adjusted'})
    .sel(**{'_longitude_adjusted': sorted(ds._longitude_adjusted)})
    .drop(lon_name))

  ds = ds.rename({'_longitude_adjusted': lon_name})

ds_era5


## Air temperature anomaly

### 1. Data selection

From the original dataset we extract the air temperature (variable `t2m`) and perform a geographical selection corresponding to Europe. This greatly reduces the amount of data that will be downloaded from EDH.

The Eurostat dataset used here starts with the year 2000, while the ERA5 data start with 1940, thus we select only the time period from 2000 onwards.

Note that latitudes are decreasing from 90 to -90 degrees and longitudes are increasing from 0 to 360 degrees. Longitudes need to be wrapped to -180, 180.

Convert from Kelvin to Celsius.

Select temperature (t2m) and precipitation (tp)

In [None]:
# geographical selection
ds_t2m = ds_era5.t2m.sel(**{"latitude": slice(72, 36), "longitude": slice(-10, 32)})
# appropriate time period
ds_t2m = ds_t2m.sel(valid_time=slice("2000-01-01", "2024-12-31"))
# convert from Kelvin to Celsius
ds_t2m = ds_t2m.astype("float32") - 273.15
ds_t2m.attrs["units"] = "C"
ds_t2m

At this point, no data has been downloaded yet, nor loaded in memory.

### 2. Data reduction

We compute the monthly air temperature averages in the selected region over the reference period:

Monthly averages:

In [None]:
ds_t2m_monthly = ds_t2m.resample(valid_time="1M").mean(dim="valid_time")

Mean temperature over the whole reference period:

In [None]:
ds_t2m_mean = ds_t2m.mean(dim="valid_time")

Long-term monthly averages

In [None]:
ds_t2m_monthly_lt = ds_t2m_monthly.groupby("time.month").mean("valid_time")

Monthly anomalies

In [None]:
#climatology = ds.groupby("time.month").mean("valid_time")
#anomalies = ds.groupby("time.month") - climatology

climatology = ds_t2m_monthly.groupby("time.month").mean("valid_time")
anomalies = ds_t2m_monthly.groupby("time.month") - climatology


Growing season from May to September: anomalies

In [None]:
ds_t2m_anom_season = anomalies.sel(time=anomalies.time.dt.month.isin([5, 6, 7, 8, 9]))

Averages for each year

In [None]:
ds_t2m_anom_year = ds_t2m_anom_season.groupby("time.year")

# or

ds_t2m_anom_year = ds_t2m_anom_season.resample(valid_time="1Y").mean(dim="valid_time")

After that, we can compute the monthly temperature anomalies in the same area. Calling `compute()` on the result will trigger the download and computation, needed to load the data to the neural network.

We can mesure the time it takes, should be about 4 min:

In [None]:
%%time

ds_t2m_anom_year = ds_t2m_anom_year.compute()
ds_t2m_anom_year

## Precipitation anomaly

Same procedure as for temperature.

In [None]:
# geographical selection
ds_tp = ds_era5.tp.sel(**{"latitude": slice(72, 36), "longitude": slice(-10, 32)})
# appropriate time period
ds_tp = ds_tp.sel(valid_time=slice("2000-01-01", "2024-12-31"))
# convert from meters to millimeters
ds_tp = ds_tp.astype("float32") * 1000.0
ds_tp

Monthly sums

In [None]:
ds_tp_monthly = ds_tp.resample(valid_time="1M").reduce(np.sum)

Long-term monthly averages

In [None]:
ds_tp_monthly_lt = ds_tp_monthly.groupby("time.month").mean("valid_time")

Monthly anomalies

In [None]:
#climatology = ds.groupby("time.month").mean("valid_time")
#anomalies = ds.groupby("time.month") - climatology

climatology = ds_tp_monthly.groupby("time.month").mean("valid_time")
anomalies = ds_tp_monthly.groupby("time.month") - climatology


Growing season from May to September: anomalies

In [None]:
ds_tp_anom_season = anomalies.sel(time=anomalies.time.dt.month.isin([5, 6, 7, 8, 9]))

Averages for each year

In [None]:
ds_tp_anom_year = ds_tp_anom_season.groupby("time.year")

# or

ds_tp_anom_year = ds_tp_anom_season.resample(valid_time="1Y").mean(dim="valid_time")

After that, we can compute the monthly precipitation anomalies in the same area. Calling compute() on the result will trigger the download and computation, needed to load the data to the neural network.

We can mesure the time it takes, should be about 4 min:


In [None]:
%%time

ds_tp_anom_year = ds_tp_anom_year.compute()
ds_tp_anom_year

## Crop yields

Annual crop yields on NUTS2 level are calculated as t/ha from harvested production and harvested area.

In [None]:
# TODO
ds_eurostat = ds_eurostat.sel(**{"latitude": slice(72, 36), "longitude": slice(-10, 32)})
ds_eurostat = ds_eurostat.assign(cropyield = lambda x: x.L2_A_C1000_PR_HU_EU / x.L2_A_C1000_AR)
ds_cropyield = ds_eurostat.cropyield
ds_cropyield.compute()
ds_cropyield

### 3. Visualization
We can plot crop yields for a given year, e.g. 2015, on a map.

**! This crashes the Google Colab session !**

In [None]:
ds_cropyield_2015 = ds_cropyield.sel(time=["2015-01-01"])

_, ax = plt.subplots(
    figsize=(6, 6),
    subplot_kw={"projection":  ccrs.Miller()},
)
ds_cropyield_2015.plot(
    ax=ax,
    cmap="Blues",
    transform=ccrs.PlateCarree(),
    cbar_kwargs={"orientation": "horizontal", "pad": 0.05, "aspect": 40, "label": "Sea Surface Height anomaly [m]"},
)
ax.coastlines()
ax.add_feature(cfeature.BORDERS)
ax.set_title("Crop yields 2015")
plt.show()

## Data preparation for AI

We have already computed monthly temperature and precipitation anomalies from the ERA5 dataset and now need to train a network.

Helper functions to manage input data:

In [None]:
# Scaffold code to load in data.  This code cell is mostly data wrangling

def assemble_predictors_predictands(ds_t2m, ds_tp, ds_crop,
                                    start_date, end_date,
                                    use_pca=False, n_components=32):
  """
  inputs
  ------

      ds_t2m            xarray : input xarray dataset with temperature
      ds_tp             xarray : input xarray dataset with precipitation
      ds_crop           xarray : input xarray dataset with crop yields
      start_date           str : the start date from which to extract sst
      end_date             str : the end date
      use_pca             bool : whether or not to apply principal components
                                 analysis to the sst field
      n_components         int : the number of components to use for PCA

  outputs
  -------
      Returns a tuple of the predictors (np array of temperature and precipitation anomalies)
      and the predictands (np array of crop yields).

  """

  t2m = ds_t2m.sel(valid_time=slice(start_date, end_date))
  tp = ds_tp.sel(valid_time=slice(start_date, end_date))
  cropyields = ds_crop.sel(time=slice(start_date, end_date))

  num_samples = sst.shape[0]
  # t2m and tp are (num_samples, lat, lon) arrays
  # the line below combines them to (num_samples, num_params, lat, lon)
  # with num_params = 2
  #sst = np.stack([sst.values[n-num_input_time_steps:n] for n in range(num_input_time_steps,
  #                                                            num_samples+1)])

  combined_arr = np.stack((t2m, tp), axis=1)

  num_samples = combined_arr.shape[0]

  combined_arr[np.isnan(combined_arr)] = 0
  X = combined_arr

  cropyields[np.isnan(cropyields)] = 0
  y = cropyields

  return X.astype(np.float32), y.astype(np.float32)


class ERA5Dataset(Dataset):
    def __init__(self, predictors, predictands):
        self.predictors = predictors
        self.predictands = predictands
        assert self.predictors.shape[0] == self.predictands.shape[0], \
               "The number of predictors must equal the number of predictands!"

    def __len__(self):
        return self.predictors.shape[0]

    def __getitem__(self, idx):
        return self.predictors[idx], self.predictands[idx]



## Train a Simple Convolutional Neural Network to model crop yields

Let's define a simple convolutional neural network architecture.  This architecture has one convolutional layer, followed by a pooling layer, followed by another convolutional layer, followed by two transposed convolutional layers.  The output of the final layer is a 2-D array, since we are trying to model one value for each pixel: the crop yield.

### Define the network

In [None]:
class CNN(nn.Module):
    def __init__(self, num_params=2, print_feature_dimension=False):
        """
        inputs
        -------
            num_params                  (int) : the number of parameters
                                                in the predictor
            print_feature_dimension    (bool) : whether or not to print
                                                out the dimension of the features
                                                extracted from the conv layers
        """
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(num_params, 15, 3)
        self.pool = nn.MaxPool2d(3, 3)
        self.conv2 = nn.Conv2d(15, 32, 3)
        self.print_layer = Print()

        # TIP: print out the dimension of the extracted features from
        # the conv layers for setting the dimension of the linear layer!
        # Using the print_layer, we find that the dimensions are
        # (batch_size, 32, 18, 108)
        self.deconv1 = nn.ConvTranspose2d(32, 15, 3)
        self.deconv2 = nn.ConvTranspose2d(15, 1, 3)
        self.print_feature_dimension = print_feature_dimension

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        if self.print_feature_dimension:
          x = self.print_layer(x)
        # adjust to dimensions of fc1
        #x = x.view(-1, 32 * 8 * 48)
        x = F.relu(self.deconv1(x))
        x = F.relu(self.deconv2(x))
        return x

class Print(nn.Module):
    """
    This class prints out the size of the features
    """
    def forward(self, x):
        print(x.size())
        return x

Next, let's define a method that trains our neural network.

In [None]:
def train_network(net, criterion, optimizer, trainloader, testloader,
                  experiment_name, num_epochs=40):
  """
  inputs
  ------

      net               (nn.Module)   : the neural network architecture
      criterion         (nn)          : the loss function (i.e. root mean squared error)
      optimizer         (torch.optim) : the optimizer to use update the neural network
                                        architecture to minimize the loss function
      trainloader       (torch.utils.data.DataLoader): dataloader that loads the
                                        predictors and predictands
                                        for the train dataset
      testloader        (torch.utils.data. DataLoader): dataloader that loads the
                                        predictors and predictands
                                        for the test dataset
  outputs
  -------
      predictions (np.array), and saves the trained neural network as a .pt file
  """
  device = "cuda:0" if torch.cuda.is_available() else "cpu"
  net = net.to(device)
  best_loss = np.infty
  train_losses, test_losses = [], []

  for epoch in range(num_epochs):
    for mode, data_loader in [('train', trainloader), ('test', testloader)]:
      # Set the model to train mode to allow its weights to be updated
      # while training
      if mode == 'train':
        net.train()

      # Set the model to eval model to prevent its weights from being updated
      # while testing
      elif mode == 'test':
        net.eval()

      running_loss = 0.0
      for i, data in enumerate(data_loader):
          # get a mini-batch of predictors and predictands
          batch_predictors, batch_predictands = data
          batch_predictands = batch_predictands.to(device)
          batch_predictors = batch_predictors.to(device)

          # zero the parameter gradients
          optimizer.zero_grad()

          # calculate the predictions of the current neural network
          predictions = net(batch_predictors).squeeze()

          # quantify the quality of the predictions using a
          # loss function (aka criterion) that is differentiable
          loss = criterion(predictions, batch_predictands)

          if mode == 'train':
            # the 'backward pass: calculates the gradients of each weight
            # of the neural network with respect to the loss
            loss.backward()

            # the optimizer updates the weights of the neural network
            # based on the gradients calculated above and the choice
            # of optimization algorithm
            optimizer.step()

          # Save the model weights that have the best performance!

          running_loss += loss.item()

      if running_loss < best_loss and mode == 'test':
          best_loss = running_loss
          torch.save(net, '{}.pt'.format(experiment_name))
      print('{} Set: Epoch {:02d}. loss: {:3f}'.format(mode, epoch+1, \
                                            running_loss/len(data_loader)))
      if mode == 'train':
          train_losses.append(running_loss/len(data_loader))
      else:
          test_losses.append(running_loss/len(data_loader))

  net = torch.load('{}.pt'.format(experiment_name))
  net.eval()
  net.to(device)

  # the remainder of this function calculates the predictions of the best
  # saved model
  predictions = np.asarray([])
  for i, data in enumerate(testloader):
    batch_predictors, batch_predictands = data
    batch_predictands = batch_predictands.to(device)
    batch_predictors = batch_predictors.to(device)

    batch_predictions = net(batch_predictors).squeeze()
    # Edge case: if there is 1 item in the batch, batch_predictions becomes a float
    # not a Tensor. the if statement below converts it to a Tensor
    # so that it is compatible with np.concatenate
    if len(batch_predictions.size()) == 0:
      batch_predictions = torch.Tensor([batch_predictions])
    predictions = np.concatenate([predictions, batch_predictions.detach().cpu().numpy()])
  return predictions, train_losses, test_losses


### Actual training

Prepare data for the network and train it.

The earliest possible start date is 2000-01-01, the latest possible end date is 2015-12-31.

In [None]:
%%time

# Assemble numpy arrays corresponding to predictors and predictands
train_start_date = '2000-01-01'
train_end_date = '2011-12-31'

test_start_date = '2012-01-01'
test_end_date = '2015-12-31'

train_predictors, train_predictands = assemble_predictors_predictands(ds_t2m_anom_year, ds_tp_anom_year, ds_cropyield,
                                                                      train_start_date, train_end_date)

print("train_predictors: %d" % train_predictors.shape[0])
print("train_predictands: %d" % train_predictands.shape[0])

test_predictors, test_predictands = assemble_predictors_predictands(ds_t2m_anom_year, ds_tp_anom_year, ds_cropyield,
                                                                    test_start_date, test_end_date)

print("test_predictors: %d" % test_predictors.shape[0])
print("test_predictands: %d" % test_predictands.shape[0])

# Convert the numpy ararys into ERA5Dataset, which is a subset of the
# torch.utils.data.Dataset class.  This class is compatible with
# the torch dataloader, which allows for data loading for a CNN
train_dataset = ERA5Dataset(train_predictors, train_predictands)
test_dataset = ERA5Dataset(test_predictors, test_predictands)

# Create a torch.utils.data.DataLoader from the ERA5Dataset() created earlier!
# the similarity between the name DataLoader and Dataset in the pytorch API is unfortunate...
trainloader = DataLoader(train_dataset, batch_size=2)
testloader = DataLoader(test_dataset, batch_size=2)
net = CNN(num_params=2, print_feature_dimension=False)
optimizer = optim.Adam(net.parameters(), lr=0.0001)

# train the model and make predictions for the test time period
experiment_name = "twolayerCNN_{}_{}".format(train_start_date, train_end_date)
predictions, train_losses, test_losses = train_network(net, nn.MSELoss(),
                  optimizer, trainloader, testloader, experiment_name, num_epochs=40)

## Results of NN training

Plot train and test losses:

In [None]:
plt.plot(train_losses, label='Train Loss')
plt.plot(test_losses, label='Test Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Performance of {} Neural Network During Training'.format(experiment_name))
plt.legend(loc='best')
plt.show()

If test loss does not look satisfactory, try reducing the number of parameters of the network. You could define your own network architecture, which uses a different number of parameters.

Alternatively, try increasing the number of training samples by using a longer time period or a larger area.

Show predictions:

In [None]:
corr, _ = pearsonr(test_predictands, predictions)
rmse = mean_squared_error(test_predictands, predictions) ** 0.5

print("RMSE: {:.2f}".format(rmse))
