# ML on ESDC using PyTorch including Transfer Learning
A DeepESDL example notebook

## Linear Regression for prediction of missing land surface temperature values from air temperature values
This notebook demonstrates how to implement Machine Learning on the Earth System Data Cube using the ML library PyTorch, how to safe the model and how to reload it for a second task (Transfer Learning). The workflow is self-contained and based on a generic use case to showcase data loading, sampling strategies, model training, model evaluation and visualisation.

Please, also refer to the DeepESDL documentation and visit the platform's website for further information!

ScaDS.AI, 2023

### Import necessary libraries
In case you experience an error due to a missing library xy, please install it via "pip install xy".

In [1]:
import math
import numpy as np
import xarray as xr
from xcube.core.store import new_data_store


import mltools as ml
import pandas as pd

import dask.array as da

import torch
from torch.utils.data import TensorDataset, DataLoader
from torch import nn
from torch.nn.functional import normalize

import nbimporter

### Load Data (Earth System Data Cube)
We load the ESDC (*.zarr) from the s3 data store (lazy load). The ESDC consists of three dimensions (longitude, latitude, time). Out of many available cube variables, which are dask arrays, we load two ("land_surface_temperature", "air_temperature_2m"). 

In [13]:
data_store = new_data_store("s3", root="esdl-esdc-v2.1.1", storage_options=dict(anon=True))
dataset = data_store.open_data('esdc-8d-0.083deg-184x270x270-2.1.1.zarr')
ds = dataset[['land_surface_temperature', 'air_temperature_2m']]
ds

### Assign a random train/test split

In [5]:
xds = ds.assign({"split": ml.rand})
xds

Unnamed: 0,Array,Chunk
Bytes,355.96 MiB,2.78 MiB
Shape,"(10, 2160, 4320)","(10, 270, 270)"
Count,129 Tasks,128 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 355.96 MiB 2.78 MiB Shape (10, 2160, 4320) (10, 270, 270) Count 129 Tasks 128 Chunks Type float32 numpy.ndarray",4320  2160  10,

Unnamed: 0,Array,Chunk
Bytes,355.96 MiB,2.78 MiB
Shape,"(10, 2160, 4320)","(10, 270, 270)"
Count,129 Tasks,128 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,355.96 MiB,2.78 MiB
Shape,"(10, 2160, 4320)","(10, 270, 270)"
Count,129 Tasks,128 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 355.96 MiB 2.78 MiB Shape (10, 2160, 4320) (10, 270, 270) Count 129 Tasks 128 Chunks Type float32 numpy.ndarray",4320  2160  10,

Unnamed: 0,Array,Chunk
Bytes,355.96 MiB,2.78 MiB
Shape,"(10, 2160, 4320)","(10, 270, 270)"
Count,129 Tasks,128 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,88.99 MiB,711.91 kiB
Shape,"(10, 2160, 4320)","(10, 270, 270)"
Count,256 Tasks,128 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 88.99 MiB 711.91 kiB Shape (10, 2160, 4320) (10, 270, 270) Count 256 Tasks 128 Chunks Type bool numpy.ndarray",4320  2160  10,

Unnamed: 0,Array,Chunk
Bytes,88.99 MiB,711.91 kiB
Shape,"(10, 2160, 4320)","(10, 270, 270)"
Count,256 Tasks,128 Chunks
Type,bool,numpy.ndarray


### Model set up

Select cuda device if available to use GPU ressources

In [6]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

Using cuda device


#### Define model, loss and error

In [7]:
# model, loss and optimizer
class Model(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, hidden_size)
        self.fc4 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = self.fc1(x)
        x = self.fc2(x)
        x = self.fc3(x)
        x = self.fc4(x)
        return x

reg_model = Model(input_size=1, hidden_size=1, output_size=1)
mse_loss = nn.MSELoss()
optimizer = torch.optim.SGD(reg_model.parameters(), lr=0.0001)

Get range (min, max) and statistics (mean, std) of data variables for normalization or standardization.

In [4]:
at_range = ml.getRange(ds, 'air_temperature_2m')
lst_range = ml.getRange(ds, 'land_surface_temperature')

at_stat = ml.getStatistics(ds, 'air_temperature_2m')
lst_stat = ml.getStatistics(ds, 'land_surface_temperature')

### Train model

We iterate through the chunks of the ESDC. The data will be preprocessed by flattening, removing NaNs, normalization or standardization. Further, we will split the data into a training and testing fraction. We generate a train data loader and a test data loader and perform a linear regression. The train and test errors are returned during model training.

In [8]:
for chunk in ml.iter_data_var_blocks(xds): 
    ### preprocessing 
    # flatten
    cf = {x: chunk[x].ravel() for x in chunk.keys()}
    # drop nans
    lst = cf['land_surface_temperature']
    cfn = {x: cf[x][~np.isnan(lst)] for x in cf.keys()}

    if len(cfn['land_surface_temperature']) > 0:
        #X = normalize(cfn['air_temperature_2m'], 'air_temperature_2m')
        #y = normalize(cfn['land_surface_temperature'], 'land_surface_temperature')
        X = ml.standardize(cfn['air_temperature_2m'],*at_stat)
        y = ml.standardize(cfn['land_surface_temperature'], *lst_stat)
        
        
        ### get train/test data 
        X_train = X[cfn['split']==True]
        X_test  = X[cfn['split']==False]
        y_train = y[cfn['split']==True]
        y_test  = y[cfn['split']==False]
        
        inputs  = torch.tensor(X_train)
        outputs =  torch.tensor(y_train)
        
        train_ds = TensorDataset(inputs, outputs)
        test_ds  = TensorDataset(torch.tensor(X_test), torch.tensor(y_test))
        
        trainloader = DataLoader(train_ds, batch_size=50, shuffle=True)
        testloader  = DataLoader(test_ds, batch_size=50, shuffle=True)
        
        ### train model 
        for i in range(3):
            reg_model,train_pred,loss = ml.train_one_epoch(i, trainloader, reg_model, mse_loss, optimizer, device)
            print(f"Training Error: Avg loss: {loss:>8f}")
            test_pred, test_loss = ml.test(testloader, reg_model, mse_loss, device)
            print(f"Test Error: Avg loss: {test_loss:>8f} \n")

Training Error: Avg loss: 0.072995
Test Error: Avg loss: 0.073645 

Training Error: Avg loss: 0.071274
Test Error: Avg loss: 0.071836 

Training Error: Avg loss: 0.072579
Test Error: Avg loss: 0.070143 

Training Error: Avg loss: 0.119701
Test Error: Avg loss: 0.116720 

Training Error: Avg loss: 0.110845
Test Error: Avg loss: 0.109718 

Training Error: Avg loss: 0.103377
Test Error: Avg loss: 0.102931 

Training Error: Avg loss: 0.097270
Test Error: Avg loss: 0.095153 

Training Error: Avg loss: 0.085659
Test Error: Avg loss: 0.085687 

Training Error: Avg loss: 0.077311
Test Error: Avg loss: 0.076824 

Training Error: Avg loss: 0.045944
Test Error: Avg loss: 0.045861 

Training Error: Avg loss: 0.043628
Test Error: Avg loss: 0.042946 

Training Error: Avg loss: 0.041539
Test Error: Avg loss: 0.040540 

Training Error: Avg loss: 0.058766
Test Error: Avg loss: 0.058385 

Training Error: Avg loss: 0.057920
Test Error: Avg loss: 0.058047 

Training Error: Avg loss: 0.057886
Test Error: A

### Save pre-trained model

In [10]:
torch.save(reg_model.state_dict(), 'trained_model.pt')

### Load pre-trained model and set up
We load the pre-trained model weights into a modified model. The last layer of the pre-trained model is replaced by a new one.
The modified model is then trained on a second task.

In [12]:
# Define the modified model
class ModifiedModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, hidden_size)
        # no layer 4

        # Add a new layer
        self.fc5 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = self.fc1(x)
        x = self.fc2(x)
        x = self.fc3(x)
        x = self.fc5(x) # This is the new layer
        return x

# Create an instance of the modified model
reg_model = ModifiedModel(input_size=1, hidden_size=1, output_size=1)

# Load the pre-trained model weights
# strict = False: ignores non matching keys
reg_model.load_state_dict(torch.load('trained_model.pt'), strict=False)
reg_model.eval()

mse_loss = nn.MSELoss()

optimizer = torch.optim.SGD(reg_model.parameters(), lr=0.01)

# use gpu if available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

Using cuda device


#### Load Data
Here we use the same ESDC data as before. Normally you would use other data.

In [None]:
data_store = new_data_store("s3", root="esdl-esdc-v2.1.1", storage_options=dict(anon=True))
dataset = data_store.open_data('esdc-8d-0.083deg-184x270x270-2.1.1.zarr')
ds = dataset[['land_surface_temperature', 'air_temperature_2m']]
ds

### Assign random train/test split

In [None]:
xds = ds.assign({"split": hf.rand})

Get range (min, max) and statistics (mean, std) of data variables for normalization or standardization.

In [13]:
at_range = ml.getRange(ds, 'air_temperature_2m')
lst_range = ml.getRange(ds, 'land_surface_temperature')

at_stat = ml.getStatistics(ds, 'air_temperature_2m')
lst_stat = ml.getStatistics(ds, 'land_surface_temperature')

### Train pre-trained model

In [19]:
for chunk in ml.iter_data_var_blocks(xds): 
    ### preprocessing 
    # flatten
    cf = {x: chunk[x].ravel() for x in chunk.keys()}
    # drop nans
    lst = cf['land_surface_temperature']
    cfn = {x: cf[x][~np.isnan(lst)] for x in cf.keys()}

    if len(cfn['land_surface_temperature']) > 0:
        #X = normalize(cfn['air_temperature_2m'], 'air_temperature_2m')
        #y = normalize(cfn['land_surface_temperature'], 'land_surface_temperature')
        X = ml.standardize(cfn['air_temperature_2m'],*at_stat)
        y = ml.standardize(cfn['land_surface_temperature'], *lst_stat)
               
        ### get train/test data 
        X_train = X[cfn['split']==True]
        X_test  = X[cfn['split']==False]
        y_train = y[cfn['split']==True]
        y_test  = y[cfn['split']==False]
        
        inputs  = torch.tensor(X_train)
        outputs =  torch.tensor(y_train)
        
        train_ds = TensorDataset(inputs, outputs)
        test_ds  = TensorDataset(torch.tensor(X_test), torch.tensor(y_test))
        
        trainloader = DataLoader(train_ds, batch_size=50, shuffle=True)
        testloader  = DataLoader(test_ds, batch_size=50, shuffle=True)
        
        ### train model 
        for i in range(3):
            reg_model,train_pred,loss = ml.train_one_epoch(i, trainloader, reg_model, mse_loss, optimizer, device)
            print(f"Training Error: Avg loss: {loss:>8f}")
            test_pred, test_loss = hf.test(testloader, reg_model, mse_loss, device)
            print(f"Test Error: Avg loss: {test_loss:>8f} \n")

Training Error: Avg loss: 0.034744
Test Error: Avg loss: 0.034756 

Training Error: Avg loss: 0.034996
Test Error: Avg loss: 0.034780 

Training Error: Avg loss: 0.035408
Test Error: Avg loss: 0.034731 

Training Error: Avg loss: 0.040708
Test Error: Avg loss: 0.040590 

Training Error: Avg loss: 0.040530
Test Error: Avg loss: 0.040593 

Training Error: Avg loss: 0.041077
Test Error: Avg loss: 0.040591 

Training Error: Avg loss: 0.037255
Test Error: Avg loss: 0.037942 

Training Error: Avg loss: 0.037782
Test Error: Avg loss: 0.037871 

Training Error: Avg loss: 0.037483
Test Error: Avg loss: 0.037931 

Training Error: Avg loss: 0.033409
Test Error: Avg loss: 0.033033 

Training Error: Avg loss: 0.033656
Test Error: Avg loss: 0.033035 

Training Error: Avg loss: 0.033488
Test Error: Avg loss: 0.033188 

Training Error: Avg loss: 0.058158
Test Error: Avg loss: 0.056970 

Training Error: Avg loss: 0.056816
Test Error: Avg loss: 0.057112 

Training Error: Avg loss: 0.056617
Test Error: A