# Tutorial 2: Implementing the model and training pipeline

### Outline

* Imports, including the library code from previous step
* Description of LSTM-based models (describe modeling approach: seq2val)
* Defining the training loop and procedure
* Setting hyperparameters and region of interest
* Running the model training
* Recording model configuration and saving the trained model
* Putting the model architecture code into a library module

## Setup and configuration

Before we get to the core of the tutorial, actually building and training a neural network, we need to do all of the normal setup. This includes a standard set of imports, but also includes our first import from the local codebase which is in the `src` folder. Here we are going to import some functions from the `src.datapipes` module, which contains the key portions of the code from the previous tutorial on actually loading in the data. By offloading this to code in a python module we can make the actual content of this step of the tutorial much clearer. Additionally, this gives us a start of a python package that might be useful to others and could be adapted to be generally importable. Finally, after our imports we set the default `DEVICE` to use a GPU if it is available, and fall back to CPU if not.

In [1]:
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
import warnings
import torch
import yaml

from torch import nn
from tqdm.autonotebook import tqdm

# Code from the last part of the tutorial!!
from src.datapipes import make_data_pipeline, merge_data, select_region

warnings.filterwarnings('ignore')
DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

## Creating the data pipeline

If you followed along with the previous section of the tutorial you know the general pieces of info that go into setting up the data pipeline for our dataset, which culminated in developing the `make_data_pipeline` function. Before fully defining the pipelines, let's define our train/valid/test split. There are many ways to do this, and it's one of the most important parts of a rigorous machine learning pipeline. In our case, we'll just use a temporal split to define this, but our workflow is already set up quite nicely to use different regions for the splitting mechanism. Anyhow, for the tutorial we just set the splits as such:

In [2]:
train_period = slice('1985', '2000')
valid_period = slice('2001', '2007')
test_period = slice('2008', '2015')

ds = merge_data()
train_ds = ds.sel(time=train_period)
valid_ds = ds.sel(time=valid_period)
test_ds = ds.sel(time=test_period)

Given that, we can set up all of the rest of the pipeline configuration. This includes the region we're interested in modeling, the input and output variables, and the timescales we'll model with. Most of this code should be pretty self-explanatory, but it's probably worth highlighting the `input_sequence_length` and `output_sequence_length` variables. As we will get into, we'll be training a "recurrent neural network" (RNN), which processes variables sequentially and has the ability to store some "hidden state" which is able to track information coming in from past inputs. The reason that we specify different input/output sequence lengths is because we know that snowpack has a long-term dependence on temperature and precipitation. So, we define a much longer input sequence lenth than output sequence length to capture this long-term dependence. We set the input to be 360 days to be able to account for roughly a full year of input data, and the output to be 30 so that we predict about a month of snowpack dynamics for any given input.  This asymmetry amounts to letting the model "spin up" it's hidden state for the first 330 days and then start to output for the last 30. These are arbitrary choices in our case, which account for some level of knowledge that we have about snow hydrology, but could be further tuned as "hyperparameters". As a last note, we set the `input_overlap` to be the difference between the input and output sequence lengths as a way to take advantage of as much as data as possible in the training dataset.

In [3]:
regions = 'WNA'
input_vars = ['pr',  'tasmax',  'tasmin',  'elevation',  'aspect_cosine']
output_vars = ['swe']
input_sequence_length = 360
output_sequence_length = 30
batch_dims={'lat': 30, 'lon': 30}
input_overlap={'time': input_sequence_length - output_sequence_length} 

With everything set up, we can use our handy `make_data_pipeline` function to create the pipes for the training and validation datasets. Note how these make use of all of our metadata/hyperparameters from above in a nicely encapsulated way. This could ideally even be taken to other problems where similar datasets are being used, perhaps for something like soil moisture modeling.

In [4]:
train_pipe = make_data_pipeline(
    train_ds, regions,
    input_vars, output_vars,
    input_sequence_length, output_sequence_length,
    batch_dims, input_overlap, preload=True
)

valid_pipe = make_data_pipeline(
    valid_ds, regions,
    input_vars, output_vars,
    input_sequence_length, output_sequence_length,
    batch_dims, input_overlap, preload=True
)

## Developing the model structure

We have finally gotten to the point where we can start developing our model structure. This part is going to be quite concise in terms of the code that we'll use, but behind such a short amout of code are decades of both code and mathematical development. We'll take a brief detour to understand how/why we use the methods we do before moving on.

<<! LSTM world here>>>

With all of this background understanding in tow, we can start to tackle the practical problem of building our model. For the most part we can rely on off-the-shelf components, but it's worth seeing how to implement a (very) basic neural network layer in the pytorch framework. To be clear, what follows could be implemented inside of something like the training loop, but [training neural networks is a leaky abstraction](http://karpathy.github.io/2019/04/25/recipe/) and as often as possible, it is better to decouple what your model *does* from how *well* it does it.

As we saw, an LSTM network can take an input of dimensions `(batch, timesteps, features)` and output `(batch, timesteps, targets)`. But, it's often the case (including here) where the number of input timesteps won't match the number of target timesteps. From an abstract standpoint this is exactly the same consideration as converting the *features* dimension to a *target* dimension. But unlike situations like language translation where the length of an input and output sequence length may be decoupled, we claim that any change in snowpack can only be a result of the meteorologic conditions from today or the past. As such, we further claim that the current snowpack state is only a function of some history of meteorologic states. This is exactly how we've set up the data loaders!

To take advantage of this, and the setup of the standard LSTM module from pytorch we can define the `LSTMOutput` class, which is a neural network "layer", that simply truncates the output time length to whatever is specified via the `out_len` variable. This is done in standard python fashion by declaring the class and `__init__` consstructor method, as well as standard pytorch fashion by defining the `forward` method which tells pytorch how to handle the forward application, while the backpropagation can be handled via pytorch's internal machinery.

In [5]:
class LSTMOutput(nn.Module):
    def __init__(self, out_len=1):
        super().__init__()
        self.out_len = out_len
        
    def forward(self,x):
        # nn.LSTM returns (output, (hn, cn)), so we just
        # want to grab the `output`
        # Output shape (batch, sequence_length, hidden)
        output, _ = x
        # Now just grab the last index on the sequence length
        # Reshape shape (batch, output_timesteps, hidden)
        return output[:, -self.out_len:, :]

With our custom layer set up we can start to build our model that will (hopefully) do something useful! In the act of doing this we will be good scientists and put the full model creation workflow into a function that can take even more hyperparameters into consideration. We'll set some defaults here just to get things started. Then, in the `create_lstm_model` function we simply create a new overall model structure by chaining together layers, starting with the pytorch implementation of the LSTM, followed by our `LSTMOtput` layer that select on time, and finally followed by a linear layer to project the dimensionality from the `hidden_size` down to the `output_size`.

In [6]:
hidden_size = 128
num_layers = 1
learning_rate = 1e-4
dropout=0.0

def create_lstm_model(
    input_size, 
    hidden_size, 
    output_size, 
    output_sequence_length,
    num_layers, 
    dropout
):
    model = nn.Sequential(
        nn.LSTM(
            input_size=input_size, 
            hidden_size=hidden_size, 
            num_layers=num_layers,
            dropout=dropout,
            batch_first=True,
        ),
        LSTMOutput(output_sequence_length),
        nn.Linear(in_features=hidden_size, out_features=output_size, bias=False),
        nn.LeakyReLU()
    )
    return model

From this we can easily create new models in a programattic way, which makes hyperparameter tuning and reproducibility much easier. We can use this function right away, given all of our other configuration. But, before we can actually train the model there are a couple other pieces that need to be connected. Namcely a loss function and optimizer. These both have many options including those implemented by default via the pytorch library but are generally beyond the scope of this tutorial. We'll just use the standard `Adam` optimizer (with  learning rate hyperparameter defined above) and mean squared error loss here.

In [7]:
model = create_lstm_model(
    len(input_vars), 
    hidden_size, 
    len(output_vars), 
    output_sequence_length, 
    num_layers, 
    dropout
)
model = model.float().to(DEVICE)

In [8]:
# model.load_state_dict(torch.load('../experiments/tutorial/tutorial.pt'))

In [9]:
opt = torch.optim.Adam(model.parameters(), lr=learning_rate)
loss_fun = nn.MSELoss()  

## Some quick model/data verification

Many tutorials would jump straight into training the model now, but from a practical standpoint, it's worth making sure that your data inputs/outputs match up in the way that you expect them to before getting too deep. This can largely be a matter of trial and error in the worst case, but if you are careful and understand everything that happens in your model merely a formality. Starting here we'll first see that our input and output batches contain differences along both the `timesteps` and `features` dimensions. This should come as no surprise, but quantifying it should be a good "gut check". Similarly, we can make sure that the output of our model is the same as the target from the dataloader. If this is not the case something is wrong and you will need to go back in the data/model workflow to debug what's going on before you can train your model.

In [10]:
x, y = next(iter(train_pipe))
x = x.to(DEVICE)
y = y.to(DEVICE)

print('Dims are: (batch, timesteps, features)')
print(x.shape, y.shape)

with torch.no_grad():
    print(
        'Model targets match output ',
        model(x).shape == y.shape
    )

Dims are: (batch, timesteps, features)
torch.Size([232, 360, 5]) torch.Size([232, 30, 1])
Model targets match output  True


## Setting up the training procedure

In [11]:
def train_epoch(model, loader, opt, loss_fun, device=DEVICE):
    avg_loss = 0.0
    for i, (x, y) in (bar := tqdm(enumerate(loader))):
        # First check that there are valid samples
        if not len(x): continue
        x = x.to(device)
        y = y.to(device)
        model.train()
        opt.zero_grad()
        yhat = model(x)
        loss = loss_fun(yhat, y)
        loss.backward()
        opt.step()
        avg_loss += loss.cpu().detach().float().numpy()
        bar.set_description(f'Training loss: {loss:.2e}')
    bar.container.close()
    return avg_loss / i

In [12]:
def valid_epoch(model, loader, opt, loss_fun, device=DEVICE):    
    avg_loss = 0.0
    for i, (x, y) in (bar := tqdm(enumerate(loader))):
        # First check that there are valid samples
        if not len(x): continue
        x = x.to(device)
        y = y.to(device)
        model.eval()
        opt.zero_grad()
        with torch.no_grad():
            yhat = model(x)
        loss = loss_fun(yhat, y)
        avg_loss += loss.cpu().float().numpy()
        bar.set_description(f'Validation loss: {loss:.2e}')
    bar.container.close()
    return avg_loss / i

# Time for model training

In [13]:
train_loss = []
valid_loss = []
max_epochs = 30

In [14]:
for e in (bar := tqdm(range(max_epochs))):
    vl = valid_epoch(model, valid_pipe, opt, loss_fun)
    tl = train_epoch(model, train_pipe, opt, loss_fun)
    train_loss.append(tl), valid_loss.append(vl)
    bar.set_description(f'Train loss: {tl:0.1e}, valid loss: {vl:0.1e}')

  0%|          | 0/30 [00:00<?, ?it/s]

0it [00:00, ?it/s]

KeyboardInterrupt: 

In [None]:
plt.plot(train_loss[1:], label='Train')
plt.plot(valid_loss[1:], label='Valid')
plt.legend()
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.semilogy()
plt.savefig('loss.png')

## A very light introduction to MLOps

In [None]:
experiment_config = {
    "data_config": {
        # Note: using `start`/`stop` attributes so we save a tuple
        "train_period": (train_period.start, train_period.stop),
        "valid_period": (valid_period.start, valid_period.stop),
        "test_period": (test_period.start, test_period.stop),
        "regions": regions,
        "input_vars": input_vars,
        "output_vars": output_vars,
        "input_sequence_length": input_sequence_length,
        "output_sequence_length": output_sequence_length,
        "batch_dims": batch_dims,
        "input_overlap": input_overlap
    },
    "model_config": {
        "input_size": len(input_vars),
        "hidden_size": hidden_size,
        "output_size": len(output_vars),
        "output_sequence_length": output_sequence_length,
        "num_layers": num_layers,
        "dropout": dropout
    },
}

In [None]:
def save_experiment(config, output_dir, name, model=None):
    outfile = f"{output_dir}/{name}.yml"
    if model:
        config["weights_file"] = f"{output_dir}/{name}.pt"
        torch.save(model.state_dict(), f"{output_dir}/{name}.pt")
    with open(outfile, "w") as f:
        yaml.dump(config, f)
    return outfile


def load_experiment(config_path):
    with open(config_path, "r") as f:
        config = yaml.load(f)
    return config

In [None]:
f = save_experiment(
    config=experiment_config, 
    output_dir="../experiments/tutorial", 
    name="tutorial", 
    model=model
)

In [None]:
experiment_config = load_experiment(f)
experiment_config

In [None]:
loaded_model = create_lstm_model(**experiment_config['model_config'])
loaded_model.load_state_dict(torch.load(experiment_config['weights_file']))