# Introduction to Neural Networks

## Outline

The **Universal Approximation Theorm** states that neural networks can approximate any continuous function. A visual demonstration that neural nets can compute any function can be seen in [this page](http://neuralnetworksanddeeplearning.com/chap4.html).

In this notebook, we give a brief overview of neural networks and how to build them using PyTorch. If you want to go through it in depth, check out these resources:
- [Deep Learning With Pytorch: A 60 Minute Blitz](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html)
- [Neural Networks](https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html)

In [None]:
%matplotlib inline
import time
from IPython.display import Image

import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.utils.data as Data
from torch import nn, optim

from L96_model import L96, RK2, RK4, EulerFwd, L96_eq1_xdot, integrate_L96_2t

In [None]:
# Ensuring reproducibility
np.random.seed(14)
torch.manual_seed(14);

### Build the *Real World* to Generate the Ground Truth Dataset

We initialise the L96 two time-scale model using $K$ (set to 8) values of $X$ and $J$ (set to 32) values of $Y$ for each $X$. The model is run for 20,000 timesteps to generate the dataset for the neural network.

In [None]:
time_steps = 20000
forcing, dt, T = 18, 0.01, 0.01 * time_steps

# Create a "real world" with K=8 and J=32
W = L96(8, 32, F=forcing)

### Getting Training Data

Using the *real world* model created above we generate the training data (input and output pairs) for the neural network by running the true state and outputting subgrid tendencies.

In [None]:
# The effect of Y on X is `xy_true`
X_true, _, _, xy_true = W.run(dt, T, store=True, return_coupling=True)

# Change the data type to `float32` in order to avoid doing type conversions later on
X_true, xy_true = X_true.astype(np.float32), xy_true.astype(np.float32)

### Split the Data to obtain the Training and Test (Validation) Set

In [None]:
# Number of time steps for validation
val_size = 4000

# Training Data
X_true_train = X_true[
    :-val_size, :
]  # Flatten because we first use single input as a sample
subgrid_tend_train = xy_true[:-val_size, :]

# Test Data
X_true_test = X_true[-val_size:, :]
subgrid_tend_test = xy_true[-val_size:, :]

### Create Dataloaders 

- `Dataset` and `Dataloader` classes provide a very convenient way of iterating over a dataset while training a deep learning model.

- We need to iterate over the data because it is very slow and memory-intensive to hold all the data and to use gradient decent over all the data simultaneously (see more details [here](https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/) and [here](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html)).

In [None]:
# Number of sample in each batch
BATCH_SIZE = 1024

Define the X (state), Y (subgrid tendency) pairs for the linear regression local network.

In [None]:
local_dataset = Data.TensorDataset(
    torch.from_numpy(np.reshape(X_true_train, -1)),
    torch.from_numpy(np.reshape(subgrid_tend_train, -1)),
)

local_loader = Data.DataLoader(
    dataset=local_dataset, batch_size=BATCH_SIZE, shuffle=True
)

Define the dataloader for the test set.

In [None]:
local_dataset_test = Data.TensorDataset(
    torch.from_numpy(np.reshape(X_true_test, -1)),
    torch.from_numpy(np.reshape(subgrid_tend_test, -1)),
)

local_loader_test = Data.DataLoader(
    dataset=local_dataset_test, batch_size=BATCH_SIZE, shuffle=True
)

Display a batch of samples from the dataset.

In [None]:
# Iterating over the data to get one batch
data_iterator = iter(local_loader)
X_iter, subgrid_tend_iter = next(data_iterator)

print("X (State):\n", X_iter)
print("\nY (Subgrid Tendency):\n", subgrid_tend_iter)

plt.figure(dpi=150)
plt.plot(X_iter, subgrid_tend_iter, ".")
plt.xlabel("State", fontsize=20)
plt.ylabel("Subgrid tendency", fontsize=20);

### Neural Network Architectures

We will try to understand the fully connected networks with the help of Linear regression (and gradient descent).

```{figure} https://miro.medium.com/max/720/1*VHOUViL8dHGfvxCsswPv-Q.png
:name: neural-network
:width: 600

A neural network with 4 hidden layers and an output layer.
```

### Building a Linear Regression Network

First, we will build a linear regression "network" and later see how to generalize the linear regression in order to use fully connected neural networks.

In [None]:
class LinearRegression(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = nn.Linear(1, 1)  # A single input and a single output

    def forward(self, x):
        # This method is automatically executed when
        # we call a object of this class
        x = self.linear1(x)
        return x

In [None]:
linear_network = LinearRegression()
linear_network

### Test forward function

In [None]:
net_input = torch.randn(1, 1)
out = linear_network(net_input)
print(f"The output of the random input is: {out.item():.4f}")

### Defining the Loss Function

In order to check how well our network is modeling the dataset, we need to define a loss function. For our task, we choose the *Mean Squared Error* metric as our loss function.

In [None]:
# MSE loss function
criterion = torch.nn.MSELoss()

In [None]:
# Load the input and output pair from the data loader
X_tmp = next(iter(local_loader))

# Predict the output
y_tmp = linear_network(torch.unsqueeze(X_tmp[0], 1))

# Calculate the MSE loss
loss = criterion(y_tmp, torch.unsqueeze(X_tmp[1], 1))
print(f"MSE Loss: {loss.item():.4f}")

### Calculating gradients

In [None]:
# Zero the gradient buffers of all parameters
linear_network.zero_grad()

print("Gradients before backward:")
print(linear_network.linear1.bias.grad)

# Compute the gradients
loss.backward(retain_graph=True)

print("\nGradients after backward:")
print(linear_network.linear1.bias.grad)

### Updating the Weights using an Optimizer

Now in order to make the network learn, we need an algorithm that will update its weights depending on the loss function. This is achieved by using an optimizer. The implementation of almost every optimizer that we'll ever need can be found in PyTorch itself. The choice of which optimizer we choose might be very important as it will determine how fast the network will be able to learn.

In the example below, we show one of the popular optimizers `SGD`.

In [None]:
optimizer = optim.SGD(linear_network.parameters(), lr=0.003, momentum=0.9)
print("Before backward pass: \n", list(linear_network.parameters())[0].data.numpy())

loss.backward(retain_graph=True)
optimizer.step()

print("\nAfter backward pass: \n", list(linear_network.parameters())[0].data.numpy())

An optimizer usually consists of two major hyperparameters called the **learning rate** and **momentum**. The **learning rate** determines the magnitude with which the weights of the network update thus making it crucial to choose the correct learning rate ($LR$) otherwise the network will either fail to train, or take much longer to converge. To read about **momentum**, check out this [blog post](https://towardsdatascience.com/stochastic-gradient-descent-with-momentum-a84097641a5d).

The  effective value of the gradient $V$ at step $t$ in SGD with momentum ($\beta$) is determined by

\begin{equation}
V_t = \beta V_{t-1} + (1-\beta) \nabla_w L(W,X,y)
\end{equation}

and the updates to the weights will be

\begin{equation}
w^{new} = w^{old} - LR * V_t
\end{equation}

#### Adam Optimizer

Another popular optimizer that is used in many neural networks is the Adam optimizer. It is an adaptive learning rate method that computes individual learning rates for different parameters. For further reading, check out this [post](https://towardsdatascience.com/adam-latest-trends-in-deep-learning-optimization-6be9a291375c) about Adam, and this [post](https://www.ruder.io/optimizing-gradient-descent/) about other optimizers.


## Combining it all Together: Training the Whole Network


### Define the Training and Test Functions

In [None]:
def train_model(network, criterion, loader, optimizer):
    """Train the network for one epoch"""
    network.train()

    train_loss = 0
    for batch_x, batch_y in loader:
        # Get predictions
        if len(batch_x.shape) == 1:
            # This if block is needed to add a dummy dimension if our inputs are 1D
            # (where each number is a different sample)
            prediction = torch.squeeze(network(torch.unsqueeze(batch_x, 1)))
        else:
            prediction = network(batch_x)

        # Compute the loss
        loss = criterion(prediction, batch_y)
        train_loss += loss.item()

        # Clear the gradients
        optimizer.zero_grad()

        # Backpropagation to compute the gradients and update the weights
        loss.backward()
        optimizer.step()

    return train_loss / len(loader)

In [None]:
def test_model(network, criterion, loader):
    """Test the network"""
    network.eval()  # Evaluation mode (important when having dropout layers)

    test_loss = 0
    with torch.no_grad():
        for batch_x, batch_y in loader:
            # Get predictions
            if len(batch_x.shape) == 1:
                # This if block is needed to add a dummy dimension if our inputs are 1D
                # (where each number is a different sample)
                prediction = torch.squeeze(network(torch.unsqueeze(batch_x, 1)))
            else:
                prediction = network(batch_x)

            # Compute the loss
            loss = criterion(prediction, batch_y)
            test_loss += loss.item()

        # Get an average loss for the entire dataset
        test_loss /= len(loader)

    return test_loss

In [None]:
def fit_model(network, criterion, optimizer, train_loader, val_loader, n_epochs):
    """Train and validate the network"""
    train_losses, val_losses = [], []
    start_time = time.time()
    for epoch in range(1, n_epochs + 1):
        train_loss = train_model(network, criterion, train_loader, optimizer)
        val_loss = test_model(network, criterion, val_loader)
        train_losses.append(train_loss)
        val_losses.append(val_loss)
    end_time = time.time()
    print(f"Training completed in {int(end_time - start_time)} seconds.")

    return train_losses, val_losses

### Set Hyperparameters

Epochs refer to the number of times we iterate over the entire training data during training.

In [None]:
n_epochs = 3
optimizer = optim.Adam(linear_network.parameters(), lr=0.03)

### Train the Network

In [None]:
_, _ = fit_model(
    linear_network, criterion, optimizer, local_loader, local_loader_test, n_epochs
)

### Show the Weights of the Trained Network

In [None]:
weights = np.array(
    [
        linear_network.linear1.weight.data.numpy()[0][0],
        linear_network.linear1.bias.data.numpy()[0],
    ]
)
print(weights)

### Compare Predictions with Ground Truth

In [None]:
predictions = linear_network(
    torch.unsqueeze(torch.from_numpy(np.reshape(X_true_test[:, 1], -1)), 1)
)
plt.figure(dpi=150)
plt.plot(predictions.detach().numpy()[0:1000], label="Predicted Values")
plt.plot(subgrid_tend_test[:1000, 1], label="True Values")
plt.legend(fontsize=7);