# DAML4 notes
## Week 9 - Deep neural networks


In [None]:
# Torch has an annoying tendancy to crash on MacOS
# This line helps, but please just run it on Notable instead!
import os
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"

<hr style="border:2px solid black"> </hr>

We now move on to the topic that is responsible for most of the current hype in machine learning. When someone in the media refers to "AI", they usually mean a deep neural network (DNN).

For classification and regression we have considered linear models of the form 

$$f(\mathbf{x}) = \mathbf{w}^\top \phi(\mathbf{x}) + b$$ 

where:

- $\mathbf{x}\in\mathbb{R}^D$ and $f(\mathbf{x})\in\mathbb{R}^{1}$
- $\phi$ is a transformation that maps $\mathbf{x}$ to a feature vector $\phi({\mathbf{x}})\in\mathbb{R}^{Z}$ 
- $\mathbf{w}\in \mathbb{R}^{Z}$ and $b\in\mathbb{R}^1$

This setup is appropriate for regression to one dimensional outputs or for binary classification. If however, we want to perform regression to multidimensional outputs or perform multiway classification then we can write a linear model more generally as:

$$f(\mathbf{x}) = \mathbf{W} \phi(\mathbf{x}) + \mathbf{b}$$ 

where:

- $\mathbf{x}\in\mathbb{R}^D$ and $f(\mathbf{x})\in\mathbb{R}^{K}$
- $\phi$ is a transformation that maps $\mathbf{x}$ to a feature vector $\phi({\mathbf{x}})\in\mathbb{R}^{Z}$ 
- $\mathbf{W}\in \mathbb{R}^{Z\times K}$ and $b\in\mathbb{R}^K$

The features we use are very important! You saw in the Week 7 lab that the difference between representing audio as raw waveforms versus spectrograms for classification was huge. Before DNNs became mainstream we had to  hand-engineer feature transformations for images that involved multiple stages and were incredibly hacky.
DNNs give us a model framework where we can learn features directly on data. We effectively have a parameterised feature transformation:

$$f(\mathbf{x}) = \mathbf{W} \phi_{\boldsymbol\theta_f}(\mathbf{x}) + \mathbf{b}$$ 

For some loss $L$ with a linear model we would solve $\underset{\mathbf{W},\mathbf{b}}{\mathrm {minimise}}\,L$. With a DNN we instead solve $\underset{\boldsymbol\theta_f,\mathbf{W},\mathbf{b}}{\mathrm {minimise}}\,L$. This means we learn a feature transformation jointly with learning the linear transformation that is applied to those features.

So what does $\phi_{\boldsymbol\theta}(\mathbf{x})$ look like? In the simplest DNN, the multilayer perceptron (MLP) it is a series of linear transformations, each followed by a non-linearity.

## Multilayer perceptrons (MLPs)



Each linear transform + non-linearity pair is referred to as a layer. When counting layers we typically include the final linear transformation too. 

If we denote $\mathbf{x}$ as $\mathbf{h}^{(0)}$ then we can write an $\mathcal{L}$ layer MLP as:

- $\mathbf{h}^{(l)} = g(\mathbf{W}^{(l) }\mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}) \,\,\text{for}\,\,l=0,1,\dots,\mathcal{L}-2$
- $f(\mathbf{x})=  \mathbf{W}^{(\mathcal{L} -1)}\mathbf{h}^{(\mathcal{L}-2)} + \mathbf{b}^{(\mathcal{L} -1)}$

We can compare this second equation to the $f(\mathbf{x})$ above to observe our feature transformation $\phi_{\boldsymbol\theta}(\mathbf{x})$ is = $\mathbf{h}^{(\mathcal{L}-2)}$ and that $\boldsymbol\theta_f$ consists of all the weight matrices and bias vectors of the different MLP layers. We can store all our parameters into one array $\boldsymbol\theta= \{\boldsymbol\theta_f,\mathbf{W}^{(\mathcal{L} -1)},\mathbf{b}^{(\mathcal{L} -1)}\}$ for notational convenience.


The problem of learning an MLP is now just $\underset{\boldsymbol\theta}{\mathrm {minimise}}\ L$, which we can do with stochastic gradient descent. We use the backpropagation algorithm from the lecture to compute all the required gradients efficiently, although when we actually run the code this is all done under the hood!

Without further ado, let's consider a binary classification problem and train a 2 layer MLP to solve it.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

plt.rcParams.update(
    {
        "lines.markersize": 7,  # Big points
        "font.size": 15,  # Larger font
        "xtick.major.size": 5.0,  # Bigger xticks
        "ytick.major.size": 5.0,  # Bigger yticks
    }
)

# Generate points from a circle

np.random.seed(42)
radii = np.hstack(
    [np.random.uniform(0, 1, size=100), np.random.uniform(1.5, 3, size=100)]
)
theta = np.random.uniform(0, 2 * np.pi, size=200)

X = np.zeros((200, 2))
X[:, 0] = radii * np.sin(theta)
X[:, 1] = radii * np.cos(theta)

y = np.hstack([np.zeros(100), np.ones(100)])

# Now let's plot our data, with a different colour for each class
fig, ax = plt.subplots(figsize=[6, 6])
ax.set_aspect("equal", "box")
ax.scatter(X[y == 0, 0], X[y == 0, 1], color="b", edgecolor="k")
ax.scatter(X[y == 1, 0], X[y == 1, 1], color="r", edgecolor="k")
ax.grid()
ax.legend(['class $0$','class $1$'])


The two layer MLP can be written as:

- $\mathbf{h}^{(1)} = g(\mathbf{W}^{(0) }\mathbf{x} + \mathbf{b}^{(0)})$
- $f(\mathbf{x})=  \mathbf{W}^{(1)}\mathbf{h}^{(1)} + \mathbf{b}^{(1)}$

Each data point is 2D, and we will use 5 *neurons* in the hidden layer. This means that $\mathbf{W}^{(0)}\in\mathbb{R}^{2\times5}$ and $\mathbf{b}^{(0)}\in\mathbb{R}^{5}$.

For classification with MLPs, we follow the multinominal logistic regression framework in that our output is a vector where each element is the logit for each class. There is no harm doing this even when you only have two classes. We will therefore make the output 2D where the elements are the logits for class 0 and 1.

It follows that $\mathbf{W}^{(1)}\in\mathbb{R}^{5\times2}$ and $\mathbf{b}^{(1)}\in\mathbb{R}^{2}$. 

For $g$ we will use a ReLU non-linearity.

### Defining an MLP in Pytorch

[Pytorch](https://pytorch.org) is an amazing deep learning framework. I used it for the vast majority of my postdoc, and my PhD students use it all the time. It makes it really easy to define a neural network. The code below uses Pytorch to make a module that represents our 2 layer MLP: 

In [None]:
import torch  # The whole pytorch package
from torch import nn  # the nn package for making neural networks
import torch.nn.functional as F  # Useful functions

torch.manual_seed(1)  # Fix RNG
np.random.seed(2)


class MLP_Module(nn.Module):
    def __init__(self):
        super(MLP_Module, self).__init__()

        self.dense0 = nn.Linear(2, 3)
        self.nonlin = F.relu
        self.dense1 = nn.Linear(3, 2)

    def forward(self, X):
        # Including this line makes life a lot easier

        X = X.float()

        # Apply the two layers to the input
        h0 = self.nonlin(self.dense0(X))
        out = self.dense1(h0)
        return out

[nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) applies a linear transform (i.e. multiplies by a weight matrix and adds a bias vector) to some input vector. The weights and bias are initialised with random values, but can (and will!) be learnt.

We can create an MLP object from this class and investigate some of these weights.

In [None]:
mlp = MLP_Module()

# This is W^(0)
print(mlp.dense0.weight)

# This is b^(0)
print(mlp.dense0.bias)

# This is W^(1)
print(mlp.dense1.weight)

# This is b^(1)
print(mlp.dense1.bias)


The magic of Pytorch (and other deep learning libraries) is their ability to do **automatic differentation**. This means we don't have to explicitly code any gradient expressions (although you will in the lab as an exercise :)).

It's best to see how this works through example. We are going to create some dummy data and labels. First, we'll pass the dummy data through to get logits.

In [None]:
# Dummy array of 10 random data points
X_dummy = torch.randn(10, 2, requires_grad=True)
logits = mlp.forward(X_dummy)
print(logits)

Now we can use these logits, along with some dummy labels to compute a scalar cross entropy loss. We can then call `backward` and this will compute the gradient of the loss with respect to all the parameters automatically!

In [None]:
loss = nn.CrossEntropyLoss()

# Dummy labels
y_dummy = torch.empty(10, dtype=torch.long).random_(2)
output = loss(logits, y_dummy)
output.backward()


# This is dL/W^(0)
print(mlp.dense0.weight.grad)

# This is dL/b^(0)
print(mlp.dense0.bias.grad)

# This is dL/W^(1)
print(mlp.dense1.weight.grad)

# This is dL/b^(1)
print(mlp.dense1.bias.grad)

# Zero out gradients to prevent memory errors
mlp.zero_grad()

We will be using skorch for training, so we don't even need to call `.backward()` ourself, but it is useful to know that it is happening under the hood.

### Training an MLP in skorch 

Pytorch is brilliant, but tends to have verbose training code, whereas we've been used to sklearn where we can just call `.fit(X,y)` on some model. 

Fear not! For the rest of this course we will use [skorch](https://github.com/skorch-dev/skorch) which is an sklearn-style wrapper for Pytorch that lets us train neural networks in only a few lines.

To do this, we need to use skorch's [NeuralNetClassifier](https://skorch.readthedocs.io/en/stable/classifier.html#skorch.classifier.NeuralNetClassifier) with the network module we defined in Pytorch:

In [None]:
from skorch import NeuralNetClassifier


net = NeuralNetClassifier(
    MLP_Module,
    max_epochs=200,
    lr=0.1,
    criterion=nn.CrossEntropyLoss,
)

# Here, lr is the learning rate
# Criterion is our loss which is cross entropy
# Max_epochs is the number of times we cycle through the dataset during SGD

Now, we just need to convert `y` to the types skorch expects, and call `.fit`!

In [None]:
y = y.astype(np.int64)
net.fit(X, y)

In [None]:
isinstance(X, np.ndarray)

You'll notice that skorch printed some train and validation accuracies and losses, even though we only handed in training data. This is because skorch by default holds out 20% of the training data for evaluation.

### Viewing the decision boundary of the MLP

We can view decision boundaries as we did with sklearn before. The only requirement is that the forward method above turns the input into a float. This is because Pytorch is very particular about its input types. 

In [None]:
# Imports for plotting the decision boundary
from matplotlib.colors import ListedColormap
from sklearn.inspection import DecisionBoundaryDisplay

In [None]:
# Now let's plot our data, with a different colour for each class
fig, ax = plt.subplots(figsize=[6, 6])
ax.set_aspect("equal", "box")
ax.scatter(X[y == 0, 0], X[y == 0, 1], color="b", edgecolor="k")
ax.scatter(X[y == 1, 0], X[y == 1, 1], color="r", edgecolor="k")
ax.grid()
ax.legend(["class $0$", "class $1$"])
# Add the decision boundary
disp = DecisionBoundaryDisplay.from_estimator(
    net,
    X,
    response_method="predict",
    alpha=0.3,
    grid_resolution=300,
    ax=ax,
    cmap=ListedColormap(["b", "r"]),
)

<hr style="border:2px solid black"> </hr>

#### Written by Elliot J. Crowley and &copy; The University of Edinburgh 2022-23