# Efficient training of neural networks

In the previous notebook, we outlined how neural networks can solve a large variety of tasks. Specifically, we demonstrated that neural networks can be used for image classification and for predicting complex biological processes, such as the conductance of nerve membranes. The main reason for this verstility of neural networks is two-fold: 
1. neural networks are universal function approximators, meaning they can learn any function from data; and 
2. surprisingly many tasks can be approximated as input-output relationships, i.e. functions. 

For example, we saw that digit classification can be achieved by learning a probability density function, $p(y | x)$, where $x$ denotes the pixels in the image, and $y$ denotes the digit. In this notebook, we dive deeper into some of the challenges of learning functions from data -- challenges that are important to understand when modelling language with neural networks.

**Learning outcomes:**

1. An intuitive understanding of architectural choices for neural networks
2. An understanding of the limitations of neural networks when learning functions from data
3. The value of prior knowledge for more efficient learning and better generalisation

In [None]:
import numpy as np
import torch
from torch.utils.data import DataLoader
import torchvision.datasets as datasets
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision.transforms as transforms
from torchsummary import summary

import kai

## 1 Using neural networks for language processing

The universal function approximation theorem states that multi-layer perceptrons (MLPs) can approximate any function, provided they have sufficiently many artificial neurons. This makes it possible to learn any function from data using neural networks. At this point, you might start to wonder what all of this has to do with natural language processing and large language models... After all, language seems to be quite different from simple input-output relationships, making it hard to imagine how the universality of neural networks helps with processing language.

But how different is language from an input-output relationship really? If we think, for example, about chatbots like our very own Equinor AIChat, typical interactions with the chatbot often take the form of questions (inputs) and answers (outputs): We ask a question or make a statement; and the chatpot replies. Hmm, so perhaps some elements of language can be interpreted as an input-output relationship after all!

That's exciting! Let us try to develop a neural network that can answer questions.

### Task 1: Brainstorm how you could use an MLP to answer questions

Assume for now that we have a fixed number of questions that can be asked and that also the number of answers to each question is fixed. What could the inputs to the network represent? What would the outputs be?

You can document your thoughts here:

### Task 2: What are the limitations of the Q&A MLP?

After making an initial attempt to define suitable inputs and outputs to the Q&A neural network, have a look at some example questions and answers below. Do you foresee any challenges with answering those questions with your approach? Think especially about how well your model might answer questions that it has not seen before, or even how it might answer questions that only use a slightly different wording than those in the datatset. 

*Tipp:* If you struggle to see the limitations of your Q&A neural network, let us simply try it out and see how well it does! Create a code cell and use the Q&A data to train an MLP. You can copy the neural network definition and the training script from the previous notebook. For simplicity, you could just focus on the first 50 questions and answers in the dataset and use the ``nn.CrossEntropyLoss()`` as an objective for the training. Remember to split your dataset into a training and into a test dataset to test your model on unseen questions. How well can your model answer the questions?

In [None]:
data = kai.get_qa_datatset()

print(data[0])
print('')
print(data[30])
print('')
print(data[42])

Ok, it's obvious that the Equinor AIChat does not use our simple Q&A neural network approach! While it is possible to enumerate questions and answers and to train a neural network to remember the mapping between questions and answers, it does not actually allow the neural network to answer any questions that are not identical to the training questions. In fact, by enumerating questions and answers, we abstract away any logic or knowledge that a human might be able to infer from the question-answer pairs, making it impossible to answer any new questions, even if the answers are contained in the training data.

For example, if the training data contained the question-answer pairs:

- Q: How long have Alice and Bob been living together? A: Alice and Bob have been living together for 10 years.
- Q: Who is Bob's neighbour? A: Bob's neighbour is Anders Opedal.

A human can infer that Alice's neighbour is Ander Opedal. But if we asked our simple Q&A neural network "Who is Alice's neighbour?", it would not be able to answer the question. In contrast, modern large language models *can* make similar inferences and answer many questions that they have not seen before. So we seem to be missing an important component of developing neural networks to make natural language processing work.

Note that, in principle, our Q&A neural network *can* answer any questions provided they are contained in the training data. It simply cannot make any inferences or generalise beyond the training questions in its current formulation. I.e. the universality of neural network approximations is not enough to solve natural language processing in practice.

## 2 Universal, but not generalisable

The challenges of using MLPs for learning function approximations that generalise beyond the training data are not unique to language processing. In fact, one of the first approaches that improved the generalisability of neural networks came from image classification. This advancement ultimately led to the development of the LLM neural networks that we are using today. We will therefore focus in this notebook on understanding this approach.

To begin, let us revisit the MNIST image classification task that we have worked on in the previous notebook. Below we have defined the MLP for the image classification. The only new element to the previous notebook is that we have slighly increased the size of the images by padding the space around the digits (see cell below). 

In [None]:
# Load data
train_loader, test_loader1, test_loader2 = kai.get_modified_mnist_data()

# Get a batch of images and labels
images, labels = next(iter(train_loader))

kai.plot_images(images, labels)

### Task 3: Train the neural network on the new dataset

In [None]:
# Define model
class NeuralNetwork(nn.Module):
    def __init__(self, input_size, num_classes):
        super(NeuralNetwork, self).__init__()
        self.fc1 = nn.Linear(input_size, 50)
        self.fc2 = nn.Linear(50, num_classes)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Set up training
input_size = 784 * 4  # Need to increase input size because images are bigger
num_classes = 10
learning_rate = 0.001
num_epochs = 5
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Initialize Network
model = NeuralNetwork(input_size=input_size, num_classes=num_classes).to(device)

# Loss and Optimizer
objective = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Train Network
kai.train_mnist_model(
    num_epochs, model, objective, optimizer, train_loader, device)

### Task 4: Check that the accuracy on the training data is still the same
You should get an accuracy of >95%.

In [None]:
print('Accuracy on training data:')
kai.check_accuracy(train_loader, model, device)

Amazing! The padding of the images had no effect on the training accuracy -- and why would it?!? The digits in the images remain the same, whether there is black space around them or not. So let's verify that the accuracy also is still the same on the test images (which we haven't used for the training).

### Task 5: Evaluate the accuracy of the model on the two test datasets
The test datasets are defined by ``test_loader1`` and by ``test_loader2``.

In [None]:
# TODO: Check the accuracy of the model on the test datasets

Wow! The accuracy for one of the datasets is significantly worse than for the other. 

### Task 6: Discuss the reasons for why the model may perform differently on different test datasets

Perhaps, visualising a few images from the test datasets can help you identify the reason...

In [None]:
# TODO: Plot a few images from the test dataset

Doument your thoughts here:

## 3 Inductive biases

One of the first approaches that tackled this lack of "concept learning" was developed by Yann Lecun, who is now a Computer Science professor at NYU and the Chief AI scientist at Meta. In his seminal paper, Lecun proposed to exploit prior knowledge about the image classification task to make the neural network more efficient at learning from data and to help it generalise more easily beyond the training images (see https://hal.science/hal-03926082/document). More specifically, Lecun argued that image classification should not be sensitive to translations of the object of interest (i.e. the location of the digit in the image, see figure below). He therefore proposed the use of neural networks that are explicitly designed to make predictions which are invariant under these translations.

<img src="translation_invariance.gif" width="500" style="display: block; margin: 0 auto" >

His proposal to achieve this was to slighty modify how MLPs process images by introducing small blocks of artifical neurons that are slided across all patches of the image (see figure below). In this way, patterns learned in one region of the image can also be recognised in other regions of the image. These neural network layers are known as convolutional layers (in reference to functional analysis and physics, see https://en.wikipedia.org/wiki/Convolution).

<img src="convolution.gif" width="500" style="display: block; margin: 0 auto" >

Using convolutional layers, Lecun defined the first convolutional neural network (CNN) which was the first neural network that was able to learn something close to a concept of a digit in order to classify images correctly beyond the training dataset (see below). His network consisted of 2 convolutional layers, each followed by a pooling layer. We will work through these details below. Note that the final component of the network remains a MLP (see the final two "dense" layers in the figure). Dense layers are the layers that we have used before (``nn.Linear``), where each input is processed by all artificial neurons. Dense layers are also sometimes referred to as fully connected layers. 

<img src="Lenet.svg" width="500" style="display: block; margin: 0 auto" >

Below we will implement Lecun's CNN, but before we do so, we just want to comment on the quite substantially more complex-looking architecture of the neural network in comparison to our "simple" MLPs that we used in the previous notebook. While Lecun's CNN looks complex at first, conceptually, its architecture is actually quite intuitive: it has one block that is responsible for feature extraction (the convolutional layers); and another block that is responsible for the classification (the dense layers). The feature extractor is based on prior knowledge of the image classification task and it implements inductive biases that help the neural network to learn more efficiently from data and to generalise more easily beyond the training data. The dense layers are not based on any prior assumptions about the modelling task and they simply make use of the universality of MLPs to map the extracted features to the probability masses of the digits. This combination of the universal function approximation theorem and inductive biases is one of the main reasons for the wide success of neural networks in recent years (including language processing).

### Task 7: Implement a convolutional layer

A convolutional layer processes inputs similar to the dense layers (``nn.Linear``) that we have used before, the only difference is that now a smaller number of artificial neurons is slided across the inputs, such that patterns learned from one patch of signals can also benefit the processing of another patch of signals. The neurons process the inputs as before by: 1. mutliplying the signals by model weights; 2. summing the weighted signals; 3. adding a bias; and 4. applying an activation function.

For images, this "patch processing" takes the form of sliding a small window of model weights across the image (see below). After multiplying the inputs by the weights, the values are summed up and a bias is added to produce the value for one patch of signals before applying the activation function. In the figure below, the bias is equal to 0.

<img src="convolutional_layer.gif" width="500" style="display: block; margin: 0 auto" >

Implement a convolutional layer that takes an image (two-dimensional array of shape (n, m)), the model weights (two-dimensional array of shape (k, k)) and the bias (float) as input, and it returns the processed image. For simplicity, we will choose a window of size 3x3 and assume that the input image is larger than 3x3 (just like in the image above).

In [None]:
# TODO: Complete the implementation
def convolutional_layer(inputs, weights, bias):
    inputs = np.array(inputs)
    m, n = inputs.shape
    if (m <= 3) or (n <= 3):
        raise ValueError('Inputs has the wrong shape.')
    weights = np.array(weights)
    k1, k2 = weights.shape
    if (k1 != 3) or (k2 != 3):
        raise ValueError('Weights has the wrong shape.')
    bias = float(bias)

    # TODO Implement the convolution

    return output

In [None]:
# Test your implementation
# Test 1
inputs = np.array([
    [1, 1, 1, 0, 0],
    [0, 1, 1, 1, 0],
    [0, 0, 1, 1, 1],
    [0, 0, 1, 1, 0],
    [0, 1, 1, 0, 0]
])
weights = np.array([
    [1, 0, 1],
    [0, 1, 0],
    [1, 0, 1]
])
bias = 0
output = convolutional_layer(inputs, weights, bias)
true = np.array([
    [4, 3, 4],
    [2, 4, 3],
    [2, 3, 4]
])
np.testing.assert_equal(output, true)

# Test 2
bias = 5
output = convolutional_layer(inputs, weights, bias)
true = np.array([
    [9, 8, 9],
    [7, 9, 8],
    [7, 8, 9]
])
np.testing.assert_equal(output, true)

# Test 3
weights = np.array([
    [1, 0, 1],
    [0, 0, 0],
    [1, 0, 1]
])
output = convolutional_layer(inputs, weights, bias)
true = np.array([
    [8, 7, 8],
    [7, 8, 7],
    [7, 7, 8]
])
np.testing.assert_equal(output, true)


### Task 8: Discuss the purpose of a pooling layer

In Lecun's CNN each convolutional layer is followed by a pooling layer. A pooling layer aggregates neighbouring inputs into one output. This aggregation can take different forms. For example, the figure below shows a "max pooling" on the left and an "average pooling" on the right. For the max pooling, the output is the maximum value among the cluster of neighbouring pixels, while for the average pooling, the average value across the cluster of neighbouring pixels is returned.

<img src="average_pooling.gif" width="500" style="display: block; margin: 0 auto" >

Discuss in your group the purpose of pooling layers for the image classification task. Do they make features extracted by convolutional layers more robust? Are they simply a way of condensing the amount of information that is being processed? Do they have another purpose?

You can document your thoughts here:

### Task 9: Implement Lecun's CNN (LeNet)

Below is a diagram of Lecun's CNN, referred to as LeNet. Use PyTorch's pre-implemented layers (``nn.Conv2d``, ``nn.AvgPool2d``, ``nn.Linear``) to implement the LeNet neural network. You can find the API definition of the layers here: https://pytorch.org/docs/stable/nn.html. When you are finished implementing the neural network, train it on the MNIST dataset. We will first train it on the original MNIST dataset to establish that the network works as expected.

<img src="lenet-architecture.svg" width="150" style="display: block; margin: 0 auto" >

In [None]:
# TODO: Complete the implementation
class LeNet(nn.Module):

    def __init__(self):
        super(LeNet, self).__init__()
        # TODO: Define the layers of the network

    def forward(self,x):
        # TODO: Chain the layers to process images according to the LeNet architecture

        return out

In [None]:
# Test that the implementation of the network is correct
model = LeNet()
s = summary(model, input_size=(1, 28, 28))

In [None]:
# Load MNIST dataset
batch_size = 64

# Load Data
train_dataset = datasets.MNIST(root='dataset/', train=True, transform=transforms.ToTensor(), download=True)
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)

test_dataset = datasets.MNIST(root='dataset/', train=False, transform=transforms.ToTensor(), download=True)
test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=True)

In [None]:
# Train Network
learning_rate = 0.001
num_epochs = 5

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = LeNet().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
kai.train_mnist_model(
    num_epochs, model, objective, optimizer, train_loader, device)

In [None]:
# Check the performance of the model
kai.check_accuracy(train_loader, model, device)
kai.check_accuracy(test_loader, model, device)

Amazing! You have just implemented your first CNN. Let us keep the momentum high and train the CNN on the modified MNIST dataset to see whether it is better at generalising beyond the training data.

### Task 10: Train the model on the modified MNIST dataset
Note that you have to slightly modify the neural network, because increasing the size of the input images will change the size of the feature maps, and the first fully connected layer of the classifier assumes an input size of 16 x 5 x 5 = 400. This is because the outputs of the final convolutional block (convolutional layer and pooling layer) are 16 5x5 feature images. How does the size of the feature images change when the input image size is 4 times larger?

In [None]:
# TODO: Update the model, so it can process the larger images
class LeNet2(nn.Module):

    def __init__(self):
        super(LeNet2, self).__init__()
        self.feature_extractor = nn.Sequential(
            nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, stride=1, padding=2),
            nn.ReLU(),
            nn.AvgPool2d(kernel_size=2, stride=2),

            nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5, stride=1, padding=0),
            nn.ReLU(),
            nn.AvgPool2d(kernel_size=2, stride=2),
        )

        self.classifier = nn.Sequential(
            # TODO: Change the input size of the linear layer
            nn.Linear(400, 120),

            nn.ReLU(),
            nn.Linear(120, 84),
            nn.ReLU(),
            nn.Linear(84, 10),
            nn.Softmax(dim=1)
        )

    def forward(self,x):
        out = self.feature_extractor(x)
        out = torch.flatten(out, 1)
        out = self.classifier(out)
        return out

In [None]:
# Test the implementation of the modified model
model = LeNet2()
s = summary(model, input_size=(1, 28*2, 28*2))

In [None]:
# Load modified MNIST dataset (kai.get_modified_mnist_data)
train_loader, test_loader1, test_loader2 = kai.get_modified_mnist_data()

In [None]:
# Train Network
learning_rate = 0.001
num_epochs = 5

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = LeNet2().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
kai.train_mnist_model(
    num_epochs, model, objective, optimizer, train_loader, device)

In [None]:
# Check the performance of the model
kai.check_accuracy(train_loader, model, device)
kai.check_accuracy(test_loader1, model, device)
kai.check_accuracy(test_loader2, model, device)

Hmm, that was a lot of work for no improvement on the second test dataset at all... But don't worry, we are almost there! Why do you think does the LeNet network not improve on the test dataset and what can we do to fix it?

### Task 11: Discuss in your group why the LeNet network does not achieve translational invariance afterall

*Tip:* Perhaps, visualising the outputs from the feature extractor for some test images can help you see the problem. You can use the ``kai.plot_images`` function if your first convert the torch tensors to numpy arrays.

In [None]:
# TODO: Plot a few images from the test dataset

You can document your thoughts here:

### Task 12: Modify the final pooling layer in the LeNet neural network

The output of the pooling layer should be just one pixel for each feature channel.

In [None]:
class LeNet3(nn.Module):

    def __init__(self):
        super(LeNet3, self).__init__()
        self.feature_extractor = nn.Sequential(
            nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, stride=1, padding=2),
            nn.ReLU(),
            nn.AvgPool2d(kernel_size=2, stride=2),

            nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5, stride=1, padding=0),
            nn.ReLU(),
            # TODO: Change the pooling to return just one value per feature map
            nn.AvgPool2d(kernel_size=2, stride=2),
        )

        self.classifier = nn.Sequential(
            # TODO: Also match the number of inputs here
            nn.Linear(12*12*16, 120),

            nn.ReLU(),
            nn.Linear(120, 84),
            nn.ReLU(),
            nn.Linear(84, 10),
            nn.Softmax(dim=1)
        )

    def forward(self,x):
        out = self.feature_extractor(x)
        out = torch.flatten(out, 1)
        out = self.classifier(out)
        return out

In [None]:
# Test the implementation of the modified model
model = LeNet3()
s = summary(model, input_size=(1, 28*2, 28*2))

In [None]:
# Train Network
learning_rate = 0.005
num_epochs = 10

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = LeNet3().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
kai.train_mnist_model(
    num_epochs, model, objective, optimizer, train_loader, device)

In [None]:
# Check the performance of the model
kai.check_accuracy(train_loader, model, device)
kai.check_accuracy(test_loader1, model, device)
kai.check_accuracy(test_loader2, model, device)

Magic! With this simple tweak we managed to enable the neural network to learn generalisable concepts 🎉

Feel free to play around with the network to get it's performance on the training dataset back up to >95%.

## Optional

TODO