# Practical session 4

Most datasets that you will work with are going to be quite challenging, and one of the biggest challenges will be to find a model that not only does good in training data, but that also does good in data that has not been used for training. Especially as your data grows, there are two phenomena that you need to be aware of:

1. When your model has poor training performance and poor testing performance, we say that the model has **underfit** the training data.
2. When your model has good training performance and poor testing performance, we say that the model has **overfit** the training data.

The idea of over- and under-fitting has to do with _capacity_ (i.e. how large is the search space where you expect to find your model). If your model has too much capcity (i.e. it can use very complex functions), then it is very likely to _overfit_ your training data. If your model has low capacity (i.e. it can only use very simple functions), then it is likely to underfit the data.

You can control the capacity of most models by playing with the number of parameters. However, this can result in very inefficient optimisation; causing you to repeatedly train your model to find the right number of parameters for it. In machine learning, this challenge is solved through **inductive biases** and/or **regularisation**.

Regularisation is "any technique that is used to specifically improve the performance in the test set, regardless of the performance in the training set". Common regularisation techniques (e.g. weight decay and dropout) will penalise the model for choosing too complex models that are likely to result in overfitting. That is, you give the model high capacity, but you limit its ability to choose too complex models; the idea is that the model you land on will be "just right".

Inductive biases are specific relations that are put in the _mathematical definition_ of the ML models to reduce the search space. For example Convolutional Neural Networks (CNN) are a sub-set of MLPs where we force each perceptron to only process a subset from the previous layers. When processing images, this enforces a bias of locality; pixels that are close-by are processed together. CNNs are often the only way to get good result in image processing.

## Practical exercise 1: hand-written digit recognition

In this exercise we are going to explore the idea of regularisation. For this exercise we are going to use the MNIST dataset. The features of this dataset are measurable qualities of a wine (acidity, sugar, ...) and the target is the quality of the wine.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data(path='mnist.npz')

random_gen = np.random.default_rng()
random_numbers = random_gen.integers(low=0, high=x_train.shape[0], size=9)

plt.figure(figsize=(5,5))
for idx, example in enumerate(random_numbers):
    plt.subplot(3, 3, idx+1)
    plt.imshow(x_train[example, :, :], cmap='gray')

x_train = x_train.astype("float32") / 255
x_test = x_test.astype("float32") / 255
num_classes = 10
y_train = tf.keras.utils.to_categorical(y_train, num_classes)
y_test = tf.keras.utils.to_categorical(y_test, num_classes)


For the first exercise we are going to stick with the same model that we were using before, the MLP. We will now introduce two forms of regularisation:
1. Dropout: makes the network randomly drop some perceptron connections during training. Which are dropped are random in each epoch.
2. L2-regularisation: puts a penalty on high-value coefficients, helping the model prefer less complex models that generalise better.

The objective is, again, to try to get the test accuracy as high as possible using the same parameters as before. Hint: in order to get good performance from the start, I suggest that you think of the size of the input and how many perceptrons you need in order to process all that information.

In [None]:
NUM_UNITS = 1
NUM_LAYERS = 1
NUM_EPOCHS = 5
DROPOUT = 0.1 # Usually less than 0.5
REG_COEFFICIENT = 0.001 # Usually less than 0.001

inputs = tf.keras.Input(shape=x_train.shape[1:])
output = tf.keras.layers.Flatten(input_shape=x_train.shape[1:])(inputs)
for _ in range(NUM_LAYERS):
    output = tf.keras.layers.Dense(
        units=NUM_UNITS,
        activation='relu',
        kernel_regularizer=tf.keras.regularizers.l2(REG_COEFFICIENT),
    )(output)
    output = tf.keras.layers.Dropout(DROPOUT)(output)
output = tf.keras.layers.Dense(
    units=y_test.shape[1],
    activation='softmax'
)(output)
model = tf.keras.Model(inputs, output)
model.compile(
    optimizer='adam',
    loss="categorical_crossentropy",
    metrics=['accuracy']
)

model.summary()
history = model.fit(x_train, y_train, epochs=NUM_EPOCHS)
test_metrics = model.evaluate(x_test, y_test)

plt.figure(figsize=(5, 5))
plt.plot(np.arange(NUM_EPOCHS), np.array(history.history['accuracy']) * 100)
plt.title(
    "Model accuracy v. epochs of training\n"
    f"Train Acc={round(history.history['accuracy'][-1]*100, 4)}%, Test Acc={round(test_metrics[1]*100, 4)}%"
)
plt.xlabel("Epoch")
plt.ylabel("Training accuracy (%)")


## Practical exercise 2: introducing convolutions as an inductive bias

We will now introduce a different architecture: the convolutional neural network (CNN). CNNs introduce an inductive bias of locality by reducing the number of connections available to the MLP. Here are some parameters for you to play with to achieve high performance:
1. Convolutional layers: like the MLP, the CNN can build complex models from simpler ones, more layers means that the CNN will have more capacity.
2. Kernel size: this is where the inductive bias comes in; this parameter says how many pixels are used to inform the activation of the perceptron. A 2x2 processes 4 adjacent pixels simultaneously, 3x3 processes 9 adjacent pixels, and so on...

We introduce an MLP at the end to do the classification anyway.

How does the test accuracy compare to the previous one? What about the training times? What about model size?

In [None]:
NUM_CONV_LAYERS = 1
KERNEL_SIZE = (3, 3)
NUM_MLP_LAYERS = 1
NUM_MLP_UNITS = 128
NUM_EPOCHS = 5
DROPOUT = 0.1 # usually less than 0.5
REG_COEFFICIENT = 0.001 # Usually less than 0.001

import numpy as np
import tensorflow as tf

x_train_conv = np.expand_dims(x_train, -1)
x_test_conv = np.expand_dims(x_test, -1)


inputs = tf.keras.Input(shape=x_train_conv.shape[1:])
output = inputs
for _ in range(NUM_CONV_LAYERS):
    output = tf.keras.layers.Conv2D(
        filters=32,
        kernel_size=KERNEL_SIZE,
        padding='same',
        activation='relu',
    )(output)

output = tf.keras.layers.Flatten(input_shape=x_train.shape[1:])(output)
for _ in range(NUM_MLP_LAYERS):
    output = tf.keras.layers.Dense(
        units=NUM_MLP_UNITS,
        activation='relu'
    )(output)
    output = tf.keras.layers.Dropout(DROPOUT)(output)

output = tf.keras.layers.Dense(
    units=y_test.shape[1],
    activation='softmax'
)(output)
model = tf.keras.Model(inputs, output)
model.compile(
    optimizer='adam',
    loss="categorical_crossentropy",
    metrics=['accuracy']
)

model.summary()
history = model.fit(x_train_conv, y_train, epochs=NUM_EPOCHS)
test_metrics = model.evaluate(x_test_conv, y_test)

plt.figure(figsize=(5, 5))
plt.plot(np.arange(NUM_EPOCHS), np.array(history.history['accuracy']) * 100)
plt.title(
    "Model accuracy v. epochs of training\n"
    f"Train Acc={round(history.history['accuracy'][-1]*100, 4)}%, Test Acc={round(test_metrics[1]*100, 4)}%"
)
plt.xlabel("Epoch")
plt.ylabel("Training accuracy (%)")
