# Practical session 4

Most datasets that you will work with are going to be quite challenging, and one of the biggest challenges will be to find a model that not only does good in training data, but that also does good in data that has not been used for training. Especially as your data grows, there are two phenomena that you need to be aware of:

1. When your model has poor training performance and poor testing performance, we say that the model has **underfit** the training data.
2. When your model has good training performance and poor testing performance, we say that the model has **overfit** the training data.

The idea of over- and under-fitting has to do with _capacity_ (i.e. how large is the search space where you expect to find your model). If your model has too much capcity (i.e. it can use very complex functions), then it is very likely to _overfit_ your training data. If your model has low capacity (i.e. it can only use very simple functions), then it is likely to underfit the data.

You can control the capacity of most models by playing with the number of parameters. However, this can result in very inefficient optimisation; causing you to repeatedly train your model to find the right number of parameters for it. In machine learning, this challenge is solved through **inductive biases** and/or **regularisation**.

Regularisation is "any technique that is used to specifically improve the performance in the test set, regardless of the performance in the training set". Common regularisation techniques (e.g. weight decay and dropout) will penalise the model for choosing too complex models that are likely to result in overfitting. That is, you give the model high capacity, but you limit its ability to choose too complex models; the idea is that the model you land on will be "just right".

Inductive biases are specific relations that are put in the _mathematical definition_ of the ML models to reduce the search space. For example Convolutional Neural Networks (CNN) are a sub-set of MLPs where we force each perceptron to only process a subset from the previous layers. When processing images, this enforces a bias of locality; pixels that are close-by are processed together. CNNs are often the only way to get good result in image processing.

## Practical exercise 1: let's drink some wine

In this exercise we are going to explore the idea of regularisation. For this exercise we are going to use a [wine quality dataset](https://www.kaggle.com/datasets/rajyellow46/wine-quality). The features of this dataset are measurable qualities of a wine (acidity, sugar, ...) and the target is the quality of the wine.

The question is, can you predict if a wine is going to be good based on the input features? Give it a try!

Observe that the data is re-split every time that you run the training code, this is to verify that you didn't just get a lucky split. The objective here is to get similar performance 3 times in a row, your submission will be the average over the three re-trainings. Regularisation will help make your training more stable and less dependant on the input examples.

In [None]:
# -- Input your parameters here -- #
NUM_EPOCHS = 50
DEPTH = 5
WIDTH = 5
ACTIVATION = "relu"
REG_COEFFICIENT = 0.001
DROPOUT = 0.3

## -- DO NOT TOUCH -- #
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split

def get_wine_data(filepath, test_size):
    """Get and prepare the wine data for ML"""
    dataset = pd.read_csv(filepath)
    num_classes = dataset.quality.max() + 1
    dataset=dataset.to_numpy()
    X_train, X_test, y_train, y_test = train_test_split(
        dataset[:,:-2], dataset[:,-2],
        test_size=test_size,
    )

    return (
        X_train,
        X_test,
        tf.keras.utils.to_categorical(y_train, num_classes=num_classes),
        tf.keras.utils.to_categorical(y_test, num_classes=num_classes)
    )

X_train, X_test, y_train, y_test = get_wine_data('wine_data.csv', test_size=0.2)

input_ = tf.keras.Input(shape=X_train.shape[1])
output = input_
for _ in range(DEPTH):
    output = tf.keras.layers.Dense(
        units=WIDTH,
        activation=ACTIVATION,
        kernel_regularizer=tf.keras.regularizers.l2(REG_COEFFICIENT),
    )(output)
    output = tf.keras.layers.BatchNormalization()(output)
    output=tf.keras.layers.Dropout(DROPOUT)(output)
output = tf.keras.layers.Dense(
    units=y_test.shape[1],
    activation="softmax",
    kernel_regularizer=tf.keras.regularizers.l2(REG_COEFFICIENT),
)(output)
model = tf.keras.Model(input_, output)
model.compile(
    optimizer="adam",
    loss="categorical_crossentropy",
    metrics=['accuracy'],
)
print(model.summary())
history = model.fit(
    x=X_train,
    y=y_train,
    epochs=NUM_EPOCHS,
    verbose=2,
)
test_metrics = model.evaluate(
    x=X_test,
    y=y_test,
)

plt.figure(figsize=(10,10))
plt.plot(np.arange(NUM_EPOCHS), np.array(history.history['loss']) * 100)
plt.title(
    "Model accuracy v. epochs of training\n"
    f"Train Acc={history.history['accuracy'][-1]}, Test Acc={test_metrics[1]}"
)
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
## -- END OF DO NOT TOUCH -- #