# Task 2 – MLP, Learning rate, Overfitting, and Hyper-parameters.

### <span style="color:red">Deadline Tuesday, June 3, 2025 at 11:59 p.m<span>

# IMPORTANT SUBMISSION INSTRUCTIONS

- When you're done, download the notebook and rename it to task02_[name].ipynb
- Only submit the ipynb file, no other file is required.
- The deadline is strict.
- Minimal requirement for passing: solving all code cells.

Implementation
- Do not change the cells which are marked as "Do not change", similarly write your solution to the marked cells. Do not create additional cells.

# Introduction

In this task you will implement an MLP model for virtual sensing using the flood dataset.
The objectives are:
- Implementing an MLP model via TensorFlow Functional API.
- Getting more familiar with model fitting and overfitting.
- Implementing early stopping.
- Exploring hyperparameters and their influence.
- Selecting model architecture.

## Tutorials

Some python libraries are required to accomplish the tasks assigned in this homework. If you feel like you need to follow a tutorial before, feel free to do so:

*   [Scikit-learn Tutorials](https://www.tensorflow.org/tutorials)
*   [TensorFlow Tutorials](https://scikit-learn.org/stable/tutorial/index.html)
*   [Matplotlib Tutorials](https://matplotlib.org/stable/tutorials/index.html)

## Imports

In [1]:
import numpy as np
import random
import pandas as pd
import tensorflow as tf
from tensorflow.keras import models, layers, optimizers, losses, callbacks

import matplotlib.pyplot as plt
import seaborn as sns

SEED = 24
random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)

## System checks


In [2]:
gpus = tf.config.list_physical_devices('GPU')
cpus = tf.config.list_physical_devices('CPU')
print(gpus)
print(cpus)

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]


Choose your device for computation. CPU or one of your CUDA devices

In [3]:
tf.config.set_visible_devices(gpus, 'GPU')

# Subtask 2.1

## Backpropagation – Chain Rule

### Model
Let us assume an MLP with one hidden layer containing one neuron with `sigmoid` function ($S(x) = 1 / (1 + e^{-x})$) as the activation function. The input and the output dimensions are equal to one. The activation function for the input and output layers is `linear`.
The mapping $f: x \to \tilde{y}$ from the input $x$ to output $\tilde{y}$ can be written as:
\begin{equation}
\tilde{y} = f(x) = w_2 S(w_1 x)
\end{equation}
where $w_1$ and $w_2$ are the weights of the model. Note that we do not have bias for this example. Let us consider mean-squared error as the loss function. The loss $\ell$ can be obtained as:
\begin{equation}
\ell = (y - \tilde{y})^2 = (y - w_2 S(w_1 x))^2
\end{equation}
where $y$ denotes the reference label. Let's initialize the weights as $w_1 = 0.1$ and $w_2 = 0.1$.

### Data

Let's assume we have a data set containing three samples as: $x = [1.0, 2.0, 3.0]^{T}$ and $y = [1.0, 4.0, 9.0]^{T}$.

#### TODO
 - Perform gradient descent by hand with a learning rate of 0.1. Train the model for 1 epoch with the batch size of 1.
 - Report the prediction $\tilde{y}$, loss, gradients, and the updated weights at every iteration.

<span style='color:red'>**Your answer:**</span>

...

#### TODO
- Define two functions that compute the gradients of the loss with respect to $w_1$ and $w_2$ using analytical differentiation and chain rule.
- Use the functions you defined and train the model again using gradient descent and the same learning rate.
- Report the prediction $\tilde{y}$, loss, gradients, and the updated weights at every iteration.

In [4]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def dL_dw2(x, y, w1, w2):
    z = w1 * x
    s = sigmoid(z)
    y_pred = w2 * s
    dL = -2 * (y - y_pred) * s
    return dL

def dL_dw1(x, y, w1, w2):
    z = w1 * x
    s = sigmoid(z)
    y_pred = w2 * s
    d_sigmoid = s * (1 - s)
    dL = -2 * (y - y_pred) * w2 * d_sigmoid * x
    return dL

# default weights
w1 = 0.1
w2 = 0.1
lr = 0.1

# Data
X = [1.0, 2.0, 3.0]
Y = [1.0, 4.0, 9.0]

for i in range(len(X)):
    x = X[i]
    y = Y[i]

    z = w1 * x
    s = sigmoid(z)
    y_pred = w2 * s

    #loss
    loss = (y - y_pred)**2

    grad_w1 = dL_dw1(x, y, w1, w2)
    grad_w2 = dL_dw2(x, y, w1, w2)

    # update weights
    w1 -= lr * grad_w1
    w2 -= lr * grad_w2

    # show
    print(f"value {i+1}")
    print(f"x = {x}, y = {y}")
    print(f"Prediction (ŷ) = {y_pred:.4f}")
    print(f"Loss = {loss:.4f}")
    print(f"Gradient w.r.t w1 = {grad_w1:.4f}, w2 = {grad_w2:.4f}")
    print(f"Updated w1 = {w1:.4f}, w2 = {w2:.4f}")
    print("-" * 40)


value 1
x = 1.0, y = 1.0
Prediction (ŷ) = 0.0525
Loss = 0.8978
Gradient w.r.t w1 = -0.0473, w2 = -0.9948
Updated w1 = 0.1047, w2 = 0.1995
----------------------------------------
value 2
x = 2.0, y = 4.0
Prediction (ŷ) = 0.1101
Loss = 15.1309
Gradient w.r.t w1 = -0.7675, w2 = -4.2957
Updated w1 = 0.1815, w2 = 0.6291
----------------------------------------
value 3
x = 3.0, y = 9.0
Prediction (ŷ) = 0.3981
Loss = 73.9928
Gradient w.r.t w1 = -7.5437, w2 = -10.8873
Updated w1 = 0.9358, w2 = 1.7178
----------------------------------------


#### TODO
- Use automatic differentiation (AD) (`tf.GradientTape`) to compute the gradients.
- Train the model again using the same learning rate, epoch, and batch size.
- Report the prediction $\tilde{y}$, loss, gradients, and the updated weights at every iteration.

In [8]:
# train data
x_data = tf.constant([1.0, 2.0, 3.0])
y_data = tf.constant([1.0, 4.0, 9.0])

# weihgts (as tf.Variable)
w1 = tf.Variable(0.1)
w2 = tf.Variable(0.1)

learning_rate = 0.1

for i in range(len(x_data)):
    x = x_data[i]
    y = y_data[i]

    with tf.GradientTape() as tape:
        z = w1 * x
        s = sigmoid(z)
        y_pred = w2 * s
        loss = (y - y_pred) ** 2

    # calculate gradients
    gradients = tape.gradient(loss, [w1, w2])
    grad_w1, grad_w2 = gradients

    # error catch
    if grad_w1 is None or grad_w2 is None:
        print(f"Gradyanlar None! Kontrol et: x={x}, y={y}, z={z}, s={s}, y_pred={y_pred}, loss={loss}")
        break

    # weights updating
    w1.assign_sub(learning_rate * grad_w1)
    w2.assign_sub(learning_rate * grad_w2)

    # show
    print(f"value {i+1}")
    print(f"x = {x.numpy():.1f}, y = {y.numpy():.1f}")
    print(f"Prediction (ỹ) = {y_pred.numpy():.4f}")
    print(f"Loss = {loss.numpy():.4f}")
    print(f"Gradient w.r.t w1 = {grad_w1.numpy():.4f}, w2 = {grad_w2.numpy():.4f}")
    print(f"Updated w1 = {w1.numpy():.4f}, w2 = {w2.numpy():.4f}")
    print("-" * 40)


value 1
x = 1.0, y = 1.0
Prediction (ỹ) = 0.0525
Loss = 0.8978
Gradient w.r.t w1 = -0.0473, w2 = -0.9948
Updated w1 = 0.1047, w2 = 0.1995
----------------------------------------
value 2
x = 2.0, y = 4.0
Prediction (ỹ) = 0.1101
Loss = 15.1309
Gradient w.r.t w1 = -0.7675, w2 = -4.2957
Updated w1 = 0.1815, w2 = 0.6291
----------------------------------------
value 3
x = 3.0, y = 9.0
Prediction (ỹ) = 0.3981
Loss = 73.9928
Gradient w.r.t w1 = -7.5437, w2 = -10.8873
Updated w1 = 0.9358, w2 = 1.7178
----------------------------------------


# Subtask 2.2

In this task, we consider the data from **Margarethenklippe** as the input and train an MLP that gives the information about **Sennhuette** as the output. For each of the sensors, we have the water level, **W**, and the mass flow rate, **Q**, every 15 minutes.
## Pre-processing

#### TODO
 - Load the flood dataset (`Flood_Data.csv`) using *pandas* library.
 - Drop the first four columns of the DataFrame and columns containing NaNs.
 - Split the data into inputs and outputs.
 - Shuffle the data and get the NumPy arrays.
 - Split the data to 75% for training, 15% for validation, and 15% for testing.
 - Standardize inputs and outputs.


In [9]:
# TODO load the flood dataset using pandas
df = pd.read_csv("/content/Flood_Data.csv")

df = df.iloc[:, 4:]

nan_threshold = 0.3 * df.shape[0]
df = df.dropna(axis=1, thresh=int(df.shape[0] - nan_threshold))

In [10]:
# TODO split the data into inputs X and outputs y
n_input = 6
X = df.iloc[:, :n_input].values
Y = df.iloc[:, n_input:].values

N_samples = X.shape[0] # number of samples

In [11]:
from sklearn.model_selection import train_test_split
# TODO shuffle the data and get the NumPy arrays:

indices = np.arange(len(X))
np.random.shuffle(indices)
X = X[indices]
Y = Y[indices]

# TODO split the data to train, validation, and test:
X_train, X_temp, Y_train, Y_temp = train_test_split(X, Y, test_size=0.3, random_state=42)
X_val, X_test, Y_val, Y_test = train_test_split(X_temp, Y_temp, test_size=0.5, random_state=42)

Let us define a class for standardizing the data. We will implement a function that applies the scaling (`apply`) and a function that maps the scaled data back into the original scale (`apply_reverse`).

In [12]:
# TODO create a class for standardizing the data.

class Standardizer:
    def __init__(self):
        self.mean = None
        self.std = None

    def fit(self, data):
        self.mean = np.mean(data, axis=0)
        self.std = np.std(data, axis=0)

    def apply(self, data):
        return (data - self.mean) / self.std

    def apply_reverse(self, scaled_data):
        return scaled_data * self.std + self.mean

In [None]:
# TODO standardize the data using the class you defined in the cell above.
input_scaler = Standardizer()
output_scaler = Standardizer()

# Fit
input_scaler.fit(X_train)
output_scaler.fit(Y_train)

# Apply
X_train_scaled = input_scaler.apply(X_train)
X_val_scaled = input_scaler.apply(X_val)
X_test_scaled = input_scaler.apply(X_test)

Y_train_scaled = output_scaler.apply(Y_train)
Y_val_scaled = output_scaler.apply(Y_val)
Y_test_scaled = output_scaler.apply(Y_test)

# Subtask 2.3

## Implementing an MLP using TensorFlow Functional API.

#### TODO

- Implement an MLP using TensorFlow Functional API.
- Print the model architecture using `model.summary()`.

In [None]:
# TODO implement the model

def MLP(N_hidden_layers, N_neurons, activation):
    input_layer = ...
    x = input_layer
    for i in range(N_hidden_layers):
        x = ...
    output_layer = ...

    model = models.Model(input_layer, output_layer)
    return model

####################
## YOUR CODE HERE ##
####################
model =

In [None]:
# TODO print the model architecture
####################
## YOUR CODE HERE ##
####################

# Subtask 2.4

## Optimizer, loss function and training

#### TODO

- Define the optimizer and the loss function using `tf.keras.optimizers` and `tf.keras.losses`, respectively.
    - Use stochastic gradient descent (SGD) with a learning rate of 1.0 as the optimizer.
    - Use mean-squared error as the loss function.
- Compile the model using the optimizer and the loss.
- Train the model and get the history of training and validation losses.
    - Train the model for 10 epochs.
    - Use a batch size of 512.
- Plot the learning curves.

In [None]:
# TODO define the optimizer and the loss function
####################
## YOUR CODE HERE ##
####################

# TODO compile the model
####################
## YOUR CODE HERE ##
####################

In [None]:
# TODO train the model
####################
## YOUR CODE HERE ##
####################

In [None]:

def plot_learning_curves(hist):
    epochs = np.arange(0, len(hist.history['loss'])) + 1
    sns.set(style='ticks')
    fig, ax = plt.subplots(1, 1, figsize = (5, 4))
    ax.plot(epochs, hist.history['loss'], label = 'Training loss', marker = 'o', ls = '--')
    ax.plot(epochs, hist.history['val_loss'], label = 'Validation loss', marker = 'o', ls = '--')

    ax.set_xlabel('Epoch')
    ax.set_ylabel('Loss')
    ax.set_title('Learning curves')
    ax.legend()
    sns.despine(trim=True, offset=5)

# TODO Use the function above to plot the learning curves
####################
## YOUR CODE HERE ##
####################

**TODO Your answer here**

Does the model learn properly? Why?

**TODO Your answer here**

<span style='color:red'>**Your answer:**</span>

...


# Subtask 2.5

## Learning rate

#### TODO
- Decrease the learning rate logarithmically, i.e. by a factor of 10, until your model starts to train.
    - Train the model for 10 epochs.
    - Use a batch size of 512.
- Plot the training curves of the loss and the accuracies as in Subtask 1.3. Use the function defined above.


In [None]:
# learning rate = 0.1
####################
## YOUR CODE HERE ##
####################


In [None]:
# learning rate = 0.01
####################
## YOUR CODE HERE ##
####################

In [None]:
# learning rate = 0.001
####################
## YOUR CODE HERE ##
####################

In [None]:
# learning rate = 1e-4
####################
## YOUR CODE HERE ##
####################

## Model selection

Is it the lowest validation loss that one can achieve? Probably not. So here are your TODOs:

#### TODO:
- Train the model for 100 epochs and plot the learning curves. Use learning rate of 0.01.

#### TODO from now on, for all subsequent tasks:
- Print the overall best validation loss and the epoch at which it occurred of.


In [None]:
####################
## YOUR CODE HERE ##
####################

**TODO Your answer here**

Answer the following questions in written form:

1. Does the training loss decrease after each epoch? Why does it? // Why does it not?
1. Does the validation loss decrease after each epoch? Why does it? // Why does it not? (For your answer to be sufficient, you should describe fluctuations and discuss the overall minimum of the curve.)
1. At which epoch was your model best? I.e. if you had saved your model after each training epoch, which one would you use to make predictions to unseen samples (e.g. from the test set)? Why? (For your answer to be sufficient: Also discuss what this means in terms of overfitting)

**TODO Your answer here**


<span style='color:red'>**Your answer:**</span>

1. ...

# Subtask 2.6
## Save and restore model checkpoints

Training that model for 100 epochs took quite a bit of time, right? Wouldn't it be a pity if it would get deleted out of memory, e.g. because your Colab session terminates (this can even happen automatically)? We would have to train it again to make predictions! To prevent this, we would like to save a check-point of the already optimized model's weights to disk. Then, we could just load our model weights at any time and use our model again without retraining. As you will see in a bit, this will be very handy for early stopping, too!

#### TODO
- Save a checkpoint of the `model` trained above (i.e. the model's parameters) to disk.
- Initialize a new model, `model2` with the same architecture as used for the `model` you stored. Do *not* train `model2`.
- Evaluate `model2` on the validation dataset.
- Now, overwrite the initialized, untrained weights of `model2` with the weights you saved into the checkpoint of `model`.
- Evaluate `model2`'s validation loss again. It should be of the exact same value as `model`'s validation loss.


*Hints:*
 - Read https://www.tensorflow.org/tutorials/keras/save_and_load
 - Use `model.save_weights`

In [None]:
####################
## YOUR CODE HERE ##
####################


# Subtask 2.7
## Early stopping
So the model you ended up with after 100 epochs was not the best one. That has two implications for us: (1) We would not have had to train for that many epochs and could have saved some computing time. (2) We do not have the best model to apply our model to make actual predictions for unseen samples. If we would constantly assess our model's validation performance during training, we could stop optimization as soon as our model's performance does not increase anymore. This is called *early stopping*.

### TODO
- Implement Early Stopping using `tf.keras.callbacks.EarlyStopping`. Use patience = 5, which means that if the validation loss is not improved after 5 epochs, the training process should be stopped.
- Save the model checkpoint after each epoch if the validation loss is improved. Use `tf.keras.callbacks.ModelCheckpoint`.
- Train the model for 100 epochs with a learning rate of 0.01.
- Load the model checkpoint.
- Evaluate the model on the validation data.

In [None]:
####################
## YOUR CODE HERE ##
####################


In [None]:
## TODO load the model checkpoint and evaluate on validation data
####################
## YOUR CODE HERE ##
####################

#### TODO
- Compare the training you just did with the one of the same model trained for 100 epochs. Did you reach best model performance? If so: why? If not: why not?
- What is the purpose of `patience`, and why do we need that?
- Do the same training as in the previous cell, starting training from scratch, but try different values for `patience` now. Did you end up with a model resulting in the best validation accuracy you have seen so far, but without training the full 100 epochs?

In [None]:
####################
## YOUR CODE HERE ##
####################

## Explore batch size *(optional)*

*This task is optional, you do not need to solve it*

Let us explore even more model and training parameters. In this section, we will see the impact of batch size on training. Let us use a learning rate of $10^{-3}$ from now on.


#### TODO *(optional)*
- Run training of the same model used above with
    - batch size 1 for **one epoch**
    - batch size 1024 for 100 epochs, using early stopping with patience 10
- Compare your training results of all three batch sizes you have trained, i.e. batch size 1, 512 (from above) and 1024
- Was it smart to set batch size to 1?
- How long (in terms of computing time) do your models need to train for the different batch sizes? (You could even measure this with python, using the `time` package)
- What is the impact on model performance?

In [None]:
# Batch size 1

###############################
## YOUR CODE HERE - OPTIONAL ##
###############################

In [None]:
# Batch size 1024

###############################
## YOUR CODE HERE - OPTIONAL ##
###############################

# Subtask 2.8
## What about the architecture?

How does architecture affect predictive performance?

#### TODO:
In the following, try to improve model performance by varying
- number of hidden layers
- number of neurons per each hidden layer
- activation function

These parameters are called hyper-parameters, since they are excluded from model optimization. Instead, we have to set them by hand and explore them to find a model with good predictive accuracy.

Vary only one hyper-parameter at a time. If you would vary multiple parameters at the same time, it would be harder for you to see the impact that each parameter has.

In [None]:
# number of hidden layers

####################
## YOUR CODE HERE ##
####################

# 4 hidden layers

In [None]:
# number of neurons per each hidden layer

####################
## YOUR CODE HERE ##
####################

# 100 neurons

In [None]:
# activation function

####################
## YOUR CODE HERE ##
####################

# Use tanh instead of ReLU

**TODO Your answer here**

1. How good do you get?
2. Which hyper-parameter makes the largest difference?
3. Does it always help to make your model bigger (i.e. wider / deeper)? Why not?

**TODO Your answer here**


**Your answers:**
1. ...

# Subtask 2.9
## Challenge
#### TODO
- If you choose your best values for number hidden units, number of layers and activation function that you determined by varying them independently above: Does performance improve? Why?
- Vary all of the parameters at the same time to maximize the predictive performance of your model. How good do you get?

In [None]:
# Your best model:

####################
## YOUR CODE HERE ##
####################

# Subtask 2.10
## Evaluate your best model on test set, once!
When doing a study, at the very end right before writing up your paper, you evaluate the best model you chose on the test set. This is the performance value you will report to the public.

#### TODO
- Evaluate the model on the testing dataset.
- Plot the reference mass flow rate vs. the predicted values for the first 50 samples of the testing dataset.

In [None]:
# TODO evaluate the model
####################
## YOUR CODE HERE ##
####################


In [None]:
# TODO plot results
####################
## YOUR CODE HERE ##
####################

y_pred = ...


**TODO Your answer here**

1. Is the test loss of your model as good as the validation loss?
2. If those values are different: How can you explain the difference?
3. Why should you never use test set performance when trying out different hyper-parameters and architectures?

**TODO Your answer here**



<span style='color:red'>**Your answer:**</span>

1. ...