<div style="background-color: cyan">
This is one of two notebooks about neural networks.
One is about neural networks for regression, and the other one about classification. The regression notebook is a bit simpler, and its focus is on the actual neural network and variations of the architecture. The classification notebook is a bit more complex, as you will have to deal with imbalanced data. It also provides more experiments with regularization of neural networks, e.g. using dropout layers.

*You don't need to work on both notebooks. Choose the one that best fits your interests.*
</div>

# Neural Network extension of linear regression

In this task, we revisit the ***diamond*** dataset from the previous regression exercise. Earlier, we achieved an $R^2$-score of 0.85 using linear regression to predict the diamond `price`. Now, let's see if we can improve this performance further using a neural network!

We will build and train a neural network using the `Keras` library to perform regression on the `price` of the diamonds. This approach will allow us to explore the potential of deep learning for regression tasks and compare its effectiveness to that of traditional regression models.

### Load Required Libraries

In [None]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, median_absolute_error, root_mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
import tensorflow as tf

from tensorflow.keras.optimizers import Adam

### Random Seeds
We will use random numbers in several places. With the below, we set the seed for the random number generators in order to get reproducible results:

In [None]:
tf.random.set_seed(1)
np.random.seed(1)

### Data Preparation
We use the data provided in the file `diamonds.csv`. First we load the file:

In [None]:
# Load the dataset
data = pd.read_csv('diamonds.csv', index_col=0)

In [None]:
data.head()

We will use the `price` as the target value:

In [None]:
# Define y as the 'price' column
y_all = data['price']

We will do the prediction based on all other values that are numeric:

In [None]:
# Define X by dropping the 'price' column
X_all = data.drop('price', axis=1)

# Keep only numerical columns in X
X_all = X_all.select_dtypes(include=['number'])

**EXERCISE**

Write code to split the dataset and scale the features.

1. Split the whole dataset into train/val/test sets. Test set is 10% of the whole dataset. Validation set counts 20% of the remaining data (i.e., of the joint training and validation data).
2. Standardize the datasets: Use the `StandardScaler` class from the `sklearn.preprocessing` module to standardize features.

**Hint**:
1. Use `random_state=1` to ensure the split is reproducible.

In [None]:
# 1. Split into train/val/test sets

# X_train_val, X_test, y_train, y_test = ...
# X_train, X_val, y_train, y_val = t...

In [None]:
# 2. Scale the features

# scaler = ...
# X_train_scaled = ...
# X_val_scaled = ...
# X_test_scaled = ...

## Deep Learning Pipeline
Now we are ready to evaluate deep learning models for the price prediction on the `diamonds` dataset. We first define 3 auxiliary functions which take over functionality we will use repeadedly to compare different models. You do not need to understand these functions in detail.

### Auxiliary Functions

In [None]:
def apply_eval_model(model, X, y_true, model_name, do_print=True):
    """
    Function to evaluate a given model on a feature data frame and compute several performance metrics.
    The results are returned as a data frame and optionally printed out.

    Arguments:
    - model: the model to be evaluated
    - X: a data frame containing the predictors
    - y_true: the true target values, used for the performance assessment
    - model_name: name of the model (will be added to the result data frame)
    - do_print: optional argument to indicate whether the results should be printed
    """
    y_pred = model.predict(X)
    if do_print:
        print('r2-Score: ' + str(r2_score(y_true, y_pred)))
        print('MSE: ' + str(mean_squared_error(y_true, y_pred)))
        print('RMSE: ' + str(root_mean_squared_error(y_true, y_pred)))

    if len(model_name)>0:
        df = pd.DataFrame({'model_name': model_name,
                           'r2_score': r2_score(y_true, y_pred),
                           'MSE': mean_squared_error(y_true, y_pred),
                           'MAE': mean_absolute_error(y_true, y_pred),
                           'RMS': root_mean_squared_error(y_true, y_pred)},
                           index=[model_name])
        return df

In [None]:
def train_apply_eval_model(model, X_train, y_train, X_val, y_val, model_name='model', num_epochs=20, batch_size=16, do_print=True):
    """
    Train a given model on a training data set, and evaluate it on both the training and validation data.

    Arguments:
    - model: the model to be evaluated
    - X_train: the training predictors
    - y_train: the true labels of the training data set
    - X_val: the predictors of the validation data set
    - y_val: the true labels of the validation data set
    - model_name: name of the model (will be added to the result data frame)
    - do_print: optional argument to indicate whether the results should be printed
    """

    n_val = X_val.shape[0]

    # Fit the model
    history = model.fit(
        X_train,
        y_train,
        validation_data=(X_val, y_val),
        epochs=num_epochs,
        batch_size=batch_size,
        verbose=1
    )

    if do_print:
        print('Evaluation on Training Data:')
    res_train = apply_eval_model(model, X_train, y_train, model_name, do_print=do_print)
    res_train['dataset'] = 'train'

    if n_val>0:
        if do_print:
            print('\nEvaluation on Validation Data:')
        res_val = apply_eval_model(model, X_val, y_val, model_name, do_print=do_print)
        res_val['dataset'] = 'validation'

        return history, pd.concat([res_train, res_val])
    else:
        return history, res_train

In [None]:
def plot_history(history):
  """
  Plot model training history.
  Args:
  - history: tensorflow history object.

  Returns:
  None
  """
  # Plot loss, precision and recall during training
  f, axes = plt.subplots(ncols=2, figsize=(15, 6))

  sns.lineplot(x=history.epoch, y=history.history['mean_squared_error'], ax=axes[0], label='Train mse')
  sns.lineplot(x=history.epoch, y=history.history['val_mean_squared_error'], ax=axes[0], label='Val mse')
  axes[0].set_title('Loss history')
  axes[0].set(yscale='log') # Use a log scale on y-axis to show the wide range of values.
  axes[0].set(xlabel='Epoch', ylabel='Loss')

  sns.lineplot(x=history.epoch, y=history.history['r2_score'], ax=axes[1], label='Train r2_score')
  sns.lineplot(x=history.epoch, y=history.history['val_r2_score'], ax=axes[1], label='Val r2_score')
  axes[1].set_title('r2_score history')
  axes[1].set(xlabel='Epoch', ylabel='Precision')

  plt.show()

### Define Model Architecture

We are now ready to define the model architecture. We start with the model itself:

In [None]:
# Define deep learning model
model_1 = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(X_train.shape[1],)),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

Next we define the optimizer - these are some standard settings we will not consider in detail.

In [None]:
learning_rate = 0.001
optimizer = Adam(learning_rate=learning_rate)

Next we need to compile the model in order to combine necessary components together. You must compile it before starting the training.

In [None]:
model_1.compile(
    optimizer=optimizer,
    loss='mean_squared_error',
    metrics=['mean_squared_error', 'r2_score']
    )

Now we can get a summary of the model:

In [None]:
model_1.summary()

**EXERCISE**

1. How many layers are there in this architecture?
2. Looking at the summary table, explain the number of parameters related to the first hidden layer. *Hint: Don't forget the bias terms*

### Train the model
Now we do the actual training of the model. We will do 20 epochs. For the training, we will use our helper function `train_apply_eval_model`, which returns both the training history and a result overview.

In [None]:
# Define the parameters
num_epochs = 20
batch_size = 16

# Train the model
history_1, base_model_result = train_apply_eval_model(model_1, X_train_scaled, y_train, X_val_scaled, y_val, 
                                                      model_name='base', num_epochs=num_epochs, batch_size=batch_size)

### Evaluate the model
We will first look at the summary results of this first model:

In [None]:
base_model_result

**EXERCISE:** What do you observe here? Please describe the performance of the model on the training and validation dataset. **Hint**: We will compare the performance of the neural networks with the results obtained using the "classical" regression methods later on.

We also look at the evolution of the performance as the training progresses:

In [None]:
plot_history(history_1)

**EXERCISE**:

1) Does the training loss decrease steadily over epochs?

2) How does the validation loss behave compared to the training loss?

3) Using the answers of the 2 first questions, can you say infere something about model generalisation ?

## Model Variations


### Wider Model

**EXERCISE**:

Define a wider neural network

- The model consists of a input layer of 64 neurons, two hidden layers of 128 neurons, and an output layer of 1 neuron.

In [None]:
# Define deep learning model
# model_2 = ...

# Define the optimizer
# learning_rate = 0.001
# optimizer = ...

# Compile the model. This means to combine necessary components together. You must compile it before start training.
# model_2.compile(
#     optimizer=...,
#     loss=...,
#     metrics=...
#     )

# Print the model info
# ...

In [None]:
# Train the model
history_2, wider_model_result = train_apply_eval_model(model_2, X_train_scaled, y_train, X_val_scaled, y_val, 
                                                       model_name='wider', num_epochs=num_epochs, batch_size=batch_size)

In [None]:
plot_history(history_2)

### Deeper Model

**EXERCISE**:

Define a deeper neural network

- The model consists of a input layer of 32 neurons, four hidden layers of 64 neurons, and an output layer of 1 neuron.

In [None]:
# Define deep learning model
# model_3 = ...

# Define the optimizer
# learning_rate = 0.001
# optimizer = ...

# Compile the model. This means to combine necessary components together. You must compile it before start training.
# model_3.compile(
#     optimizer=...,
#     loss=...,
#     metrics=...
#     )

# Print the model info
# ...

In [None]:
# Define the parameters
num_epochs = 20
batch_size = 16

# Train the model
history_3, deeper_model_result = train_apply_eval_model(model_3, X_train_scaled, y_train, X_val_scaled, y_val, model_name='deeper', num_epochs=num_epochs, batch_size=batch_size)

In [None]:
plot_history(history_3)

### Model Comparison

In [None]:
all_results = pd.concat([base_model_result, wider_model_result, deeper_model_result])

In [None]:
all_results

In [None]:
# Plotting the results
metrics =['r2_score', 'RMS']

# Create a figure with subplots
fig, axs = plt.subplots(1, 2, figsize=(10, 3))

for i, metric in enumerate(metrics):
    sns.barplot(data=all_results, y='model_name', x=metric, hue='dataset', ax=axs[i])
    axs[i].legend(loc='lower center')
    axs[i].grid()

plt.tight_layout()
plt.show()

To get a more detailed results, consider the varying more the layers and sizes.

In [None]:
n_neurons = [32, 64, 128]
n_layers = [1, 2, 4]
results = []


for n_layer in n_layers:
    for n_neuron in n_neurons:
        print('\n****************************')

        # define network:
        model = tf.keras.Sequential(
            [tf.keras.layers.Input(shape=(X_train.shape[1],)),
             tf.keras.layers.Dense(n_neuron, activation='relu')] + 
            [tf.keras.layers.Dense(n_neuron*2, activation='relu') for _ in range(n_layer)] + 
            [tf.keras.layers.Dense(1)]
        )

        # define optimizer:
        learning_rate = 0.001
        optimizer = Adam(learning_rate=learning_rate)

        # compile model
        model.compile(
            optimizer=optimizer,
            loss='mean_squared_error',
            metrics=['mean_squared_error', 'r2_score']
        )

        model.summary()

        # train and evaluate the model
        history, result = train_apply_eval_model(model, X_train_scaled, y_train, X_val_scaled, y_val, 
                                                 model_name=f"{n_layer+2}_layers-{n_neuron}_size",
                                                 num_epochs=5, batch_size=batch_size)
        
        results.append(result)

In [None]:
results_df = pd.concat(results)

# Plotting the results
metrics =['r2_score', 'RMS']

# Create a figure with subplots
fig, axs = plt.subplots(1, 2, figsize=(10, 3))

for i, metric in enumerate(metrics):
    sns.barplot(data=results_df, y='model_name', x=metric, hue='dataset', ax=axs[i])
    axs[i].legend(loc='lower center')
    axs[i].grid()

plt.tight_layout()
plt.show()

Let's load the results from Project 1 and compare them.

In [None]:
project1_results = pd.read_csv('regression_results.csv')
project1_results

We join the two results dataframes and plot both the r2 score and the root mean squared error of all models we considered:

In [None]:
all_results = pd.concat([project1_results, results_df]).reset_index().drop(columns='index')

In [None]:
sns.barplot(data=all_results, y='model_name', x='r2_score', hue='dataset')
plt.grid()
plt.legend(loc='lower left')
plt.show()

In [None]:
sns.barplot(data=all_results, y='model_name', x='RMS', hue='dataset')
plt.grid()
plt.legend(loc='lower left')
plt.show()

In [None]:
sns.barplot(data=all_results, y='model_name', x='MAE', hue='dataset')
plt.grid()
plt.legend(loc='lower left')
plt.show()

**EXERCISE**

- Compare the results of deep learning models with the results of traditional regression models.
- Do deep learning models outperform traditional regressors in terms of MAE?

## Effect of Dataset Size
The following cell trains a rather small neural network on a varying number of training data. For each value of `training_sample_counts`, the given number of training data points is sampled randomly from the full training data set 10 times. The model is then trained on this randomly sampled training data set, and evaluated on the full validation dataset.

As the training can take quite some time, we have stored the results in a file `all_model_results.csv` that is available to you. If you want to run the trainings on your own, set `run_from_scratch = True`.

In [None]:
run_from_scratch = False

if run_from_scratch:
    n_train_all = X_train_scaled.shape[0]
    training_sample_counts = [ 2, 5, 10, 100, 1000, 10000 ]
    
    # define model
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(X_train.shape[1],)),
        tf.keras.layers.Dense(32, activation='relu'),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(1)
    ])
    learning_rate = 0.0001
    optimizer = Adam(learning_rate=learning_rate)
    
    all_model_results = pd.DataFrame()
    np.random.seed(123)
    
    for n_train in training_sample_counts:
        # randomly sample the required number of indices
        print('Using ' + str(n_train) + ' samples for training')
        for iter in range(10):
            print('iteration', iter)
            my_train_idx = np.random.choice(range(n_train_all), size=n_train, replace=False)
        
            # select the respective training data points from the training data set (X and y):
            my_X_train = X_train_scaled[my_train_idx]
            my_y_train = y_train[my_train_idx]
        
            model.compile(
                optimizer=optimizer,
                loss='mean_squared_error',
                metrics=['mean_squared_error', 'r2_score']
                )
        
            # train and evaluate model performance
            _, all_features_results = train_apply_eval_model(model, my_X_train, my_y_train, X_val_scaled, y_val,
                                                             model_name='NN', num_epochs=50, do_print=False)
            all_features_results['n_train'] = n_train
            all_features_results['iter'] = iter
        
            # add the result data frame to the overall data frame
            all_model_results = pd.concat([all_model_results, all_features_results])
    
    all_model_results.to_csv('all_model_results.csv')
else:
    all_model_results = pd.read_csv('all_model_results.csv')

The following cell plots the results in the same way you have seen in the lecture. The bars represent the average performance (in the given performance metric) on the dataset given by the color; the black bars (errorbars) represent the standard deviation over the 10 sampled datasets.

In [None]:
metric = 'RMS'

plt.figure()

sns.barplot(data=all_model_results_small, errorbar='sd', x='n_train', y=metric, hue='dataset')
plt.tick_params(axis='x', rotation=45)
plt.xlabel('Training Set Size')
plt.legend()
plt.grid()

plt.title('Model Performance depending on Training Data Size', fontsize="x-large")
plt.tight_layout()
plt.show()

**EXERCISE:**
Describe and interprete the above results. In particular, interprete the average performance on the training and validation datasets, as well as their standard deviation, as the size of the training data varies.