<div style="background-color: cyan">
This is one of two notebooks about neural networks.
One is about neural networks for regression, and the other one about classification. The regression notebook is a bit simpler, and its focus is on the actual neural network and variations of the architecture. The classification notebook is a bit more complex, as you will have to deal with imbalanced data. It also provides more experiments with regularization of neural networks, e.g. using dropout layers.
*You don't need to work on both notebooks. Choose the one that best fits your interests.*
</div>

# Neural Networks for classification

This project helps you get hands-on experiences on neural networks. The project is organized as follows:
1. Data preparation
2. Define a neural network model
3. Role of model architecture
4. Overfitting in deep learning
    1. Dropout
    2. Weight regularization
    3. Early stopping
5. Other components in deep learning
    1. Optimizers
    2. Learning rate
6. Your own model
7. Summary


## Preparations and loading the data

We start by importing the necessary libraries. We also set a random seed, for the reproducibility of the results.

In [5]:

import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, median_absolute_error, root_mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

import seaborn as sns
import matplotlib.pyplot as plt
import tensorflow as tf


In [6]:
tf.random.set_seed(1)
np.random.seed(1)


Next we load the dataset which was used in the classification project.  In this way you can compare neural networks' performance with traditional classifiers.   
Remember that the `Net Income Flag` column was constant and hence removed from the dataset.

In [8]:

bank_data = pd.read_csv("bank_data.csv") 
bank_data = bank_data.drop(columns="Net Income Flag") 

In this project, we use a single validation dataset instead of K-fold cross-validation to simplify the pipeline. The strategy is to split the whole dataset into train, validation and test dataset.

In [None]:
# Train/val/test split
train_val_df, test_df = train_test_split(bank_data, stratify=bank_data["Bankrupt?"], test_size=0.2, random_state=2024)
train_df, val_df = train_test_split(train_val_df, stratify=train_val_df["Bankrupt?"], test_size=0.1, random_state=2024)

train_labels = np.array(train_df.pop('Bankrupt?'))
val_labels = np.array(val_df.pop('Bankrupt?'))
test_labels = np.array(test_df.pop('Bankrupt?'))

train_features = np.array(train_df)
val_features = np.array(val_df)
test_features = np.array(test_df)

print(f'Average class probability in training set:   {train_labels.mean():.4f}')
print(f'Average class probability in validation set: {val_labels.mean():.4f}')
print(f'Average class probability in test set:       {test_labels.mean():.4f}')

As usual, we normalize the data. We also remove outliers by clipping the data.

In [None]:
scaler = StandardScaler()
train_features = scaler.fit_transform(train_features)

val_features = scaler.transform(val_features)
test_features = scaler.transform(test_features)

# Clip the values in range [-5, 5] to remove outliers.
train_features = np.clip(train_features, -5, 5)
val_features = np.clip(val_features, -5, 5)
test_features = np.clip(test_features, -5, 5)


print('Training labels shape:', train_labels.shape)
print('Validation labels shape:', val_labels.shape)
print('Test labels shape:', test_labels.shape)

print('Training features shape:', train_features.shape)
print('Validation features shape:', val_features.shape)
print('Test features shape:', test_features.shape)

## Define a neural network model

In this section we go through a typical deep learning pipeline, including
- Defining the models, determine optimizers and loss
- Training
- Validation and testing

In [11]:
# Define the model
model_1 = tf.keras.Sequential([
  tf.keras.layers.Input(shape=(train_features.shape[-1],)),
  tf.keras.layers.Dense(32, activation='relu'),
  tf.keras.layers.Dense(32, activation='relu'),
  tf.keras.layers.Dense(1, activation='sigmoid')
])


In [12]:

# Compile the model. This means to combine necessary components together. You must compile it before training.
model_1.compile(
    optimizer='adam',
    loss=tf.keras.losses.BinaryCrossentropy(),
    metrics=['accuracy', 'precision', 'recall']
    )


In [None]:
# Print the model info
model_1.summary()

We noticed in the classification project that the dataset is imbalanced. We used class weights to address this issue. In this project, we will use class weights as well.

In [None]:
# Assign different weights to class labels, as done in the classification project.
neg = len(train_labels[train_labels==0])
pos = len(train_labels[train_labels==1])
total = neg + pos
weight_for_0 = (1 / neg) * (total / 2.0)
weight_for_1 = (1 / pos) * (total / 2.0)

class_weight = {0: weight_for_0, 1: weight_for_1}

print('Weight for class 0: {:.2f}'.format(weight_for_0))
print('Weight for class 1: {:.2f}'.format(weight_for_1))

Now we can train the model:

In [15]:
# Training settings
EPOCHS = 50
BATCH_SIZE = 512

In [None]:
history_1 = model_1.fit(
    train_features,
    train_labels,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    validation_data=(val_features, val_labels),
    class_weight=class_weight
    )

We define a convenience function to plot the training history.

In [17]:
def plot_history(history):
  """
  Plot model training history.
  Args:
  - history: tensorflow history object.

  Returns:
  None
  """
  # Plot loss, precision and recall during training
  f, axes = plt.subplots(ncols=3, figsize=(24,6))

  sns.lineplot(x=history.epoch, y=history.history['loss'], ax=axes[0], label='Train loss')
  sns.lineplot(x=history.epoch, y=history.history['val_loss'], ax=axes[0], label='Val loss')
  axes[0].set_title('Loss history')
  axes[0].set(yscale='log') # Use a log scale on y-axis to show the wide range of values.
  axes[0].set(xlabel='Epoch', ylabel='Loss')

  sns.lineplot(x=history.epoch, y=history.history['precision'], ax=axes[1], label='Train precision')
  sns.lineplot(x=history.epoch, y=history.history['val_precision'], ax=axes[1], label='Val precision')
  axes[1].set_title('Precision history')
  axes[1].set(xlabel='Epoch', ylabel='Precision')

  sns.lineplot(x=history.epoch, y=history.history['recall'], ax=axes[2], label='Train recall')
  sns.lineplot(x=history.epoch, y=history.history['val_recall'], ax=axes[2], label='Val recall')
  axes[2].set_title('Recall history')
  axes[2].set(xlabel='Epoch', ylabel='Recall')

  plt.show()

In [None]:
plot_history(history_1)

We observed that the loss is decreasing as expected and the precision is improving with the number of epochs. The recall is, however, decreasing. 

### Exercise 

- Can you think of a reason why the recall is decreasing? Hint: What is optimized in the loss?


YOUR ANSWER HERE

To gain more insight about the performance of the model, we compute the same classification metrics that we already used in the classification project. In particular, we will also look at the trade-off between precision and recall by plotting the precision-recall curve.

As we will evaluate a lot of models, we define a convenience function. 

In [19]:
from sklearn.metrics import accuracy_score, balanced_accuracy_score, precision_score, recall_score, f1_score, roc_curve, roc_auc_score, precision_recall_curve, auc
from sklearn.metrics import RocCurveDisplay, PrecisionRecallDisplay

# The same evaluation will be used multiple times, so we make it a function
def evaluate_model(model, model_name, test_features, test_labels):
  """
  Evaluate trained models. Plot ROC curve and PR curve.
  Args:
  - model: tensorflow model. trained neural network
  - model_name: str.
  - test_features: np.array.
  - test_labels: np.array.

  Returns:
  - results: dict.
  """
  results = {}
  labels_score = model.predict(test_features)
  pred_labels = np.where(labels_score > 0.5, 1, 0)

  results['accuracy'] = accuracy_score(test_labels, pred_labels)
  results['balanced_accuracy'] = balanced_accuracy_score(test_labels, pred_labels)
  results['precision'] = precision_score(test_labels, pred_labels)
  results['recall'] = recall_score(test_labels, pred_labels)
  results['f1'] = f1_score(test_labels, pred_labels)
  print("Accuracy", results['accuracy'])
  print("Balanced Accuracy", results['balanced_accuracy'])
  print("Precision", results['precision'])
  print("Recall", results['recall'])
  print("F1", results['f1'])

  fpr, tpr, thresholds = roc_curve(test_labels, labels_score)
  roc_auc = roc_auc_score(test_labels, labels_score)
  print('ROC-AUC Score:', roc_auc)
  precision, recall, thresholds = precision_recall_curve(test_labels, labels_score)
  pr_auc = auc(recall, precision)
  print('PR-AUC Score:', pr_auc)

  results['roc_auc'] = roc_auc
  results['pr_auc'] = pr_auc

  f, axes = plt.subplots(ncols=2, figsize=(18,6))
  disp = RocCurveDisplay(fpr=fpr, tpr=tpr, estimator_name=model_name)
  disp.plot(ax=axes[0])
  disp = PrecisionRecallDisplay(precision=precision, recall=recall, estimator_name=model_name)
  disp.plot(ax=axes[1])
  plt.show()
  return results

We will also define a convenience function to show all the results. Note, you don't need to understand the code in this function.

In [20]:
from IPython.display import display, HTML
def display_benchmark(benchmark):
  """
  Display benchmark results in a DataFrame.
  Args:
  - benchmark: dict.
 """
  df =  pd.DataFrame.from_dict(benchmark, orient='index', columns=["accuracy", "balanced_accuracy", "precision", "recall", "f1", "roc_auc", "pr_auc"])
  display(HTML("<bf>Summary of all results</bf>"))
  display(df)


In [None]:
# For gathering all evaluation results
benchmark = {}
benchmark['baseline'] = evaluate_model(model_1, "Model 1", test_features, test_labels)

display_benchmark(benchmark)

Congratulations! You've finished a typical pipeline of training a neural network. In the following sections, we will discuss more about model architecture, overfit and underfit, as well as other components in deep learning.

## Defining a larger network.

We can now try to obtain a better model by increasing the number of hidden layers and the number of neurons in each layer.

**Exercise**

- Define a wider and deeper network that's composed of 3 layers of 256 neurons

In [35]:
# model_2 = ...

# model_2.compile(optimizer='adam',
#        loss=tf.keras.losses.BinaryCrossentropy(),
#        metrics=['accuracy', 'precision', 'recall'])

# model_2.summary()

Let's train this model:

In [None]:
# Train the model
history_2 = model_2.fit(
    train_features,
    train_labels,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    validation_data=(val_features, val_labels),
    class_weight=class_weight
    )

In [None]:
plot_history(history_2)
benchmark['wider'] = evaluate_model(model_2, "Model 2", test_features, test_labels)

display_benchmark(benchmark)

### Exercise

- Is the model better or worse than the previous?
- Look at the training history. What phenomenon do you observe?

YOUR ANSWER HERE

## Regularization

In the following we look at different regularization techniques to prevent overfitting.

### Dropout

The term “dropout” refers to dropping out the nodes (input and hidden layer) in a neural network. All the forward and backwards connections with a dropped node are temporarily removed, thus creating a new network architecture out of the parent network. The nodes are dropped by a dropout probability of $p$.


In [None]:
# Same architecture as model 2, but with dropout to prevent overfitting
model_3 = tf.keras.Sequential([
  tf.keras.layers.Input(shape=(train_features.shape[-1],)),
  tf.keras.layers.Dense(256, activation='relu'),
  tf.keras.layers.Dropout(0.5), # Add dropout layer
  tf.keras.layers.Dense(256, activation='relu'),
  tf.keras.layers.Dropout(0.5), # Add dropout layer
  tf.keras.layers.Dense(1, activation='sigmoid')
])

model_3.compile(optimizer='adam',
       loss=tf.keras.losses.BinaryCrossentropy(),
       metrics=['accuracy', 'precision', 'recall'])

model_3.summary()

### Weight regularization

Another common way to mitigate overfitting is to put constraints on the complexity of a network by forcing its weights only to take small values, which makes the distribution of weight values more "regular". This is called "weight regularization", and it is done by adding to the loss function of the network a cost associated with having large weights. This cost comes in two flavors:

- L1 regularization, where the cost added is proportional to the absolute value of the weights coefficients (i.e. to what is called the "L1 norm" of the weights).

- L2 regularization, where the cost added is proportional to the square of the value of the weights coefficients (i.e. to what is called the squared "L2 norm" of the weights). L2 regularization is also called weight decay in the context of neural networks. Don't let the different name confuse you: weight decay is mathematically the exact same as L2 regularization.

In [None]:
# Same architecture as model 2, but with L2 regularization to prevent overfitting
model_4 = tf.keras.Sequential([
  tf.keras.layers.Input(shape=(train_features.shape[-1],)),
  tf.keras.layers.Dense(256, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.001)),
  tf.keras.layers.Dense(256, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.001)),
  tf.keras.layers.Dense(1, activation='sigmoid')
])

model_4.compile(optimizer='adam',
       loss=tf.keras.losses.BinaryCrossentropy(),
       metrics=['accuracy', 'precision', 'recall'])

model_4.summary()

**Exercise**

1. Train the two models (`model_3` and `model_4`). Do they avoid overfitting?
2. Plot the history of the models and do the evaluations. 


In [None]:
history_3 = ...

plot_history(history_3)

benchmark["dropout"] = ...
display_benchmark(benchmark)

In [None]:
history_4 = ...

plot_history(history_4)

benchmark["l2_reg"] = ...
display_benchmark(benchmark)


#### Early stopping

Yet another way to prevent overfitting is to stop the training when the performance on the validation set starts to degrade. This is called early stopping.

During training, the model is evaluated on the validation dataset after each epoch. If the performance of the model on the validation dataset starts to degrade (e.g. loss begins to increase or accuracy begins to decrease), then the training process is stopped.

In contrast to regularization, early stopping is not defined when compiling the model. Instead, it is a callback that is passed to the `fit` method.

In [None]:
# Define early stop callback. It continuously monitors the validation loss. If the loss stops decreasing for 10 epochs, the training automatically stops and the best weights are restored.
# The callback will be used in the fit method later.
early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss',
    verbose=1,
    patience=10,
    mode='min',
    restore_best_weights=True)

# Same architecture as model 2, but with L2 regularization to prevent overfitting
model_5 = tf.keras.Sequential([
  tf.keras.layers.Input(shape=(train_features.shape[-1],)),
  tf.keras.layers.Dense(256, activation='relu'),
  tf.keras.layers.Dense(256, activation='relu'),
  tf.keras.layers.Dense(1, activation='sigmoid')
])

model_5.compile(optimizer='adam',
       loss=tf.keras.losses.BinaryCrossentropy(),
       metrics=['accuracy', 'precision', 'recall'])

model_5.summary()

history_5 = model_5.fit(
    train_features,
    train_labels,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    validation_data=(val_features, val_labels),
    class_weight=class_weight,
    callbacks=[early_stopping]  # <-- Early stopping
    )

plot_history(history_5)
benchmark['early_stopping'] = evaluate_model(model_5, "Model 5", test_features, test_labels)
display_benchmark(benchmark)

## Other components in deep learning

Besides the architecture and the regularization, there are other components that are important in deep learning. In this section, we will discuss the optimizer and the learning rate.

#### Optimizer

Gradient descent is the most commonly used optimization algorithm in deep learning. There are many variations of gradient descent. Below is a non-comprehensive list

- SGD (`tf.keras.optimizers.SGD()`)
- Adagrad (`tf.keras.optimizers.Adagrad()`)
- Adadelta (`tf.keras.optimizers.Adadelta()`)
- Adam (`tf.keras.optimizers.Adam()`)
- AdamW (`tf.keras.optimizers.AdamW()`)

You can choose any of them to experiment. The following code (which is not a runnable) shows how you could compile the model with the Adam optimizer

```python
optimizer = tf.keras.optimizers.Adam()
model.compile(optimizer=optimizer,
       loss=tf.keras.losses.BinaryCrossentropy(),
       metrics=['accuracy', 'precision', 'recall'])
```

#### Learning rate

The learning rate controls how quickly the model is adapted to the problem. Smaller learning rates require more training epochs given the smaller changes made to the weights each update, whereas larger learning rates result in rapid changes and require fewer training epochs.

A learning rate that is too large can cause the model to converge too quickly to a suboptimal solution, whereas a learning rate that is too small can cause the process to get stuck.

The learning rate is given as an argument when defining the optimizer. For example, to set the learning rate to 0.01, you can use the following code:

```python
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
```

### Your own model

In this section you will define your own model, based on everything you learned from previous sections.


**Exercise**

Define a model with suitable architecture and techniques for preventing overfitting.

Can you achieve the following performance on the validation set
- balanced_accuracy > 0.8
- precision > 0.15
- recall > 0.8
- almost no overfitting

Hint:
1. Make the network bigger (wider and deeper) and use smaller learning rate (but not as small as 0.0001)
2. Combine different methods, such as dropout, early stop and weight regularization to avoid overfitting
3. Train longer and use larged batch size
4. Use robust optimizers, such as Adam, AdamW

You perhaps need many experiments to find the best combination. It is suggested to keep track of parameters you have used. You can create many cells and keep them here.

When you finish your experiments, answer the following questions:
1. How does the model perform on validation and test set?
2. What is the best model you have found?
3. How does it compare to the models from project 2? Is it better than XGBoost?

To answer the last question, we have compiled the results of the models from project 2. You can find them in the following table:

In [None]:
project2_results = pd.read_csv("classification_results.csv", index_col=0)
project2_results

In [2]:
# Define and train your model here

In [None]:
# Performance on evaluation results
benchmark['model_own'] = evaluate_model(model_own, "Own model", test_features, test_labels)
display_benchmark(benchmark)

**Exercise**

- Compare the results of deep learning models with the results of traditional classifiers, such as KNN, decision trees and XGBoost.
- Do deep learning models outperform traditional classifiers in terms of precision, recall and f1-score?

In [None]:
project2_results