In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

**Motivation & objective**

Imbalanced datasets are ubiquitous in real-world machine learning tasks, where one class significantly outnumbers the other(s). While this scenario is common, it poses significant challenges for traditional machine learning algorithms, which tend to be biased towards the majority class and perform poorly on minority classes. In this lab, we will explore three techniques to address the imbalance issue: subsampling, oversampling, and Synthetic Minority Over-sampling Technique (SMOTE). By implementing these techniques, we aim to improve the model's performance on imbalanced datasets and make our predictions more reliable.

---

**Importing the Dataset**:

Let's start by importing the Credit Card Fraud Detection dataset, a real-world example of an imbalanced dataset where fraudulent transactions are the minority class.

*Download the dataset here!!*

https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

In [None]:
df = pd.read_csv("../data/creditcard.csv")

In [None]:
df.info()

In [None]:
# check the first rows of this dataset

df.head()

In [None]:
# this is extremely imbalanced

df['Class'].value_counts()

# but also expected, since frauds are a mere fraction of all transactions

**Train/Test Split and Baseline Model**

Before applying any techniques to handle the imbalance, let's establish a baseline model. 

We'll perform a simple train/test split and train a RandomForest classifier on the training set, evaluating its performance on the test set.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score

from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

In [None]:
# Splitting the data into train and test sets

X,y = df.drop(columns=['Class']), df['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

As we can see below, both splits are also heavily imbalanced

In [None]:
y_train.value_counts()

In [None]:
y_test.value_counts()

Training a RandomForest classifier

In [None]:
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train, y_train)

Evaluating it on the test set

In [None]:
y_pred = rf_classifier.predict(X_test)

In [None]:
cm = confusion_matrix(y_true=y_test, y_pred=y_pred)

cm_display = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = [0, 1])

cm_display.plot()
plt.show() 

In [None]:
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print("Baseline Model Performance:")
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

---

**Subsampling**

Subsampling involves reducing the number of samples in the majority class to balance the dataset. 

*However, it's essential to perform subsampling only on the training set to avoid information loss in the test set.* 

This point actually stands for any other sampling method aswell. We want our test set to represent reality, and we therefore, as usual, can't alter it.

In [None]:
# install the imblearn-package

!pip install imblearn

In [None]:
from imblearn.under_sampling import RandomUnderSampler

# Subsampling the majority class in the training set

undersampler = RandomUnderSampler(random_state=42)

X_train_resampled, y_train_resampled = undersampler.fit_resample(X_train, y_train)

In [None]:
# note that this is now balanced!

y_train_resampled.value_counts()

Training a RandomForest classifier

In [None]:
# Training a new RandomForest classifier on the resampled data

rf_classifier_resampled = RandomForestClassifier(random_state=42)
rf_classifier_resampled.fit(X_train_resampled, y_train_resampled)

Evaluate it on the test set

In [None]:
y_pred_resampled = rf_classifier_resampled.predict(X_test)

In [None]:
cm = confusion_matrix(y_true=y_test, y_pred=y_pred_resampled)

cm_display = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = [0, 1])

cm_display.plot()
plt.show() 

In [None]:
accuracy_resampled = accuracy_score(y_test, y_pred_resampled)
precision_resampled = precision_score(y_test, y_pred_resampled)
recall_resampled = recall_score(y_test, y_pred_resampled)

print("Subsampling Model Performance:")
print("Accuracy:", accuracy_resampled)
print("Precision:", precision_resampled)
print("Recall:", recall_resampled)

---

**Oversampling**

Oversampling involves increasing the number of samples in the minority class to balance the dataset.

In [None]:
from imblearn.over_sampling import RandomOverSampler

# Oversampling the minority class in the training set

oversampler = RandomOverSampler(random_state=42)
X_train_oversampled, y_train_oversampled = oversampler.fit_resample(X_train, y_train)

In [None]:
# notice that this is now balanced!

In [None]:
y_train_oversampled.value_counts()

Train a RandomForest Classifier

In [None]:
rf_classifier_oversampled = RandomForestClassifier(random_state=42)
rf_classifier_oversampled.fit(X_train_oversampled, y_train_oversampled)

Evaluate it on the test set

In [None]:
y_pred_oversampled = rf_classifier_oversampled.predict(X_test)

In [None]:
cm = confusion_matrix(y_true=y_test, y_pred=y_pred_oversampled)

cm_display = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = [0, 1])

cm_display.plot()
plt.show() 

In [None]:
accuracy_oversampled = accuracy_score(y_test, y_pred_oversampled)
precision_oversampled = precision_score(y_test, y_pred_oversampled)
recall_oversampled = recall_score(y_test, y_pred_oversampled)

print("\nOversampling Model Performance:")
print("Accuracy:", accuracy_oversampled)
print("Precision:", precision_oversampled)
print("Recall:", recall_oversampled)

---

**SMOTE (Synthetic Minority Over-sampling Technique)**

SMOTE generates synthetic samples for the minority class to balance the dataset.

In [None]:
from imblearn.over_sampling import SMOTE

# Applying SMOTE to the training set
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

In [None]:
# note that this is now balanced

y_train_smote.value_counts()

Train a RandomForest classifier

In [None]:
rf_classifier_smote = RandomForestClassifier(random_state=42)
rf_classifier_smote.fit(X_train_smote, y_train_smote)

Evaluate it on the test set

In [None]:
y_pred_smote = rf_classifier_smote.predict(X_test)

In [None]:
cm = confusion_matrix(y_true=y_test, y_pred=y_pred_smote)

cm_display = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = [0, 1])

cm_display.plot()
plt.show() 

In [None]:
accuracy_smote = accuracy_score(y_test, y_pred_smote)
precision_smote = precision_score(y_test, y_pred_smote)
recall_smote = recall_score(y_test, y_pred_smote)

print("\nSMOTE Model Performance:")
print("Accuracy:", accuracy_smote)
print("Precision:", precision_smote)
print("Recall:", recall_smote)

---

## Challenges 

**Task 1**

Understand everything we've done above.

**Task 2**

Recall that we in the binary classification case, predict the class which has the biggest probability. 

Since there are only 2 classes, we predict the one which has 0.5 (by default).

However, oftentimes it's worth altering this prediction threshold/cutoff to something else. This way, we can also affect our accuracy/precision/recall.

Run the code below to see how to do this.

In [None]:
rf = RandomForestClassifier()
rf.fit(X_train,y_train)   

In [None]:
# get all prediction probabilities

prediction_probabilities = rf.predict_proba(X_test)

# extract only the probabilities for the positive class

prediction_for_positive_class = prediction_probabilities[:, 1]

In [None]:
# Define the range of threshold values

threshold_values = np.linspace(0.2,0.8,25)

threshold_values

The above threshold values are what we will loop over below. Specifically, each threshold represents the required probability for class 1 for us to predict it.

In [None]:
# Loop through each threshold value
for threshold in threshold_values:

    # Convert predicted probabilities to binary predictions based on the current threshold
    y_pred = (prediction_for_positive_class >= threshold).astype(int)
    
    accuracy = round(accuracy_score(y_test, y_pred),4)
    precision = round(precision_score(y_test, y_pred),4)
    recall = round(recall_score(y_test, y_pred),4)
    
    # Print the metrics for the current threshold
    print(f'Threshold : {round(threshold,2)}')
    print(f'Accuracy  : {accuracy}')
    print(f'Precision : {precision}')
    print(f'Recall    : {recall}', end='\n\n')

**Task 3**

Do the above analysis again, but instead use the classifiers you've trained on the differently pre-processed training sets above.