# **SW12: Imbalanced data**

Imbalanced data occurs when the distribution of classes in a dataset is highly skewed, with one class significantly underrepresented compared to others. This imbalance can lead to biased model performance, where the model favors the majority class and fails to correctly predict or recognize the minority class. Addressing imbalanced data is crucial in applications like fraud detection, medical diagnosis, and rare event prediction to ensure reliable and equitable outcomes.

---

## **Setup**


In [None]:
# Basic imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Some Jupyter magic for nicer output
%config InlineBackend.figure_formats = ["svg"]   # Enable vectorized graphics

# Adjust the default settings for plots
import sys
sys.path.append("..")
import ml
ml.setup_plotting()

### **Dataset / Problem**

In this tutorial, we will use the [adult dataset](https://www.openml.org/search?type=data&status=active&id=1590). It contains data about 
adult persons extracted from a 1994 Census database. The task is to 
determine whether a person makes over 50K a year or not.

We can obtain the data from [OpenML](https://www.openml.org/), an open online scientific platform for machine learning that provides open data, open algorithms and other resources. scikit-learn offers a function fetch_openml() to automatically download data from OpenML:

In [None]:
from sklearn.datasets import fetch_openml

df, y = fetch_openml("adult", version=2, as_frame=True, return_X_y=True)

# Drop columns that are not needed
df = df.drop(columns=["fnlwgt", "education-num"])

display(df)

In [None]:
# Display absolute and relative class frequencies
classes_count = y.value_counts()
display(classes_count)
display(classes_count / classes_count.sum())

The data is already imbalanced (with a 1:3 ratio). But since we want to 
demonstrate the effect of imbalanced data in a more pronounced way, we
will artificially reduce the number of positive samples. Here, we will
use the imbalanced-learn library to do this:

In [None]:
# To better highlight the effect of learning from an imbalanced dataset, 
# we increase its ratio to 30:1:
from imblearn.datasets import make_imbalance
sampling_strategy = {classes_count.idxmin(): classes_count.max() // 30}
df, y = make_imbalance(df, y, sampling_strategy=sampling_strategy)

display(y.value_counts())
display(y.value_counts(normalize=True))

In [None]:
# Does the data contain any missing values?
missing_values = df.isnull().sum()
display(missing_values[missing_values > 0])

---

## **Data preprocessing**

In the following, we will construct a pipeline for preprocessing the data. 
Inspection of the data reveals that the dataset contains both numerical and 
categorical features, which we want to process differently. Conveniently, 
scikit-learn provides the 
[ColumnTransformer](https://scikit-learn.org/stable/modules/compose.html#column-transformer),
which allows us to apply different transformations to different columns of the input data.


Note that the variables workclass, occupation, and native-country are 
sometimes missing. To handle these missing values, we will use the
SimpleImputer class to fill in missing values. For numerical features,
we will use the mean value, while for categorical features, we will use
a constant value "missing".


In [None]:
# Preprocess
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector as selector

# Pipeline for numeric data: Scale and impute missing values
num_pipe = make_pipeline(
    StandardScaler(), 
    SimpleImputer(strategy="mean", add_indicator=True)
)

# Pipeline for categorical data: One-hot encode and impute missing values
cat_pipe = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    OneHotEncoder(handle_unknown="ignore"),
)

# Combine both pipelines to a single preprocessor
preprocessor = make_column_transformer(
    (num_pipe, selector(dtype_include="number")),
    (cat_pipe, selector(dtype_include="category")),
    n_jobs=2,
)

# Let's train the preprocessing, and apply the transformation immediately.
df = preprocessor.fit_transform(df)

---

## **Baseline models**

In the remainder of this tutorial, we will use the following function to train
and evaluate the performance of a classifier. Besides accuracy and some other
metrics, we will also report the balanced accuracy, which is defined as the
average of recall obtained on each class. Balanced accuracy is equivalent to 
the arithmetic mean of specificity and sensitivity. This is metric particularly 
useful for imbalanced datasets.

Furthermore, we employ 5-fold cross-validation ([cross_validate](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html))
to train a classifier and evaluate its performance on multiple splits (n=5) of the data. 

In [None]:
from sklearn.model_selection import cross_validate
from time import time

def train_evaluate(clf, title=None):
    start = time()
    ret = cross_validate(clf, df, y, cv=5,
                         scoring=["accuracy", 
                                  "balanced_accuracy",
                                  #"sensitivity", 
                                  #"specificity",
                                  ])

    if title is not None:
        print(title)
        print("="*(len(title)+1))
        
    print("  accuracy=%.2f" % ret['test_accuracy'].mean())
    print("  balanced accuracy=%.2f" % ret['test_balanced_accuracy'].mean())
    print("  training time=%.2fs" % (time() - start))
    #print("  sensitivity=%.2f" % ret['test_sensitivity'].mean())
    #print("  specificity=%.2f" % ret['test_specificity'].mean())
    

### **DummyDummyClassifier**

We will compare the performance of the different classifiers.
 We start with a dummy classifier that always predicts the majority class.

In [None]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier(strategy="most_frequent")
train_evaluate(clf, title="DummyClassifier")

$\Rightarrow$ Even though the accuracy is high, the balanced accuracy (i.e. the average of the sensitivity and specificity) is low.
This indicates that the Dummy Classifier is not learning anything useful from the data.

### **Logistic regression**



In [None]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(max_iter=1000)

train_evaluate(clf)

$\Rightarrow$ The linear model is learning slightly better, but it is impacted by the class imbalance.

### **Random forest**

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(random_state=42, n_jobs=2)
train_evaluate(clf)

$\Rightarrow$ This looks better. But still suffers from the inbalance.

### **Class weighting**

Some classifiers allow to set the class weights to counteract the imbalance.
The "balanced" mode uses the values of y to automatically adjust weights 
inversely proportional to class frequencies in the input data as:

```python
n_samples / (n_classes * np.bincount(y))
```

In [None]:
print("LR with Event-Weighting:")
clf = LogisticRegression(class_weight="balanced", max_iter=1000)
train_evaluate(clf)

print("\nRF with Event-Weighting:")
clf = RandomForestClassifier(class_weight="balanced", random_state=42, n_jobs=2)
train_evaluate(clf)

$\Rightarrow$ Class weighting helps with LR but not with RF.

### **Specialized RandomForest from imbalanced-learn**

In [None]:
from imblearn.ensemble import BalancedRandomForestClassifier

clf = BalancedRandomForestClassifier(random_state=42, n_jobs=2, 
                                     replacement=True,
                                     bootstrap=False,
                                     sampling_strategy="auto")
train_evaluate(clf)

### **Undersampling**

In [None]:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
df, y = rus.fit_resample(df,y)

print("Undersampling + LR:")
clf = LogisticRegression(max_iter=1000)
train_evaluate(clf)

print()
print("Undersampling + RF:")
clf = RandomForestClassifier(random_state=42, n_jobs=2)
train_evaluate(clf)

### **Oversampling**


In [None]:
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=42)
df, y = ros.fit_resample(df,y)

print("Oversampling + LR:")
clf = LogisticRegression(max_iter=1000)
train_evaluate(clf)

print()
print("Oversampling + RF:")
clf = RandomForestClassifier(random_state=42, n_jobs=2)
train_evaluate(clf)

## **SMOTE**

Use [SMOTE (Synthetic Minority Over-sampling Technique)](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html) to generate synthetic samples
for the minority class. SMOTE interpolates between existing samples to create new ones.

In [None]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
df, y = smote.fit_resample(df,y)

print("SMOTE + LR:")
clf = LogisticRegression(max_iter=1000)
train_evaluate(clf)

print()
print("SMOTE + RF:")
clf = RandomForestClassifier(random_state=42, n_jobs=2)
train_evaluate(clf)