# Class imbalance ⚖️

## What is it?

### Most machine learning algorithms assume that all misclassification errors made by a model are equal. This is often not the case for imbalanced classification problems where missing a positive or minority class case is worse than incorrectly classifying an example from the negative or majority class. There are many real-world examples, such as detecting spam email, diagnosing a medical condition, or identifying fraud. In all of these cases, a false negative (missing a case) is worse or more costly than a false positive. 

### Cost-sensitive learning is a subfield of machine learning that takes the costs of prediction errors (and potentially other costs) into account when training a machine learning model. It is a field of study that is closely related to the field of imbalanced learning that is concerned with classification on datasets with a skewed class distribution. As such, many conceptualizations and techniques developed and used for cost-sensitive learning can be adopted for imbalanced classification problems. In this tutorial, you will discover a gentle introduction to cost-sensitive learning for imbalanced classification. 

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import lazypredict

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, f1_score

In [4]:
df = pd.read_csv("../data/creditcard.zip", index_col=0)
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [5]:
df.shape

(284807, 28)

In [6]:
df.columns

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V13', 'V15', 'V16', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23',
       'V24', 'V25', 'V26', 'V27', 'V28', 'Amount', 'Class'],
      dtype='object')

## Why is it a Problem?

- Because the classifier has an "incentive" to rather predict class 0. It just shows up so much more.


In [7]:
df["Class"].value_counts() / df.shape[0]

0    0.998273
1    0.001727
Name: Class, dtype: float64

- Because of class imbalance in the dataset, any model will be a lot more exposed to the majority class ("non-event", "negative class" or "0") => predictions will be biased towards majority class.
- Even a simple dummy model reflects this: generalizing that there a NO fraud cases whatsoever implies an error rate of only 1% => why invest in preparing data, training model, evaluate model, deploy and maintain it??

- In practice: consider the loss that one single fraud case can have on a whole financial institution (e.g. correspondent LLPs), or for that matter: many small fraud cases accumulated. Even if the fraud is detected, financial institutions suffer of "reputation risk", which in the case increases by higher order of magnitudes (leading in the worst case to bank runs).

## How can we deal with it?

### 1. Compare models and select best-performing one (?) 🤓
### 2. Use the right evaluation metrics 🕙 🕞 🕚
### 3. Resample data (you've already seen this!) 📲
### 4. Cost-sensitive learning 🔨💰

## After participating in this lecture, you will know:
### - Imbalanced classification problems often value false-positive classification errors differently from false negatives. 📲
### - Cost-sensitive learning is a subfield of machine learning that involves explicitly defining and using costs when training machine learning algorithms.💻
### - Cost-sensitive techniques may be divided into three groups, including data sampling, algorithm modifications, and ensemble methods. 📊

## Re 1: "lazypredict" 🤓

In [17]:
from lazypredict.Supervised import LazyClassifier

ModuleNotFoundError: No module named 'sklearn.utils.testing'

In [7]:
X = df.iloc[:,:-1] 
y = df["Class"]

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10, stratify=y) # remember: you set "stratify" so that train data and labelled data have the same "proportions" re classes...
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((213605, 27), (71202, 27), (213605,), (71202,))

In [9]:
# Running all models

clf = LazyClassifier(verbose=0,ignore_warnings=True)
models, predictions = clf.fit(X_train, X_test, y_train, y_test)
models

ModuleNotFoundError: No module named 'sklearn.utils.testing'

## Re 4: Cost-sensitive Learning 🔨💰

### LogReg WITHOUT weights assigned to cost function - formally:

$$ \min \sum_{i=1}^{n} - (log(yhat_{i}) * y_{i} + log(1 - yhat_{i}) * (1 - y_{i}))$$ 

### LogReg WITHOUT weights assigned to cost function - defining, fitting and evaluating the model:

In [8]:
X = df.drop(["Class"], axis = 1)
X.columns
X.shape

y = df["Class"]
y.shape

(284807,)

In [None]:
model = LogisticRegression(solver="lbfgs")    # instantiating the model

In [None]:
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # x-validation

In [None]:
scores = cross_val_score(model, X, y, scoring="roc_auc", cv=cv, n_jobs=-1)  # defining evaluation metric ("ROC AUC")

In [None]:
print('Mean ROC AUC: %.3f' % mean(scores)) 

### Mean ROC AUC = 0.917...but: what DOES this mean? 

## Re 2: Use the right evaluation metrics 🕙 🕞 🕚

### You have already learned about accuracy, recall and precision. Therefore, we will focus on the ROC AUC here.

### LogReg WITH weights - formally:

$$ \min \sum_{i=1}^{n} - (w_{0} * log(yhat_{i}) * y_{i} + w_{1} * log(1 - yhat_{i}) * (1 - y_{i}))$$ 

### LogReg WITH weights assigned to cost function - defining, fitting and evaluating the model:

In [None]:
weights = {0:0.01, 1:1.0} # note the weights assigned to loss function, whereby positive cases have a proportionally higher contribution to loss

In [None]:
model = LogisticRegression(solver="lbfgs", class_weight=weights) # note the explicit spec of the "weights" param, as opposed to above

In [None]:
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

In [None]:
scores = cross_val_score(model, X, y, scoring="roc_auc", cv=cv, n_jobs=-1)  # defining evaluation metric ("ROC AUC")

In [None]:
print('Mean ROC AUC: %.3f' % mean(scores)) 

### Clearly, weighting cost function improves performance of LogReg on this imbalanced dataset!

#### compare a neural network the same way => eventually, you will have to go to google colab to do this