# Class imbalance

## What is it?

In [11]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import lazypredict


from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, f1_score

In [12]:
df = pd.read_csv("../data/creditcard.zip", index_col=0)
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [13]:
df.shape

(284807, 28)

In [14]:
df["Class"].value_counts() / df.shape[0]

0    0.998273
1    0.001727
Name: Class, dtype: float64

## Why is it a Problem?

- Because the classifier has an "incentive" to rather predict class 0. It just shows up so much more.


## How can we deal with it?

1. Compare models and select best-performing one.
2. Use the right evaluation metrics.
3. Resample data. 

### re 1.:

#### Build a simple baseline model

#### At the end of may 2021, monthly customer churn = 5% (see EDA). "Positives" are clients that will have left the SB in three months' time. Therefore
* if one assumes that no client will have left the institution, i.e. there is 0 "positives" => TP = 0, FP = 0, TN = 95%, FN = 5%

In [15]:
# TP = 0 
# FP = 0 
# TN = 0.95
# FN = 0.05

In [16]:
# accuracy = (TP + TN) / (FP + FN + TP + TN)
# accuracy

#### In fact, u don't even need to do all this: by simply looking at the data, u can state "there is no fraud" and you conclude that accuracy is nearly 99% (since your FN are approx. 1%)

#### Now, let's compare models to each other (as well as to dummy model) => make your life easier with a super cool lib: "lazypredict"

In [17]:
from lazypredict.Supervised import LazyClassifier

ModuleNotFoundError: No module named 'sklearn.utils.testing'

In [7]:
X = df.iloc[:,:-1] #all rows (:), and all columns EXCEPT for the last one
# # X = df[df.columns[df.columns != 'Class']] #Alternative

y = df["Class"]

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10, stratify=y) # remember: you set "stratify" so that train data and labelled data have the same "proportions" re classes...
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((213605, 27), (71202, 27), (213605,), (71202,))

In [9]:
# Running all models

clf = LazyClassifier(verbose=0,ignore_warnings=True)
models, predictions = clf.fit(X_train, X_test, y_train, y_test)
models

ModuleNotFoundError: No module named 'sklearn.utils.testing'

### Methods to improve: Use Undersampling

In [None]:
# !pip install imbalanced-learn

In [None]:
from imblearn.under_sampling import RandomUnderSampler, NearMiss

In [None]:
(y_train == 0).sum(), (y_train == 1).sum()

Instantiating the models:
- just declaring them with their hyperparameters.

In [None]:
rus = RandomUnderSampler(sampling_strategy={0: 20_000},random_state=10)
# we are asking for 20000 data points in the first class.

nm = NearMiss(sampling_strategy={0: 20_000})

Actually do the transformation, i.e. resampling:

In [None]:
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train) #fit_resample() new imblearn syntax
#very conceptually similar to .fit_transform() <---sklearn

X_train_nm, y_train_nm = nm.fit_resample(X_train, y_train) 



In [None]:
X_train_rus.shape #20,000 non-frauds and 369 frauds. As opposed to 213,000 non-frauds.

In [None]:
### Exact same code as before, but this time we are training the Random Forest on the undersampled  / down-sampled
rf.fit(X_train_rus, y_train_rus)
ypred_rus = rf.predict(X_test)

In [None]:
print_evaluations(y_test, ypred_rus, 'Random Undersampling')

In `sklearn`, there are 2 types of "models":
1. Transformative / Feature Engineering models
    - scalers
    - binners
    - interpolators
    - polynomial features
    - they all use the `.fit()` and the `.transform()` methods.
2. Predictive Models
    - Tree-based models
    - Linear Models
    - Support vector machines
    - they all use the `.fit()` and the `.predict()` methods.

In [None]:
### Exact same code as before, but this time we are training the Random Forest on the undersampled  / down-sampled data using Near Miss
rf.fit(X_train_nm, y_train_nm)
ypred_nm = rf.predict(X_test)
print_evaluations(y_test, ypred_nm, 'Near Miss')

### Use Oversampling

In [None]:
from imblearn.over_sampling import RandomOverSampler, SMOTE

In [None]:
(y_train == 0).sum(), (y_train == 1).sum()

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

#### Random over sampling with replacement

In [None]:
# Oversample to 20000
ros = RandomOverSampler(random_state=10, sampling_strategy={1: 20_000}) #"up-sampling" the minority class to have 20_000 samples (up from 369)

In [None]:
X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)

In [None]:
X_train_ros.shape, X_train.shape
#the oversampled dataframe now has 20_000 more rows than before.

In [None]:
### Exact same code as before, but this time we are training the Random Forest on the oversampled  / up-sampled data using replacement
rf.fit(X_train_ros, y_train_ros)
ypred_ros = rf.predict(X_test)
print_evaluations(y_test, ypred_ros, 'Random Oversampling')

#### Synthetic Minority Over Sampling

Random factor times the distance to the Nearest Neighbour is used to generate new data point.

In [None]:
sm = SMOTE(sampling_strategy={1: 20000}, random_state=10)
X_train_smote, y_train_smote = sm.fit_resample(X_train, y_train)

rf.fit(X_train_smote, y_train_smote)
ypred_smote = rf.predict(X_test)
print_evaluations(y_test, ypred_smote, 'SMOTE')

How to use this stuff in this week's project:
- Still split your data into train/test!
- Still need to vectorize your data, e.g. CountVectorizer + TfIdfTransformer
- then you can apply any sampling technique you'd like in order to better balance the classes (this is part of feature engineering!)
- then validate your results!

Models you can try out this week:
- Some probabilistic models (e.g. Naive Bayes) are SUPER SENSITIVE to class imbalance, although works well with many features.
- Logistic Regression and RFs do pretty okay with imbalanced classes, but it still helps.

Resampling / Balancing your classes most likely will not hurt! If anything, it will increase your model's performance.