<a href="https://colab.research.google.com/github/dk-wei/ml-algo-implementation/blob/main/Imbalanced_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### The Imbalanced Classification Notebook

Source Youtube Video 🚀: [Aditya Lahiri: Dealing With Imbalanced Classes in Machine Learning](https://www.youtube.com/watch?v=6M2d2n-QXCc&t=1672s)

In [None]:
# importing our favorite libraries
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score,accuracy_score

Let us first create an artificial imabalanced dataset
We use 5 features drawn from a standard normal distribution and sample 1010 such rows.

In [None]:
np.random.seed(1);
data = np.random.normal(0,1,(1010,5))

In [None]:
# make a dataframe
df = pd.DataFrame(data)

In [None]:
df[5] = df[0] + df[1] + df[2] + df[3] + df[4]

In [None]:
df[df[5]>4].shape

(41, 6)

In [None]:
label = []
for i in df[5].values:
    if i>4:
        label.append(1)
    else:
        label.append(0)

In [None]:
df[6] = label

In [None]:
df.drop(columns=[5],inplace=True)
df.rename(columns={6:'label'},inplace=True)

In [None]:
df.head()

Unnamed: 0,0,1,2,3,4,label
0,1.624345,-0.611756,-0.528172,-1.072969,0.865408,0
1,-2.301539,1.744812,-0.761207,0.319039,-0.24937,0
2,1.462108,-2.060141,-0.322417,-0.384054,1.133769,0
3,-1.099891,-0.172428,-0.877858,0.042214,0.582815,0
4,-1.100619,1.144724,0.901591,0.502494,0.900856,0


In [None]:
df['label'].value_counts()

0    969
1     41
Name: label, dtype: int64

In [None]:
y = df['label']
X = df.drop(['label'],axis=1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42,stratify=y)

###  First let us train and validate a standard XGBoost classifier on this dataset

In [None]:
from xgboost import XGBClassifier

In [None]:
xgb = XGBClassifier(seed=42)

In [None]:
xgb.fit(X_train,y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=42,
              silent=None, subsample=1, verbosity=1)

In [None]:
print(f1_score(y_test,xgb.predict(X_test)))
print(accuracy_score(y_test,xgb.predict(X_test)))

0.6956521739130435
0.9790419161676647


Not that great, right?
We achieve a decent f1 score and a pretty high accuracy (Sounds good, doesn't work!). 
Let us now see how much of a difference does using scale_pos_weight to balance out the effect of skewed classes has on this

In [None]:
class_weight = int(y_train.value_counts()[0]/y_train.value_counts()[1])

In [None]:
class_weight

24

In [None]:
xgb = XGBClassifier(scale_pos_weight=class_weight,seed=42)

In [None]:
xgb.fit(X_train,y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=24, seed=42,
              silent=None, subsample=1, verbosity=1)

In [None]:
print(f1_score(y_test,xgb.predict(X_test)))
print(accuracy_score(y_test,xgb.predict(X_test)))

0.8148148148148148
0.9850299401197605


Wow, that worked! 
We just made quite a leap in the f1 score! That really did do wonders. The accuracy also improved a tad, but we don't really want to read too much into the accuracy because it is a useless metric here.
Why don't we take a little detour and show how useless accuracy really is here. 


In [None]:
print(accuracy_score(y_test,[0 for i in range(len(X_test)) ] ))

0.9580838323353293


We just predicted all zeros(!) {the majority class} without any sort of model, and that gave us an accuracy of 89. Now we know why we don't need accuracy here.

Imblearn
Lets try out the Balanced version of the famous Random Forest Classifier from imblearn and see how it fares

In [None]:
from imblearn.ensemble import BalancedRandomForestClassifier



In [None]:
brf = BalancedRandomForestClassifier(n_estimators=300,random_state=0)

In [None]:
brf.fit(X_train,y_train)



ValueError: ignored

In [None]:
print(f1_score(y_test,brf.predict(X_test)))

IndexError: ignored

### Neat!
We scored a decent f1 score  by using the imblearn version of Random Forest that's been created to handle imbalanced classes. Let us now see how vanilla sklearn Random Forest implementation fares on the same task. 

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf = RandomForestClassifier(n_estimators=300)

In [None]:
rf.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=300,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [None]:
print(f1_score(y_test,rf.predict(X_test)))

0.35294117647058826


### Vanilla Random Forests don't quite kill it.
The vanilla implementation wasn't built to deal with skewed classes and its results show that.
What if we use the hyperparameters of the vanilla implementation to let it know that we are dealing with skewed classes here?

In [None]:
rf = RandomForestClassifier(n_estimators=300,class_weight= {0:1,1:class_weight})

In [None]:
rf.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                       class_weight={0: 1, 1: 24}, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       max_samples=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       n_estimators=300, n_jobs=None, oob_score=False,
                       random_state=None, verbose=0, warm_start=False)

In [None]:
print(f1_score(y_test,rf.predict(X_test)))

0.5263157894736842


### Improvement!
That did much better than the vanilla Random Forest without the class_weight parameters set.
This also did better than the highly specialised BalancedRandomForestClassifier of Imblearn
However, practically both of these go neck to neck, and it is better to try out both of them before going ahead with one.

### Bad splitting in action!

Many datasets have rows sorted by classes.
So, here all class 0 rows would be first and then class 1 rows.
Let's artificially create the scenario first.

In [None]:
df =  df.sort_values(by=['label'])

In [None]:
X = df.drop(['label'],axis=1)
y = df['label']

### Now let us make a rookie mistake of splitting the data into train and validation via a traditional 70-30 split but **without randomization** ...

In [None]:
X_train = X[:int(0.7*len(X))]
y_train = y[:int(0.7*len(X))]

X_test = X[int(0.7*len(X)):]
y_test = y[int(0.7*len(X)):]

### Time to train a model on this seemingly-perfect split.

In [None]:
brf = BalancedRandomForestClassifier(n_estimators=500,n_jobs=-1)

In [None]:
brf.fit(X_train, y_train)

ValueError: ignored

### Boom, an error!

We got an error here that y_train, the target label list, has only one class! How could this happen? Let's double check.

In [None]:
y_train.max()

0

### We now know why!
When we split, none of the rows has class 1. Why?
Because they were in minority and all of them were at the end of the dataset.
When we split, the train took **top** 70% of data points, and alas! none of them had class 1.