# IMBALANCED CLASSES

Imbalanced classes are a common problem in machine learning classification where there are a disproportionate ratio of observations in each class. Class imbalance can be found in many different areas including medical diagnosis, spam filtering, and fraud detection.

In this guide, we’ll look at one possible way to handle an imbalanced class problem, using SMOTE method to solve this issue.


Important Note: This guide will focus solely on addressing imbalanced classes and will not addressing other important machine learning steps including, but not limited to, feature selection or hyperparameter tuning.

## Data

We will use the Credit Card Fraud Detection Dataset available on Kaggle. 

https://www.kaggle.com/mlg-ulb/creditcardfraud/home



In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, recall_score

In [8]:
# setting up default plotting parameters
%matplotlib inline

plt.rcParams['figure.figsize'] = [20.0, 7.0]
plt.rcParams.update({'font.size': 22,})

sns.set_palette('viridis')
sns.set_style('white')
sns.set_context('talk', font_scale=0.8)

In [9]:
# read in data
df = pd.read_csv('creditcard.csv')

print(df.shape)
df.head()

(284807, 31)


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [10]:
print(df.Class.value_counts())

0    284315
1       492
Name: Class, dtype: int64


In [11]:
# print percentage of questions where target == 1
(len(df.loc[df.Class==1])) / (len(df.loc[df.Class == 0])) * 100

0.17304750013189596

The dataset is high imbalanced, with only 0.17% of transactions being classified as fraudulent.
Our objective will be to correctly classify the minority class of fraudulent transactions.

## The Problem with Imbalanced Classes

Most machine learning algorithms work best when the number of samples in each class are about equal. This is because most algorithms are designed to maximize accuracy and reduce error.

## The Problem with Accuracy

Here we can use the DummyClassifier to always predict “not fraud” just to show how misleading accuracy can be.

In [12]:
# Separate input features and target
y = df.Class
X = df.drop('Class', axis=1)

# setting up testing and training sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=27)

# DummyClassifier to predict only target 0
dummy = DummyClassifier(strategy='most_frequent').fit(X_train, y_train)
dummy_pred = dummy.predict(X_test)

# checking unique labels
print('Unique predicted labels: ', (np.unique(dummy_pred)))

# checking accuracy
print('Test score: ', accuracy_score(y_test, dummy_pred))

Unique predicted labels:  [0]
Test score:  0.9981461194910255


We got an accuracy score of 99.8% — And without even training a model! Let’s compare this to logistic regression, an actual trained classifier.

In [14]:
# Modeling the data as is
# Train model
lr = LogisticRegression(solver='liblinear').fit(X_train, y_train)
 
# Predict on training set
lr_pred = lr.predict(X_test)

# Checking accuracy
accuracy_score(y_test, lr_pred)


0.9992135052386169

In [15]:
# Checking unique values
predictions = pd.DataFrame(lr_pred)
predictions[0].value_counts()

0    71108
1       94
Name: 0, dtype: int64

Maybe not surprisingly, our accuracy score decreased as compared to the dummy classifier above. This tells us that either we did something wrong in our logistic regression model, or that accuracy might not be our best option for measuring performance.

Let’s take a look at some popular methods for dealing with class imbalance.

## Change the performance metric

As we saw above, accuracy is not the best metric to use when evaluating imbalanced datasets as it can be very misleading. Metrics that can provide better insight include:

* Confusion Matrix: a table showing correct predictions and types of incorrect predictions.

* Precision: the number of true positives divided by all positive predictions. Precision is also called Positive Predictive Value. It is a measure of a classifier’s exactness. Low precision indicates a high number of false positives.

* Recall: the number of true positives divided by the number of positive values in the test data. Recall is also called Sensitivity or the True Positive Rate. It is a measure of a classifier’s completeness. Low recall indicates a high number of false negatives.

* F1: Score: the weighted average of precision and recall.



Let’s see what happens when we apply these F1 and recall scores to our logistic regression from above.

In [16]:
# Checking accuracy
accuracy_score(y_test, lr_pred)

0.9992135052386169

In [17]:
# f1 score
f1_score(y_test, lr_pred)


0.7522123893805309

In [18]:
# recall score
recall_score(y_test, lr_pred)

0.6439393939393939

In [19]:
# confusion matrix
pd.DataFrame(confusion_matrix(y_test, lr_pred))

Unnamed: 0,0,1
0,71061,9
1,47,85


We have a very high accuracy score of 0.999 but a F1 score of only 0.752. And from the confusion matrix, we can see we are misclassifying several observations leading to a recall score of only 0.64.

## Generate synthetic samples

A technique similar to upsampling is to create synthetic samples. 

Here we will use imblearn’s SMOTE or Synthetic Minority Oversampling Technique. 

SMOTE uses a nearest neighbors algorithm to generate new and synthetic data we can use for training our model.

Again, it’s important to generate the new samples only in the training set to ensure our model generalizes well to unseen data.

In [50]:
from imblearn.over_sampling import SMOTE

# Separate input features and target
y = df.Class
X = df.drop('Class', axis=1)

# setting up testing and training sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=27)

sm = SMOTE(random_state=27, ratio=0.01)
X_train, y_train = sm.fit_sample(X_train, y_train)



After generating our synthetic data points, let’s see how our logistic regression performs.

In [51]:
smote = LogisticRegression(solver='liblinear').fit(X_train, y_train)

smote_pred = smote.predict(X_test)

# Checking accuracy
accuracy_score(y_test, smote_pred)

0.9989326142524086

In [52]:
# f1 score
f1_score(y_test, smote_pred)

0.7342657342657343

In [53]:
recall_score(y_test, smote_pred)

0.7954545454545454

In [54]:
# confusion matrix
pd.DataFrame(confusion_matrix(y_test, smote_pred))

Unnamed: 0,0,1
0,71021,49
1,27,105


Our recall is increased and with this we can detect more fraudulent cases.

## Conclusion

These are just some of the many possible methods to try when dealing with imbalanced datasets, and not an exhaustive list.
Some others methods to consider are collecting more data or choosing different resampling ratios, or ever change the algorithms

You should always try several approaches and then decide which is best for your problem.