## Predicting Imbalanced Classes:

#### Introduction
The inspiration behind this exercise came from a horribly failed exercise during a job interview process. The goal was to accurately predict a binary outcome in which the positive case occured 0.0005% of the time (1/200,000). The dataset consisted of around 6.5M rows and 503 input columns. They gave me a two hour time limit, and I didn't do very well at all. I won't let that happen again.

This exercise roughly follows a walkthrough published by EliteDataScience titled "How to Handle Imbalanced Classes in Machine Learning" which was written on July 5, 2017 and can be accessed at https://elitedatascience.com/imbalanced-classes. In their walkthrough, they didn't use a test/train split, and they used only accuracy in assessing each model's performance. I did use a test train split, and I explored model performance more thoroughly using scikit learn's classification_report, found in sklearn.metrics.

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, classification_report
from sklearn.utils import resample
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier


from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))
#jtplot.style(figsize=(20, 15))

#### Data
The data for this exercise comes from the UCI Machine Learning Repository. It is a synthetic dataset that uses four input variables to predict whether a scale is balanced. The output in the original data had three labels: leaning left, balanced, and leaning right. To turn this into a binary classification problem, I simply focused on whether the final result was balanced. This outcome happens only 8% of the time. 

In [3]:
# Read and glimpse data
df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/balance-scale/balance-scale.data")
df.head()

Unnamed: 0,B,1,1.1,1.2,1.3
0,R,1,1,1,2
1,R,1,1,1,3
2,R,1,1,1,4
3,R,1,1,1,5
4,R,1,1,2,1


In [4]:
# Change outcome to binary variable
df.columns = ['balance', 'var1', 'var2', 'var3', 'var4']
df['balance'] = [1 if b == 'B' else 0 for b in df.balance]
df['balance'].value_counts()

0    576
1     48
Name: balance, dtype: int64

In [5]:
%%capture --no-stdout

# Separate input features (X) and target variable (y)
y = df.balance
X = df.drop('balance', axis = 1)

# Partition into test/train split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.3, random_state = 314
)

# Reconnect x and y for each set to make resampling easier
train = X_train.copy()
train['balance'] = y_train
test = X_test.copy()
test['balance'] = y_test

In [6]:
# Train model
clf_lr = LogisticRegression().fit(X_train, y_train)
 
# Predict on training set
pred_y_lr = clf_lr.predict(X_test)

# Observe model performance
print('Logistic Regression')
print(classification_report(y_test, pred_y_lr))

#### Imbalanced Classes: Problems and Solutions
Here we see the issue with imbalanced classification. The accuracy for this model is an impressive 92%. Amazing! A novice data scientist may see these results, communicate them to stakeholders and get that model into production. Doing so would be a huge mistake. Further examination of the model shows that it predicts the negative class 100% of the time. A model with such predictions is completely useless, and implementing it would be equivalent to saying "We should plan on this event never happening."

But what can we do? Sit around and wait for more data? Wait for that rare event to happen enough that we can get a moderately strong signal out of a massive amount of data? That might be the best option depending on the importance of the model, but chances are you won't want to wait that long. In approaching this problem, we have two paths to a solution. First, we can adjust the data. We can artificially inflate the number of positive cases in the data (upsampling) to strengthen the signal, or we can reduce the number of negative cases (downsampling) to minimize the noise. Upsampling the positive case happens more frequently, as it doesn't result in drastic reduction of the size of the dataset. Upsampling is performed by sampling from the positive cases with replacement until a suffient sample size is collected or by artificially creating additional observations using SMOTE or ADASYN methods. For simplicity's sake, I will stick to sampling with replacement in this exercise. 

#### Solution 1: Resampling

In [7]:
# Separate majority and minority classes
train_majority = train[train.balance == 0]
train_minority = train[train.balance == 1]
 

In [8]:
# Upsample minority class
train_minority_upsampled = resample(
    train_minority, 
    replace = True,     # sample with replacement
    n_samples = train_majority.shape[0],    # to match majority class
    random_state = 123
)
 
# Combine majority class with upsampled minority class
train_upsampled = pd.concat([train_majority, train_minority_upsampled])

# Minor upsample logistic regression
y = train_upsampled.balance
X = train_upsampled.drop('balance', axis = 1)

clf_lr_min = LogisticRegression().fit(X, y)

In [9]:
# Down sample major class
train_majority_downsampled = resample(
    train_majority,
    replace = False,
    n_samples = train_minority.shape[0],
    random_state = 123
)

# Combine minority class with downsampled majority class
train_downsampled = pd.concat([train_minority, train_majority_downsampled])

# Display new class counts
train_downsampled.balance.value_counts()

1    27
0    27
Name: balance, dtype: int64

In [10]:
# Major downsample logistic regression
y = train_downsampled.balance
X = train_downsampled.drop('balance', axis = 1)

clf_lr_maj = LogisticRegression().fit(X, y)

#### Solution 2: Model Selection
I mentioned earlier that there are two approaches to handling imbalanced data. We just walked through the first option, which is to adjust the data on which a model is fitted. The second option is to use a model that better handles imbalanced data. Two model types that often perform well on imbalanced classes are support vector machines (SVM) and tree-based models such as random forests. SVM's can be adjusted to penalize mistakes on the minority class based on how under-represented it is.

In [11]:
# Penalized SVM
y = y_train
X = X_train
 
clf_sv = SVC(
    kernel = 'linear', 
    class_weight = 'balanced', # penalize
    probability = True
)
 
clf_sv.fit(X, y)


SVC(C=1.0, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [12]:
# Random forest
clf_rf = RandomForestClassifier().fit(X, y)

After pursuing both types of corrections to adjust for an imbalanced output, we have four potential models to use. We can determine which model fits best by observing its performance on the testing data set aside earlier. Rather than using accuracy, imbalanced class predictors can be more effectively compared using measures such as precision, recall, and the f1-score, which are all computed in scikit learns classification report.

In [13]:
# Make predictions
pred_y_lr = clf_lr.predict(X_test)
pred_y_lr_min = clf_lr_min.predict(X_test)
pred_y_lr_maj = clf_lr_maj.predict(X_test)
pred_y_sv = clf_sv.predict(X_test)
pred_y_rf = clf_rf.predict(X_test)

print('Logistic Regression')
print(classification_report(y_test, pred_y_lr))

print('Logistic Regression Minority Upsample')
print(classification_report(y_test, pred_y_lr_min))

print('Logistic Regression Majority Downsample')
print(classification_report(y_test, pred_y_lr_maj))

print('Penalized SVM')
print(classification_report(y_test, pred_y_sv))

print('Random Forest')
print(classification_report(y_test, pred_y_rf))


Logistic Regression
             precision    recall  f1-score   support

          0       0.89      1.00      0.94       167
          1       0.00      0.00      0.00        21

avg / total       0.79      0.89      0.84       188

Logistic Regression Minority Upsample
             precision    recall  f1-score   support

          0       0.86      0.49      0.62       167
          1       0.09      0.38      0.14        21

avg / total       0.77      0.47      0.57       188

Logistic Regression Majority Downsample
             precision    recall  f1-score   support

          0       0.87      0.49      0.62       167
          1       0.09      0.43      0.16        21

avg / total       0.78      0.48      0.57       188

Penalized SVM
             precision    recall  f1-score   support

          0       0.89      0.60      0.72       167
          1       0.11      0.38      0.17        21

avg / total       0.80      0.58      0.66       188

Random Forest
             p

  'precision', 'predicted', average, warn_for)


#### Results
Interestingly enough, performance for the data-adjusted logistic regression models was very similar to that of the penalized SVM. Surprisingly, the random forest, which in many cases would be the top performer by far, failed to correctly classify a single positive case in the testing data. The downsampled model managed to perform as well as the SVM despite being fit on around 60 observations versus the upsampled and SVM's 700. The SVM performed the best, and without consideration for training time and complexity of production, should be implemented. However, since the logistic regression models are simpler, they may be the better choice depending on the real-world context of the model.

An interesting note on the random forest: In the original walkthrough by EliteDataScience, the random forest model performed incredibly well, but they didn't use a train/test set to verify that performance. This highlights the dangers of overfitting and the importance of properly validating a model.