## Handling the Imbalanced Dataset
<p>Consider a credit card fraud detection model where we have 99000 not fraudulent transactions and 1000 fraudulent transactions. So, however dumb the model is, it will give a accuracy of about 90% since the data is totally imbalanced. <p>

## 1. Under sampling majority class
![undersampling](undersampling.png) <br> <br>
But this technique is not effective since we remove lot of data.

## 2. Over sampling minority class by duplication of minority class.
![oversampling](oversampling.png) <br> <br>
Bu this may not be the best technique since we are duplicating the minority class.

## 3. Over sampling minority class class using SMOTE
![smote](smote.png) <br> <br>

## 4. Ensemble Method
![ensemble](ensemble.png) <br> <br>
Build a model with every batch and take majority vote of classification.
<br> 
## 5. Focal Loss
Focal loss will penalize majority samples during los  calculation and give more weight to minority class samples.

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')


In [4]:
df = pd.read_csv('customer_churn_scaled.csv')
df.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,OnlineSecurity,OnlineBackup,DeviceProtection,...,InternetService_0,InternetService_DSL,InternetService_Fiber optic,Contract_Month-to-month,Contract_One year,Contract_Two year,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0,0,1,0,0.0,0,0,0,1,0,...,0,1,0,1,0,0,0,0,1,0
1,1,0,0,0,0.464789,1,0,1,0,1,...,0,1,0,0,1,0,0,0,0,1
2,1,0,0,0,0.014085,1,0,1,1,0,...,0,1,0,1,0,0,0,0,0,1
3,1,0,0,0,0.619718,0,0,1,0,1,...,0,1,0,0,1,0,1,0,0,0
4,0,0,0,0,0.014085,1,0,0,0,0,...,0,0,1,1,0,0,0,0,1,0


In [5]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('Churn', axis=1), df['Churn'], test_size=0.2, random_state=42)


In [6]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((5625, 26), (1407, 26), (5625,), (1407,))

In [8]:
def ANN(X_train, y_train, X_test, y_test, loss, weights):
    model = keras.Sequential()
    model.add(keras.layers.Dense(26, input_dim=X_train.shape[1], activation='relu'))
    model.add(keras.layers.Dense(15, activation='relu'))
    model.add(keras.layers.Dense(1, activation='sigmoid'))

    model.compile(loss=loss, optimizer='adam', metrics=['accuracy'])

    if weights == -1:
        model.fit(X_train, y_train, epochs=10)
    else:
        model.fit(X_train, y_train, epochs=10, class_weight=weights)

    print(model.evaluate(X_test, y_test))


    y_pred = model.predict(X_test)
    y_pred = np.round(y_pred)

    print('Classification Report: \n', classification_report(y_test, y_pred))

    return y_pred

In [9]:
y_pred = ANN(X_train, y_train, X_test, y_test, 'binary_crossentropy', -1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
[0.43351054191589355, 0.7853589057922363]
Classification Report: 
               precision    recall  f1-score   support

           0       0.83      0.89      0.86      1033
           1       0.62      0.50      0.55       374

    accuracy                           0.79      1407
   macro avg       0.72      0.70      0.71      1407
weighted avg       0.77      0.79      0.78      1407



In [10]:
y_test.value_counts()

0    1033
1     374
Name: Churn, dtype: int64

As we can see the data is partially imbalanced. So, let us perform the above discussed techniques to handle the imbalanced dataset.

In [12]:
count_class_0, count_class_1 = df.Churn.value_counts()

df_class_0 = df[df['Churn'] == 0]
df_class_1 = df[df['Churn'] == 1]


In [13]:
df_class_0.shape, df_class_1.shape

((5163, 27), (1869, 27))

Data imbalance is seen here.

## 1. Under Sampling

In [14]:
df_class_0_under = df_class_0.sample(count_class_1)
df_class_0_under.shape

(1869, 27)

In [15]:
df_class_0_under.shape, df_class_1.shape

((1869, 27), (1869, 27))

Now, we got the sample number of data records. Let's concatenate them.

In [18]:
df_test_under = pd.concat([df_class_0_under, df_class_1])
df_test_under.shape

(3738, 27)

In [20]:
X_train, X_test, y_train, y_test = train_test_split(df_test_under.drop('Churn', axis=1), df_test_under['Churn'], test_size=0.2, random_state=15, stratify=df_test_under['Churn'])
# stratify will ensure that X_train and X_test have the same proportions of 0 and 1

In [22]:
y_train.value_counts()

1    1495
0    1495
Name: Churn, dtype: int64

In [23]:
y_pred = ANN(X_train, y_train, X_test, y_test, 'binary_crossentropy', -1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
[0.48828670382499695, 0.7566844820976257]
Classification Report: 
               precision    recall  f1-score   support

           0       0.75      0.77      0.76       374
           1       0.76      0.75      0.75       374

    accuracy                           0.76       748
   macro avg       0.76      0.76      0.76       748
weighted avg       0.76      0.76      0.76       748



We can see that precision and recall has improved because of undersampling.

## 2. Over Sampling