# **Handling Imbalanced Data**

- Imbalanced data can Give falls understanding and accuracy for a ML model. 
- In general their are 3 distinct ways to handle this problem. Let's say we have 99000 data belonging to class 1 and 1000 data belonging to class 2 : 

                - You can take 1000 random data from each class and train your model. The cons on that is that you wasting to much data

                - You can copy the small data (1000) many times undtil you reach the same number of samples, then you train. THis migh sounds like a good idea but there is another way to handle this problem. 

                - **SMOTE** - Synthetic Minority Over-sampling Technique :  Using k nearest neighbor algorithm to generate synthetic samples 

                - **Ensemble Method** : Divide the bigger class in many batchs, take the first batch and the samples of the smaller class and train your model. Do the same for all the batches. At the end use a majority vote.

                - **Focal Loss** : Focal Loss will penalize majority samples during loss calculation and give more weight to minority class samples (a speciall type of loss function).


The following implementation is based on the customer churn project which is located in notebook : "ANN_Prediction_Customer_Churn.ipynb"

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [3]:
import tensorflow as tf
from tensorflow import keras
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import MinMaxScaler

In [None]:
from tensorflow_addons import losses


In [11]:
# The following code is from the Notebook : "ANN_Prediction_Customer_Churn.ipynb"
df = pd.read_csv("customer_churn.csv")
df.drop('customerID',axis='columns',inplace=True)

pd.to_numeric(df.TotalCharges,errors='coerce').isnull()
df[pd.to_numeric(df.TotalCharges,errors='coerce').isnull()]
df1 = df[df.TotalCharges!=' ']
df1.TotalCharges = pd.to_numeric(df1.TotalCharges)

def print_unique_col_values(df):
       for column in df:
            if df[column].dtypes=='object':
                print(f'{column}: {df[column].unique()}') 
                
                
df1.replace('No internet service','No',inplace=True)
df1.replace('No phone service','No',inplace=True)


yes_no_columns = ['Partner','Dependents','PhoneService','MultipleLines','OnlineSecurity','OnlineBackup',
                  'DeviceProtection','TechSupport','StreamingTV','StreamingMovies','PaperlessBilling','Churn']
for col in yes_no_columns:
    df1[col].replace({'Yes': 1,'No': 0},inplace=True)
    
df1['gender'].replace({'Female':1,'Male':0},inplace=True)


df2 = pd.get_dummies(data=df1, columns=['InternetService','Contract','PaymentMethod'])


cols_to_scale = ['tenure','MonthlyCharges','TotalCharges']


scaler = MinMaxScaler()
df2[cols_to_scale] = scaler.fit_transform(df2[cols_to_scale])


X = df2.drop('Churn',axis='columns')
y = testLabels = df2.Churn.astype(np.float32)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=15, stratify=y)


In [9]:
## ANN algorithm for automatic training, compiling and printing the classification report (all in one)
def ANN(X_train, y_train, X_test, y_test, loss, weights):
    model = keras.Sequential([
        keras.layers.Dense(26, input_dim=26, activation='relu'),
        keras.layers.Dense(15, activation='relu'),
        keras.layers.Dense(1, activation='sigmoid')
    ])

    model.compile(optimizer='adam', loss=loss, metrics=['accuracy'])
    
    if weights == -1:
        model.fit(X_train, y_train, epochs=100)
    else:
        model.fit(X_train, y_train, epochs=100, class_weight = weights)
    
    print(model.evaluate(X_test, y_test))
    
    y_preds = model.predict(X_test)
    y_preds = np.round(y_preds)
    
    print("Classification Report: \n", classification_report(y_test, y_preds))
    
    return y_preds

In [10]:
y_preds = ANN(X_train, y_train, X_test, y_test, 'binary_crossentropy', -1)

Epoch 1/100
[1m176/176[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.5338 - loss: 0.7170
Epoch 2/100
[1m176/176[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.7843 - loss: 0.4413
Epoch 3/100
[1m176/176[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.7913 - loss: 0.4303
Epoch 4/100
[1m176/176[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.7945 - loss: 0.4324
Epoch 5/100
[1m176/176[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8059 - loss: 0.4144
Epoch 6/100
[1m176/176[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.8031 - loss: 0.4092
Epoch 7/100
[1m176/176[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8001 - loss: 0.4193
Epoch 8/100
[1m176/176[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.7983 - loss: 0.4202
Epoch 9/100
[1m176/176[0m [32

In [17]:
count_class_0, count_class_1 = df1.Churn.value_counts()
# Divide by class
df_class_0 = df2[df2['Churn'] == 0]
df_class_1 = df2[df2['Churn'] == 1]

print(f'class 1 = {count_class_1},  class 0 = {count_class_0}')# From the results bellow it is obvious we have undersample for class 1

class 1 = 1869,  class 0 = 5163


# **Method 1 : Undersample**

In [19]:
df_class_0_under = df_class_0.sample(count_class_1)

df_class_0_under = df_class_0.sample(count_class_1)

df_test_under = pd.concat([df_class_0_under , df_class_1], axis = 0)
print('Random under-sampling:')
print(df_test_under.Churn.value_counts())

Random under-sampling:
Churn
0    1869
1    1869
Name: count, dtype: int64


In [22]:
# Create a ML model for the above undersampling data 

X = df_test_under.drop('Churn', axis = 'columns')
y = df_test_under['Churn']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 15, stratify = y)
y_train.value_counts()

Churn
0    1495
1    1495
Name: count, dtype: int64

In [23]:
y_pred = ANN(X_train, y_train, X_test, y_test, 'binary_crossentropy', -1)

Epoch 1/100
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.5720 - loss: 0.6848
Epoch 2/100
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.7596 - loss: 0.5343
Epoch 3/100
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7672 - loss: 0.4933
Epoch 4/100
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7801 - loss: 0.4760
Epoch 5/100
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.7651 - loss: 0.4870
Epoch 6/100
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7733 - loss: 0.4783
Epoch 7/100
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.7569 - loss: 0.4855
Epoch 8/100
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.7700 - loss: 0.4785
Epoch 9/100
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━

The above classification report indicates that the f1 - sccore for class 1 we increased it using undersampling. 

# **Method 2 : Oversampling**

In [27]:
df_class_1_over = df_class_1.sample(count_class_0, replace = True) # Using replace = True the function takes random samples and copy them
print(f'Shape : {df_class_1_over.shape}')

df_test_over = pd.concat([df_class_0, df_class_1_over], axis = 0)

print('Random over-sampling : ')
print(df_test_over.Churn.value_counts())

Shape : (5163, 27)
Random over-sampling : 
Churn
0    5163
1    5163
Name: count, dtype: int64


In [28]:
# Train ANN again
X = df_test_under.drop('Churn', axis = 'columns')
y = df_test_under['Churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 15, stratify = y)
y_train.value_counts()

y_pred = ANN(X_train, y_train, X_test, y_test, 'binary_crossentropy', -1)

Epoch 1/100
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - accuracy: 0.6652 - loss: 0.6357
Epoch 2/100
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - accuracy: 0.7565 - loss: 0.5109
Epoch 3/100
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.7547 - loss: 0.5165
Epoch 4/100
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.7805 - loss: 0.4663
Epoch 5/100
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - accuracy: 0.7772 - loss: 0.4858
Epoch 6/100
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.7739 - loss: 0.4785
Epoch 7/100
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.7730 - loss: 0.4764
Epoch 8/100
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.7769 - loss: 0.4677
Epoch 9/100
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━

Observing the above Classification report, we can see that f1 - score has increased from the initial report (before handling imbalanced data). For class 0 we have a dicrese but this is not a problem at the moment. 

# **Method 3: SMOTE**

Creating new samples using KNN

In [29]:
X = df2.drop('Churn', axis = 'columns')
y = df2['Churn']

In [31]:
from imblearn.over_sampling import SMOTE # From the imbalanced-learn lib

In [37]:
smote = SMOTE(sampling_strategy = 'minority') # Creating a smote object
X_sm, y_sm = smote.fit_resample(X, y) 

In [40]:
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size = 0.2, random_state = 15, stratify = y_sm)
y_train.value_counts()

y_pred = ANN(X_train, y_train, X_test, y_test, 'binary_crossentropy', -1)


Epoch 1/100


[1m259/259[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.6844 - loss: 0.6113
Epoch 2/100
[1m259/259[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7803 - loss: 0.4698
Epoch 3/100
[1m259/259[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.7762 - loss: 0.4664
Epoch 4/100
[1m259/259[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7782 - loss: 0.4671
Epoch 5/100
[1m259/259[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7879 - loss: 0.4443
Epoch 6/100
[1m259/259[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.7904 - loss: 0.4420
Epoch 7/100
[1m259/259[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7881 - loss: 0.4454
Epoch 8/100
[1m259/259[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7979 - loss: 0.4375
Epoch 9/100
[1m259/259[0m [32m━━━━━━━━━━━

WE observe from the above classification report that f1 - score has increased for both classes in comparison to the other handling methods. 

# **Method 4: Use of Ensemble with undersampling**

In [41]:
X = df2.drop('Churn', axis = 'columns')
y = df2['Churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 15, stratify = y)

y_train.value_counts()

Churn
0    4130
1    1495
Name: count, dtype: int64

In [43]:
df3 = X_train.copy()
df3['Churn'] = y_train

In [45]:
# The above count indicates imbalanced 
df3_class0 = df3[df3.Churn == 0 ]   
df3_class1 = df3[df3.Churn == 1]


In [46]:
df3_class0.shape, df3_class1.shape 

((4130, 27), (1495, 27))

In [50]:
def get_train_batch(df_majority, df_minority, start, end):
    df_train  = pd.concat([df_majority[start:end] , df_minority], axis = 0) # We want to take 3 batches without generating random sequence
    
    X_train = df_train.drop('Churn' , axis = 'columns')
    y_train = df_train.Churn
    
    return X_train, y_train

In [53]:
# Train the first model 
X_train, y_train = get_train_batch(df3_class0, df3_class1, 0, 1495)
y_pred1 = ANN(X_train, y_train, X_test, y_test, 'binary_crossentropy', -1)


In [57]:
# Train second model 
X_train, y_train = get_train_batch(df3_class0, df3_class1, 1495, 2*1495 )
y_pred2 = ANN(X_train, y_train, X_test, y_test, 'binary_crossentropy', -1)

Epoch 1/100
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.6252 - loss: 0.6515
Epoch 2/100
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.7570 - loss: 0.5198
Epoch 3/100
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7650 - loss: 0.4893
Epoch 4/100
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.7706 - loss: 0.4859
Epoch 5/100
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.7608 - loss: 0.4848
Epoch 6/100
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.7625 - loss: 0.4713
Epoch 7/100
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.7555 - loss: 0.4884
Epoch 8/100
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.7639 - loss: 0.4819
Epoch 9/100
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━

In [56]:
# Train third model 
X_train, y_train = get_train_batch(df3_class0, df3_class1, 2*1495, 3*1495 )
y_pred3 = ANN(X_train, y_train, X_test, y_test, 'binary_crossentropy', -1)

Epoch 1/100
[1m83/83[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.5812 - loss: 0.6713
Epoch 2/100
[1m83/83[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7468 - loss: 0.5342
Epoch 3/100
[1m83/83[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7627 - loss: 0.4927
Epoch 4/100
[1m83/83[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7693 - loss: 0.4838
Epoch 5/100
[1m83/83[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7845 - loss: 0.4748
Epoch 6/100
[1m83/83[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7901 - loss: 0.4634
Epoch 7/100
[1m83/83[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.7627 - loss: 0.4868
Epoch 8/100
[1m83/83[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.7841 - loss: 0.4578
Epoch 9/100
[1m83/83[0m [32m━━━━━━━━━━━━━━━━━

Now that we have trained all the models we are going to take a majority vote. In order to do that, we see in wich class each model is classifying a sample and we follow the majority. We work with the follow logic : 

vote1 = 0
vote2 = 1
vote3 = 0 
 sum = 1 : Everything above 1 is going to class_1 

In [58]:
y_pred_final = y_pred1.copy()
for i in range(len(y_pred1)):
    n_ones = y_pred1[i] + y_pred2[i] + y_pred3[i]
    if n_ones > 1:
        y_pred_final[i] = 1
    else: 
        y_pred_final[i] = 0

In [60]:
print(classification_report(y_test, y_pred_final))
# The results below doesn't look that satisfying. We could propably resample or add random sampling for this method. IN general imblanaced data it is good practice to do

              precision    recall  f1-score   support

           0       0.90      0.72      0.80      1033
           1       0.50      0.77      0.61       374

    accuracy                           0.73      1407
   macro avg       0.70      0.75      0.70      1407
weighted avg       0.79      0.73      0.75      1407

