<h3 style='color:blue' align='center'>Handling imbalanced data in customer churn prediction</h3>

Customer churn prediction is to measure why customers are leaving a business. In this tutorial we will be looking at customer churn in telecom business. We will build a deep learning model to predict the churn and use precision,recall, f1-score to measure performance of our model.
We will then handle imbalance in data using various techniques and improve f1-score

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
%matplotlib inline

In [2]:
import warnings
warnings.filterwarnings('ignore')

**Load the data**

In [3]:
df1 = pd.read_csv("random_imputed.csv",index_col=[0])
df1.sample(5)

Unnamed: 0_level_0,gender,body.temperature,pulse,respiration,systolic.blood.pressure,diastolic.blood.pressure,map,BMI,type.of.heart.failure,NYHA.cardiac.function.classification,...,measured.bicarbonate,carboxyhemoglobin,oxygen.saturation,partial.oxygen.pressure,oxyhemoglobin,anion.gap,free.calcium,total.hemoglobin,GCS,ageCat
inpatient.number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
835982,1,36.0,100,19,140,72,94.666667,19.723866,2,3,...,23.246503,1.217008,96.585423,86.74564,94.749285,15.874385,1.252327,151.929679,15,75
742228,0,36.3,66,19,90,68,75.333333,20.957171,2,3,...,26.9,0.0,98.0,121.0,97.7,8.4,1.0,92.0,15,85
775827,0,36.2,66,18,130,80,96.666667,23.555556,2,4,...,23.246503,1.217008,96.585423,86.74564,94.749285,15.874385,1.252327,151.929679,15,75
736744,1,37.1,121,19,162,100,120.666667,27.886696,0,3,...,23.246503,1.217008,96.585423,86.74564,94.749285,15.874385,1.252327,151.929679,15,55
840048,1,36.5,76,21,120,60,80.0,18.442546,2,3,...,27.0,0.3,98.0,109.0,97.7,18.0,1.02,93.0,15,85


In [4]:
df3=pd.read_csv("drug_onehot_latest.csv",index_col=[0])
df3.head()


Unnamed: 0_level_0,sulfotanshinone sodium injection,Furosemide tablet,Meglumine Adenosine Cyclophosphate for injection,Furosemide injection,Milrinone injection,Deslanoside injection,Torasemide tablet,Benazepril hydrochloride tablet,Atorvastatin calcium tablet,Digoxin tablet,Hydrochlorothiazide tablet,Spironolactone tablet,Valsartan Dispersible tablet,Dobutamine hydrochloride injection,Isoprenaline Hydrochloride injection,Nitroglycerin injection,Shenfu injection,Isosorbide Mononitrate Sustained Release tablet
inpatient.number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
722128,0,1,0,1,1,1,0,0,0,1,0,1,0,0,0,0,0,0
723327,0,1,0,1,1,1,1,0,1,1,0,1,1,0,0,0,0,1
723617,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0
724385,0,1,0,1,1,1,0,0,0,1,0,1,0,0,0,0,0,0
725509,0,1,0,1,1,1,1,0,0,1,0,1,0,0,0,0,1,0


In [6]:
df2=df3['Atorvastatin calcium tablet']
df2 = df2.to_frame()

In [7]:
df2.head()

Unnamed: 0_level_0,Atorvastatin calcium tablet
inpatient.number,Unnamed: 1_level_1
722128,0
723327,1
723617,1
724385,0
725509,0


In [8]:
df=pd.concat([df1,df2],axis=1)

In [9]:
df.head()

Unnamed: 0_level_0,gender,body.temperature,pulse,respiration,systolic.blood.pressure,diastolic.blood.pressure,map,BMI,type.of.heart.failure,NYHA.cardiac.function.classification,...,carboxyhemoglobin,oxygen.saturation,partial.oxygen.pressure,oxyhemoglobin,anion.gap,free.calcium,total.hemoglobin,GCS,ageCat,Atorvastatin calcium tablet
inpatient.number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
722128,0,36.9,66,18,121,56,77.666667,18.590125,2,4,...,1.217008,96.585423,86.74564,94.749285,15.874385,1.252327,151.929679,15,75,0
723327,0,36.0,72,21,150,80,103.333333,21.644121,2,4,...,0.7,91.0,64.0,90.0,13.0,1.11,73.0,15,65,1
723617,1,36.4,70,20,120,80,93.333333,31.111111,2,3,...,0.0,98.0,106.0,97.7,9.1,1.11,104.0,15,85,1
724385,1,36.7,129,22,122,86,98.0,20.13478,2,3,...,1.217008,96.585423,86.74564,94.749285,15.874385,1.252327,151.929679,15,75,0
725509,0,36.9,100,20,95,65,75.0,22.031726,2,3,...,1.217008,96.585423,86.74564,94.749285,15.874385,1.252327,151.929679,15,45,0


In [10]:
517400/df.shape[0]

258.3125312031952

**First of all, drop customerID column as it is of no use**

In [11]:
#df.drop('customerID',axis='columns',inplace=True)

In [12]:
df.dtypes

gender                           int64
body.temperature               float64
pulse                            int64
respiration                      int64
systolic.blood.pressure          int64
                                ...   
free.calcium                   float64
total.hemoglobin               float64
GCS                              int64
ageCat                           int64
Atorvastatin calcium tablet      int64
Length: 129, dtype: object

**Quick glance at above makes me realize that TotalCharges should be float but it is an object. Let's check what's going on with  this column**

In [13]:
#df.TotalCharges.values

**Ahh... it is string. Lets convert it to numbers**

In [14]:
#pd.to_numeric(df.TotalCharges,errors='coerce').isnull()

In [15]:
#df[pd.to_numeric(df.TotalCharges,errors='coerce').isnull()]

In [16]:
df.shape

(2003, 129)

In [17]:
#df.iloc[488].TotalCharges

In [18]:
#df[df.TotalCharges!=' '].shape

**Remove rows with space in TotalCharges**

In [19]:
#df1 = df[df.TotalCharges!=' ']
#df1.shape

In [20]:
# df1.dtypes

In [21]:
# df1.TotalCharges = pd.to_numeric(df1.TotalCharges)

In [22]:
# df1.TotalCharges.values

In [23]:
# df1[df1.Churn=='No']

**Data Visualization**

In [24]:
# tenure_churn_no = df1[df1.Churn=='No'].tenure
# tenure_churn_yes = df1[df1.Churn=='Yes'].tenure

# plt.xlabel("tenure")
# plt.ylabel("Number Of Customers")
# plt.title("Customer Churn Prediction Visualiztion")

# blood_sugar_men = [113, 85, 90, 150, 149, 88, 93, 115, 135, 80, 77, 82, 129]
# blood_sugar_women = [67, 98, 89, 120, 133, 150, 84, 69, 89, 79, 120, 112, 100]

# plt.hist([tenure_churn_yes, tenure_churn_no], rwidth=0.95, color=['green','red'],label=['Churn=Yes','Churn=No'])
# plt.legend()

In [25]:
# mc_churn_no = df1[df1.Churn=='No'].MonthlyCharges      
# mc_churn_yes = df1[df1.Churn=='Yes'].MonthlyCharges      

# plt.xlabel("Monthly Charges")
# plt.ylabel("Number Of Customers")
# plt.title("Customer Churn Prediction Visualiztion")

# blood_sugar_men = [113, 85, 90, 150, 149, 88, 93, 115, 135, 80, 77, 82, 129]
# blood_sugar_women = [67, 98, 89, 120, 133, 150, 84, 69, 89, 79, 120, 112, 100]

# plt.hist([mc_churn_yes, mc_churn_no], rwidth=0.95, color=['green','red'],label=['Churn=Yes','Churn=No'])
# plt.legend()

**Many of the columns are yes, no etc. Let's print unique values in object columns to see data values**

In [26]:
# # def print_unique_col_values(df):
#        for column in df:
#             if df[column].dtypes=='object':
#                 print(f'{column}: {df[column].unique()}') 

In [27]:
# print_unique_col_values(df2)

**Some of the columns have no internet service or no phone service, that can be replaced with a simple No**

In [28]:
# df1.replace('No internet service','No',inplace=True)
# df1.replace('No phone service','No',inplace=True)

In [29]:
# print_unique_col_values(df1)

**Convert Yes and No to 1 or 0**

In [30]:
# yes_no_columns = ['Partner','Dependents','PhoneService','MultipleLines','OnlineSecurity','OnlineBackup',
#                   'DeviceProtection','TechSupport','StreamingTV','StreamingMovies','PaperlessBilling','Churn']
# for col in yes_no_columns:
#     df1[col].replace({'Yes': 1,'No': 0},inplace=True)

In [31]:
# for col in df1:
#     print(f'{col}: {df1[col].unique()}') 

In [32]:
#df1['gender'].replace({'Female':1,'Male':0},inplace=True)

In [33]:
# df1.gender.unique()

**One hot encoding for categorical columns**

In [34]:
# df2 = pd.get_dummies(data=df1, columns=['InternetService','Contract','PaymentMethod'])
# df2.columns

In [35]:
# df2.sample(5)

In [36]:
# df2.dtypes

In [37]:
cols_to_scale = [df.columns]

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_df= scaler.fit_transform(df)

In [38]:
df = pd.DataFrame(scaled_df,columns = df.columns)

In [39]:
df.head()

Unnamed: 0,gender,body.temperature,pulse,respiration,systolic.blood.pressure,diastolic.blood.pressure,map,BMI,type.of.heart.failure,NYHA.cardiac.function.classification,...,carboxyhemoglobin,oxygen.saturation,partial.oxygen.pressure,oxyhemoglobin,anion.gap,free.calcium,total.hemoglobin,GCS,ageCat,Atorvastatin calcium tablet
0,0.0,0.271429,0.333333,0.5,0.480159,0.383562,0.428309,0.046006,1.0,1.0,...,0.229624,0.954472,0.284024,0.941835,0.380276,0.724655,0.69601,1.0,0.666667,0.0
1,0.0,0.142857,0.363636,0.583333,0.595238,0.547945,0.569853,0.053564,1.0,1.0,...,0.132075,0.88,0.187234,0.878342,0.316258,0.44,0.22619,1.0,0.533333,1.0
2,1.0,0.2,0.353535,0.555556,0.47619,0.547945,0.514706,0.076992,1.0,0.5,...,0.0,0.973333,0.365957,0.981283,0.229399,0.44,0.410714,1.0,0.8,1.0
3,1.0,0.242857,0.651515,0.611111,0.484127,0.589041,0.540441,0.049828,1.0,0.5,...,0.229624,0.954472,0.284024,0.941835,0.380276,0.724655,0.69601,1.0,0.666667,0.0
4,0.0,0.271429,0.505051,0.555556,0.376984,0.445205,0.413603,0.054523,1.0,0.5,...,0.229624,0.954472,0.284024,0.941835,0.380276,0.724655,0.69601,1.0,0.266667,0.0


**Train test split**

In [40]:
X = df2.drop('Atorvastatin calcium tablet',axis='columns')
y = testLabels = df2['Atorvastatin calcium tablet'].astype(np.float32)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df, df2, test_size=0.2, random_state=15, stratify=df2)

In [41]:
y_train.value_counts()

Atorvastatin calcium tablet
0                              948
1                              654
dtype: int64

In [42]:
y.value_counts()

0.0    1185
1.0     818
Name: Atorvastatin calcium tablet, dtype: int64

In [43]:
1185/818

1.4486552567237163

In [44]:
y_test.value_counts()

Atorvastatin calcium tablet
0                              237
1                              164
dtype: int64

In [45]:
X_train.shape

(1602, 129)

In [46]:
X_test.shape

(401, 129)

In [47]:
X_train[:10]

Unnamed: 0,gender,body.temperature,pulse,respiration,systolic.blood.pressure,diastolic.blood.pressure,map,BMI,type.of.heart.failure,NYHA.cardiac.function.classification,...,carboxyhemoglobin,oxygen.saturation,partial.oxygen.pressure,oxyhemoglobin,anion.gap,free.calcium,total.hemoglobin,GCS,ageCat,Atorvastatin calcium tablet
1486,0.0,0.185714,0.494949,0.555556,0.714286,0.616438,0.661765,0.054994,1.0,0.0,...,0.09434,0.986667,0.47234,0.98262,0.340757,0.44,0.583333,1.0,0.8,1.0
223,1.0,0.228571,0.353535,0.5,0.357143,0.410959,0.386029,0.048805,1.0,1.0,...,0.056604,0.986667,0.506383,0.987968,0.296214,0.5,0.619048,1.0,0.533333,0.0
1034,0.0,0.214286,0.333333,0.5,0.492063,0.479452,0.485294,0.043263,0.0,0.5,...,0.018868,0.986667,0.710638,1.0,0.35412,0.52,0.547619,1.0,0.8,1.0
762,0.0,0.142857,0.313131,0.5,0.388889,0.458904,0.426471,0.046353,1.0,0.5,...,0.09434,0.946667,0.234043,0.94385,0.400891,0.32,0.72619,1.0,0.4,0.0
188,1.0,0.242857,0.30303,0.527778,0.531746,0.616438,0.577206,0.063431,1.0,0.5,...,0.229624,0.954472,0.284024,0.941835,0.380276,0.724655,0.69601,0.583333,0.8,1.0
1909,1.0,0.285714,0.328283,0.5,0.595238,0.479452,0.533088,0.045641,0.0,0.0,...,0.229624,0.954472,0.284024,0.941835,0.380276,0.724655,0.69601,1.0,0.8,0.0
1087,1.0,0.142857,0.520202,0.555556,0.436508,0.547945,0.496324,0.066357,1.0,0.5,...,0.056604,0.986667,0.544681,0.989305,0.191537,0.32,0.428571,1.0,0.266667,1.0
1955,0.0,0.257143,0.419192,0.5,0.543651,0.568493,0.556985,0.052795,1.0,0.5,...,0.229624,0.954472,0.284024,0.941835,0.380276,0.724655,0.69601,1.0,0.4,0.0
606,0.0,0.371429,0.414141,0.555556,0.484127,0.479452,0.481618,0.047097,0.0,0.5,...,0.229624,0.954472,0.284024,0.941835,0.380276,0.724655,0.69601,1.0,0.4,0.0
526,0.0,0.228571,0.338384,0.5,0.515873,0.465753,0.488971,0.045761,1.0,1.0,...,0.283019,0.64,0.102128,0.631016,0.360802,0.36,0.470238,1.0,0.8,0.0


In [48]:
# len(X_train.columns)

**Build a model (ANN) in tensorflow/keras**

In [262]:
from tensorflow_addons import losses

ModuleNotFoundError: No module named 'tensorflow_addons'

In [49]:
import tensorflow as tf
from tensorflow import keras
from sklearn.metrics import confusion_matrix , classification_report

In [73]:
def ANN(X_train, y_train, X_test, y_test, loss, weights):
    model = keras.Sequential([
        keras.layers.Dense(128, input_dim=128, activation='relu'),
        keras.layers.Dense(15, activation='relu'),
        keras.layers.Dense(1, activation='sigmoid')
    ])

    model.compile(optimizer='rmsprop', loss=loss, metrics=['accuracy'])
    
    if weights == -1:
        model.fit(X_train, y_train, epochs=50)
    else:
        model.fit(X_train, y_train, epochs=50, class_weight = weights)
    
    print(model.evaluate(X_test, y_test))
    
    y_preds = model.predict(X_test)
    y_preds = np.round(y_preds)
    
    print("Classification Report: \n", classification_report(y_test, y_preds))
    
    return y_preds

In [66]:
y_preds = ANN(X_train, y_train, X_test, y_test, 'binary_crossentropy', -1)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
[0.6114324331283569, 0.6666666865348816]
Classification Report: 
               precision    recall  f1-score   support

         0.0       0.64      0.76      0.69       237
         1.0       0.70      0.58      0.63       237

    accuracy                           0.67       474
   macro avg       0.67      0.67      0.66       474
weighted avg       0.67      0.67      0.66       474



## Mitigating Skewdness of Data

### Method 1: Undersampling

reference: https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets

In [51]:
# Class count
count_class_0, count_class_1 = df['Atorvastatin calcium tablet'].value_counts()

# Divide by class
df_class_0 = df[df['Atorvastatin calcium tablet'] == 0]
df_class_1 = df[df['Atorvastatin calcium tablet'] == 1]

In [52]:
df['Atorvastatin calcium tablet'].value_counts()

0.0    1185
1.0     818
Name: Atorvastatin calcium tablet, dtype: int64

In [53]:
# Undersample 0-class and concat the DataFrames of both class
df_class_0_under = df_class_0.sample(count_class_1)
df_test_under = pd.concat([df_class_0_under, df_class_1], axis=0)

print('Random under-sampling:')
print(df_test_under['Atorvastatin calcium tablet'].value_counts())

Random under-sampling:
0.0    818
1.0    818
Name: Atorvastatin calcium tablet, dtype: int64


In [54]:
X = df_test_under.drop('Atorvastatin calcium tablet',axis='columns')
y = df_test_under['Atorvastatin calcium tablet']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=15, stratify=y)

In [55]:
# Number of classes in training Data
y_train.value_counts()

0.0    654
1.0    654
Name: Atorvastatin calcium tablet, dtype: int64

**Printing Classification in the last, Scroll down till the last epoch to watch the classification report**

In [274]:
y_preds = ANN(X_train, y_train, X_test, y_test, 'binary_crossentropy', -1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
[0.6618996858596802, 0.582317054271698]
Classification Report: 
               precision    recall  f1-score   support

         0.0       0.59      0.55      0.57       164
         1.0       0.58      0.61      0.59       164

    accuracy                           0.58       328
   macro avg       0.58      0.58      0.58       328
weighted avg       0.58      0.58      0.58       328



Check classification report above. f1-score for minority class 1 improved from **0.57 to 0.76**. Score for class 0 reduced to 0.75 from 0.85 but that's ok. We have more generalized classifier which classifies both classes with similar prediction score

### Method2: Oversampling

In [57]:
# Oversample 1-class and concat the DataFrames of both classes
df_class_1_over = df_class_1.sample(count_class_0, replace=True)
df_test_over = pd.concat([df_class_0, df_class_1_over], axis=0)

print('Random over-sampling:')
print(df_test_over['Atorvastatin calcium tablet'].value_counts())

Random over-sampling:
0.0    1185
1.0    1185
Name: Atorvastatin calcium tablet, dtype: int64


In [58]:
X = df_test_over.drop('Atorvastatin calcium tablet',axis='columns')
y = df_test_over['Atorvastatin calcium tablet']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=15, stratify=y)

In [59]:
# Number of classes in training Data
y_train.value_counts()

0.0    948
1.0    948
Name: Atorvastatin calcium tablet, dtype: int64

In [74]:
loss = keras.losses.BinaryCrossentropy()
weights = -1
y_preds = ANN(X_train, y_train, X_test, y_test, 'binary_crossentropy', -1)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
[0.6194208860397339, 0.7067510485649109]
Classification Report: 
               precision    recall  f1-score   support

         0.0       0.71      0.70      0.70       237
         1.0       0.70      0.72      0.71       237

    accuracy                           0.71       474
   macro avg       0.71      0.71      0.71       474
weighted avg       0.71      0.71      0.71       474



Check classification report above. f1-score for minority class 1 improved from **0.57 to 0.76**. Score for class 0 reduced to 0.75 from 0.85 but that's ok. We have more generalized classifier which classifies both classes with similar prediction score

### Method3: SMOTE

To install imbalanced-learn library use **pip install imbalanced-learn** command

In [71]:
X = df.drop('Atorvastatin calcium tablet',axis='columns')
y = df['Atorvastatin calcium tablet']

In [61]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(sampling_strategy='minority')
X_sm, y_sm = smote.fit_sample(X, y)

y_sm.value_counts()

1    5163
0    5163
Name: Churn, dtype: int64

In [62]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size=0.2, random_state=15, stratify=y_sm)

In [63]:
# Number of classes in training Data
y_train.value_counts()

1    4130
0    4130
Name: Churn, dtype: int64

In [64]:
y_preds = ANN(X_train, y_train, X_test, y_test, 'binary_crossentropy', -1)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

SMOT Oversampling increases f1 score of minority class 1 from **0.57 to 0.81 (huge improvement)** Also over all accuracy improves from 0.78 to 0.80

### Method4: Use of Ensemble with undersampling

In [65]:
df2.Churn.value_counts()

0    5163
1    1869
Name: Churn, dtype: int64

In [66]:
# Regain Original features and labels
X = df2.drop('Churn',axis='columns')
y = df2['Churn']

In [67]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=15, stratify=y)

In [68]:
y_train.value_counts()

0    4130
1    1495
Name: Churn, dtype: int64

model1 --> class1(1495) + class0(0, 1495)

model2 --> class1(1495) + class0(1496, 2990)

model3 --> class1(1495) + class0(2990, 4130)

In [73]:
df3 = X_train.copy()
df3['Churn'] = y_train

In [74]:
df3.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,OnlineSecurity,OnlineBackup,DeviceProtection,...,InternetService_Fiber optic,InternetService_No,Contract_Month-to-month,Contract_One year,Contract_Two year,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,Churn
684,1,0,0,0,0.0,1,0,0,0,0,...,1,0,1,0,0,0,0,0,1,0
2446,1,0,0,0,0.239437,1,1,0,1,0,...,1,0,1,0,0,0,1,0,0,1
1680,0,0,1,1,0.774648,1,1,0,0,0,...,0,1,0,1,0,0,0,0,1,0
2220,0,0,1,0,1.0,1,0,1,1,0,...,0,0,0,0,1,1,0,0,0,0
2842,1,0,0,0,0.042254,0,0,1,0,1,...,0,0,1,0,0,0,0,0,1,0


In [75]:
df3_class0 = df3[df3.Churn==0]
df3_class1 = df3[df3.Churn==1]

In [76]:
def get_train_batch(df_majority, df_minority, start, end):
    df_train = pd.concat([df_majority[start:end], df_minority], axis=0)

    X_train = df_train.drop('Churn', axis='columns')
    y_train = df_train.Churn
    return X_train, y_train    

In [77]:
X_train, y_train = get_train_batch(df3_class0, df3_class1, 0, 1495)

y_pred1 = ANN(X_train, y_train, X_test, y_test, 'binary_crossentropy', -1)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100
[0.6136508584022522, 0.711442768573761]
Classification Report: 
               precision    recall  f1-score   support

           0       0.89      0.70      0.78      1033
           1       0.47      0.76      0.58       374

    accuracy                           0.71      1407
   macro avg       0.68      0.73      0.68      1407
weighted avg       0.78      0.71      0.73      1407



In [78]:
X_train, y_train = get_train_batch(df3_class0, df3_class1, 1495, 2990)

y_pred2 = ANN(X_train, y_train, X_test, y_test, 'binary_crossentropy', -1)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100
[0.6378723382949829, 0.7078891396522522]
Classification Report: 
               precision    recall  f1-score   support

           0       0.88      0.70      0.78      1033
           1       0.47      0.74      0.57       374

    accuracy                           0.71      1407
   macro avg       0.67      0.72      0.68      1407
weighted avg       0.77      0.71      0.72      1407



In [79]:
X_train, y_train = get_train_batch(df3_class0, df3_class1, 2990, 4130)

y_pred3 = ANN(X_train, y_train, X_test, y_test, 'binary_crossentropy', -1)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100
[0.6610018014907837, 0.6915422677993774]
Classification Report: 
               precision    recall  f1-score   support

           0       0.89      0.66      0.76      1033
           1       0.45      0.78      0.57       374

    accuracy                           0.69      1407
   macro avg       0.67      0.72      0.67      1407
weighted avg       0.78      0.69      0.71      1407



In [80]:
len(y_pred1)

1407

In [81]:
y_pred_final = y_pred1.copy()
for i in range(len(y_pred1)):
    n_ones = y_pred1[i] + y_pred2[i] + y_pred3[i]
    if n_ones>1:
        y_pred_final[i] = 1
    else:
        y_pred_final[i] = 0

In [82]:
cl_rep = classification_report(y_test, y_pred_final)
print(cl_rep)

              precision    recall  f1-score   support

           0       0.88      0.69      0.77      1033
           1       0.46      0.75      0.57       374

    accuracy                           0.70      1407
   macro avg       0.67      0.72      0.67      1407
weighted avg       0.77      0.70      0.72      1407



f1-score for minority class 1 improved to 0.62 from 0.57. The score for majority class 0 is suffering and reduced to 0.80 from 0.85 but at least there is some balance in terms of prediction accuracy across two classes