# Lab | Handling Data Imbalance in Classification Models

For this lab and in the next lessons we will build a model on customer churn binary classification problem. You will be using `files_for_lab/Customer-Churn.csv` file.

### Scenario

You are working as an analyst with this internet service provider. You are provided with this historical data about your company's customers and their churn trends. Your task is to build a machine learning model that will help the company identify customers that are more likely to default/churn and thus prevent losses from such customers.

### Instructions

In this lab, we will first take a look at the degree of imbalance in the data and correct it using the techniques we learned on the class.

Here is the list of steps to be followed (building a simple model without balancing the data):

- Import the required libraries and modules that you would need.

In [66]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.utils import resample

pd.set_option('display.max_columns', None)

- Read that data into Python and call the dataframe `churnData`.

In [33]:
churnData = pd.read_csv('files_for_lab/Customer-churn.csv')
print(churnData.shape)
churnData

(7043, 16)


Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.30,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,Male,0,Yes,Yes,24,Yes,Yes,No,Yes,Yes,Yes,Yes,One year,84.80,1990.5,No
7039,Female,0,Yes,Yes,72,Yes,No,Yes,Yes,No,Yes,Yes,One year,103.20,7362.9,No
7040,Female,0,Yes,Yes,11,No,Yes,No,No,No,No,No,Month-to-month,29.60,346.45,No
7041,Male,1,Yes,No,4,Yes,No,No,No,No,No,No,Month-to-month,74.40,306.6,Yes


- Check the datatypes of all the columns in the data. You would see that the column `TotalCharges` is object type. Convert this column into numeric type using `pd.to_numeric` function.

In [34]:
churnData.dtypes

gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

In [35]:
churnData.TotalCharges.value_counts()

          11
20.2      11
19.75      9
20.05      8
19.9       8
          ..
6849.4     1
692.35     1
130.15     1
3211.9     1
6844.5     1
Name: TotalCharges, Length: 6531, dtype: int64

There is an empty space in some of the fields, I will change it into 0

In [36]:
churnData.TotalCharges.replace(' ', 0, inplace=True)

In [37]:
churnData.TotalCharges.value_counts()

0         11
20.2      11
19.75      9
20.05      8
19.9       8
          ..
6849.4     1
692.35     1
130.15     1
3211.9     1
6844.5     1
Name: TotalCharges, Length: 6531, dtype: int64

In [38]:
churnData.TotalCharges = pd.to_numeric(churnData.TotalCharges, downcast="float")
churnData.dtypes

gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
MonthlyCharges      float64
TotalCharges        float32
Churn                object
dtype: object

- Check for null values in the dataframe. Replace the null values.

In [39]:
churnData.isna().sum()

gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

Apparently there are no null values to replace

- Use the following features: `tenure`, `SeniorCitizen`, `MonthlyCharges` and `TotalCharges`:
  - Scale the features either by using normalizer or a standard scaler.


In [40]:
numerical = churnData[['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']]
numerical

Unnamed: 0,tenure,SeniorCitizen,MonthlyCharges,TotalCharges
0,1,0,29.85,29.850000
1,34,0,56.95,1889.500000
2,2,0,53.85,108.150002
3,45,0,42.30,1840.750000
4,2,0,70.70,151.649994
...,...,...,...,...
7038,24,0,84.80,1990.500000
7039,72,0,103.20,7362.899902
7040,11,0,29.60,346.450012
7041,4,1,74.40,306.600006


In [41]:
scaler = MinMaxScaler()
numerical_scaled = scaler.fit_transform(numerical)

In [47]:
numerical_scaled = pd.DataFrame(numerical_scaled)
numerical_scaled.columns = numerical.columns
numerical_scaled

Unnamed: 0,tenure,SeniorCitizen,MonthlyCharges,TotalCharges
0,0.013889,0.0,0.115423,0.003437
1,0.472222,0.0,0.385075,0.217564
2,0.027778,0.0,0.354229,0.012453
3,0.625000,0.0,0.239303,0.211951
4,0.027778,0.0,0.521891,0.017462
...,...,...,...,...
7038,0.333333,0.0,0.662189,0.229194
7039,1.000000,0.0,0.845274,0.847792
7040,0.152778,0.0,0.112935,0.039892
7041,0.055556,1.0,0.558706,0.035303


In [63]:
# I need to make everything numerical. That means changing most of the columns into booleans and encoding the rest.


In [59]:
# booleans = churnData[['Partner', 'Dependents', 'PhoneService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']]

In [None]:
# pd.Series(np.where(booleans.values == 'yes', 1, 0),
#           sample.index)

Now I can merge everything back.

In [51]:

churnData_scaled = pd.concat([churnData.drop(numerical.columns, axis=1), numerical_scaled], axis=1)

In [52]:
churnData_scaled

Unnamed: 0,gender,Partner,Dependents,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,Churn,tenure,SeniorCitizen,MonthlyCharges,TotalCharges
0,Female,Yes,No,No,No,Yes,No,No,No,No,Month-to-month,No,0.013889,0.0,0.115423,0.003437
1,Male,No,No,Yes,Yes,No,Yes,No,No,No,One year,No,0.472222,0.0,0.385075,0.217564
2,Male,No,No,Yes,Yes,Yes,No,No,No,No,Month-to-month,Yes,0.027778,0.0,0.354229,0.012453
3,Male,No,No,No,Yes,No,Yes,Yes,No,No,One year,No,0.625000,0.0,0.239303,0.211951
4,Female,No,No,Yes,No,No,No,No,No,No,Month-to-month,Yes,0.027778,0.0,0.521891,0.017462
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,Male,Yes,Yes,Yes,Yes,No,Yes,Yes,Yes,Yes,One year,No,0.333333,0.0,0.662189,0.229194
7039,Female,Yes,Yes,Yes,No,Yes,Yes,No,Yes,Yes,One year,No,1.000000,0.0,0.845274,0.847792
7040,Female,Yes,Yes,No,Yes,No,No,No,No,No,Month-to-month,No,0.152778,0.0,0.112935,0.039892
7041,Male,Yes,No,Yes,No,No,No,No,No,No,Month-to-month,Yes,0.055556,1.0,0.558706,0.035303


- 
    - Split the data into a training set and a test set.


In [61]:
y = churnData.Churn
# X = churnData_scaled.drop('Churn', axis=1)
X = numerical_scaled

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

-    
    - Fit a logistic regression model on the training data.  

In [62]:
classification = LogisticRegression(random_state=42, solver='lbfgs',
                  multi_class='multinomial').fit(X_train, y_train)

-     
    - Check the accuracy on the test data.

In [64]:
predictions = classification.predict(X_test)
classification.score(X_test, y_test)

0.7989778534923339

Managing imbalance in the dataset

- Check for the imbalance.


In [65]:
y.value_counts()

No     5174
Yes    1869
Name: Churn, dtype: int64

Yes, the target appears unbalanced.

- Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.


In [72]:
category_0 = churnData_scaled[churnData_scaled['Churn'] == 'No']
category_1 = churnData_scaled[churnData_scaled['Churn'] == 'Yes']

**Downsampling**

In [73]:
category_0_undersampled = resample(category_0, 
                                   replace=False, 
                                   n_samples = len(category_1))

In [74]:
print(category_0_undersampled.shape)
print(category_1.shape)

(1869, 16)
(1869, 16)


In [76]:
data_downsampled = pd.concat([category_0_undersampled, category_1], axis=0)
data_downsampled['Churn'].value_counts()

No     1869
Yes    1869
Name: Churn, dtype: int64

**Upsampling**

In [77]:
category_1_oversampled = resample(category_1, 
                                  replace=True, # the difference
                                  n_samples = len(category_0))

In [78]:
print(category_0.shape)
print(category_1_oversampled.shape)

(5174, 16)
(5174, 16)


In [79]:
data_upsampled = pd.concat([category_0, category_1_oversampled], axis=0)
data_upsampled['Churn'].value_counts()

No     5174
Yes    5174
Name: Churn, dtype: int64

- Each time fit the model and see how the accuracy of the model is.


**Downsampling** modelling

In [80]:
numerical_down = data_downsampled[['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']]

y_down = data_downsampled.Churn
X_down = numerical_down

X_train_down, X_test_down, y_train_down, y_test_down = train_test_split(X_down, y_down, random_state=42)

classification = LogisticRegression(random_state=42, solver='lbfgs',
                  multi_class='multinomial').fit(X_train_down, y_train_down)

predictions = classification.predict(X_test_down)
classification.score(X_test_down, y_test_down)

0.7262032085561497

**Upsampling** modelling

In [81]:
numerical_up = data_upsampled[['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']]

y_up = data_upsampled.Churn
X_up = numerical_up

X_train_up, X_test_up, y_train_up, y_test_up = train_test_split(X_up, y_up, random_state=42)

classification = LogisticRegression(random_state=42, solver='lbfgs',
                  multi_class='multinomial').fit(X_train_up, y_train_up)

predictions = classification.predict(X_test_up)
classification.score(X_test_up, y_test_up)

0.7243911867027445