# Applying Balancing Techniques on a Telecom Dataset

Now that we have seen different balancing techniques, let's apply these techniques to a new dataset that is related to the churn of telecom customers. This dataset is available on GitHub.

This dataset has various variables related to the usage level of a mobile connection, such as total call minutes, call charges, calls made during certain periods of the day, details of international calls, and details of calls to customer services.

The problem statement is to predict whether a customer will churn. This dataset is a highly imbalanced one, with the cases where customers churn being the minority. You will be using this dataset in the following activity.

# Finding the Best Balancing Technique by Fitting a Classifier on the Telecom Churn Dataset

You are working as a data scientist for a telecom company. You have encountered a dataset that is highly imbalanced, and you want to correct the class imbalance before fitting the classifier to analyze the churn. You know different methods for correcting the imbalance in datasets and you want to compare them to find the best method before fitting the model.

In this activity, you need to implement all of the three methods that you have come across so far and compare the results.

Note: You will be using the telecom churn dataset that you used in Chapter 10, Analyzing a Dataset.

Use the MinMaxscaler function to scale the dataset instead of the robust scaler function you have been using so far. Compare the methods based on the results you get by fitting a logistic regression model on the dataset.

In [66]:
import pandas as pd 
import numpy as np 
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import smote_variants as sv 
from sklearn.utils import shuffle

In [2]:
df = pd.read_csv('../Dataset/churn.csv')
df

Unnamed: 0,churn,accountlength,internationalplan,voicemailplan,numbervmailmessages,totaldayminutes,totaldaycalls,totaldaycharge,totaleveminutes,totalevecalls,totalevecharge,totalnightminutes,totalnightcalls,totalnightcharge,totalintlminutes,totalintlcalls,totalintlcharge,numbercustomerservicecalls
0,No,128,no,yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.70,1
1,No,107,no,yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.70,1
2,No,137,no,no,0,243.4,114,41.38,121.2,110,10.30,162.6,104,7.32,12.2,5,3.29,0
3,No,84,yes,no,0,299.4,71,50.90,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2
4,No,75,yes,no,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,No,50,no,yes,40,235.7,127,40.07,223.0,126,18.96,297.5,116,13.39,9.9,5,2.67,2
4996,Yes,152,no,no,0,184.2,90,31.31,256.8,73,21.83,213.6,113,9.61,14.7,2,3.97,3
4997,No,61,no,no,0,140.6,89,23.90,172.8,128,14.69,212.4,97,9.56,13.6,4,3.67,1
4998,No,109,no,no,0,188.8,67,32.10,171.7,92,14.59,224.4,89,10.10,8.5,6,2.30,0


In [3]:
df['internationalplan'] = df['internationalplan'].apply(lambda x: 1 if x == 'yes' else 0)
df['voicemailplan'] = df['voicemailplan'].apply(lambda x: 1 if x == 'yes' else 0)
df['churn'] = df['churn'].apply(lambda x: 1 if x == 'Yes' else 0)
df

Unnamed: 0,churn,accountlength,internationalplan,voicemailplan,numbervmailmessages,totaldayminutes,totaldaycalls,totaldaycharge,totaleveminutes,totalevecalls,totalevecharge,totalnightminutes,totalnightcalls,totalnightcharge,totalintlminutes,totalintlcalls,totalintlcharge,numbercustomerservicecalls
0,0,128,0,1,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.70,1
1,0,107,0,1,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.70,1
2,0,137,0,0,0,243.4,114,41.38,121.2,110,10.30,162.6,104,7.32,12.2,5,3.29,0
3,0,84,1,0,0,299.4,71,50.90,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2
4,0,75,1,0,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,0,50,0,1,40,235.7,127,40.07,223.0,126,18.96,297.5,116,13.39,9.9,5,2.67,2
4996,1,152,0,0,0,184.2,90,31.31,256.8,73,21.83,213.6,113,9.61,14.7,2,3.97,3
4997,0,61,0,0,0,140.6,89,23.90,172.8,128,14.69,212.4,97,9.56,13.6,4,3.67,1
4998,0,109,0,0,0,188.8,67,32.10,171.7,92,14.59,224.4,89,10.10,8.5,6,2.30,0


In [5]:
X = df.iloc[:, 1:]
y = df.iloc[:, 0]

In [43]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(4000, 17)
(1000, 17)
(4000,)
(1000,)


In [44]:
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = pd.DataFrame(scaler.transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)

In [49]:
y_train = y_train.reset_index().drop('index', axis=1)

In [52]:
# check imbalance
print(y_train.churn[y_train.churn==1].count())
print(y_train.churn[y_train.churn==0].count())

553
3447


## Undersampling

In [57]:
# minority class
ndx_min = y_train[y_train.churn==1].index
X_train_min = X_train.iloc[ndx_min, :]
y_train_min = y_train.iloc[ndx_min, :]
print(X_train_min.shape)
print(y_train_min.shape)

(553, 17)
(553, 1)


In [58]:
# majority class
ndx_maj = y_train[y_train.churn==0].index
X_train_maj = X_train.iloc[ndx_maj, :]
y_train_maj = y_train.iloc[ndx_maj, :]
print(X_train_maj.shape)
print(y_train_maj.shape)

(3447, 17)
(3447, 1)


In [62]:
# sample majority class
X_train_maj_sample = X_train_maj.sample(n=len(ndx_min),random_state = 123)
y_train_maj_sample = y_train.iloc[X_train_maj_sample.index, :]
print(X_train_maj_sample.shape)
print(y_train_maj_sample.shape)

(553, 17)
(553, 1)


In [69]:
# concat
X_train_us = pd.concat([X_train_min, X_train_maj_sample], axis=0)
y_train_us = pd.concat([y_train_min, y_train_maj_sample], axis=0)
# shuffle
temp_data = pd.concat([X_train_us, y_train_us], axis=1)
temp_data = shuffle(temp_data)
# re_separate
X_train_us = temp_data.iloc[:, :-1]
y_train_us = temp_data.iloc[:, -1]

print(X_train_us.shape)
print(y_train_us.shape)

(1106, 17)
(1106,)


In [90]:
churnModel1 = LogisticRegression()
churnModel1.fit(X_train_us, y_train_us)
y_pred1 = churnModel1.predict(X_test)

## SMOTE

In [73]:
oversampler_smote = sv.SMOTE(random_state=123)
X_train_smote, y_train_smote = oversampler_smote.sample(X_train.values, y_train.values.flatten())
print(X_train_smote.shape)
print(y_train_smote.shape)

2020-09-21 21:24:58,793:INFO:SMOTE: Running sampling via ('SMOTE', "{'proportion': 1.0, 'n_neighbors': 5, 'n_jobs': 1, 'random_state': 123}")
(6894, 17)
(6894,)


In [89]:
print(sum(y_train_smote==1))
print(sum(y_train_smote==0))

3447
3447


In [94]:
churnModel2 = LogisticRegression()
churnModel2.fit(X_train_smote, y_train_smote)
y_pred2 = churnModel2.predict(X_test)

## MSMOTE

In [92]:
oversampler_msmote = sv.MSMOTE(random_state=123)
X_train_msmote, y_train_msmote = oversampler_msmote.sample(X_train.values, y_train.values.flatten())
print(X_train_msmote.shape)
print(y_train_msmote.shape)

2020-09-21 21:32:47,574:INFO:MSMOTE: Running sampling via ('MSMOTE', "{'proportion': 1.0, 'n_neighbors': 5, 'n_jobs': 1, 'random_state': 123}")
(6894, 17)
(6894,)


In [93]:
print(sum(y_train_msmote==1))
print(sum(y_train_msmote==0))

3447
3447


In [95]:
churnModel3 = LogisticRegression()
churnModel3.fit(X_train_msmote, y_train_msmote)
y_pred3 = churnModel3.predict(X_test)

## Metrics

In [96]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [101]:
print('ChurnModel1')
print(f'Accuracy: {accuracy_score(y_test, y_pred1)}')
print(confusion_matrix(y_test, y_pred1))
print(classification_report(y_test, y_pred1))

print('\nChurnModel2')
print(f'Accuracy: {accuracy_score(y_test, y_pred2)}')
print(confusion_matrix(y_test, y_pred2))
print(classification_report(y_test, y_pred2))


print('\nChurnModel3')
print(f'Accuracy: {accuracy_score(y_test, y_pred3)}')
print(confusion_matrix(y_test, y_pred3))
print(classification_report(y_test, y_pred3))

ChurnModel1
Accuracy: 0.799
[[674 172]
 [ 29 125]]
              precision    recall  f1-score   support

           0       0.96      0.80      0.87       846
           1       0.42      0.81      0.55       154

    accuracy                           0.80      1000
   macro avg       0.69      0.80      0.71      1000
weighted avg       0.88      0.80      0.82      1000


ChurnModel2
Accuracy: 0.794
[[672 174]
 [ 32 122]]
              precision    recall  f1-score   support

           0       0.95      0.79      0.87       846
           1       0.41      0.79      0.54       154

    accuracy                           0.79      1000
   macro avg       0.68      0.79      0.70      1000
weighted avg       0.87      0.79      0.82      1000


ChurnModel3
Accuracy: 0.802
[[682 164]
 [ 34 120]]
              precision    recall  f1-score   support

           0       0.95      0.81      0.87       846
           1       0.42      0.78      0.55       154

    accuracy               

In this activity, we have performed data balancing using random undersampling with SMOTE and MSMOTE for the telecom churn dataset. From the classification report, we can see that MSMOTE has the best accuracy, 80%, compared to SMOTE and undersampling techniques, which achieve 79% and 78%, respectively. However, we know that it is important to look at the recall values, especially of the minority class. From the recall values, we see that SMOTE has the largest value of 76%. This means that 76% of customers who are likely to churn have been correctly identified by the model. Random undersampling and MSMOTE have lower recall values of 73% and 75%, respectively. We now have a situation where MSMOTE has the highest accuracy but a slightly lower recall value and SMOTE has the lowest accuracy measure but the highest recall value. In such a situation, we have to look at the f1 scores, which is a weighted score between precision and recall. From all the f1 scores, we see that MSMOTE has the highest f1 score of 52%, with SMOTE and random undersampling scoring 50% each. Therefore, we can select MSMOTE as the best technique for balancing for this context.