# Lab | Imbalanced data 

We will be using the files_for_lab/customer_churn.csv dataset to build a churn predictor.


Load the dataset and explore the variables.
- We will try to predict variable Churn using a logistic regression on variables tenure, SeniorCitizen,MonthlyCharges.
- Extract the target variable.
- Extract the independent variables and scale them.
- Build the logistic regression model.
- Evaluate the model.
- Even a simple model will give us more than 70% accuracy. Why?
- Synthetic Minority Oversampling TEchnique (SMOTE) is an over sampling technique based on nearest neighbors that adds new points between existing points. Apply imblearn.over_sampling.SMOTE to the dataset. Build and evaluate the logistic regression model. Is it there any improvement?


In [12]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler 

In [2]:
pd.set_option('display.max_columns', None)
data=pd.read_csv('customer_churn.csv')
data

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.30,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,6840-RESVB,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,No,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.80,1990.5,No
7039,2234-XADUH,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,Yes,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.20,7362.9,No
7040,4801-JZAZL,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,No,No,No,No,No,Month-to-month,Yes,Electronic check,29.60,346.45,No
7041,8361-LTMKD,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Mailed check,74.40,306.6,Yes


In [3]:
data = data[['tenure','SeniorCitizen', 'MonthlyCharges', 'Churn']]
data

Unnamed: 0,tenure,SeniorCitizen,MonthlyCharges,Churn
0,1,0,29.85,No
1,34,0,56.95,No
2,2,0,53.85,Yes
3,45,0,42.30,No
4,2,0,70.70,Yes
...,...,...,...,...
7038,24,0,84.80,No
7039,72,0,103.20,No
7040,11,0,29.60,No
7041,4,1,74.40,Yes


In [4]:
data.dtypes

tenure              int64
SeniorCitizen       int64
MonthlyCharges    float64
Churn              object
dtype: object

In [5]:
data['tenure'].value_counts()

1     613
72    362
2     238
3     200
4     176
     ... 
28     57
39     56
44     51
36     50
0      11
Name: tenure, Length: 73, dtype: int64

In [31]:
data['SeniorCitizen'].value_counts()

0    5901
1    1142
Name: SeniorCitizen, dtype: int64

In [26]:
data['SeniorCitizen'] = data['SeniorCitizen'].astype(object)
data['SeniorCitizen']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['SeniorCitizen'] = data['SeniorCitizen'].astype(object)


0       0
1       0
2       0
3       0
4       0
       ..
7038    0
7039    0
7040    0
7041    1
7042    0
Name: SeniorCitizen, Length: 7043, dtype: object

In [8]:
data['MonthlyCharges'].value_counts()

20.05     61
19.85     45
19.95     44
19.90     44
20.00     43
          ..
23.65      1
114.70     1
43.65      1
87.80      1
78.70      1
Name: MonthlyCharges, Length: 1585, dtype: int64

In [9]:
data.dtypes

tenure              int64
SeniorCitizen      object
MonthlyCharges    float64
Churn              object
dtype: object

In [10]:
# X/y split
X = data.drop('Churn',axis = 1)
y = data['Churn']
# Train test Splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [27]:
y_train

2142     No
1623     No
6074    Yes
1362    Yes
6754     No
       ... 
3772    Yes
5191     No
5226     No
5390    Yes
860      No
Name: Churn, Length: 5634, dtype: object

In [22]:
# X_train numerical minmax scaling

transformer = MinMaxScaler().fit(X_train[['MonthlyCharges','tenure']])
X_norm = transformer.transform(X_train[['MonthlyCharges','tenure']])
print(X_norm.shape)
X_train_scale = pd.DataFrame(X_norm, columns=X_train[['MonthlyCharges','tenure']].columns)
X_train_scale

(5634, 2)


Unnamed: 0,MonthlyCharges,tenure
0,0.464375,0.291667
1,0.786746,0.750000
2,0.051819,0.013889
3,0.517688,0.055556
4,0.434978,0.000000
...,...,...
5629,0.764823,0.013889
5630,0.725959,0.319444
5631,0.028899,0.166667
5632,0.809168,0.166667


In [23]:
# X_test numerical minmax scaling

X_norm = transformer.transform(X_test[['MonthlyCharges','tenure']])
print(X_norm.shape)
X_test_scale = pd.DataFrame(X_norm, columns=X_test[['MonthlyCharges','tenure']].columns)
X_test_scale

(1409, 2)


Unnamed: 0,MonthlyCharges,tenure
0,0.065272,0.013889
1,0.069756,0.569444
2,0.010962,0.722222
3,0.578974,0.013889
4,0.321873,0.930556
...,...,...
1404,0.498754,0.888889
1405,0.914798,0.708333
1406,0.016442,0.236111
1407,0.256104,0.958333


In [37]:
X_test

Unnamed: 0,tenure,SeniorCitizen,MonthlyCharges
0,1,0,24.80
1,41,0,25.25
2,52,0,19.35
3,1,0,76.35
4,67,0,50.55
...,...,...,...
1404,64,0,68.30
1405,51,0,110.05
1406,17,0,19.90
1407,69,0,43.95


In [None]:
# X_train categorical onehot encoding if neccessary

In [None]:
# X_test categorical onehot encoding if neccessary

In [28]:
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True) 


In [35]:
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True) 

In [29]:
y_train

0        No
1        No
2       Yes
3       Yes
4        No
       ... 
5629    Yes
5630     No
5631     No
5632    Yes
5633     No
Name: Churn, Length: 5634, dtype: object

In [36]:
train_data = pd.concat([X_train_scale, X_train['SeniorCitizen'], y_train],axis=1)
train_data

Unnamed: 0,MonthlyCharges,tenure,SeniorCitizen,Churn
0,0.464375,0.291667,0,No
1,0.786746,0.750000,0,No
2,0.051819,0.013889,0,Yes
3,0.517688,0.055556,0,Yes
4,0.434978,0.000000,0,No
...,...,...,...,...
5629,0.764823,0.013889,0,Yes
5630,0.725959,0.319444,0,No
5631,0.028899,0.166667,0,No
5632,0.809168,0.166667,1,Yes


In [49]:
X_train_all_scaled = pd.concat([X_train_scale, X_train['SeniorCitizen']],axis=1)
X_train_all_scaled

Unnamed: 0,MonthlyCharges,tenure,SeniorCitizen
0,0.464375,0.291667,0
1,0.786746,0.750000,0
2,0.051819,0.013889,0
3,0.517688,0.055556,0
4,0.434978,0.000000,0
...,...,...,...
5629,0.764823,0.013889,0
5630,0.725959,0.319444,0
5631,0.028899,0.166667,0
5632,0.809168,0.166667,1


In [38]:
test_data = pd.concat([X_test_scale, X_test['SeniorCitizen'], y_test],axis=1)
test_data

Unnamed: 0,MonthlyCharges,tenure,SeniorCitizen,Churn
0,0.065272,0.013889,0,Yes
1,0.069756,0.569444,0,No
2,0.010962,0.722222,0,No
3,0.578974,0.013889,0,Yes
4,0.321873,0.930556,0,No
...,...,...,...,...
1404,0.498754,0.888889,0,No
1405,0.914798,0.708333,0,No
1406,0.016442,0.236111,0,No
1407,0.256104,0.958333,0,No


In [55]:
X_test_all_scaled = pd.concat([X_test_scale, X_test['SeniorCitizen']],axis=1)
X_test_all_scaled

Unnamed: 0,MonthlyCharges,tenure,SeniorCitizen
0,0.065272,0.013889,0
1,0.069756,0.569444,0
2,0.010962,0.722222,0
3,0.578974,0.013889,0
4,0.321873,0.930556,0
...,...,...,...
1404,0.498754,0.888889,0
1405,0.914798,0.708333,0
1406,0.016442,0.236111,0
1407,0.256104,0.958333,0


In [56]:
LR = LogisticRegression(random_state=0, solver='lbfgs')
LR.fit(X_train_all_scaled, y_train)
LR.score(X_test_all_scaled, y_test)

0.8041163946061036

In [51]:
LR.predict_proba(X_train_all_scaled)

array([[0.71317004, 0.28682996],
       [0.84319759, 0.15680241],
       [0.74683136, 0.25316864],
       ...,
       [0.84841088, 0.15158912],
       [0.23910814, 0.76089186],
       [0.92351195, 0.07648805]])

In [52]:
y_train

0        No
1        No
2       Yes
3       Yes
4        No
       ... 
5629    Yes
5630     No
5631     No
5632    Yes
5633     No
Name: Churn, Length: 5634, dtype: object

In [57]:
from sklearn.linear_model import LogisticRegression

classification = LogisticRegression(solver='saga').fit(X_train_all_scaled, y_train)
predictions = classification.predict(X_test_all_scaled)
classification.score(X_test_all_scaled, y_test)

0.8041163946061036

In [58]:
pd.Series(y_test).value_counts()

No     1036
Yes     373
Name: Churn, dtype: int64

In [59]:
pd.Series(predictions).value_counts()

No     1158
Yes     251
dtype: int64

In [61]:
from sklearn import neighbors
clf = neighbors.KNeighborsClassifier(n_neighbors=6, weights='uniform')
clf.fit(X_train_all_scaled, y_train)
predictions_clf = clf.predict(X_test_all_scaled)
clf.score(X_test_all_scaled, y_test)


0.7835344215755855

In [62]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, predictions_clf)

array([[953,  83],
       [222, 151]])

In [63]:
pd.Series(y_test).value_counts()

No     1036
Yes     373
Name: Churn, dtype: int64

In [64]:
pd.Series(predictions_clf).value_counts()

No     1175
Yes     234
dtype: int64