In [2]:
import numpy as np
import pandas as pd

# Introduction

In this lab, we will first take a look at the degree of imbalance in the data and correct it using the techniques we learned on the class. Here is the list of steps to be followed (building a simple model without balancing the data)

# Challenge 1 - Loading and Extracting Features from the First Dataset

#### In this challenge, our goals are: 

*Import the required libraries and modules that you would need.
*Read that data into Python and call the dataframe churnData.
*Check the datatypes of all the columns in the data.You would see that the column TotalCharges is object type. Convert this *column into numeric type using pd.to_numeric function.
*Check for null values in the dataframe. Replace the null values.
*Use the following features: tenure, SeniorCitizen, MonthlyCharges and TotalCharges:
*Split the data into a training set and a test set.
*Scale the features either by using MinMaxScaler or a standard scaler.
*(Optional) Encode the categorical variables so you can use them for modeling later.

#### The first dataset contains different information describing the apps. 


In [20]:
churnData=pd.read_csv('DATA_Customer-Churn.txt')

#### Examine all variables and their types in the following cell

In [21]:
churnData.dtypes

gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

In [57]:
churnData['TotalCharges'] = churnData['TotalCharges'].replace('.','.', regex=True).astype(float))

ValueError: could not convert string to float: ' '

#### Since this dataset only contains one numeric column, let's skip the `describe()` function and look at the first 5 rows using the `head()` function

In [22]:
churnData.describe

<bound method NDFrame.describe of       gender  SeniorCitizen Partner Dependents  tenure PhoneService  \
0     Female              0     Yes         No       1           No   
1       Male              0      No         No      34          Yes   
2       Male              0      No         No       2          Yes   
3       Male              0      No         No      45           No   
4     Female              0      No         No       2          Yes   
...      ...            ...     ...        ...     ...          ...   
7038    Male              0     Yes        Yes      24          Yes   
7039  Female              0     Yes        Yes      72          Yes   
7040  Female              0     Yes        Yes      11           No   
7041    Male              1     Yes         No       4          Yes   
7042    Male              0      No         No      66          Yes   

     OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV  \
0                No          Yes    

In [41]:
churnData.head(5)

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn,TotalCharges_numeric
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,0,29.85
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,0,1889.5
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,1,108.15
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.3,1840.75,0,1840.75
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.7,151.65,1,151.65


In [40]:
churnData['Churn']= churnData['Churn'].replace({'No': 0, 'Yes': 1}).astype(int)

In [None]:
churnData['Churn'].dtype()

#### We can see that there are a few columns that could be coerced to numeric.

Start with the reviews column. We can evaluate what value is causing this column to be of object type finding the non-numeric values in this column. To do this, we recall the `to_numeric()` function. With this function, we are able to coerce all non-numeric data to null. We can then use the `isnull()` function to subset our dataframe using the True/False column that this function generates.

In the cell below, transform the Reviews column to numeric and assign this new column to the variable `Reviews_numeric`. Make sure to coerce the errors.

In [24]:
churnData['TotalCharges_numeric'] = pd.to_numeric(churnData.TotalCharges, errors='coerce')
#google_play['Reviews_numeric'] = pd.to_numeric(google_play.Reviews, errors='coerce')

In [25]:
churnData['TotalCharges_numeric'].unique()

array([  29.85, 1889.5 ,  108.15, ...,  346.45,  306.6 , 6844.5 ])

Also check the variable types of `churn`. The `TotalCharges` column should be a `float64` type now.

In [26]:
churnData['TotalCharges_numeric'].dtypes


dtype('float64')

In [27]:
churnData.isna().sum()

gender                   0
SeniorCitizen            0
Partner                  0
Dependents               0
tenure                   0
PhoneService             0
OnlineSecurity           0
OnlineBackup             0
DeviceProtection         0
TechSupport              0
StreamingTV              0
StreamingMovies          0
Contract                 0
MonthlyCharges           0
TotalCharges             0
Churn                    0
TotalCharges_numeric    11
dtype: int64

In [87]:
churnData_new = churnData.dropna()


In [88]:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression

In [89]:
X = churnData_new[['tenure', 'SeniorCitizen','MonthlyCharges','TotalCharges_numeric']]
y = churnData_new['Churn']

In [90]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state= 100)

In [91]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

In [92]:
from sklearn.preprocessing import StandardScaler
# all features are numeric, so no need to split into _num and _cat
scaler = StandardScaler().fit(X_train)
X_train_scaled = pd.DataFrame(scaler.transform(X_train),columns=X.columns)

# we can immediately transform the X_test as well
X_test_scaled = pd.DataFrame(scaler.transform(X_test),columns=X.columns)
X_train_scaled.head(5)

Unnamed: 0,tenure,SeniorCitizen,MonthlyCharges,TotalCharges_numeric
0,-1.201477,-0.437013,-0.483666,-0.937173
1,-1.160653,-0.437013,-1.47288,-0.966413
2,0.839728,2.288262,-0.340687,0.249914
3,1.615386,-0.437013,0.00346,1.071889
4,0.186543,-0.437013,-1.491168,-0.691863


In [82]:
#from sklearn.impute import SimpleImputer
#imp = SimpleImputer(strategy="most_frequent")
#print(imp.fit_transform(churnData))
#X = churnData[['tenure', 'SeniorCitizen','MonthlyCharges','TotalCharges_numeric']]
#y = churnData['Churn']

[['Female' 0 'Yes' ... '29.85' 0 29.85]
 ['Male' 0 'No' ... '1889.5' 0 1889.5]
 ['Male' 0 'No' ... '108.15' 1 108.15]
 ...
 ['Female' 0 'Yes' ... '346.45' 0 346.45]
 ['Male' 1 'Yes' ... '306.6' 1 306.6]
 ['Male' 0 'No' ... '6844.5' 0 6844.5]]


In [93]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

In [96]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.7377398720682303

In [98]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

array([[864, 149],
       [220, 174]], dtype=int64)

In [104]:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
print("precision: ",precision_score(y_test,y_pred))
print("recall: ",recall_score(y_test,y_pred))
print("f1: ",f1_score(y_test,y_pred))

precision:  0.5386996904024768
recall:  0.4416243654822335
f1:  0.48535564853556484


In [99]:
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)

In [100]:
accuracy_score(y_test, y_pred)

0.7377398720682303

In [101]:
confusion_matrix(y_test, y_pred)

array([[864, 149],
       [220, 174]], dtype=int64)

In [105]:
print("precision: ",precision_score(y_test,y_pred))
print("recall: ",recall_score(y_test,y_pred))
print("f1: ",f1_score(y_test,y_pred))

precision:  0.5386996904024768
recall:  0.4416243654822335
f1:  0.48535564853556484


In [106]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from numpy import mean
from numpy import absolute
from numpy import sqrt

In [107]:
cv = KFold(n_splits=10, random_state=1, shuffle=True)
model=LinearRegression()
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
mean(absolute(scores))

0.30850033865406995

In [113]:
#cv_results = cross_val_score(model, X_train, y_train, cv=3, scoring='roc_auc')
#cv_results

array([0.81099183, 0.8144133 , 0.81414858])

In [109]:
from sklearn.ensemble import RandomForestRegressor
 
# create regressor object
randomregressor = RandomForestRegressor(n_estimators=100, random_state=0)
 
# fit the regressor with x and y data
randomregressor.fit(X_train, y_train)

In [115]:
from sklearn.model_selection import GridSearchCV
grid_search_cv = GridSearchCV(model, {'C': [.01, .1, 1, 10, 100]}, cv=3, scoring='roc_auc')

In [117]:
from sklearn.metrics import roc_auc_score
grid_search_cv.fit(X_train, y_train)

ValueError: Invalid parameter 'C' for estimator LinearRegression(). Valid parameters are: ['copy_X', 'fit_intercept', 'n_jobs', 'positive'].

In [120]:
# separate majority/minority classes
no_churn = churnData[churnData['Churn']==0]
yes_churn = churnData[churnData['Churn']==1]

In [121]:
display(no_churn.shape)
display(yes_churn.shape)

(5174, 17)

(1869, 17)

In [122]:
yes_churn.shape

(1869, 17)

In [127]:
from sklearn.utils import resample
yes_churn_oversampled = resample(yes_churn, #<- sample from here
                                    replace=True, #<- we need replacement, since we don't have enough data otherwise
                                    n_samples = len(no_churn),#<- make both sets the same size
                                    random_state=0)

In [128]:
yes_churn.groupby(yes_churn.columns.tolist(),as_index=False).size()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn,TotalCharges_numeric,size
0,Female,0,No,No,1,No,No,No,No,No,No,No,Month-to-month,24.60,24.6,1,24.60,1
1,Female,0,No,No,1,No,No,No,No,No,No,No,Month-to-month,25.10,25.1,1,25.10,1
2,Female,0,No,No,1,No,No,No,No,No,No,No,Month-to-month,25.20,25.2,1,25.20,1
3,Female,0,No,No,1,No,No,No,No,No,No,Yes,Month-to-month,35.05,35.05,1,35.05,1
4,Female,0,No,No,1,No,No,No,No,No,Yes,No,Month-to-month,34.70,34.7,1,34.70,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1845,Male,1,Yes,Yes,51,Yes,No,Yes,Yes,No,No,No,Month-to-month,84.20,4146.05,1,4146.05,1
1846,Male,1,Yes,Yes,56,Yes,Yes,No,Yes,No,Yes,Yes,One year,104.55,5794.65,1,5794.65,1
1847,Male,1,Yes,Yes,66,Yes,No,No,Yes,No,Yes,Yes,Month-to-month,99.50,6822.15,1,6822.15,1
1848,Male,1,Yes,Yes,66,Yes,Yes,Yes,Yes,Yes,Yes,No,Two year,79.40,5154.6,1,5154.60,1


In [129]:
yes_churn_oversampled.groupby(yes_churn_oversampled.columns.tolist(),as_index=False).size()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn,TotalCharges_numeric,size
0,Female,0,No,No,1,No,No,No,No,No,No,No,Month-to-month,24.60,24.6,1,24.60,1
1,Female,0,No,No,1,No,No,No,No,No,No,No,Month-to-month,25.10,25.1,1,25.10,3
2,Female,0,No,No,1,No,No,No,No,No,No,No,Month-to-month,25.20,25.2,1,25.20,2
3,Female,0,No,No,1,No,No,No,No,No,No,Yes,Month-to-month,35.05,35.05,1,35.05,5
4,Female,0,No,No,1,No,No,No,No,No,Yes,No,Month-to-month,34.70,34.7,1,34.70,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1733,Male,1,Yes,Yes,51,Yes,No,Yes,Yes,No,No,No,Month-to-month,84.20,4146.05,1,4146.05,2
1734,Male,1,Yes,Yes,56,Yes,Yes,No,Yes,No,Yes,Yes,One year,104.55,5794.65,1,5794.65,3
1735,Male,1,Yes,Yes,66,Yes,No,No,Yes,No,Yes,Yes,Month-to-month,99.50,6822.15,1,6822.15,3
1736,Male,1,Yes,Yes,66,Yes,Yes,Yes,Yes,Yes,Yes,No,Two year,79.40,5154.6,1,5154.60,2


In [130]:
display(no_churn.shape)
display(yes_churn_oversampled.shape)

(5174, 17)

(5174, 17)

In [148]:
churnData_oversampled = pd.concat([no_churn,yes_churn_oversampled],axis=0)
churnData_oversampled.tail()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn,TotalCharges_numeric
7018,Male,0,Yes,Yes,1,Yes,No,No,No,No,No,No,Month-to-month,70.65,70.65,1,70.65
5787,Female,0,No,No,36,Yes,No,No,Yes,Yes,No,Yes,Month-to-month,87.55,3078.1,1,3078.1
5906,Female,1,No,No,14,Yes,No,Yes,No,No,No,No,Month-to-month,78.95,1101.85,1,1101.85
1651,Male,1,No,No,1,Yes,No,No,No,No,Yes,No,Month-to-month,79.1,79.1,1,79.1
3090,Male,0,No,No,1,Yes,No,No,No,No,Yes,No,Month-to-month,53.5,53.5,1,53.5


In [149]:
churnData_oversampled['gender']= churnData_oversampled ['gender'].replace({'Male': 0, 'Female': 1}).astype(int)

In [154]:
churnData_oversampled['Partner']= churnData_oversampled ['Partner'].replace({'No': 0, 'Yes': 1}).astype(int)
churnData_oversampled['Dependents']= churnData_oversampled ['Dependents'].replace({'No': 0, 'Yes': 1}).astype(int)
churnData_oversampled['PhoneService']= churnData_oversampled ['Partner'].replace({'No': 0, 'Yes': 1}).astype(int)

In [158]:
#churnData_oversampled['OnlineSecurity']= churnData_oversampled ['OnlineSecurity'].replace({'No': 0, 'Yes': 1}).astype(int)
#churnData_oversampled['DeviceProtection']= churnData_oversampled ['DeviceProtection'].replace({'No': 0, 'Yes': 1}).astype(int)
#churnData_oversampled['TechSupport']= churnData_oversampled ['TechSupport'].replace({'No': 0, 'Yes': 1}).astype(int)
#churnData_oversampled['StreamingMovies']= churnData_oversampled ['StreamingMovies'].replace({'No': 0, 'Yes': 1}).astype(int)

In [159]:
churnData_oversampled = churnData_oversampled.drop('OnlineSecurity', axis=1)
churnData_oversampled = churnData_oversampled.drop('DeviceProtection', axis=1)
churnData_oversampled = churnData_oversampled.drop('TechSupport', axis=1)
churnData_oversampled = churnData_oversampled.drop('StreamingMovies', axis=1)

In [162]:
X = churnData_oversampled[["gender", "Partner", "Dependents", "PhoneService"]]
y = churnData_oversampled['Churn']

In [164]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=532)

In [165]:
LR = LogisticRegression(max_iter=1000)
LR.fit(X, y)
y_pred = LR.predict(X_test)

print("precision: ", precision_score(y_test, y_pred))

precision:  0.600625


In [167]:
print("recall: ",recall_score(y_test,y_pred))
print("f1: ",f1_score(y_test,y_pred))

recall:  0.6101587301587301
f1:  0.6053543307086614
