**LOGISTIC REGRESSION**

Problem Statement :

 "You have a telecom firm which has collected data of all its customers"
The main types of attributes are :

	1.Demographics (age, gender etc.)
    
	2.Services availed (internet packs purchased, special offers etc)
    
	3.Expenses (amount of recharge done per month etc.)
    
Based on all this past information, you want to build a model which will predict whether a particular customer will churn or not. 
So the variable of interest, i.e. the target variable here is ‘Churn’ which will tell us whether or not a particular customer has churned. It is a binary variable  1 means that the customer has churned and 0 means the customer has not churned.
With 21 predictor variables we need to predict whether a particular customer will switch to another telecom provider or not.


**Import necessary libraries**

In [1]:
# Supress Warnings

import warnings
warnings.filterwarnings('ignore')

# Import the numpy and pandas package

import numpy as np
import pandas as pd

# Data Visualisation

import matplotlib.pyplot as plt 
 

**	Importing all datasets**

In [2]:
# Importing all datasets
churn_data = pd.read_csv('churn_data.csv')
customer_data = pd.read_csv('customer_data.csv')
internet_data = pd.read_csv('internet_data.csv')

**	Merging all datasets based on condition ("customer_id ")**

In [3]:
# Merging on 'customerID'
df_1 = pd.merge(churn_data, customer_data, how='inner', on='customerID')

# Final dataframe with all predictor variables
dataset = pd.merge(df_1, internet_data, how='inner', on='customerID')

# Let's see the head of our master dataset
dataset.head()

# let's look at the statistical aspects of the dataframe
dataset.describe()

# Let's see the type of each column
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7042 entries, 0 to 7041
Data columns (total 21 columns):
customerID          7042 non-null object
tenure              7042 non-null int64
PhoneService        7042 non-null object
Contract            7042 non-null object
PaperlessBilling    7042 non-null object
PaymentMethod       7042 non-null object
MonthlyCharges      7042 non-null float64
TotalCharges        7042 non-null object
Churn               7042 non-null object
gender              7042 non-null object
SeniorCitizen       7042 non-null int64
Partner             7042 non-null object
Dependents          7042 non-null object
MultipleLines       7042 non-null object
InternetService     7042 non-null object
OnlineSecurity      7042 non-null object
OnlineBackup        7042 non-null object
DeviceProtection    7042 non-null object
TechSupport         7042 non-null object
StreamingTV         7042 non-null object
StreamingMovies     7042 non-null object
dtypes: float64(1), int64(2), obj

** 	Data Cleaning - checking the null values**

In [4]:
# Checking Null values
dataset.isnull().sum()*100/dataset.shape[0]



customerID          0.0
tenure              0.0
PhoneService        0.0
Contract            0.0
PaperlessBilling    0.0
PaymentMethod       0.0
MonthlyCharges      0.0
TotalCharges        0.0
Churn               0.0
gender              0.0
SeniorCitizen       0.0
Partner             0.0
Dependents          0.0
MultipleLines       0.0
InternetService     0.0
OnlineSecurity      0.0
OnlineBackup        0.0
DeviceProtection    0.0
TechSupport         0.0
StreamingTV         0.0
StreamingMovies     0.0
dtype: float64

**# There are no NULL values in the dataset, hence it is clean**

In [5]:
#Replacing NAN values in totalcharges
dataset['TotalCharges'].describe()
dataset['TotalCharges'] = dataset['TotalCharges'].replace(' ', np.nan)
dataset['TotalCharges'] = pd.to_numeric(dataset['TotalCharges'])

value = (dataset['TotalCharges']/dataset['MonthlyCharges']).median()*dataset['MonthlyCharges']
dataset['TotalCharges'] = value.where(dataset['TotalCharges'] == np.nan, other =dataset['TotalCharges'])
dataset['TotalCharges'].describe()



count    7031.000000
mean     2282.651714
std      2266.279660
min        18.800000
25%       401.400000
50%      1397.300000
75%      3793.050000
max      8684.800000
Name: TotalCharges, dtype: float64

** Model building******

In [6]:

#Model Building
#Data Preparation
#Converting some binary variables (Yes/No) to 0/1
# List of variables to map

varlist =  ['PhoneService', 'PaperlessBilling', 'Churn', 'Partner', 'Dependents']

 	Binary encoding**

In [7]:
# Defining the map function
def binary_map(x):
    return x.map({'Yes': 1, "No": 0})

** 	One hot encoding**

In [8]:
# Applying the function to the var list
dataset[varlist] = dataset[varlist].apply(binary_map)
dataset.head()

Unnamed: 0,customerID,tenure,PhoneService,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,gender,SeniorCitizen,Partner,Dependents,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies
0,7590-VHVEG,1,0,Month-to-month,1,Electronic check,29.85,29.85,0,Female,0,1,0,No phone service,DSL,No,Yes,No,No,No,No
1,5575-GNVDE,34,1,One year,0,Mailed check,56.95,1889.5,0,Male,0,0,0,No,DSL,Yes,No,Yes,No,No,No
2,3668-QPYBK,2,1,Month-to-month,1,Mailed check,53.85,108.15,1,Male,0,0,0,No,DSL,Yes,Yes,No,No,No,No
3,7795-CFOCW,45,0,One year,0,Bank transfer (automatic),42.3,1840.75,0,Male,0,0,0,No phone service,DSL,Yes,No,Yes,Yes,No,No
4,9237-HQITU,2,1,Month-to-month,1,Electronic check,70.7,151.65,1,Female,0,0,0,No,Fiber optic,No,No,No,No,No,No


** 	Creating dummy variables and removing the extra columns**

In [9]:
#For categorical variables with multiple levels, create dummy features (one-hot encoded)
# Creating a dummy variable for some of the categorical variables and dropping the first one.
dummy1 = pd.get_dummies(dataset[['Contract', 'PaymentMethod', 'gender', 'InternetService']], drop_first=True)

# Adding the results to the master dataframe
dataset = pd.concat([dataset, dummy1], axis=1)
dataset.head()

Unnamed: 0,customerID,tenure,PhoneService,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,gender,SeniorCitizen,Partner,Dependents,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract_One year,Contract_Two year,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,gender_Male,InternetService_Fiber optic,InternetService_No
0,7590-VHVEG,1,0,Month-to-month,1,Electronic check,29.85,29.85,0,Female,0,1,0,No phone service,DSL,No,Yes,No,No,No,No,0,0,0,1,0,0,0,0
1,5575-GNVDE,34,1,One year,0,Mailed check,56.95,1889.5,0,Male,0,0,0,No,DSL,Yes,No,Yes,No,No,No,1,0,0,0,1,1,0,0
2,3668-QPYBK,2,1,Month-to-month,1,Mailed check,53.85,108.15,1,Male,0,0,0,No,DSL,Yes,Yes,No,No,No,No,0,0,0,0,1,1,0,0
3,7795-CFOCW,45,0,One year,0,Bank transfer (automatic),42.3,1840.75,0,Male,0,0,0,No phone service,DSL,Yes,No,Yes,Yes,No,No,1,0,0,0,0,1,0,0
4,9237-HQITU,2,1,Month-to-month,1,Electronic check,70.7,151.65,1,Female,0,0,0,No,Fiber optic,No,No,No,No,No,No,0,0,0,1,0,0,1,0


In [10]:
# Creating dummy variables for the remaining categorical variables and dropping the level with big names.

# Creating dummy variables for the variable 'MultipleLines'
ml = pd.get_dummies(dataset['MultipleLines'], prefix='MultipleLines')
# Dropping MultipleLines_No phone service column
ml1 = ml.drop(['MultipleLines_No phone service'], 1)
#Adding the results to the master dataframe
dataset = pd.concat([dataset,ml1], axis=1)

# Creating dummy variables for the variable 'OnlineSecurity'.
os = pd.get_dummies(dataset['OnlineSecurity'], prefix='OnlineSecurity')
os1 = os.drop(['OnlineSecurity_No internet service'], 1)
# Adding the results to the master dataframe
dataset = pd.concat([dataset,os1], axis=1)

# Creating dummy variables for the variable 'OnlineBackup'.
ob = pd.get_dummies(dataset['OnlineBackup'], prefix='OnlineBackup')
ob1 = ob.drop(['OnlineBackup_No internet service'], 1)
# Adding the results to the master dataframe
dataset = pd.concat([dataset,ob1], axis=1)

# Creating dummy variables for the variable 'DeviceProtection'. 
dp = pd.get_dummies(dataset['DeviceProtection'], prefix='DeviceProtection')
dp1 = dp.drop(['DeviceProtection_No internet service'], 1)
# Adding the results to the master dataframe
dataset = pd.concat([dataset,dp1], axis=1)

# Creating dummy variables for the variable 'TechSupport'. 
ts = pd.get_dummies(dataset['TechSupport'], prefix='TechSupport')
ts1 = ts.drop(['TechSupport_No internet service'], 1)
# Adding the results to the master dataframe
dataset = pd.concat([dataset,ts1], axis=1)

# Creating dummy variables for the variable 'StreamingTV'.
st =pd.get_dummies(dataset['StreamingTV'], prefix='StreamingTV')
st1 = st.drop(['StreamingTV_No internet service'], 1)
# Adding the results to the master dataframe
dataset = pd.concat([dataset,st1], axis=1)

# Creating dummy variables for the variable 'StreamingMovies'. 
sm = pd.get_dummies(dataset['StreamingMovies'], prefix='StreamingMovies')
sm1 = sm.drop(['StreamingMovies_No internet service'], 1)
# Adding the results to the master dataframe
dataset = pd.concat([dataset,sm1], axis=1)
dataset.head()


Unnamed: 0,customerID,tenure,PhoneService,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,gender,SeniorCitizen,Partner,Dependents,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract_One year,Contract_Two year,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,gender_Male,InternetService_Fiber optic,InternetService_No,MultipleLines_No,MultipleLines_Yes,OnlineSecurity_No,OnlineSecurity_Yes,OnlineBackup_No,OnlineBackup_Yes,DeviceProtection_No,DeviceProtection_Yes,TechSupport_No,TechSupport_Yes,StreamingTV_No,StreamingTV_Yes,StreamingMovies_No,StreamingMovies_Yes
0,7590-VHVEG,1,0,Month-to-month,1,Electronic check,29.85,29.85,0,Female,0,1,0,No phone service,DSL,No,Yes,No,No,No,No,0,0,0,1,0,0,0,0,0,0,1,0,0,1,1,0,1,0,1,0,1,0
1,5575-GNVDE,34,1,One year,0,Mailed check,56.95,1889.5,0,Male,0,0,0,No,DSL,Yes,No,Yes,No,No,No,1,0,0,0,1,1,0,0,1,0,0,1,1,0,0,1,1,0,1,0,1,0
2,3668-QPYBK,2,1,Month-to-month,1,Mailed check,53.85,108.15,1,Male,0,0,0,No,DSL,Yes,Yes,No,No,No,No,0,0,0,0,1,1,0,0,1,0,0,1,0,1,1,0,1,0,1,0,1,0
3,7795-CFOCW,45,0,One year,0,Bank transfer (automatic),42.3,1840.75,0,Male,0,0,0,No phone service,DSL,Yes,No,Yes,Yes,No,No,1,0,0,0,0,1,0,0,0,0,0,1,1,0,0,1,0,1,1,0,1,0
4,9237-HQITU,2,1,Month-to-month,1,Electronic check,70.7,151.65,1,Female,0,0,0,No,Fiber optic,No,No,No,No,No,No,0,0,0,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0


In [11]:
# We have created dummies for the below variables, so we can drop them
dataset = dataset.drop(['Contract','PaymentMethod','gender','MultipleLines','InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
       'TechSupport', 'StreamingTV', 'StreamingMovies'], 1)
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7042 entries, 0 to 7041
Data columns (total 32 columns):
customerID                               7042 non-null object
tenure                                   7042 non-null int64
PhoneService                             7042 non-null int64
PaperlessBilling                         7042 non-null int64
MonthlyCharges                           7042 non-null float64
TotalCharges                             7031 non-null float64
Churn                                    7042 non-null int64
SeniorCitizen                            7042 non-null int64
Partner                                  7042 non-null int64
Dependents                               7042 non-null int64
Contract_One year                        7042 non-null uint8
Contract_Two year                        7042 non-null uint8
PaymentMethod_Credit card (automatic)    7042 non-null uint8
PaymentMethod_Electronic check           7042 non-null uint8
PaymentMethod_Mailed check        

In [12]:
# Checking for outliers in the continuous variables
num_telecom = dataset[['tenure','MonthlyCharges','SeniorCitizen','TotalCharges']]
# Checking outliers at 25%, 50%, 75%, 90%, 95% and 99%
num_telecom.describe(percentiles=[.25, .5, .75, .90, .95, .99])


Unnamed: 0,tenure,MonthlyCharges,SeniorCitizen,TotalCharges
count,7042.0,7042.0,7042.0,7031.0
mean,32.366373,64.755886,0.16217,2282.651714
std,24.557955,30.088238,0.368633,2266.27966
min,0.0,18.25,0.0,18.8
25%,9.0,35.5,0.0,401.4
50%,29.0,70.35,0.0,1397.3
75%,55.0,89.85,0.0,3793.05
90%,69.0,102.6,1.0,5974.3
95%,72.0,107.4,1.0,6923.8
99%,72.0,114.7295,1.0,8039.94


In [13]:
# Checking up the missing values (column-wise)
dataset.isnull().sum()


customerID                                0
tenure                                    0
PhoneService                              0
PaperlessBilling                          0
MonthlyCharges                            0
TotalCharges                             11
Churn                                     0
SeniorCitizen                             0
Partner                                   0
Dependents                                0
Contract_One year                         0
Contract_Two year                         0
PaymentMethod_Credit card (automatic)     0
PaymentMethod_Electronic check            0
PaymentMethod_Mailed check                0
gender_Male                               0
InternetService_Fiber optic               0
InternetService_No                        0
MultipleLines_No                          0
MultipleLines_Yes                         0
OnlineSecurity_No                         0
OnlineSecurity_Yes                        0
OnlineBackup_No                 

In [14]:
# Removing NaN TotalCharges rows
dataset = dataset[~np.isnan(dataset['TotalCharges'])]

In [15]:
# Checking percentage of missing values after removing the missing values
round(100*(dataset.isnull().sum()/len(dataset.index)), 2)

customerID                               0.0
tenure                                   0.0
PhoneService                             0.0
PaperlessBilling                         0.0
MonthlyCharges                           0.0
TotalCharges                             0.0
Churn                                    0.0
SeniorCitizen                            0.0
Partner                                  0.0
Dependents                               0.0
Contract_One year                        0.0
Contract_Two year                        0.0
PaymentMethod_Credit card (automatic)    0.0
PaymentMethod_Electronic check           0.0
PaymentMethod_Mailed check               0.0
gender_Male                              0.0
InternetService_Fiber optic              0.0
InternetService_No                       0.0
MultipleLines_No                         0.0
MultipleLines_Yes                        0.0
OnlineSecurity_No                        0.0
OnlineSecurity_Yes                       0.0
OnlineBack

In [16]:

# Putting feature variable to X
from sklearn.model_selection import train_test_split #use 'cross_validation' instead of
                                                     #'model_selection' Executing in jupyter or spyder 
X = dataset.drop(['Churn','customerID'], axis=1)
X.head()

# Putting response variable to y
y = dataset['Churn']

y.head()

0    0
1    0
2    1
3    0
4    1
Name: Churn, dtype: int64

In [17]:
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=100)


In [18]:
#Feature Scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

X_train[['tenure','MonthlyCharges','TotalCharges']] = scaler.fit_transform(X_train[['tenure','MonthlyCharges','TotalCharges']])

X_train.head()

Unnamed: 0,tenure,PhoneService,PaperlessBilling,MonthlyCharges,TotalCharges,SeniorCitizen,Partner,Dependents,Contract_One year,Contract_Two year,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,gender_Male,InternetService_Fiber optic,InternetService_No,MultipleLines_No,MultipleLines_Yes,OnlineSecurity_No,OnlineSecurity_Yes,OnlineBackup_No,OnlineBackup_Yes,DeviceProtection_No,DeviceProtection_Yes,TechSupport_No,TechSupport_Yes,StreamingTV_No,StreamingTV_Yes,StreamingMovies_No,StreamingMovies_Yes
879,0.032381,1,1,-0.333459,-0.269045,0,0,0,0,0,0,0,0,1,0,0,0,1,1,0,0,1,1,0,1,0,1,0,1,0
5789,-0.29323,1,0,-1.491062,-0.793442,0,0,0,0,1,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
5008,-1.066555,1,0,-1.496045,-0.951355,0,0,0,0,1,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
880,-0.903749,1,1,1.508408,-0.543097,0,0,0,0,0,0,1,0,1,1,0,1,0,0,1,0,1,0,1,0,1,0,1,0,1
2784,-1.147957,1,1,1.109808,-0.82821,0,0,1,0,0,0,0,0,0,1,0,1,0,1,0,1,0,0,1,0,1,0,1,0,1


In [19]:
#Model Building
# Logistic regression model
import statsmodels.api as sm
logm1 = sm.GLM(y_train,(sm.add_constant(X_train)), family = sm.families.Binomial())
logm1.fit().summary()



0,1,2,3
Dep. Variable:,Churn,No. Observations:,4921
Model:,GLM,Df Residuals:,4897
Model Family:,Binomial,Df Model:,23
Link Function:,logit,Scale:,1.0000
Method:,IRLS,Log-Likelihood:,-2025.4
Date:,"Fri, 07 Jun 2019",Deviance:,4050.8
Time:,11:02:54,Pearson chi2:,5.95e+03
No. Iterations:,7,Covariance Type:,nonrobust

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-2.4358,1.173,-2.077,0.038,-4.734,-0.138
tenure,-1.4652,0.188,-7.811,0.000,-1.833,-1.098
PhoneService,0.4106,0.582,0.705,0.481,-0.730,1.551
PaperlessBilling,0.3283,0.089,3.679,0.000,0.153,0.503
MonthlyCharges,-1.3506,1.147,-1.178,0.239,-3.598,0.897
TotalCharges,0.7252,0.197,3.683,0.000,0.339,1.111
SeniorCitizen,0.3976,0.101,3.936,0.000,0.200,0.596
Partner,0.0100,0.092,0.108,0.914,-0.171,0.191
Dependents,-0.1641,0.107,-1.528,0.127,-0.375,0.046


In [20]:
#Feature Selection Using RFE
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
from sklearn.feature_selection import RFE
rfe = RFE(logreg, 15)             # running RFE with 13 variables as output
rfe = rfe.fit(X_train, y_train)
rfe.support_

list(zip(X_train.columns, rfe.support_, rfe.ranking_))


col = X_train.columns[rfe.support_]
X_train.columns[~rfe.support_]


Index(['Partner', 'Dependents', 'PaymentMethod_Credit card (automatic)',
       'PaymentMethod_Mailed check', 'gender_Male', 'MultipleLines_Yes',
       'OnlineSecurity_No', 'OnlineBackup_No', 'DeviceProtection_No',
       'DeviceProtection_Yes', 'TechSupport_No', 'StreamingTV_No',
       'StreamingTV_Yes', 'StreamingMovies_No', 'StreamingMovies_Yes'],
      dtype='object')

In [21]:
#Adding a constant

X_train_sm = sm.add_constant(X_train[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()


0,1,2,3
Dep. Variable:,Churn,No. Observations:,4921
Model:,GLM,Df Residuals:,4905
Model Family:,Binomial,Df Model:,15
Link Function:,logit,Scale:,1.0000
Method:,IRLS,Log-Likelihood:,-2028.6
Date:,"Fri, 07 Jun 2019",Deviance:,4057.1
Time:,11:02:55,Pearson chi2:,6.03e+03
No. Iterations:,7,Covariance Type:,nonrobust

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-0.6769,0.230,-2.942,0.003,-1.128,-0.226
tenure,-1.4703,0.184,-7.994,0.000,-1.831,-1.110
PhoneService,-0.5897,0.206,-2.868,0.004,-0.993,-0.187
PaperlessBilling,0.3366,0.089,3.786,0.000,0.162,0.511
MonthlyCharges,0.3889,0.165,2.351,0.019,0.065,0.713
TotalCharges,0.7244,0.195,3.710,0.000,0.342,1.107
SeniorCitizen,0.4223,0.099,4.244,0.000,0.227,0.617
Contract_One year,-0.7089,0.130,-5.453,0.000,-0.964,-0.454
Contract_Two year,-1.4366,0.214,-6.699,0.000,-1.857,-1.016


** 	Getting the predicted values on train set**

In [22]:
# Getting the predicted values on the train set
y_train_pred = res.predict(X_train_sm)
y_train_pred[:10]


y_train_pred_final = pd.DataFrame({'Churn':y_train.values, 'Churn_Prob':y_train_pred})
y_train_pred_final['CustID'] = y_train.index
y_train_pred_final.head()

Unnamed: 0,Churn,Churn_Prob,CustID
879,0,0.172464,879
5789,0,0.01393,5789
5008,0,0.037725,5008
880,1,0.473307,880
2784,1,0.577652,2784


**	Creating a new column predicted with 1  if churn  > 0.5  else 0**

In [23]:
#Creating new column 'predicted' with 1 if Churn_Prob > 0.5 else 0
y_train_pred_final['predicted'] = y_train_pred_final.Churn_Prob.map(lambda x: 1 if x > 0.5 else 0)

# Let's see the head
y_train_pred_final.head()

Unnamed: 0,Churn,Churn_Prob,CustID,predicted
879,0,0.172464,879,0
5789,0,0.01393,5789,0
5008,0,0.037725,5008,0
880,1,0.473307,880,0
2784,1,0.577652,2784,1


** 	Create a confusion matrix on train set and test**

In [24]:
# Confusion matrix
from sklearn import metrics
confusion_matrix = metrics.confusion_matrix(y_train_pred_final.Churn, y_train_pred_final.predicted )
print(confusion_matrix)


[[3253  369]
 [ 578  721]]


In [25]:
# Let's check the overall accuracy.
print(metrics.accuracy_score(y_train_pred_final.Churn, y_train_pred_final.predicted))


0.8075594391383865


In [26]:
#Making predictions on the test set
X_test[['tenure','MonthlyCharges','TotalCharges']] = scaler.fit_transform(X_test[['tenure','MonthlyCharges','TotalCharges']])
X_test = X_test[col]
X_test.head()

X_test_sm = sm.add_constant(X_test)
y_test_pred = res.predict(X_test_sm)
y_test_pred[:10]

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


942     0.470360
3729    0.002782
1761    0.005110
2283    0.631098
1872    0.006129
1970    0.682815
2532    0.276677
1616    0.004593
2485    0.670316
4783    0.033912
dtype: float64

In [27]:
# Converting y_pred to a dataframe which is an array
y_pred_1 = pd.DataFrame(y_test_pred)
y_pred_1.head()

Unnamed: 0,0
942,0.47036
3729,0.002782
1761,0.00511
2283,0.631098
1872,0.006129


In [28]:
# Converting y_test to dataframe
y_test_df = pd.DataFrame(y_test)

In [29]:
# Putting CustID to index
y_test_df['CustID'] = y_test_df.index


In [30]:
# Removing index for both dataframes to append them side by side 
y_pred_1.reset_index(drop=True, inplace=True)
y_test_df.reset_index(drop=True, inplace=True)

In [31]:
# Appending y_test_df and y_pred_1
y_pred_final = pd.concat([y_test_df, y_pred_1],axis=1)
y_pred_final.head()

Unnamed: 0,Churn,CustID,0
0,0,942,0.47036
1,0,3729,0.002782
2,0,1761,0.00511
3,1,2283,0.631098
4,0,1872,0.006129


In [32]:
# Rearranging the columns
y_pred_final = y_pred_final.reindex_axis(['CustID','Churn','Churn_Prob'], axis=1)
# Let's see the head of y_pred_final
y_pred_final.head()
y_pred_final['final_predicted'] = y_pred_final.Churn_Prob.map(lambda x: 1 if x > 0.42 else 0)
y_pred_final.head()


Unnamed: 0,CustID,Churn,Churn_Prob,final_predicted
0,942,0,,0
1,3729,0,,0
2,1761,0,,0
3,2283,1,,0
4,1872,0,,0


** 	 Check the overall accuracy**

In [33]:
# Let's check the overall accuracy.
metrics.accuracy_score(y_pred_final.Churn, y_pred_final.final_predicted)

0.7298578199052133