# Project Overview

###  Project - Predicting Customer Churn using Machine Learning

Customer churn is when a company’s customers stop doing business with that company. Businesses are very keen on measuring churn because keeping an existing customer is far less expensive than acquiring a new customer. New business involves working leads through a sales funnel, using marketing and sales budgets to gain additional customers. Existing customers will often have a higher volume of service consumption and can generate additional customer referrals.

Preventing customer churn is critically important to the telecommunications sector, as the barriers to entry for switching services are so low.

# Dataset Overview

### Context

Analysis of Telecom company customer database, with information about the attributes of its customers.The intention is to predict customers with greater potential to leave the company.

### Content

Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

### The data set includes information about:

* Customers who left within the last month – the column is called Churn

* Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies

* Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges

* Demographic info about customers – gender, age range, and if they have partners and dependents

# Data preparation

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
# Write code to read the dataset and mention index_col="customerID"



# What does the data look like?

In [None]:
df.head(15).T

In [None]:
df.info()

**Here we can see that Total Charges is an object variable.
Let's Change it to float**

In [None]:
# We need to convert the Total Charges from object type to Numeric
df['TotalCharges'] = df['TotalCharges'].replace(r'\s+', np.nan, regex=True)
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'])

In [None]:
df.info()

# Data Exploration



In [None]:
df.Partner.value_counts(normalize=True).plot(kind='bar')

In [None]:
# Comment your analysis here-




In [None]:
df.SeniorCitizen.value_counts(normalize=True).plot(kind='bar')

In [None]:
# Comment your analysis here-




In [None]:
# Similar to above way write code for drawing the value counts for gender
# i.e. countplot for gender column




In [None]:
# Comment your analysis here-




In [None]:
# Similar to above way write code for drawing the value counts for Tenure
# i.e. countplot for Tenure column




In [None]:
# Comment your analysis here-




In [None]:
df.PhoneService.value_counts(normalize=True).plot(kind='bar')

In [None]:
# Comment your analysis here-




In [None]:
df.MultipleLines.value_counts(normalize=True).plot(kind='bar')

In [None]:
# Comment your analysis here-




In [None]:
df.InternetService.value_counts(normalize=True).plot(kind='bar')

In [None]:
# Comment your analysis here-




In [None]:
df.Contract.value_counts(normalize=True).plot(kind='bar')

In [None]:
# Comment your analysis here-




In [None]:
# Similar to above way write code for drawing the value counts for paymentmethod
# i.e. countplot for paymentmethod column




In [None]:
# Comment your analysis here-




**We will Visualize other variables as we will perform our Analysis.**

**Now Let's Plot variables with respect to Our Target Variable.**

In [None]:
# First let's see Our Target Variable
df.Churn.value_counts(normalize=True).plot(kind='bar');


In [None]:
# Comment your analysis here-




In [None]:
# Now Let's Start Comparing.
# Gender Vs Churn
print(pd.crosstab(df.gender,df.Churn,margins=True))
pd.crosstab(df.gender,df.Churn,margins=True).plot(kind='bar',figsize=(7,5));

In [None]:
print('Percent of Females that Left the Company {0}'.format((939/1869)*100))
print('Percent of Males that Left the Company {0}'.format((930/1869)*100))     

In [None]:
# Comment your analysis here-




**We can See that Gender Does'nt Play an important Role in Predicting Our Target Variable.**

In [None]:
# Contract Vs Churn
print(pd.crosstab(df.Contract,df.Churn,margins=True))
pd.crosstab(df.Contract,df.Churn,margins=True).plot(kind='bar',figsize=(7,5));

In [None]:
print('Percent of Month-to-Month Contract People that Left the Company {0}'.format((1655/1869)*100))
print('Percent of One-Year Contract People that Left the Company {0}'.format((166/1869)*100)) 
print('Percent of Two-Year Contract People that Left the Company {0}'.format((48/1869)*100))     

In [None]:
# Comment your analysis here-




**Most of the People that Left were the Ones who had Month-to-Month  Contract.**

In [None]:
# Internet Service Vs Churn Write code




In [None]:
print('Percent of DSL Internet-Service People that Left the Company {0}'.format((459/1869)*100))
print('Percent of Fiber Optic Internet-Service People that Left the Company {0}'.format((1297/1869)*100)) 
print('Percent of No Internet-Service People that Left the Company {0}'.format((113/1869)*100))     

In [None]:
# Comment your analysis here-




**Most of the people That Left had Fiber Optic Internet-Service.**

In [None]:
# Tenure Median Vs Churn - write code



In [None]:
# Comment your analysis here-




In [None]:
# Partner Vs Dependents
print(pd.crosstab(df.Partner,df.Dependents,margins=True))
pd.crosstab(df.Partner,df.Dependents,margins=True).plot(kind='bar',figsize=(5,5));

In [None]:
print('Percent of Partner that had Dependents {0}'.format((1749/2110)*100))
print('Percent of Non-Partner that had Dependents {0}'.format((361/2110)*100))     

In [None]:
# Comment your analysis here-




**We can See Partners had a much larger percent of Dependents than Non-Partner this tells us that Most Partners might be Married.**

In [None]:
# Partner Vs Churn
print(pd.crosstab(df.Partner,df.Churn,margins=True))
pd.crosstab(df.Partner,df.Churn,margins=True).plot(kind='bar',figsize=(5,5));

In [None]:
plt.figure(figsize=(17,8))
sns.countplot(x=df['tenure'],hue=df.Partner);

In [None]:
# Comment your analysis here-




**Most of the People that Were Partner will Stay Longer with The Company. So Being a Partner is a Plus-Point For the Company as they will Stay Longer with Them.**

In [None]:
# Partner Vs Churn
print(pd.crosstab(df.Partner,df.Churn,margins=True))
pd.crosstab(df.Partner,df.Churn,normalize=True).plot(kind='bar')

In [None]:
# Senior Citizen Vs Churn
print(pd.crosstab(df.SeniorCitizen,df.Churn,margins=True))
pd.crosstab(df.SeniorCitizen,df.Churn,normalize=True).plot(kind='bar')

In [None]:
# Comment your analysis here-




**Let's Check for Outliers in Monthly Charges And Total Charges Using Box Plots**

In [None]:
df.boxplot('MonthlyCharges');

In [None]:
df.boxplot('TotalCharges');

**Both Monthly Charges and Total Charges don't have any Outliers so we don't have to Get into Extracting Information from Outliers.**

In [None]:
df.describe()

In [None]:
# Comment your analysis here-




# Correlation Matrix

In [None]:
# Let's Check the Correaltion Matrix in Seaborn - write code



In [None]:
# Comment your analysis here-




**Here We can See Tenure and Total Charges are correlated and also Monthly charges and Total Charges are also correlated with each other. So this is proving our first Hypothesis right of Considering Total charges = Monthly charges * Tenure + Additional Tax that We had Taken Above.
**

# Data Munging Process

In [None]:
# Checking For NULL - write code



**We can See here that We have 11 Null Values in Total Charges so  let's try to fill them..**

In [None]:
df.head(15)

In [None]:
fill = df.MonthlyCharges * df.tenure

In [None]:
df.TotalCharges.fillna(fill,inplace=True)

In [None]:
df.isnull().sum()

**No Null Values are there Now..**

# When Churn = 'Yes'

In [None]:
# write code for calculating median MonthlyCharges for churn ==Yes



In [None]:
# write code for calculating median TotalCharges for churn ==Yes



In [None]:
# write code for calculating median tenure for churn ==Yes


In [None]:
df.loc[(df.Churn == 'Yes'),'PaymentMethod'].value_counts(normalize = True)

**Most of the People that Left are the Ones who had Payment Method as Electronic Check so Let's Make a Seperate Variable for it so that The Model can Easily Predict our Target Variable.**

In [None]:
df['Is_Electronic_check'] = np.where(df['PaymentMethod'] == 'Electronic check',1,0)

In [None]:
df.loc[(df.Churn == 'Yes'),'PaperlessBilling'].value_counts(normalize = True)

In [None]:
df.loc[(df.Churn == 'Yes'),'DeviceProtection'].value_counts(normalize = True)

In [None]:
df.loc[(df.Churn == 'Yes'),'OnlineBackup'].value_counts(normalize = True)

In [None]:
df.loc[(df.Churn == 'Yes'),'TechSupport'].value_counts(normalize = True)

In [None]:
df.loc[(df.Churn == 'Yes'),'OnlineSecurity'].value_counts(normalize = True)

**We can See that People That Left the Company did'nt use Services Like Online Security , Device Protection , Tech Support and Online Backup quite often. Hence for Our Prediction these variables will not be of much Importance. We will Drop them in the End.**

In [None]:
df= pd.get_dummies(df,columns=['Partner','Dependents',
       'PhoneService', 'MultipleLines','StreamingTV',
       'StreamingMovies','Contract','PaperlessBilling','InternetService'],drop_first=True)

**We have Encoded the Categorical Variables with Numeric using get dummies Property which will make it easy for the Machine to Make Correct Prediction.**

In [None]:
df.info()

**Now Let's Drop the variables that are not Important For us according to our Analysis.**

In [None]:
# Write code to drop 'StreamingTV_No internet service','StreamingMovies_No internet service'




In [None]:
# write code to drop gender



In [None]:
# write code to drop tenure and MonthlyCharges



In [None]:
# write code to drop 'OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','PaymentMethod'



**Let's Convert Our Target Variable 'Churn' for Yes or No to 1 or 0. **

In [None]:
# write code to convert the target into 1 for yes and 0 for No



In [None]:
df.info()

**Now We have only 16 variables that we think are important for Our Prediction. So let's Start our Modelling Part.**

# Modelling Part

In [None]:
# write code to separatly have x and y, y as target

X = 
y = 

In [None]:
# train test split - 20% test data - write code


# pritn shape of all generated variables




**Let's start with Logistic Regression Model because we know Our Target Variable has a Binary Outcome.**

In [None]:
# Import Logistic Regression



In [None]:
# create model



In [None]:
# train model



In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score,classification_report

In [None]:
# performance metrics
# accuracy
print ('accuracy for logistic regression - version 1 : {0:.2f}'.format(accuracy_score(y_test, model_lr_1.predict(X_test))))
# confusion matrix
print ('confusion matrix for logistic regression - version 1: \n {0}'.format(confusion_matrix(y_test, model_lr_1.predict(X_test))))
# precision 
print ('precision for logistic regression - version 1 : {0:.2f}'.format(precision_score(y_test, model_lr_1.predict(X_test))))
# precision 
print ('recall for logistic regression - version 1 : {0:.2f}'.format(recall_score(y_test, model_lr_1.predict(X_test))))

code further to improve the results, comment on performance of the model