# Churn Prediction

Churn prediction is common use case in machine learning domain. If you are not familiar with the term, churn means "leaving the company". It is very critical for business to have an idea about why and when customers are likely to churn. Having a robust and accurate churn prediction model helps businesses to take actions to prevent customers from leaving the company. 

In this project, I will use "Telco Customer Churn" data set which is available on Kaggle.

There are 20 featuures (independent variables) and 1 target (dependent) variable for 7043 customers. Target variable indicates if a customer has has left the company (i.e. churn=yes) within the last month. Since the target variable has two states (yes/no or 1/0), this is a binary classification problem.

The variables are:
                                        Classification labels
Churn — Whether the customer churned or not (Yes or No)
                                        Customer services booked
PhoneService — Whether the customer has a phone service (Yes, No)
MultipleLines — Whether the customer has multiple lines (Yes, No, No phone service)
InternetService — Customer’s internet service provider (DSL, Fiber optic, No)
OnlineSecurity — Whether the customer has online security (Yes, No, No internet service)
OnlineBackup — Whether the customer has online backup (Yes, No, No internet service)
DeviceProtection — Whether the customer has device protection (Yes, No, No internet service)
TechSupport — Whether the customer has tech support (Yes, No, No internet service)
StreamingTV — Whether the customer has streaming TV (Yes, No, No internet service)
StreamingMovies — Whether the customer has streaming movies (Yes, No, No internet service)
                                            Customer account information
Tenure — Number of months the customer has stayed with the company
Contract — The contract term of the customer (Month-to-month, One year, Two year)
PaperlessBilling — Whether the customer has paperless billing (Yes, No)
PaymentMethod — The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
MonthlyCharges — The amount charged to the customer monthly
TotalCharges — The total amount charged to the customer
                                            Customers demographic info
customerID — Customer ID
Gender — Whether the customer is a male or a female
SeniorCitizen — Whether the customer is a senior citizen or not (1, 0)
Partner — Whether the customer has a partner or not (Yes, No)
Dependents — Whether the customer has dependents or not (Yes, No)

At first glance, only customerID seems irrelevant to customer churn. Other variables may or may not have an effect on customer churn. We will figure out.


# DATA COLLECTION

In [None]:
import numpy as np   #importing the libraries for data storing
import pandas as pd

In [None]:
import matplotlib.pyplot as plt  #importing libraries for data plotting
import seaborn as sns
%matplotlib inline

In [None]:
df = pd.read_csv("Telco-Customer-Churn.csv")   #reading csv file 

# EXPLORARTORY DATA ANALYSIS

In [None]:
df.head()   #by default it shows 20 columns only and 5 rows

In [None]:
pd.set_option('max_columns',None)   #to display all columns
df.head(7)  # to display top 7 rows

In [None]:
df.shape  #returns shape of data frame

In [None]:
df.isna().sum() # find number of missing values for each column of the data set

There is no missing value in the data set.

In [None]:
df.columns  #returns the name of columns

In [None]:
df.dtypes #return the data type of each column

In [None]:
df['TotalCharges']=pd.to_numeric(df['TotalCharges'],errors="coerce")  #used to change data type of columns TotalCharges

In [None]:
df.dtypes

In [None]:
df.isna().sum()

In [None]:
df.loc[df['TotalCharges'].isnull()==True]  #display the rows which have missing value in TotalCharges

In [None]:
df.dropna(subset=['TotalCharges'],inplace=True)   #dropping the rows 

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe() #returns the statiscal values for the columns who have numeric type of data

In [None]:
df.Churn.value_counts() #count number of each value i.e Yes and No for column Churn

In [None]:
100*df.Churn.value_counts()/len(df.Churn)   #to find % of churned customers.

Target variable has imbalanced class distribution. Negative class (Churn=No) is much less than positive class (churn=Yes). Imbalanced class distributions influence the performance of a machine learning model negatively as the model is trained with 74% of non-churned data and 26% of churned data. We will use upsampling or downsampling to overcome this issue. 

It is always beneficial to explore the features (independent variables) before trying to build a model. Let's first discover the features that only have two values.

In [None]:
columns = df.columns
binary_cols = []
#making separate list for features having 2 types of values
for col in columns:
    if df[col].value_counts().shape[0] == 2:
        binary_cols.append(col)

In [None]:
binary_cols # categorical features with two classes

The remaining categorical variables have more than two values (or classes).

In [None]:
# Categorical features with multiple classes
multiple_cols_cat = ['MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup',
 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract','PaymentMethod']

## Binary categorical features

Let's check the class distribution of binary features.

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(12, 7), sharey=True)  #plotting countplot for binary categorical features

sns.countplot("gender", data=df, ax=axes[0,0])
sns.countplot("SeniorCitizen", data=df, ax=axes[0,1])
sns.countplot("Partner", data=df, ax=axes[0,2])
sns.countplot("Dependents", data=df, ax=axes[1,0])
sns.countplot("PhoneService", data=df, ax=axes[1,1])
sns.countplot("PaperlessBilling", data=df, ax=axes[1,2])

There is a high imbalance in SeniorCitizen and PhoneService variables. Most of the customers are not senior and similarly, most customers have a phone service.

It is better to check how the target variable (churn) changes according to the binary features. To be able to make calculations, we need to change the values of target variable. "Yes" will be 1 and "No" will be 0.

In [None]:
churn_numeric = {'Yes':1, 'No':0}  #changing categorical value into numbers
df.Churn.replace(churn_numeric, inplace=True)


In [None]:
df.dtypes  #churn is now of int type

In [None]:
df[['gender','Churn']].groupby(['gender']).mean()   #finding average churn rate  wrt the different values of gender

Average churn rate for males and females are approximately the same which indicates gender variable does not bring a valuable prediction power to a model. Therefore, I will not use gender variable in the machine learning model.

In [None]:
df[['SeniorCitizen','Churn']].groupby(['SeniorCitizen']).mean()

In [None]:
df[['Partner','Churn']].groupby(['Partner']).mean()

In [None]:
df[['Dependents','Churn']].groupby(['Dependents']).mean()

In [None]:
df[['PhoneService','Churn']].groupby(['PhoneService']).mean()

In [None]:
df[['PaperlessBilling','Churn']].groupby(['PaperlessBilling']).mean()

The other binary features have an effect on the target variable. The phone service may also be skipped if you think 2% difference can be ignored. I have decided to use this feature in the model.

## Other Categorical Features

It is time to explore other categorical features. We also have continuous features such as tenure, monthly charges and total charges which I will discuss in the next part.

There are 6 variables that come with internet service. There variables come into play if customer has internet service.

### Internet Service

In [None]:
sns.countplot("InternetService", data=df) 

In [None]:
df.InternetService.value_counts()

In [None]:
df[['InternetService','Churn']].groupby('InternetService').mean()   #finding average churn rate wrt diff value of InternetService

Internet service variable is definitely important in predicting churn rate. As you can see, customers with fiber optic internet service are much likely to churn than other customers although there is not a big difference in the number of customers with DSL and fiber optic. This company may have some problems with fiber optic connection. However, it is not a good way to make assumptions based on only one variable. Let's also check the monthly charges.

In [None]:
df[['InternetService','MonthlyCharges']].groupby('InternetService').mean()

Fiber optic service is much more expensive than DSL which may be one of the reasons why customers churn.

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(12, 7), sharey=True)
#plotting countplot of the features related to the internetservice
sns.countplot("StreamingTV", data=df, ax=axes[0,0])
sns.countplot("StreamingMovies", data=df, ax=axes[0,1])
sns.countplot("OnlineSecurity", data=df, ax=axes[0,2])
sns.countplot("OnlineBackup", data=df, ax=axes[1,0])
sns.countplot("DeviceProtection", data=df, ax=axes[1,1])
sns.countplot("TechSupport", data=df, ax=axes[1,2])

In [None]:
df[['StreamingTV','Churn']].groupby('StreamingTV').mean()

In [None]:
df[['StreamingMovies','Churn']].groupby('StreamingMovies').mean()

In [None]:
df[['OnlineSecurity','Churn']].groupby('OnlineSecurity').mean()

In [None]:
df[['OnlineBackup','Churn']].groupby('OnlineBackup').mean()

In [None]:
df[['DeviceProtection','Churn']].groupby('DeviceProtection').mean()

In [None]:
df[['TechSupport','Churn']].groupby('TechSupport').mean()

All internet service related features seem to have different churn rates for their classes.

### Phone service

In [None]:
df.PhoneService.value_counts()

In [None]:
df.MultipleLines.value_counts()

MultipleLines column includes more specific data compared to PhoneService column as I can understand the number of people who have phone service or not from MultipleLines column also. So, I'll drop the PhoneService column

In [None]:
df[['MultipleLines','Churn']].groupby('MultipleLines').mean()  

### Contract, Payment Method

In [None]:
sns.countplot("Contract", data=df)

In [None]:
df[['Contract','Churn']].groupby('Contract').mean()

It seems like, as expected, customers with short-term contract are more likely to churn. This clearly explains the motivation for companies to have long-term relationship with their customers.

In [None]:
plt.figure(figsize=(10,6))
sns.countplot("PaymentMethod", data=df)

In [None]:
df[['PaymentMethod','Churn']].groupby('PaymentMethod').mean()

### Continuous Variables

The continuous features are tenure, monthly charges and total charges. The amount in total charges columns is proportional to tenure (months) multiplied by monthly charges. So it is unnecessary to include total charges in the model. Adding unnecassary features will increase the model complexity. It is better to have a simpler model when possible.

In [None]:
df[['tenure','MonthlyCharges','Churn']].groupby('Churn').mean() 
#finding average tenure and monthlycharge on the basis of churn.

It is clear that people who have been a customer for a long time tend to stay with the company. The average tenure in months for people who left the company is 20 months less than the average for people who stay. 

It seems like monthly charges also have an effect on churn rate. 

Contract and tenure features may be correlated because customer with long term contract are likely to stay longer with the company. Let's figure out.

In [None]:
df[['Contract','tenure']].groupby('Contract').mean()  #average tenure wrt type of contract

As expected, contract and tenure are highly correlated. Customers with long contracts have been a customer for longer time than customers with short-term contracts. I think contract will add little to no value to tenure feature so I will not use contract feature in the model.

After exploring the variables, I have decided not to use following variable because they add little or no informative power to the model:
1) Customer ID
2) Gender
3) PhoneService
4) Contract
5) TotalCharges

In [None]:
df.drop(['customerID','gender','PhoneService','Contract','TotalCharges'], axis=1, inplace=True)

In [None]:
df.head(10)

In [None]:
df.shape


# Data Preprocessing

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.preprocessing import MinMaxScaler

In [None]:
#Categorical features need to be converted to numbers so that they can be included in calculations done by a machine learning
#model this is called encoding.
cat_features = ['SeniorCitizen', 'Partner', 'Dependents',
        'MultipleLines', 'InternetService', 'OnlineSecurity',
       'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
       'StreamingMovies', 'PaperlessBilling', 'PaymentMethod']
X = pd.get_dummies(df, columns=cat_features,drop_first=True)  #encoding and saving the data in dataframe named as X


'''Subject       Subject_English  Subject_Hindi
    English         1              0
    Hindi           0             1
    Punjabi          0               0
    Hindi            0               1
    English          1               0
    Punjabi         0                 0'''



In [None]:
#Values of numerical features are rescaled between a range of 0 and 1 because if they are not scaled then values with 
#higher number will be given more importance which will effect the accuracy
sc = MinMaxScaler()
a = sc.fit_transform(df[['tenure']])  #min=1 max=72
b = sc.fit_transform(df[['MonthlyCharges']])  #min=18 #max=118


In [None]:
X['tenure'] = a     
X['MonthlyCharges'] = b
#transformed value which was kept in a and b are again placed in dataframe

In [None]:
X.shape

In [None]:
X.head()


# Resampling

As we briefly discussed in the beginning, target variables with imbalanced class distribution is not desired for machine learning models. I will use upsampling which means increasing the number of samples of that class which has less samples by randomly selecting rows from it.

In [None]:
sns.countplot('Churn', data=X).set_title('Class Distribution Before Resampling')

In [None]:
X_no = X[X.Churn == 0]
X_yes = X[X.Churn == 1]

In [None]:
print(len(X_no),len(X_yes))

In [None]:
X_yes_upsampled = X_yes.sample(n=len(X_no), replace=True, random_state=42)
print(len(X_yes_upsampled))

In [None]:
X_upsampled = X_no.append(X_yes_upsampled).reset_index(drop=True)

In [None]:
sns.countplot('Churn', data=X_upsampled).set_title('Class Distribution After Resampling')

# ML model

We need to divide the data set into training and test subsets so that we are able to measure the performance of our model on new, previously unseen examples.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = X_upsampled.drop(['Churn'], axis=1) #features (independent variables)
y = X_upsampled['Churn'] #target (dependent variable)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

### Ridge Classifier

I have decided to use ridge classifier as a base model. Then I will try a model that I think will perform better.

In [None]:
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

In [None]:
clf_ridge = RidgeClassifier() #create a ridge classifier object
clf_ridge.fit(X_train, y_train) #train the model

In [None]:
pred = clf_ridge.predict(X_train)  #make predictions on training set

In [None]:
accuracy_score(y_train, pred) #accuracy on training set

In [None]:
confusion_matrix(y_train, pred)

In [None]:
pred_test = clf_ridge.predict(X_test)

In [None]:
accuracy_score(y_test, pred_test)

The model achieved 75% accuracy on training set and 76% accuracy on test set. The model is not overfitting because accuracies on training and test sets are pretty close. However, 75% accuracy is not very good so we will try to get a better accuracy using a different model.

### Random Forests

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

In [None]:
clf_forest = RandomForestClassifier(n_estimators=101, max_depth=10)  #101 subsets and max depth is 10

In [None]:
clf_forest.fit(X_train, y_train)

In [None]:
pred1 = clf_forest.predict(X_train)

In [None]:
accuracy_score(y_train, pred1)

In [None]:
confusion_matrix(y_train, pred1)

In [None]:
pred_test1 = clf_forest.predict(X_test)

In [None]:
accuracy_score(y_test, pred_test1)

The accuracy on training set is 5% higher than the accuracy on test set which indicates a slight overfitting. We can decrease the depth of a tree in the forest because as trees get deeper, they tend to be more specific which results in not generalizing well. However, reducing tree depth may also decrease the accuracy. So we need to be careful when optimizing the parameters. We can also increase the number of trees in the forest which will help the model to be more generalized and thus reduce overfitting. Parameter tuning is a very critical part in almost every project.

Another way is to do cross-validation which allows to use every sample in training and test set. 

GridSearchCV makes this process easy to handle. We can both do cross-validation and try different parameters using GridSearchCV.

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
parameters = {'n_estimators':[151,201,251,301], 'max_depth':[15,20,25]}
forest = RandomForestClassifier()
clf = GridSearchCV(estimator=forest, param_grid=parameters, n_jobs=-1, cv=5)

cv = 5 means having a 5-fold cross validation. So dataset is divided into 5 subset. At each iteration, 4 subsets are used in training and the other subset is used as test set. When 5 iteration completed, the model used all samples as both training and test samples.

n_jobs parameter is used to select how many processors to use. -1 means using all processors.

In [None]:
clf.fit(X_train, y_train)

In [None]:
clf.best_params_

## BEST MODEL USING THE PARAMETERS WE FOUND FROM GRIDSEARCHCV

In [None]:
rf=RandomForestClassifier(n_estimators=301, max_depth=20)

In [None]:
rf.fit(X_train,y_train)

In [None]:
predtest3=rf.predict(X_test)

In [None]:
accuracy_score(y_test,predtest3)

We have achieved an overall accuracy of almost 90%. This is the mean cross-validated score of the best_estimator. In the previous random forest, the mean score was approximately 86% (88% on training and 84% on test). Using GridSearchCV, we improved the model accuracy by 4%.

## How to improve

We can always try to improve the model. The fuel of machine learning models is data so if we can collect more data, it is always helpful in improving the model. We can also try a wider range of parameters in GridSearchCV because a little adjustment in a parameter may slighlty increase the model.

Finally, we can try more robust or advanced models. Please keep in mind that there will be a trade-off when making such kind of decisions. Advanced models may increase the accuracy but they require more data and more computing power. So it comes down to business decision.