Customer churn, also known as customer attrition, customer turnover, or customer defection, is the loss of clients or customers. Telephone service companies, Internet service providers, pay-TV companies, insurance firms, and alarm monitoring services, often use customer attrition analysis and customer attrition rates as one of their key business metrics because the cost of retaining an existing customer is far less than acquiring a new one.

Predictive analytics use churn prediction models that predict customer churn by assessing their propensity of risk to churn. Since these models generate a small prioritized list of potential defectors, they are effective at focusing customer retention marketing programs on the subset of the customer base who are most vulnerable to churn.

For this project, we will be exploring the dataset of a telecom company and try to predict the customer churn

**Problem Statement**

Using the method of Boosting, classify whether or not the customer will churn
A zipped file containing the following items is given:

**train.csv**
The data file train.csv contains the 5634 instances with the 21 features including the target feature.

**test.csv**
The datafile test.csv contains the 1409instances with the 20 features excluding the target feature
**Evaluation metrics**
For this particular dataset, we are using accuracy_score as the evaluation metric.

Submissions will be evaluated based on accuracy_score as per the below threshold.

Your accuracy_score score	Points earned for the Task

0.795 <= accuracy_score	100% of the available points
0.77 <= accuracy_score < 0.795	80% of the available points
0.75 < accuracy_score < 0.77	70% of the available points
accuracy_score <= 0.75	No points earned

In [None]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
train_path="/content/drive/MyDrive/Colab_Notebooks/boosting_train.csv"
test_path = "/content/drive/MyDrive/Colab_Notebooks/boosting_test.csv"
df_train=pd.read_csv(train_path)
df_test=pd.read_csv(test_path)

Viewing the first 5 rows in train & test data

In [85]:
df_train.head()

Unnamed: 0,Id,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,1370,7596-IIWYC,Female,0,No,No,27,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,Yes,Bank transfer (automatic),20.25,538.2,No
1,5676,9103-CXVOK,Male,0,Yes,Yes,1,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Month-to-month,No,Electronic check,19.75,19.75,No
2,5800,7129-CAKJW,Female,0,No,No,17,Yes,Yes,Fiber optic,No,No,Yes,No,No,No,Month-to-month,No,Bank transfer (automatic),80.05,1345.65,No
3,1645,9490-DFPMD,Female,1,No,No,42,Yes,Yes,Fiber optic,No,No,No,No,Yes,No,Month-to-month,Yes,Electronic check,84.65,3541.35,Yes
4,366,9069-LGEUL,Male,0,Yes,No,23,Yes,No,DSL,Yes,No,No,No,No,Yes,Month-to-month,Yes,Bank transfer (automatic),59.95,1406.0,No


In [86]:
df_test.head()


Unnamed: 0,Id,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
0,4539,4355-HBJHH,Male,0,Yes,Yes,67,Yes,Yes,DSL,Yes,No,Yes,No,Yes,Yes,Two year,Yes,Electronic check,79.7,5293.4
1,1802,7492-TAFJD,Male,0,Yes,Yes,7,No,No phone service,DSL,Yes,Yes,Yes,No,No,No,Two year,No,Mailed check,38.55,280.0
2,1380,1131-SUEKT,Male,0,Yes,Yes,61,Yes,No,Fiber optic,No,Yes,No,Yes,Yes,Yes,One year,Yes,Bank transfer (automatic),98.45,6145.2
3,5305,9027-TMATR,Female,0,Yes,No,43,Yes,No,DSL,Yes,No,Yes,Yes,Yes,Yes,Two year,Yes,Electronic check,78.8,3460.3
4,1960,5846-QFDFI,Female,0,Yes,Yes,33,Yes,Yes,Fiber optic,No,Yes,Yes,Yes,No,No,Month-to-month,No,Credit card (automatic),88.6,2888.7


Droping the columns ID & Custoer ID in both train & test data

In [87]:
df_train.drop(["Id","customerID"],axis=1,inplace=True)
df_test.drop(["Id","customerID"],axis=1,inplace=True)

Checking the data types for the given dataset

In [88]:
df_train.dtypes

gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

Changing the object categories to int by using cat codes & string column (Total charges) to float in both train & test data

In [92]:
df_train["gender"]=df_train["gender"].astype('category').cat.codes
df_train["Partner"]=df_train["Partner"].astype('category').cat.codes
df_train["Dependents"]=df_train["Dependents"].astype('category').cat.codes
df_train["PhoneService"]=df_train["PhoneService"].astype('category').cat.codes
df_train["MultipleLines"]=df_train["MultipleLines"].astype('category').cat.codes
df_train["InternetService"]=df_train["InternetService"].astype('category').cat.codes
df_train["OnlineSecurity"]=df_train["OnlineSecurity"].astype('category').cat.codes
df_train["OnlineBackup"]=df_train["OnlineBackup"].astype('category').cat.codes
df_train["DeviceProtection"]=df_train["DeviceProtection"].astype('category').cat.codes
df_train["TechSupport"]=df_train["TechSupport"].astype('category').cat.codes
df_train["StreamingTV"]=df_train["StreamingTV"].astype('category').cat.codes
df_train["StreamingMovies"]=df_train["StreamingMovies"].astype('category').cat.codes
df_train["Contract"]=df_train["Contract"].astype('category').cat.codes
df_train["PaperlessBilling"]=df_train["PaperlessBilling"].astype('category').cat.codes
df_train["PaymentMethod"]=df_train["PaymentMethod"].astype('category').cat.codes
df_train["Churn"]=df_train["Churn"].astype('category').cat.codes
df_train['TotalCharges'] = pd.to_numeric(df_train['TotalCharges'],errors='coerce').astype('float')

In [93]:
df_test["gender"]=df_test["gender"].astype('category').cat.codes
df_test["Partner"]=df_test["Partner"].astype('category').cat.codes
df_test["Dependents"]=df_test["Dependents"].astype('category').cat.codes
df_test["PhoneService"]=df_test["PhoneService"].astype('category').cat.codes
df_test["MultipleLines"]=df_test["MultipleLines"].astype('category').cat.codes
df_test["InternetService"]=df_test["InternetService"].astype('category').cat.codes
df_test["OnlineSecurity"]=df_test["OnlineSecurity"].astype('category').cat.codes
df_test["OnlineBackup"]=df_test["OnlineBackup"].astype('category').cat.codes
df_test["DeviceProtection"]=df_test["DeviceProtection"].astype('category').cat.codes
df_test["TechSupport"]=df_test["TechSupport"].astype('category').cat.codes
df_test["StreamingTV"]=df_test["StreamingTV"].astype('category').cat.codes
df_test["StreamingMovies"]=df_test["StreamingMovies"].astype('category').cat.codes
df_test["Contract"]=df_test["Contract"].astype('category').cat.codes
df_test["PaperlessBilling"]=df_test["PaperlessBilling"].astype('category').cat.codes
df_test["PaymentMethod"]=df_test["PaymentMethod"].astype('category').cat.codes
df_test['TotalCharges'] = pd.to_numeric(df_test['TotalCharges'],errors='coerce').astype('float')

Checking the data types after making changes

In [95]:
df_train.dtypes

gender                 int8
SeniorCitizen         int64
Partner                int8
Dependents             int8
tenure                int64
PhoneService           int8
MultipleLines          int8
InternetService        int8
OnlineSecurity         int8
OnlineBackup           int8
DeviceProtection       int8
TechSupport            int8
StreamingTV            int8
StreamingMovies        int8
Contract               int8
PaperlessBilling       int8
PaymentMethod          int8
MonthlyCharges      float64
TotalCharges        float64
Churn                  int8
dtype: object

checking for the null values in both train & test data

In [96]:
df_train.isnull().sum()

gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        9
Churn               0
dtype: int64

In [97]:
df_test.isnull().sum()

gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        2
dtype: int64

Filling the null values using the method forward fill in both train & test data

In [102]:
df_train['TotalCharges'].fillna(method='ffill',inplace=True)
df_test['TotalCharges'].fillna(method='ffill',inplace=True)

In [104]:
df_test.isnull().sum()

gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
dtype: int64

Splitting into X & y & using decision tree predicting the value

In [108]:
X=df_train.iloc[:,:19]
y=df_train.iloc[:,-1]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.3,random_state = 0)
dt_clf=DecisionTreeClassifier(max_depth=1,random_state=0)
dt_clf.fit(X_train,y_train)
y_pred=dt_clf.predict(X_test)
from sklearn.metrics import accuracy_score
dt_accuracy_score = accuracy_score(y_test, y_pred)
print(dt_accuracy_score)

0.7332939089296274


Using adaboost to the decision tree 
& calculating the accuracy score

In [109]:
##Adaboost
from sklearn.ensemble import AdaBoostClassifier
ada_clf = AdaBoostClassifier(base_estimator=dt_clf,random_state=0)
ada_clf.fit(X_train,y_train)
ada_pred=ada_clf.predict(X_test)
ada_accuracyscore = accuracy_score(y_test, ada_pred)
print(ada_accuracyscore)

0.7971614429331756


Using Gradientboost to the decision tree 
& calculating the accuracy score

In [110]:
 from sklearn.ensemble import GradientBoostingClassifier
 gb_clf=GradientBoostingClassifier(random_state=0)
gb_clf.fit(X_train,y_train)
gda_pred=gb_clf.predict(X_test)
gda_accuracyscore = accuracy_score(y_test, gda_pred)
print(gda_accuracyscore)


0.7983441750443524


Using xgboost to the decision tree 
& calculating the accuracy score

In [112]:
from xgboost import XGBClassifier
xgb_clf = XGBClassifier(base_estimator=dt_clf,random_state=0)
xgb_clf.fit(X_train,y_train)
xgb_pred=xgb_clf.predict(X_test)
xgb_accuracyscore = accuracy_score(y_test, xgb_pred)
print(xgb_accuracyscore)

0.8030751034890597


Since the accuracy is good when we use XGboost we are predicting the test data & submitting the file

In [113]:
y_pred_test = xgb_clf.predict(df_test)
print(y_pred_test)
submissions_f = pd.DataFrame(y_pred_test,columns = ['Chum'])
submissions_f.to_csv('/content/drive/MyDrive/Colab_Notebooks/boosting_sample_submission.csv')

[0 0 0 ... 0 0 0]
