By: Dominikus Krisna Herlambang | ©2023

## Metadata

Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

The data set includes information about:

1. Customers who left within the last month – the column is called Churn
2. Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
3. Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
4. Demographic info about customers – gender, age range, and if they have partners and dependents


# Install & Load Library

In [None]:
!pip install scikit-plot
!pip install dataprep
!pip install pycaret

In [None]:
# load pandas for data wrangling
import pandas as pd
# load numpy for vector manipulation
import numpy as np
# load matplotlib for data visualization
import matplotlib.pyplot as plt
# load seaborn for data visualization
import seaborn as sns
# load dataprep.eda for data exploration
from dataprep.eda import *

# load train_test_split for divide the data into train dan test
from sklearn.model_selection import train_test_split

# import all classification models from pycaret
from pycaret.classification import *

# load scikitplot for metrik visualization
import scikitplot as skplt

%matplotlib inline

# Load Dataset

In [None]:
raw_data = pd.read_csv("data_project/Project_3/telco_customer.csv")
raw_data

# Data Inspection

In [None]:
# check data structure
raw_data.info()

Check missing values

In [None]:
# plot missing value inside raw_data
plot_missing(raw_data)

But wait, it pretty strange that `TotalCharges` is `Object` data type, we need to change it to numeric data type.

In [None]:
raw_data['TotalCharges'] = pd.to_numeric(raw_data['TotalCharges'], errors = 'coerce')

Check again because probably the NA values will appear due to data type coersion

In [None]:
plot_missing(raw_data)

In [None]:
raw_data[raw_data['TotalCharges'].isna()]

In [None]:
raw_data.query('tenure == 0')

Because most of NA `TotalCharges` has 0 tenure, we can impute this value with `0`


In [None]:
raw_data[raw_data['TotalCharges'].isna()] = 0

Check null values again

In [None]:
plot_missing(raw_data).show()

Typecasting categorical variables into numeric

In [None]:
# change gender 'Male' and 'Female' to 0 and 1
raw_data['gender'].replace(['Male','Female'],[0,1],inplace=True)

# change partner 'Yes' and 'No' to 1 and 0
raw_data['Partner'].replace(['Yes','No'],[1,0],inplace=True)

# change dependent 'Yes' and 'No' to 1 and 0
raw_data['Dependents'].replace(['Yes','No'],[1,0],inplace=True)

# change PhoneService 'Yes' and 'No' to 1 and 0
raw_data['PhoneService'].replace(['Yes','No'],[1,0],inplace=True)

# change MultipleLines 'No phonse service' and 'No' to 0 and 'Yes' to 1
raw_data['MultipleLines'].replace(['No phone service','No', 'Yes'],[0,0,1],inplace=True)

# change InternetService 'No', 'DSL', and 'Fiber Optic' to 0, 1, and 2
raw_data['InternetService'].replace(['No','DSL','Fiber optic'],[0,1,2],inplace=True)

# change OnlineSecurity 'No' and 'No internet service' to 0 and 'Yes' to 1
raw_data['OnlineSecurity'].replace(['No','Yes','No internet service'],[0,1,0],inplace=True)

# change OnlineBackup 'No' and 'No internet service' to 0 and 'Yes' to 1
raw_data['OnlineBackup'].replace(['No','Yes','No internet service'],[0,1,0],inplace=True)

# change DeviceProtection 'No' and 'No internet service' to 0 and 'Yes' to 1
raw_data['DeviceProtection'].replace(['No','Yes','No internet service'],[0,1,0],inplace=True)

# change TechSupport 'No' and 'No internet service' to 0 and 'Yes' to 1
raw_data['TechSupport'].replace(['No','Yes','No internet service'],[0,1,0],inplace=True)

# change StreamingTV 'No' and 'No internet service' to 0 and 'Yes' to 1
raw_data['StreamingTV'].replace(['No','Yes','No internet service'],[0,1,0],inplace=True)

# change StreamingMovies 'No' and 'No internet service' to 0 and 'Yes' to 1
raw_data['StreamingMovies'].replace(['No','Yes','No internet service'],[0,1,0],inplace=True)

# change Contract 'Month-to-month', 'One year', and 'Two year' to 0, 1, and 2
raw_data['Contract'].replace(['Month-to-month', 'One year', 'Two year'],[0,1,2],inplace=True)

# change PaperlessBilling 'Yes' and 'No' to 1 and 0
raw_data['PaperlessBilling'].replace(['Yes','No'],[1,0],inplace=True)

# change PaymentMethod 'Electronic check', 'Mailed check',
# 'Bank transfer (automatic)', and 'Credit card (automatic)' to 0, 1, 2, and 3
raw_data['PaymentMethod'].replace(['Electronic check', 'Mailed check', 'Bank transfer (automatic)','Credit card (automatic)'],[0,1,2,3],inplace=True)

# change Churn 'Yes' and 'No' to 1 and 0
raw_data['Churn'].replace(['Yes','No'],[1,0],inplace=True)

Check new data structure

In [None]:
raw_data.info()

We need to remove Customer ID from dataset

In [None]:
raw_data = raw_data.drop(["customerID"], axis = 1)

# Train-Test Split Data

Split data before data exploration and engineering

In [None]:
# define predictor variables
X = raw_data.drop(["Churn"], axis = 1)
# define target variable
y = raw_data["Churn"]

In [None]:
# split data into train and test set
X_train, X_test, y_train, y_test = train_test_split(
    # using predictor from X
    X,
    # and target variabel from y
    y,
    # with test set size around 20%
    test_size=0.2,
    # use stratified sampling using y
    stratify = y,
    # seed random number generator
    random_state=1000
)

In [None]:
# check X_train and X_test dimension
print("X_train dimension: ", X_train.shape)
print("X_test dimension: ", X_test.shape)

In [None]:
# check y_train and y_test dimension
print("y_train dimension: ", y_train.shape)
print("y_test dimension: ", y_test.shape)

# Data Exploration

In [None]:
# include y_train to X_train with name 'Churn'
X_train["Churn"] = y_train

In [None]:
# include y_test to X_test with name 'Churn'
X_test["Churn"] = y_test

Let's check target distribution

In [None]:
# visualize 'Churn' using catplot
sns.catplot(x = "Churn", kind = "count", data = X_train);

In [None]:
# check the proportion of each category
y_train.value_counts(normalize=True)

In [None]:
# visualize using plot from dataprep
plot(X_train, 'Churn').show()

We can observe there imbalance case in our dataset. We can solve this in two ways:

- While modeling, by adding weight on model parameter
- Post-modeling, by changing classification threhold to optimize metrics such as F1-Score, precision, recall, etc
- Pre-modeling, by doing resamping such as oversampling, downsampling, and mixed sampling

We gonna do pre-modelling oversampling if necessary

Next we gonna check multicolinearity for each variable


In [None]:
# plot correlation matrix
plot_correlation(X_train)

In [None]:
# plot scatter plot
plot_correlation(X_train, "InternetService", "MonthlyCharges")

In [None]:
# plot correlation matrix tiap variabel terhadap Churn
plot_correlation(X_train, "Churn")

We observe multiple multicolinearity between predictor variables such as tenure vs Total charges, Contract vs Tenure, StreamingTV vs MonthlyCharges, etc.

## Predictors vs Target Variable

In [None]:
for feature in ['gender', 'SeniorCitizen', 'Partner', 'MonthlyCharges']:
    plot(X_train, feature, 'Churn').show()

**Interpretation**

1. Gender seems have no impact on Churn since the ratio of Churn between both gender pretty close or look-alike.

2. For Senior Citizen, they have tendency to churn compared to non-Senior Citizen

3. People who live with their partner (or has partner) has low tendency to churn compared to people who has no partner

4. People who churn has higher Monthly charge compared to people who not churn, though the differences both not so obvious


## Task

Do analysis to others predictors. Check the correlation between predictors and target variable and give your opinion on why certain variable need to be used/removed from analysis based on your analysis.

___

Also, we need to drop predictors, if any, from both `X_train` and `X_test`

In [None]:
removed_features = ['gender', 'StreamingTV', 'PaymentMethod', 'tenure', 'TechSupport', 'DeviceProtection']
X_train = X_train.drop(removed_features, axis = 1)
X_test = X_test.drop(removed_features, axis = 1)

# Modeling

## Define Model

We will use **all** models we have to check which model has higher accuracy compared to other. It can be helped by using PyCaret package.

In [None]:
s = setup(X_train, test_data = X_test, target = 'Churn', session_id = 1)

How many models we can use and test? As many as possible! We can check models we used for training by calling `models`:

In [None]:
models()

Then we build the model by calling `compare_models`:

In [None]:
# compare baseline models
best = compare_models()

In [None]:
# check best model
best

## Model Evaluation

In [None]:
# plot confusion matrix
plot_model(best, plot = 'confusion_matrix')

In [None]:
# plot AUC
plot_model(best, plot = 'auc')

In [None]:
# plot feature importance
plot_model(best, plot = 'feature')

In [None]:
# plot feature importance
plot_model(best, plot = 'lift')

In [None]:
# plot feature importance
plot_model(best, plot = 'gain')

In [None]:
# predict on test set
test_data_pred = predict_model(best)

In [None]:
# show predictions df
test_data_pred.head()

## Task

Do you satisfied the result? If not, let's play with the model! Use multiple combination of parameters we can set in PyCaret. Check the documentation by calling `help(setup)`.

In [None]:
s = setup(
    X_train,
    test_data = X_test,
    target = 'Churn',
    session_id = 1,
    # only change parameter blow this comment
    polynomial_features = False,
    remove_multicollinearity = False,
    remove_outliers = False,
    fix_imbalance = False,
    normalize = False,
    feature_selection = False
)

In [None]:
help(setup)

## Save and Load Model

To save the model we already build, we can call `save_model` such as:

In [None]:
# save pipeline
save_model(best, 'best_model')

In [None]:
# load model
best_model_loaded = load_model('best_model')

In [None]:
# check best_model_loaded
best_model_loaded