# Telco Customer Churn - Create models

In this notebook, we will look at the customer churn in the telecommunication sector.  
Using the [Telco Customer Churn data](https://www.kaggle.com/blastchar/telco-customer-churn) from Kaggle, we explore the accuracy of 4 machine learning algorithms against the actual churn in the past month:  
- Dummy Prediction
- Logistic Regression Prediction
- Naive Bayes Prediction
- SVM Classifier Linear Prediction

Note: we train the models with last month's churn data using the algorithm provided in [Telecom Customer Churn Prediction](https://www.kaggle.com/pavanraj159/telecom-customer-churn-prediction).

### Things to install
pip install imblearn  
pip install sklearn

Use imblearn if you would like to use a SMOTE approach

Load packages

In [1]:
# !conda install -c numba/label/dev numba
# !pip install pandas_profiling imblearn sklearn

In [2]:
import os
import numpy as np
import pandas as pd
from _utils import data_utils, prediction
from pandas_profiling import ProfileReport
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from collections import Counter
import pickle

### Global variables

In [3]:
OUTPUT_PATH = "./output/"
MODELS_PATH = "./models/"

# STEP 1: Load the data

In [4]:
binary_df = pd.read_csv(os.path.join(OUTPUT_PATH, "all_df.csv"))
train_df_transf = pd.read_csv(os.path.join(OUTPUT_PATH, "train_df_transf.csv"))
test_df_transf = pd.read_csv(os.path.join(OUTPUT_PATH, "test_df_transf.csv"))

In [5]:
ProfileReport(binary_df)

Summarize dataset:   0%|          | 0/51 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



We create a few new columns in preparation for the machine learning output.  
In the actual churn data, `ChurnProbability` is fixed as the customers have already churned. Hence we gave the probability a value 1.  
The `ChurnPredicted` would be the actual churn in this base use case.

In [6]:
cols = [c for c in train_df_transf.columns if c != "Churn"]
target_col = "Churn"

train_X = train_df_transf[cols]
train_Y = train_df_transf[target_col]

test_X = test_df_transf[cols]
test_Y = test_df_transf[target_col]

# STEP 2: Modelling

You can expand the below sections to look at how we train the models below. As we referenced the algorithm, we will not explained it further. Our purpose is to analyse the prediction and its impact on the telco churn.

## Create models
Here, we build the models to be compared in the latter part.

#### Dummy Model - Uniform
This model predicts churn randomly

In [7]:
dummy_unif_clf = DummyClassifier(strategy="uniform")
dummy_unif_clf.fit(train_X, train_Y)

dummy_unif_clf = prediction.churn_prediction(
    dummy_unif_clf,
    test_X,
    test_Y,
    train_X.columns,
    "features",
    threshold_plot=True,
    coefs_or_features=False,
)

-------------------------------------------------------------------------------
DummyClassifier(strategy='uniform')
-------------------------------------------------------------------------------


 Classification report: 
               precision    recall  f1-score   support

           0       0.72      0.50      0.59       256
           1       0.27      0.49      0.35        96

    accuracy                           0.50       352
   macro avg       0.50      0.49      0.47       352
weighted avg       0.60      0.50      0.52       352

F1 score:  0.35
ROC AUC:  0.49 



### Save the model to disk

In [8]:
filename = os.path.join(MODELS_PATH, "dummy_unif_clf.sav")
pickle.dump(dummy_unif_clf, open(filename, "wb"))

#### Dummy Model - Stratified
This model predicts churn by respecting the training set’s class distribution   
Enable this if you'd like to try a different dummy model

### Save the model to disk

#### Dummy Model - Most frequent
This model predicts the majority class (the most frequent label in the training set) all the time  
Enable this if you'd like to try a different dummy model

### Save the model to disk

#### Naive Bayes Model

Gaussian Naive Bayes algorithm can be used with the hypothesis that features are independent from each other and their distribution being Gaussian.

In [9]:
from sklearn.naive_bayes import GaussianNB

gnb_clf = GaussianNB()
gnb_clf.fit(train_X, train_Y.values.ravel())

gnb_clf = prediction.churn_prediction(
    gnb_clf,
    test_X,
    test_Y,
    train_X.columns,
    "coefficients",
    threshold_plot=True,
    coefs_or_features=False,
)

-------------------------------------------------------------------------------
GaussianNB()
-------------------------------------------------------------------------------


 Classification report: 
               precision    recall  f1-score   support

           0       0.85      0.89      0.87       256
           1       0.66      0.58      0.62        96

    accuracy                           0.80       352
   macro avg       0.75      0.74      0.74       352
weighted avg       0.80      0.80      0.80       352

F1 score:  0.62
ROC AUC:  0.74 



### Save the model to disk

In [10]:
filename = os.path.join(MODELS_PATH, "gnb_clf.sav")
pickle.dump(gnb_clf, open(filename, "wb"))

#### Logistic Regression Model

In [11]:
from sklearn.linear_model import LogisticRegression

estimator = LogisticRegression()

params_grid = {
    "penalty": ["l2"],
    "C": [0.1, 0.5, 1.0, 5, 10],
    "solver": ["liblinear", "newton-cg", "lbfgs"],
    "class_weight": [
        "balanced",
        None,
        {0: 1, 1: 1.5},
        {0: 1, 1: 2},
        {0: 1, 1: 3},
        {0: 1, 1: 5},
    ],
    "random_state": [0],
}

lr_clf = GridSearchCV(
    estimator=estimator, param_grid=params_grid, scoring="roc_auc", n_jobs=-1, cv=10
)

lr_clf.fit(train_X, train_Y.values.ravel())

lr_clf = lr_clf.best_estimator_

lr_clf = prediction.churn_prediction(
    lr_clf,
    test_X,
    test_Y,
    train_X.columns,
    "coefficients",
    threshold_plot=True,
    coefs_or_features=True,
)

-------------------------------------------------------------------------------
LogisticRegression(C=0.1, class_weight={0: 1, 1: 1.5}, random_state=0,
                   solver='newton-cg')
-------------------------------------------------------------------------------


 Classification report: 
               precision    recall  f1-score   support

           0       0.86      0.88      0.87       256
           1       0.67      0.62      0.65        96

    accuracy                           0.81       352
   macro avg       0.76      0.75      0.76       352
weighted avg       0.81      0.81      0.81       352

F1 score:  0.65
ROC AUC:  0.75 



### Save the model to disk

In [12]:
filename = os.path.join(MODELS_PATH, "lr_clf.sav")
pickle.dump(lr_clf, open(filename, "wb"))

#### SVM Classifier Linear Model

**That cell will take a few minutes to run!**

In [13]:
from sklearn.svm import SVC

estimator = SVC()

params_grid = {
    "C": [0.1, 0.5, 1.0, 5],
    "kernel": ["linear", "rbf"],
    "gamma": ["auto", "scale"],
    "class_weight": [
        "balanced",
        None,
        {0: 1, 1: 2},
        {0: 1, 1: 3},
        {0: 1, 1: 5},
    ],
    "probability": [True],
}

svc_clf = GridSearchCV(
    estimator=estimator, param_grid=params_grid, scoring="roc_auc", n_jobs=-1, cv=10
)

svc_clf.fit(train_X, train_Y.values.ravel())

svc_clf = svc_clf.best_estimator_

svc_clf = prediction.churn_prediction(
    svc_clf,
    test_X,
    test_Y,
    train_X.columns,
    "coefficients",
    threshold_plot=False,
    coefs_or_features=True,
)

-------------------------------------------------------------------------------
SVC(C=0.1, class_weight='balanced', gamma='auto', kernel='linear',
    probability=True)
-------------------------------------------------------------------------------


 Classification report: 
               precision    recall  f1-score   support

           0       0.89      0.75      0.81       256
           1       0.53      0.76      0.62        96

    accuracy                           0.75       352
   macro avg       0.71      0.75      0.72       352
weighted avg       0.79      0.75      0.76       352

F1 score:  0.62
ROC AUC:  0.75 



### Save the model to disk

In [14]:
filename = os.path.join(MODELS_PATH, "svc_clf.sav")
pickle.dump(svc_clf, open(filename, "wb"))