# Telco Customer Churn - Create models

In this notebook, we will look at the customer churn in the telecommunication sector.  
Using the [Telco Customer Churn data](https://www.kaggle.com/blastchar/telco-customer-churn) from Kaggle, we explore the accuracy of 4 machine learning algorithms against the actual churn in the past month:  
- Logistic Regression Prediction
- Logistic Regression (SMOTE) Prediction
- Naive Bayes Prediction
- SVM Classifier Linear Prediction

Note: we train the models with last month's churn data using the algorithm provided in [Telecom Customer Churn Prediction](https://www.kaggle.com/pavanraj159/telecom-customer-churn-prediction).

### Things to install
pip install imblearn  
pip install sklearn

!pip install imblearn sklearn

Load packages

In [1]:
# !conda install -c numba/label/dev numba
# !pip install pandas_profiling imblearn sklearn

In [2]:
import os
import atoti as tt
import numpy as np
import pandas as pd
from _utils import data_utils, prediction
from pandas_profiling import ProfileReport
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from collections import Counter
import pickle

Welcome to atoti 0.5.3!

By using this community edition, you agree with the license available at https://www.atoti.io/eula.
Browse the official documentation at https://docs.atoti.io.
Join the community at https://www.atoti.io/register.

atoti collects telemetry data, which is used to help understand how to improve the product.
Telemetry can be disabled by setting the ATOTI_DISABLE_TELEMETRY environment variable to True.

You can hide this message by setting the ATOTI_HIDE_EULA_MESSAGE environment variable to True.


### Global variables

In [3]:
PROJECT_PATH = "./"
DATA_PATH = "./data/"
MODELS_PATH = "./models/"

# STEP 1: Load the data

In [4]:
binary_df = pd.read_csv(os.path.join(DATA_PATH, "all_df.csv"))
train_df_transf = pd.read_csv(os.path.join(DATA_PATH, "train_df_transf.csv"))
test_df_transf = pd.read_csv(os.path.join(DATA_PATH, "test_df_transf.csv"))

In [5]:
ProfileReport(binary_df)

Summarize dataset:   0%|          | 0/51 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



We create a few new columns in preparation for the machine learning output.  
In the actual churn data, `ChurnProbability` is fixed as the customers have already churned. Hence we gave the probability a value 1.  
The `ChurnPredicted` would be the actual churn in this base use case.

In [6]:
cols = [c for c in train_df_transf.columns if c != "Churn"]
target_col = "Churn"

train_X = train_df_transf[cols]
train_Y = train_df_transf[target_col]

test_X = test_df_transf[cols]
test_Y = test_df_transf[target_col]

# STEP 2: Modelling

You can expand the below sections to look at how we train the models below. As we referenced the algorithm, we will not explained it further. Our purpose is to analyse the prediction and its impact on the telco churn.

## Create models
Here, we build the models to be compared in the latter part.

#### Dummy Model - Uniform
This model predicts churn randomly

In [7]:
dummy_unif_clf = DummyClassifier(strategy="uniform")
dummy_unif_clf.fit(train_X, train_Y)

dummy_unif_clf = prediction.churn_prediction(
    dummy_unif_clf,
    train_X,
    test_X,
    train_Y,
    test_Y,
    train_X.columns,
    "features",
    threshold_plot=True,
    coefs_or_features=False,
)

-------------------------------------------------------------------------------
DummyClassifier(strategy='uniform')
-------------------------------------------------------------------------------


 Classification report - test : 
               precision    recall  f1-score   support

           0       0.74      0.47      0.57       256
           1       0.28      0.55      0.37        96

    accuracy                           0.49       352
   macro avg       0.51      0.51      0.47       352
weighted avg       0.61      0.49      0.52       352

F1 score - test :  0.37
ROC AUC - test:  0.51 


                ---------------------------------                             


 Classification report - train: 
               precision    recall  f1-score   support

           0       0.72      0.49      0.59      4907
           1       0.25      0.47      0.33      1773

    accuracy                           0.49      6680
   macro avg       0.49      0.48      0.46      6680
weigh

### Save the model to disk

In [8]:
filename = os.path.join(MODELS_PATH, "dummy_unif_clf.sav")
pickle.dump(dummy_unif_clf, open(filename, "wb"))

#### Dummy Model - Stratified
This model predicts churn by respecting the training set’s class distribution

In [9]:
dummy_strat_clf = DummyClassifier(strategy="stratified")
dummy_strat_clf.fit(train_X, train_Y)

dummy_strat_clf = prediction.churn_prediction(
    dummy_strat_clf,
    train_X,
    test_X,
    train_Y,
    test_Y,
    train_X.columns,
    "coefficients",
    threshold_plot=True,
    coefs_or_features=False,
)

-------------------------------------------------------------------------------
DummyClassifier(strategy='stratified')
-------------------------------------------------------------------------------


 Classification report - test : 
               precision    recall  f1-score   support

           0       0.73      0.71      0.72       256
           1       0.29      0.31      0.30        96

    accuracy                           0.61       352
   macro avg       0.51      0.51      0.51       352
weighted avg       0.61      0.61      0.61       352

F1 score - test :  0.3
ROC AUC - test:  0.51 


                ---------------------------------                             


 Classification report - train: 
               precision    recall  f1-score   support

           0       0.73      0.73      0.73      4907
           1       0.26      0.26      0.26      1773

    accuracy                           0.61      6680
   macro avg       0.49      0.49      0.49      6680
wei

### Save the model to disk

In [10]:
filename = os.path.join(MODELS_PATH, "dummy_strat_clf.sav")
pickle.dump(dummy_strat_clf, open(filename, "wb"))

#### Dummy Model - Most frequent
This model predicts the majority class (he most frequent label in the training set) all the time

In [11]:
dummy_major_clf = DummyClassifier(strategy="most_frequent")
dummy_major_clf.fit(train_X, train_Y)

dummy_major_clf = prediction.churn_prediction(
    dummy_major_clf,
    train_X,
    test_X,
    train_Y,
    test_Y,
    train_X.columns,
    "coefficients",
    threshold_plot=True,
    coefs_or_features=False,
)

-------------------------------------------------------------------------------
DummyClassifier(strategy='most_frequent')
-------------------------------------------------------------------------------


 Classification report - test : 
               precision    recall  f1-score   support

           0       0.73      1.00      0.84       256
           1       0.00      0.00      0.00        96

    accuracy                           0.73       352
   macro avg       0.36      0.50      0.42       352
weighted avg       0.53      0.73      0.61       352

F1 score - test :  0.0
ROC AUC - test:  0.5 


                ---------------------------------                             


 Classification report - train: 
               precision    recall  f1-score   support

           0       0.73      1.00      0.85      4907
           1       0.00      0.00      0.00      1773

    accuracy                           0.73      6680
   macro avg       0.37      0.50      0.42      6680
w

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Save the model to disk

In [12]:
filename = os.path.join(MODELS_PATH, "dummy_major_clf.sav")
pickle.dump(dummy_major_clf, open(filename, "wb"))

#### Naive Bayes Model

Gaussian Naive Bayes algorithm can be used with the hypothesis that features are independent from each other and their distribution being Gaussian.

In [13]:
from sklearn.naive_bayes import GaussianNB

gnb_clf = GaussianNB()
gnb_clf.fit(train_X, train_Y.values.ravel())

gnb_clf = prediction.churn_prediction(
    gnb_clf,
    train_X,
    test_X,
    train_Y,
    test_Y,
    train_X.columns,
    "coefficients",
    threshold_plot=True,
    coefs_or_features=False,
)

-------------------------------------------------------------------------------
GaussianNB()
-------------------------------------------------------------------------------


 Classification report - test : 
               precision    recall  f1-score   support

           0       0.85      0.89      0.87       256
           1       0.66      0.58      0.62        96

    accuracy                           0.80       352
   macro avg       0.75      0.74      0.74       352
weighted avg       0.80      0.80      0.80       352

F1 score - test :  0.62
ROC AUC - test:  0.74 


                ---------------------------------                             


 Classification report - train: 
               precision    recall  f1-score   support

           0       0.85      0.87      0.86      4907
           1       0.62      0.58      0.60      1773

    accuracy                           0.79      6680
   macro avg       0.73      0.72      0.73      6680
weighted avg       0.79     

### Save the model to disk

In [14]:
filename = os.path.join(MODELS_PATH, "gnb_clf.sav")
pickle.dump(gnb_clf, open(filename, "wb"))

#### Logistic Regression Model

In [15]:
from sklearn.linear_model import LogisticRegression

estimator = LogisticRegression()

params_grid = {
    "penalty": ["l2"],
    "C": [0.1, 0.5, 1.0, 5, 10],
    "solver": ["liblinear", "newton-cg", "lbfgs"],
    "class_weight": [
        "balanced",
        None,
        {0: 1, 1: 1.5},
        {0: 1, 1: 2},
        {0: 1, 1: 3},
        {0: 1, 1: 5},
    ],
    "random_state": [0],
}

lr_clf = GridSearchCV(
    estimator=estimator, param_grid=params_grid, scoring="roc_auc", n_jobs=-1, cv=10
)

lr_clf.fit(train_X, train_Y.values.ravel())

lr_clf = lr_clf.best_estimator_

lr_clf = prediction.churn_prediction(
    lr_clf,
    train_X,
    test_X,
    train_Y,
    test_Y,
    train_X.columns,
    "coefficients",
    threshold_plot=True,
    coefs_or_features=True,
)

-------------------------------------------------------------------------------
LogisticRegression(C=0.1, class_weight={0: 1, 1: 1.5}, random_state=0,
                   solver='newton-cg')
-------------------------------------------------------------------------------


 Classification report - test : 
               precision    recall  f1-score   support

           0       0.86      0.88      0.87       256
           1       0.67      0.62      0.65        96

    accuracy                           0.81       352
   macro avg       0.76      0.75      0.76       352
weighted avg       0.81      0.81      0.81       352

F1 score - test :  0.65
ROC AUC - test:  0.75 


                ---------------------------------                             


 Classification report - train: 
               precision    recall  f1-score   support

           0       0.86      0.86      0.86      4907
           1       0.61      0.62      0.61      1773

    accuracy                           

### Save the model to disk

In [16]:
filename = os.path.join(MODELS_PATH, "lr_clf.sav")
pickle.dump(lr_clf, open(filename, "wb"))

#### SVM Classifier Linear Model

**That cell will take a few minutes to run!**

In [17]:
from sklearn.svm import SVC

estimator = SVC()

params_grid = {
    "C": [0.1, 0.5, 1.0, 5],
    "kernel": ["linear", "rbf"],
    "gamma": ["auto", "scale"],
    "class_weight": [
        "balanced",
        None,
        {0: 1, 1: 2},
        {0: 1, 1: 3},
        {0: 1, 1: 5},
    ],
    "probability": [True],
}

svc_clf = GridSearchCV(
    estimator=estimator, param_grid=params_grid, scoring="roc_auc", n_jobs=-1, cv=10
)

svc_clf.fit(train_X, train_Y.values.ravel())

svc_clf = svc_clf.best_estimator_

svc_clf = prediction.churn_prediction(
    svc_clf,
    train_X,
    test_X,
    train_Y,
    test_Y,
    train_X.columns,
    "coefficients",
    threshold_plot=False,
    coefs_or_features=True,
)

-------------------------------------------------------------------------------
SVC(C=0.1, class_weight='balanced', gamma='auto', kernel='linear',
    probability=True)
-------------------------------------------------------------------------------


 Classification report - test : 
               precision    recall  f1-score   support

           0       0.89      0.75      0.81       256
           1       0.53      0.76      0.62        96

    accuracy                           0.75       352
   macro avg       0.71      0.75      0.72       352
weighted avg       0.79      0.75      0.76       352

F1 score - test :  0.62
ROC AUC - test:  0.75 


                ---------------------------------                             


 Classification report - train: 
               precision    recall  f1-score   support

           0       0.90      0.75      0.82      4907
           1       0.53      0.78      0.63      1773

    accuracy                           0.76      6680
   mac

### Save the model to disk

In [18]:
filename = os.path.join(MODELS_PATH, "svc_clf.sav")
pickle.dump(svc_clf, open(filename, "wb"))

**From the results above, we can see that the models are underfitting the data as the training and testing performance are the same and are both quite low (F1 score is less than 0.70).
Which is not surprising given that we only collected a few data corresponding to one month.
Thus, both can be improved by collecting more data.**