### Classification: Customer Churn

This project aims to build a machine learning classification model to predict customer churn. The dataset comes from [Kaggle](https://www.kaggle.com/datasets/blastchar/telco-customer-churn) and contains information about a fictional telco company. Each row represents a customer and each column contains customer attributes, which are a mix of numeric and categorical forms. Customers who left within the last month are labeled as churn.

Objectives:
1. Build a classifer model to predict customer churn and compare different estimators (Logistic Regression, Random Forest, Naive Bayes). 
2. Implement Cross Validation to minimize risk of overfitting.
3. Implement methods to deal with class imbalance.

In [106]:
from pathlib import Path
import pandas as pd

from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics

import importlib
import common

In [107]:
df = pd.read_csv(Path('WA_Fn-UseC_-Telco-Customer-Churn.csv'))
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [108]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [109]:
# Convert datatypes
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

# Drop NaNs
df = df.dropna()

We'll begin with a few features, get the workflow and model working, then add more features to see if the model performance improves.

In [110]:
features = ['tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'Contract',
            'PaymentMethod', 'MonthlyCharges', 'TotalCharges']

X = df[features]

In [111]:
# Get labels column
y = LabelEncoder().fit_transform(df['Churn'])

# Test train split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Complete dataset:", X.shape)
print("Train dataset:", X_train.shape)
print("Test dataset:", X_test.shape)

Complete dataset: (7032, 8)
Train dataset: (5625, 8)
Test dataset: (1407, 8)


In [112]:
# Class balance check
y.mean()

0.26578498293515357

In this dataset only about 27% of customers churn which means the dataset is unbalanced. We'll deal with this later. For now let's get the initial model built.

In [113]:
# Last datatype check
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5625 entries, 6030 to 862
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   tenure           5625 non-null   int64  
 1   PhoneService     5625 non-null   object 
 2   MultipleLines    5625 non-null   object 
 3   InternetService  5625 non-null   object 
 4   Contract         5625 non-null   object 
 5   PaymentMethod    5625 non-null   object 
 6   MonthlyCharges   5625 non-null   float64
 7   TotalCharges     5625 non-null   float64
dtypes: float64(2), int64(1), object(5)
memory usage: 395.5+ KB


In [114]:
# Build model pipeline
num_features = ['tenure', 'MonthlyCharges', 'TotalCharges']

cat_features = [f for f in features if f not in num_features]

ohe = ColumnTransformer([
      ('ohe_features', OneHotEncoder(), cat_features),
      ('scaled_num', StandardScaler(), num_features)
      ])

lr_pipe = Pipeline([('ohe', ohe),
                    ('lr', LogisticRegression())])

lr_pipe.fit(X_train, y_train);

In [115]:
def model_evaluation(model, X, y):
    """Compute and display model's score metrics"""
    y_pred = model.predict(X)
    scores = {}
    scores['accuracy'] = round(metrics.accuracy_score(y, y_pred), 4)
    scores['precision'] = round(metrics.precision_score(y, y_pred), 4)
    scores['recall'] = round(metrics.recall_score(y, y_pred), 4)
    probs = model.predict_proba(X).T[1]
    precisions, recalls, thresholds = metrics.precision_recall_curve(y, probs)
    scores['area under precision-recall curve'] = round(metrics.auc(recalls, precisions), 4)

    # Print scores
    for metric, score in scores.items():
        print(f'Test {metric}: {score}')

    return scores

In [116]:
model_evaluation(lr_pipe, X_test, y_test);
print(f"Model weights:\n{lr_pipe.named_steps['lr'].coef_}")

Test accuracy: 0.7861
Test precision: 0.6174
Test recall: 0.5134
Test area under precision-recall curve: 0.6424
Model weights:
[[ 0.20818339 -0.20850892 -0.23954177  0.20818339  0.03103285 -0.18456465
   0.7733577  -0.58911859  0.84270159 -0.05488144 -0.78814568 -0.02675818
  -0.16698435  0.34349142 -0.15007443 -1.47039759  0.13048608  0.68277358]]


Now that the model is working, let's wrap everything into functions to facilitate model iterations and maintenance. We want to be able to start from the raw csv (so that hypothetically newly recorded data in the future can be used), process it, extract features of interest, and run the classifier model in an automated workflow to minimize errors.

In [117]:
importlib.reload(common) # Reload necessary for notebooks if import was updated
path = Path('WA_Fn-UseC_-Telco-Customer-Churn.csv')
features = ['tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'Contract',
            'PaymentMethod', 'MonthlyCharges', 'TotalCharges']
common.run_lr_model(path, features, LogisticRegression());

Test accuracy: 0.7861
Test precision: 0.6174
Test recall: 0.5134
Test area under precision-recall curve: 0.6424
Train accuracy: 0.8014
Train precision: 0.6477
Train recall: 0.5545
Train area under precision-recall curve: 0.6442


In [118]:
# All features included
importlib.reload(common)
path = Path('WA_Fn-UseC_-Telco-Customer-Churn.csv')
features = ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 
            'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 
            'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
            'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges']
common.run_model(path, features, LogisticRegression());

Test accuracy: 0.7882
Test precision: 0.6226
Test recall: 0.516
Test area under precision-recall curve: 0.6294
Train accuracy: 0.8089
Train precision: 0.6656
Train recall: 0.5645
Train area under precision-recall curve: 0.6668


Including all features does not appear to improve model performance. Next let's try a Random Forest Classifier.

In [119]:
importlib.reload(common) # Reload necessary for notebooks if import was updated
path = Path('WA_Fn-UseC_-Telco-Customer-Churn.csv')
features = ['tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'Contract',
            'PaymentMethod', 'MonthlyCharges', 'TotalCharges']
common.run_model(path, features, 
        RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42));

Test accuracy: 0.7719
Test precision: 0.5852
Test recall: 0.4866
Test area under precision-recall curve: 0.628
Train accuracy: 0.8612
Train precision: 0.7725
Train recall: 0.6769
Train area under precision-recall curve: 0.8475


The more powerful Random Forest classifier appears to be overfitting. Let's add some Cross Validation to minimize that.

In [120]:
importlib.reload(common)
path = Path('WA_Fn-UseC_-Telco-Customer-Churn.csv')
features = ['tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'Contract',
            'PaymentMethod', 'MonthlyCharges', 'TotalCharges']

param_grid = {'est__max_depth': range(4, 10),
              'est__min_samples_split': [2, 3, 4],
              'est__min_samples_leaf': [1, 2, 3]}

model = common.run_model(path, features, 
    RandomForestClassifier(n_estimators=100, random_state=42), param_grid);

print(f'Best parameters from CV: {model.best_params_}')

Fitting 3 folds for each of 54 candidates, totalling 162 fits
Test accuracy: 0.7868
Test precision: 0.637
Test recall: 0.4599
Test area under precision-recall curve: 0.6533
Train accuracy: 0.814
Train precision: 0.692
Train recall: 0.5411
Train area under precision-recall curve: 0.6992
Best parameters from CV: {'est__max_depth': 6, 'est__min_samples_leaf': 1, 'est__min_samples_split': 4}


Cross Validation for the Random Forest classifier does not seem to have significantly improved the model performance. The last lever to try is different feature subsets or additional feature engineering. We'll skip that exercise for now and instead look at one more estimator then class imbalances.

Let's also try a Naive Bayes model. While its not expected to have better performance, a benefit of the developed functional approach to the classifier model is that trying alternative estimators is quick and easy.

In [121]:
importlib.reload(common)
path = Path('WA_Fn-UseC_-Telco-Customer-Churn.csv')
features = ['tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'Contract',
            'PaymentMethod', 'MonthlyCharges', 'TotalCharges']
common.run_model(path, features, GaussianNB());

Test accuracy: 0.7079
Test precision: 0.4703
Test recall: 0.7834
Test area under precision-recall curve: 0.6093
Train accuracy: 0.7264
Train precision: 0.4908
Train recall: 0.7886
Train area under precision-recall curve: 0.6149
