# Why Does It Churn, Smeagol?

## Overview

(Client Redacted) approached our company, The Classification Station, to ascertain whether we could build a model to accurately predict whether a customer would "soon" stop doing business with (Client Redacted). When a customer withdraws their business, this is known as "churning".

## Business Understanding

While a certain amount of churn is unavoidable, businesses strive to bring churn to as low a level as possible. In essence, they'd like to keep their customers if they can. So, our company was employed to determined causes of churn and which of those causes had the most impact on (Client Redacted).

## Data Understanding

This public dataset is provided by the CrowdAnalytix community as part of their churn prediction competition. The real name of the telecom company is anonymized. It contains 20 predictor variables mostly about customer usage patterns. There are 3333 records in this dataset, out of which 483 customers are churners and the remaining 2850 are non-churners. Thus, the ratio of churners in this dataset is 14%. Our first real steps at understanding our data, outside of the metadata we have, is performing a .head(), .describe() and .info() to see some basic statistics about our data.

In [29]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, mean_squared_error
from sklearn.dummy import DummyClassifier
from sklearn.compose import ColumnTransformer
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE

In [30]:
cust_df = pd.read_csv("Data/bigml_59c28831336c6604c800002a.csv")

In [31]:
cust_df.head()

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


As we can see from a .head(), we have 21 columns ranging from state, account length, account metrics, and a Boolean column to tell us whether the customer has churned or not.

In [32]:
cust_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   

A little further exploration tells us that there are over 3000 entries, which we knew from our data understanding and the metadata that came along with this dataset. Our data types are object, integer, float (fun integer) and a Boolean column. We also appear to have no nulls in this dataset, which is amazing.

In [33]:
cust_df.isna().sum()

state                     0
account length            0
area code                 0
phone number              0
international plan        0
voice mail plan           0
number vmail messages     0
total day minutes         0
total day calls           0
total day charge          0
total eve minutes         0
total eve calls           0
total eve charge          0
total night minutes       0
total night calls         0
total night charge        0
total intl minutes        0
total intl calls          0
total intl charge         0
customer service calls    0
churn                     0
dtype: int64

As a quick verification, we can indeed see that there are no nulls in our dataset, which allows us to use all entries.

In [34]:
cust_df.describe()

Unnamed: 0,account length,area code,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls
count,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0
mean,101.064806,437.182418,8.09901,179.775098,100.435644,30.562307,200.980348,100.114311,17.08354,200.872037,100.107711,9.039325,10.237294,4.479448,2.764581,1.562856
std,39.822106,42.37129,13.688365,54.467389,20.069084,9.259435,50.713844,19.922625,4.310668,50.573847,19.568609,2.275873,2.79184,2.461214,0.753773,1.315491
min,1.0,408.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.2,33.0,1.04,0.0,0.0,0.0,0.0
25%,74.0,408.0,0.0,143.7,87.0,24.43,166.6,87.0,14.16,167.0,87.0,7.52,8.5,3.0,2.3,1.0
50%,101.0,415.0,0.0,179.4,101.0,30.5,201.4,100.0,17.12,201.2,100.0,9.05,10.3,4.0,2.78,1.0
75%,127.0,510.0,20.0,216.4,114.0,36.79,235.3,114.0,20.0,235.3,113.0,10.59,12.1,6.0,3.27,2.0
max,243.0,510.0,51.0,350.8,165.0,59.64,363.7,170.0,30.91,395.0,175.0,17.77,20.0,20.0,5.4,9.0


A final "first step" in our exploration process is to see some simple statistics for our dataset. In this case, these aren't particularly relevant; I highly doubt that a customer's minutes used impacted their churn, but we have the data nonetheless, and it can allow us to account for outliers and better understand our data overall.

## Data Preparation

Now, as we've begun to grasp the contents and use of our data, we can begin to decide what is and is not useful. We can also begin to think about which type of model might be best for the problem we've been presented, and begin to curate our data for that purpose. Our first real step is to decide which columns will or will not be useful to our purposes, and drop those that are unncecessary.

We have 21 available columns, and they are listed here, for a quick refresher.

In [35]:
cust_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   

Of the available options, there are some that are simply not useful for our purpose as they are unlikely to have any correlation to churn. Unnecessary columns (to my thinking) include "area code", "phone number", "voice mail plan", "number vmail messages", "total day minutes", "total day charge", "total eve minutes", "total eve charge", "total night minutes", "total night charge", "total intl minutes" and "total intl charge". I find these things to have no relation on customer churn and so these columns will be dropped before we proceed to split our data for testing and training our model.

In [36]:
cust_df = cust_df.drop(["area code", "phone number", "voice mail plan", "number vmail messages", "total day minutes", "total day charge", "total eve minutes", "total eve charge", "total night minutes", "total night charge", "total intl minutes", "total intl charge"], axis=1)

In [37]:
cust_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   state                   3333 non-null   object
 1   account length          3333 non-null   int64 
 2   international plan      3333 non-null   object
 3   total day calls         3333 non-null   int64 
 4   total eve calls         3333 non-null   int64 
 5   total night calls       3333 non-null   int64 
 6   total intl calls        3333 non-null   int64 
 7   customer service calls  3333 non-null   int64 
 8   churn                   3333 non-null   bool  
dtypes: bool(1), int64(6), object(2)
memory usage: 211.7+ KB


As we can see, remaining we have "state", "account length", "international plan", "total day/eve/night/itl calls", "customer service calls" and "churn". These seem important because the state someone lives in may impact their service and thus their decision to churn. Their account length may also have an impact, as well as whether or not their plan works (or is required to work) internationally. The amount of usage, as seen in the total calls metrics, can have an impact on churn; perhaps the customer simply isn't using the service enough to warrant keeping it. Finally, customer service calls may indicate that the service is not working well, and that itself may lead to churn.

Our first step before any further transformations or analysis is to separate our data into a "testing" and a "training" set. This means that as we build our model, it will only have access to the training data, and during evaluation, it will have access to the testing data. Never the twain shall meet, or it will result in data leakage, which is to be avoided at all costs.

In [38]:
X = cust_df.drop('churn', axis=1)
y = cust_df['churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=42)

This bit of code above has performed the train/test split as described, with a test size of .25 meaning that .75 of the data will be used for training the model, and .25 will be used for testing it. We've also set a random state of 42, an arbitrary number that guarantees randomization of our data during splitting, guaranteeing that the result will be representative of aforementioned data.

Now has come the time to analyze our data, though we still have some preprocessing steps. We can wrap all of that up into a pipeline! Our created pipeline will have many uses. First, it will use a method called SMOTE to help deal with the class imbalance we have. It is not recommended to undersample the majority, as that is simply leaving data on the table, and so we will be oversampling the minority. In this case, we will be "making up for" the 14% of churned customers in this dataset so that our model can function. Secondly, our pipeline will be using StandardScaler() to scale our numeric columns so that they are "weighted" the same, so to speak. There will be far fewer "customer service calls" than "total day calls", for example, but the two factors are considered equally important, and so the numerics will be scaled. Third, our pipeline will use OneHotEncoder() to transform "state" and "international plan" into Boolean values similar to "churn" and thus allow us to further model these fatures versus our target.

Our target, "churn" is already a Boolean value and thus does not need to be encoded; it would be redundant. In any case, once our pipeline is properly set up and implemented, we will also have a logistic regression model, and we can proceed with tuning that model and deciding whether a logistic regression is the best type of model for this question. We will also be exploring a decision tree model.

In [39]:
numeric_features = ['account length', 'total day calls', 'total eve calls', 'total night calls', 'total intl calls', 'customer service calls']
categorical_features = ['state', 'international plan']

numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

steps = [
    ('preprocessor', preprocessor),
    ('sampler', SMOTE()),
    ('classifier', LogisticRegression())
]

log_pipeline = Pipeline(steps)

log_pipeline.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num', StandardScaler(),
                                                  ['account length',
                                                   'total day calls',
                                                   'total eve calls',
                                                   'total night calls',
                                                   'total intl calls',
                                                   'customer service calls']),
                                                 ('cat', OneHotEncoder(),
                                                  ['state',
                                                   'international plan'])])),
                ('sampler', SMOTE()), ('classifier', LogisticRegression())])

Now, we have a logistic regression model that has been fitted to our training data. Yay! There are four important "metrics" that we care about in terms of a logistic regression model. Those metrics are accuracy, precision, recall, and an F1 score. Each has their own definition and reason for importance depending on context, but the one we care most about in this context is the accuracy score, and potentially the precision score. These two scores are defined as follows -

Accuracy is the proportion of correctly defined labels. Or, how many did our model get correct? Considering we were asked to build and tune a model to accurately predict whether a customer would soon churn, that's a very important metric! We also might care about the precision score, which is the proportion of true positive predictions out of all positive predictions. This metric may matter because our model is attempting to accurately predict that something will happen, or a positive result. Therefore, we care about how well it predicts that positive result.

These can be looked at very easily, as seen here -

In [45]:
log_acc = log_pipeline.score(X_train, y_train)

In [48]:
log_acc

0.7342937174869948

As we can see, our logistic regression model has a 73% accuracy rate, meaning that 73% of the time, it labeled a customer accurately. Not especially good, but also not terrible. Now let's check out the precision score.

When we created our pipeline, we fit the training data to our pipeline object, using it as a logistic regression model, which it created. Now, we need to make predictions based on that training data to obtain our precision score.

In [53]:
y_pred = log_pipeline.predict(X_test)

In [54]:
log_pre = precision_score(y_test, y_pred)

In [55]:
log_pre

0.2788104089219331

As we can see, our precision score is awful. 