Churn forecasting at Buffer.
Churn occurs when customers or subscribers stop doing business with a company or service. Predicting those events helps us to know more about our service and how customers benefit from it.
That said, there is no correct way to do churn prediction. This repository contains our approach to do churn prediction with Machine Learning!
To run and develop churnado
you'll need a the followint environmental variables under a .env
file.
REDSHIFT_ENDPOINT
REDSHIFT_DB_PORT
REDSHIFT_USER
REDSHIFT_PASSWORD
REDSHIFT_DB_NAME
The goal of Churnado is to predict whether or not a customer will cancel his or her subscription within a given time period. This makes the it a binary classification problem (churn or not churn).
Once we feel confident in our binary classification model, we may move on to more complex models that try to predict the amount of time until a churn event. In this case, we are no longer dealing with a classification model. It isn't a regression model either.
Initially, with the binary classification model, we will use the area under the receiver operating characteristic curve (AUC) as the success metric. We could use model accuracy (number of users classified correctly divided by the total number of users) as the success metric, however imbalanced classes causes this metric measure of success to be insufficient -- we could assume that nobody churns and have an accuracy of over 90%.
The receiver operating characteristic curve (ROC) works like this. It plots sensitivity, the probability of predicting a real positive will be a positive, against 1-specificity, or the probability of predicting a false positive.This curve represents every possible trade-off between sensitivity and specificity that is available for this classifier.
When the area under this curve is maximized, the false positive rate increases much more slowly than the true positive rate, meaning that we are accurately predicting positives (churn) without incorrectly labeling so many negatives (non churns).
To evaluate our models, we will maintain a hold-out validation set that is not used to train the model. Notice that we will then have three separate datasets: a training set, a testing set, and a validation set.
The reason that we need the hold-out validation set is that information from the testing set "leaks" into the model each time we use the testing set to score our model's performance during training.
Our predictive models must beat the performance of two models:
- A "dumb" model that uses the average churn rate to randomly assign users a value of "churned" or "not churned".
- A simple logistic regression model.
Remember that these models must be out-performed on the hold-out validation set.
We define a customer as churned if they cancel their subscription. Our inputs will consist of snapshot data (billing info) and time series data (detailed usage info). We will use 8 weeks of snapshot and time series data to build our feature sets. We will try to predict whether or not a customer will churn in the next 4 weeks.