Skip to content

bufferapp/churnado

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Churn forecasting at Buffer.

license

Introduction

Churn occurs when customers or subscribers stop doing business with a company or service. Predicting those events helps us to know more about our service and how customers benefit from it.

That said, there is no correct way to do churn prediction. This repository contains our approach to do churn prediction with Machine Learning!

Requirements

To run and develop churnado you'll need a the followint environmental variables under a .env file.

  • REDSHIFT_ENDPOINT
  • REDSHIFT_DB_PORT
  • REDSHIFT_USER
  • REDSHIFT_PASSWORD
  • REDSHIFT_DB_NAME

Defining the Problem

The goal of Churnado is to predict whether or not a customer will cancel his or her subscription within a given time period. This makes the it a binary classification problem (churn or not churn).

Once we feel confident in our binary classification model, we may move on to more complex models that try to predict the amount of time until a churn event. In this case, we are no longer dealing with a classification model. It isn't a regression model either.

The Measure of Success

Initially, with the binary classification model, we will use the area under the receiver operating characteristic curve (AUC) as the success metric. We could use model accuracy (number of users classified correctly divided by the total number of users) as the success metric, however imbalanced classes causes this metric measure of success to be insufficient -- we could assume that nobody churns and have an accuracy of over 90%.

The receiver operating characteristic curve (ROC) works like this. It plots sensitivity, the probability of predicting a real positive will be a positive, against 1-specificity, or the probability of predicting a false positive.This curve represents every possible trade-off between sensitivity and specificity that is available for this classifier.

ROC

When the area under this curve is maximized, the false positive rate increases much more slowly than the true positive rate, meaning that we are accurately predicting positives (churn) without incorrectly labeling so many negatives (non churns).

Model Evaluation

To evaluate our models, we will maintain a hold-out validation set that is not used to train the model. Notice that we will then have three separate datasets: a training set, a testing set, and a validation set.

The reason that we need the hold-out validation set is that information from the testing set "leaks" into the model each time we use the testing set to score our model's performance during training.

Determining a Baseline

Our predictive models must beat the performance of two models:

  • A "dumb" model that uses the average churn rate to randomly assign users a value of "churned" or "not churned".
  • A simple logistic regression model.

Remember that these models must be out-performed on the hold-out validation set.

Defining Inputs and Outputs

We define a customer as churned if they cancel their subscription. Our inputs will consist of snapshot data (billing info) and time series data (detailed usage info). We will use 8 weeks of snapshot and time series data to build our feature sets. We will try to predict whether or not a customer will churn in the next 4 weeks.