Forecasting Number of transactions a customer would make using Beta Geometric-Negative Binomial Distribution, a BTYD-Probabilistic Model.
Trained a Beta Geometric-Negative Binomial Distribution (BG/NBD) model that explains how frequently customers make purchases while they are still "alive" and how likely a customer is to churn in any given time period, using customer transactions of E-Commerce store Olist public dataset
Model Outcome
Trained Model Predicts Number of Purchases with an RMSE of 0.144 and is able to capture 99% of customer historical transactions with a frequency less than or equal to 4, and only less than 1% of customers have greater than 4 repeated purchases in the dataset.
Model vs Actual - Cumulative Transactions and Daily Transactions
The dataset has information of 100k orders from 2016 to 2018 made at multiple marketplaces in Brazil, the orders are divided into 9 .csv files in a relational database schema.
For this study, I have aggregrated data using the following 3 .csv files out of the 9 in the zip file.
- olist_customers_dataset.csv
- olist_orders_dataset.csv
- olist_payments_dataset.csv
Source : Olist Dataset
Table of Contents
- Experiment
- Package Introduction
- Pre-Processing
- Train BG/NBD Models
- Model Evaluation
- Model Interpretation
Model Selection
Experimented with Pareto/NBD, Modified-Beta-Geometric/NBD, and Beta-Geometric/NBD model. Out of the 3 selected BG/NBD for further exploration as it presented faster training time and low prediction RMSE.
Calibration-Holdout Cut-off Selection
Transactions dataset year-month ranges from 2016-09 to 2018-08. Treating the calibration-holdout threshold date as a hyperparameter, experimented with different dates. Selected 2017-01 to 2017-12 as calibration period and 2018-01 to 2018-08 as holdout period for model evaluation.
# Setting Working Directory to Git-Clone-Path.
mydir = "Git\Clone\Path"
%cd $mydir
Module | Function | Description | Parameters | Yields | Returns |
---|---|---|---|---|---|
preprocess | make_dataset() | Pre-processes raw data | -- | transactions, summary, summary_cal_holdout, customer_mapping | transactions |
train | train_model() | Trains BG/NBD models on summary and summary_cal_holodut dataset | -- | calibration_model.pkl, customer_lifetime_estimator.pkl, summary_cal_preds.csv | -- |
evaluation | -- | Utility functions for evaluation on calibration_holdout dataset | -- | -- | -- |
predict | number_of_purchases(), probability_alive() | Predictions using final fit model | -- | -- | -- |
# Basic Imports.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Local Imports.
from src.preprocess import make_dataset
from src.train import train_model
make_dataset() Pre-processes raw .csv files to following.
-
Transactions data features -
customer_unique_id - Customer ID order_id - Order ID order_purchase_timestamp - Timestamp when order was placed payment_value - Order payment value payment_type - Method used to make payment year_month - Year-Month from order timestamp order_date - Order date from order timestamp avg_inter_purchase_time - Average number of days between each orders for repeated customers
-
Summary Calibration and Holdout data features -
frequency_cal - Frequency of Purchases: (Total Purchase Count) - 1 recency_cal - Age of customer: (first purchase) - (latest purchase) days T_cal - Total age of customer: (first purchase) - (closing date in dataset) frequency_holdout - Frequency after thershold duration_holdout - Number of days in holdout
-
Summary data features -
frequency - Frequency of Purchases: (Total Purchase Count) - 1 recency - Age of customer: (first purchase) - (latest purchase) days T - Total age of customer: (first purchase) - (closing date in dataset)
# Pre-Processing raw data.
transactions = make_dataset()
# Transactions dataset.
transactions.head(2)
customer_unique_id | order_id | order_purchase_timestamp | payment_value | payment_type | year_month | order_date | avg_inter_purchase_time | |
---|---|---|---|---|---|---|---|---|
0 | f7b981e8a280e455ac3cbe0d5d171bd1 | ec7a019261fce44180373d45b442d78f | 2017-01-05 11:56:06 | 19.62 | credit_card | 2017-01 | 2017-01-05 | 0.0 |
1 | 83e7958a94bd7f74a9414d8782f87628 | b95a0a8bd30aece4e94e81f0591249d8 | 2017-01-05 12:01:20 | 19.62 | boleto | 2017-01 | 2017-01-05 | 0.0 |
# Summary Calibration and Holdout dataset.
summary_cal_holdout = pd.read_csv("datasets/summary_cal_holdout.csv")
summary_cal_holdout.head(2)
frequency_cal | recency_cal | T_cal | frequency_holdout | duration_holdout | |
---|---|---|---|---|---|
0 | 0.0 | 0.0 | 296.0 | 0.0 | 243.0 |
1 | 0.0 | 0.0 | 80.0 | 0.0 | 243.0 |
# Summary dataset for final Fit.
summary = pd.read_csv("datasets/summary.csv")
summary.head(2)
frequency | recency | T | |
---|---|---|---|
0 | 0.0 | 0.0 | 113.0 |
1 | 0.0 | 0.0 | 116.0 |
train_model():
Trains two BG/NBD models. One on summary_cal_holdout dataset for evaluation and another on summary as a final fit.
# Training Models.
train_model()
"""
Optimization terminated successfully.
Current function value: 0.070935
Iterations: 61
Function evaluations: 63
Gradient evaluations: 63
Optimization terminated successfully.
Current function value: 0.086931
Iterations: 62
Function evaluations: 63
Gradient evaluations: 63
"""
evaluation
-
single_customer_evaluation() - Compares Model prediction to Ground Truth of randomly sampled customer from the dataset.
-
root_mean_squared_error() - Computes Root Mean Squared Error of model frequency predictions vs frequency holdout.
-
evaluation_plots() - 4 Plots for model evaluation. tracking - Tracking Cumulative transactions and Daily transactions. repeated - Frequency of Repeated Purchases. calibration_holdout - Calibration vs Holdout Repeated Purchases.
# Evaluation utility functions.
from src.evaluation import single_customer_evaluation
from src.evaluation import root_mean_squared_error
from src.evaluation import evaluation_plots
# Evaluation of an Individual customer predictions by the model.
frequency_predicted, frequency_holdout = single_customer_evaluation()
# Predicted vs Holdout.
print(f"SINGLE CUSTOMER PREDICTIONS:"
f"\nPrediction:"
f"\n {frequency_predicted}"
f"\nGround Truth: "
f"\n {frequency_holdout}")
"""
SINGLE CUSTOMER PREDICTIONS:
Prediction:
29881 0.008178
dtype: float64
Ground Truth:
29881 0.0
Name: frequency_holdout, dtype: float64
"""
# Overall Root Mean Squared Error of Predictions.
rmse = root_mean_squared_error()
print(f"RMSE: {rmse}")
"""
RMSE: 0.14444759935762416
"""
# Calibration vs Holdout Plot.
evaluation_plots(plot_type="calibration_holdout");
# Cumulative Transactions and Daily Transactions plot.
evaluation_plots(plot_type="tracking");
# Repeated Frequency of Transactions plot.
evaluation_plots(plot_type="repeated");
C:\Program Files\Anaconda\lib\site-packages\lifetimes\generate_data.py:54: RuntimeWarning: divide by zero encountered in double_scalars
next_purchase_in = random.exponential(scale=1.0 / l)
# Imports for Model Interpretation.
from lifetimes import BetaGeoFitter
from lifetimes.plotting import plot_frequency_recency_matrix
from lifetimes.plotting import plot_probability_alive_matrix
# Loading Dataset used Training.
summary = pd.read_csv("datasets/summary.csv")
customer_id_mapping = pd.read_csv("datasets/customer_mapping.csv")
transactions = pd.read_csv("datasets/transactions.csv", parse_dates=["order_purchase_timestamp", "order_date"])
# Setting Trained Customer_Lifetime_Estimator.
model = BetaGeoFitter()
model.load_model("models/customer_lifetime_estimator.pkl")
Analyzing the relation between the frequency-recency-expected number of future purchases using the 30 days forecast plot below.
Frequency - Repeated purchases the customer has made.
Recency - Age at last purchase viz., (first purchase - last purchase) days
Best Customers
The model predicts that the best set of customers are the ones in the bottom right, with historical recency of 400-600, frequency of 10-15 are likely to make about 6 purchases in the next 30 days.
Coldest Customers
The top right customers who have historical recency of 0-200, frequency of 10-15 are likely to make almost no purchases.
# Frequency-Recency-Expected Number of Future Purchases.
plot_frequency_recency_matrix(model=model,
T=30,
max_frequency=None);
This plot depicts the relation between frequency-recency-probability a customer is Alive, Alive referring to will they be ever placing an order in the future.
Interpretation
Customer who has made a purchase after 200 days of their first purchase and has been making about 7 purchases, has a probability of 0.2 of them coming back to make a purchase.
# Probability the customer is alive.
plot_probability_alive_matrix(model=model);
Finding Insights from historical transactions of Customers who are likely to purchase in the next 30 days.
Insight Obtained
Customers who have a high probability of making a purchase, have historically made purchases using "Credit Card", we could infer that credit card will likely be the method of payment they will be using in the future too.
# Predictions.
frequency_predictions = model.predict(t=30,
frequency=summary["frequency"],
recency=summary["recency"],
T=summary["T"])
summary["frequency_predictions"] = frequency_predictions.copy()
# Top 10 Likely to Purchase customers.
top_ten = summary.sort_values("frequency_predictions", ascending=False).head(10)
# Extracting IDs using Customer ID Mapping dataset.
top_ten_ids = customer_id_mapping.iloc[top_ten.index]
# List of ids.
top_ten_ids_list = list(top_ten_ids.customer_unique_id.values)
# Their Transactions.
historical_transactions_of_top_ten = transactions[transactions.customer_unique_id.isin(top_ten_ids_list)]
historical_transactions_of_top_ten.head(2)
customer_unique_id | order_id | order_purchase_timestamp | payment_value | payment_type | year_month | order_date | avg_inter_purchase_time | |
---|---|---|---|---|---|---|---|---|
9401 | 8d50f5eadf50201ccdcedfb9e2ac8455 | 5d848f3d93a493c1c8955e018240e7ca | 2017-05-15 23:30:03 | 22.77 | credit_card | 2017-05 | 2017-05-15 | 28.875 |
13399 | 8d50f5eadf50201ccdcedfb9e2ac8455 | 369634708db140c5d2c4e365882c443a | 2017-06-18 22:56:48 | 51.75 | credit_card | 2017-06 | 2017-06-18 | 28.875 |
# Payment Methods Used Historical by Top 10 Customers.
sns.countplot(historical_transactions_of_top_ten["payment_type"]);
Insights Obtained
- Number of Purchases - The randomly sampled customer is not very likely to make a purchase in next 30 days.
- Probability Alive - There is a very low probablity that they will be placing an order anytime soon.
# Prediction Module for Local Package.
from src.predict import number_of_purchases, probability_alive
# Customers with repeated purchases.
repeated_customers = summary.loc[summary["frequency"] >= 2]
# Randomly sampling one customer.
random_sampled_customer = repeated_customers.sample()
random_sampled_customer
frequency | recency | T | frequency_predictions | |
---|---|---|---|---|
79299 | 2.0 | 125.0 | 220.0 | 0.044719 |
# Predicting Number of Purchases 30 days in future.
number_of_purchases(historical_rfm_data=random_sampled_customer,
time_units=30)
"""
79299 0.044719
dtype: float64
"""
# Predicting Probability that they will be placing an order based on Historical Transactions.
probability_alive(historical_rfm_data=random_sampled_customer)
"""
array([0.22415287])
"""
END