Skip to content

Latest commit



654 lines (433 loc) · 16.2 KB

File metadata and controls

654 lines (433 loc) · 16.2 KB

Customer Lifetime Value.

Forecasting Number of transactions a customer would make using Beta Geometric-Negative Binomial Distribution, a BTYD-Probabilistic Model.


Trained a Beta Geometric-Negative Binomial Distribution (BG/NBD) model that explains how frequently customers make purchases while they are still "alive" and how likely a customer is to churn in any given time period, using customer transactions of E-Commerce store Olist public dataset

Model Outcome

Trained Model Predicts Number of Purchases with an RMSE of 0.144 and is able to capture 99% of customer historical transactions with a frequency less than or equal to 4, and only less than 1% of customers have greater than 4 repeated purchases in the dataset.


Model vs Actual - Cumulative Transactions and Daily Transactions image-3.png

About Olist Dataset

The dataset has information of 100k orders from 2016 to 2018 made at multiple marketplaces in Brazil, the orders are divided into 9 .csv files in a relational database schema.

For this study, I have aggregrated data using the following 3 .csv files out of the 9 in the zip file.

  • olist_customers_dataset.csv
  • olist_orders_dataset.csv
  • olist_payments_dataset.csv

Source : Olist Dataset

Analysis Walk-through

Table of Contents

  1. Experiment
  2. Package Introduction
  3. Pre-Processing
  4. Train BG/NBD Models
  5. Model Evaluation
  6. Model Interpretation
    1. Frequency-Recency-Expected Number of Purchases Analysis
    2. Customer Segmentation
    3. Probability-Alive-Matrix
    4. High Probability Customers
    5. Future Forecast Random Sampled Customer


Model Selection

Experimented with Pareto/NBD, Modified-Beta-Geometric/NBD, and Beta-Geometric/NBD model. Out of the 3 selected BG/NBD for further exploration as it presented faster training time and low prediction RMSE.

Calibration-Holdout Cut-off Selection

Transactions dataset year-month ranges from 2016-09 to 2018-08. Treating the calibration-holdout threshold date as a hyperparameter, experimented with different dates. Selected 2017-01 to 2017-12 as calibration period and 2018-01 to 2018-08 as holdout period for model evaluation.

# Setting Working Directory to Git-Clone-Path.

mydir = "Git\Clone\Path"

%cd $mydir

Package Introduction

Module Function Description Parameters Yields Returns
preprocess make_dataset() Pre-processes raw data -- transactions, summary, summary_cal_holdout, customer_mapping transactions
train train_model() Trains BG/NBD models on summary and summary_cal_holodut dataset -- calibration_model.pkl, customer_lifetime_estimator.pkl, summary_cal_preds.csv --
evaluation -- Utility functions for evaluation on calibration_holdout dataset -- -- --
predict number_of_purchases(), probability_alive() Predictions using final fit model -- -- --
# Basic Imports.
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

# Local Imports.
from src.preprocess import make_dataset
from src.train import train_model


make_dataset() Pre-processes raw .csv files to following.

  1. Transactions data features -

     customer_unique_id - Customer ID
     order_id - Order ID
     order_purchase_timestamp - Timestamp when order was placed
     payment_value - Order payment value
     payment_type - Method used to make payment
     year_month - Year-Month from order timestamp
     order_date - Order date from order timestamp
     avg_inter_purchase_time - Average number of days between each orders for repeated customers
  2. Summary Calibration and Holdout data features -

     frequency_cal - Frequency of Purchases: (Total Purchase Count) - 1 
     recency_cal - Age of customer: (first purchase) - (latest purchase) days
     T_cal - Total age of customer: (first purchase) - (closing date in dataset)
     frequency_holdout - Frequency after thershold
     duration_holdout - Number of days in holdout
  3. Summary data features -

     frequency - Frequency of Purchases: (Total Purchase Count) - 1 
     recency -  Age of customer: (first purchase) - (latest purchase) days
     T - Total age of customer: (first purchase) - (closing date in dataset)
# Pre-Processing raw data.

transactions = make_dataset()
# Transactions dataset.

customer_unique_id order_id order_purchase_timestamp payment_value payment_type year_month order_date avg_inter_purchase_time
0 f7b981e8a280e455ac3cbe0d5d171bd1 ec7a019261fce44180373d45b442d78f 2017-01-05 11:56:06 19.62 credit_card 2017-01 2017-01-05 0.0
1 83e7958a94bd7f74a9414d8782f87628 b95a0a8bd30aece4e94e81f0591249d8 2017-01-05 12:01:20 19.62 boleto 2017-01 2017-01-05 0.0
# Summary Calibration and Holdout dataset.

summary_cal_holdout = pd.read_csv("datasets/summary_cal_holdout.csv")

frequency_cal recency_cal T_cal frequency_holdout duration_holdout
0 0.0 0.0 296.0 0.0 243.0
1 0.0 0.0 80.0 0.0 243.0
# Summary dataset for final Fit.

summary = pd.read_csv("datasets/summary.csv")

frequency recency T
0 0.0 0.0 113.0
1 0.0 0.0 116.0

Training BG/NBD Models


Trains two BG/NBD models. One on summary_cal_holdout dataset for evaluation and another on summary as a final fit.

# Training Models.



    Optimization terminated successfully.
             Current function value: 0.070935
             Iterations: 61
             Function evaluations: 63
             Gradient evaluations: 63
    Optimization terminated successfully.
             Current function value: 0.086931
             Iterations: 62
             Function evaluations: 63
             Gradient evaluations: 63


Model Evaluation


  • single_customer_evaluation() - Compares Model prediction to Ground Truth of randomly sampled customer from the dataset.

  • root_mean_squared_error() - Computes Root Mean Squared Error of model frequency predictions vs frequency holdout.

  • evaluation_plots() - 4 Plots for model evaluation. tracking - Tracking Cumulative transactions and Daily transactions. repeated - Frequency of Repeated Purchases. calibration_holdout - Calibration vs Holdout Repeated Purchases.

# Evaluation utility functions.

from src.evaluation import single_customer_evaluation
from src.evaluation import root_mean_squared_error
from src.evaluation import evaluation_plots
# Evaluation of an Individual customer predictions by the model.

frequency_predicted, frequency_holdout = single_customer_evaluation()

# Predicted vs Holdout.

          f"\n {frequency_predicted}"
          f"\nGround Truth: "
          f"\n {frequency_holdout}")
     29881    0.008178
    dtype: float64
    Ground Truth: 
     29881    0.0
    Name: frequency_holdout, dtype: float64

# Overall Root Mean Squared Error of Predictions.

rmse = root_mean_squared_error()

print(f"RMSE: {rmse}")

RMSE: 0.14444759935762416
# Calibration vs Holdout Plot.



# Cumulative Transactions and Daily Transactions plot.



# Repeated Frequency of Transactions plot.

C:\Program Files\Anaconda\lib\site-packages\lifetimes\ RuntimeWarning: divide by zero encountered in double_scalars
  next_purchase_in = random.exponential(scale=1.0 / l)


Model Interpretation

# Imports for Model Interpretation.

from lifetimes import BetaGeoFitter

from lifetimes.plotting import plot_frequency_recency_matrix
from lifetimes.plotting import plot_probability_alive_matrix
# Loading Dataset used Training.

summary = pd.read_csv("datasets/summary.csv")
customer_id_mapping = pd.read_csv("datasets/customer_mapping.csv")
transactions = pd.read_csv("datasets/transactions.csv", parse_dates=["order_purchase_timestamp", "order_date"])
# Setting Trained Customer_Lifetime_Estimator.

model = BetaGeoFitter()

Frequency/Recency Analysis

Analyzing the relation between the frequency-recency-expected number of future purchases using the 30 days forecast plot below.

Frequency - Repeated purchases the customer has made.

Recency - Age at last purchase viz., (first purchase - last purchase) days

Customer Segmentation

Best Customers

The model predicts that the best set of customers are the ones in the bottom right, with historical recency of 400-600, frequency of 10-15 are likely to make about 6 purchases in the next 30 days.

Coldest Customers

The top right customers who have historical recency of 0-200, frequency of 10-15 are likely to make almost no purchases.

# Frequency-Recency-Expected Number of Future Purchases.



Probability Alive Matrix

This plot depicts the relation between frequency-recency-probability a customer is Alive, Alive referring to will they be ever placing an order in the future.


Customer who has made a purchase after 200 days of their first purchase and has been making about 7 purchases, has a probability of 0.2 of them coming back to make a purchase.

# Probability the customer is alive.



High Probability Customers.

Finding Insights from historical transactions of Customers who are likely to purchase in the next 30 days.

Insight Obtained

Customers who have a high probability of making a purchase, have historically made purchases using "Credit Card", we could infer that credit card will likely be the method of payment they will be using in the future too.

# Predictions.

frequency_predictions = model.predict(t=30,

summary["frequency_predictions"] = frequency_predictions.copy()
# Top 10 Likely to Purchase customers.

top_ten = summary.sort_values("frequency_predictions", ascending=False).head(10)

# Extracting IDs using Customer ID Mapping dataset.

top_ten_ids = customer_id_mapping.iloc[top_ten.index]

# List of ids.

top_ten_ids_list = list(top_ten_ids.customer_unique_id.values)

# Their Transactions.

historical_transactions_of_top_ten = transactions[transactions.customer_unique_id.isin(top_ten_ids_list)]
customer_unique_id order_id order_purchase_timestamp payment_value payment_type year_month order_date avg_inter_purchase_time
9401 8d50f5eadf50201ccdcedfb9e2ac8455 5d848f3d93a493c1c8955e018240e7ca 2017-05-15 23:30:03 22.77 credit_card 2017-05 2017-05-15 28.875
13399 8d50f5eadf50201ccdcedfb9e2ac8455 369634708db140c5d2c4e365882c443a 2017-06-18 22:56:48 51.75 credit_card 2017-06 2017-06-18 28.875
# Payment Methods Used Historical by Top 10 Customers.



Forecast for Randomly Selected Customer in dataset.

Insights Obtained

  • Number of Purchases - The randomly sampled customer is not very likely to make a purchase in next 30 days.
  • Probability Alive - There is a very low probablity that they will be placing an order anytime soon.
# Prediction Module for Local Package.

from src.predict import number_of_purchases, probability_alive
# Customers with repeated purchases.

repeated_customers = summary.loc[summary["frequency"] >= 2]

# Randomly sampling one customer.

random_sampled_customer = repeated_customers.sample()

frequency recency T frequency_predictions
79299 2.0 125.0 220.0 0.044719
# Predicting Number of Purchases 30 days in future.


 79299    0.044719
    dtype: float64
# Predicting Probability that they will be placing an order based on Historical Transactions.

