##### Phase Objective: Here, we will use  probabilistic modeling approach to predict CLTV. We'll use the `lifetimes` library, which is a powerful tool for modeling customer behavior in e-commerce.This approach is highly interpretable and provides separate insights into customer purchasing frequency and monetary value.

In [1]:
import pandas as pd
import numpy as np
from lifetimes import BetaGeoFitter
from lifetimes import GammaGammaFitter
from lifetimes.utils import calibration_and_holdout_data

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
# Set display options for better viewing
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

In [3]:
print("Loading feature-engineered data...")
try:
    df_features = pd.read_csv('../data/processed/customer_features_rfm.csv')
    print("Feature-engineered data loaded successfully.")
except FileNotFoundError:
    print("Error: 'customer_features_rfm.csv' not found. Please ensure 02_Feature_Engineering.ipynb was run and saved.")
    exit()

Loading feature-engineered data...
Feature-engineered data loaded successfully.


In [4]:
print("Shape of the  customer_features_rfm.csv dataframe is", df_features.shape)

Shape of the  customer_features_rfm.csv dataframe is (4338, 10)


In [6]:
print("Sample features are: ")
df_features.head()

Sample features are: 


Unnamed: 0,CustomerID,Recency,Frequency,Monetary,Tenure,Frequency_model,Recency_Model,AOV,AvgPurchaseGap,ProductDiversity
0,12346,326,1,77183.6,326,0,0,77183.6,0.0,1
1,12347,2,7,4310.0,367,6,365,615.71,2.0,103
2,12348,75,4,1797.24,358,3,282,449.31,9.4,22
3,12349,19,1,1757.55,19,0,0,1757.55,0.0,73
4,12350,310,1,334.4,310,0,0,334.4,0.0,17


#### We are using the 'lifetimes' library for the probabilistic models. These models require data to be of a specific format. 
#### We have the following:
##### 1.`frequency`: The number of *repeat* purchases (transactions - 1)
##### 2.`recency`: The age of the customer *at the time of their last purchase* (T_x)
##### 3.`T`: The age of the customer in total (Tenure)
##### 4.`monetary`: The *average* monetary value per transaction
##### Now lets make another dataframe that just have these mentioned above. 

In [8]:
lifetimes_df=df_features[['CustomerID', 'Frequency_model', 'Recency_Model', 'Tenure', 'Monetary']].copy()

##### The Gamma-Gamma model requires monetary value to be greater than zero.
##### We'll filter out customers with a monetary value of 0 (single buyers).
##### We will use the BG/NBD model on all customers, but the Gamma-Gamma model only on repeat buyers.


In [9]:

lifetimes_df = lifetimes_df[lifetimes_df['Monetary'] > 0]

In [10]:
lifetimes_df.rename(columns={
    'Frequency_model': 'frequency',
    'Recency_Model': 'recency',
    'Tenure': 'T',
    'Monetary': 'monetary'
}, inplace=True)

In [11]:
print("\nData prepared for lifetimes library:")
print(lifetimes_df.head())


Data prepared for lifetimes library:
   CustomerID  frequency  recency    T  monetary
0       12346          0        0  326  77183.60
1       12347          6      365  367   4310.00
2       12348          3      282  358   1797.24
3       12349          0        0   19   1757.55
4       12350          0        0  310    334.40


#### Training the BG/NBD Model(Frequency Prediction)
#### The BG/NBD (Beta-Geometric / Negative Binomial Distribution) model predicts
#### how many future transactions a customer will make. It models two processes:
#### 1. A customer's purchasing process (frequency)
#### 2. A customer's "dropout" process (churn)


In [12]:
print("\n--- Training the BG/NBD model ---")
bgf = BetaGeoFitter(penalizer_coef=0.0) 
bgf.fit(lifetimes_df['frequency'], lifetimes_df['recency'], lifetimes_df['T'])




--- Training the BG/NBD model ---


<lifetimes.BetaGeoFitter: fitted with 4338 subjects, a: 0.03, alpha: 51.55, b: 2.50, r: 0.73>

In [13]:
print("BG/NBD model training complete. Model summary:")
print(bgf.summary)


BG/NBD model training complete. Model summary:
       coef  se(coef)  lower 95% bound  upper 95% bound
r      0.73      0.02             0.68             0.77
alpha 51.55      1.98            47.67            55.42
a      0.03      0.01             0.00             0.06
b      2.50      2.00            -1.42             6.41


#### Meaning of rows->


| Parameter | Meaning                                                                                                                             |
| --------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
| **r**     | This helps the model figure out how often customers come back. Higher **r** means more regular buyers. Think of it like a “customer loyalty score.” |
| **alpha** | This tells the model how spread out the customer purchase rates are. Big number = customers have **different speeds of buying**.                    |
| **a**     | This is about how likely a customer is to stop buying. A small **a** means most customers will **probably come back**.                              |
| **b**     | This shows how much variation there is in the chances of stopping. A bigger **b** means customers are all different some stop early, some never.  |


#### Meaning of columns->
| Column              | What it means                                                                                                                                          |
| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **coef**            | The actual number or value the model learnt. Think of it as the model's best guess for that parameter.                                                                          |
| **se(coef)**        | It's called **Standard Error** – the smaller this number, the more confident the model is. |
| **lower 95% bound** | Imagine this is the **lowest** value we expect the real answer could be, 95% of the time.                                                                                        |
| **upper 95% bound** | This is the **highest** value we expect the real answer could be, 95% of the time.                                                                                               |


##### Training the Gamma-Gamma Model (Monetary Prediction)
##### The Gamma-Gamma model predicts the average monetary value of a customer's transactions.
##### It assumes that monetary value and transaction frequency are independent.
##### This model is trained only on *repeat customers* (those with frequency > 0).

In [14]:
print("\n--- Training the Gamma-Gamma model ---")
ggf = GammaGammaFitter(penalizer_coef=0.0)
# Filter for repeat buyers (frequency > 0)
repeat_buyers_df = lifetimes_df[lifetimes_df['frequency'] > 0]


--- Training the Gamma-Gamma model ---


In [17]:
ggf.fit(repeat_buyers_df['frequency'], repeat_buyers_df['monetary'])
print("Gamma-Gamma model training complete. Model summary:")
print(ggf.summary)
print("\nNote: The Gamma-Gamma model is trained only on customers with >1 purchase.")


Gamma-Gamma model training complete. Model summary:
     coef  se(coef)  lower 95% bound  upper 95% bound
p    1.72      0.12             1.49             1.94
q    1.75      0.06             1.63             1.87
v 1234.57    137.37           965.33          1503.81

Note: The Gamma-Gamma model is trained only on customers with >1 purchase.


| Parameter | Value   | What It Means                                                         |
| --------- | ------- | ----------------------------------------------------------------------------------- |
| **p**     | 1.72    | Helps model how **consistent** customers are with how much they spend.              |
| **q**     | 1.75    | Helps model the **variation** between customers' spending behavior.                 |
| **v**     | 1234.57 | This is the **average spend value** (in your currency, say ₹) across all customers. |


In [24]:
import os

# Make sure the 'models' folder exists
os.makedirs('../models', exist_ok=True)

# Use the lifetimes library's dedicated save method
bgf.save_model('../models/bgf.json')
ggf.save_model('../models/ggf.json')

print("Models saved successfully to the 'models/' folder as JSON files.")

Models saved successfully to the 'models/' folder as JSON files.


#### Predicting CLTV
####  Now we combine both models to predict CLTV over a future period.
#### CLTV = (Predicted Future Purchases) * (Predicted Average Monetary Value)


In [19]:
# Let's predict CLTV for the next 12 months (365 days).
prediction_period = 365 # In days
discount_rate = 0.01 # A simple discount rate for future profits
print(f"\n--- Predicting CLTV for the next {prediction_period} days ---")


--- Predicting CLTV for the next 365 days ---


In [20]:
# Combine the models to get the final CLTV estimate.
# The `monetary` value here is the TOTAL monetary value, not the average.
lifetimes_df['predicted_cltv'] = ggf.customer_lifetime_value(
    bgf, # Our Beta-Geo model
    lifetimes_df['frequency'],
    lifetimes_df['recency'],
    lifetimes_df['T'],
    lifetimes_df['monetary'], # The TOTAL monetary value for each customer
    time=prediction_period,   # Duration in days for prediction
    discount_rate=discount_rate
)

In [21]:
# Display the top 10 customers with the highest predicted CLTV
print("Top 10 customers by predicted CLTV:")
print(lifetimes_df.sort_values(by='predicted_cltv', ascending=False).head(10))

Top 10 customers by predicted CLTV:
      CustomerID  frequency  recency    T  monetary  predicted_cltv
1879       14911        200      372  373 143711.17    187546448.52
1689       14646         72      353  355 280206.02    137653390.94
4201       18102         54      366  367 259657.30     93120007.40
3728       17450         45      359  368 194390.79     57957620.92
326        12748        209      372  374  33053.19     44969098.89
1333       14156         53      361  372 117210.08     40717912.71
2176       15311         90      373  374  60632.75     35591954.94
562        13089         92      366  370  58762.08     35570354.25
4010       17841        123      371  373  40519.84     32551848.45
2702       16029         60      335  374  80850.84     26902393.20


#### Customer 14911 bought 200 times already, spent ₹143,711, and is expected to generate a massive ₹187 million over the next year!
#### These are your most valuable customers, the ones you want to retain and reward.

#### Saving predictions

In [22]:
output_path = '../data/processed/cltv_predictions_probabilistic.csv'
lifetimes_df.to_csv(output_path, index=False)
print(f"\nCLTV predictions saved to: {output_path}")



CLTV predictions saved to: ../data/processed/cltv_predictions_probabilistic.csv
