#### Feature engineering involves transforming cleaned transactional data into customer-level features, specifically focusing on RFM (Recency, Frequency, Monetary) and customer Tenure. These features are fundamental inputs for our CLTV prediction models.


In [2]:
import pandas as pd
import numpy as np
import datetime as dt

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [3]:
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

In [4]:
print("loading the dataset...")
try:
    df_cleaned=pd.read_csv("../data/processed/online_retail_cleaned.csv")
    df_cleaned['InvoiceDate']=pd.to_datetime(df_cleaned['InvoiceDate'])
    print("Loaded the Cleaned dataset successfully!")
except FileNotFoundError:
    print("File not found!")
    exit()

loading the dataset...
Loaded the Cleaned dataset successfully!


In [5]:
print("Shape of the cleaned dataset",df_cleaned.shape)
print("Sample of the cleaned dataset: ")
df_cleaned.head()

Shape of the cleaned dataset (392692, 9)
Sample of the cleaned dataset: 


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,TotalPrice
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34


### Defining the analysis period
#### for cltv, we need a snapshot of a reference date to calculate the analysis period i.e recency, frequency, monetary and tenure.

In [6]:
snapshot_date=df_cleaned['InvoiceDate'].max() + dt.timedelta(days=1)
print("Analysis snapshot date: " , snapshot_date)

Analysis snapshot date:  2011-12-10 12:50:00


### Calculate the RFM
#### Here RFM is:
#### 1. R (Recency) : days since last purchase
#### 2. F (Frequency) : Total number of purchases/transactions
#### 3. M (Monetary) : Total monetary value of purchases

In [7]:
print("Calculating RFM features....")
rfm_df=df_cleaned.groupby('CustomerID').agg(
Recency=('InvoiceDate', lambda date:(snapshot_date-date.max()).days),
Frequency=('InvoiceDate', 'nunique'),
Monetary=('TotalPrice','sum')

).reset_index()

print("RFM DataFrame head:")
print(rfm_df.head())
print("Shape of rfm is " , rfm_df.shape)

Calculating RFM features....
RFM DataFrame head:
   CustomerID  Recency  Frequency  Monetary
0       12346      326          1  77183.60
1       12347        2          7   4310.00
2       12348       75          4   1797.24
3       12349       19          1   1757.55
4       12350      310          1    334.40
Shape of rfm is  (4338, 4)


#### 4. Tenure :The age of the customer from their first purchase date to the snapshot date.This is important as older customers (higher tenure) might have higher CLTV naturally.

In [8]:
first_purchase_date_df = df_cleaned.groupby('CustomerID').agg(
    FirstPurchase=('InvoiceDate', 'min')
).reset_index()

#merge first purchase date with RFM dataframe
rfm_df = pd.merge(rfm_df,first_purchase_date_df, on='CustomerID' , how='left')

#Calculate tenure in days
rfm_df['Tenure']=(snapshot_date - rfm_df['FirstPurchase']).dt.days

print("RFM DataFrame with Tenure head:")
print(rfm_df.head())

RFM DataFrame with Tenure head:
   CustomerID  Recency  Frequency  Monetary       FirstPurchase  Tenure
0       12346      326          1  77183.60 2011-01-18 10:01:00     326
1       12347        2          7   4310.00 2010-12-07 14:57:00     367
2       12348       75          4   1797.24 2010-12-16 19:09:00     358
3       12349       19          1   1757.55 2011-11-21 09:51:00      19
4       12350      310          1    334.40 2011-02-02 16:01:00     310


### Adjusting frequency for CLTV modeling. 
#### As we'll be using probablistic models like BG/NBD, frequency is defined as the number of additional purchases after the first purchase. Therefore, a customer who bought once will have frequency=0.We will be using the 'lifetimes' library from python to use the probablistic model. 

##### Frequency model

In [9]:
#find the number of unique purchase days -> represents distinct buying occasions.
customer_transactions_count=df_cleaned.groupby('CustomerID').agg(
    NumTransactions=('InvoiceDate' ,'nunique')
).reset_index()

rfm_df=pd.merge(rfm_df , customer_transactions_count , on='CustomerID' , how='left')

#Frequency of probablistic models (eg. lifetimes library) is 'number of transactions -1' for frequent buyers and 0 for one time buyers
rfm_df['Frequency_model']=rfm_df['NumTransactions']-1
rfm_df.loc[rfm_df['NumTransactions'] == 1, 'Frequency_model'] = 0 # Ensure single buyers have 0 repeat frequency


#### Recency for probabilistic models is the age of the customer *at the time of their last purchase* (T_x). T_x is the duration between the first purchase and the last purchase in days.It's 'Recency' from the perspective of the customer's *last purchase*, not snapshot date.


##### Recency model

In [10]:
last_purchase_date_df = df_cleaned.groupby('CustomerID').agg(
    LastPurchase=('InvoiceDate', 'max')
).reset_index()
rfm_df = pd.merge(rfm_df, last_purchase_date_df, on='CustomerID', how='left')
rfm_df['Recency_Model'] = (rfm_df['LastPurchase'] - rfm_df['FirstPurchase']).dt.days

#### T for probabilistic models is the age of the customer (Tenure) in days, from first purchase to snapshot date.We already have 'Tenure' calculated earlier.

In [11]:
print("\nRFM DataFrame with adjusted features for probabilistic models:")
# Display columns relevant for 'lifetimes' library: Frequency_Model (x), Recency_Model (T_x), Tenure (T)
print(rfm_df[['CustomerID', 'Recency', 'Frequency', 'Monetary', 'Tenure', 'Frequency_model', 'Recency_Model']].head())




RFM DataFrame with adjusted features for probabilistic models:
   CustomerID  Recency  Frequency  Monetary  Tenure  Frequency_model  Recency_Model
0       12346      326          1  77183.60     326                0              0
1       12347        2          7   4310.00     367                6            365
2       12348       75          4   1797.24     358                3            282
3       12349       19          1   1757.55      19                0              0
4       12350      310          1    334.40     310                0              0


#### Average order value
##### Now let's calculate the average order value(AOV)

In [12]:
rfm_df['AOV']=rfm_df['Monetary'] / rfm_df['Frequency']
# Handle cases where Frequency is 0 (single buyers) to avoid division by zero for AOV
rfm_df.loc[rfm_df['Frequency'] == 0, 'AOV'] = rfm_df['Monetary'] # For single buyers, AOV is their one purchase monetary value.


### Purchase Gap
##### Here we should see the individual invoice dates per customer

In [13]:
#Calculate AvgPurchaseGap
customer_invoice_gap = df_cleaned.groupby('CustomerID')['InvoiceDate'].apply(
    lambda x: x.sort_values().diff().dropna().dt.days.mean()
).reset_index(name='AvgPurchaseGap')

# Before merging, make sure no duplicate 'AvgPurchaseGap' column exists
if 'AvgPurchaseGap' in rfm_df.columns:
    rfm_df = rfm_df.drop(columns=['AvgPurchaseGap'])

# Merge safely
rfm_df = pd.merge(rfm_df, customer_invoice_gap, on='CustomerID', how='left')

# Fill NaN values (i.e., single purchase customers)
rfm_df['AvgPurchaseGap'] = rfm_df['AvgPurchaseGap'].fillna(0)

### Product diversity - number of unique products bought

In [14]:
product_diversity_df = df_cleaned.groupby('CustomerID')['StockCode'].nunique().reset_index(name='ProductDiversity')
rfm_df = pd.merge(rfm_df, product_diversity_df, on='CustomerID', how='left')


In [16]:
print("\nRFM DataFrame with additional engineered features:")
print(rfm_df.head())



RFM DataFrame with additional engineered features:
   CustomerID  Recency  Frequency  Monetary       FirstPurchase  Tenure  NumTransactions  Frequency_model        LastPurchase  Recency_Model      AOV  AvgPurchaseGap  ProductDiversity
0       12346      326          1  77183.60 2011-01-18 10:01:00     326                1                0 2011-01-18 10:01:00              0 77183.60            0.00                 1
1       12347        2          7   4310.00 2010-12-07 14:57:00     367                7                6 2011-12-07 15:52:00            365   615.71            2.00               103
2       12348       75          4   1797.24 2010-12-16 19:09:00     358                4                3 2011-09-25 13:13:00            282   449.31            9.40                22
3       12349       19          1   1757.55 2011-11-21 09:51:00      19                1                0 2011-11-21 09:51:00              0  1757.55            0.00                73
4       12350      310      

In [18]:
final_features_df = rfm_df[['CustomerID', 'Recency', 'Frequency', 'Monetary', 'Tenure',
                            'Frequency_model', 'Recency_Model', # For probabilistic models
                            'AOV', 'AvgPurchaseGap', 'ProductDiversity']]

print("\nFinal Feature DataFrame head (ready for modeling):")
print(final_features_df.head())




Final Feature DataFrame head (ready for modeling):
   CustomerID  Recency  Frequency  Monetary  Tenure  Frequency_model  Recency_Model      AOV  AvgPurchaseGap  ProductDiversity
0       12346      326          1  77183.60     326                0              0 77183.60            0.00                 1
1       12347        2          7   4310.00     367                6            365   615.71            2.00               103
2       12348       75          4   1797.24     358                3            282   449.31            9.40                22
3       12349       19          1   1757.55      19                0              0  1757.55            0.00                73
4       12350      310          1    334.40     310                0              0   334.40            0.00                17


In [19]:
print(f"Shape of final feature DataFrame: {final_features_df.shape}")
print("\nFinal feature DataFrame info:")
final_features_df.info()

Shape of final feature DataFrame: (4338, 10)

Final feature DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4338 entries, 0 to 4337
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   CustomerID        4338 non-null   int64  
 1   Recency           4338 non-null   int64  
 2   Frequency         4338 non-null   int64  
 3   Monetary          4338 non-null   float64
 4   Tenure            4338 non-null   int64  
 5   Frequency_model   4338 non-null   int64  
 6   Recency_Model     4338 non-null   int64  
 7   AOV               4338 non-null   float64
 8   AvgPurchaseGap    4338 non-null   float64
 9   ProductDiversity  4338 non-null   int64  
dtypes: float64(3), int64(7)
memory usage: 339.0 KB


## Save the feature-engineered data

In [20]:

output_path = '../data/processed/customer_features_rfm.csv'
final_features_df.to_csv(output_path, index=False)
print(f"\nFeature-engineered data saved to: {output_path}")


Feature-engineered data saved to: ../data/processed/customer_features_rfm.csv


---
##  Feature Engineering

We successfully transformed the granular transactional data into a  set of customer-level features essential for CLTV prediction. Key achievements include:

* **RFM (Recency, Frequency, Monetary) Calculation**: Derived these core behavioral metrics for each customer, based on a defined snapshot date.
* **Customer Tenure**: Calculated the age of each customer from their first purchase, a vital indicator of potential lifetime value.
* **Probabilistic Model Adjustments**: Created `Frequency_Model` and `Recency_Model` specifically tailored for the `lifetimes` library's probabilistic models (BG/NBD, Gamma-Gamma), which require a slightly different interpretation of RFM.
* **Advanced Features**: Went beyond basic RFM by engineering `Average Order Value (AOV)`, `Average Purchase Gap`, and `Product Diversity`. These features provide deeper insights into customer behavior and will enrich our predictive models.

The `customer_features_rfm.csv` dataset, now containing these comprehensive customer profiles, is saved in the `data/processed/` directory. It is fully prepared to be used as input for our CLTV prediction models in the next phase.