## <span style="color:#956bbf">RFM: Recency, Frequency, Monetary value Summary</span>
---

### <span style="color:#956bbf">Introduction</span>
---

In order to estimate the parameters of **transaction-flow models** such as the Pareto/NBD and BG/NBD, as well as those of the associated models for **spend per transaction** (spend model), we need an RFM (recency, frequency, monetary value) summary of each customer’s purchasing behavior. In particular,
1. The **transaction-flow model** requires three pieces of information about each customer’s purchasing history: their “recency”(when their last transaction occurred), “frequency” (how many transactions they made in a specified time period), and the length of time over which we have observed their purchasing behavior. The notation used to represent this information is $(x, tx, T)$, where $x$ is the number of transactions observed in the time period $(0, T]$ and $t_{x}$ $(0 \le tx \le T)$ is the time of the last transaction.
2. The **spend model** requires two pieces of information about each customer’s purchasing history: the average “monetary value” of each transaction (denoted by $m_x$) and the number of transactions over which this average is computed (i.e., frequency, $x$).

### <span style="color:#956bbf">Imports</span>
---

#### Import Packages

In [1]:
import polars as pl
import numpy as np

#### Import Data

We will make use of the CDNOW dataset. The master dataset contains the entire purchase history up to the end of June 1998 of the cohort of 23,570 individuals who made their first-ever purchase at CDNOW in the first quarter of 1997. 

The file `CDNOW_sample.csv` contains purchasing data for a 1/10th systematic sample of the whole cohort (2357 customers). Each record in this file, 6919 in total, comprises five fields: the customer’s ID in the master dataset, the customer’s ID in the 1/10th sample dataset (ranging from 1 to 2357), the date of the transaction, the number of CDs purchased, and the dollar value of the transaction.

In [None]:
CDNOW_master = (
    pl.scan_csv(source = 'data/CDNOW/CDNOW_master.csv', 
                has_header=False, 
                separator=',', 
                schema={'CustID': pl.Int32,     # customer id
                        'Date': pl.String,      # transaction date
                        'Quant': pl.Int16,      # number of CDs purchased
                        'Spend': pl.Float64})   # dollar value (excl. S&H)
    .with_columns(pl.col('Date').str.to_date("%Y%m%d"))
    .with_columns((pl.col('Date') - pl.date(1996,12,31)).dt.total_days().cast(pl.UInt16).alias('PurchDay'))
    .with_columns((pl.col('Spend')*100).round(0).cast(pl.Int64).alias('Spend Scaled'))
    .group_by('CustID', 'Date')
    .agg(pl.col('*').exclude('PurchDay').sum(), pl.col('PurchDay').max()) # Multiple transactions by a customer on a single day are aggregated into one
    .sort('CustID', 'Date')
    .with_columns((pl.col("CustID").cum_count().over("CustID") - 1).cast(pl.UInt16).alias("DoR"))  # DoR = Depth of Repeat ('Transaction' time: starts with 0 as trial, 1 as 1st repeat and so on)
)

In [3]:
# MATLAB Sampling (due to numerical float precision handling differences, original sampling results cannot be replicated unless spend is scaled in MATLAB)
CDNOW_sample = (
    pl.scan_csv(source='data/CDNOW/CDNOW_sample.csv',
                has_header=False,
                separator=',',
                schema={'CustID': pl.Int32,
                        'NewID': pl.Int32,
                        'Date': pl.String,
                        'Quant': pl.Int16,
                        'Spend': pl.Float64})
    .with_columns(pl.col('Date').str.to_date("%Y%m%d"))
    .with_columns((pl.col('Date') - pl.date(1996,12,31)).dt.total_days().cast(pl.UInt16).alias('PurchDay'))
    .with_columns((pl.col('Spend')*100).round(0).cast(pl.Int64).alias('Spend Scaled'))
    .group_by('CustID', 'Date')
    .agg(pl.col('*').exclude('PurchDay').sum(), pl.col('PurchDay').max())
    .sort('CustID', 'Date')
    .with_columns((pl.col("CustID").cum_count().over("CustID") - 1).cast(pl.UInt16).alias("DoR"))      
    .drop('CustID', 'Quant')
    .rename({'NewID': 'ID'})
)

CDNOW_sample.collect()

Date,ID,Spend,Spend Scaled,PurchDay,DoR
date,i32,f64,i64,u16,u16
1997-01-01,1,29.33,2933,1,0
1997-01-18,1,29.73,2973,18,1
1997-08-02,1,14.96,1496,214,2
1997-12-12,1,26.48,2648,346,3
1997-01-04,58,14.96,1496,4,0
…,…,…,…,…,…
1997-07-26,2356,45.74,4574,207,3
1997-09-27,2356,31.47,3147,270,4
1998-01-03,2356,28.98,2898,368,5
1998-06-07,2356,28.98,2898,523,6


1. We assume that the records in the raw transaction data file are grouped by customer, and sorted within customer by date of transaction. If in doubt, sort the raw dataset by customer ID and date of transaction.
2. As they are of no interest to us in this particular case, we delete the first and fourth columns (master dataset customer ID and # CDs purchased, respectively). 
3. We note that some customers had more than one transaction on a given day. For example, customer 26 had two separate transactions on 13 January 1997, while customer 46 had two separate transactions on 28 August 1997. There are about 233 such "additional" transactions. The transaction-flow models are developed by telling a story about **interpurchase times**. As we only know the date (and not the time) of each transaction, we need to aggregate the records associated with same-day transactions—we can’t have an interpurchase time of 0

### <span style="color:#956bbf">“Frequency” and “Monetary Value”</span>
---

We now compute the frequency and monetary value summaries for each customer.

Most of the previous analyses undertaken using this dataset have split the 78 weeks of data in half, creating a 39-week calibration period (1997-01-01–1997-09-30) and 39-week validation period (1997-10-01–1998-06-30). Furthermore, these analyses have generally ignored each customer’s first-ever purchase at CDNOW, which signals the start of the customer’s “relationship” with the firm; this means calibration-period “frequency” has usually been the number of repeat transactions, and “monetary value” has been the average dollar value per repeat transaction.