## <span style="color:#956bbf">Depth-of-Repeat Model - Sales Summary & Sales Forecasting</span>
---

### <span style="color:#956bbf">Introduction</span>

**Source**:
- [Creating a Depth-of-Repeat Sales Summary Using Excel](https://www.brucehardie.com/notes/006/) by Bruce G. S. Hardie & Peter S. Fader
- [Generating a Sales Forecast With a Simple Depth-of-Repeat Model](https://www.brucehardie.com/notes/007/) by Bruce G. S. Hardie & Peter S. Fader

Central to diagnosing the performance of a new product is the decomposition of its total sales into trial, first repeat, second repeat, and so on, components. More formally, we are interested in creating a summary of purchasing that tells us for each unit of time (e.g., week), the cumulative number of people who have made a trial (i.e., first-ever) purchase, a first repeat (i.e., secondever) purchase, a second repeat purchase, and so on. We let $T(t)$ denote the cumulative number of people who have made a trial purchase by time $t$, and $R_{j}(t)$ denote the number of people who have made at least $j$ repeat purchases of the new product by time $t(j=1,2,3, \ldots)$.

With such a data summary in place, standard new product performance metrics such as “percent triers repeating” and “repeats per repeater” are easily computed from these data. At any point in time $t$, percent triers repeating is computed as $R_{1}(t)/T (t)$, while repeats per repeater is computed as $R(t)/R_{1}(t)$, where $R(t)$ is the total number of repeat purchases up to time $t$:

$$
R(t)=\sum_{j=1}^{\infty} R_{j}(t)
$$

Furthermore, a simple new product sales forecasting model can easily be built around such a data summary.

we describe how to create such a sales summary from raw customer-level transaction data (typically collected via a consumer panel) using python.

A consumer panel is formed by selecting a representative sample of individuals or households (from the population of interest) and recording their complete behaviour (e.g., purchasing of FMCG products) over a fixed period of time.

For a given product category, we can construct a dataset that reports the timing of each purchase, along with details of the product purchased. A stylized representation of this is given in the figure (Purchase Histories: Total Category) below for a total of $n$ households, in which we consider an observation period starting with the launch of a new product and ending at time $t_{\text {end }}$. We let $\diamond$ denote a purchase of the new product and $\times$ denote the purchase of any other product in the category.

<div style="max-width:400px;margin-left: auto; margin-right: auto;">
<img src="references\Depth-of-Repeat\Purchase Histories-Total Category.png"/>
</div>

We see that HH 1 made three category purchases over the observation period but never purchased the new product. HH 2 made seven category purchases; the third category purchase represents a trial purchase of the new product, and no repeat purchasing activity was observed over the remainder of the observation period. HH 3 made a trial purchase of the new product and two repeat purchases. And so on.

In many analysis situations, where we are focusing on a particular product, purchase records not associated with the focal product are removed to yield a simpler (and smaller) dataset. A stylized representation of this is given in the figure (Purchase Histories: New Product Only) below. As HH 1 never bought the new product, there is no explicit record of this household in the resulting dataset.

<div style="max-width:400px;margin-left: auto; margin-right: auto;">
<img src="references\Depth-of-Repeat\Purchase Histories-New Product Only.png"/>
</div>

### <span style="color:#956bbf">Imports</span>
---

#### Import Packages

In [1]:
import polars as pl
import numpy as np
import altair as alt
from great_tables import GT, style, loc, md
import gc

alt.renderers.enable('mimetype')

RendererRegistry.enable('mimetype')

#### Import Panel Data

"Kiwi Bubbles" is a masked name for a shelf-stable juice drink, aimed primarily at children, which is sold as a multipack with several single-serve containers bundled together. Prior to national launch, it underwent a year-long test conducted in two of IRI's BehaviorScan markets. The file `kiwibubbles_tran.txt` contains purchasing data for the new product, drawn from 1300 panelists in Market 1 and 1499 panelists in Market 2.

Each record in this file comprises five fields: *Panelist ID*, *Market*, *Week*, *Day*, and *Units*. The value of the Market field is either 1 or 2. The Week field gives us the week number in which the purchase occurred (the product was launched at the beginning of week 1), the Day field tells us the day of the week (1-7) in which the purchase occurred, and the Units field tells us how many units of the new product were purchased on that particular purchase occasion.

We load this dataset into python. We see that there are a total of 857 transactions across the two markets during the year-long test.

In [49]:
kiwi_lf = pl.scan_csv(source="data/kiwibubbles/kiwibubbles_tran.csv",
                      has_header=False,
                      separator=",",
                      schema={'ID': pl.UInt16,
                              'Market': pl.UInt8,
                              'Week': pl.Int16,
                              'Day': pl.Int16,
                              'Units': pl.Int16})
kiwi_lf.head().collect()

ID,Market,Week,Day,Units
u16,u8,i16,i16,i16
10001,1,19,3,1
10002,1,12,5,1
10003,1,37,7,1
10004,1,30,6,1
10004,1,47,3,1


To illustrate the process of creating a sales by depth-of-repeat summary from this raw transaction data, we will focus just on **Market 2**.

In [50]:
kiwi_lf_m2 = (kiwi_lf.filter(pl.col('Market') == 2).drop('Market'))
num_panellists_m2 = 1499

### <span style="color:#956bbf">Creating a Depth-of-Repeat Sales Summary</span>
---

#### Preliminaries

Let us consider the transaction history of panelist 20014. We note that this panelist made his/her trial purchase and first repeat purchase in week 4. Similarly, this panelist's third and fourth repeat purchases occurred in week 7 . We also note that this panelist typically purchased several units of the product on any given purchase occasion.

In [4]:
kiwi_lf_m2.filter(pl.col('ID') == 20014).collect()

ID,Week,Day,Units
u16,i16,i16,i16
20014,4,2,1
20014,4,4,1
20014,6,6,2
20014,7,2,3
20014,7,6,3
20014,12,5,2
20014,17,6,1
20014,23,4,2
20014,47,6,2


This suggests several possible versions of the desired sales by depth-of-repeat summary:
- Our summary counts the number of trial, first repeat, second repeat, etc. transactions that occurred each week. The process of creating such a summary is described in section [Creating a “Raw” Transactions Summary](#raw_trans_summary)' below.
- Our summary reports the *sales volume* (e.g., units) associated with trial, first repeat, second repeat, etc. transactions that occurred each week. The process of creating such a summary is described in section '[Creating a “Raw” Sales Volume Summary](#raw_vol_summary)' below.
- We have noted that this panelist's trial and first repeat purchases occurred in the same week, albeit on different days. Similarly, his/her fourth and fifth repeat purchases occurred in the same week. The structure of many simple models of new product sales forecasting is such that a customer can have only one transaction per unit of time. If the unit of time is one week (as it typically the case), we clearly have a problem. One solution would be change the unit of time from week to day. However, as such purchasing behaviour tends to be rare, Eskin suggests that, "[f]or estimation purposes, second purchases within a single week are coded in the following week." The process of creating such a "shifted" summary is described in [Creating a "Shifted" Transactions Summary](#raw_shifted_trans_summary) below.

But what happens if we observe multiple transactions on the same day? This is very rare and typically reflects bad pre-processing of the panel data. For example, as an individual's purchases are scanned at the supermarket checkout, one six-pack of Coke could be the first item scanned with another six-pack of Coke being the last item scanned. As the raw data are "cleaned-up" these two purchases should be combined into one transaction with a quantity of two. But this doesn't always happen. If the (very) raw panel data file contains a transaction time field, we easily determine whether the two records come from the same or different shopping trips. Even if they did come from separate shopping trips on the same day, our natural reaction would be to combine them into a single transaction with multiple units, rather than shift to an even smaller time unit (e.g., hour). we should reflect on how to determine whether we observe multiple transactions for an individual panelist on the same day once the raw panel data has been loaded into python. (Note that there are no such occurrences in the Kiwi Bubbles dataset.)

#### Creating a "Raw" Transactions Summary
<a id='raw_trans_summary'></a>

The first thing we need to do is add a field that indicates the depth-of-repeat level associated with each record; i.e., is this a trial purchase $(\mathrm{DoR}=0)$, a first repeat purchase $(\mathrm{DoR}=1$ ), a second repeat purchase ( $\operatorname{DoR}=2$ ), etc.

This is a straightforward exercise. If the panelist ID associated with this record does not equal that of the previous record, we are dealing with a new panelist and we set the depth-of-repeat indicator to 0 . If the panelist ID associated with this record does equal that of the previous record, we are dealing with a repeat purchase by that panelist and we increment the depth-of-repeat indicator by 1.

In [51]:
kiwi_lf_m2 = (
    kiwi_lf_m2
    .sort(by='ID')
    .with_columns((pl.col("ID").cum_count().over("ID") - 1).cast(pl.UInt16).alias("DoR"))    
)

The corresponding records for panelist 20014 are shown below with DoR indicator:

In [6]:
kiwi_lf_m2.filter(pl.col('ID') == 20014).collect()

ID,Week,Day,Units,DoR
u16,i16,i16,i16,u16
20014,4,2,1,0
20014,4,4,1,1
20014,6,6,2,2
20014,7,2,3,3
20014,7,6,3,4
20014,12,5,2,5
20014,17,6,1,6
20014,23,4,2,7
20014,47,6,2,8


The next step is to perform aggregations to the data and create two type of the same data-frame: a **long-form** and **wide-form**.

We want there to be 52 rows, one for each week of the test. It turns out, however, that this panel of 1499 households only purchased the test product in 49 weeks; no purchases occurred in weeks 25,39, and 41. How can we create a table that will contain zeros in the rows corresponding to these three weeks? We accomplish this by creating a dummy dataframe that contains the full range of combination of `Week` and `DoR`. There are should be a range of 1 to 52 weeks and a range of 0 to 11 depth-of-repeats. Next, we will join the aggregated dataframes of the main dataset with the dummy dataframe such that it preserves the size of the dummy dataframe and fills the empty combinations with `null` values.

In [7]:
# Week Range: 1 to 52, DoR Range: 0 to 11 (max(DoR) = 11)
week_range, dor_range = np.meshgrid(np.arange(1, 53, dtype='int16'), np.arange(0, 12, dtype='uint16'))
# Create a dummy LazyFrame that contains the full range of combinations for Week & DoR
dummy_lf = pl.LazyFrame({'Week': week_range.reshape(-1), 'DoR': dor_range.reshape(-1)})

agg_trans = (
    kiwi_lf_m2
    .group_by('Week', 'DoR')
    .agg(pl.len().alias('Count'))
)

week_total_trans = (
    agg_trans
    .group_by('Week')
    .agg(pl.col('Count').sum().alias('Total')) 
)

agg_trans_longform = (
    dummy_lf
    .join(agg_trans, on=['Week', 'DoR'], how='left')
    .join(week_total_trans, on='Week', how='left')
    .fill_null(0)
)

The wide-form is more intuitive and easy to visualize, it tells us how many trial, first repeat, etc. purchases (columns) occurred in each week (rows).

In [8]:
agg_trans_wideform = (
    agg_trans_longform
    .collect()
    .pivot(on='DoR', index='Week', values='Count')
    .join(week_total_trans.collect(), on='Week')
)

col_total = agg_trans_wideform.select(pl.col('*').exclude('Week').sum())

display(agg_trans_wideform)
display(col_total)

Week,0,1,2,3,4,5,6,7,8,9,10,11,Total
i16,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
1,8,1,0,0,0,0,0,0,0,0,0,0,9
2,6,0,0,0,0,0,0,0,0,0,0,0,6
3,2,1,0,0,0,0,0,0,0,0,0,0,3
4,16,2,0,0,0,0,0,0,0,0,0,0,18
5,8,3,0,0,0,0,0,0,0,0,0,0,11
…,…,…,…,…,…,…,…,…,…,…,…,…,…
48,1,1,1,1,0,0,0,1,0,0,0,0,5
49,4,0,0,0,0,2,0,1,1,0,0,0,8
50,0,2,0,0,0,0,0,1,2,1,1,1,8
51,0,1,0,0,0,0,0,1,0,0,0,0,2


0,1,2,3,4,5,6,7,8,9,10,11,Total
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
139,52,31,23,17,14,9,8,6,4,2,1,306


We note that there are no entries in the rows corresponding to weeks 25, 39, and 41. Over the year-long test, 139 of the 1499 panelists made at least one purchase of the new product, with a total of 306 purchase occasions. We also note that by the end of the year, one person had made eleven repeat purchases of the new product.

A cleaned-up summary that reports these weekly transactions in cumulative form (i.e., $T(t), R_{1}(t), R_{2}(t)$, etc.) is created created below:

In [9]:
cum_trans_longform = agg_trans_longform.with_columns(pl.col('Count').cum_sum().over('DoR').alias('Cum DoR'))
cum_trans_wideform = cum_trans_longform.collect().pivot(on='DoR', index='Week', values='Cum DoR')

display(cum_trans_wideform)

Week,0,1,2,3,4,5,6,7,8,9,10,11
i16,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
1,8,1,0,0,0,0,0,0,0,0,0,0
2,14,1,0,0,0,0,0,0,0,0,0,0
3,16,2,0,0,0,0,0,0,0,0,0,0
4,32,4,0,0,0,0,0,0,0,0,0,0
5,40,7,0,0,0,0,0,0,0,0,0,0
…,…,…,…,…,…,…,…,…,…,…,…,…
48,133,48,30,22,17,12,9,5,3,2,1,0
49,137,48,30,22,17,14,9,6,4,2,1,0
50,137,50,30,22,17,14,9,7,6,3,2,1
51,137,51,30,22,17,14,9,8,6,3,2,1


#### Creating a "Raw" Sales Volume Summary
<a id='raw_vol_summary'></a>

Having created a weekly transaction by depth-of-repeat level summary, it is extremely easy to create an equivalent **sales volume** (e.g., units) summary. Here, instead of counting IDs or length of aggregated dataframe as the value item, we sum `Units`. We note that a total of 396 units of the product were purchased (across the 306 purchase occasions).

In [10]:
agg_vol = (
    kiwi_lf_m2
    .group_by('Week', 'DoR')
    .agg(pl.col('Units').sum().alias('Units'))
)

week_total_vol = (
    agg_vol
    .group_by('Week')
    .agg(pl.col('Units').sum().alias('Total')) 
)

agg_vol_longform = (
    dummy_lf
    .join(agg_vol, on=['Week', 'DoR'], how='left')
    .join(week_total_vol, on='Week', how='left')
    .fill_null(0)
)

In [11]:
agg_vol_wideform = (
    agg_vol_longform
    .collect()
    .pivot(on='DoR', index='Week', values='Units')
    .join(week_total_vol.collect(), on='Week')
)

col_total = agg_vol_wideform.select(pl.col('*').exclude('Week').sum())

display(agg_vol_wideform)
display(col_total)

Week,0,1,2,3,4,5,6,7,8,9,10,11,Total
i16,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64
1,9,1,0,0,0,0,0,0,0,0,0,0,10
2,6,0,0,0,0,0,0,0,0,0,0,0,6
3,2,1,0,0,0,0,0,0,0,0,0,0,3
4,19,3,0,0,0,0,0,0,0,0,0,0,22
5,8,3,0,0,0,0,0,0,0,0,0,0,11
…,…,…,…,…,…,…,…,…,…,…,…,…,…
48,1,1,1,1,0,0,0,1,0,0,0,0,5
49,4,0,0,0,0,2,0,2,1,0,0,0,9
50,0,2,0,0,0,0,0,1,3,2,2,1,11
51,0,2,0,0,0,0,0,2,0,0,0,0,4


0,1,2,3,4,5,6,7,8,9,10,11,Total
i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64
161,64,42,30,24,20,14,19,10,7,4,1,396


A cleaned-up summary that reports these weekly transactions in cumulative form is created below. We note that a total of 161 units were purchased on the 139 trial purchase occasions, an average 1.16 units per trial purchase.

In [12]:
cum_vol_longform = agg_vol_longform.with_columns(pl.col('Units').cum_sum().over('DoR').alias('Cum DoR'))
cum_vol_wideform = cum_vol_longform.collect().pivot(on='DoR', index='Week', values='Cum DoR')

display(cum_vol_wideform)

Week,0,1,2,3,4,5,6,7,8,9,10,11
i16,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64
1,9,1,0,0,0,0,0,0,0,0,0,0
2,15,1,0,0,0,0,0,0,0,0,0,0
3,17,2,0,0,0,0,0,0,0,0,0,0
4,36,5,0,0,0,0,0,0,0,0,0,0
5,44,8,0,0,0,0,0,0,0,0,0,0
…,…,…,…,…,…,…,…,…,…,…,…,…
48,155,59,41,28,24,18,14,14,6,3,2,0
49,159,59,41,28,24,20,14,16,7,3,2,0
50,159,61,41,28,24,20,14,17,10,5,4,1
51,159,63,41,28,24,20,14,19,10,5,4,1


#### Creating a "Shifted" Transactions Summary
<a id='raw_shifted_trans_summary'></a>

We now turn our attention to the task of creating a weekly transaction by depth-of-repeat level summary under the assumption that a customer can have only one transaction per week. In other words, a second purchase within a single week is "shifted" to the next week (i.e., coded as occurring in the following week).

Referring back to panellist `20014`, the field that indicates the depth-of-repeat level associated with each record (DoR) is correct. What we need to do is makes some changes to the week field: we want the week associated with the first repeat purchase to be 5, and the week associated with the fourth repeat purchase to be 8. One solution would be to create a new week variable that equals the original week variable +1 if the week associated with the current record is the same as that of the previous record. But what if we have three purchases occurring in the same week?

In [13]:
kiwi_lf_m2.filter(pl.col('ID') == 20014).collect()

ID,Week,Day,Units,DoR
u16,i16,i16,i16,u16
20014,4,2,1,0
20014,4,4,1,1
20014,6,6,2,2
20014,7,2,3,3
20014,7,6,3,4
20014,12,5,2,5
20014,17,6,1,6
20014,23,4,2,7
20014,47,6,2,8


In [14]:
kiwi_lf_m2.filter(pl.col('ID') == 20069).collect()

ID,Week,Day,Units,DoR
u16,i16,i16,i16,u16
20069,18,1,1,0
20069,18,5,1,1
20069,19,4,2,2


To complicate matters, consider the transaction history of panelist 20069. This person's trial and first repeat purchases occurred in the same week. We therefore change the week associated with the first repeat purchase from 18 to 19. But this creates another problem as this person's second repeat purchase occurred in week 19. Having shifted the first repeat purchase to week 19, we have to shift the second repeat purchase to week 20.

Our solution is to create an offset variable that can be added to the value of the week field, giving us a "shifted" week variable. Clearly the value of this offset will be zero for the trial purchase. For any repeat purchase record, if the week associated with the current record is the same as that of the previous record, we increment the offset variable by 1. (This ensures that the third purchase in a given week is shifted two weeks.) We also need to shift any purchases encroached on by the shifting of previous purchases (such as the second repeat purchase for panelist 20069).

We create this offset variable in the following manner.

```
=IF(Current ID = Previous ID,
    IF(Current Week = Previous Week, 
        Previous Offset+1 ,
        MAX(0,Previous Week + Previous Office - Current Week + 1)),
    0)
```

In [28]:
# Add shifted columns to represent previous-row values
kiwi_lf_offset = (
    kiwi_lf_m2
    .with_columns(
        pl.col("Week").shift(1).alias("Prev_Week"),
        pl.lit(0).alias("Offset")
    ).with_columns(
        (pl.when(pl.col('Week') == pl.col('Prev_Week'))
         .then((pl.col('Offset').shift(1)) + 1)
         .otherwise(
             pl.max_horizontal(0, pl.col('Prev_Week') + (pl.col('Offset').shift(1)) - pl.col('Week') + 1)
        )            
        ).over('ID').alias('Offset')
    ).fill_null(0)
    .with_columns((pl.col('Week') + pl.col('Offset')).alias('shWeek'))
    .collect()
)

kiwi_lf_offset.filter(pl.col('Offset') == 1)

ID,Week,Day,Units,DoR,Prev_Week,Offset,shWeek
u16,i16,i16,i16,u16,i16,i32,i32
20014,4,4,1,1,4,1,5
20014,7,6,3,4,7,1,8
20051,17,7,2,1,17,1,18
20057,16,4,3,2,16,1,17
20069,18,5,1,1,18,1,19
20117,1,4,1,1,1,1,2
20118,19,5,1,5,19,1,20


In [38]:
# Add shifted columns to represent previous-row values
kiwi_lf_offset = (
    kiwi_lf_m2
    .with_columns(
        pl.col("Week").shift(1).alias("Prev_Week"),
        pl.col('ID').shift(1).alias('Prev_ID'),
        pl.lit(0).alias("Offset")
    ).with_columns(
        pl.when(pl.col('ID') == pl.col('Prev_ID'))
        .then(pl.when(pl.col('Week') == pl.col('Prev_Week'))
              .then((pl.col('Offset').shift(1)) + 1)
              .otherwise(pl.max_horizontal(0, pl.col('Prev_Week') + (pl.col('Offset').shift(1)) - pl.col('Week') + 1))            
        ).otherwise(0).alias('Offset')
    ).fill_null(0)
    .with_columns((pl.col('Week') + pl.col('Offset')).alias('shWeek'))
    .collect()
)

kiwi_lf_offset.filter(pl.col('Offset') != 0)

ID,Week,Day,Units,DoR,Prev_Week,Prev_ID,Offset,shWeek
u16,i16,i16,i16,u16,i16,u16,i32,i32
20014,4,4,1,1,4,20014,1,5
20014,7,6,3,4,7,20014,1,8
20051,17,7,2,1,17,20051,1,18
20057,16,4,3,2,16,20057,1,17
20069,18,5,1,1,18,20069,1,19
20117,1,4,1,1,1,20117,1,2
20118,19,5,1,5,19,20118,1,20
