# 02 — Labeling and Feature Engineering (Engagement Risk)

**This notebook:**

Turns a clean order event log (“customer placed an order at time t with value v”) into a supervised learning dataset:
- Rows: “Customer at snapshot date”
- Features (X): derived from past behavior before the snapshot
- Label (y): whether they disengage in the future after the snapshot

Steps:
- loads the processed order-level event table (`orders.parquet`)
- defines weekly snapshot dates
- labels customers as disengaged if they make no purchase in the next 30 days
- builds leakage-safe rolling behavioral features (7/30/90 day windows)
- evaluates baseline heuristics (recency rule and all-churn baseline)

In [33]:
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    roc_auc_score,
    average_precision_score,
)

In [34]:
orders = pd.read_parquet("../data/processed/orders.parquet")
orders.head(10), orders.shape

(  Customer ID Invoice            order_ts  total_revenue  total_qty  n_items  \
 0       12346  491725 2009-12-14 08:34:00          45.00         10        1   
 1       12346  491742 2009-12-14 11:00:00          22.50          5        1   
 2       12346  491744 2009-12-14 11:02:00          22.50          5        1   
 3       12346  492718 2009-12-18 10:47:00          22.50          5        1   
 4       12346  492722 2009-12-18 10:55:00           1.00          1        1   
 5       12346  493410 2010-01-04 09:24:00          22.50          5        1   
 6       12346  493412 2010-01-04 09:53:00          22.50          5        1   
 7       12346  494450 2010-01-14 13:50:00          22.50          5        1   
 8       12346  495295 2010-01-22 13:30:00          22.50          5        1   
 9       12346  499763 2010-03-02 13:08:00          27.05          5        5   
 
    n_lines  
 0        1  
 1        1  
 2        1  
 3        1  
 4        1  
 5        1  
 6       

In [35]:
# Sanity Checks

print(orders.dtypes)
print(orders['order_ts'].min(), orders['order_ts'].max())
print(orders['Customer ID'].nunique(), orders['Invoice'].nunique())

Customer ID              object
Invoice                  object
order_ts         datetime64[ns]
total_revenue           float64
total_qty                 int64
n_items                   int64
n_lines                   int64
dtype: object
2009-12-01 07:45:00 2010-12-09 20:01:00
4312 19213


Labeling, Rolling-window features engineering, and ultimate data frame construction for model building is done in the following steps: 
- (1) Create datetime array with snapshot dates
- (2) Choose label conditions: Label = 1 if no orders within 30 days of snapshot, Label = 0 otherwise
- (3) Define engagement label function that assigns 1 for users that are non-active for 30 days. Returns array with Customer ID, snapshot date, and label.
- (4) Define rolling-window feature building function using only orders with order_ts <= snapshot_date (leakage safe).
- (5) Define Model df by merging features and labels

In [85]:
# Snapshot Dates Array Definition 

# Create weekly snapshot dates
snapshot_dates = (
    orders["order_ts"]
    .dt.to_period("W")
    .dt.start_time
    .sort_values()
    .unique()
)

# Remove any week-start snapshots that occur before first order timestamp
min_ts = orders["order_ts"].min()
snapshot_dates = [s for s in snapshot_dates if s >= min_ts]

len(snapshot_dates), snapshot_dates[:5], snapshot_dates[-5:]

(52,
 [Timestamp('2009-12-07 00:00:00'),
  Timestamp('2009-12-14 00:00:00'),
  Timestamp('2009-12-21 00:00:00'),
  Timestamp('2010-01-04 00:00:00'),
  Timestamp('2010-01-11 00:00:00')],
 [Timestamp('2010-11-08 00:00:00'),
  Timestamp('2010-11-15 00:00:00'),
  Timestamp('2010-11-22 00:00:00'),
  Timestamp('2010-11-29 00:00:00'),
  Timestamp('2010-12-06 00:00:00')])

In [70]:
# Check to see if first snapshot date is earlier than first timestamp 
orders["order_ts"].min(), snapshot_dates[0], snapshot_dates[1], snapshot_dates[2],snapshot_dates[9]

(Timestamp('2009-12-01 07:45:00'),
 Timestamp('2009-12-07 00:00:00'),
 Timestamp('2009-12-14 00:00:00'),
 Timestamp('2009-12-21 00:00:00'),
 Timestamp('2010-02-15 00:00:00'))

In [76]:
# Define Engagement Label 
# Label = 1 if user has no orders in the 30 days after snapshot
# Label = 0 otherwise

from datetime import timedelta

def label_engagement(orders, snapshot_date, horizon_days=30):
    horizon_end = snapshot_date + timedelta(days=horizon_days)
    
    users_active_before = orders.loc[
        orders["order_ts"] <= snapshot_date, "Customer ID"
    ].unique()
    
    future_orders = orders.loc[
        (orders["order_ts"] > snapshot_date) &
        (orders["order_ts"] <= horizon_end)
    ]
    
    users_active_future = set(future_orders["Customer ID"].unique())
    
    labels = []
    for u in users_active_before:
        labels.append({
            "Customer ID": u,
            "snapshot_date": snapshot_date,
            "label": int(u not in users_active_future)
        })
        
    return pd.DataFrame(labels, columns = ["Customer ID", "snapshot_date", "label"]) #Ensure joins later do not crash

In [77]:
# Test on one snapshot
# Result: Churn Rate = 66.93% Approx. 

test_snapshot = snapshot_dates[9]
labels_df = label_engagement(orders, test_snapshot)
labels_df["label"].value_counts(normalize=True)

label
1    0.669271
0    0.330729
Name: proportion, dtype: float64

**Define Feature Windows** 
- 7 Days -> Short-Term Engagement
- 30 Days -> Medium-Term Behavior
- 90 Days -> Long-Term Value

In [79]:
# Feature-Building Function 

def build_features(orders, snapshot_date, windows=(7, 30, 90)):
    snap_orders = orders[orders["order_ts"] <= snapshot_date].copy()
    
    features = []
    
    for customer_id, hist in snap_orders.groupby("Customer ID"):
        feat = {
            "Customer ID": customer_id,
            "snapshot_date": snapshot_date,
        }
        
        last_order_date = hist["order_ts"].max()
        feat["recency_days"] = (snapshot_date - last_order_date).days
        
        for w in windows:
            start = snapshot_date - pd.Timedelta(days=w)
            win = hist[hist["order_ts"] > start]
            
            feat[f"orders_{w}d"] = len(win)
            feat[f"revenue_{w}d"] = win["total_revenue"].sum()
            feat[f"items_{w}d"] = win["n_items"].sum()
        
        feat["lifetime_orders"] = len(hist)
        feat["lifetime_revenue"] = hist["total_revenue"].sum()
        
        features.append(feat)

    cols = ["Customer ID", "snapshot_date", "recency_days"] + \
       [f"orders_{w}d" for w in windows] + \
       [f"revenue_{w}d" for w in windows] + \
       [f"items_{w}d" for w in windows] + \
       ["lifetime_orders", "lifetime_revenue"]

    return pd.DataFrame(features, columns=cols) # Ensures join compatability

In [80]:
# Feature Generation Test (Single Snapshot)

X_df = build_features(orders, test_snapshot)
X_df.head(), X_df.shape

(  Customer ID snapshot_date  recency_days  orders_7d  orders_30d  orders_90d  \
 0       12346    2010-02-15            23          0           1           9   
 1       12358    2010-02-15            68          0           0           1   
 2       12359    2010-02-15            60          0           0           2   
 3       12361    2010-02-15            19          0           1           1   
 4       12362    2010-02-15            75          0           0           1   
 
    revenue_7d  revenue_30d  revenue_90d  items_7d  items_30d  items_90d  \
 0         0.0         22.5       203.50         0          1          9   
 1         0.0          0.0      1429.83         0          0         17   
 2         0.0          0.0       838.89         0          0         34   
 3         0.0        109.2       109.20         0          7          7   
 4         0.0          0.0       130.00         0          0          1   
 
    lifetime_orders  lifetime_revenue  
 0            

In [74]:
# Join Features + Labels 

model_df = (
    X_df
    .merge(labels_df, on=["Customer ID", "snapshot_date"], how="inner")
)

model_df["label"].value_counts(normalize=True)

label
1    0.669271
0    0.330729
Name: proportion, dtype: float64

## Description of Features/Label/Rows

**Identifier/Metadata:**
- Customer ID: Unique identifier for customer
- snapshot_date: Moment in time features are observed

**Core Behavioral Features:**
- recency_days: Number of days since the customer's most recent purchase as of snapshot_date
- orders_7d: Number of orders placed in the 7 days prior to snapshot_date
- revenue_7d: Total revenue generated in last 7 days prior to snapshot_date
- items_7d: Total number of items purchased in last 7 days prior to snapshot_date
- orders_30d: Number of orders placed in the 30 days prior to snapshot_date
- revenue_30d: Revenue in 30 days prior to snapshot_date
- items_30d: Number of items purchased in 30 days prior to snapshot_date
- orders_90d: Number of orders in 90 days prior to snapshot_date
- revenue_90d: Revenue in 90 days prior to snapshot_date
- items_90d: Number of items purchased in 90 days prior to snapshot_date
- lifetime_orders: Total number of orders the customer has placed up to snapshot_date
- lifetime_revenue: Total revenue generated by customer up to snapshot_date

**Label:**
- 1 -> Customer does not make purchase in next 30 days (disengaged)
- 0 -> Customer does make a purchase in next 30 days (engaged)


**Each row means:**
- “At snapshot date T, customer C had the following historical behavior. Will customer C fail to make a purchase in the next 30 days?”
- Everything except label is computed using data at or before T.



In [42]:
model_df.shape

(1536, 15)

In [43]:
model_df.head()

Unnamed: 0,Customer ID,snapshot_date,recency_days,orders_7d,revenue_7d,items_7d,orders_30d,revenue_30d,items_30d,orders_90d,revenue_90d,items_90d,lifetime_orders,lifetime_revenue,label
0,12346,2010-02-15,23,0,0.0,0,1,22.5,1,9,203.5,9,9,203.5,0
1,12358,2010-02-15,68,0,0.0,0,0,0.0,0,1,1429.83,17,1,1429.83,1
2,12359,2010-02-15,60,0,0.0,0,0,0.0,0,2,838.89,34,2,838.89,0
3,12361,2010-02-15,19,0,0.0,0,1,109.2,7,1,109.2,7,1,109.2,1
4,12362,2010-02-15,75,0,0.0,0,0,0.0,0,1,130.0,1,1,130.0,1


In [60]:
# X and y Split 

feature_cols = ["Customer ID",
    "snapshot_date",
    "recency_days",
    "orders_7d", "revenue_7d", "items_7d",
    "orders_30d", "revenue_30d", "items_30d",
    "orders_90d", "revenue_90d", "items_90d",
    "lifetime_orders", "lifetime_revenue",
]

X = model_df[feature_cols]
y = model_df["label"]

We compute a base rate that will act as our benchmark to beat with our ML work. This base rate will be actual disengagement within 30 days (label prevalence):

In [45]:
print('Label Prevalence/Base Rate (% of Customers that did disengage after 30 days): ', y.mean()) # y.mean() = Label Prevalence; 66.93% approx. disengage within 30 days 

Label Prevalence/Base Rate (% of Customers that did disengage after 30 days):  0.6692708333333334


Next, we compute our baseline heuristic rate which is simply predicting disengagement/churn if the customer has not bought anything within 30 days prior to the snapshot date:

In [46]:
# Baseline Heuristic 
# Predict disengagement if recency_days > 30

baseline_pred = (model_df["recency_days"] > 30).astype(int)

print('Churn Ratio Based Off 30-Day Baseline Heuristic: ', baseline_pred.mean()) # Rule recency_days > 30 predicts churn for approx. 44.21% of users

Churn Ratio Based Off 30-Day Baseline Heuristic:  0.4420572916666667


So, if we predict disengagement for anyone that hasn't bought anything within the last 30 days we get a churn rate of ~44.21% for the data set at our test snapshot.

Next, we calculate accuracy based off our baseline heuristic: 

In [47]:
# Baseline Accuracy 
# -> 52.15% approx. 

# Worse than just always predicting churn (~67%)
# -> Recency alone is not sufficient, need a learned model that incorporates frequency and monetary signals

print('Baseline Accuracy: ', (baseline_pred == y).mean()) 

Baseline Accuracy:  0.521484375


Our baseline heuristic of predicting anyone who has not engaged within the last 30 days as disengaged yields an accuracy of ~52.15%. 

**This accuracy is worse than simply predicting everyone as churn which would have yielded an accuracy of ~67% (since there are ~67% true churn customers in the current data set). This implies that recency alone is not sufficient for a good model.** 

Due to class inbalance in the data (mostly churn at roughly 67%), accuracy is not a great metric to evaluate predictive power here. We now formally evaluate our baseline metrics as well as "all-churn" metrics, which correspond to a model that simply predicts churn across the board.

In [48]:
# Formal Baseline Evaluation Metrics (For recency_days > 30 = churn)
print("Baseline Accuracy:", accuracy_score(y, baseline_pred)) # Proportion of total labels predicted correctly (dataset is imbalanced)
print("Baseline Precision:", precision_score(y, baseline_pred)) # Proportion of predicted churn users that actually churned
print("Baseline Recall:", recall_score(y, baseline_pred)) # Proportion of true churn users successfully identified

# All-Churn Classifier Metrics (predict churn for all users)
all_churn_pred = np.ones_like(y)

print("All-Churn Accuracy:", accuracy_score(y, all_churn_pred))
print("All-Churn Precision:", precision_score(y, all_churn_pred))
print("All-Churn Recall:", recall_score(y, all_churn_pred))

Baseline Accuracy: 0.521484375
Baseline Precision: 0.7157584683357879
Baseline Recall: 0.4727626459143969
All-Churn Accuracy: 0.6692708333333334
All-Churn Precision: 0.6692708333333334
All-Churn Recall: 1.0


## Evaluation Metrics In-Depth 

### Baseline Metrics
-> Metrics using Baseline Heuristic of predicting churn if customer does not engage within the last 30 days 

**Baseline Accuracy: ~52.15%**
- Out of all customers, how many were labeled correct?
- Dangerous metric to use here due to class imbalance; stupid model that predicts "everyone churns" gets accuracy of ~67%

**Baseline Precision: ~71.58%**
- When the baseline heuristic did predict churn, how many actually churned? 
- **Precision is more significant when false positives are costly**
    - **-> High precision allows us to not waste time on the wrong users**
- ~71.58% is actually a relatively good result

**Baseline Recall: ~47.28%**
- Of all customers who actually churned, how many did our baseline heuristic correctly predict?
- **Recall matters more when missing a churner is costly**
    - **-> e.g., losing high-value customers**
- We caught less than half of true churners

### All-Churn Metrics
-> Metrics yielded from predicting churn for every single customer

**All-Churn Accuracy & Precision: ~67% (Base Rate)**
- 2/3 of all users churn so predicting all churn yields the base rate for accuracy and precision

**All-Churn Recall: 100%**
- We catch every churner since we flagged everyone


## A good ML model will:
- Beat baseline heuristic on recall or precision depending on business tradeoffs
- Beat all-churn baseline on discrimination

Before we move onto model building, we must build a full dataset based off multiple snapshots (currently we've only looked at one test snapshot). 

**Important: We will have to skip the last few snapshots because a full future 30-day window after snapshots are not present**

*Churn/Disengagement Horizon and Feature Windows can be chosen in the code below:*

In [88]:
from pathlib import Path
import pandas as pd

HORIZON_DAYS = 30          # CHOOSE: churn/disengagement horizon
WINDOWS = (7, 30, 90)      # CHOOSE: feature windows

# Filter snapshots to not include snapshots without 30 days of future data so label definition holds.
max_order_ts = orders["order_ts"].max()
latest_snapshot_allowed = max_order_ts - pd.Timedelta(days = HORIZON_DAYS)
snapshots_used = [s for s in snapshot_dates if s <= latest_snapshot_allowed]

len(snapshot_dates), len(snapshots_used), snapshots_used[:3], snapshots_used[-3:]

(52,
 48,
 [Timestamp('2009-12-07 00:00:00'),
  Timestamp('2009-12-14 00:00:00'),
  Timestamp('2009-12-21 00:00:00')],
 [Timestamp('2010-10-25 00:00:00'),
  Timestamp('2010-11-01 00:00:00'),
  Timestamp('2010-11-08 00:00:00')])

In [89]:
# Full Dataset Build Loop: 
# Skip snapshots with no orders

all_rows = []

for i, snap in enumerate(snapshots_used):
    labels_df = label_engagement(orders, snap, horizon_days=HORIZON_DAYS)
    X_df = build_features(orders, snap, windows=WINDOWS)

    if labels_df.empty or X_df.empty:
        print(f"Skipping {snap.date()} (no eligible users)")
        continue

    df_snap = X_df.merge(labels_df, on=["Customer ID", "snapshot_date"], how="inner")

    all_rows.append(df_snap)

    # lightweight progress print every ~10 snapshots
    if (i + 1) % 10 == 0 or (i + 1) == len(snapshots_used):
        print(f"Built snapshot {i+1}/{len(snapshots_used)} — rows: {df_snap.shape[0]}")

full_model_df = pd.concat(all_rows, ignore_index=True)

full_model_df.shape, full_model_df["label"].mean()


Built snapshot 10/48 — rows: 1536
Built snapshot 20/48 — rows: 2370
Built snapshot 30/48 — rows: 3001
Built snapshot 40/48 — rows: 3385
Built snapshot 48/48 — rows: 4013


((120261, 15), np.float64(0.7098144868244901))

In [90]:
# Persist full_model_df to parquet file
out_dir = Path("../data/processed")
out_dir.mkdir(parents = True, exist_ok = True)
out_path = out_dir / "modeling_dataset.parquet"
full_model_df.to_parquet(out_path, index=False)

print("Saved to:", out_path)
print("Shape:", full_model_df.shape)

Saved to: ../data/processed/modeling_dataset.parquet
Shape: (120261, 15)
