# Stratified Cox Model - Step-by-Step Tutorial
## Dataset: UCI Online Retail II

**GOAL:** Predict which customers will repurchase which products and when

### KEY INNOVATION: Stratified by product (SKU)
- Each product gets its own baseline hazard curve
- Customer features affect all products similarly

This allows us to handle products with VERY different repurchase cycles:
- Milk: ~7 days
- Shampoo: ~30 days  
- Winter coat: ~365 days

---
## STEP 1: SETUP & DATA LOADING

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from lifelines import CoxPHFitter
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Load data
# Download from: https://www.kaggle.com/datasets/mashlyn/online-retail-ii-uci

print("Loading UCI Online Retail Dataset...")

# Try to load from Excel or CSV
df = pd.read_csv('online_retail_II.csv')

print(f"Loaded {len(df):,} transactions")
print(f"Columns: {df.columns.tolist()}")

Loading UCI Online Retail Dataset...
Loaded 1,067,371 transactions
Columns: ['Invoice', 'StockCode', 'Description', 'Quantity', 'InvoiceDate', 'Price', 'Customer ID', 'Country']


In [3]:
# Quick look at data
print("Sample Data:")
df.head()

Sample Data:


Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


---
## STEP 2: DATA CLEANING

In [4]:
print("Cleaning data...")

# Remove cancellations (Invoice starts with 'C')
df = df[~df['Invoice'].astype(str).str.startswith('C')]

# Remove missing CustomerID
df = df[df['Customer ID'].notna()]

# Remove negative quantities/prices
df = df[(df['Quantity'] > 0) & (df['Price'] > 0)]

# Remove duplicate rows (EDA showed 3.22% duplicates)
df = df.drop_duplicates()

# Convert date
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Create revenue column
df['Revenue'] = df['Quantity'] * df['Price']

# Rename for convenience
df = df.rename(columns={
    'Invoice': 'InvoiceNo',
    'StockCode': 'StockCode',
    'Customer ID': 'CustomerID'
})

# Ensure StockCode is string type
df['StockCode'] = df['StockCode'].astype(str)

print(f"Clean data: {len(df):,} transactions")
print(f"   Customers: {df['CustomerID'].nunique():,}")
print(f"   Products: {df['StockCode'].nunique():,}")
print(f"   Date range: {df['InvoiceDate'].min().date()} to {df['InvoiceDate'].max().date()}")

Cleaning data...
Clean data: 779,425 transactions
   Customers: 5,878
   Products: 4,631
   Date range: 2009-12-01 to 2011-12-09


---
## STEP 3: CREATE SURVIVAL DATASET

In [5]:
print("Creating survival analysis dataset...")
print("   For each customer-product pair, calculating:")
print("   - DURATION_DAYS: Days until next purchase (or censoring)")
print("   - EVENT: 1 = repurchased, 0 = censored")

# Filter for products purchased frequently (min 30 times)
product_counts = df['StockCode'].value_counts()
popular_products = product_counts[product_counts >= 30].index
df_filtered = df[df['StockCode'].isin(popular_products)].copy()

print(f"   Using {len(popular_products)} popular products")

Creating survival analysis dataset...
   For each customer-product pair, calculating:
   - DURATION_DAYS: Days until next purchase (or censoring)
   - EVENT: 1 = repurchased, 0 = censored
   Using 3085 popular products


In [6]:
# Sort by customer, product, date
df_filtered = df_filtered.sort_values(['CustomerID', 'StockCode', 'InvoiceDate'])

# Build survival records
survival_records = []
observation_end = df_filtered['InvoiceDate'].max()

for (customer, product), group in df_filtered.groupby(['CustomerID', 'StockCode']):
    
    # Get purchase dates for this customer-product pair
    dates = sorted(group['InvoiceDate'].unique())
    
    # Create records for consecutive purchases (EVENT=1)
    for i in range(len(dates) - 1):
        duration = (dates[i+1] - dates[i]).days
        if 1 <= duration <= 365:  # Reasonable duration
            survival_records.append({
                'CustomerID': customer,
                'StockCode': product,
                'DURATION_DAYS': duration,
                'EVENT': 1
            })
    
    # Add censored observation (last purchase, EVENT=0)
    last_date = dates[-1]
    censored_duration = (observation_end - last_date).days
    if censored_duration > 1:
        survival_records.append({
            'CustomerID': customer,
            'StockCode': product,
            'DURATION_DAYS': censored_duration,
            'EVENT': 0
        })

survival_df = pd.DataFrame(survival_records)

print(f"\nCreated {len(survival_df):,} survival records")
print(f"   Events (repurchases): {survival_df['EVENT'].sum():,}")
print(f"   Censored: {(survival_df['EVENT'] == 0).sum():,}")


Created 729,256 survival records
   Events (repurchases): 267,411
   Censored: 461,845


In [7]:
# Quick stats
print("Duration Statistics:")
survival_df.groupby('EVENT')['DURATION_DAYS'].describe()

Duration Statistics:


Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
EVENT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,461845.0,308.27142,220.688332,2.0,86.0,311.0,477.0,738.0
1,267411.0,86.69735,86.104399,1.0,24.0,55.0,120.0,365.0


---
## STEP 4: FEATURE ENGINEERING

In [8]:
print("Engineering features...")

# Calculate customer-level aggregations from original data
customer_first_purchase = df.groupby('CustomerID')['InvoiceDate'].min()
customer_purchase_count = df.groupby('CustomerID')['InvoiceNo'].nunique()
customer_avg_revenue = df.groupby('CustomerID')['Revenue'].mean()

# Calculate customer-product aggregations
customer_product_count = df.groupby(['CustomerID', 'StockCode'])['InvoiceNo'].nunique()

# Add features to survival dataframe
# Note: For simplicity, using approximate values here
# In production, you'd track exact values at time of each purchase

survival_df['FREQUENCY'] = survival_df['CustomerID'].map(customer_purchase_count)
survival_df['MONETARY'] = survival_df['CustomerID'].map(customer_avg_revenue)
survival_df['PRODUCT_FREQUENCY'] = survival_df.apply(
    lambda row: customer_product_count.get((row['CustomerID'], row['StockCode']), 1),
    axis=1
)

# Log transform monetary (reduce skew)
survival_df['LOG_MONETARY'] = np.log1p(survival_df['MONETARY'])

# Normalize features
feature_cols = ['FREQUENCY', 'LOG_MONETARY', 'PRODUCT_FREQUENCY']
scaler = StandardScaler()
survival_df[feature_cols] = scaler.fit_transform(survival_df[feature_cols])

print("Features created:")
for col in feature_cols:
    print(f"   - {col}")

Engineering features...
Features created:
   - FREQUENCY
   - LOG_MONETARY
   - PRODUCT_FREQUENCY


In [9]:
print("Feature Summary:")
survival_df[feature_cols].describe()

Feature Summary:


Unnamed: 0,FREQUENCY,LOG_MONETARY,PRODUCT_FREQUENCY
count,729256.0,729256.0,729256.0
mean,9.977234e-18,-1.097496e-16,-3.492032e-17
std,1.000001,1.000001,1.000001
min,-0.481941,-2.143993,-0.4610467
25%,-0.4210138,-0.7561171,-0.4610467
50%,-0.3143912,0.0819078,-0.2790326
75%,-0.1011459,0.5238603,0.08499573
max,5.565085,10.9333,25.74899


---
## STEP 5: SELECT TOP PRODUCTS FOR DEMO

In [10]:
# Use top 10 products by number of events for cleaner demo
top_products = survival_df[survival_df['EVENT'] == 1].groupby('StockCode').size().nlargest(10).index
model_df = survival_df[survival_df['StockCode'].isin(top_products)].copy()

print(f"Using top {len(top_products)} products for model:")
print(top_products.tolist())
print(f"Total records: {len(model_df):,}")

Using top 10 products for model:
['85123A', '85099B', '22423', '20725', '84879', 'POST', '21212', '22383', '20727', '21232']
Total records: 26,072


---
## STEP 6: TRAIN/TEST SPLIT

In [11]:
train_df, test_df = train_test_split(model_df, test_size=0.2, random_state=42)

print(f"Train/Test Split:")
print(f"   Train: {len(train_df):,} records")
print(f"   Test:  {len(test_df):,} records")

Train/Test Split:
   Train: 20,857 records
   Test:  5,215 records


---
## STEP 7: TRAIN STRATIFIED COX MODEL

In [12]:
print("="*70)
print("TRAINING STRATIFIED COX PROPORTIONAL HAZARDS MODEL")
print("="*70)

# Initialize model with small L2 penalty for stability
cph_stratified = CoxPHFitter(penalizer=0.01)

# Fit model with STRATIFICATION by product
print("\nKEY: Using strata=['StockCode']")
print("   -> Each product gets its own baseline hazard h0_product(t)")
print("   -> Customer features (beta) are shared across all products")

cph_stratified.fit(
    train_df[feature_cols + ['DURATION_DAYS', 'EVENT', 'StockCode']],
    duration_col='DURATION_DAYS',
    event_col='EVENT',
    strata=['StockCode'],  # STRATIFICATION
    show_progress=True
)

print("\nModel trained!")

TRAINING STRATIFIED COX PROPORTIONAL HAZARDS MODEL

KEY: Using strata=['StockCode']
   -> Each product gets its own baseline hazard h0_product(t)
   -> Customer features (beta) are shared across all products
Iteration 1: norm_delta = 1.84e+00, step_size = 0.9500, log_lik = -95428.63590, newton_decrement = 7.87e+03, seconds_since_start = 0.4
Iteration 2: norm_delta = 3.04e+00, step_size = 0.9500, log_lik = -120403.78774, newton_decrement = 5.73e+04, seconds_since_start = 0.7
Iteration 3: norm_delta = 7.76e+00, step_size = 0.9500, log_lik = -108756.68722, newton_decrement = 4.57e+04, seconds_since_start = 0.8
Iteration 4: norm_delta = 7.05e-01, step_size = 0.2327, log_lik = -94846.16476, newton_decrement = 2.91e+03, seconds_since_start = 1.0
Iteration 5: norm_delta = 4.42e-01, step_size = 0.2965, log_lik = -93380.03471, newton_decrement = 1.33e+03, seconds_since_start = 1.1
Iteration 6: norm_delta = 1.95e-01, step_size = 0.5011, log_lik = -92383.76186, newton_decrement = 3.19e+02, second

In [13]:
print("="*70)
print("MODEL SUMMARY")
print("="*70)
cph_stratified.print_summary()

MODEL SUMMARY


0,1
model,lifelines.CoxPHFitter
duration col,'DURATION_DAYS'
event col,'EVENT'
penalizer,0.01
l1 ratio,0.0
strata,StockCode
baseline estimation,breslow
number of observations,20857
number of events observed,13497
partial log-likelihood,-92055.76

Unnamed: 0,coef,exp(coef),se(coef),coef lower 95%,coef upper 95%,exp(coef) lower 95%,exp(coef) upper 95%,cmp to,z,p,-log2(p)
FREQUENCY,-0.01,0.99,0.01,-0.04,0.02,0.96,1.02,0.0,-0.87,0.39,1.38
LOG_MONETARY,0.06,1.06,0.01,0.05,0.08,1.05,1.08,0.0,7.13,<0.005,39.85
PRODUCT_FREQUENCY,0.35,1.41,0.01,0.34,0.36,1.4,1.43,0.0,65.2,<0.005,inf

0,1
Concordance,0.78
Partial AIC,184117.52
log-likelihood ratio test,6745.75 on 3 df
-log2(p) of ll-ratio test,inf


---
## STEP 8: COMPARE WITH NON-STRATIFIED MODEL

In [14]:
print("="*70)
print("COMPARISON: STRATIFIED vs NON-STRATIFIED")
print("="*70)

# Train non-stratified model
cph_unstratified = CoxPHFitter(penalizer=0.01)
cph_unstratified.fit(
    train_df[feature_cols + ['DURATION_DAYS', 'EVENT']],
    duration_col='DURATION_DAYS',
    event_col='EVENT'
)

# Evaluate both models using concordance_index_ (train) and score with scoring_method (test)
train_c_strat = cph_stratified.concordance_index_
train_c_unstrat = cph_unstratified.concordance_index_

# For test set evaluation, use scoring_method='concordance_index'
test_c_strat = cph_stratified.score(
    test_df[feature_cols + ['DURATION_DAYS', 'EVENT', 'StockCode']], 
    scoring_method='concordance_index'
)
test_c_unstrat = cph_unstratified.score(
    test_df[feature_cols + ['DURATION_DAYS', 'EVENT']], 
    scoring_method='concordance_index'
)

print(f"\nSTRATIFIED MODEL:")
print(f"   Train C-index: {train_c_strat:.4f}")
print(f"   Test C-index:  {test_c_strat:.4f}")

print(f"\nNON-STRATIFIED MODEL:")
print(f"   Train C-index: {train_c_unstrat:.4f}")
print(f"   Test C-index:  {test_c_unstrat:.4f}")

improvement = (test_c_strat - test_c_unstrat) * 100
print(f"\nWINNER: {'STRATIFIED' if improvement > 0 else 'NON-STRATIFIED'}")
print(f"   Improvement: {improvement:.2f} percentage points")

COMPARISON: STRATIFIED vs NON-STRATIFIED

STRATIFIED MODEL:
   Train C-index: 0.7811
   Test C-index:  0.7824

NON-STRATIFIED MODEL:
   Train C-index: 0.3509
   Test C-index:  0.3499

WINNER: STRATIFIED
   Improvement: 43.26 percentage points


In [15]:
print("Why stratified is better:")
print("   -> Captures product-specific repurchase cycles")
print("   -> Milk (7 days) vs Shampoo (30 days) vs Coat (365 days)")
print("   -> More accurate predictions for diverse product portfolios")

Why stratified is better:
   -> Captures product-specific repurchase cycles
   -> Milk (7 days) vs Shampoo (30 days) vs Coat (365 days)
   -> More accurate predictions for diverse product portfolios


---
## STEP 9: MAKE PREDICTIONS

In [16]:
print("="*70)
print("PREDICTING CUSTOMER REPURCHASE RISK")
print("="*70)

# Get baseline survival - check column format
baseline_survival = cph_stratified.baseline_survival_
print(f"\nBaseline survival columns (first 3): {list(baseline_survival.columns[:3])}")
print(f"Column type: {type(baseline_survival.columns[0])}")

# For stratified models, columns may be tuples like ('85123A',) or just strings
# Convert to dict for easier lookup
if isinstance(baseline_survival.columns[0], tuple):
    # Columns are tuples - extract the StockCode value
    baseline_dict = {col[0]: baseline_survival[col] for col in baseline_survival.columns}
    print("Note: Baseline columns are tuples, extracting first element")
else:
    baseline_dict = {col: baseline_survival[col] for col in baseline_survival.columns}

results = []

for product in top_products[:3]:  # Demo with 3 products
    print(f"\nProduct: {product}")
    print("-" * 50)
    
    product_str = str(product)
    
    # Get customers who purchased this product
    product_df = test_df[test_df['StockCode'] == product_str].copy()
    
    if len(product_df) == 0:
        print("   No test data for this product")
        continue
    
    # Check if baseline exists for this product
    if product_str not in baseline_dict:
        print(f"   Warning: No baseline survival for product {product_str}")
        print(f"   Available products: {list(baseline_dict.keys())[:5]}...")
        continue
    
    # Predict partial hazard (risk scores) - exp(X*beta)
    partial_hazard = cph_stratified.predict_partial_hazard(product_df[feature_cols])
    product_df['RISK_SCORE'] = partial_hazard.values
    
    # Get baseline survival for this product
    baseline_surv_product = baseline_dict[product_str]
    
    # Calculate survival probabilities: S(t|X) = S_0(t)^exp(X*beta)
    for horizon in [30, 60, 90]:
        valid_times = baseline_surv_product.index[baseline_surv_product.index <= horizon]
        closest_time = valid_times.max() if len(valid_times) > 0 else baseline_surv_product.index.min()
        
        base_surv_at_t = baseline_surv_product.loc[closest_time]
        survival_probs = base_surv_at_t ** partial_hazard.values
        product_df[f'PROB_{horizon}D'] = 1 - survival_probs
    
    # Sort by risk (highest first)
    product_df = product_df.sort_values('RISK_SCORE', ascending=False)
    
    # Show top 5 customers
    print("\nTop 5 customers most likely to repurchase:")
    for idx, row in product_df.head(5).iterrows():
        print(f"\n   Customer {int(row['CustomerID'])}")
        print(f"   - Risk Score: {row['RISK_SCORE']:.2f}")
        print(f"   - 30-day prob: {row['PROB_30D']:.1%}")
        print(f"   - 60-day prob: {row['PROB_60D']:.1%}")
        print(f"   - 90-day prob: {row['PROB_90D']:.1%}")
        print(f"   - Actually repurchased: {'Yes' if row['EVENT'] == 1 else 'No (censored)'}")
    
    # Store results
    for _, row in product_df.head(10).iterrows():
        results.append({
            'StockCode': product,
            'CustomerID': row['CustomerID'],
            'RISK_SCORE': row['RISK_SCORE'],
            'PROB_30D': row['PROB_30D'],
            'PROB_60D': row['PROB_60D'],
            'PROB_90D': row['PROB_90D'],
            'EVENT': row['EVENT']
        })

results_df = pd.DataFrame(results)
print(f"\n\nTotal predictions generated: {len(results_df)}")

PREDICTING CUSTOMER REPURCHASE RISK

Baseline survival columns (first 3): ['20725', '20727', '21212']
Column type: <class 'str'>

Product: 85123A
--------------------------------------------------

Top 5 customers most likely to repurchase:

   Customer 14911
   - Risk Score: 21.29
   - 30-day prob: 98.9%
   - 60-day prob: 100.0%
   - 90-day prob: 100.0%
   - Actually repurchased: Yes

   Customer 14911
   - Risk Score: 21.29
   - 30-day prob: 98.9%
   - 60-day prob: 100.0%
   - 90-day prob: 100.0%
   - Actually repurchased: Yes

   Customer 14911
   - Risk Score: 21.29
   - 30-day prob: 98.9%
   - 60-day prob: 100.0%
   - 90-day prob: 100.0%
   - Actually repurchased: Yes

   Customer 14911
   - Risk Score: 21.29
   - 30-day prob: 98.9%
   - 60-day prob: 100.0%
   - 90-day prob: 100.0%
   - Actually repurchased: Yes

   Customer 14911
   - Risk Score: 21.29
   - 30-day prob: 98.9%
   - 60-day prob: 100.0%
   - 90-day prob: 100.0%
   - Actually repurchased: Yes

Product: 85099B
-------

---
## STEP 10: BUSINESS INSIGHTS

In [17]:
print("="*70)
print("BUSINESS INSIGHTS & ACTIONABLE RECOMMENDATIONS")
print("="*70)

# 1. Feature importance
print("\n1. MOST IMPORTANT FACTORS:")
coef_summary = cph_stratified.summary[['coef', 'exp(coef)', 'p']].sort_values('coef', key=abs, ascending=False)
print(coef_summary)

BUSINESS INSIGHTS & ACTIONABLE RECOMMENDATIONS

1. MOST IMPORTANT FACTORS:
                       coef  exp(coef)             p
covariate                                           
PRODUCT_FREQUENCY  0.346321   1.413857  0.000000e+00
LOG_MONETARY       0.062093   1.064061  1.010849e-12
FREQUENCY         -0.012067   0.988006  3.854350e-01


In [18]:
print("Interpretation:")
for feature in coef_summary.index:
    coef = coef_summary.loc[feature, 'coef']
    hr = coef_summary.loc[feature, 'exp(coef)']
    p = coef_summary.loc[feature, 'p']
    
    if p < 0.05:
        direction = "INCREASES" if coef > 0 else "DECREASES"
        print(f"\n  {feature}:")
        print(f"  - {direction} repurchase risk by {abs((hr-1)*100):.1f}% per unit")
        print(f"  - Hazard Ratio: {hr:.3f}")
        print(f"  - Statistically significant (p={p:.4f})")

Interpretation:

  PRODUCT_FREQUENCY:
  - INCREASES repurchase risk by 41.4% per unit
  - Hazard Ratio: 1.414
  - Statistically significant (p=0.0000)

  LOG_MONETARY:
  - INCREASES repurchase risk by 6.4% per unit
  - Hazard Ratio: 1.064
  - Statistically significant (p=0.0000)


In [19]:
# 2. Customer segmentation
print("\n2. CUSTOMER TARGETING STRATEGY:")

high_intent = results_df[results_df['PROB_30D'] > 0.6]
medium_intent = results_df[(results_df['PROB_30D'] > 0.3) & (results_df['PROB_30D'] <= 0.6)]
low_intent = results_df[results_df['PROB_30D'] <= 0.3]

print(f"\n   HIGH INTENT (>60% in 30 days): {len(high_intent)} customers")
print(f"      -> Send gentle reminder email")
print(f"      -> No discount needed (already primed)")

print(f"\n   MEDIUM INTENT (30-60% in 30 days): {len(medium_intent)} customers")
print(f"      -> Offer 10-15% discount")
print(f"      -> Personalized recommendations")

print(f"\n   LOW INTENT (<30% in 30 days): {len(low_intent)} customers")
print(f"      -> Skip for now")
print(f"      -> Try re-engagement later")


2. CUSTOMER TARGETING STRATEGY:

   HIGH INTENT (>60% in 30 days): 30 customers
      -> Send gentle reminder email
      -> No discount needed (already primed)

   MEDIUM INTENT (30-60% in 30 days): 0 customers
      -> Offer 10-15% discount
      -> Personalized recommendations

   LOW INTENT (<30% in 30 days): 0 customers
      -> Skip for now
      -> Try re-engagement later


In [20]:
# 3. Demand forecast
print("\n3. DEMAND FORECASTING:")
expected_30d = results_df['PROB_30D'].sum()
print(f"\n   Expected repurchases in 30 days: {expected_30d:.0f}")
print(f"   -> Use for inventory planning")


3. DEMAND FORECASTING:

   Expected repurchases in 30 days: 28
   -> Use for inventory planning


---
## STEP 11: SAVE RESULTS

In [21]:
print("="*70)
print("SAVING RESULTS")
print("="*70)

# Save predictions
output_path = 'repurchase_predictions.csv'
results_df.to_csv(output_path, index=False)
print(f"\nPredictions saved to: {output_path}")

# Create summary report
summary = {
    'Total Records': len(survival_df),
    'Products Analyzed': len(top_products),
    'Train Size': len(train_df),
    'Test Size': len(test_df),
    'Model Type': 'Stratified Cox PH',
    'Train C-index': train_c_strat,
    'Test C-index': test_c_strat,
    'Improvement vs Non-Stratified': f"{improvement:.2f}%"
}

summary_df = pd.DataFrame([summary])
summary_df.to_csv('model_summary.csv', index=False)
print(f"Summary saved to: model_summary.csv")

SAVING RESULTS

Predictions saved to: repurchase_predictions.csv
Summary saved to: model_summary.csv


In [22]:
print("="*70)
print("ANALYSIS COMPLETE!")
print("="*70)

print("""
Next steps:
1. Review predictions in repurchase_predictions.csv
2. Implement targeting strategy for high/medium/low intent customers
3. Monitor actual vs predicted repurchase rates
4. Refine features and retrain model with new data
5. Deploy to production for real-time scoring
""")

ANALYSIS COMPLETE!

Next steps:
1. Review predictions in repurchase_predictions.csv
2. Implement targeting strategy for high/medium/low intent customers
3. Monitor actual vs predicted repurchase rates
4. Refine features and retrain model with new data
5. Deploy to production for real-time scoring

