# 01 â€” Exploratory Data Analysis: Customer Lifetime Value (CLV)

**Dataset:** UCI Online Retail II  
**Source:** UK-based online gift retailer, transactions 01 Dec 2009 -- 09 Dec 2011  
**Objective:** Predict 12-month customer lifetime value (CLV) using a temporal split strategy.

---

## Temporal Split Strategy

| Window | Period | Purpose |
|--------|--------|---------|
| **Observation** | Dec 2009 -- Nov 2010 (12 months) | Compute RFM features and behavioral signals |
| **Prediction** | Dec 2010 -- Dec 2011 (12 months) | Compute target: `clv_12m = sum(Quantity * Price)` per customer |

## CLV Definition

For each customer active in the observation window:
- **clv_12m** = total revenue attributed to that customer in the prediction window
- Customers with zero purchases in the prediction window have `clv_12m = 0` (churned)
- **Cold-start customers** (fewer than 2 purchases in observation) receive a median CLV prediction at inference time

## Sections

1. Introduction & Objectives  
2. Load & Inspect Raw Data  
3. Data Quality Assessment  
4. Temporal Coverage  
5. Customer Purchase Patterns  
6. CLV Label Analysis  
7. RFM Segmentation Preview  
8. Feature Correlations  
9. Key Findings Summary

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
from pathlib import Path
import warnings

warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.dpi'] = 100
plt.rcParams['font.size'] = 11

# Paths
PROJECT_ROOT = Path.cwd().parent
SHARED_RAW = PROJECT_ROOT.parent / 'shared' / 'data' / 'raw'
PROCESSED_DIR = PROJECT_ROOT / 'data' / 'processed'
CSV_PATH = SHARED_RAW / 'online_retail_ii.csv'

# Temporal split boundaries
OBS_START = pd.Timestamp('2009-12-01')
OBS_END = pd.Timestamp('2010-11-30 23:59:59')
PRED_START = pd.Timestamp('2010-12-01')
PRED_END = pd.Timestamp('2011-12-31 23:59:59')

print(f'Project root:  {PROJECT_ROOT}')
print(f'Raw CSV path:  {CSV_PATH}')
print(f'File exists:   {CSV_PATH.exists()}')
print(f'Processed dir: {PROCESSED_DIR}')

---
## 2. Load & Inspect Raw Data

**Intent:** Load the raw CSV, inspect its schema, shape, and basic statistics.

In [None]:
df = pd.read_csv(CSV_PATH, parse_dates=['InvoiceDate'])

print(f'Shape: {df.shape[0]:,} rows x {df.shape[1]} columns')
print(f'Date range: {df["InvoiceDate"].min()} -- {df["InvoiceDate"].max()}')
print(f'\nColumn dtypes:')
print(df.dtypes)
print(f'\n--- First 5 rows ---')
df.head()

In [None]:
print('=== Numeric Summary ===')
df.describe()

In [None]:
print('=== Null Percentage Per Column ===')
null_pct = df.isnull().mean().mul(100).round(2)
null_pct_df = null_pct.to_frame('null_pct').assign(
    null_count=df.isnull().sum()
)[['null_count', 'null_pct']]
print(null_pct_df)

---
## 3. Data Quality Assessment

**Intent:** Identify data quality issues that the pipeline must handle: null customer IDs, cancellations, non-product stock codes, invalid prices/quantities, and duplicates.

In [None]:
# --- Null Customer ID ---
null_cid = df['Customer ID'].isna().sum()
null_cid_pct = df['Customer ID'].isna().mean() * 100
print(f'Rows with null Customer ID: {null_cid:,} ({null_cid_pct:.1f}%)')
print(f'Unique Customer IDs (non-null): {df["Customer ID"].nunique():,}')
print()

# How much revenue is lost by dropping null-CID rows?
df['Revenue'] = df['Quantity'] * df['Price']
rev_total = df['Revenue'].sum()
rev_null_cid = df.loc[df['Customer ID'].isna(), 'Revenue'].sum()
print(f'Total revenue (all rows): {rev_total:,.0f}')
print(f'Revenue in null-CID rows: {rev_null_cid:,.0f} ({rev_null_cid / rev_total * 100:.1f}%)')

In [None]:
# --- Cancellations (Invoice starts with 'C') ---
is_cancel = df['Invoice'].astype(str).str.startswith('C')
cancel_count = is_cancel.sum()
cancel_pct = is_cancel.mean() * 100
print(f'Cancellation rows: {cancel_count:,} ({cancel_pct:.2f}%)')
print()

# Revenue impact of cancellations
cancel_rev = df.loc[is_cancel, 'Revenue'].sum()
print(f'Revenue in cancellation rows: {cancel_rev:,.0f}')
print('(Negative revenue = returned value)')

In [None]:
# --- Non-product StockCodes ---
# Common non-product codes: POST, DOT, M, BANK CHARGES, PADS, etc.
non_product_mask = (
    df['StockCode'].astype(str).str.match(r'^[A-Za-z]')
    & ~df['StockCode'].astype(str).str.match(r'^C\d')  # exclude normal codes starting with C + digits
)
non_product_codes = df.loc[non_product_mask, 'StockCode'].value_counts().head(15)
print(f'Rows with non-product StockCodes: {non_product_mask.sum():,} ({non_product_mask.mean()*100:.2f}%)')
print(f'\nTop non-product StockCodes:')
print(non_product_codes)

In [None]:
# --- Price <= 0 or Quantity <= 0 ---
bad_price = (df['Price'] <= 0).sum()
bad_qty = (df['Quantity'] <= 0).sum()
print(f'Rows with Price <= 0:    {bad_price:,} ({bad_price / len(df) * 100:.2f}%)')
print(f'Rows with Quantity <= 0: {bad_qty:,} ({bad_qty / len(df) * 100:.2f}%)')
print()

# Overlap between negative quantity and cancellations
neg_qty_and_cancel = ((df['Quantity'] <= 0) & is_cancel).sum()
neg_qty_only = ((df['Quantity'] <= 0) & ~is_cancel).sum()
print(f'Negative Qty AND cancellation: {neg_qty_and_cancel:,}')
print(f'Negative Qty but NOT cancellation: {neg_qty_only:,}')

In [None]:
# --- Duplicates ---
n_dups = df.duplicated().sum()
print(f'Exact duplicate rows: {n_dups:,} ({n_dups / len(df) * 100:.2f}%)')

In [None]:
# --- Data Quality Summary Table ---
quality_summary = pd.DataFrame({
    'Issue': [
        'Null Customer ID',
        'Cancellations (Invoice starts with C)',
        'Non-product StockCodes',
        'Price <= 0',
        'Quantity <= 0',
        'Exact duplicates'
    ],
    'Row Count': [
        null_cid,
        cancel_count,
        non_product_mask.sum(),
        bad_price,
        bad_qty,
        n_dups
    ],
    'Pct of Total': [
        f'{null_cid_pct:.1f}%',
        f'{cancel_pct:.2f}%',
        f'{non_product_mask.mean()*100:.2f}%',
        f'{bad_price / len(df) * 100:.2f}%',
        f'{bad_qty / len(df) * 100:.2f}%',
        f'{n_dups / len(df) * 100:.2f}%'
    ]
})
print('=== Data Quality Summary ===')
print(quality_summary.to_string(index=False))

---
## 4. Temporal Coverage

**Intent:** Visualize transaction volume over time and verify the observation/prediction window boundaries. Understand customer cohort arrival patterns.

In [None]:
# Monthly transaction volume
df['YearMonth'] = df['InvoiceDate'].dt.to_period('M')
monthly_counts = df.groupby('YearMonth').size()

fig, ax = plt.subplots(figsize=(14, 5))
x_dates = monthly_counts.index.to_timestamp()
ax.bar(x_dates, monthly_counts.values, width=25, color='steelblue', edgecolor='white', alpha=0.85)

# Mark observation and prediction windows
ax.axvline(OBS_START, color='green', linestyle='--', linewidth=2, label='Observation Start (Dec 2009)')
ax.axvline(OBS_END, color='orange', linestyle='--', linewidth=2, label='Observation End (Nov 2010)')
ax.axvline(PRED_START, color='red', linestyle='--', linewidth=2, label='Prediction Start (Dec 2010)')
ax.axvline(PRED_END, color='darkred', linestyle='--', linewidth=2, label='Prediction End (Dec 2011)')

ax.set_title('Monthly Transaction Volume with Temporal Split Boundaries', fontsize=14)
ax.set_xlabel('Month')
ax.set_ylabel('Transaction Count')
ax.legend(loc='upper left', fontsize=9)
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
ax.tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()

print('Monthly transaction counts:')
print(monthly_counts)

In [None]:
# Monthly revenue (positive transactions only)
df_pos = df[(df['Quantity'] > 0) & (df['Price'] > 0) & (~is_cancel)].copy()
df_pos['Revenue'] = df_pos['Quantity'] * df_pos['Price']
monthly_revenue = df_pos.groupby(df_pos['InvoiceDate'].dt.to_period('M'))['Revenue'].sum()

fig, ax = plt.subplots(figsize=(14, 5))
x_dates_rev = monthly_revenue.index.to_timestamp()
ax.bar(x_dates_rev, monthly_revenue.values, width=25, color='mediumseagreen', edgecolor='white', alpha=0.85)

ax.axvline(OBS_END, color='orange', linestyle='--', linewidth=2, label='Split boundary')

ax.set_title('Monthly Revenue (Positive Transactions Only)', fontsize=14)
ax.set_xlabel('Month')
ax.set_ylabel('Revenue')
ax.legend()
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
ax.tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Customer cohort analysis: month of first appearance
df_with_cust = df.dropna(subset=['Customer ID']).copy()
first_purchase = df_with_cust.groupby('Customer ID')['InvoiceDate'].min()
cohort_month = first_purchase.dt.to_period('M')
cohort_counts = cohort_month.value_counts().sort_index()

fig, ax = plt.subplots(figsize=(14, 5))
x_cohort = cohort_counts.index.to_timestamp()
ax.bar(x_cohort, cohort_counts.values, width=25, color='coral', edgecolor='white', alpha=0.85)

ax.axvline(OBS_END, color='orange', linestyle='--', linewidth=2, label='Split boundary')

ax.set_title('New Customer Cohorts by Month (First Purchase Date)', fontsize=14)
ax.set_xlabel('Month')
ax.set_ylabel('New Customers')
ax.legend()
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
ax.tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()

# Observation-period vs prediction-period new customers
obs_custs = first_purchase[first_purchase <= OBS_END].count()
pred_only_custs = first_purchase[first_purchase > OBS_END].count()
print(f'Customers first seen in observation window: {obs_custs:,}')
print(f'Customers first seen in prediction window only: {pred_only_custs:,}')
print(f'(Prediction-only customers will NOT be in training data)')

---
## 5. Customer Purchase Patterns

**Intent:** Understand the distributions of frequency, monetary value, and recency in the observation window. These form the basis of RFM features.

In [None]:
# Filter to observation window, valid transactions only
obs_mask = (
    (df_with_cust['InvoiceDate'] >= OBS_START)
    & (df_with_cust['InvoiceDate'] <= OBS_END)
    & (df_with_cust['Quantity'] > 0)
    & (df_with_cust['Price'] > 0)
    & (~df_with_cust['Invoice'].astype(str).str.startswith('C'))
)
obs_df = df_with_cust[obs_mask].copy()
obs_df['Revenue'] = obs_df['Quantity'] * obs_df['Price']

print(f'Observation-window transactions: {len(obs_df):,}')
print(f'Unique customers in observation window: {obs_df["Customer ID"].nunique():,}')

# Compute per-customer RFM-like stats
ref_date = OBS_END + pd.Timedelta(days=1)
customer_stats = obs_df.groupby('Customer ID').agg(
    frequency=('Invoice', 'nunique'),
    monetary=('Revenue', 'sum'),
    recency_days=('InvoiceDate', lambda x: (ref_date - x.max()).days),
    avg_order_value=('Revenue', 'mean'),
    n_items=('Quantity', 'sum')
).reset_index()

print(f'\n=== Customer-Level Summary (Observation Window) ===')
print(customer_stats[['frequency', 'monetary', 'recency_days']].describe().round(2))

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# (a) Frequency distribution
ax = axes[0, 0]
freq_clipped = customer_stats['frequency'].clip(upper=customer_stats['frequency'].quantile(0.99))
ax.hist(freq_clipped, bins=50, color='steelblue', edgecolor='white', alpha=0.85)
ax.set_title('Purchase Frequency (Observation Window)', fontsize=12)
ax.set_xlabel('Number of Unique Invoices')
ax.set_ylabel('Customers')
median_freq = customer_stats['frequency'].median()
ax.axvline(median_freq, color='red', linestyle='--', label=f'Median = {median_freq:.0f}')
ax.legend()

# (b) Monetary distribution (log scale)
ax = axes[0, 1]
monetary_pos = customer_stats.loc[customer_stats['monetary'] > 0, 'monetary']
ax.hist(np.log10(monetary_pos), bins=50, color='mediumseagreen', edgecolor='white', alpha=0.85)
ax.set_title('Monetary Value -- log10 (Observation Window)', fontsize=12)
ax.set_xlabel('log10(Total Revenue)')
ax.set_ylabel('Customers')
median_mon = monetary_pos.median()
ax.axvline(np.log10(median_mon), color='red', linestyle='--',
           label=f'Median = {median_mon:,.0f}')
ax.legend()

# (c) Recency distribution
ax = axes[1, 0]
ax.hist(customer_stats['recency_days'], bins=50, color='coral', edgecolor='white', alpha=0.85)
ax.set_title('Recency in Days (Observation Window)', fontsize=12)
ax.set_xlabel('Days Since Last Purchase')
ax.set_ylabel('Customers')
median_rec = customer_stats['recency_days'].median()
ax.axvline(median_rec, color='red', linestyle='--', label=f'Median = {median_rec:.0f} days')
ax.legend()

# (d) Frequency vs Monetary scatter
ax = axes[1, 1]
scatter_data = customer_stats[customer_stats['monetary'] > 0]
ax.scatter(scatter_data['frequency'], scatter_data['monetary'],
           alpha=0.3, s=10, color='mediumpurple')
ax.set_title('Frequency vs Monetary Value', fontsize=12)
ax.set_xlabel('Purchase Frequency')
ax.set_ylabel('Total Revenue')
ax.set_yscale('log')
ax.set_xscale('log')

plt.suptitle('Customer Purchase Patterns -- Observation Window', fontsize=14, y=1.01)
plt.tight_layout()
plt.show()

---
## 6. CLV Label Analysis

**Intent:** Analyze the target variable (`clv_12m`) produced by the data pipeline. Understand its distribution, zero-inflation, and relationship to observation-period behavior.

> **Note:** This section requires `data/processed/clv_labels.csv` from the data pipeline. If the pipeline has not run yet, this section will show instructions.

In [None]:
clv_labels_path = PROCESSED_DIR / 'clv_labels.csv'

if clv_labels_path.exists():
    clv_labels = pd.read_csv(clv_labels_path)
    print(f'CLV labels loaded: {len(clv_labels):,} customers')
    print(f'Columns: {clv_labels.columns.tolist()}')
    print()
    print(clv_labels.describe().round(2))
else:
    clv_labels = None
    print('clv_labels.csv not found. Run the data pipeline first:')
    print('  cd m02_clv')
    print('  /Users/aayan/MarketingAnalytics/.venv/bin/python src/data_pipeline.py')

In [None]:
if clv_labels is not None:
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    # (a) Raw CLV distribution
    ax = axes[0]
    clv_pos = clv_labels.loc[clv_labels['clv_12m'] > 0, 'clv_12m']
    ax.hist(clv_pos, bins=100, color='steelblue', edgecolor='white', alpha=0.85)
    ax.set_title('CLV Distribution (CLV > 0 only)', fontsize=12)
    ax.set_xlabel('CLV (12-month revenue)')
    ax.set_ylabel('Customers')
    ax.set_xlim(0, clv_pos.quantile(0.99))
    median_clv = clv_pos.median()
    ax.axvline(median_clv, color='red', linestyle='--',
               label=f'Median = {median_clv:,.0f}')
    ax.legend()

    # (b) Log-scale CLV distribution
    ax = axes[1]
    ax.hist(np.log1p(clv_labels['clv_12m']), bins=100, color='mediumseagreen',
            edgecolor='white', alpha=0.85)
    ax.set_title('CLV Distribution -- log1p scale (all customers)', fontsize=12)
    ax.set_xlabel('log1p(CLV)')
    ax.set_ylabel('Customers')

    plt.tight_layout()
    plt.show()

    # Summary stats
    n_zero = (clv_labels['clv_12m'] == 0).sum()
    n_total = len(clv_labels)
    pct_zero = n_zero / n_total * 100
    print(f'Customers with CLV = 0 (churned): {n_zero:,} / {n_total:,} ({pct_zero:.1f}%)')
    print(f'Customers with CLV > 0 (retained): {n_total - n_zero:,} ({100 - pct_zero:.1f}%)')
    print(f'Mean CLV (all):  {clv_labels["clv_12m"].mean():,.2f}')
    print(f'Median CLV (all): {clv_labels["clv_12m"].median():,.2f}')
    print(f'Mean CLV (CLV>0): {clv_pos.mean():,.2f}')
    print(f'Max CLV: {clv_labels["clv_12m"].max():,.2f}')
else:
    print('Skipping -- clv_labels.csv not available.')

In [None]:
if clv_labels is not None:
    # CLV by decile
    clv_sorted = clv_labels['clv_12m'].sort_values(ascending=False).reset_index(drop=True)
    clv_labels_sorted = clv_labels.sort_values('clv_12m', ascending=False).copy()
    clv_labels_sorted['decile'] = pd.qcut(
        clv_labels_sorted['clv_12m'].rank(method='first'),
        q=10, labels=[f'D{i}' for i in range(1, 11)]
    )

    decile_stats = clv_labels_sorted.groupby('decile', observed=True)['clv_12m'].agg(
        ['mean', 'median', 'sum', 'count']
    ).round(2)
    decile_stats['pct_of_total_revenue'] = (decile_stats['sum'] / decile_stats['sum'].sum() * 100).round(1)
    print('=== CLV by Decile ===')
    print(decile_stats)

    # Pareto visualization
    fig, ax = plt.subplots(figsize=(10, 5))
    ax.bar(decile_stats.index.astype(str), decile_stats['pct_of_total_revenue'],
           color='steelblue', edgecolor='white', alpha=0.85)
    ax.set_title('Revenue Concentration by Customer Decile', fontsize=14)
    ax.set_xlabel('Customer Decile (D10 = highest CLV)')
    ax.set_ylabel('% of Total Revenue')
    for i, v in enumerate(decile_stats['pct_of_total_revenue']):
        ax.text(i, v + 0.5, f'{v:.1f}%', ha='center', fontsize=9)
    plt.tight_layout()
    plt.show()
else:
    print('Skipping -- clv_labels.csv not available.')

In [None]:
if clv_labels is not None:
    # Box plot of CLV by number of observation-period purchases
    plot_data = clv_labels.copy()

    # Cap n_obs_purchases for cleaner visualization
    cap_val = int(plot_data['n_obs_purchases'].quantile(0.95))
    cap_label = f'{cap_val}+'
    plot_data['obs_purchases_binned'] = plot_data['n_obs_purchases'].clip(upper=cap_val)
    plot_data['obs_purchases_label'] = plot_data['obs_purchases_binned'].astype(str)
    plot_data.loc[plot_data['n_obs_purchases'] >= cap_val, 'obs_purchases_label'] = cap_label

    # Use log1p for better visibility
    plot_data['log_clv'] = np.log1p(plot_data['clv_12m'])

    fig, ax = plt.subplots(figsize=(14, 6))
    sorted_labels = sorted(plot_data['obs_purchases_binned'].unique())
    label_strs = [str(int(x)) if x < cap_val else cap_label for x in sorted_labels]

    box_data = [plot_data.loc[plot_data['obs_purchases_binned'] == v, 'log_clv'].values
                for v in sorted_labels]
    bp = ax.boxplot(box_data, labels=label_strs, patch_artist=True,
                    boxprops=dict(facecolor='steelblue', alpha=0.6),
                    medianprops=dict(color='red', linewidth=2))

    ax.set_title('log1p(CLV) by Number of Observation-Period Purchases', fontsize=14)
    ax.set_xlabel('Number of Purchases in Observation Window')
    ax.set_ylabel('log1p(CLV 12m)')
    plt.tight_layout()
    plt.show()

    # Cold-start analysis
    if 'is_cold_start' in clv_labels.columns:
        n_cold = clv_labels['is_cold_start'].sum()
        pct_cold = n_cold / len(clv_labels) * 100
        print(f'Cold-start customers (< 2 obs purchases): {n_cold:,} ({pct_cold:.1f}%)')
        print(f'Mean CLV for cold-start: {clv_labels.loc[clv_labels["is_cold_start"] == 1, "clv_12m"].mean():,.2f}')
        print(f'Mean CLV for non-cold-start: {clv_labels.loc[clv_labels["is_cold_start"] == 0, "clv_12m"].mean():,.2f}')
else:
    print('Skipping -- clv_labels.csv not available.')

---
## 7. RFM Segmentation Preview

**Intent:** Load the engineered customer features and inspect RFM score distributions. Visualize how recency and frequency relate to mean CLV.

In [None]:
features_path = PROCESSED_DIR / 'customer_features.csv'

if features_path.exists():
    features_df = pd.read_csv(features_path)
    print(f'Customer features loaded: {features_df.shape}')
    print(f'Columns: {features_df.columns.tolist()}')
    print()
    print(features_df.head())
else:
    features_df = None
    print('customer_features.csv not found. Run the data pipeline + feature engineering first:')
    print('  cd m02_clv')
    print('  /Users/aayan/MarketingAnalytics/.venv/bin/python src/data_pipeline.py')
    print('  /Users/aayan/MarketingAnalytics/.venv/bin/python src/feature_engineering.py')

In [None]:
if features_df is not None:
    # Look for RFM score columns
    rfm_cols = [c for c in features_df.columns if 'score' in c.lower() or 'rfm' in c.lower()]
    print(f'RFM-related columns found: {rfm_cols}')

    if len(rfm_cols) >= 2:
        n_plots = len(rfm_cols)
        fig, axes = plt.subplots(1, n_plots, figsize=(5 * n_plots, 4))
        if n_plots == 1:
            axes = [axes]
        for ax, col in zip(axes, rfm_cols):
            vals = features_df[col].dropna()
            ax.hist(vals, bins=30, color='steelblue', edgecolor='white', alpha=0.85)
            ax.set_title(col, fontsize=12)
            ax.set_xlabel('Score')
            ax.set_ylabel('Customers')
        plt.suptitle('RFM Score Distributions', fontsize=14, y=1.02)
        plt.tight_layout()
        plt.show()
    else:
        print('No RFM score columns found; showing top feature distributions instead.')
        numeric_cols = features_df.select_dtypes(include=[np.number]).columns[:6]
        fig, axes = plt.subplots(2, 3, figsize=(15, 8))
        for ax, col in zip(axes.flat, numeric_cols):
            vals = features_df[col].dropna()
            ax.hist(vals, bins=40, color='steelblue', edgecolor='white', alpha=0.85)
            ax.set_title(col, fontsize=11)
        plt.suptitle('Feature Distributions', fontsize=14, y=1.01)
        plt.tight_layout()
        plt.show()
else:
    print('Skipping -- customer_features.csv not available.')

In [None]:
if features_df is not None and clv_labels is not None:
    # Merge features with CLV labels
    # Identify customer ID column (may be 'customer_id' or 'Customer ID')
    cid_col_feat = [c for c in features_df.columns if 'customer' in c.lower() and 'id' in c.lower()]
    cid_col_label = [c for c in clv_labels.columns if 'customer' in c.lower() and 'id' in c.lower()]

    if cid_col_feat and cid_col_label:
        merged = features_df.merge(clv_labels[[cid_col_label[0], 'clv_12m']],
                                   left_on=cid_col_feat[0], right_on=cid_col_label[0],
                                   how='inner')
        print(f'Merged dataset: {len(merged):,} customers')

        # Look for recency_score and frequency_score
        rec_score = [c for c in merged.columns if 'recency' in c.lower() and 'score' in c.lower()]
        freq_score = [c for c in merged.columns if 'frequency' in c.lower() and 'score' in c.lower()]

        if rec_score and freq_score:
            heatmap_data = merged.groupby([rec_score[0], freq_score[0]])['clv_12m'].mean().unstack()

            fig, ax = plt.subplots(figsize=(8, 6))
            sns.heatmap(heatmap_data, annot=True, fmt='.0f', cmap='YlOrRd', ax=ax)
            ax.set_title('Mean CLV by Recency Score vs Frequency Score', fontsize=14)
            ax.set_xlabel('Frequency Score')
            ax.set_ylabel('Recency Score')
            plt.tight_layout()
            plt.show()
        else:
            print('Recency/Frequency score columns not found. Skipping heatmap.')
            print(f'Available columns: {merged.columns.tolist()}')
    else:
        print('Could not identify customer ID columns for merge.')
else:
    print('Skipping -- features or labels not available.')

---
## 8. Feature Correlations

**Intent:** Examine correlations between engineered features and the CLV target to guide feature selection and identify potential multicollinearity.

In [None]:
if features_df is not None and clv_labels is not None:
    # Build analysis dataframe
    cid_col_feat = [c for c in features_df.columns if 'customer' in c.lower() and 'id' in c.lower()]
    cid_col_label = [c for c in clv_labels.columns if 'customer' in c.lower() and 'id' in c.lower()]

    if cid_col_feat and cid_col_label:
        analysis_df = features_df.merge(
            clv_labels[[cid_col_label[0], 'clv_12m']],
            left_on=cid_col_feat[0], right_on=cid_col_label[0], how='inner'
        )

        # Select only numeric columns
        numeric_df = analysis_df.select_dtypes(include=[np.number])
        # Drop ID columns
        id_cols = [c for c in numeric_df.columns if 'id' in c.lower()]
        numeric_df = numeric_df.drop(columns=id_cols, errors='ignore')

        print(f'Numeric features for correlation: {numeric_df.shape[1]} columns')

        # Full correlation matrix
        corr = numeric_df.corr()

        fig, ax = plt.subplots(figsize=(16, 12))
        mask = np.triu(np.ones_like(corr, dtype=bool), k=1)
        sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='RdBu_r',
                    center=0, vmin=-1, vmax=1, ax=ax,
                    annot_kws={'size': 8}, square=True)
        ax.set_title('Feature Correlation Matrix (lower triangle)', fontsize=14)
        plt.tight_layout()
        plt.show()
    else:
        print('Could not identify customer ID columns for merge.')
else:
    print('Skipping -- features or labels not available.')

In [None]:
if features_df is not None and clv_labels is not None:
    if 'clv_12m' in numeric_df.columns:
        clv_corr = numeric_df.corr()['clv_12m'].drop('clv_12m').sort_values(ascending=False)

        print('=== Top 10 Features Correlated with clv_12m ===')
        print(clv_corr.head(10).round(4))
        print()
        print('=== Bottom 5 (Negative Correlations) ===')
        print(clv_corr.tail(5).round(4))

        # Bar chart of correlations
        fig, ax = plt.subplots(figsize=(10, 6))
        top_n = min(15, len(clv_corr))
        top_features = clv_corr.head(top_n)
        colors = ['steelblue' if v > 0 else 'coral' for v in top_features.values]
        ax.barh(range(top_n), top_features.values, color=colors, edgecolor='white')
        ax.set_yticks(range(top_n))
        ax.set_yticklabels(top_features.index, fontsize=10)
        ax.set_xlabel('Pearson Correlation with clv_12m')
        ax.set_title('Top Features Correlated with CLV', fontsize=14)
        ax.invert_yaxis()
        plt.tight_layout()
        plt.show()
    else:
        print('clv_12m not found in numeric columns.')
else:
    print('Skipping -- features or labels not available.')

---
## 9. Key Findings Summary

### Dataset Overview
- ~1M transaction rows spanning Dec 2009 -- Dec 2011 from a UK-based online retailer
- ~5,900 unique customers with valid Customer IDs
- 22.8% of rows lack Customer ID (must be dropped for customer-level modeling)
- 91.9% of transactions originate from the United Kingdom

### Data Quality
- **Cancellations** (~1.8% of rows) are identified by Invoice prefix 'C' and carry negative quantities
- **Non-product StockCodes** (e.g., POST, DOT, M) should be excluded from revenue calculations
- **Duplicates** (~3.2%) exist and should be removed
- **Zero/negative prices** are minimal but need filtering

### Temporal Split
- Observation window (Dec 2009 -- Nov 2010) provides sufficient data for RFM feature computation
- Prediction window (Dec 2010 -- Dec 2011) captures the next 12 months for CLV target
- New customers appearing only in the prediction window are excluded from training

### CLV Target Characteristics
- CLV is highly right-skewed -- a log1p transformation is recommended for the regression target
- Significant fraction of customers have CLV = 0 (churned in prediction window)
- Strong Pareto effect: top decile likely accounts for majority of total revenue
- CLV increases monotonically with observation-period purchase frequency

### Modeling Implications
- **Target transform:** Use `log1p(clv_12m)` with `expm1()` inverse for predictions
- **Cold-start handling:** Customers with < 2 observation-period purchases need special treatment (median CLV fallback)
- **Feature candidates:** Frequency, monetary, and recency are strong CLV predictors (as expected from RFM theory)
- **Baseline model:** BG/NBD + Gamma-Gamma probabilistic model from `lifetimes` library
- **Primary model:** LightGBM Regressor on engineered features