# Texas CLV Complete Walkthrough

This notebook demonstrates a complete Customer Lifetime Value (CLV) analysis workflow using the AutoCLV library with synthetic Texas customer data.

## What You'll Learn

1. Generate realistic synthetic customer data
2. Build a customer data mart with period aggregations
3. Run Five Lenses analyses (Lens 1-3)
4. Calculate RFM scores and segment customers
5. Train CLV models (BG/NBD + Gamma-Gamma)
6. Generate CLV predictions and insights

## Prerequisites

```bash
pip install -e .
```

## ⚠️ Data Privacy Notice

**This notebook uses synthetic data for demonstration purposes.**

When adapting this workflow to production data:
- Ensure compliance with data privacy regulations (GDPR, CCPA, etc.)
- Anonymize or pseudonymize customer identifiers
- Implement appropriate access controls
- Never commit notebooks with real customer data to version control
- Consider using differential privacy techniques for sensitive analyses

## Step 1: Generate Synthetic Data

We'll use the Texas CLV generator to create realistic customer transaction data across 4 Texas cities.

In [None]:
from datetime import datetime
from dataclasses import asdict
from customer_base_audit.synthetic.texas_clv_client import generate_texas_clv_client

# Generate 1000 customers with transactions over 2024
customers, transactions, city_map = generate_texas_clv_client(
    total_customers=1000,
    seed=42,  # Fixed seed for reproducibility
)

print(f"Generated {len(customers):,} customers")
print(f"Generated {len(transactions):,} transactions")
print(f"Cities: {set(city_map.values())}")
print(
    f"\nDate range: {min(t.event_ts for t in transactions).date()} to {max(t.event_ts for t in transactions).date()}"
)

## Step 2: Explore the Data

Let's examine the structure of our synthetic data.

In [None]:
import pandas as pd

# Convert to DataFrames for easy exploration
customers_df = pd.DataFrame([asdict(c) for c in customers])
transactions_df = pd.DataFrame([asdict(t) for t in transactions])

print("=== Customer Sample ===")
print(customers_df.head())
print("\nCustomers by city:")
print(customers_df["customer_id"].map(city_map).value_counts())

print("\n=== Transaction Sample ===")
print(transactions_df.head())

print("\n=== Transaction Statistics ===")
print(
    f"Total revenue: ${transactions_df['quantity'].astype(float) * transactions_df['unit_price'].astype(float).sum():,.2f}"
)
print(
    f"Average order value: ${(transactions_df['quantity'].astype(float) * transactions_df['unit_price'].astype(float)).mean():,.2f}"
)
print(f"Unique products: {transactions_df['product_id'].nunique()}")

## Step 3: Build Customer Data Mart

The data mart aggregates transactions by customer and time period.

In [None]:
from customer_base_audit.foundation.data_mart import (
    CustomerDataMartBuilder,
    PeriodGranularity,
)

# Build data mart with monthly granularity
builder = CustomerDataMartBuilder(period_granularities=[PeriodGranularity.MONTH])
mart = builder.build([asdict(t) for t in transactions])

print(f"Orders processed: {len(mart.orders):,}")
print(f"Monthly periods: {len(mart.periods[PeriodGranularity.MONTH]):,}")
print(
    f"Unique customers: {len(set(p.customer_id for p in mart.periods[PeriodGranularity.MONTH])):,}"
)

## Step 4: Calculate RFM Metrics

RFM (Recency, Frequency, Monetary) analysis helps segment customers by behavior.

In [None]:
from customer_base_audit.foundation.rfm import calculate_rfm, calculate_rfm_scores

# Calculate RFM metrics
period_aggregations = mart.periods[PeriodGranularity.MONTH]
observation_end = datetime(2024, 12, 31, 23, 59, 59)

rfm_metrics = calculate_rfm(
    period_aggregations=period_aggregations, observation_end=observation_end
)

rfm_scores = calculate_rfm_scores(rfm_metrics)

# Convert to DataFrame for visualization
rfm_df = pd.DataFrame([asdict(rfm) for rfm in rfm_metrics])
scores_df = pd.DataFrame([asdict(s) for s in rfm_scores])

print("=== RFM Metrics Summary ===")
print(rfm_df[["recency_days", "frequency", "monetary"]].describe())

print("\n=== RFM Scores Distribution ===")
print(
    f"Best customers (555): {len(scores_df[(scores_df['r_score'] == 5) & (scores_df['f_score'] == 5) & (scores_df['m_score'] == 5)])}"
)
print(
    f"At-risk customers (111): {len(scores_df[(scores_df['r_score'] == 1) & (scores_df['f_score'] == 1) & (scores_df['m_score'] == 1)])}"
)

## Step 5: Lens 1 - Single Period Analysis

Lens 1 provides a snapshot view of your customer base.

In [None]:
from customer_base_audit.analyses.lens1 import analyze_single_period

lens1_results = analyze_single_period(rfm_metrics, rfm_scores)

print("=== Lens 1: Single Period Analysis ===")
print(f"Total Customers: {lens1_results.total_customers:,}")
print(
    f"One-Time Buyers: {lens1_results.one_time_buyers:,} ({lens1_results.one_time_buyer_pct:.1f}%)"
)
print(f"Total Revenue: ${lens1_results.total_revenue:,.2f}")
print("\nRevenue Concentration:")
print(f"  Top 10% contribute: {lens1_results.top_10pct_revenue_contribution:.1f}%")
print(f"  Top 20% contribute: {lens1_results.top_20pct_revenue_contribution:.1f}%")
print("\nCustomer Metrics:")
print(f"  Avg orders per customer: {lens1_results.avg_orders_per_customer:.1f}")
print(f"  Median customer value: ${lens1_results.median_customer_value:.2f}")

if lens1_results.rfm_distribution:
    print("\nTop 5 RFM Segments:")
    sorted_segments = sorted(
        lens1_results.rfm_distribution.items(), key=lambda x: x[1], reverse=True
    )[:5]
    for segment, count in sorted_segments:
        print(f"  {segment}: {count} customers")

## Step 6: Lens 2 - Period-to-Period Comparison

Lens 2 compares two time periods to analyze customer migration patterns.

In [None]:
from customer_base_audit.analyses.lens2 import analyze_period_comparison

# Compare Q3 vs Q4 2024
period1_start = datetime(2024, 7, 1)
period1_end = datetime(2024, 9, 30, 23, 59, 59)
period2_start = datetime(2024, 10, 1)
period2_end = datetime(2024, 12, 31, 23, 59, 59)

period1_aggs = [
    agg
    for agg in period_aggregations
    if period1_start <= agg.period_start <= period1_end
]
period2_aggs = [
    agg
    for agg in period_aggregations
    if period2_start <= agg.period_start <= period2_end
]

period1_rfm = calculate_rfm(period1_aggs, observation_end=period1_end)
period2_rfm = calculate_rfm(period2_aggs, observation_end=period2_end)

all_customer_ids = list(set(c.customer_id for c in customers))

lens2_results = analyze_period_comparison(
    period1_rfm=period1_rfm,
    period2_rfm=period2_rfm,
    all_customer_history=all_customer_ids,
)

print("=== Lens 2: Q3 → Q4 2024 Comparison ===")
print("\nCustomer Migration:")
print(f"  Retained: {len(lens2_results.migration.retained):,}")
print(f"  Churned: {len(lens2_results.migration.churned):,}")
print(f"  New: {len(lens2_results.migration.new):,}")
print(f"  Reactivated: {len(lens2_results.migration.reactivated):,}")

print("\nRetention Metrics:")
print(f"  Retention Rate: {lens2_results.retention_rate:.1f}%")
print(f"  Churn Rate: {lens2_results.churn_rate:.1f}%")
print(f"  Reactivation Rate: {lens2_results.reactivation_rate:.1f}%")

print("\nBusiness Metrics:")
print(f"  Customer Count Change: {lens2_results.customer_count_change:+d}")
print(f"  Revenue Change: {lens2_results.revenue_change_pct:+.1f}%")
print(f"  AOV Change: {lens2_results.avg_order_value_change_pct:+.1f}%")

## Step 7: Lens 3 - Single Cohort Evolution

Lens 3 tracks how a cohort's behavior evolves over time.

In [None]:
from customer_base_audit.foundation.cohorts import (
    create_monthly_cohorts,
    assign_cohorts,
)
from customer_base_audit.analyses.lens3 import analyze_cohort_evolution

# Create monthly cohorts
cohort_definitions = create_monthly_cohorts(
    customers=customers,
    start_date=datetime(2024, 1, 1),
    end_date=datetime(2024, 12, 31),
)

cohort_assignments = assign_cohorts(customers, cohort_definitions)

# Analyze January 2024 cohort
cohort_name = "2024-01"
cohort_customer_ids = [
    cust_id for cust_id, coh_id in cohort_assignments.items() if coh_id == cohort_name
]

cohort_definition = next(c for c in cohort_definitions if c.cohort_id == cohort_name)

lens3_results = analyze_cohort_evolution(
    cohort_name=cohort_name,
    acquisition_date=cohort_definition.period_start,
    period_aggregations=period_aggregations,
    cohort_customer_ids=cohort_customer_ids,
)

print(f"=== Lens 3: {cohort_name} Cohort Evolution ===")
print(f"Cohort Size: {lens3_results.cohort_size:,}")
print("\nPeriod-by-Period Metrics:")

for period in lens3_results.periods[:6]:  # First 6 months
    print(f"\nPeriod {period.period_number}:")
    print(f"  Active Customers: {period.active_customers:,}")
    print(f"  Retention Rate: {period.retention_rate:.1%}")
    print(f"  Avg Orders: {period.avg_orders_per_customer:.2f}")
    print(f"  Avg Revenue: ${period.avg_revenue_per_customer:,.2f}")
    print(f"  Total Revenue: ${period.total_revenue:,.2f}")

## Step 8: Prepare Model Input Data

Format data for CLV model training (BG/NBD + Gamma-Gamma).

In [None]:
from customer_base_audit.models.model_prep import prepare_clv_model_inputs

# Prepare data for BG/NBD (purchase frequency) and Gamma-Gamma (monetary value) models
model_data = prepare_clv_model_inputs(
    transactions=[asdict(t) for t in transactions],
    observation_start=datetime(2024, 1, 1),
    observation_end=datetime(2024, 12, 31, 23, 59, 59),
    customer_id_field="customer_id",
    timestamp_field="event_ts",
    monetary_field="unit_price",
)

print("=== CLV Model Input Data ===")
print(model_data.head())
print(f"\nTotal customers: {len(model_data)}")
print(
    f"Customers with 2+ purchases (eligible for Gamma-Gamma): {len(model_data[model_data['frequency'] >= 2])}"
)

## Step 9: Train BG/NBD Model (Purchase Frequency)

The BG/NBD model predicts how many purchases a customer will make in a future time period.

In [None]:
from customer_base_audit.models.bg_nbd import BGNBDModelWrapper, BGNBDConfig

# Train BG/NBD model using MAP method (faster than MCMC)
config = BGNBDConfig(method="map")
bg_nbd_model = BGNBDModelWrapper(config)

print("Training BG/NBD model (this may take a minute)...")
bg_nbd_model.fit(model_data)
print("✓ BG/NBD model trained successfully")

# Predict purchases for next 90 days
purchase_predictions = bg_nbd_model.predict_purchases(model_data, time_periods=90.0)

print("\n=== Purchase Predictions (Next 90 Days) ===")
print(purchase_predictions.describe())
print("\nTop 5 customers by predicted purchases:")
print(purchase_predictions.nlargest(5, "predicted_purchases"))

## Step 10: Train Gamma-Gamma Model (Monetary Value)

The Gamma-Gamma model predicts average transaction value per customer.

**Why 2+ purchases?** The Gamma-Gamma model requires customers to have made at least 2 purchases to estimate their average transaction value. This mathematical requirement ensures the model has enough data points to distinguish between a customer's "true" average spend and random variation.

In [None]:
from customer_base_audit.models.gamma_gamma import (
    GammaGammaModelWrapper,
    GammaGammaConfig,
)

# Filter to customers with 2+ purchases (Gamma-Gamma requirement)
gamma_data = model_data[model_data["frequency"] >= 2].copy()

print(f"Training Gamma-Gamma model on {len(gamma_data)} customers with 2+ purchases...")

gg_config = GammaGammaConfig(method="map")
gamma_gamma_model = GammaGammaModelWrapper(gg_config)
gamma_gamma_model.fit(gamma_data)
print("✓ Gamma-Gamma model trained successfully")

# Predict average transaction value
spend_predictions = gamma_gamma_model.predict_spend(gamma_data)

print("\n=== Spend Predictions (Avg Transaction Value) ===")
print(spend_predictions.describe())
print("\nTop 5 customers by predicted spend:")
print(spend_predictions.nlargest(5, "predicted_avg_spend"))

## Step 11: Calculate Customer Lifetime Value

Combine purchase frequency and monetary value predictions to estimate CLV.

In [None]:
# Merge purchase and spend predictions
clv_data = purchase_predictions.merge(spend_predictions, on="customer_id", how="inner")

# Calculate 90-day CLV = predicted purchases × predicted avg spend
clv_data["clv_90_day"] = (
    clv_data["predicted_purchases"] * clv_data["predicted_avg_spend"]
)

# Calculate probability alive for active customers
prob_alive = bg_nbd_model.calculate_probability_alive(gamma_data)
clv_data = clv_data.merge(prob_alive, on="customer_id", how="left")

# Adjust CLV by probability alive
clv_data["clv_90_day_adjusted"] = clv_data["clv_90_day"] * clv_data[
    "prob_alive"
].fillna(1.0)

print("=== Customer Lifetime Value (90-Day Predictions) ===")
print(clv_data[["clv_90_day", "clv_90_day_adjusted", "prob_alive"]].describe())

print(
    f"\nTotal predicted 90-day revenue: ${clv_data['clv_90_day_adjusted'].sum():,.2f}"
)
print(f"Average CLV per customer: ${clv_data['clv_90_day_adjusted'].mean():,.2f}")

print("\n=== Top 10 Customers by CLV ===")
top_customers = clv_data.nlargest(10, "clv_90_day_adjusted")[
    [
        "customer_id",
        "predicted_purchases",
        "predicted_avg_spend",
        "prob_alive",
        "clv_90_day_adjusted",
    ]
]
for idx, row in top_customers.iterrows():
    print(
        f"{row['customer_id']}: ${row['clv_90_day_adjusted']:.2f} ({row['predicted_purchases']:.1f} purchases @ ${row['predicted_avg_spend']:.2f}, {row['prob_alive']:.1%} alive)"
    )

## Step 12: Visualize Results

Create visualizations to understand customer segments and CLV distribution.

In [None]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. CLV Distribution
axes[0, 0].hist(clv_data["clv_90_day_adjusted"], bins=50, edgecolor="black")
axes[0, 0].set_xlabel("90-Day CLV ($)")
axes[0, 0].set_ylabel("Number of Customers")
axes[0, 0].set_title("CLV Distribution")
axes[0, 0].axvline(
    clv_data["clv_90_day_adjusted"].median(),
    color="red",
    linestyle="--",
    label=f"Median: ${clv_data['clv_90_day_adjusted'].median():.2f}",
)
axes[0, 0].legend()

# 2. Purchase Frequency vs Monetary Value
axes[0, 1].scatter(
    clv_data["predicted_purchases"], clv_data["predicted_avg_spend"], alpha=0.5
)
axes[0, 1].set_xlabel("Predicted Purchases (90 days)")
axes[0, 1].set_ylabel("Predicted Avg Spend ($)")
axes[0, 1].set_title("Purchase Frequency vs Monetary Value")

# 3. Probability Alive Distribution
axes[1, 0].hist(clv_data["prob_alive"].dropna(), bins=30, edgecolor="black")
axes[1, 0].set_xlabel("Probability Alive")
axes[1, 0].set_ylabel("Number of Customers")
axes[1, 0].set_title("Customer Activity Probability")

# 4. Top 20 Customers by CLV
top_20 = clv_data.nlargest(20, "clv_90_day_adjusted")
axes[1, 1].barh(range(len(top_20)), top_20["clv_90_day_adjusted"])
axes[1, 1].set_yticks(range(len(top_20)))
axes[1, 1].set_yticklabels(top_20["customer_id"].str[:10])  # Truncate IDs
axes[1, 1].set_xlabel("90-Day CLV ($)")
axes[1, 1].set_title("Top 20 Customers by CLV")
axes[1, 1].invert_yaxis()

plt.tight_layout()
plt.show()

print("📊 Visualizations complete!")

## Summary

### What We Accomplished

1. ✅ Generated 1,000 customers with realistic synthetic transaction data
2. ✅ Built a customer data mart with monthly aggregations
3. ✅ Calculated RFM metrics and segmented customers
4. ✅ Ran Five Lenses analyses (Lens 1-3)
5. ✅ Trained BG/NBD model for purchase frequency prediction
6. ✅ Trained Gamma-Gamma model for monetary value prediction
7. ✅ Calculated 90-day CLV predictions for all customers
8. ✅ Visualized customer segments and CLV distribution

### Key Insights

- **Customer Retention**: Check Lens 2 retention rate and Lens 1 one-time buyer percentage
- **Revenue Concentration**: See Lens 1 top 10%/20% revenue contribution
- **Cohort Health**: Review Lens 3 cohort evolution and retention decay
- **CLV Distribution**: Understand which customers drive future revenue

### Next Steps

- Compare multiple cohorts (see `02_custom_cohorts.ipynb`)
- Benchmark different models (see `03_model_comparison.ipynb`)
- Monitor model drift over time (see `04_monitoring_drift.ipynb`)
- Apply to your own production data!