# Chapter 1c: Temporal Pattern Analysis (Event Bronze Track)

**Purpose:** Discover temporal patterns in event-level data that inform feature engineering and model design.

**When to use this notebook:**
- After completing 01a and 01b (temporal deep dive and quality checks)
- Your dataset is EVENT_LEVEL granularity
- You want to understand time-based patterns before aggregation

**What you'll learn:**
- How to detect long-term trends in your data
- How to identify seasonality patterns (weekly, monthly)
- How cohort analysis reveals customer lifecycle patterns
- How recency relates to target outcomes

**Pattern Categories:**

| Pattern | Description | Feature Engineering Impact |
|---------|-------------|---------------------------|
| **Trend** | Long-term direction (up/down) | Detrend features, add trend slope |
| **Seasonality** | Periodic patterns (weekly, monthly) | Add cyclical encodings, seasonal indicators |
| **Cohort Effects** | Behavior varies by join date | Add cohort features, stratify models |
| **Recency Effects** | Recent activity predicts outcomes | Prioritize recent time windows |

## 1c.1 Load Findings and Data

In [1]:
from customer_retention.analysis.auto_explorer import ExplorationFindings
from customer_retention.analysis.visualization import ChartBuilder, display_figure, display_table
from customer_retention.core.config.column_config import ColumnType, DatasetGranularity
from customer_retention.stages.profiling import (
    TemporalPatternAnalyzer, TemporalPatternAnalysis,
    TrendResult, TrendDirection, SeasonalityResult, RecencyResult,
    TemporalFeatureAnalyzer, VelocityResult, MomentumResult,
    LagCorrelationResult, PredictivePowerResult, FeatureRecommendation,
    CategoricalTargetAnalyzer
)
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from scipy import stats

In [2]:
# === CONFIGURATION ===
from pathlib import Path

FINDINGS_DIR = Path("../experiments/findings")

findings_files = [f for f in FINDINGS_DIR.glob("*_findings.yaml") if "multi_dataset" not in f.name]
if not findings_files:
    raise FileNotFoundError(f"No findings files found in {FINDINGS_DIR}. Run notebook 01 first.")

findings_files.sort(key=lambda f: f.stat().st_mtime, reverse=True)
FINDINGS_PATH = str(findings_files[0])

print(f"Using: {FINDINGS_PATH}")
findings = ExplorationFindings.load(FINDINGS_PATH)
print(f"Loaded findings for {findings.column_count} columns")

Using: ../experiments/findings/customer_emails_31faba_findings.yaml
Loaded findings for 12 columns


In [3]:
# Get time series configuration
ts_meta = findings.time_series_metadata
ENTITY_COLUMN = ts_meta.entity_column if ts_meta else None
TIME_COLUMN = ts_meta.time_column if ts_meta else None

print(f"Entity column: {ENTITY_COLUMN}")
print(f"Time column: {TIME_COLUMN}")

# Note: Target column configuration is handled in section 1c.2 below
# This allows for event-level to entity-level aggregation when needed

Entity column: customer_id
Time column: sent_date


In [None]:
from customer_retention.stages.temporal import load_data_with_snapshot_preference, TEMPORAL_METADATA_COLS

# Load source data (prefers snapshots over raw files)
df, data_source = load_data_with_snapshot_preference(findings, output_dir="../experiments/findings")
charts = ChartBuilder()

# Parse time column
df[TIME_COLUMN] = pd.to_datetime(df[TIME_COLUMN])

print(f"Loaded {len(df):,} rows x {len(df.columns)} columns")
print(f"Data source: {data_source}")

## 1c.2 Target Column Configuration

**📖 Event-Level vs Entity-Level Targets:**

In time series data, targets can be defined at different granularities:

| Target Level | Example | Usage |
|--------------|---------|-------|
| **Event-level** | "Did this email get clicked?" | Exists in raw data |
| **Entity-level** | "Did this customer churn?" | Need to join from entity table |

If your target is entity-level, you may need to join it or configure it manually.

In [5]:
# === TARGET CONFIGURATION ===
# Override these settings to customize target handling

# Option 1: Use auto-detected target from findings
# TARGET_COLUMN_OVERRIDE = None

# Option 2: Specify a different column as target
# TARGET_COLUMN_OVERRIDE = "unsubscribed"

# Option 3: Target will come from another dataset (set in 05_multi_dataset)
# TARGET_COLUMN_OVERRIDE = "DEFER_TO_MULTI_DATASET"

TARGET_COLUMN_OVERRIDE = None  # Change this to override

# Aggregation method for event-level targets
# Options: "max" (any=1 means churned), "mean" (proportion), "last" (final state), "sum" (count)
TARGET_AGGREGATION = "max"

# === AUTO-DETECT TARGET LEVEL ===
def detect_target_level(df, target_col, entity_col):
    """
    Detect if target is at event-level or entity-level.
    
    Returns: ("entity_level", None) or ("event_level", suggested_aggregation)
    """
    if target_col is None or entity_col is None:
        return "unknown", None
    
    if target_col not in df.columns:
        return "missing", None
    
    # Check if target varies within entities
    target_per_entity = df.groupby(entity_col)[target_col].nunique()
    entities_with_variation = (target_per_entity > 1).sum()
    total_entities = len(target_per_entity)
    variation_pct = entities_with_variation / total_entities * 100 if total_entities > 0 else 0
    
    # Check value distribution
    value_counts = df[target_col].value_counts(normalize=True)
    is_binary = len(value_counts) == 2
    
    # If target is same for all events of an entity, it's entity-level
    if variation_pct < 5:  # Less than 5% of entities have variation
        return "entity_level", None
    
    # If binary and varies within entities, likely event-level churn indicator
    if is_binary and variation_pct > 10:
        # Check if it's a "any positive = churned" pattern
        # (most events are 0, occasional 1)
        if value_counts.get(0, 0) > 0.8 or value_counts.get(1, 0) < 0.2:
            return "event_level", "max"
    
    return "event_level", "max"

# === COMPUTE ENTITY-LEVEL TARGET ===
print("="*70)
print("TARGET COLUMN ANALYSIS")
print("="*70)

# Determine target column
TARGET_COLUMN = None  # Initialize

if TARGET_COLUMN_OVERRIDE == "DEFER_TO_MULTI_DATASET":
    TARGET_COLUMN = None
    print("\n⏳ Target deferred to multi-dataset notebook (05)")
    print("   Analysis will proceed without target-based comparisons")
elif TARGET_COLUMN_OVERRIDE is not None:
    TARGET_COLUMN = TARGET_COLUMN_OVERRIDE
    print(f"\n🔧 Using override target: {TARGET_COLUMN}")
else:
    # Use auto-detected from findings
    for col_name, col_info in findings.columns.items():
        if col_info.inferred_type == ColumnType.TARGET:
            TARGET_COLUMN = col_name
            break
    
    if TARGET_COLUMN:
        print(f"\n🔍 Auto-detected target: {TARGET_COLUMN}")
    else:
        # Try to find binary columns that could be targets
        binary_candidates = []
        for col_name, col_info in findings.columns.items():
            if col_info.inferred_type == ColumnType.BINARY:
                # Check for churn-related names
                churn_keywords = ['churn', 'unsub', 'cancel', 'retain', 'active', 'lost', 'leave']
                if any(kw in col_name.lower() for kw in churn_keywords):
                    binary_candidates.append(col_name)
        
        if binary_candidates:
            TARGET_COLUMN = binary_candidates[0]
            print(f"\n🔍 No explicit target detected, using binary candidate: {TARGET_COLUMN}")
            print(f"   Other candidates: {binary_candidates[1:] if len(binary_candidates) > 1 else 'none'}")
        else:
            print("\n🔍 No target column detected")

# Analyze target level
if TARGET_COLUMN and TARGET_COLUMN in df.columns and ENTITY_COLUMN:
    target_level, suggested_agg = detect_target_level(df, TARGET_COLUMN, ENTITY_COLUMN)
    
    print(f"\n📊 Target Analysis:")
    print(f"   Column: {TARGET_COLUMN}")
    print(f"   Level detected: {target_level.upper()}")
    
    if target_level == "event_level":
        print(f"\n⚠️  EVENT-LEVEL TARGET DETECTED")
        print(f"   The target '{TARGET_COLUMN}' varies within entities.")
        print(f"   This needs to be aggregated to entity-level for proper analysis.")
        
        # Show distribution
        event_dist = df[TARGET_COLUMN].value_counts()
        print(f"\n   Event-level distribution:")
        for val, count in event_dist.items():
            pct = count / len(df) * 100
            print(f"      {TARGET_COLUMN}={val}: {count:,} events ({pct:.1f}%)")
        
        # Compute entity-level target
        print(f"\n   Aggregating to entity-level using: {TARGET_AGGREGATION}()")
        
        if TARGET_AGGREGATION == "max":
            entity_target = df.groupby(ENTITY_COLUMN)[TARGET_COLUMN].max()
        elif TARGET_AGGREGATION == "mean":
            entity_target = df.groupby(ENTITY_COLUMN)[TARGET_COLUMN].mean()
        elif TARGET_AGGREGATION == "sum":
            entity_target = df.groupby(ENTITY_COLUMN)[TARGET_COLUMN].sum()
        elif TARGET_AGGREGATION == "last":
            entity_target = df.sort_values(TIME_COLUMN).groupby(ENTITY_COLUMN)[TARGET_COLUMN].last()
        else:
            entity_target = df.groupby(ENTITY_COLUMN)[TARGET_COLUMN].max()
        
        # Show entity-level distribution
        entity_dist = entity_target.value_counts()
        print(f"\n   Entity-level distribution (after aggregation):")
        for val, count in entity_dist.items():
            pct = count / len(entity_target) * 100
            label = "Churned" if val == 1 else "Retained" if val == 0 else str(val)
            print(f"      {label} ({TARGET_COLUMN}={val}): {count:,} entities ({pct:.1f}%)")
        
        # Create entity-level target mapping and merge back
        entity_target_map = entity_target.reset_index()
        entity_target_map.columns = [ENTITY_COLUMN, f"{TARGET_COLUMN}_entity"]
        df = df.merge(entity_target_map, on=ENTITY_COLUMN, how="left")
        
        # Update TARGET_COLUMN to use entity-level version
        ORIGINAL_TARGET = TARGET_COLUMN
        TARGET_COLUMN = f"{TARGET_COLUMN}_entity"
        print(f"\n   ✓ Created entity-level target: {TARGET_COLUMN}")
        print(f"   ✓ Original event-level column preserved: {ORIGINAL_TARGET}")
        
    elif target_level == "entity_level":
        print(f"\n   ✓ Target is already at entity-level (consistent within entities)")
        # Verify it's properly mapped
        entity_target = df.groupby(ENTITY_COLUMN)[TARGET_COLUMN].first()
        entity_dist = entity_target.value_counts()
        print(f"\n   Entity-level distribution:")
        for val, count in entity_dist.items():
            pct = count / len(entity_target) * 100
            label = "Churned" if val == 1 else "Retained" if val == 0 else str(val)
            print(f"      {label} ({TARGET_COLUMN}={val}): {count:,} entities ({pct:.1f}%)")

elif not TARGET_COLUMN:
    print("\n   ℹ️  No target column detected or configured")
    print("   Temporal pattern analysis will proceed without target-based comparisons")
    print("   To enable comparisons:")
    print("   - Set TARGET_COLUMN_OVERRIDE above, or")
    print("   - Define target in multi-dataset notebook (05)")

print("\n" + "─"*70)
print(f"Final configuration:")
print(f"   ENTITY_COLUMN: {ENTITY_COLUMN}")
print(f"   TIME_COLUMN: {TIME_COLUMN}")
print(f"   TARGET_COLUMN: {TARGET_COLUMN}")
print("─"*70)

TARGET COLUMN ANALYSIS

🔍 No explicit target detected, using binary candidate: unsubscribed
   Other candidates: none

📊 Target Analysis:
   Column: unsubscribed
   Level detected: EVENT_LEVEL

⚠️  EVENT-LEVEL TARGET DETECTED
   The target 'unsubscribed' varies within entities.
   This needs to be aggregated to entity-level for proper analysis.

   Event-level distribution:
      unsubscribed=0: 85,667 events (97.4%)
      unsubscribed=1: 2,315 events (2.6%)

   Aggregating to entity-level using: max()

   Entity-level distribution (after aggregation):
      Retained (unsubscribed=0): 2,683 entities (53.7%)
      Churned (unsubscribed=1): 2,315 entities (46.3%)

   ✓ Created entity-level target: unsubscribed_entity
   ✓ Original event-level column preserved: unsubscribed

──────────────────────────────────────────────────────────────────────
Final configuration:
   ENTITY_COLUMN: customer_id
   TIME_COLUMN: sent_date
   TARGET_COLUMN: unsubscribed_entity
───────────────────────────────

## 1c.3 Configure Value Column for Analysis

Temporal patterns are analyzed on aggregated metrics. Choose the primary metric to analyze.

In [6]:
# Find numeric columns that could be aggregated
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
numeric_cols = [c for c in numeric_cols if c not in [ENTITY_COLUMN]]

print("Available numeric columns for pattern analysis:")
for col in numeric_cols:
    print(f"   - {col}")

# Default: use event count (most common for pattern detection)
# Change this to analyze patterns in a specific metric
VALUE_COLUMN = "_event_count"  # Special: will aggregate event counts

Available numeric columns for pattern analysis:
   - opened
   - clicked
   - send_hour
   - unsubscribed
   - bounced
   - time_to_open_hours
   - unsubscribed_entity


In [7]:
# Prepare data for pattern analysis
# Aggregate to daily level for trend/seasonality detection

if VALUE_COLUMN == "_event_count":
    # Aggregate event counts by day
    daily_data = df.groupby(df[TIME_COLUMN].dt.date).size().reset_index()
    daily_data.columns = [TIME_COLUMN, "value"]
    daily_data[TIME_COLUMN] = pd.to_datetime(daily_data[TIME_COLUMN])
    analysis_col = "value"
    print("Analyzing: Daily event counts")
else:
    # Aggregate specific column by day
    daily_data = df.groupby(df[TIME_COLUMN].dt.date)[VALUE_COLUMN].sum().reset_index()
    daily_data.columns = [TIME_COLUMN, "value"]
    daily_data[TIME_COLUMN] = pd.to_datetime(daily_data[TIME_COLUMN])
    analysis_col = "value"
    print(f"Analyzing: Daily sum of {VALUE_COLUMN}")

print(f"\nDaily data points: {len(daily_data)}")
print(f"Date range: {daily_data[TIME_COLUMN].min()} to {daily_data[TIME_COLUMN].max()}")

Analyzing: Daily event counts

Daily data points: 3286
Date range: 2015-01-01 00:00:00 to 2023-12-30 00:00:00


## 1c.4 Trend Detection

**📖 Understanding Trends:**
- **Increasing**: Metric growing over time (e.g., expanding customer base)
- **Decreasing**: Metric shrinking (e.g., declining engagement)
- **Stationary**: No significant trend (stable business)

**Impact on ML:**
- Strong trends can cause data leakage if not handled
- Consider detrending or adding trend as explicit feature

In [8]:
# Run trend detection
analyzer = TemporalPatternAnalyzer(time_column=TIME_COLUMN)
trend_result = analyzer.detect_trend(daily_data, value_column=analysis_col)

print("\U0001f4c8 TREND ANALYSIS RESULTS")
print("="*50)

direction_emoji = {
    TrendDirection.INCREASING: "\U0001f4c8",
    TrendDirection.DECREASING: "\U0001f4c9",
    TrendDirection.STABLE: "\u27a1\ufe0f",
    TrendDirection.UNKNOWN: "\u2753",
}

print(f"\n   Direction: {direction_emoji.get(trend_result.direction, '')} {trend_result.direction.value.upper()}")
print(f"   Strength (R\u00b2): {trend_result.strength:.3f}")
print(f"   Confidence: {trend_result.confidence.upper()}")

if trend_result.slope is not None:
    print(f"   Slope: {trend_result.slope:.4f} per day")
    # Interpret slope
    mean_val = daily_data[analysis_col].mean()
    daily_pct_change = (trend_result.slope / mean_val) * 100 if mean_val != 0 else 0
    print(f"   Daily % change: {daily_pct_change:+.3f}%")

if trend_result.p_value is not None:
    print(f"   P-value: {trend_result.p_value:.4f}")

📈 TREND ANALYSIS RESULTS

   Direction: ➡️ STABLE
   Strength (R²): 0.574
   Confidence: HIGH
   Slope: -0.0066 per day
   Daily % change: -0.025%
   P-value: 0.0000


In [9]:
# Visualize trend
fig = go.Figure()

# Actual data
fig.add_trace(go.Scatter(
    x=daily_data[TIME_COLUMN],
    y=daily_data[analysis_col],
    mode="lines",
    name="Daily Values",
    line=dict(color="steelblue", width=1),
    opacity=0.7
))

# Add trend line
if trend_result.slope is not None:
    x_numeric = (daily_data[TIME_COLUMN] - daily_data[TIME_COLUMN].min()).dt.days
    y_trend = trend_result.slope * x_numeric + (
        daily_data[analysis_col].mean() - trend_result.slope * x_numeric.mean()
    )
    
    trend_color = {
        TrendDirection.INCREASING: "green",
        TrendDirection.DECREASING: "red",
        TrendDirection.STABLE: "gray",
        TrendDirection.UNKNOWN: "gray",
    }.get(trend_result.direction, "gray")
    
    fig.add_trace(go.Scatter(
        x=daily_data[TIME_COLUMN],
        y=y_trend,
        mode="lines",
        name=f"Trend ({trend_result.direction.value})",
        line=dict(color=trend_color, width=3, dash="dash")
    ))

# Add rolling average for smoothing
rolling_avg = daily_data[analysis_col].rolling(window=7, center=True).mean()
fig.add_trace(go.Scatter(
    x=daily_data[TIME_COLUMN],
    y=rolling_avg,
    mode="lines",
    name="7-day Rolling Avg",
    line=dict(color="orange", width=2)
))

fig.update_layout(
    title=f"Trend Analysis: {trend_result.direction.value.title()} (R\u00b2={trend_result.strength:.2f}, {trend_result.confidence} confidence)",
    xaxis_title="Date",
    yaxis_title="Value",
    template="plotly_white",
    height=450,
    legend=dict(yanchor="top", y=0.99, xanchor="left", x=0.01)
)
display_figure(fig)

## 1c.5 Seasonality Detection

**📖 Understanding Seasonality:**
- **Weekly** (period=7): Higher activity on certain days
- **Monthly** (period~30): End-of-month patterns, billing cycles
- **Quarterly** (period~90): Business cycles, seasonal products

**Impact on ML:**
- Add day-of-week, month features
- Consider seasonal decomposition
- Use cyclical encodings (sin/cos) for neural networks

In [10]:
# Run seasonality detection
seasonality_results = analyzer.detect_seasonality(daily_data, value_column=analysis_col)

print("\U0001f501 SEASONALITY ANALYSIS RESULTS")
print("="*50)

if seasonality_results:
    print(f"\n   Detected {len(seasonality_results)} seasonal pattern(s):\n")
    
    for i, sr in enumerate(seasonality_results, 1):
        strength_label = "Strong" if sr.strength > 0.5 else "Moderate" if sr.strength > 0.3 else "Weak"
        period_name = sr.period_name or f"{sr.period}-day"
        print(f"   {i}. {period_name.title()} Pattern")
        print(f"      Period: {sr.period} days")
        print(f"      Strength: {sr.strength:.3f} ({strength_label})")
        print()
else:
    print("\n   No significant seasonal patterns detected.")
    print("   This could mean:")
    print("   - Data is truly non-seasonal")
    print("   - Not enough data points for detection")
    print("   - High noise obscuring patterns")

🔁 SEASONALITY ANALYSIS RESULTS

   Detected 3 seasonal pattern(s):

   1. Weekly Pattern
      Period: 7 days
      Strength: 0.593 (Strong)

   2. 14-Day Pattern
      Period: 14 days
      Strength: 0.585 (Strong)

   3. Monthly Pattern
      Period: 30 days
      Strength: 0.580 (Strong)



In [11]:
# Visualize day-of-week pattern
daily_data["day_of_week"] = daily_data[TIME_COLUMN].dt.day_name()
dow_order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
daily_data["day_of_week"] = pd.Categorical(daily_data["day_of_week"], categories=dow_order, ordered=True)

dow_stats = daily_data.groupby("day_of_week")[analysis_col].agg(["mean", "std"]).reset_index()

fig = go.Figure()
fig.add_trace(go.Bar(
    x=dow_stats["day_of_week"],
    y=dow_stats["mean"],
    error_y=dict(type="data", array=dow_stats["std"]),
    name="Mean",
    marker_color="steelblue"
))

# Mark weekends
for i, day in enumerate(dow_stats["day_of_week"]):
    if day in ["Saturday", "Sunday"]:
        fig.add_vrect(
            x0=i-0.4, x1=i+0.4,
            fillcolor="lightgray", opacity=0.3,
            layer="below", line_width=0
        )

fig.update_layout(
    title="Day of Week Pattern (gray = weekend)",
    xaxis_title="Day of Week",
    yaxis_title="Average Value",
    template="plotly_white",
    height=400
)
display_figure(fig)





In [12]:
# Monthly pattern analysis
daily_data["month"] = daily_data[TIME_COLUMN].dt.month_name()
month_order = ["January", "February", "March", "April", "May", "June",
               "July", "August", "September", "October", "November", "December"]

# Only include months present in data
present_months = [m for m in month_order if m in daily_data["month"].values]
daily_data["month"] = pd.Categorical(daily_data["month"], categories=present_months, ordered=True)

monthly_stats = daily_data.groupby("month")[analysis_col].agg(["mean", "std"]).reset_index()

if len(monthly_stats) > 1:
    fig = go.Figure()
    fig.add_trace(go.Bar(
        x=monthly_stats["month"],
        y=monthly_stats["mean"],
        error_y=dict(type="data", array=monthly_stats["std"]),
        name="Mean",
        marker_color="mediumpurple"
    ))
    
    # Add overall mean line
    overall_mean = daily_data[analysis_col].mean()
    fig.add_hline(y=overall_mean, line_dash="dash", line_color="red",
                  annotation_text=f"Overall Mean: {overall_mean:.1f}",
                  annotation_position="top right")
    
    fig.update_layout(
        title="Monthly Pattern",
        xaxis_title="Month",
        yaxis_title="Average Value",
        template="plotly_white",
        height=400
    )
    display_figure(fig)
else:
    print("Not enough months of data for monthly pattern analysis")





## 1c.6 Cohort Analysis

**\U0001f4d6 Understanding Cohorts:**
- Group entities by when they first appeared (signup cohort)
- Compare behavior across cohorts
- Identify if acquisition quality changed over time

In [13]:
# Cohort analysis requires entity column
if ENTITY_COLUMN:
    # Define cohort as the month of first event
    first_events = df.groupby(ENTITY_COLUMN)[TIME_COLUMN].min().reset_index()
    first_events.columns = [ENTITY_COLUMN, "first_event"]
    first_events["cohort"] = first_events["first_event"].dt.to_period("M")
    
    # Merge cohort info back to main data
    df_cohort = df.merge(first_events[[ENTITY_COLUMN, "cohort"]], on=ENTITY_COLUMN)
    
    # Cohort-level analysis
    cohort_result = analyzer.analyze_cohorts(
        df,
        entity_column=ENTITY_COLUMN,
        cohort_column=TIME_COLUMN,  # Will use min event date as cohort
        target_column=TARGET_COLUMN,
        period="M"
    )
    
    print("\U0001f465 COHORT ANALYSIS RESULTS")
    print("="*50)
    print(f"\n   Cohorts identified: {len(cohort_result)}")
    
    if len(cohort_result) > 0:
        display_table(cohort_result.head(12))
else:
    print("Entity column not set - skipping cohort analysis")

👥 COHORT ANALYSIS RESULTS

   Cohorts identified: 108


cohort,entity_count,first_event,last_event,retention_rate
2015-01,1051,2015-01-01,2015-01-31,0.584548
2015-02,927,2015-02-01,2015-02-28,0.598478
2015-03,974,2015-03-01,2015-03-31,0.546643
2015-04,968,2015-04-01,2015-04-30,0.569257
2015-05,1000,2015-05-01,2015-05-31,0.55678
2015-06,961,2015-06-01,2015-06-30,0.541268
2015-07,962,2015-07-01,2015-07-31,0.555749
2015-08,1023,2015-08-01,2015-08-31,0.552739
2015-09,974,2015-09-01,2015-09-30,0.530776
2015-10,962,2015-10-01,2015-10-31,0.528674


In [14]:
# Visualize cohort sizes and retention
if ENTITY_COLUMN and len(cohort_result) > 0:
    cohort_result_sorted = cohort_result.sort_values("cohort")
    
    # Compute retention rate if not present but target exists
    if "retention_rate" not in cohort_result.columns and TARGET_COLUMN and TARGET_COLUMN in df.columns:
        # Calculate retention rate per cohort from raw data
        entity_cohort = first_events[[ENTITY_COLUMN, "cohort"]]
        entity_target = df.groupby(ENTITY_COLUMN)[TARGET_COLUMN].first().reset_index()
        cohort_target = entity_cohort.merge(entity_target, on=ENTITY_COLUMN)
        
        retention_by_cohort = cohort_target.groupby("cohort")[TARGET_COLUMN].mean().reset_index()
        retention_by_cohort.columns = ["cohort", "retention_rate"]
        
        cohort_result_sorted = cohort_result_sorted.merge(retention_by_cohort, on="cohort", how="left")
    
    # Decide layout based on available data
    has_retention = "retention_rate" in cohort_result_sorted.columns and cohort_result_sorted["retention_rate"].notna().any()
    
    if has_retention:
        fig = make_subplots(rows=1, cols=2, subplot_titles=("Cohort Sizes", "Retention Rate by Cohort"))
        
        # Cohort sizes
        fig.add_trace(
            go.Bar(
                x=cohort_result_sorted["cohort"].astype(str),
                y=cohort_result_sorted["entity_count"],
                name="Entities",
                marker_color="steelblue"
            ),
            row=1, col=1
        )
        
        # Retention rate
        fig.add_trace(
            go.Scatter(
                x=cohort_result_sorted["cohort"].astype(str),
                y=cohort_result_sorted["retention_rate"] * 100,
                mode="lines+markers",
                name="Retention %",
                line=dict(color="green", width=2),
                marker=dict(size=8)
            ),
            row=1, col=2
        )
        fig.update_yaxes(title_text="Entity Count", row=1, col=1)
        fig.update_yaxes(title_text="Retention %", row=1, col=2)
        
        fig.update_layout(
            title="Cohort Overview",
            template="plotly_white",
            height=400,
            showlegend=False
        )
    else:
        # Single chart - cohort sizes only
        fig = go.Figure()
        fig.add_trace(
            go.Bar(
                x=cohort_result_sorted["cohort"].astype(str),
                y=cohort_result_sorted["entity_count"],
                name="Entities",
                marker_color="steelblue",
                text=cohort_result_sorted["entity_count"],
                textposition="outside"
            )
        )
        fig.update_layout(
            title="Cohort Sizes (No target column for retention analysis)",
            xaxis_title="Cohort",
            yaxis_title="Entity Count",
            template="plotly_white",
            height=400
        )
        print("💡 Tip: Set TARGET_COLUMN to see retention rates by cohort")
    
    fig.update_xaxes(tickangle=45)
    display_figure(fig)

## 1c.7 Recency Analysis

**\U0001f4d6 Understanding Recency:**
- Time since last event for each entity
- Often strongly correlated with churn/retention
- Key feature for predictive models

In [15]:
# Run recency analysis
if ENTITY_COLUMN:
    recency_result = analyzer.analyze_recency(
        df,
        entity_column=ENTITY_COLUMN,
        target_column=TARGET_COLUMN,
        reference_date=df[TIME_COLUMN].max()  # Use latest date in data as reference
    )
    
    print("\u23f1\ufe0f  RECENCY ANALYSIS RESULTS")
    print("="*50)
    print(f"\n   Reference date: {df[TIME_COLUMN].max()}")
    print(f"\n   Recency Statistics (days since last event):")
    print(f"      Mean: {recency_result.avg_recency_days:.1f}")
    print(f"      Median: {recency_result.median_recency_days:.1f}")
    print(f"      Min: {recency_result.min_recency_days:.1f}")
    print(f"      Max: {recency_result.max_recency_days:.1f}")
    
    if recency_result.target_correlation is not None:
        corr = recency_result.target_correlation
        corr_strength = "Strong" if abs(corr) > 0.5 else "Moderate" if abs(corr) > 0.3 else "Weak"
        corr_direction = "negative" if corr < 0 else "positive"
        
        print(f"\n   \U0001f3af Target Correlation:")
        print(f"      Correlation: {corr:.3f}")
        print(f"      Interpretation: {corr_strength} {corr_direction} correlation")
        
        if corr < -0.3:
            print(f"      \U0001f4a1 Insight: Lower recency (recent activity) associates with higher target")
            print(f"         This suggests recency is a strong predictor - use in features!")
        elif corr > 0.3:
            print(f"      \U0001f4a1 Insight: Higher recency (longer since last event) associates with higher target")
else:
    print("Entity column not set - skipping recency analysis")

⏱️  RECENCY ANALYSIS RESULTS

   Reference date: 2023-12-30 00:00:00

   Recency Statistics (days since last event):
      Mean: 879.1
      Median: 316.0
      Min: 0.0
      Max: 3285.0

   🎯 Target Correlation:
      Correlation: 0.779
      Interpretation: Strong positive correlation
      💡 Insight: Higher recency (longer since last event) associates with higher target


In [16]:
# Visualize recency distribution - COMPARING RETAINED VS CHURNED
if ENTITY_COLUMN:
    # Compute recency for each entity
    reference_date = df[TIME_COLUMN].max()
    entity_last = df.groupby(ENTITY_COLUMN)[TIME_COLUMN].max().reset_index()
    entity_last["recency_days"] = (reference_date - entity_last[TIME_COLUMN]).dt.days
    
    # Add target for comparison
    if TARGET_COLUMN and TARGET_COLUMN in df.columns:
        entity_target = df.groupby(ENTITY_COLUMN)[TARGET_COLUMN].first().reset_index()
        entity_recency = entity_last.merge(entity_target, on=ENTITY_COLUMN)
        has_target = True
    else:
        entity_recency = entity_last.copy()
        has_target = False
    
    # Cap for visualization
    cap = entity_recency["recency_days"].quantile(0.99)
    entity_recency_capped = entity_recency[entity_recency["recency_days"] <= cap]
    
    if has_target:
        # SIDE-BY-SIDE COMPARISON: Retained vs Churned
        print("="*70)
        print("RECENCY DISTRIBUTION: Retained vs Churned Comparison")
        print("="*70)
        
        retained_recency = entity_recency_capped[entity_recency_capped[TARGET_COLUMN] == 1]["recency_days"]
        churned_recency = entity_recency_capped[entity_recency_capped[TARGET_COLUMN] == 0]["recency_days"]
        
        fig = make_subplots(
            rows=1, cols=2,
            subplot_titles=[
                f"🟢 RETAINED (n={len(retained_recency):,})",
                f"🔴 CHURNED (n={len(churned_recency):,})"
            ],
            horizontal_spacing=0.1
        )
        
        # Retained histogram
        fig.add_trace(go.Histogram(
            x=retained_recency,
            nbinsx=30,
            name="Retained",
            marker_color="rgba(46, 204, 113, 0.7)",
            showlegend=False
        ), row=1, col=1)
        
        # Churned histogram
        fig.add_trace(go.Histogram(
            x=churned_recency,
            nbinsx=30,
            name="Churned",
            marker_color="rgba(231, 76, 60, 0.7)",
            showlegend=False
        ), row=1, col=2)
        
        # Add median lines
        fig.add_vline(x=retained_recency.median(), line_dash="solid", line_color="green",
                      annotation_text=f"Med: {retained_recency.median():.0f}d", row=1, col=1)
        fig.add_vline(x=churned_recency.median(), line_dash="solid", line_color="red",
                      annotation_text=f"Med: {churned_recency.median():.0f}d", row=1, col=2)
        
        fig.update_layout(
            title="Recency Distribution: Compare Shape and Median Between Groups",
            template="plotly_white",
            height=400
        )
        fig.update_xaxes(title_text="Days Since Last Event", row=1, col=1)
        fig.update_xaxes(title_text="Days Since Last Event", row=1, col=2)
        fig.update_yaxes(title_text="Number of Entities", row=1, col=1)
        
        display_figure(fig)
        
        # Summary statistics
        print("\n📊 Recency Statistics by Retention Status:")
        print("-" * 60)
        print(f"{'Metric':<20} {'Retained':>15} {'Churned':>15} {'Difference':>15}")
        print("-" * 60)
        
        metrics = [
            ("Mean", retained_recency.mean(), churned_recency.mean()),
            ("Median", retained_recency.median(), churned_recency.median()),
            ("Std Dev", retained_recency.std(), churned_recency.std()),
            ("25th Percentile", retained_recency.quantile(0.25), churned_recency.quantile(0.25)),
            ("75th Percentile", retained_recency.quantile(0.75), churned_recency.quantile(0.75)),
        ]
        
        for name, ret_val, churn_val in metrics:
            diff = ret_val - churn_val
            print(f"{name:<20} {ret_val:>15.1f} {churn_val:>15.1f} {diff:>+15.1f}")
        
        # Calculate effect size for recency
        pooled_std = np.sqrt((retained_recency.var() + churned_recency.var()) / 2)
        if pooled_std > 0:
            cohens_d = (retained_recency.mean() - churned_recency.mean()) / pooled_std
        else:
            cohens_d = 0
        
        abs_d = abs(cohens_d)
        if abs_d >= 0.8:
            effect_interp = "Large effect"
        elif abs_d >= 0.5:
            effect_interp = "Medium effect"
        elif abs_d >= 0.2:
            effect_interp = "Small effect"
        else:
            effect_interp = "Negligible"
        
        print(f"\n📈 Effect Size (Cohen's d): {cohens_d:+.3f} ({effect_interp})")
        
        # INTERPRETATION
        print("\n" + "─"*70)
        print("📖 HOW TO INTERPRET RECENCY COMPARISON")
        print("─"*70)
        if churned_recency.median() > retained_recency.median():
            print("""
Key Finding: Churned customers have HIGHER recency (more days since last event)

This is a classic churn pattern - customers who leave typically show:
  • Longer gaps between activities before churning
  • Declining engagement over time
  • Last activity farther from observation date

Feature Engineering Recommendations:
  • days_since_last_event (recency as-is)
  • log_recency (if distribution is skewed)
  • recency_bucket (categorical: 0-7d, 8-30d, 31-90d, >90d)
  • is_recent_active (binary: recency < 30 days)
""")
        else:
            print("""
Observation: Retained customers have similar or higher recency than churned

This is unusual - investigate whether:
  • Churn is happening very quickly (new customers leaving fast)
  • There's a time window issue in the data
  • Target definition may need review
""")
        
    else:
        # Single distribution (no target)
        fig = go.Figure()
        fig.add_trace(go.Histogram(
            x=entity_recency_capped["recency_days"],
            nbinsx=50,
            name="Recency",
            marker_color="coral",
            opacity=0.7
        ))
        
        fig.add_vline(x=recency_result.median_recency_days, line_dash="solid", line_color="green",
                      annotation_text=f"Median: {recency_result.median_recency_days:.0f} days",
                      annotation_position="top right")
        
        fig.update_layout(
            title=f"Recency Distribution (capped at {cap:.0f} days = 99th percentile)",
            xaxis_title="Days Since Last Event",
            yaxis_title="Number of Entities",
            template="plotly_white",
            height=400
        )
        display_figure(fig)

RECENCY DISTRIBUTION: Retained vs Churned Comparison



📊 Recency Statistics by Retention Status:
------------------------------------------------------------
Metric                      Retained         Churned      Difference
------------------------------------------------------------
Mean                          1684.7           155.0         +1529.6
Median                        1721.0           106.0         +1615.0
Std Dev                        885.9           157.2          +728.7
25th Percentile                946.0            42.0          +904.0
75th Percentile               2448.0           217.0         +2231.0

📈 Effect Size (Cohen's d): +2.404 (Large effect)

──────────────────────────────────────────────────────────────────────
📖 HOW TO INTERPRET RECENCY COMPARISON
──────────────────────────────────────────────────────────────────────

Observation: Retained customers have similar or higher recency than churned

This is unusual - investigate whether:
  • Churn is happening very quickly (new customers leaving fast)
  • Ther

In [17]:
# Recency vs Target visualization (if target exists)
if ENTITY_COLUMN and TARGET_COLUMN and TARGET_COLUMN in df.columns:
    # Get target per entity
    entity_target = df.groupby(ENTITY_COLUMN)[TARGET_COLUMN].first().reset_index()
    
    # Merge with recency
    recency_target = entity_last.merge(entity_target, on=ENTITY_COLUMN)
    
    # Bin recency for clearer visualization
    recency_target["recency_bin"] = pd.cut(
        recency_target["recency_days"],
        bins=[0, 7, 30, 90, 180, float("inf")],
        labels=["0-7d", "8-30d", "31-90d", "91-180d", ">180d"]
    )
    
    # Target rate by recency bin
    target_by_recency = recency_target.groupby("recency_bin")[TARGET_COLUMN].agg(["mean", "count"]).reset_index()
    
    fig = make_subplots(specs=[[{"secondary_y": True}]])
    
    fig.add_trace(
        go.Bar(
            x=target_by_recency["recency_bin"].astype(str),
            y=target_by_recency["count"],
            name="Entity Count",
            marker_color="lightsteelblue",
            opacity=0.7
        ),
        secondary_y=False
    )
    
    fig.add_trace(
        go.Scatter(
            x=target_by_recency["recency_bin"].astype(str),
            y=target_by_recency["mean"] * 100,
            mode="lines+markers",
            name="Target Rate %",
            line=dict(color="red", width=3),
            marker=dict(size=10)
        ),
        secondary_y=True
    )
    
    fig.update_layout(
        title="Target Rate by Recency Bucket",
        xaxis_title="Days Since Last Event",
        template="plotly_white",
        height=450
    )
    fig.update_yaxes(title_text="Entity Count", secondary_y=False)
    fig.update_yaxes(title_text="Target Rate %", secondary_y=True)
    
    display_figure(fig)





## 1c.8 Feature Correlations and Relationships

**📖 Understanding Feature Relationships in Event Data:**
- **Correlation Matrix**: Identify redundant features (multicollinearity)
- **Effect Sizes**: How well features discriminate by target (if available)
- **Cramér's V**: Association strength for categorical features

These analyses parallel the standard track (notebook 04) but applied to event-level attributes.

In [18]:
# Correlation matrix for numeric event attributes
numeric_event_cols = [c for c in df.select_dtypes(include=[np.number]).columns 
                      if c not in [ENTITY_COLUMN, TARGET_COLUMN]]

if len(numeric_event_cols) >= 2:
    corr_matrix = df[numeric_event_cols].corr()
    fig = charts.heatmap(
        corr_matrix.values, x_labels=numeric_event_cols, y_labels=numeric_event_cols,
        title="Event Attribute Correlation Matrix"
    )
    display_figure(fig)
    
    # High correlation pairs (multicollinearity detection)
    high_corr_pairs = []
    for i in range(len(numeric_event_cols)):
        for j in range(i+1, len(numeric_event_cols)):
            corr_val = corr_matrix.iloc[i, j]
            if abs(corr_val) >= 0.7:
                high_corr_pairs.append({
                    "Column 1": numeric_event_cols[i], "Column 2": numeric_event_cols[j],
                    "Correlation": f"{corr_val:.3f}"
                })
    
    if high_corr_pairs:
        print("⚠️ High Correlation Pairs (|r| >= 0.7):")
        display_table(pd.DataFrame(high_corr_pairs))
    else:
        print("✓ No high correlation pairs detected (multicollinearity not a concern)")
else:
    print("Not enough numeric columns for correlation analysis.")

✓ No high correlation pairs detected (multicollinearity not a concern)


In [19]:
# Categorical feature analysis using Cramér's V (if target exists at entity level)
categorical_cols = [c for c in df.select_dtypes(include=['object', 'category']).columns 
                    if c not in [ENTITY_COLUMN, TIME_COLUMN]]

if categorical_cols and ENTITY_COLUMN:
    print("="*70)
    print("CATEGORICAL FEATURE ANALYSIS (Cramér's V)")
    print("="*70)
    
    # For event data, aggregate to entity level first (mode category per entity)
    entity_cats = df.groupby(ENTITY_COLUMN)[categorical_cols].agg(lambda x: x.mode().iloc[0] if len(x.mode()) > 0 else None)
    
    if TARGET_COLUMN and TARGET_COLUMN in df.columns:
        entity_target = df.groupby(ENTITY_COLUMN)[TARGET_COLUMN].first()
        entity_data = entity_cats.join(entity_target)
        
        overall_retention = entity_data[TARGET_COLUMN].mean()
        print(f"\nOverall retention rate: {overall_retention:.1%}")
        
        cat_analyzer = CategoricalTargetAnalyzer(min_samples_per_category=10)
        cat_summary = cat_analyzer.analyze_multiple(entity_data.reset_index(), categorical_cols, TARGET_COLUMN)
        
        print("\n📊 Categorical Feature Strength:")
        print(f"{'Feature':<25} {'Cramér V':>10} {'Strength':<12} {'Significance'}")
        print("-" * 60)
        
        for _, row in cat_summary.iterrows():
            strength = "Strong" if row["cramers_v"] >= 0.3 else "Moderate" if row["cramers_v"] >= 0.1 else "Weak"
            sig = "***" if row["p_value"] < 0.001 else "**" if row["p_value"] < 0.01 else "*" if row["p_value"] < 0.05 else ""
            print(f"{row['feature'][:24]:<25} {row['cramers_v']:>10.3f} {strength:<12} {sig}")
        
        # Detailed analysis for top categorical features
        for col_name in categorical_cols[:3]:
            result = cat_analyzer.analyze(entity_data.reset_index(), col_name, TARGET_COLUMN)
            
            if len(result.category_stats) > 0:
                print(f"\n{'─'*60}")
                print(f"📊 {col_name.upper()} - Retention by Category")
                print("─"*60)
                
                cat_stats = result.category_stats
                categories = cat_stats['category'].tolist()
                retained_counts = cat_stats['retained_count'].tolist()
                churned_counts = cat_stats['churned_count'].tolist()
                
                # Stacked bar chart
                fig = go.Figure()
                fig.add_trace(go.Bar(
                    name='Retained', x=categories, y=retained_counts,
                    marker_color='rgba(46, 204, 113, 0.8)',
                    text=[f"{r/(r+c)*100:.0f}%" for r, c in zip(retained_counts, churned_counts)],
                    textposition='inside', textfont=dict(color='white', size=12)
                ))
                fig.add_trace(go.Bar(
                    name='Churned', x=categories, y=churned_counts,
                    marker_color='rgba(231, 76, 60, 0.8)',
                ))
                fig.update_layout(
                    barmode='stack', title=f"Retention by {col_name}",
                    template='plotly_white', height=350,
                    legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="center", x=0.5)
                )
                display_figure(fig)
                
                # High-risk categories
                if result.high_risk_categories:
                    print(f"\n⚠️ High-risk categories (below average retention):")
                    for cat in result.high_risk_categories[:3]:
                        cat_row = cat_stats[cat_stats['category'] == cat].iloc[0]
                        print(f"   • {cat}: {cat_row['retention_rate']:.1%} retention ({cat_row['lift']:.2f}x lift)")
        
        # INTERPRETATION
        print("\n" + "─"*70)
        print("📖 INTERPRETING CRAMÉR'S V")
        print("─"*70)
        print("""
Cramér's V measures association strength for categorical variables:
  V ≥ 0.3:  Strong association
  V 0.1-0.3: Moderate association
  V < 0.1:  Weak association

Significance: *** p<0.001, ** p<0.01, * p<0.05

High-risk categories (lift < 0.9x overall retention):
  → Target for retention campaigns
  → Investigate why these segments churn more
""")
    else:
        print("\nCategorical columns found but no target for association analysis")
        print(f"  Columns: {categorical_cols}")
elif not categorical_cols:
    print("No categorical columns found for Cramér's V analysis")

CATEGORICAL FEATURE ANALYSIS (Cramér's V)

Overall retention rate: 46.3%

📊 Categorical Feature Strength:
Feature                     Cramér V Strength     Significance
------------------------------------------------------------
email_id                       1.000 Strong       
campaign_type                  0.106 Moderate     ***
device_type                    0.041 Weak         *
subject_line_category          0.026 Weak         

────────────────────────────────────────────────────────────
📊 CAMPAIGN_TYPE - Retention by Category
────────────────────────────────────────────────────────────



────────────────────────────────────────────────────────────
📊 SUBJECT_LINE_CATEGORY - Retention by Category
────────────────────────────────────────────────────────────



──────────────────────────────────────────────────────────────────────
📖 INTERPRETING CRAMÉR'S V
──────────────────────────────────────────────────────────────────────

Cramér's V measures association strength for categorical variables:
  V ≥ 0.3:  Strong association
  V 0.1-0.3: Moderate association
  V < 0.1:  Weak association

Significance: *** p<0.001, ** p<0.01, * p<0.05

High-risk categories (lift < 0.9x overall retention):
  → Target for retention campaigns
  → Investigate why these segments churn more



## 1c.9 Entity-Level Feature Analysis (Effect Sizes)

**📖 Why Aggregate to Entity Level:**
- Time series data has multiple events per entity
- Target variable (retention) is typically at entity level
- Effect sizes (Cohen's d) require entity-level comparison

**Effect Size Interpretation (Cohen's d):**

| |d| | Interpretation | Predictive Power | Action |
|-----|----------------|------------------|--------|
| ≥ 0.8 | Large | Strong discriminator | Priority feature - include in model |
| 0.5-0.8 | Medium | Useful predictor | Include in model |
| 0.2-0.5 | Small | Weak signal | May help in combination |
| < 0.2 | Negligible | Limited value alone | Consider dropping or engineering |

**Direction matters:**
- **Positive d**: Retained customers have HIGHER values
- **Negative d**: Retained customers have LOWER values

In [20]:
# Aggregate event data to entity level for effect size analysis
if ENTITY_COLUMN and TARGET_COLUMN and TARGET_COLUMN in df.columns:
    # Build entity-level aggregations
    entity_aggs = df.groupby(ENTITY_COLUMN).agg({
        TIME_COLUMN: ['count', 'min', 'max'],
        **{col: ['mean', 'sum', 'std'] for col in numeric_event_cols if col != TARGET_COLUMN}
    })
    entity_aggs.columns = ['_'.join(col).strip() for col in entity_aggs.columns]
    entity_aggs = entity_aggs.reset_index()
    
    # Add target
    entity_target = df.groupby(ENTITY_COLUMN)[TARGET_COLUMN].first().reset_index()
    entity_df = entity_aggs.merge(entity_target, on=ENTITY_COLUMN)
    
    # Add derived features
    entity_df['tenure_days'] = (entity_df[f'{TIME_COLUMN}_max'] - entity_df[f'{TIME_COLUMN}_min']).dt.days
    entity_df['event_count'] = entity_df[f'{TIME_COLUMN}_count']
    
    # Calculate effect sizes (Cohen's d) for entity-level features
    effect_feature_cols = [c for c in entity_df.select_dtypes(include=[np.number]).columns
                          if c not in [ENTITY_COLUMN, TARGET_COLUMN]]
    
    print("="*80)
    print("ENTITY-LEVEL FEATURE EFFECT SIZES (Cohen's d)")
    print("="*80)
    print(f"\nAnalyzing {len(effect_feature_cols)} aggregated features at entity level")
    print(f"Entities: {len(entity_df):,} (Retained: {(entity_df[TARGET_COLUMN]==1).sum():,}, Churned: {(entity_df[TARGET_COLUMN]==0).sum():,})\n")
    
    effect_sizes = []
    for col in effect_feature_cols:
        churned = entity_df[entity_df[TARGET_COLUMN] == 0][col].dropna()
        retained = entity_df[entity_df[TARGET_COLUMN] == 1][col].dropna()
        
        if len(churned) > 0 and len(retained) > 0:
            pooled_std = np.sqrt(((len(churned)-1)*churned.std()**2 + (len(retained)-1)*retained.std()**2) / 
                                 (len(churned) + len(retained) - 2))
            d = (retained.mean() - churned.mean()) / pooled_std if pooled_std > 0 else 0
            
            abs_d = abs(d)
            if abs_d >= 0.8:
                interp, emoji = "Large effect", "🔴"
            elif abs_d >= 0.5:
                interp, emoji = "Medium effect", "🟡"
            elif abs_d >= 0.2:
                interp, emoji = "Small effect", "🟢"
            else:
                interp, emoji = "Negligible", "⚪"
            
            effect_sizes.append({
                "feature": col, "cohens_d": d, "abs_d": abs_d, 
                "interpretation": interp, "emoji": emoji,
                "retained_mean": retained.mean(), "churned_mean": churned.mean()
            })
    
    # Sort and display
    effect_df = pd.DataFrame(effect_sizes).sort_values("abs_d", ascending=False)
    
    print(f"{'Feature':<35} {'d':>8} {'Effect':<15} {'Direction':<20}")
    print("-" * 80)
    for _, row in effect_df.head(15).iterrows():
        direction = "↑ Higher in retained" if row["cohens_d"] > 0 else "↓ Lower in retained"
        print(f"{row['emoji']} {row['feature'][:33]:<33} {row['cohens_d']:>+8.3f} {row['interpretation']:<15} {direction:<20}")
    
    # Categorize features
    large_effect = effect_df[effect_df["abs_d"] >= 0.8]["feature"].tolist()
    medium_effect = effect_df[(effect_df["abs_d"] >= 0.5) & (effect_df["abs_d"] < 0.8)]["feature"].tolist()
    small_effect = effect_df[(effect_df["abs_d"] >= 0.2) & (effect_df["abs_d"] < 0.5)]["feature"].tolist()
    
    # INTERPRETATION
    print("\n" + "─"*80)
    print("📖 INTERPRETATION & RECOMMENDATIONS")
    print("─"*80)
    
    if large_effect:
        print(f"\n🔴 LARGE EFFECT (|d| ≥ 0.8) - Priority Features:")
        for f in large_effect[:5]:
            row = effect_df[effect_df["feature"] == f].iloc[0]
            direction = "higher" if row["cohens_d"] > 0 else "lower"
            print(f"   • {f}: Retained customers have {direction} values")
            print(f"     Mean: Retained={row['retained_mean']:.2f}, Churned={row['churned_mean']:.2f}")
        print("   → MUST include in predictive model")
    
    if medium_effect:
        print(f"\n🟡 MEDIUM EFFECT (0.5 ≤ |d| < 0.8) - Useful Features:")
        for f in medium_effect[:3]:
            print(f"   • {f}")
        print("   → Should include in model")
    
    if small_effect:
        print(f"\n🟢 SMALL EFFECT (0.2 ≤ |d| < 0.5) - Supporting Features:")
        print(f"   {', '.join(small_effect[:5])}")
        print("   → May help in combination with other features")
    
    negligible = effect_df[effect_df["abs_d"] < 0.2]["feature"].tolist()
    if negligible:
        print(f"\n⚪ NEGLIGIBLE EFFECT (|d| < 0.2): {len(negligible)} features")
        print("   → Consider engineering or dropping from model")
else:
    print("Entity column or target not available for effect size analysis")

ENTITY-LEVEL FEATURE EFFECT SIZES (Cohen's d)

Analyzing 21 aggregated features at entity level
Entities: 4,998 (Retained: 2,315, Churned: 2,683)

Feature                                    d Effect          Direction           
--------------------------------------------------------------------------------
🔴 unsubscribed_std                    +3.986 Large effect    ↑ Higher in retained
🔴 tenure_days                         -2.459 Large effect    ↓ Lower in retained 
🔴 unsubscribed_mean                   +1.283 Large effect    ↑ Higher in retained
🔴 opened_sum                          -0.862 Large effect    ↓ Lower in retained 
🔴 opened_std                          -0.846 Large effect    ↓ Lower in retained 
🟡 sent_date_count                     -0.793 Medium effect   ↓ Lower in retained 
🟡 event_count                         -0.793 Medium effect   ↓ Lower in retained 
🟡 send_hour_sum                       -0.780 Medium effect   ↓ Lower in retained 
🟡 opened_mean                     

In [21]:
# Box Plots: Entity-level feature distributions by target
if ENTITY_COLUMN and TARGET_COLUMN and 'entity_df' in dir() and len(effect_df) > 0:
    # Select top features by effect size for visualization
    top_features = effect_df.head(6)["feature"].tolist()
    n_features = len(top_features)
    
    if n_features > 0:
        print("="*70)
        print("DISTRIBUTION COMPARISON: Retained vs Churned (Box Plots)")
        print("="*70)
        print("\n📊 Showing top 6 features by effect size")
        print("   🟢 Green = Retained | 🔴 Red = Churned\n")
        
        fig = make_subplots(rows=1, cols=n_features, subplot_titles=top_features, horizontal_spacing=0.05)
        
        for i, col in enumerate(top_features):
            col_num = i + 1
            
            # Retained (1) - Green
            retained_data = entity_df[entity_df[TARGET_COLUMN] == 1][col].dropna()
            fig.add_trace(go.Box(y=retained_data, name='Retained',
                fillcolor='rgba(46, 204, 113, 0.7)', line=dict(color='#1e8449', width=2),
                boxpoints='outliers', width=0.35, showlegend=(i == 0), legendgroup='retained',
                marker=dict(color='rgba(46, 204, 113, 0.5)', size=4)), row=1, col=col_num)
            
            # Churned (0) - Red
            churned_data = entity_df[entity_df[TARGET_COLUMN] == 0][col].dropna()
            fig.add_trace(go.Box(y=churned_data, name='Churned',
                fillcolor='rgba(231, 76, 60, 0.7)', line=dict(color='#922b21', width=2),
                boxpoints='outliers', width=0.35, showlegend=(i == 0), legendgroup='churned',
                marker=dict(color='rgba(231, 76, 60, 0.5)', size=4)), row=1, col=col_num)
        
        fig.update_layout(height=450, title_text="Top Features: Retained (Green) vs Churned (Red)",
            template='plotly_white', showlegend=True, boxmode='group',
            legend=dict(orientation="h", yanchor="bottom", y=1.05, xanchor="center", x=0.5))
        fig.update_xaxes(showticklabels=False)
        display_figure(fig)
        
        # INTERPRETATION
        print("─"*70)
        print("📖 HOW TO READ BOX PLOTS")
        print("─"*70)
        print("""
Box Plot Elements:
  • Box = Middle 50% of data (IQR: 25th to 75th percentile)
  • Line inside box = Median (50th percentile)
  • Whiskers = 1.5 × IQR from box edges
  • Dots outside = Outliers

What makes a good predictor:
  ✓ Clear SEPARATION between green and red boxes
  ✓ Different MEDIANS (center lines at different heights)
  ✓ Minimal OVERLAP between boxes

Patterns to look for:
  • Green box entirely above red → Retained have higher values
  • Green box entirely below red → Retained have lower values
  • Overlapping boxes → Feature alone may not discriminate well
  • Many outliers in one group → Subpopulations worth investigating
""")

DISTRIBUTION COMPARISON: Retained vs Churned (Box Plots)

📊 Showing top 6 features by effect size
   🟢 Green = Retained | 🔴 Red = Churned



──────────────────────────────────────────────────────────────────────
📖 HOW TO READ BOX PLOTS
──────────────────────────────────────────────────────────────────────

Box Plot Elements:
  • Box = Middle 50% of data (IQR: 25th to 75th percentile)
  • Line inside box = Median (50th percentile)
  • Whiskers = 1.5 × IQR from box edges
  • Dots outside = Outliers

What makes a good predictor:
  ✓ Clear SEPARATION between green and red boxes
  ✓ Different MEDIANS (center lines at different heights)
  ✓ Minimal OVERLAP between boxes

Patterns to look for:
  • Green box entirely above red → Retained have higher values
  • Green box entirely below red → Retained have lower values
  • Overlapping boxes → Feature alone may not discriminate well
  • Many outliers in one group → Subpopulations worth investigating



In [22]:
# Feature-Target Correlation Ranking
if ENTITY_COLUMN and TARGET_COLUMN and 'entity_df' in dir():
    print("="*70)
    print("FEATURE-TARGET CORRELATIONS (Entity-Level)")
    print("="*70)
    
    correlations = []
    for col in effect_feature_cols:
        if col != TARGET_COLUMN:
            corr = entity_df[[col, TARGET_COLUMN]].corr().iloc[0, 1]
            if not np.isnan(corr):
                correlations.append({"Feature": col, "Correlation": corr})
    
    if correlations:
        corr_df = pd.DataFrame(correlations).sort_values("Correlation", key=abs, ascending=False)
        
        fig = charts.bar_chart(
            corr_df["Feature"].head(12).tolist(),
            corr_df["Correlation"].head(12).tolist(),
            title=f"Feature Correlations with {TARGET_COLUMN}"
        )
        display_figure(fig)
        
        print("\n📊 Correlation Rankings:")
        print(f"{'Feature':<35} {'Correlation':>12} {'Strength':<15} {'Direction'}")
        print("-" * 75)
        
        for _, row in corr_df.head(10).iterrows():
            abs_corr = abs(row["Correlation"])
            if abs_corr >= 0.5:
                strength = "Strong"
            elif abs_corr >= 0.3:
                strength = "Moderate"
            elif abs_corr >= 0.1:
                strength = "Weak"
            else:
                strength = "Very weak"
            
            direction = "Positive" if row["Correlation"] > 0 else "Negative"
            print(f"{row['Feature'][:34]:<35} {row['Correlation']:>+12.3f} {strength:<15} {direction}")
        
        # INTERPRETATION
        print("\n" + "─"*70)
        print("📖 INTERPRETING CORRELATIONS WITH TARGET")
        print("─"*70)
        print("""
Correlation with binary target (retained=1, churned=0):

  Positive correlation (+): Higher values → more likely RETAINED
  Negative correlation (-): Higher values → more likely CHURNED

Strength guide:
  |r| > 0.5:  Strong - prioritize this feature
  |r| 0.3-0.5: Moderate - useful predictor
  |r| 0.1-0.3: Weak - may help in combination
  |r| < 0.1:  Very weak - limited predictive value

Note: Correlation captures LINEAR relationships only.
Non-linear relationships may have low correlation but still be predictive.
""")

FEATURE-TARGET CORRELATIONS (Entity-Level)



📊 Correlation Rankings:
Feature                              Correlation Strength        Direction
---------------------------------------------------------------------------
unsubscribed_sum                          +1.000 Strong          Positive
unsubscribed_std                          +0.893 Strong          Positive
tenure_days                               -0.775 Strong          Negative
unsubscribed_mean                         +0.539 Strong          Positive
opened_sum                                -0.395 Moderate        Negative
opened_std                                -0.388 Moderate        Negative
sent_date_count                           -0.368 Moderate        Negative
event_count                               -0.368 Moderate        Negative
send_hour_sum                             -0.363 Moderate        Negative
opened_mean                               -0.352 Moderate        Negative

──────────────────────────────────────────────────────────────────────
📖 INTERPRETI

In [23]:
# Scatter Plot Matrix for top entity-level features
if ENTITY_COLUMN and TARGET_COLUMN and 'entity_df' in dir() and len(effect_df) > 0:
    # Select top 4 features for scatter matrix
    top_scatter_features = effect_df.head(4)["feature"].tolist()
    
    if len(top_scatter_features) >= 2:
        scatter_data = entity_df[top_scatter_features].sample(min(1000, len(entity_df)))
        fig = charts.scatter_matrix(scatter_data, title="Scatter Plot Matrix (Top Entity-Level Features)")
        display_figure(fig)
        
        print("\n📈 Scatter Matrix Insights:")
        print("   • Look for clusters indicating natural segments")
        print("   • Diagonal patterns suggest correlated features")
        print("   • Curved patterns may benefit from polynomial features")


📈 Scatter Matrix Insights:
   • Look for clusters indicating natural segments
   • Diagonal patterns suggest correlated features
   • Curved patterns may benefit from polynomial features


## 1c.10 Sparkline Comparison: Retained vs Churned Trends

**📖 Why Sparklines for Cohort Comparison:**

Sparklines provide a compact side-by-side visualization of how metrics evolve differently for retained vs churned customers:

| Row | What It Shows | Look For |
|-----|--------------|----------|
| **Retained (Green)** | Weekly trend for customers who stayed | Stable or upward trends |
| **Churned (Red)** | Weekly trend for customers who left | Declining trends before churn |

**Reading the Sparklines:**
- Each column = one metric
- Top row = Retained customers (green)
- Bottom row = Churned customers (red)
- Compare shapes: divergent patterns = predictive signal

**Configuration:**
- Variables are auto-selected based on **effect size** (Cohen's d) - metrics that best differentiate retained from churned
- Override `SPARKLINE_COLUMNS` below to specify custom columns
- Target defaults to detected churn/retention column

In [24]:
# === SPARKLINE CONFIGURATION ===
# Override these to customize the sparkline comparison

# Target column for cohort split (default: auto-detected churn/retention column)
SPARKLINE_TARGET = TARGET_COLUMN  # Override: e.g., "churn_flag"

# Columns to visualize (default: auto-select based on effect size)
# Set to specific columns: ["col1", "col2", "col3"]
# Set to None for auto-selection
SPARKLINE_COLUMNS = None  # Override: e.g., ["revenue", "login_count", "support_tickets"]

# Number of columns to show if auto-selecting
SPARKLINE_MAX_COLS = 6

# === AUTO-SELECT BEST COLUMNS (by Effect Size / Cohen's d) ===
def select_sparkline_columns(df, numeric_cols, target_col, max_cols=6):
    """
    Select columns most likely to show differences between retained/churned.
    
    Selection Logic:
    - WITH target: Uses effect size (Cohen's d) to find metrics that best 
      differentiate retained vs churned customers
    - WITHOUT target: Uses variance to find most variable (interesting) metrics
    
    Returns columns sorted by discriminative power.
    """
    if target_col is None or target_col not in df.columns:
        # No target - select by variance (most variable = most interesting)
        variances = {col: df[col].var() for col in numeric_cols if col in df.columns}
        sorted_cols = sorted(variances.keys(), key=lambda x: variances[x], reverse=True)
        return sorted_cols[:max_cols]
    
    # With target - select by discrimination power (effect size proxy)
    scores = {}
    for col in numeric_cols:
        if col not in df.columns or col == target_col:
            continue
        try:
            group0 = df[df[target_col] == 0][col].dropna()
            group1 = df[df[target_col] == 1][col].dropna()
            if len(group0) > 0 and len(group1) > 0:
                # Cohen's d: standardized difference in means
                pooled_std = np.sqrt((group0.var() + group1.var()) / 2)
                if pooled_std > 0:
                    scores[col] = abs(group1.mean() - group0.mean()) / pooled_std
                else:
                    scores[col] = 0
        except:
            continue
    
    if scores:
        sorted_cols = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
        return sorted_cols[:max_cols]
    
    # Fallback to first N columns
    return [c for c in numeric_cols if c in df.columns][:max_cols]

# Determine columns to use
if SPARKLINE_COLUMNS is not None:
    sparkline_cols = [c for c in SPARKLINE_COLUMNS if c in df.columns]
    selection_method = "user-specified"
else:
    sparkline_cols = select_sparkline_columns(df, numeric_event_cols, SPARKLINE_TARGET, SPARKLINE_MAX_COLS)
    selection_method = "auto-selected by effect size (Cohen's d)" if SPARKLINE_TARGET else "auto-selected by variance"

print("="*70)
print("SPARKLINE VARIABLE SELECTION")
print("="*70)
print(f"\nTarget column: {SPARKLINE_TARGET or 'None (no cohort split)'}")
print(f"Selection method: {selection_method}")
print(f"\nSelected columns ({len(sparkline_cols)}):")
for i, col in enumerate(sparkline_cols, 1):
    print(f"   {i}. {col}")

if SPARKLINE_COLUMNS is None and SPARKLINE_TARGET:
    print("""
💡 Why these columns?
   Columns are ranked by EFFECT SIZE (Cohen's d), which measures how well 
   each metric separates retained from churned customers. Higher effect 
   size = better discrimination = more interesting to visualize.
   
   To override: Set SPARKLINE_COLUMNS = ["your", "columns", "here"]
""")

SPARKLINE VARIABLE SELECTION

Target column: unsubscribed_entity
Selection method: auto-selected by effect size (Cohen's d)

Selected columns (6):
   1. unsubscribed
   2. opened
   3. clicked
   4. time_to_open_hours
   5. send_hour
   6. bounced

💡 Why these columns?
   Columns are ranked by EFFECT SIZE (Cohen's d), which measures how well 
   each metric separates retained from churned customers. Higher effect 
   size = better discrimination = more interesting to visualize.

   To override: Set SPARKLINE_COLUMNS = ["your", "columns", "here"]



In [25]:
# Sparkline comparison: Retained vs Churned behavior over time
if ENTITY_COLUMN and sparkline_cols:
    # Prepare data with target labels
    if SPARKLINE_TARGET and SPARKLINE_TARGET in df.columns:
        entity_target_map = df.groupby(ENTITY_COLUMN)[SPARKLINE_TARGET].first()
        df_with_target = df.merge(entity_target_map.reset_index().rename(columns={SPARKLINE_TARGET: '_target'}), on=ENTITY_COLUMN)
        has_target_split = True
        
        # Count for validation
        n_retained = (df_with_target['_target'] == 1).sum()
        n_churned = (df_with_target['_target'] == 0).sum()
        print(f"Data split: {n_retained:,} retained events, {n_churned:,} churned events")
    else:
        df_with_target = df.copy()
        df_with_target['_target'] = 1  # Treat all as one group
        has_target_split = False
    
    # Aggregate by week
    df_with_target['_week'] = df_with_target[TIME_COLUMN].dt.to_period('W').dt.start_time
    
    print("="*70)
    if has_target_split:
        print("SPARKLINE COMPARISON: Retained vs Churned Weekly Trends")
        print("="*70)
        print("\n┌─────────────────────────────────────────────────────────┐")
        print("│  🟢 TOP ROW = RETAINED (target=1) - customers who stayed │")
        print("│  🔴 BOTTOM ROW = CHURNED (target=0) - customers who left │")
        print("└─────────────────────────────────────────────────────────┘\n")
    else:
        print("SPARKLINE TRENDS: Weekly Metric Patterns")
        print("="*70)
        print("\n📊 No target column - showing overall trends\n")
    
    # Build sparkline data for each column
    sparkline_data = []
    
    for col in sparkline_cols:
        if has_target_split:
            retained_weekly = df_with_target[df_with_target['_target'] == 1].groupby('_week')[col].mean()
            churned_weekly = df_with_target[df_with_target['_target'] == 0].groupby('_week')[col].mean()
            all_weeks = sorted(set(retained_weekly.index) | set(churned_weekly.index))
            retained_values = [retained_weekly.get(w, np.nan) for w in all_weeks]
            churned_values = [churned_weekly.get(w, np.nan) for w in all_weeks]
        else:
            overall_weekly = df_with_target.groupby('_week')[col].mean()
            all_weeks = sorted(overall_weekly.index)
            retained_values = overall_weekly.tolist()
            churned_values = None
        
        sparkline_data.append({
            'name': col,
            'retained': retained_values,
            'churned': churned_values,
            'weeks': all_weeks
        })
    
    # Create sparkline grid with CLEAR ROW LABELS
    n_cols = len(sparkline_data)
    
    if has_target_split:
        fig = make_subplots(
            rows=2, cols=n_cols,
            row_titles=['🟢 RETAINED', '🔴 CHURNED'],
            subplot_titles=[d['name'][:15] for d in sparkline_data],
            vertical_spacing=0.2,
            horizontal_spacing=0.05
        )
    else:
        fig = make_subplots(
            rows=1, cols=n_cols,
            subplot_titles=[d['name'][:15] for d in sparkline_data],
            horizontal_spacing=0.05
        )
    
    for i, data in enumerate(sparkline_data):
        col_num = i + 1
        
        # Retained sparkline (top row)
        fig.add_trace(go.Scatter(
            y=data['retained'],
            mode='lines',
            line=dict(color='#2ca02c', width=2),
            fill='tozeroy',
            fillcolor='rgba(44, 160, 44, 0.3)',
            showlegend=False,
            hovertemplate=f"{data['name']}<br>Retained: %{{y:.2f}}<extra></extra>"
        ), row=1, col=col_num)
        
        # Churned sparkline (bottom row) - only if target split
        if has_target_split and data['churned'] is not None:
            fig.add_trace(go.Scatter(
                y=data['churned'],
                mode='lines',
                line=dict(color='#d62728', width=2),
                fill='tozeroy',
                fillcolor='rgba(214, 39, 40, 0.3)',
                showlegend=False,
                hovertemplate=f"{data['name']}<br>Churned: %{{y:.2f}}<extra></extra>"
            ), row=2, col=col_num)
    
    # Clean up axes for sparkline appearance
    fig.update_xaxes(showticklabels=False, showgrid=False)
    fig.update_yaxes(showticklabels=False, showgrid=False)
    
    fig.update_layout(
        title=dict(
            text="Weekly Metric Trends: Compare TOP ROW (Retained) vs BOTTOM ROW (Churned)",
            font=dict(size=14)
        ),
        height=350 if has_target_split else 180,
        template='plotly_white',
        margin=dict(t=100, b=30, l=80, r=20)
    )
    display_figure(fig)
    
    # Detailed divergence analysis with interpretation
    if has_target_split:
        print("\n" + "─"*70)
        print("📊 DIVERGENCE ANALYSIS: Which metrics differ between groups?")
        print("─"*70)
        
        divergent_metrics = []
        for data in sparkline_data:
            if len(data['retained']) >= 4 and data['churned'] is not None and len(data['churned']) >= 4:
                # Compare first half to second half
                ret_early = np.nanmean(data['retained'][:len(data['retained'])//2])
                ret_late = np.nanmean(data['retained'][len(data['retained'])//2:])
                churn_early = np.nanmean(data['churned'][:len(data['churned'])//2])
                churn_late = np.nanmean(data['churned'][len(data['churned'])//2:])
                
                retained_trend = ret_late - ret_early
                churned_trend = churn_late - churn_early
                
                r_dir = "↑ rising" if retained_trend > 0.01 else "↓ falling" if retained_trend < -0.01 else "→ flat"
                c_dir = "↑ rising" if churned_trend > 0.01 else "↓ falling" if churned_trend < -0.01 else "→ flat"
                
                is_divergent = (retained_trend > 0) != (churned_trend > 0) and abs(retained_trend - churned_trend) > 0.01
                divergence = "⚠️ DIVERGENT" if is_divergent else ""
                
                if is_divergent:
                    divergent_metrics.append(data['name'])
                
                print(f"   {data['name'][:20]}: Retained {r_dir}, Churned {c_dir} {divergence}")
        
        # INTERPRETATION
        print("\n" + "─"*70)
        print("📖 HOW TO INTERPRET THE SPARKLINES")
        print("─"*70)
        print("""
Compare the SHAPE of green (top) vs red (bottom) rows:

Pattern Recognition:
  🟢 Top sparkline RISING + 🔴 Bottom sparkline FALLING
     → STRONG signal! Metric clearly differentiates groups
     → Priority feature candidate
     
  🟢 Top sparkline STABLE + 🔴 Bottom sparkline FALLING  
     → Good signal! Churned customers show declining behavior
     → Create velocity/momentum features
     
  Both rows have SIMILAR shapes
     → Weak signal for this metric alone
     → May still be useful in combination
     
  🔴 Bottom row more VOLATILE than 🟢 top row
     → Churned customers have erratic behavior
     → Consider variance-based features
""")
        
        if divergent_metrics:
            print(f"⭐ DIVERGENT METRICS (high priority): {', '.join(divergent_metrics)}")
            print("   These show opposite trends for retained vs churned.")
            print("   RECOMMENDED: Create trend/velocity features for these.")
else:
    print("Entity column required for sparkline comparison")
    if not sparkline_cols:
        print("No numeric columns available for sparklines")

Data split: 31,267 retained events, 56,715 churned events
SPARKLINE COMPARISON: Retained vs Churned Weekly Trends

┌─────────────────────────────────────────────────────────┐
│  🟢 TOP ROW = RETAINED (target=1) - customers who stayed │
│  🔴 BOTTOM ROW = CHURNED (target=0) - customers who left │
└─────────────────────────────────────────────────────────┘




──────────────────────────────────────────────────────────────────────
📊 DIVERGENCE ANALYSIS: Which metrics differ between groups?
──────────────────────────────────────────────────────────────────────
   unsubscribed: Retained ↑ rising, Churned → flat ⚠️ DIVERGENT
   opened: Retained ↓ falling, Churned → flat 
   clicked: Retained ↓ falling, Churned → flat ⚠️ DIVERGENT
   time_to_open_hours: Retained ↓ falling, Churned ↓ falling 
   send_hour: Retained ↓ falling, Churned ↓ falling 
   bounced: Retained → flat, Churned → flat 

──────────────────────────────────────────────────────────────────────
📖 HOW TO INTERPRET THE SPARKLINES
──────────────────────────────────────────────────────────────────────

Compare the SHAPE of green (top) vs red (bottom) rows:

Pattern Recognition:
  🟢 Top sparkline RISING + 🔴 Bottom sparkline FALLING
     → STRONG signal! Metric clearly differentiates groups
     → Priority feature candidate

  🟢 Top sparkline STABLE + 🔴 Bottom sparkline FALLING  
     → 

In [26]:
# Use ChartBuilder sparkline_grid for monthly cohort trends (alternative visualization)
if ENTITY_COLUMN and sparkline_cols and SPARKLINE_TARGET and SPARKLINE_TARGET in df.columns:
    # Monthly aggregation for cleaner sparklines
    df_with_target['_month'] = df_with_target[TIME_COLUMN].dt.to_period('M').dt.start_time
    
    # Build monthly trends for key metrics, split by target
    cols_to_plot = sparkline_cols[:4]  # Limit to 4 for grid layout
    monthly_retained = df_with_target[df_with_target['_target'] == 1].groupby('_month')[cols_to_plot].mean()
    monthly_churned = df_with_target[df_with_target['_target'] == 0].groupby('_month')[cols_to_plot].mean()
    
    # Create separate sparkline grids for retained and churned
    print("\n" + "="*70)
    print("MONTHLY SPARKLINE GRIDS (ChartBuilder)")
    print("="*70)
    print("""
Alternative view using monthly aggregation (smoother trends).
Compare the shapes between Retained (green dots) and Churned (red dots).
""")
    
    retained_series = {col[:20]: monthly_retained[col].dropna().tolist() for col in cols_to_plot if col in monthly_retained.columns}
    if retained_series:
        fig_retained = charts.sparkline_grid(retained_series, columns=2, sparkline_height=80)
        fig_retained.update_layout(title="🟢 RETAINED Customers - Monthly Trends")
        display_figure(fig_retained)
    
    churned_series = {col[:20]: monthly_churned[col].dropna().tolist() for col in cols_to_plot if col in monthly_churned.columns}
    if churned_series:
        fig_churned = charts.sparkline_grid(churned_series, columns=2, sparkline_height=80)
        fig_churned.update_layout(title="🔴 CHURNED Customers - Monthly Trends")
        display_figure(fig_churned)
    
    # Side-by-side summary
    print("\n📊 Monthly Trend Summary:")
    print(f"{'Metric':<20} {'Retained Trend':>15} {'Churned Trend':>15} {'Signal':<12}")
    print("-" * 65)
    
    for col in cols_to_plot:
        if col in monthly_retained.columns and col in monthly_churned.columns:
            ret_vals = monthly_retained[col].dropna().tolist()
            churn_vals = monthly_churned[col].dropna().tolist()
            
            if len(ret_vals) >= 2 and len(churn_vals) >= 2:
                ret_trend = ret_vals[-1] - ret_vals[0]
                churn_trend = churn_vals[-1] - churn_vals[0]
                
                ret_dir = "↑" if ret_trend > 0 else "↓" if ret_trend < 0 else "→"
                churn_dir = "↑" if churn_trend > 0 else "↓" if churn_trend < 0 else "→"
                
                signal = "⚠️ DIVERGENT" if (ret_trend > 0) != (churn_trend > 0) else ""
                print(f"{col[:19]:<20} {ret_dir:>15} {churn_dir:>15} {signal:<12}")
elif not SPARKLINE_TARGET:
    print("💡 Set SPARKLINE_TARGET to enable retained vs churned comparison")


MONTHLY SPARKLINE GRIDS (ChartBuilder)

Alternative view using monthly aggregation (smoother trends).
Compare the shapes between Retained (green dots) and Churned (red dots).




📊 Monthly Trend Summary:
Metric                Retained Trend   Churned Trend Signal      
-----------------------------------------------------------------
unsubscribed                       ↑               → ⚠️ DIVERGENT
opened                             ↓               ↓             
clicked                            ↓               ↓             
time_to_open_hours                 ↓               ↓             


## 1c.11 Velocity & Acceleration Analysis

**📖 Why Velocity and Acceleration Matter:**

| Metric | Formula | Interpretation |
|--------|---------|----------------|
| **Velocity** | Δ(event_count) / Δt | Rate of change - is activity speeding up or slowing down? |
| **Acceleration** | Δ(velocity) / Δt | Change in rate - is the slowdown accelerating? |

- Positive velocity: Activity increasing
- Negative velocity: Activity decreasing (churn signal)
- Positive acceleration: Speeding up (engagement growing)
- Negative acceleration: Slowing down (disengagement)

In [27]:
# Velocity & Acceleration Analysis using TemporalFeatureAnalyzer
if ENTITY_COLUMN and sparkline_cols and SPARKLINE_TARGET and SPARKLINE_TARGET in df.columns:
    # Initialize the analyzer
    feature_analyzer = TemporalFeatureAnalyzer(
        time_column=TIME_COLUMN,
        entity_column=ENTITY_COLUMN
    )
    
    velocity_cols = sparkline_cols[:4]
    
    print("="*70)
    print("VELOCITY & ACCELERATION ANALYSIS (using TemporalFeatureAnalyzer)")
    print("="*70)
    
    # Calculate velocity using framework
    velocity_results = feature_analyzer.calculate_velocity(df, velocity_cols, window_days=7)
    acceleration_results = feature_analyzer.calculate_acceleration(df, velocity_cols, window_days=7)
    
    # Compare cohorts
    cohort_comparison = feature_analyzer.compare_cohorts(df, velocity_cols, SPARKLINE_TARGET)
    
    # Build data for visualization
    chart_data = {}
    for col in velocity_cols:
        # Get weekly data for retained and churned
        entity_target = df.groupby(ENTITY_COLUMN)[SPARKLINE_TARGET].first()
        df_temp = df.merge(entity_target.reset_index().rename(columns={SPARKLINE_TARGET: '_target'}), on=ENTITY_COLUMN)
        df_temp['_week'] = df_temp[TIME_COLUMN].dt.to_period('W').dt.start_time
        
        retained_weekly = df_temp[df_temp['_target'] == 1].groupby('_week')[col].mean()
        churned_weekly = df_temp[df_temp['_target'] == 0].groupby('_week')[col].mean()
        
        # Calculate derivatives
        ret_vel = retained_weekly.diff().dropna()
        churn_vel = churned_weekly.diff().dropna()
        ret_acc = ret_vel.diff().dropna()
        churn_acc = churn_vel.diff().dropna()
        
        chart_data[col] = {
            "retained": retained_weekly.tolist(),
            "churned": churned_weekly.tolist(),
            "velocity_retained": ret_vel.tolist(),
            "velocity_churned": churn_vel.tolist(),
            "accel_retained": ret_acc.tolist(),
            "accel_churned": churn_acc.tolist(),
        }
    
    # Use ChartBuilder for visualization
    fig = charts.velocity_acceleration_chart(
        chart_data,
        title="Value → Velocity → Acceleration (🟢 Retained vs 🔴 Churned)"
    )
    display_figure(fig)
    
    # Display framework results
    print("\n📊 Velocity Analysis Results:")
    for col, result in velocity_results.items():
        print(f"   {col}: {result.trend_direction} (mean velocity: {result.mean_velocity:.4f})")
    
    print("\n📊 Cohort Comparison (Retained vs Churned):")
    divergent_cols = []
    for col, comparison in cohort_comparison.items():
        ret_vel = comparison["retained"].velocity
        churn_vel = comparison["churned"].velocity
        is_divergent = (ret_vel > 0) != (churn_vel > 0)
        if is_divergent:
            divergent_cols.append(col)
        diff = "⚠️ DIVERGENT" if is_divergent else ""
        print(f"   {col}: Retained vel={ret_vel:+.4f}, Churned vel={churn_vel:+.4f} {diff}")
    
    # INTERPRETATION SECTION
    print("\n" + "─"*70)
    print("📖 HOW TO INTERPRET THESE RESULTS")
    print("─"*70)
    print("""
Velocity shows the RATE OF CHANGE of each metric over time:
  • Positive velocity = metric is INCREASING
  • Negative velocity = metric is DECREASING
  • Zero velocity = metric is STABLE

Key patterns to look for:
  1. DIVERGENT velocities (retained ↑ while churned ↓) = STRONG signal
     → These variables directly differentiate behavior
     → High priority for feature engineering
     
  2. Same direction but different magnitude = Moderate signal
     → Both groups trending same way, but at different rates
     → May indicate timing differences
     
  3. Both near zero = Weak signal for this metric
     → Stable behavior in both groups
     → Less useful for churn prediction
""")
    
    if divergent_cols:
        print(f"⭐ TOP CANDIDATES from velocity: {', '.join(divergent_cols)}")
        print("   These show opposite trends for retained vs churned customers.")
        print("   RECOMMENDED: Create velocity features for these variables.")
    
    print("\n💡 Feature Engineering from Velocity Analysis:")
    print("   • {col}_velocity_7d = (current - 7d_ago) / 7d_ago")
    print("   • {col}_momentum = mean_7d / mean_30d")
    print("   • {col}_acceleration = velocity_now - velocity_7d_ago")

VELOCITY & ACCELERATION ANALYSIS (using TemporalFeatureAnalyzer)



📊 Velocity Analysis Results:
   unsubscribed: stable (mean velocity: 0.0000)
   opened: stable (mean velocity: -0.0000)
   clicked: stable (mean velocity: 0.0000)
   time_to_open_hours: stable (mean velocity: -0.0004)

📊 Cohort Comparison (Retained vs Churned):
   unsubscribed: Retained vel=+0.0003, Churned vel=+0.0000 ⚠️ DIVERGENT
   opened: Retained vel=-0.0001, Churned vel=+0.0000 ⚠️ DIVERGENT
   clicked: Retained vel=-0.0000, Churned vel=+0.0000 ⚠️ DIVERGENT
   time_to_open_hours: Retained vel=-0.0092, Churned vel=+0.0005 ⚠️ DIVERGENT

──────────────────────────────────────────────────────────────────────
📖 HOW TO INTERPRET THESE RESULTS
──────────────────────────────────────────────────────────────────────

Velocity shows the RATE OF CHANGE of each metric over time:
  • Positive velocity = metric is INCREASING
  • Negative velocity = metric is DECREASING
  • Zero velocity = metric is STABLE

Key patterns to look for:
  1. DIVERGENT velocities (retained ↑ while churned ↓) = STRONG

## 1c.12 Lag Correlation Analysis

**📖 Why Lag Correlations Matter:**

Lag correlations show how a metric relates to itself over time:
- High lag-1 correlation: Today's value predicts tomorrow's
- Decaying correlations: Effect diminishes over time
- Periodic spikes: Seasonality (e.g., spike at lag 7 = weekly pattern)

In [28]:
# Lag Correlation Analysis using TemporalFeatureAnalyzer
if ENTITY_COLUMN and sparkline_cols:
    lag_cols = sparkline_cols[:6]
    max_lag = 14
    
    print("="*70)
    print("LAG CORRELATION ANALYSIS (using TemporalFeatureAnalyzer)")
    print("="*70)
    
    # Use framework analyzer (initialized above or create new one)
    if 'feature_analyzer' not in dir():
        feature_analyzer = TemporalFeatureAnalyzer(
            time_column=TIME_COLUMN,
            entity_column=ENTITY_COLUMN
        )
    
    # Calculate lag correlations using framework
    lag_results = feature_analyzer.calculate_lag_correlations(df, lag_cols, max_lag=max_lag)
    
    # Build data for heatmap
    lag_corr_data = {col: result.correlations for col, result in lag_results.items()}
    
    # Use ChartBuilder for visualization
    fig = charts.lag_correlation_heatmap(
        lag_corr_data,
        max_lag=max_lag,
        title="Autocorrelation by Lag (days) - Informs Lag Feature Selection"
    )
    display_figure(fig)
    
    # Display framework results
    print("\n📊 Best Lag per Variable:")
    strong_lag_vars = []
    weekly_pattern_vars = []
    for col, result in lag_results.items():
        best_lag_info = f"best lag={result.best_lag}d (r={result.best_correlation:.2f})"
        weekly_info = " [Weekly pattern]" if result.has_weekly_pattern else ""
        
        if result.best_correlation > 0.3:
            strong_lag_vars.append((col, result.best_lag, result.best_correlation))
        if result.has_weekly_pattern:
            weekly_pattern_vars.append(col)
            
        print(f"   {col[:25]}: {best_lag_info}{weekly_info}")
    
    # INTERPRETATION SECTION
    print("\n" + "─"*70)
    print("📖 HOW TO INTERPRET LAG CORRELATIONS")
    print("─"*70)
    print("""
Lag correlation shows how a variable relates to its PAST values:

Reading the heatmap:
  • Darker colors = STRONGER correlation at that lag
  • Row = variable being analyzed
  • Column = lag in days (1-14)

What the patterns mean:
  1. HIGH correlation at lag-1 (r > 0.5)
     → Strong "memory" - today's value predicts tomorrow's
     → Use: {col}_lag_1d, {col}_diff_1d features
     
  2. HIGH correlation at lag-7 (weekly peak)
     → Clear weekly seasonality
     → Use: {col}_lag_7d, day_of_week encoding
     
  3. SLOWLY decaying correlations
     → Mean-reverting behavior
     → Use: Rolling averages work well
     
  4. LOW correlations everywhere (< 0.2)
     → Random/noisy variable
     → Lag features less useful here
""")
    
    if strong_lag_vars:
        print("⭐ STRONG LAG CANDIDATES:")
        for col, lag, corr in strong_lag_vars:
            print(f"   • {col}: lag {lag}d (r={corr:.2f}) → Create {col}_lag_{lag}d feature")
    
    if weekly_pattern_vars:
        print(f"\n📅 WEEKLY PATTERN DETECTED in: {', '.join(weekly_pattern_vars)}")
        print("   RECOMMENDED: Add day_of_week features + lag_7d features")

LAG CORRELATION ANALYSIS (using TemporalFeatureAnalyzer)



📊 Best Lag per Variable:
   unsubscribed: best lag=2d (r=0.05)
   opened: best lag=6d (r=0.04)
   clicked: best lag=9d (r=0.03)
   time_to_open_hours: best lag=6d (r=0.04)
   send_hour: best lag=6d (r=0.04)
   bounced: best lag=13d (r=0.05)

──────────────────────────────────────────────────────────────────────
📖 HOW TO INTERPRET LAG CORRELATIONS
──────────────────────────────────────────────────────────────────────

Lag correlation shows how a variable relates to its PAST values:

Reading the heatmap:
  • Darker colors = STRONGER correlation at that lag
  • Row = variable being analyzed
  • Column = lag in days (1-14)

What the patterns mean:
  1. HIGH correlation at lag-1 (r > 0.5)
     → Strong "memory" - today's value predicts tomorrow's
     → Use: {col}_lag_1d, {col}_diff_1d features

  2. HIGH correlation at lag-7 (weekly peak)
     → Clear weekly seasonality
     → Use: {col}_lag_7d, day_of_week encoding

  3. SLOWLY decaying correlations
     → Mean-reverting behavior
     → 

## 1c.13 Predictive Power Analysis (IV & KS Statistics)

**📖 Information Value (IV) and KS Statistics:**

These metrics measure how well time-window features predict the target:

| Metric | Range | Interpretation |
|--------|-------|----------------|
| **IV** | 0-1+ | <0.02=weak, 0.02-0.1=medium, 0.1-0.3=strong, >0.3=very strong |
| **KS** | 0-1 | Maximum separation between target classes |

In [29]:
# Predictive Power Analysis using TemporalFeatureAnalyzer
if ENTITY_COLUMN and SPARKLINE_TARGET and SPARKLINE_TARGET in df.columns:
    print("="*70)
    print("PREDICTIVE POWER ANALYSIS (using TemporalFeatureAnalyzer)")
    print("="*70)
    
    # Use framework analyzer
    if 'feature_analyzer' not in dir():
        feature_analyzer = TemporalFeatureAnalyzer(
            time_column=TIME_COLUMN,
            entity_column=ENTITY_COLUMN
        )
    
    # Calculate predictive power using framework
    power_results = feature_analyzer.calculate_predictive_power(
        df, sparkline_cols, SPARKLINE_TARGET
    )
    
    # Build data for visualization
    iv_values = {col: result.information_value for col, result in power_results.items()}
    ks_values = {col: result.ks_statistic for col, result in power_results.items()}
    
    # Use ChartBuilder for visualization
    fig = charts.predictive_power_chart(
        iv_values,
        ks_values,
        title="Variable Predictive Power Rankings"
    )
    display_figure(fig)
    
    # Display framework results
    print("\n📊 Predictive Power Rankings (from framework):")
    print(f"{'Variable':<25} {'IV':>8} {'Strength':<12} {'KS':>8} {'p-value':>10}")
    print("-" * 70)
    
    sorted_results = sorted(power_results.items(), key=lambda x: x[1].information_value, reverse=True)
    strong_iv_vars = []
    strong_ks_vars = []
    suspicious_vars = []
    
    for col, result in sorted_results:
        sig = "***" if result.ks_pvalue < 0.001 else "**" if result.ks_pvalue < 0.01 else "*" if result.ks_pvalue < 0.05 else ""
        print(f"{col[:24]:<25} {result.information_value:>8.3f} {result.iv_interpretation:<12} {result.ks_statistic:>8.3f} {result.ks_pvalue:>9.4f} {sig}")
        
        if result.information_value > 0.3:
            strong_iv_vars.append(col)
        if result.ks_statistic > 0.4:
            strong_ks_vars.append(col)
        if result.iv_interpretation == "suspicious":
            suspicious_vars.append(col)
    
    # INTERPRETATION SECTION
    print("\n" + "─"*70)
    print("📖 HOW TO INTERPRET IV AND KS STATISTICS")
    print("─"*70)
    print("""
Information Value (IV) - measures how well a variable separates classes:
  • IV < 0.02:   Very weak - not useful alone
  • IV 0.02-0.1: Weak - some signal
  • IV 0.1-0.3:  Medium - good predictor
  • IV 0.3-0.5:  Strong - excellent predictor
  • IV > 0.5:    SUSPICIOUS - check for data leakage!

KS Statistic - measures distribution separation between retained/churned:
  • KS < 0.2:    Heavy overlap - weak discriminator
  • KS 0.2-0.4:  Moderate separation
  • KS > 0.4:    Clear separation - strong discriminator

Significance stars: *** p<0.001, ** p<0.01, * p<0.05

Combined interpretation:
  • HIGH IV + HIGH KS + Significant → TOP FEATURE CANDIDATE
  • HIGH IV but LOW KS → May need binning/transformation
  • LOW IV but HIGH KS → May have outliers driving KS
""")
    
    # Warnings and recommendations
    if suspicious_vars:
        print(f"⚠️ WARNING: Suspicious IV for: {', '.join(suspicious_vars)}")
        print("   IV > 0.5 may indicate DATA LEAKAGE - investigate these carefully!")
        print("   Check if these variables are derived from the target or future data.")
    
    top_vars = [col for col, r in sorted_results if r.information_value > 0.1 or r.ks_statistic > 0.3]
    if top_vars:
        print(f"\n⭐ TOP FEATURE ENGINEERING CANDIDATES: {', '.join(top_vars[:5])}")
        print("   These variables show strong predictive power for the target.")
        print("   RECOMMENDED: Prioritize creating derived features from these.")
else:
    print("Target column required for predictive power analysis")

PREDICTIVE POWER ANALYSIS (using TemporalFeatureAnalyzer)



📊 Predictive Power Rankings (from framework):
Variable                        IV Strength           KS    p-value
----------------------------------------------------------------------
unsubscribed                 8.037 suspicious      1.000    0.0000 ***
opened                       0.775 suspicious      0.350    0.0000 ***
clicked                      0.583 suspicious      0.342    0.0000 ***
bounced                      0.237 medium          0.127    0.0000 ***
send_hour                    0.196 medium          0.090    0.0000 ***
time_to_open_hours           0.112 medium          0.087    0.0000 ***

──────────────────────────────────────────────────────────────────────
📖 HOW TO INTERPRET IV AND KS STATISTICS
──────────────────────────────────────────────────────────────────────

Information Value (IV) - measures how well a variable separates classes:
  • IV < 0.02:   Very weak - not useful alone
  • IV 0.02-0.1: Weak - some signal
  • IV 0.1-0.3:  Medium - good predictor
  • IV 0

## 1c.14 Momentum Analysis (Window Ratios)

**📖 Momentum Features:**

Momentum captures behavioral changes by comparing time windows:
- recent_7d / recent_30d > 1: Activity increasing
- recent_7d / recent_30d < 1: Activity decreasing
- Large swings indicate volatility

In [30]:
# Momentum Analysis using TemporalFeatureAnalyzer
if ENTITY_COLUMN and SPARKLINE_TARGET and SPARKLINE_TARGET in df.columns:
    print("="*70)
    print("MOMENTUM ANALYSIS (using TemporalFeatureAnalyzer)")
    print("="*70)
    
    # Use framework analyzer
    if 'feature_analyzer' not in dir():
        feature_analyzer = TemporalFeatureAnalyzer(
            time_column=TIME_COLUMN,
            entity_column=ENTITY_COLUMN
        )
    
    momentum_cols = sparkline_cols[:4]
    
    # Calculate momentum using framework
    momentum_7_30 = feature_analyzer.calculate_momentum(df, momentum_cols, short_window=7, long_window=30)
    momentum_30_90 = feature_analyzer.calculate_momentum(df, momentum_cols, short_window=30, long_window=90)
    
    # For cohort comparison, split by target and calculate separately
    entity_target = df.groupby(ENTITY_COLUMN)[SPARKLINE_TARGET].first()
    df_temp = df.merge(entity_target.reset_index().rename(columns={SPARKLINE_TARGET: '_target'}), on=ENTITY_COLUMN)
    
    retained_df = df_temp[df_temp['_target'] == 1]
    churned_df = df_temp[df_temp['_target'] == 0]
    
    retained_mom_7_30 = feature_analyzer.calculate_momentum(retained_df, momentum_cols, 7, 30)
    churned_mom_7_30 = feature_analyzer.calculate_momentum(churned_df, momentum_cols, 7, 30)
    retained_mom_30_90 = feature_analyzer.calculate_momentum(retained_df, momentum_cols, 30, 90)
    churned_mom_30_90 = feature_analyzer.calculate_momentum(churned_df, momentum_cols, 30, 90)
    
    # Build data for visualization
    momentum_data = {}
    divergent_momentum_cols = []
    for col in momentum_cols:
        ret_7_30 = retained_mom_7_30[col].mean_momentum if col in retained_mom_7_30 else 1
        churn_7_30 = churned_mom_7_30[col].mean_momentum if col in churned_mom_7_30 else 1
        momentum_data[col] = {
            "retained_7_30": ret_7_30,
            "churned_7_30": churn_7_30,
            "retained_30_90": retained_mom_30_90[col].mean_momentum if col in retained_mom_30_90 else 1,
            "churned_30_90": churned_mom_30_90[col].mean_momentum if col in churned_mom_30_90 else 1,
        }
        if abs(ret_7_30 - churn_7_30) > 0.1:
            divergent_momentum_cols.append(col)
    
    # Use ChartBuilder for visualization
    fig = charts.momentum_comparison_chart(
        momentum_data,
        title="Momentum by Retention Status (ratio > 1 = increasing, < 1 = declining)"
    )
    display_figure(fig)
    
    # Display framework results
    print("\n📊 Momentum Results (from framework):")
    print(f"{'Variable':<20} {'7d/30d Ret':>12} {'7d/30d Churn':>12} {'Diff':>10}")
    print("-" * 60)
    
    for col in momentum_cols:
        ret = momentum_data[col]["retained_7_30"]
        churn = momentum_data[col]["churned_7_30"]
        diff = ret - churn
        signal = "⚠️" if abs(diff) > 0.1 else ""
        print(f"{col[:19]:<20} {ret:>12.3f} {churn:>12.3f} {diff:>+10.3f} {signal}")
    
    # INTERPRETATION SECTION
    print("\n" + "─"*70)
    print("📖 HOW TO INTERPRET MOMENTUM ANALYSIS")
    print("─"*70)
    print("""
Momentum = mean(recent_window) / mean(longer_window)

Interpreting momentum values:
  • Momentum > 1.2:  Recent activity HIGHER than historical → Increasing
  • Momentum 0.8-1.2: Activity is STABLE
  • Momentum 0.5-0.8: Recent activity LOWER than historical → Declining
  • Momentum < 0.5:   Sharp decline → High churn risk

What to look for in the chart:
  1. RETAINED momentum > CHURNED momentum
     → Retained customers have improving engagement
     → This pattern is expected and validates the metric
     
  2. Large GAP between retained and churned
     → Strong differentiator - good feature candidate
     → Create: {col}_momentum_7_30 = mean_7d / mean_30d
     
  3. Both groups have similar momentum
     → Metric doesn't differentiate well on its own
     → May still be useful in combination with other features

Window pair choices:
  • 7d/30d:  Captures short-term changes (reacts quickly)
  • 30d/90d: Captures medium-term trends (more stable)
  • Both together can capture different dynamics
""")
    
    if divergent_momentum_cols:
        print(f"⭐ HIGH-SIGNAL MOMENTUM FEATURES: {', '.join(divergent_momentum_cols)}")
        print("   These show meaningful differences between retained and churned.")
        print("   RECOMMENDED: Create momentum features for these variables:")
        for col in divergent_momentum_cols:
            print(f"   • {col}_momentum_7_30 = mean({col}, 7d) / mean({col}, 30d)")

MOMENTUM ANALYSIS (using TemporalFeatureAnalyzer)



📊 Momentum Results (from framework):
Variable               7d/30d Ret 7d/30d Churn       Diff
------------------------------------------------------------
unsubscribed                1.250        1.000     +0.250 ⚠️
opened                      1.000        1.015     -0.015 
clicked                     1.000        0.800     +0.200 ⚠️
time_to_open_hours          1.000        1.015     -0.015 

──────────────────────────────────────────────────────────────────────
📖 HOW TO INTERPRET MOMENTUM ANALYSIS
──────────────────────────────────────────────────────────────────────

Momentum = mean(recent_window) / mean(longer_window)

Interpreting momentum values:
  • Momentum > 1.2:  Recent activity HIGHER than historical → Increasing
  • Momentum 0.8-1.2: Activity is STABLE
  • Momentum 0.5-0.8: Recent activity LOWER than historical → Declining
  • Momentum < 0.5:   Sharp decline → High churn risk

What to look for in the chart:
  1. RETAINED momentum > CHURNED momentum
     → Retained customer

## 1c.15 Feature Engineering Summary

**📋 Consolidated recommendations from all analyses above:**

In [31]:
# Feature Engineering Recommendations using TemporalFeatureAnalyzer
print("="*80)
print("FEATURE ENGINEERING RECOMMENDATIONS (from TemporalFeatureAnalyzer)")
print("="*80)

if 'feature_analyzer' in dir() and SPARKLINE_TARGET:
    # Get automated recommendations from framework
    recommendations = feature_analyzer.get_feature_recommendations(
        df,
        value_columns=sparkline_cols,
        target_column=SPARKLINE_TARGET
    )
    
    if recommendations:
        print(f"\n🎯 Framework Generated {len(recommendations)} Feature Recommendations:\n")
        
        # Group by type
        by_type = {}
        for rec in recommendations:
            if rec.feature_type.value not in by_type:
                by_type[rec.feature_type.value] = []
            by_type[rec.feature_type.value].append(rec)
        
        for feat_type, recs in by_type.items():
            print(f"\n{'─'*60}")
            print(f"📌 {feat_type.upper()} FEATURES")
            print(f"{'─'*60}")
            for rec in recs[:5]:  # Top 5 per type
                print(f"   Feature: {rec.feature_name}")
                print(f"   Formula: {rec.formula}")
                print(f"   Reason:  {rec.rationale}")
                print()
    else:
        print("\n   No significant feature recommendations found.")

# Also show manual summary
print("\n" + "="*80)
print("QUICK REFERENCE: Feature Engineering Patterns")
print("="*80)

print("""
┌─────────────────┬────────────────────────────────────────────────────┐
│ Feature Type    │ Formula Example                                     │
├─────────────────┼────────────────────────────────────────────────────┤
│ Velocity        │ (value_now - value_7d_ago) / value_7d_ago          │
│ Acceleration    │ velocity_now - velocity_7d_ago                      │
│ Momentum        │ mean_7d / mean_30d                                  │
│ Lag             │ df[col].shift(N)                                    │
│ Rolling Mean    │ df[col].rolling(7).mean()                          │
│ Rolling Std     │ df[col].rolling(30).std()                          │
│ Ratio           │ sum_30d / sum_all_time                              │
└─────────────────┴────────────────────────────────────────────────────┘
""")

FEATURE ENGINEERING RECOMMENDATIONS (from TemporalFeatureAnalyzer)

🎯 Framework Generated 9 Feature Recommendations:


────────────────────────────────────────────────────────────
📌 ROLLING FEATURES
────────────────────────────────────────────────────────────
   Feature: unsubscribed_mean
   Formula: df.groupby(entity)['unsubscribed'].transform('mean')
   Reason:  IV=8.037 (suspicious)

   Feature: opened_mean
   Formula: df.groupby(entity)['opened'].transform('mean')
   Reason:  IV=0.775 (suspicious)

   Feature: clicked_mean
   Formula: df.groupby(entity)['clicked'].transform('mean')
   Reason:  IV=0.583 (suspicious)

   Feature: bounced_mean
   Formula: df.groupby(entity)['bounced'].transform('mean')
   Reason:  IV=0.237 (medium)

   Feature: send_hour_mean
   Formula: df.groupby(entity)['send_hour'].transform('mean')
   Reason:  IV=0.196 (medium)


────────────────────────────────────────────────────────────
📌 MOMENTUM FEATURES
──────────────────────────────────────────────────────

In [32]:
print("\n" + "="*70)
print("TEMPORAL PATTERN SUMMARY")
print("="*70)

# Trend summary
print(f"\n\U0001f4c8 TREND:")
print(f"   Direction: {trend_result.direction.value}")
print(f"   Confidence: {trend_result.confidence}")
if trend_result.direction == TrendDirection.INCREASING:
    print("   \U0001f4a1 Consider: Time-based features may improve with recency weighting")
elif trend_result.direction == TrendDirection.DECREASING:
    print("   \U0001f4a1 Consider: Investigate cause of decline; recent data may be more valuable")

# Seasonality summary
print(f"\n\U0001f501 SEASONALITY:")
if seasonality_results:
    for sr in seasonality_results[:2]:
        period_name = sr.period_name or f"{sr.period}-day"
        print(f"   {period_name.title()} pattern (strength: {sr.strength:.2f})")
    print("   \U0001f4a1 Consider: Add day-of-week or month features; use seasonal adjustments")
else:
    print("   No significant seasonality detected")

# Recency summary
if ENTITY_COLUMN:
    print(f"\n\u23f1\ufe0f  RECENCY:")
    print(f"   Median recency: {recency_result.median_recency_days:.0f} days")
    if recency_result.target_correlation is not None:
        print(f"   Target correlation: {recency_result.target_correlation:.3f}")
        if abs(recency_result.target_correlation) > 0.3:
            print("   \U0001f4a1 Consider: Recency is a strong predictor - prioritize in feature engineering")


TEMPORAL PATTERN SUMMARY

📈 TREND:
   Direction: stable
   Confidence: high

🔁 SEASONALITY:
   Weekly pattern (strength: 0.59)
   14-Day pattern (strength: 0.59)
   💡 Consider: Add day-of-week or month features; use seasonal adjustments

⏱️  RECENCY:
   Median recency: 316 days
   Target correlation: 0.779
   💡 Consider: Recency is a strong predictor - prioritize in feature engineering


In [33]:
# Feature engineering recommendations based on patterns
print("\n" + "="*70)
print("RECOMMENDED TEMPORAL FEATURES")
print("="*70)

print("\n\U0001f6e0\ufe0f Based on detected patterns, consider these features:\n")

print("1. RECENCY FEATURES:")
print("   - days_since_last_event")
print("   - log_days_since_last_event (if right-skewed)")
print("   - recency_bucket (categorical: 0-7d, 8-30d, etc.)")

if seasonality_results:
    weekly = any(6 <= sr.period <= 8 for sr in seasonality_results)
    monthly = any(28 <= sr.period <= 32 for sr in seasonality_results)
    
    print("\n2. SEASONALITY FEATURES:")
    if weekly:
        print("   - is_weekend (binary)")
        print("   - day_of_week_sin, day_of_week_cos (cyclical encoding)")
    if monthly:
        print("   - day_of_month")
        print("   - is_month_start, is_month_end")

print("\n3. TREND-ADJUSTED FEATURES:")
if trend_result.direction in [TrendDirection.INCREASING, TrendDirection.DECREASING]:
    print("   - event_count_recent_vs_overall (ratio)")
    print("   - activity_trend_direction (for each entity)")
else:
    print("   - Standard time-window aggregations should work well")

print("\n4. COHORT FEATURES:")
print("   - cohort_month (categorical or ordinal)")
print("   - tenure_days (days since first event)")


RECOMMENDED TEMPORAL FEATURES

🛠️ Based on detected patterns, consider these features:

1. RECENCY FEATURES:
   - days_since_last_event
   - log_days_since_last_event (if right-skewed)
   - recency_bucket (categorical: 0-7d, 8-30d, etc.)

2. SEASONALITY FEATURES:
   - is_weekend (binary)
   - day_of_week_sin, day_of_week_cos (cyclical encoding)
   - day_of_month
   - is_month_start, is_month_end

3. TREND-ADJUSTED FEATURES:
   - Standard time-window aggregations should work well

4. COHORT FEATURES:
   - cohort_month (categorical or ordinal)
   - tenure_days (days since first event)


## 1c.16 Save Pattern Analysis Results

In [34]:
# Store pattern analysis results in findings
pattern_summary = {
    "trend": {
        "direction": trend_result.direction.value,
        "strength": trend_result.strength,
        "confidence": trend_result.confidence,
    },
    "seasonality": [
        {"period": sr.period, "name": sr.period_name, "strength": sr.strength}
        for sr in seasonality_results
    ],
}

if ENTITY_COLUMN:
    pattern_summary["recency"] = {
        "median_days": recency_result.median_recency_days,
        "target_correlation": recency_result.target_correlation,
    }

# Add to findings notes
if not findings.metadata:
    findings.metadata = {}
findings.metadata["temporal_patterns"] = pattern_summary

findings.save(FINDINGS_PATH)
print(f"Pattern analysis saved to: {FINDINGS_PATH}")
print(f"\nSummary: {pattern_summary}")

Pattern analysis saved to: ../experiments/findings/customer_emails_31faba_findings.yaml

Summary: {'trend': {'direction': 'stable', 'strength': np.float64(0.5737213915046017), 'confidence': 'high'}, 'seasonality': [{'period': 7, 'name': 'weekly', 'strength': np.float64(0.5926612176155779)}, {'period': 14, 'name': None, 'strength': np.float64(0.5851522443106908)}, {'period': 30, 'name': 'monthly', 'strength': np.float64(0.5801820802499106)}], 'recency': {'median_days': 316.0, 'target_correlation': np.float64(0.77948901484635)}}


---

## Summary: What We Learned

In this notebook, we analyzed temporal patterns:

1. **Trend Detection** - Identified long-term direction in data
2. **Seasonality** - Found periodic patterns (weekly, monthly)
3. **Cohort Analysis** - Compared behavior by entity join date
4. **Recency Analysis** - Measured how recent activity relates to outcomes
5. **Feature Recommendations** - Generated feature engineering suggestions

## Pattern Summary

| Pattern | Status | Recommendation |
|---------|--------|----------------|
| Trend | Check findings | Detrend if strong |
| Seasonality | Check findings | Add cyclical features |
| Cohort Effects | Check findings | Add cohort indicators |
| Recency Effects | Check findings | Prioritize recent windows |

---

## Next Steps

**Complete the Event Bronze Track:**
- **01d_event_aggregation.ipynb** - Aggregate events to entity-level (produces new dataset)

After 01d produces the aggregated dataset, continue with:
- **02_column_deep_dive.ipynb** - Profile aggregated feature distributions
- **03_quality_assessment.ipynb** - Quality checks on aggregated data
- **04_relationship_analysis.ipynb** - Feature correlations and relationships

The aggregated data from 01d becomes the input for the Entity Bronze Track.