# Chapter 1c: Temporal Pattern Analysis (Event Bronze Track)

**Purpose:** Discover temporal patterns in event-level data that inform feature engineering and model design.

**When to use this notebook:**
- After completing 01a and 01b (temporal deep dive and quality checks)
- Your dataset is EVENT_LEVEL granularity
- You want to understand time-based patterns before aggregation

**What you'll learn:**
- How to detect long-term trends in your data
- How to identify seasonality patterns (weekly, monthly)
- How cohort analysis reveals customer lifecycle patterns
- How recency relates to target outcomes

**Pattern Categories:**

| Pattern | Description | Feature Engineering Impact |
|---------|-------------|---------------------------|
| **Trend** | Long-term direction (up/down) | Detrend features, add trend slope |
| **Seasonality** | Periodic patterns (weekly, monthly) | Add cyclical encodings, seasonal indicators |
| **Cohort Effects** | Behavior varies by join date | Add cohort features, stratify models |
| **Recency Effects** | Recent activity predicts outcomes | Prioritize recent time windows |

## 1c.1 Load Findings and Data

In [1]:
from customer_retention.analysis.auto_explorer import ExplorationFindings
from customer_retention.analysis.visualization import ChartBuilder, display_figure, display_table
from customer_retention.core.config.column_config import ColumnType, DatasetGranularity
from customer_retention.stages.profiling import (
    TemporalPatternAnalyzer, TemporalPatternAnalysis,
    TrendResult, TrendDirection, SeasonalityResult, RecencyResult,
    TemporalFeatureAnalyzer, VelocityResult, MomentumResult,
    LagCorrelationResult, PredictivePowerResult, FeatureRecommendation,
    CategoricalTargetAnalyzer
)
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from scipy import stats

In [2]:
# === CONFIGURATION ===
from pathlib import Path

FINDINGS_DIR = Path("../experiments/findings")

findings_files = [f for f in FINDINGS_DIR.glob("*_findings.yaml") if "multi_dataset" not in f.name]
if not findings_files:
    raise FileNotFoundError(f"No findings files found in {FINDINGS_DIR}. Run notebook 01 first.")

findings_files.sort(key=lambda f: f.stat().st_mtime, reverse=True)
FINDINGS_PATH = str(findings_files[0])

print(f"Using: {FINDINGS_PATH}")
findings = ExplorationFindings.load(FINDINGS_PATH)
print(f"Loaded findings for {findings.column_count} columns")

Using: ../experiments/findings/customer_emails_408768_findings.yaml
Loaded findings for 16 columns


In [3]:
# Get time series configuration
ts_meta = findings.time_series_metadata
ENTITY_COLUMN = ts_meta.entity_column if ts_meta else None
TIME_COLUMN = ts_meta.time_column if ts_meta else None

print(f"Entity column: {ENTITY_COLUMN}")
print(f"Time column: {TIME_COLUMN}")

# Note: Target column configuration is handled in section 1c.2 below
# This allows for event-level to entity-level aggregation when needed

Entity column: customer_id
Time column: sent_date


In [4]:
from customer_retention.stages.temporal import load_data_with_snapshot_preference, TEMPORAL_METADATA_COLS

# Load source data (prefers snapshots over raw files)
df, data_source = load_data_with_snapshot_preference(findings, output_dir="../experiments/findings")
charts = ChartBuilder()

# Parse time column
df[TIME_COLUMN] = pd.to_datetime(df[TIME_COLUMN])

print(f"Loaded {len(df):,} rows x {len(df.columns)} columns")
print(f"Data source: {data_source}")

Loaded 74,842 rows x 16 columns
Data source: snapshot


## 1c.2 Target Column Configuration

**üìñ Event-Level vs Entity-Level Targets:**

In time series data, targets can be defined at different granularities:

| Target Level | Example | Usage |
|--------------|---------|-------|
| **Event-level** | "Did this email get clicked?" | Exists in raw data |
| **Entity-level** | "Did this customer churn?" | Need to join from entity table |

If your target is entity-level, you may need to join it or configure it manually.

In [5]:
# === TARGET CONFIGURATION ===
# Override target column if needed (None = auto-detect, "DEFER_TO_MULTI_DATASET" = skip)
TARGET_COLUMN_OVERRIDE = None
TARGET_AGGREGATION = "max"  # Options: "max", "mean", "sum", "last", "first"

# Detect and analyze target
from customer_retention.stages.profiling import (
    TargetLevelAnalyzer, TargetColumnDetector, AggregationMethod
)

detector = TargetColumnDetector()
target_col, method = detector.detect(findings, df, override=TARGET_COLUMN_OVERRIDE)
detector.print_detection(target_col, method)

TARGET_COLUMN = target_col
if TARGET_COLUMN and TARGET_COLUMN in df.columns and ENTITY_COLUMN:
    analyzer = TargetLevelAnalyzer()
    agg_method = AggregationMethod(TARGET_AGGREGATION)
    df, result = analyzer.aggregate_to_entity(df, TARGET_COLUMN, ENTITY_COLUMN, TIME_COLUMN, agg_method)
    analyzer.print_analysis(result)
    
    # Update TARGET_COLUMN to entity-level version if aggregated
    if result.entity_target_column:
        ORIGINAL_TARGET = TARGET_COLUMN
        TARGET_COLUMN = result.entity_target_column

print("\n" + "‚îÄ"*70)
print(f"Final configuration:")
print(f"   ENTITY_COLUMN: {ENTITY_COLUMN}")
print(f"   TIME_COLUMN: {TIME_COLUMN}")
print(f"   TARGET_COLUMN: {TARGET_COLUMN}")
print("‚îÄ"*70)



üîç Auto-detected target: target
TARGET LEVEL ANALYSIS

Column: target
Level: EVENT_LEVEL

‚ö†Ô∏è  EVENT-LEVEL TARGET DETECTED
   38.8% of entities have varying target values

   Event-level distribution:
      target=0: 72,869 events (97.4%)
      target=1: 1,973 events (2.6%)

   Suggested aggregation: max

   Aggregation applied: max
   Entity target column: target_entity

   Entity-level distribution (after aggregation):
      Retained (target_entity=0): 3,025 entities (60.5%)
      Churned (target_entity=1): 1,973 entities (39.5%)


‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Final configuration:
   ENTITY_COLUMN: customer_id
   TIME_COLUMN: sent_date
   TARGET_COLUMN: target_entity
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î

## 1c.3 Aggregation Window Configuration

**‚öôÔ∏è Central Configuration for All Pattern Analysis**

Windows are loaded from 01a findings and used consistently throughout this notebook for:
- Velocity analysis (shortest window)
- Momentum analysis (window pairs)
- Rolling statistics
- Feature engineering recommendations

Override below if needed for your specific analysis.


In [6]:
# === AGGREGATION WINDOW CONFIGURATION ===
# These windows were recommended by 01a based on your data's temporal coverage.
# They are used consistently for velocity, momentum, rolling stats, and feature engineering.

# Override: Set to a list like ["7d", "30d", "90d"] to use custom windows
# Set to None to use 01a recommendations
WINDOW_OVERRIDE = None

from customer_retention.stages.profiling import PatternAnalysisConfig

pattern_config = PatternAnalysisConfig.from_findings(
    findings,
    target_column=TARGET_COLUMN,
    window_override=WINDOW_OVERRIDE,
)

# Display configuration
print("="*70)
print("AGGREGATION WINDOW CONFIGURATION")
print("="*70)
print(f"\nSource: {'Manual override' if WINDOW_OVERRIDE else '01a findings (recommended)'}")
print(f"\nWindows: {pattern_config.aggregation_windows}")
print(f"\nDerived settings used throughout this notebook:")
print(f"   ‚Ä¢ Velocity/Rolling window: {pattern_config.velocity_window_days} days")
print(f"   ‚Ä¢ Momentum pairs: {pattern_config.get_momentum_pairs()}")
print(f"\nüí° To override, set WINDOW_OVERRIDE = ['7d', '30d', '90d'] above and re-run")


AGGREGATION WINDOW CONFIGURATION

Source: 01a findings (recommended)

Windows: ['180d', '365d', 'all_time']

Derived settings used throughout this notebook:
   ‚Ä¢ Velocity/Rolling window: 180 days
   ‚Ä¢ Momentum pairs: [(180, 365)]

üí° To override, set WINDOW_OVERRIDE = ['7d', '30d', '90d'] above and re-run


## 1c.4 Configure Value Column for Analysis

Temporal patterns are analyzed on aggregated metrics. Choose the primary metric to analyze.

In [7]:
# Find numeric columns that could be aggregated
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
numeric_cols = [c for c in numeric_cols if c not in [ENTITY_COLUMN]]

print("Available numeric columns for pattern analysis:")
for col in numeric_cols:
    print(f"   - {col}")

# Default: use event count (most common for pattern detection)
# Change this to analyze patterns in a specific metric
VALUE_COLUMN = "_event_count"  # Special: will aggregate event counts

Available numeric columns for pattern analysis:
   - opened
   - clicked
   - send_hour
   - target
   - bounced
   - time_to_open_hours
   - target_entity


In [8]:
# Prepare data for pattern analysis
# Aggregate to daily level for trend/seasonality detection

if VALUE_COLUMN == "_event_count":
    # Aggregate event counts by day
    daily_data = df.groupby(df[TIME_COLUMN].dt.date).size().reset_index()
    daily_data.columns = [TIME_COLUMN, "value"]
    daily_data[TIME_COLUMN] = pd.to_datetime(daily_data[TIME_COLUMN])
    analysis_col = "value"
    print("Analyzing: Daily event counts")
else:
    # Aggregate specific column by day
    daily_data = df.groupby(df[TIME_COLUMN].dt.date)[VALUE_COLUMN].sum().reset_index()
    daily_data.columns = [TIME_COLUMN, "value"]
    daily_data[TIME_COLUMN] = pd.to_datetime(daily_data[TIME_COLUMN])
    analysis_col = "value"
    print(f"Analyzing: Daily sum of {VALUE_COLUMN}")

print(f"\nDaily data points: {len(daily_data)}")
print(f"Date range: {daily_data[TIME_COLUMN].min()} to {daily_data[TIME_COLUMN].max()}")

Analyzing: Daily event counts

Daily data points: 2826
Date range: 2015-01-01 00:00:00 to 2022-09-26 00:00:00


## 1c.5 Trend Detection

**üìñ Understanding Trends:**
- **Increasing**: Metric growing over time (e.g., expanding customer base)
- **Decreasing**: Metric shrinking (e.g., declining engagement)
- **Stationary**: No significant trend (stable business)

**Impact on ML:**
- Strong trends can cause data leakage if not handled
- Consider detrending or adding trend as explicit feature

In [9]:
# Run trend detection
analyzer = TemporalPatternAnalyzer(time_column=TIME_COLUMN)
trend_result = analyzer.detect_trend(daily_data, value_column=analysis_col)

print("\U0001f4c8 TREND ANALYSIS RESULTS")
print("="*50)

direction_emoji = {
    TrendDirection.INCREASING: "\U0001f4c8",
    TrendDirection.DECREASING: "\U0001f4c9",
    TrendDirection.STABLE: "\u27a1\ufe0f",
    TrendDirection.UNKNOWN: "\u2753",
}

print(f"\n   Direction: {direction_emoji.get(trend_result.direction, '')} {trend_result.direction.value.upper()}")
print(f"   Strength (R\u00b2): {trend_result.strength:.3f}")
print(f"   Confidence: {trend_result.confidence.upper()}")

if trend_result.slope is not None:
    print(f"   Slope: {trend_result.slope:.4f} per day")
    # Interpret slope
    mean_val = daily_data[analysis_col].mean()
    daily_pct_change = (trend_result.slope / mean_val) * 100 if mean_val != 0 else 0
    print(f"   Daily % change: {daily_pct_change:+.3f}%")

if trend_result.p_value is not None:
    print(f"   P-value: {trend_result.p_value:.4f}")

üìà TREND ANALYSIS RESULTS

   Direction: ‚û°Ô∏è STABLE
   Strength (R¬≤): 0.465
   Confidence: MEDIUM
   Slope: -0.0061 per day
   Daily % change: -0.023%
   P-value: 0.0000


In [10]:
# Visualize trend
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=daily_data[TIME_COLUMN], y=daily_data[analysis_col],
    mode="lines", name="Daily Values", line=dict(color="steelblue", width=1), opacity=0.7
))

# Trend line
if trend_result.slope is not None:
    x_numeric = (daily_data[TIME_COLUMN] - daily_data[TIME_COLUMN].min()).dt.days
    y_trend = trend_result.slope * x_numeric + (daily_data[analysis_col].mean() - trend_result.slope * x_numeric.mean())
    trend_color = {TrendDirection.INCREASING: "green", TrendDirection.DECREASING: "red"}.get(trend_result.direction, "gray")
    fig.add_trace(go.Scatter(
        x=daily_data[TIME_COLUMN], y=y_trend, mode="lines",
        name=f"Trend ({trend_result.direction.value})", line=dict(color=trend_color, width=3, dash="dash")
    ))

# Rolling average using configured window
rolling_avg = daily_data[analysis_col].rolling(window=pattern_config.rolling_window, center=True).mean()
fig.add_trace(go.Scatter(
    x=daily_data[TIME_COLUMN], y=rolling_avg, mode="lines",
    name=f"{pattern_config.rolling_window}-day Rolling Avg", line=dict(color="orange", width=2)
))

fig.update_layout(
    title=f"Trend Analysis: {trend_result.direction.value.title()} (R¬≤={trend_result.strength:.2f})",
    xaxis_title="Date", yaxis_title="Value", template="plotly_white", height=400,
    legend=dict(yanchor="top", y=0.99, xanchor="left", x=0.01)
)
display_figure(fig)


## 1c.6 Seasonality Detection

**üìñ Understanding Seasonality:**
- **Weekly** (period=7): Higher activity on certain days
- **Monthly** (period~30): End-of-month patterns, billing cycles
- **Quarterly** (period~90): Business cycles, seasonal products

**Impact on ML:**
- Add day-of-week, month features
- Consider seasonal decomposition
- Use cyclical encodings (sin/cos) for neural networks

In [11]:
# Run seasonality detection
seasonality_results = analyzer.detect_seasonality(daily_data, value_column=analysis_col)

print("\U0001f501 SEASONALITY ANALYSIS RESULTS")
print("="*50)

if seasonality_results:
    print(f"\n   Detected {len(seasonality_results)} seasonal pattern(s):\n")
    
    for i, sr in enumerate(seasonality_results, 1):
        strength_label = "Strong" if sr.strength > 0.5 else "Moderate" if sr.strength > 0.3 else "Weak"
        period_name = sr.period_name or f"{sr.period}-day"
        print(f"   {i}. {period_name.title()} Pattern")
        print(f"      Period: {sr.period} days")
        print(f"      Strength: {sr.strength:.3f} ({strength_label})")
        print()
else:
    print("\n   No significant seasonal patterns detected.")
    print("   This could mean:")
    print("   - Data is truly non-seasonal")
    print("   - Not enough data points for detection")
    print("   - High noise obscuring patterns")

üîÅ SEASONALITY ANALYSIS RESULTS

   Detected 3 seasonal pattern(s):

   1. Weekly Pattern
      Period: 7 days
      Strength: 0.484 (Moderate)

   2. 21-Day Pattern
      Period: 21 days
      Strength: 0.479 (Moderate)

   3. 14-Day Pattern
      Period: 14 days
      Strength: 0.474 (Moderate)



In [12]:
# Visualize day-of-week pattern
daily_data["day_of_week"] = daily_data[TIME_COLUMN].dt.day_name()
dow_order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
daily_data["day_of_week"] = pd.Categorical(daily_data["day_of_week"], categories=dow_order, ordered=True)

dow_stats = daily_data.groupby("day_of_week")[analysis_col].agg(["mean", "std"]).reset_index()

fig = go.Figure()
fig.add_trace(go.Bar(
    x=dow_stats["day_of_week"],
    y=dow_stats["mean"],
    error_y=dict(type="data", array=dow_stats["std"]),
    name="Mean",
    marker_color="steelblue"
))

# Mark weekends
for i, day in enumerate(dow_stats["day_of_week"]):
    if day in ["Saturday", "Sunday"]:
        fig.add_vrect(
            x0=i-0.4, x1=i+0.4,
            fillcolor="lightgray", opacity=0.3,
            layer="below", line_width=0
        )

fig.update_layout(
    title="Day of Week Pattern (gray = weekend)",
    xaxis_title="Day of Week",
    yaxis_title="Average Value",
    template="plotly_white",
    height=400
)
display_figure(fig)





In [13]:
# Monthly pattern analysis
daily_data["month"] = daily_data[TIME_COLUMN].dt.month_name()
month_order = ["January", "February", "March", "April", "May", "June",
               "July", "August", "September", "October", "November", "December"]

# Only include months present in data
present_months = [m for m in month_order if m in daily_data["month"].values]
daily_data["month"] = pd.Categorical(daily_data["month"], categories=present_months, ordered=True)

monthly_stats = daily_data.groupby("month")[analysis_col].agg(["mean", "std"]).reset_index()

if len(monthly_stats) > 1:
    fig = go.Figure()
    fig.add_trace(go.Bar(
        x=monthly_stats["month"],
        y=monthly_stats["mean"],
        error_y=dict(type="data", array=monthly_stats["std"]),
        name="Mean",
        marker_color="mediumpurple"
    ))
    
    # Add overall mean line
    overall_mean = daily_data[analysis_col].mean()
    fig.add_hline(y=overall_mean, line_dash="dash", line_color="red",
                  annotation_text=f"Overall Mean: {overall_mean:.1f}",
                  annotation_position="top right")
    
    fig.update_layout(
        title="Monthly Pattern",
        xaxis_title="Month",
        yaxis_title="Average Value",
        template="plotly_white",
        height=400
    )
    display_figure(fig)
else:
    print("Not enough months of data for monthly pattern analysis")





## 1c.7 Cohort Analysis

**\U0001f4d6 Understanding Cohorts:**
- Group entities by when they first appeared (signup cohort)
- Compare behavior across cohorts
- Identify if acquisition quality changed over time

In [14]:
# Cohort analysis requires entity column
if ENTITY_COLUMN:
    # Define cohort as the month of first event
    first_events = df.groupby(ENTITY_COLUMN)[TIME_COLUMN].min().reset_index()
    first_events.columns = [ENTITY_COLUMN, "first_event"]
    first_events["cohort"] = first_events["first_event"].dt.to_period("M")
    
    # Merge cohort info back to main data
    df_cohort = df.merge(first_events[[ENTITY_COLUMN, "cohort"]], on=ENTITY_COLUMN)
    
    # Cohort-level analysis
    cohort_result = analyzer.analyze_cohorts(
        df,
        entity_column=ENTITY_COLUMN,
        cohort_column=TIME_COLUMN,  # Will use min event date as cohort
        target_column=TARGET_COLUMN,
        period="M"
    )
    
    print("\U0001f465 COHORT ANALYSIS RESULTS")
    print("="*50)
    print(f"\n   Cohorts identified: {len(cohort_result)}")
    
    if len(cohort_result) > 0:
        display_table(cohort_result.head(12))
else:
    print("Entity column not set - skipping cohort analysis")

üë• COHORT ANALYSIS RESULTS

   Cohorts identified: 93


cohort,entity_count,first_event,last_event,retention_rate
2015-01,971,2015-01-01,2015-01-31,0.55418
2015-02,894,2015-02-01,2015-02-28,0.501859
2015-03,960,2015-03-01,2015-03-31,0.505677
2015-04,913,2015-04-01,2015-04-30,0.481414
2015-05,954,2015-05-01,2015-05-31,0.510426
2015-06,882,2015-06-01,2015-06-30,0.476847
2015-07,941,2015-07-01,2015-07-31,0.498195
2015-08,940,2015-08-01,2015-08-31,0.460562
2015-09,915,2015-09-01,2015-09-30,0.491211
2015-10,960,2015-10-01,2015-10-31,0.463063


In [15]:
# Visualize cohort sizes and retention
if ENTITY_COLUMN and len(cohort_result) > 0:
    cohort_result_sorted = cohort_result.sort_values("cohort")
    
    # Compute retention rate if not present but target exists
    if "retention_rate" not in cohort_result.columns and TARGET_COLUMN and TARGET_COLUMN in df.columns:
        # Calculate retention rate per cohort from raw data
        entity_cohort = first_events[[ENTITY_COLUMN, "cohort"]]
        entity_target = df.groupby(ENTITY_COLUMN)[TARGET_COLUMN].first().reset_index()
        cohort_target = entity_cohort.merge(entity_target, on=ENTITY_COLUMN)
        
        retention_by_cohort = cohort_target.groupby("cohort")[TARGET_COLUMN].mean().reset_index()
        retention_by_cohort.columns = ["cohort", "retention_rate"]
        
        cohort_result_sorted = cohort_result_sorted.merge(retention_by_cohort, on="cohort", how="left")
    
    # Decide layout based on available data
    has_retention = "retention_rate" in cohort_result_sorted.columns and cohort_result_sorted["retention_rate"].notna().any()
    
    if has_retention:
        fig = make_subplots(rows=1, cols=2, subplot_titles=("Cohort Sizes", "Retention Rate by Cohort"))
        
        # Cohort sizes
        fig.add_trace(
            go.Bar(
                x=cohort_result_sorted["cohort"].astype(str),
                y=cohort_result_sorted["entity_count"],
                name="Entities",
                marker_color="steelblue"
            ),
            row=1, col=1
        )
        
        # Retention rate
        fig.add_trace(
            go.Scatter(
                x=cohort_result_sorted["cohort"].astype(str),
                y=cohort_result_sorted["retention_rate"] * 100,
                mode="lines+markers",
                name="Retention %",
                line=dict(color="green", width=2),
                marker=dict(size=8)
            ),
            row=1, col=2
        )
        fig.update_yaxes(title_text="Entity Count", row=1, col=1)
        fig.update_yaxes(title_text="Retention %", row=1, col=2)
        
        fig.update_layout(
            title="Cohort Overview",
            template="plotly_white",
            height=400,
            showlegend=False
        )
    else:
        # Single chart - cohort sizes only
        fig = go.Figure()
        fig.add_trace(
            go.Bar(
                x=cohort_result_sorted["cohort"].astype(str),
                y=cohort_result_sorted["entity_count"],
                name="Entities",
                marker_color="steelblue",
                text=cohort_result_sorted["entity_count"],
                textposition="outside"
            )
        )
        fig.update_layout(
            title="Cohort Sizes (No target column for retention analysis)",
            xaxis_title="Cohort",
            yaxis_title="Entity Count",
            template="plotly_white",
            height=400
        )
        print("üí° Tip: Set TARGET_COLUMN to see retention rates by cohort")
    
    fig.update_xaxes(tickangle=45)
    display_figure(fig)

## 1c.8 Recency Analysis

**\U0001f4d6 Understanding Recency:**
- Time since last event for each entity
- Often strongly correlated with churn/retention
- Key feature for predictive models

In [16]:
# Run recency analysis
if ENTITY_COLUMN:
    recency_result = analyzer.analyze_recency(
        df,
        entity_column=ENTITY_COLUMN,
        target_column=TARGET_COLUMN,
        reference_date=df[TIME_COLUMN].max()  # Use latest date in data as reference
    )
    
    print("\u23f1\ufe0f  RECENCY ANALYSIS RESULTS")
    print("="*50)
    print(f"\n   Reference date: {df[TIME_COLUMN].max()}")
    print(f"\n   Recency Statistics (days since last event):")
    print(f"      Mean: {recency_result.avg_recency_days:.1f}")
    print(f"      Median: {recency_result.median_recency_days:.1f}")
    print(f"      Min: {recency_result.min_recency_days:.1f}")
    print(f"      Max: {recency_result.max_recency_days:.1f}")
    
    if recency_result.target_correlation is not None:
        corr = recency_result.target_correlation
        corr_strength = "Strong" if abs(corr) > 0.5 else "Moderate" if abs(corr) > 0.3 else "Weak"
        corr_direction = "negative" if corr < 0 else "positive"
        
        print(f"\n   \U0001f3af Target Correlation:")
        print(f"      Correlation: {corr:.3f}")
        print(f"      Interpretation: {corr_strength} {corr_direction} correlation")
        
        if corr < -0.3:
            print(f"      \U0001f4a1 Insight: Lower recency (recent activity) associates with higher target")
            print(f"         This suggests recency is a strong predictor - use in features!")
        elif corr > 0.3:
            print(f"      \U0001f4a1 Insight: Higher recency (longer since last event) associates with higher target")
else:
    print("Entity column not set - skipping recency analysis")

‚è±Ô∏è  RECENCY ANALYSIS RESULTS

   Reference date: 2022-09-26 00:00:00

   Recency Statistics (days since last event):
      Mean: 665.8
      Median: 246.5
      Min: 0.0
      Max: 2824.0

   üéØ Target Correlation:
      Correlation: 0.772
      Interpretation: Strong positive correlation
      üí° Insight: Higher recency (longer since last event) associates with higher target


In [17]:
# Visualize recency distribution - COMPARING RETAINED VS CHURNED
if ENTITY_COLUMN:
    # Compute recency for each entity
    reference_date = df[TIME_COLUMN].max()
    entity_last = df.groupby(ENTITY_COLUMN)[TIME_COLUMN].max().reset_index()
    entity_last["recency_days"] = (reference_date - entity_last[TIME_COLUMN]).dt.days
    
    # Add target for comparison
    if TARGET_COLUMN and TARGET_COLUMN in df.columns:
        entity_target = df.groupby(ENTITY_COLUMN)[TARGET_COLUMN].first().reset_index()
        entity_recency = entity_last.merge(entity_target, on=ENTITY_COLUMN)
        has_target = True
    else:
        entity_recency = entity_last.copy()
        has_target = False
    
    # Cap for visualization
    cap = entity_recency["recency_days"].quantile(0.99)
    entity_recency_capped = entity_recency[entity_recency["recency_days"] <= cap]
    
    if has_target:
        # SIDE-BY-SIDE COMPARISON: Retained vs Churned
        print("="*70)
        print("RECENCY DISTRIBUTION: Retained vs Churned Comparison")
        print("="*70)
        
        retained_recency = entity_recency_capped[entity_recency_capped[TARGET_COLUMN] == 1]["recency_days"]
        churned_recency = entity_recency_capped[entity_recency_capped[TARGET_COLUMN] == 0]["recency_days"]
        
        fig = make_subplots(
            rows=1, cols=2,
            subplot_titles=[
                f"üü¢ RETAINED (n={len(retained_recency):,})",
                f"üî¥ CHURNED (n={len(churned_recency):,})"
            ],
            horizontal_spacing=0.1
        )
        
        # Retained histogram
        fig.add_trace(go.Histogram(
            x=retained_recency,
            nbinsx=30,
            name="Retained",
            marker_color="rgba(46, 204, 113, 0.7)",
            showlegend=False
        ), row=1, col=1)
        
        # Churned histogram
        fig.add_trace(go.Histogram(
            x=churned_recency,
            nbinsx=30,
            name="Churned",
            marker_color="rgba(231, 76, 60, 0.7)",
            showlegend=False
        ), row=1, col=2)
        
        # Add median lines
        fig.add_vline(x=retained_recency.median(), line_dash="solid", line_color="green",
                      annotation_text=f"Med: {retained_recency.median():.0f}d", row=1, col=1)
        fig.add_vline(x=churned_recency.median(), line_dash="solid", line_color="red",
                      annotation_text=f"Med: {churned_recency.median():.0f}d", row=1, col=2)
        
        fig.update_layout(
            title="Recency Distribution: Compare Shape and Median Between Groups",
            template="plotly_white",
            height=400
        )
        fig.update_xaxes(title_text="Days Since Last Event", row=1, col=1)
        fig.update_xaxes(title_text="Days Since Last Event", row=1, col=2)
        fig.update_yaxes(title_text="Number of Entities", row=1, col=1)
        
        display_figure(fig)
        
        # Summary statistics
        print("\nüìä Recency Statistics by Retention Status:")
        print("-" * 60)
        print(f"{'Metric':<20} {'Retained':>15} {'Churned':>15} {'Difference':>15}")
        print("-" * 60)
        
        metrics = [
            ("Mean", retained_recency.mean(), churned_recency.mean()),
            ("Median", retained_recency.median(), churned_recency.median()),
            ("Std Dev", retained_recency.std(), churned_recency.std()),
            ("25th Percentile", retained_recency.quantile(0.25), churned_recency.quantile(0.25)),
            ("75th Percentile", retained_recency.quantile(0.75), churned_recency.quantile(0.75)),
        ]
        
        for name, ret_val, churn_val in metrics:
            diff = ret_val - churn_val
            print(f"{name:<20} {ret_val:>15.1f} {churn_val:>15.1f} {diff:>+15.1f}")
        
        # Calculate effect size for recency
        pooled_std = np.sqrt((retained_recency.var() + churned_recency.var()) / 2)
        if pooled_std > 0:
            cohens_d = (retained_recency.mean() - churned_recency.mean()) / pooled_std
        else:
            cohens_d = 0
        
        abs_d = abs(cohens_d)
        if abs_d >= 0.8:
            effect_interp = "Large effect"
        elif abs_d >= 0.5:
            effect_interp = "Medium effect"
        elif abs_d >= 0.2:
            effect_interp = "Small effect"
        else:
            effect_interp = "Negligible"
        
        print(f"\nüìà Effect Size (Cohen's d): {cohens_d:+.3f} ({effect_interp})")
        
        # INTERPRETATION
        print("\n" + "‚îÄ"*70)
        print("üìñ HOW TO INTERPRET RECENCY COMPARISON")
        print("‚îÄ"*70)
        if churned_recency.median() > retained_recency.median():
            print("""
Key Finding: Churned customers have HIGHER recency (more days since last event)

This is a classic churn pattern - customers who leave typically show:
  ‚Ä¢ Longer gaps between activities before churning
  ‚Ä¢ Declining engagement over time
  ‚Ä¢ Last activity farther from observation date

Feature Engineering Recommendations:
  ‚Ä¢ days_since_last_event (recency as-is)
  ‚Ä¢ log_recency (if distribution is skewed)
  ‚Ä¢ recency_bucket (categorical: 0-7d, 8-30d, 31-90d, >90d)
  ‚Ä¢ is_recent_active (binary: recency < 30 days)
""")
        else:
            print("""
Observation: Retained customers have similar or higher recency than churned

This is unusual - investigate whether:
  ‚Ä¢ Churn is happening very quickly (new customers leaving fast)
  ‚Ä¢ There's a time window issue in the data
  ‚Ä¢ Target definition may need review
""")
        
    else:
        # Single distribution (no target)
        fig = go.Figure()
        fig.add_trace(go.Histogram(
            x=entity_recency_capped["recency_days"],
            nbinsx=50,
            name="Recency",
            marker_color="coral",
            opacity=0.7
        ))
        
        fig.add_vline(x=recency_result.median_recency_days, line_dash="solid", line_color="green",
                      annotation_text=f"Median: {recency_result.median_recency_days:.0f} days",
                      annotation_position="top right")
        
        fig.update_layout(
            title=f"Recency Distribution (capped at {cap:.0f} days = 99th percentile)",
            xaxis_title="Days Since Last Event",
            yaxis_title="Number of Entities",
            template="plotly_white",
            height=400
        )
        display_figure(fig)

RECENCY DISTRIBUTION: Retained vs Churned Comparison



üìä Recency Statistics by Retention Status:
------------------------------------------------------------
Metric                      Retained         Churned      Difference
------------------------------------------------------------
Mean                          1399.3           164.7         +1234.7
Median                        1434.0           116.0         +1318.0
Std Dev                        767.3           160.5          +606.8
25th Percentile                748.5            50.0          +698.5
75th Percentile               2057.0           226.0         +1831.0

üìà Effect Size (Cohen's d): +2.227 (Large effect)

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
üìñ HOW TO INTERPRET RECENCY COMPARISON
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚

In [18]:
# Recency vs Target visualization (if target exists)
if ENTITY_COLUMN and TARGET_COLUMN and TARGET_COLUMN in df.columns:
    # Get target per entity
    entity_target = df.groupby(ENTITY_COLUMN)[TARGET_COLUMN].first().reset_index()
    
    # Merge with recency
    recency_target = entity_last.merge(entity_target, on=ENTITY_COLUMN)
    
    # Bin recency for clearer visualization
    recency_target["recency_bin"] = pd.cut(
        recency_target["recency_days"],
        bins=[0, 7, 30, 90, 180, float("inf")],
        labels=["0-7d", "8-30d", "31-90d", "91-180d", ">180d"]
    )
    
    # Target rate by recency bin
    target_by_recency = recency_target.groupby("recency_bin")[TARGET_COLUMN].agg(["mean", "count"]).reset_index()
    
    fig = make_subplots(specs=[[{"secondary_y": True}]])
    
    fig.add_trace(
        go.Bar(
            x=target_by_recency["recency_bin"].astype(str),
            y=target_by_recency["count"],
            name="Entity Count",
            marker_color="lightsteelblue",
            opacity=0.7
        ),
        secondary_y=False
    )
    
    fig.add_trace(
        go.Scatter(
            x=target_by_recency["recency_bin"].astype(str),
            y=target_by_recency["mean"] * 100,
            mode="lines+markers",
            name="Target Rate %",
            line=dict(color="red", width=3),
            marker=dict(size=10)
        ),
        secondary_y=True
    )
    
    fig.update_layout(
        title="Target Rate by Recency Bucket",
        xaxis_title="Days Since Last Event",
        template="plotly_white",
        height=450
    )
    fig.update_yaxes(title_text="Entity Count", secondary_y=False)
    fig.update_yaxes(title_text="Target Rate %", secondary_y=True)
    
    display_figure(fig)





## 1c.9 Feature Correlations and Relationships

**üìñ Understanding Feature Relationships in Event Data:**
- **Correlation Matrix**: Identify redundant features (multicollinearity)
- **Effect Sizes**: How well features discriminate by target (if available)
- **Cram√©r's V**: Association strength for categorical features

These analyses parallel the standard track (notebook 04) but applied to event-level attributes.

In [19]:
# Correlation matrix for numeric event attributes
numeric_event_cols = [c for c in df.select_dtypes(include=[np.number]).columns 
                      if c not in [ENTITY_COLUMN, TARGET_COLUMN]]

if len(numeric_event_cols) >= 2:
    corr_matrix = df[numeric_event_cols].corr()
    fig = charts.heatmap(
        corr_matrix.values, x_labels=numeric_event_cols, y_labels=numeric_event_cols,
        title="Event Attribute Correlation Matrix"
    )
    display_figure(fig)
    
    # High correlation pairs (multicollinearity detection)
    high_corr_pairs = []
    for i in range(len(numeric_event_cols)):
        for j in range(i+1, len(numeric_event_cols)):
            corr_val = corr_matrix.iloc[i, j]
            if abs(corr_val) >= 0.7:
                high_corr_pairs.append({
                    "Column 1": numeric_event_cols[i], "Column 2": numeric_event_cols[j],
                    "Correlation": f"{corr_val:.3f}"
                })
    
    if high_corr_pairs:
        print("‚ö†Ô∏è High Correlation Pairs (|r| >= 0.7):")
        display_table(pd.DataFrame(high_corr_pairs))
    else:
        print("‚úì No high correlation pairs detected (multicollinearity not a concern)")
else:
    print("Not enough numeric columns for correlation analysis.")

‚úì No high correlation pairs detected (multicollinearity not a concern)


In [20]:
# Categorical feature analysis using Cram√©r's V (if target exists at entity level)
categorical_cols = [c for c in df.select_dtypes(include=['object', 'category']).columns 
                    if c not in [ENTITY_COLUMN, TIME_COLUMN]]

if categorical_cols and ENTITY_COLUMN:
    print("="*70)
    print("CATEGORICAL FEATURE ANALYSIS (Cram√©r's V)")
    print("="*70)
    
    # For event data, aggregate to entity level first (mode category per entity)
    entity_cats = df.groupby(ENTITY_COLUMN)[categorical_cols].agg(lambda x: x.mode().iloc[0] if len(x.mode()) > 0 else None)
    
    if TARGET_COLUMN and TARGET_COLUMN in df.columns:
        entity_target = df.groupby(ENTITY_COLUMN)[TARGET_COLUMN].first()
        entity_data = entity_cats.join(entity_target)
        
        overall_retention = entity_data[TARGET_COLUMN].mean()
        print(f"\nOverall retention rate: {overall_retention:.1%}")
        
        cat_analyzer = CategoricalTargetAnalyzer(min_samples_per_category=10)
        cat_summary = cat_analyzer.analyze_multiple(entity_data.reset_index(), categorical_cols, TARGET_COLUMN)
        
        print("\nüìä Categorical Feature Strength:")
        print(f"{'Feature':<25} {'Cram√©r V':>10} {'Strength':<12} {'Significance'}")
        print("-" * 60)
        
        for _, row in cat_summary.iterrows():
            strength = "Strong" if row["cramers_v"] >= 0.3 else "Moderate" if row["cramers_v"] >= 0.1 else "Weak"
            sig = "***" if row["p_value"] < 0.001 else "**" if row["p_value"] < 0.01 else "*" if row["p_value"] < 0.05 else ""
            print(f"{row['feature'][:24]:<25} {row['cramers_v']:>10.3f} {strength:<12} {sig}")
        
        # Detailed analysis for top categorical features
        for col_name in categorical_cols[:3]:
            result = cat_analyzer.analyze(entity_data.reset_index(), col_name, TARGET_COLUMN)
            
            if len(result.category_stats) > 0:
                print(f"\n{'‚îÄ'*60}")
                print(f"üìä {col_name.upper()} - Retention by Category")
                print("‚îÄ"*60)
                
                cat_stats = result.category_stats
                categories = cat_stats['category'].tolist()
                retained_counts = cat_stats['retained_count'].tolist()
                churned_counts = cat_stats['churned_count'].tolist()
                
                # Stacked bar chart
                fig = go.Figure()
                fig.add_trace(go.Bar(
                    name='Retained', x=categories, y=retained_counts,
                    marker_color='rgba(46, 204, 113, 0.8)',
                    text=[f"{r/(r+c)*100:.0f}%" for r, c in zip(retained_counts, churned_counts)],
                    textposition='inside', textfont=dict(color='white', size=12)
                ))
                fig.add_trace(go.Bar(
                    name='Churned', x=categories, y=churned_counts,
                    marker_color='rgba(231, 76, 60, 0.8)',
                ))
                fig.update_layout(
                    barmode='stack', title=f"Retention by {col_name}",
                    template='plotly_white', height=350,
                    legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="center", x=0.5)
                )
                display_figure(fig)
                
                # High-risk categories
                if result.high_risk_categories:
                    print(f"\n‚ö†Ô∏è High-risk categories (below average retention):")
                    for cat in result.high_risk_categories[:3]:
                        cat_row = cat_stats[cat_stats['category'] == cat].iloc[0]
                        print(f"   ‚Ä¢ {cat}: {cat_row['retention_rate']:.1%} retention ({cat_row['lift']:.2f}x lift)")
        
        # INTERPRETATION
        print("\n" + "‚îÄ"*70)
        print("üìñ INTERPRETING CRAM√âR'S V")
        print("‚îÄ"*70)
        print("""
Cram√©r's V measures association strength for categorical variables:
  V ‚â• 0.3:  Strong association
  V 0.1-0.3: Moderate association
  V < 0.1:  Weak association

Significance: *** p<0.001, ** p<0.01, * p<0.05

High-risk categories (lift < 0.9x overall retention):
  ‚Üí Target for retention campaigns
  ‚Üí Investigate why these segments churn more
""")
    else:
        print("\nCategorical columns found but no target for association analysis")
        print(f"  Columns: {categorical_cols}")
elif not categorical_cols:
    print("No categorical columns found for Cram√©r's V analysis")

CATEGORICAL FEATURE ANALYSIS (Cram√©r's V)

Overall retention rate: 39.5%

üìä Categorical Feature Strength:
Feature                     Cram√©r V Strength     Significance
------------------------------------------------------------
email_id                       1.000 Strong       
campaign_type                  0.070 Weak         ***
subject_line_category          0.049 Weak         *
device_type                    0.025 Weak         
unsubscribe_date               0.000 Weak         

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
üìä CAMPAIGN_TYPE - Retention by Category
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ



‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
üìä SUBJECT_LINE_CATEGORY - Retention by Category
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ



‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
üìñ INTERPRETING CRAM√âR'S V
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

Cram√©r's V measures association strength for categorical variables:
  V ‚â• 0.3:  Strong association
  V 0.1-0.3: Moderate association
  V < 0.1:  Weak association

Significance: *** p<0.001, ** p<0.01, * p<0.05

High-risk categories (lift < 0.9x overall retention):
  ‚Üí Target for retention campaigns
  ‚Üí Investigate why these segments churn more



## 1c.10 Entity-Level Feature Analysis (Effect Sizes)

**üìñ Why Aggregate to Entity Level:**
- Time series data has multiple events per entity
- Target variable (retention) is typically at entity level
- Effect sizes (Cohen's d) require entity-level comparison

**Effect Size Interpretation (Cohen's d):**

| abs(d) | Interpretation | Predictive Power | Action |
|--------|----------------|------------------|--------|
| ‚â• 0.8 | Large | Strong discriminator | Priority feature - include in model |
| 0.5-0.8 | Medium | Useful predictor | Include in model |
| 0.2-0.5 | Small | Weak signal | May help in combination |
| < 0.2 | Negligible | Limited value alone | Consider dropping or engineering |

**Direction matters:**
- **Positive d**: Retained customers have HIGHER values
- **Negative d**: Retained customers have LOWER values


In [21]:
# Aggregate event data to entity level for effect size analysis
if ENTITY_COLUMN and TARGET_COLUMN and TARGET_COLUMN in df.columns:
    # Build entity-level aggregations
    entity_aggs = df.groupby(ENTITY_COLUMN).agg({
        TIME_COLUMN: ['count', 'min', 'max'],
        **{col: ['mean', 'sum', 'std'] for col in numeric_event_cols if col != TARGET_COLUMN}
    })
    entity_aggs.columns = ['_'.join(col).strip() for col in entity_aggs.columns]
    entity_aggs = entity_aggs.reset_index()
    
    # Add target
    entity_target = df.groupby(ENTITY_COLUMN)[TARGET_COLUMN].first().reset_index()
    entity_df = entity_aggs.merge(entity_target, on=ENTITY_COLUMN)
    
    # Add derived features
    entity_df['tenure_days'] = (entity_df[f'{TIME_COLUMN}_max'] - entity_df[f'{TIME_COLUMN}_min']).dt.days
    entity_df['event_count'] = entity_df[f'{TIME_COLUMN}_count']
    
    # Calculate effect sizes (Cohen's d) for entity-level features
    effect_feature_cols = [c for c in entity_df.select_dtypes(include=[np.number]).columns
                          if c not in [ENTITY_COLUMN, TARGET_COLUMN]]
    
    print("="*80)
    print("ENTITY-LEVEL FEATURE EFFECT SIZES (Cohen's d)")
    print("="*80)
    print(f"\nAnalyzing {len(effect_feature_cols)} aggregated features at entity level")
    print(f"Entities: {len(entity_df):,} (Retained: {(entity_df[TARGET_COLUMN]==1).sum():,}, Churned: {(entity_df[TARGET_COLUMN]==0).sum():,})\n")
    
    effect_sizes = []
    for col in effect_feature_cols:
        churned = entity_df[entity_df[TARGET_COLUMN] == 0][col].dropna()
        retained = entity_df[entity_df[TARGET_COLUMN] == 1][col].dropna()
        
        if len(churned) > 0 and len(retained) > 0:
            pooled_std = np.sqrt(((len(churned)-1)*churned.std()**2 + (len(retained)-1)*retained.std()**2) / 
                                 (len(churned) + len(retained) - 2))
            d = (retained.mean() - churned.mean()) / pooled_std if pooled_std > 0 else 0
            
            abs_d = abs(d)
            if abs_d >= 0.8:
                interp, emoji = "Large effect", "üî¥"
            elif abs_d >= 0.5:
                interp, emoji = "Medium effect", "üü°"
            elif abs_d >= 0.2:
                interp, emoji = "Small effect", "üü¢"
            else:
                interp, emoji = "Negligible", "‚ö™"
            
            effect_sizes.append({
                "feature": col, "cohens_d": d, "abs_d": abs_d, 
                "interpretation": interp, "emoji": emoji,
                "retained_mean": retained.mean(), "churned_mean": churned.mean()
            })
    
    # Sort and display
    effect_df = pd.DataFrame(effect_sizes).sort_values("abs_d", ascending=False)
    
    print(f"{'Feature':<35} {'d':>8} {'Effect':<15} {'Direction':<20}")
    print("-" * 80)
    for _, row in effect_df.head(15).iterrows():
        direction = "‚Üë Higher in retained" if row["cohens_d"] > 0 else "‚Üì Lower in retained"
        print(f"{row['emoji']} {row['feature'][:33]:<33} {row['cohens_d']:>+8.3f} {row['interpretation']:<15} {direction:<20}")
    
    # Categorize features
    large_effect = effect_df[effect_df["abs_d"] >= 0.8]["feature"].tolist()
    medium_effect = effect_df[(effect_df["abs_d"] >= 0.5) & (effect_df["abs_d"] < 0.8)]["feature"].tolist()
    small_effect = effect_df[(effect_df["abs_d"] >= 0.2) & (effect_df["abs_d"] < 0.5)]["feature"].tolist()
    
    # INTERPRETATION
    print("\n" + "‚îÄ"*80)
    print("üìñ INTERPRETATION & RECOMMENDATIONS")
    print("‚îÄ"*80)
    
    if large_effect:
        print(f"\nüî¥ LARGE EFFECT (|d| ‚â• 0.8) - Priority Features:")
        for f in large_effect[:5]:
            row = effect_df[effect_df["feature"] == f].iloc[0]
            direction = "higher" if row["cohens_d"] > 0 else "lower"
            print(f"   ‚Ä¢ {f}: Retained customers have {direction} values")
            print(f"     Mean: Retained={row['retained_mean']:.2f}, Churned={row['churned_mean']:.2f}")
        print("   ‚Üí MUST include in predictive model")
    
    if medium_effect:
        print(f"\nüü° MEDIUM EFFECT (0.5 ‚â§ |d| < 0.8) - Useful Features:")
        for f in medium_effect[:3]:
            print(f"   ‚Ä¢ {f}")
        print("   ‚Üí Should include in model")
    
    if small_effect:
        print(f"\nüü¢ SMALL EFFECT (0.2 ‚â§ |d| < 0.5) - Supporting Features:")
        print(f"   {', '.join(small_effect[:5])}")
        print("   ‚Üí May help in combination with other features")
    
    negligible = effect_df[effect_df["abs_d"] < 0.2]["feature"].tolist()
    if negligible:
        print(f"\n‚ö™ NEGLIGIBLE EFFECT (|d| < 0.2): {len(negligible)} features")
        print("   ‚Üí Consider engineering or dropping from model")
else:
    print("Entity column or target not available for effect size analysis")

ENTITY-LEVEL FEATURE EFFECT SIZES (Cohen's d)

Analyzing 21 aggregated features at entity level
Entities: 4,998 (Retained: 1,973, Churned: 3,025)

Feature                                    d Effect          Direction           
--------------------------------------------------------------------------------
üî¥ target_std                          +4.597 Large effect    ‚Üë Higher in retained
üî¥ tenure_days                         -2.403 Large effect    ‚Üì Lower in retained 
üî¥ target_mean                         +1.570 Large effect    ‚Üë Higher in retained
üî¥ opened_std                          -0.988 Large effect    ‚Üì Lower in retained 
üî¥ opened_sum                          -0.915 Large effect    ‚Üì Lower in retained 
üî¥ opened_mean                         -0.834 Large effect    ‚Üì Lower in retained 
üü° sent_date_count                     -0.759 Medium effect   ‚Üì Lower in retained 
üü° event_count                         -0.759 Medium effect   ‚Üì Lower in reta

In [22]:
# Box Plots: Entity-level feature distributions by target
if ENTITY_COLUMN and TARGET_COLUMN and 'entity_df' in dir() and len(effect_df) > 0:
    # Select top features by effect size for visualization
    top_features = effect_df.head(6)["feature"].tolist()
    n_features = len(top_features)
    
    if n_features > 0:
        print("="*70)
        print("DISTRIBUTION COMPARISON: Retained vs Churned (Box Plots)")
        print("="*70)
        print("\nüìä Showing top 6 features by effect size")
        print("   üü¢ Green = Retained | üî¥ Red = Churned\n")
        
        fig = make_subplots(rows=1, cols=n_features, subplot_titles=top_features, horizontal_spacing=0.05)
        
        for i, col in enumerate(top_features):
            col_num = i + 1
            
            # Retained (1) - Green
            retained_data = entity_df[entity_df[TARGET_COLUMN] == 1][col].dropna()
            fig.add_trace(go.Box(y=retained_data, name='Retained',
                fillcolor='rgba(46, 204, 113, 0.7)', line=dict(color='#1e8449', width=2),
                boxpoints='outliers', width=0.35, showlegend=(i == 0), legendgroup='retained',
                marker=dict(color='rgba(46, 204, 113, 0.5)', size=4)), row=1, col=col_num)
            
            # Churned (0) - Red
            churned_data = entity_df[entity_df[TARGET_COLUMN] == 0][col].dropna()
            fig.add_trace(go.Box(y=churned_data, name='Churned',
                fillcolor='rgba(231, 76, 60, 0.7)', line=dict(color='#922b21', width=2),
                boxpoints='outliers', width=0.35, showlegend=(i == 0), legendgroup='churned',
                marker=dict(color='rgba(231, 76, 60, 0.5)', size=4)), row=1, col=col_num)
        
        fig.update_layout(height=450, title_text="Top Features: Retained (Green) vs Churned (Red)",
            template='plotly_white', showlegend=True, boxmode='group',
            legend=dict(orientation="h", yanchor="bottom", y=1.05, xanchor="center", x=0.5))
        fig.update_xaxes(showticklabels=False)
        display_figure(fig)
        
        # INTERPRETATION
        print("‚îÄ"*70)
        print("üìñ HOW TO READ BOX PLOTS")
        print("‚îÄ"*70)
        print("""
Box Plot Elements:
  ‚Ä¢ Box = Middle 50% of data (IQR: 25th to 75th percentile)
  ‚Ä¢ Line inside box = Median (50th percentile)
  ‚Ä¢ Whiskers = 1.5 √ó IQR from box edges
  ‚Ä¢ Dots outside = Outliers

What makes a good predictor:
  ‚úì Clear SEPARATION between green and red boxes
  ‚úì Different MEDIANS (center lines at different heights)
  ‚úì Minimal OVERLAP between boxes

Patterns to look for:
  ‚Ä¢ Green box entirely above red ‚Üí Retained have higher values
  ‚Ä¢ Green box entirely below red ‚Üí Retained have lower values
  ‚Ä¢ Overlapping boxes ‚Üí Feature alone may not discriminate well
  ‚Ä¢ Many outliers in one group ‚Üí Subpopulations worth investigating
""")

DISTRIBUTION COMPARISON: Retained vs Churned (Box Plots)

üìä Showing top 6 features by effect size
   üü¢ Green = Retained | üî¥ Red = Churned



‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
üìñ HOW TO READ BOX PLOTS
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

Box Plot Elements:
  ‚Ä¢ Box = Middle 50% of data (IQR: 25th to 75th percentile)
  ‚Ä¢ Line inside box = Median (50th percentile)
  ‚Ä¢ Whiskers = 1.5 √ó IQR from box edges
  ‚Ä¢ Dots outside = Outliers

What makes a good predictor:
  ‚úì Clear SEPARATION between green and red boxes
  ‚úì Different MEDIANS (center lines at different heights)
  ‚úì Minimal OVERLAP between boxes

Patterns to look for:
  ‚Ä¢ Green box entirely above red ‚Üí Retained have higher values
  ‚Ä¢ Green box entirely below red ‚Üí Retained have lower values
  ‚Ä¢ Overlapp

In [23]:
# Feature-Target Correlation Ranking
if ENTITY_COLUMN and TARGET_COLUMN and 'entity_df' in dir():
    print("="*70)
    print("FEATURE-TARGET CORRELATIONS (Entity-Level)")
    print("="*70)
    
    correlations = []
    for col in effect_feature_cols:
        if col != TARGET_COLUMN:
            corr = entity_df[[col, TARGET_COLUMN]].corr().iloc[0, 1]
            if not np.isnan(corr):
                correlations.append({"Feature": col, "Correlation": corr})
    
    if correlations:
        corr_df = pd.DataFrame(correlations).sort_values("Correlation", key=abs, ascending=False)
        
        fig = charts.bar_chart(
            corr_df["Feature"].head(12).tolist(),
            corr_df["Correlation"].head(12).tolist(),
            title=f"Feature Correlations with {TARGET_COLUMN}"
        )
        display_figure(fig)
        
        print("\nüìä Correlation Rankings:")
        print(f"{'Feature':<35} {'Correlation':>12} {'Strength':<15} {'Direction'}")
        print("-" * 75)
        
        for _, row in corr_df.head(10).iterrows():
            abs_corr = abs(row["Correlation"])
            if abs_corr >= 0.5:
                strength = "Strong"
            elif abs_corr >= 0.3:
                strength = "Moderate"
            elif abs_corr >= 0.1:
                strength = "Weak"
            else:
                strength = "Very weak"
            
            direction = "Positive" if row["Correlation"] > 0 else "Negative"
            print(f"{row['Feature'][:34]:<35} {row['Correlation']:>+12.3f} {strength:<15} {direction}")
        
        # INTERPRETATION
        print("\n" + "‚îÄ"*70)
        print("üìñ INTERPRETING CORRELATIONS WITH TARGET")
        print("‚îÄ"*70)
        print("""
Correlation with binary target (retained=1, churned=0):

  Positive correlation (+): Higher values ‚Üí more likely RETAINED
  Negative correlation (-): Higher values ‚Üí more likely CHURNED

Strength guide:
  |r| > 0.5:  Strong - prioritize this feature
  |r| 0.3-0.5: Moderate - useful predictor
  |r| 0.1-0.3: Weak - may help in combination
  |r| < 0.1:  Very weak - limited predictive value

Note: Correlation captures LINEAR relationships only.
Non-linear relationships may have low correlation but still be predictive.
""")

FEATURE-TARGET CORRELATIONS (Entity-Level)



üìä Correlation Rankings:
Feature                              Correlation Strength        Direction
---------------------------------------------------------------------------
target_sum                                +1.000 Strong          Positive
target_std                                +0.913 Strong          Positive
tenure_days                               -0.761 Strong          Negative
target_mean                               +0.609 Strong          Positive
opened_std                                -0.434 Moderate        Negative
opened_sum                                -0.408 Moderate        Negative
opened_mean                               -0.378 Moderate        Negative
sent_date_count                           -0.348 Moderate        Negative
event_count                               -0.348 Moderate        Negative
send_hour_sum                             -0.344 Moderate        Negative

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î

In [24]:
# Scatter Plot Matrix for top entity-level features
if ENTITY_COLUMN and TARGET_COLUMN and 'entity_df' in dir() and len(effect_df) > 0:
    # Select top 4 features for scatter matrix
    top_scatter_features = effect_df.head(4)["feature"].tolist()
    
    if len(top_scatter_features) >= 2:
        scatter_data = entity_df[top_scatter_features].sample(min(1000, len(entity_df)))
        fig = charts.scatter_matrix(scatter_data, title="Scatter Plot Matrix (Top Entity-Level Features)")
        display_figure(fig)
        
        print("\nüìà Scatter Matrix Insights:")
        print("   ‚Ä¢ Look for clusters indicating natural segments")
        print("   ‚Ä¢ Diagonal patterns suggest correlated features")
        print("   ‚Ä¢ Curved patterns may benefit from polynomial features")


üìà Scatter Matrix Insights:
   ‚Ä¢ Look for clusters indicating natural segments
   ‚Ä¢ Diagonal patterns suggest correlated features
   ‚Ä¢ Curved patterns may benefit from polynomial features


## 1c.11 Sparkline Comparison: Retained vs Churned Trends

**üìñ Why Sparklines for Cohort Comparison:**

Sparklines provide a compact side-by-side visualization of how metrics evolve differently for retained vs churned customers:

| Row | What It Shows | Look For |
|-----|--------------|----------|
| **Retained (Green)** | Weekly trend for customers who stayed | Stable or upward trends |
| **Churned (Red)** | Weekly trend for customers who left | Declining trends before churn |

**Reading the Sparklines:**
- Each column = one metric
- Top row = Retained customers (green)
- Bottom row = Churned customers (red)
- Compare shapes: divergent patterns = predictive signal

**Configuration:**
- Variables are auto-selected based on **effect size** (Cohen's d) - metrics that best differentiate retained from churned
- Override `SPARKLINE_COLUMNS` below to specify custom columns
- Target defaults to detected churn/retention column

In [25]:
# === SPARKLINE CONFIGURATION ===
# Override these to customize the sparkline comparison

# Target column for cohort split (default: auto-detected churn/retention column)
SPARKLINE_TARGET = TARGET_COLUMN  # Override: e.g., "churn_flag"

# Columns to visualize (default: auto-select based on effect size)
# Set to specific columns: ["col1", "col2", "col3"]
# Set to None for auto-selection
SPARKLINE_COLUMNS = None  # Override: e.g., ["revenue", "login_count", "support_tickets"]

# Number of columns to show if auto-selecting
SPARKLINE_MAX_COLS = 6

# === AUTO-SELECT BEST COLUMNS (by Effect Size / Cohen's d) ===
def select_sparkline_columns(df, numeric_cols, target_col, max_cols=6):
    """
    Select columns most likely to show differences between retained/churned.
    
    Selection Logic:
    - WITH target: Uses effect size (Cohen's d) to find metrics that best 
      differentiate retained vs churned customers
    - WITHOUT target: Uses variance to find most variable (interesting) metrics
    
    Returns columns sorted by discriminative power.
    """
    if target_col is None or target_col not in df.columns:
        # No target - select by variance (most variable = most interesting)
        variances = {col: df[col].var() for col in numeric_cols if col in df.columns}
        sorted_cols = sorted(variances.keys(), key=lambda x: variances[x], reverse=True)
        return sorted_cols[:max_cols]
    
    # With target - select by discrimination power (effect size proxy)
    scores = {}
    for col in numeric_cols:
        if col not in df.columns or col == target_col:
            continue
        try:
            group0 = df[df[target_col] == 0][col].dropna()
            group1 = df[df[target_col] == 1][col].dropna()
            if len(group0) > 0 and len(group1) > 0:
                # Cohen's d: standardized difference in means
                pooled_std = np.sqrt((group0.var() + group1.var()) / 2)
                if pooled_std > 0:
                    scores[col] = abs(group1.mean() - group0.mean()) / pooled_std
                else:
                    scores[col] = 0
        except:
            continue
    
    if scores:
        sorted_cols = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
        return sorted_cols[:max_cols]
    
    # Fallback to first N columns
    return [c for c in numeric_cols if c in df.columns][:max_cols]

# Determine columns to use
if SPARKLINE_COLUMNS is not None:
    sparkline_cols = [c for c in SPARKLINE_COLUMNS if c in df.columns]
    selection_method = "user-specified"
else:
    sparkline_cols = select_sparkline_columns(df, numeric_event_cols, SPARKLINE_TARGET, SPARKLINE_MAX_COLS)
    selection_method = "auto-selected by effect size (Cohen's d)" if SPARKLINE_TARGET else "auto-selected by variance"

print("="*70)
print("SPARKLINE VARIABLE SELECTION")
print("="*70)
print(f"\nTarget column: {SPARKLINE_TARGET or 'None (no cohort split)'}")
print(f"Selection method: {selection_method}")
print(f"\nSelected columns ({len(sparkline_cols)}):")
for i, col in enumerate(sparkline_cols, 1):
    print(f"   {i}. {col}")

if SPARKLINE_COLUMNS is None and SPARKLINE_TARGET:
    print("""
üí° Why these columns?
   Columns are ranked by EFFECT SIZE (Cohen's d), which measures how well 
   each metric separates retained from churned customers. Higher effect 
   size = better discrimination = more interesting to visualize.
   
   To override: Set SPARKLINE_COLUMNS = ["your", "columns", "here"]
""")

SPARKLINE VARIABLE SELECTION

Target column: target_entity
Selection method: auto-selected by effect size (Cohen's d)

Selected columns (6):
   1. target
   2. opened
   3. clicked
   4. bounced
   5. time_to_open_hours
   6. send_hour

üí° Why these columns?
   Columns are ranked by EFFECT SIZE (Cohen's d), which measures how well 
   each metric separates retained from churned customers. Higher effect 
   size = better discrimination = more interesting to visualize.

   To override: Set SPARKLINE_COLUMNS = ["your", "columns", "here"]



In [26]:
# Sparkline comparison: Retained vs Churned behavior over time
from customer_retention.stages.profiling import SparklineDataBuilder

if ENTITY_COLUMN and sparkline_cols:
    builder = SparklineDataBuilder(
        entity_column=ENTITY_COLUMN,
        time_column=TIME_COLUMN,
        target_column=SPARKLINE_TARGET if SPARKLINE_TARGET and SPARKLINE_TARGET in df.columns else None,
        freq="W"
    )
    
    sparkline_data, has_target = builder.build(df, sparkline_cols)
    builder.print_summary(sparkline_data, has_target)
    
    # Build chart data in format expected by cohort_sparklines
    chart_data = {}
    for sd in sparkline_data:
        chart_data[sd.column] = {
            "retained": sd.retained_values,
            "churned": sd.churned_values if sd.churned_values else sd.retained_values,
        }
    
    # Use ChartBuilder for visualization
    fig = charts.cohort_sparklines(
        chart_data,
        title="Weekly Trends: üü¢ Retained vs üî¥ Churned" if has_target else "Weekly Trends"
    )
    display_figure(fig)
    
    # Show high-divergence columns
    divergent = [(sd.column, sd.divergence_score) for sd in sparkline_data if sd.divergence_score > 0.5]
    if divergent:
        print(f"\n‚≠ê High-divergence columns: {[c for c, _ in sorted(divergent, key=lambda x: -x[1])]}")


SPARKLINE COMPARISON: Retained vs Churned Trends

  üü¢ Retained (target=1) | üî¥ Churned (target=0)

  target: divergence=0.94
  opened: divergence=1.57
  clicked: divergence=1.01
  bounced: divergence=0.14
  time_to_open_hours: divergence=0.03
  send_hour: divergence=0.05



‚≠ê High-divergence columns: ['opened', 'clicked', 'target']


In [27]:
# Use ChartBuilder sparkline_grid for monthly cohort trends (alternative visualization)
if ENTITY_COLUMN and sparkline_cols and SPARKLINE_TARGET and SPARKLINE_TARGET in df.columns:
    # Prepare target-labeled data
    entity_target = df.groupby(ENTITY_COLUMN)[SPARKLINE_TARGET].first()
    df_monthly = df.merge(entity_target.reset_index().rename(columns={SPARKLINE_TARGET: '_target'}), on=ENTITY_COLUMN)
    df_monthly['_month'] = df_monthly[TIME_COLUMN].dt.to_period('M').dt.start_time
    
    cols_to_plot = sparkline_cols[:4]
    monthly_retained = df_monthly[df_monthly['_target'] == 1].groupby('_month')[cols_to_plot].mean()
    monthly_churned = df_monthly[df_monthly['_target'] == 0].groupby('_month')[cols_to_plot].mean()
    
    print("\n" + "="*70)
    print("MONTHLY SPARKLINE GRIDS")
    print("="*70)
    
    retained_series = {col[:20]: monthly_retained[col].dropna().tolist() for col in cols_to_plot if col in monthly_retained.columns}
    if retained_series:
        fig_retained = charts.sparkline_grid(retained_series, columns=2, sparkline_height=80)
        fig_retained.update_layout(title="üü¢ RETAINED Customers - Monthly Trends")
        display_figure(fig_retained)
    
    churned_series = {col[:20]: monthly_churned[col].dropna().tolist() for col in cols_to_plot if col in monthly_churned.columns}
    if churned_series:
        fig_churned = charts.sparkline_grid(churned_series, columns=2, sparkline_height=80)
        fig_churned.update_layout(title="üî¥ CHURNED Customers - Monthly Trends")
        display_figure(fig_churned)



MONTHLY SPARKLINE GRIDS


## 1c.12 Velocity & Acceleration Analysis

**üìñ Why Velocity and Acceleration Matter:**

| Metric | Formula | Interpretation |
|--------|---------|----------------|
| **Velocity** | Œî(value) / Œît | Rate of change - is activity speeding up or slowing down? |
| **Acceleration** | Œî(velocity) / Œît | Change in rate - is the slowdown accelerating? |

Window size is derived from 01a findings (shortest aggregation window).


In [28]:
# Velocity & Acceleration Analysis (continuous metrics only)
if ENTITY_COLUMN and sparkline_cols and SPARKLINE_TARGET and SPARKLINE_TARGET in df.columns:
    # Filter to continuous columns (exclude binary flags)
    continuous_cols = [c for c in sparkline_cols 
                       if c not in [SPARKLINE_TARGET, ENTITY_COLUMN, TIME_COLUMN]
                       and df[c].nunique() > 2]  # More than 2 unique values = likely continuous
    
    if not continuous_cols:
        print("‚ö†Ô∏è No continuous numeric columns found for velocity analysis.")
        print("   Velocity works best with metrics like amounts, durations, counts (not binary flags).")
    else:
        velocity_cols = continuous_cols[:4]
        
        print("="*70)
        print(f"VELOCITY & ACCELERATION ANALYSIS (window: {pattern_config.velocity_window_days}d)")
        print("="*70)
        print(f"Analyzing: {velocity_cols}")
        
        # Prepare cohort data once
        entity_target = df.groupby(ENTITY_COLUMN)[SPARKLINE_TARGET].first()
        df_temp = df.merge(entity_target.reset_index().rename(columns={SPARKLINE_TARGET: '_target'}), on=ENTITY_COLUMN)
        df_temp['_week'] = df_temp[TIME_COLUMN].dt.to_period('W').dt.start_time
        
        # Compute weekly aggregations and derivatives
        chart_data = {}
        velocity_summary = {}
        divergent_cols = []
        
        for col in velocity_cols:
            retained_weekly = df_temp[df_temp['_target'] == 1].groupby('_week')[col].mean()
            churned_weekly = df_temp[df_temp['_target'] == 0].groupby('_week')[col].mean()
            
            ret_vel = retained_weekly.diff().dropna()
            churn_vel = churned_weekly.diff().dropna()
            
            chart_data[col] = {
                "retained": retained_weekly.tolist(), "churned": churned_weekly.tolist(),
                "velocity_retained": ret_vel.tolist(), "velocity_churned": churn_vel.tolist(),
                "accel_retained": ret_vel.diff().dropna().tolist(), "accel_churned": churn_vel.diff().dropna().tolist(),
            }
            
            ret_mean_vel = ret_vel.mean() if len(ret_vel) > 0 else 0
            churn_mean_vel = churn_vel.mean() if len(churn_vel) > 0 else 0
            # Use relative threshold for divergence
            is_divergent = (ret_mean_vel > 0.001) != (churn_mean_vel > 0.001) or (ret_mean_vel < -0.001) != (churn_mean_vel < -0.001)
            if is_divergent: divergent_cols.append(col)
            velocity_summary[col] = {"retained": ret_mean_vel, "churned": churn_mean_vel, "divergent": is_divergent}
        
        fig = charts.velocity_acceleration_chart(chart_data, title="Value ‚Üí Velocity ‚Üí Acceleration (üü¢ Retained vs üî¥ Churned)")
        display_figure(fig)
        
        print("\nüìä Cohort Velocity Comparison:")
        for col, v in velocity_summary.items():
            signal = "‚ö†Ô∏è DIVERGENT" if v["divergent"] else ""
            print(f"   {col}: Retained={v['retained']:+.4f}, Churned={v['churned']:+.4f} {signal}")
        
        if divergent_cols:
            print(f"\n‚≠ê TOP CANDIDATES: {', '.join(divergent_cols)}")


VELOCITY & ACCELERATION ANALYSIS (window: 180d)
Analyzing: ['time_to_open_hours', 'send_hour']



üìä Cohort Velocity Comparison:
   time_to_open_hours: Retained=+0.0116, Churned=-0.0022 ‚ö†Ô∏è DIVERGENT
   send_hour: Retained=-0.0009, Churned=+0.0034 ‚ö†Ô∏è DIVERGENT

‚≠ê TOP CANDIDATES: time_to_open_hours, send_hour


## 1c.13 Lag Correlation Analysis

**üìñ Why Lag Correlations Matter:**

Lag correlations show how a metric relates to itself over time:
- High lag-1 correlation: Today's value predicts tomorrow's
- Decaying correlations: Effect diminishes over time
- Periodic spikes: Seasonality (e.g., spike at lag 7 = weekly pattern)

In [29]:
# Lag Correlation Analysis using TemporalFeatureAnalyzer
if ENTITY_COLUMN and sparkline_cols:
    lag_cols = sparkline_cols[:6]
    max_lag = 14
    
    print("="*70)
    print("LAG CORRELATION ANALYSIS (using TemporalFeatureAnalyzer)")
    print("="*70)
    
    # Use framework analyzer (initialized above or create new one)
    if 'feature_analyzer' not in dir():
        feature_analyzer = TemporalFeatureAnalyzer(
            time_column=TIME_COLUMN,
            entity_column=ENTITY_COLUMN
        )
    
    # Calculate lag correlations using framework
    lag_results = feature_analyzer.calculate_lag_correlations(df, lag_cols, max_lag=max_lag)
    
    # Build data for heatmap
    lag_corr_data = {col: result.correlations for col, result in lag_results.items()}
    
    # Use ChartBuilder for visualization
    fig = charts.lag_correlation_heatmap(
        lag_corr_data,
        max_lag=max_lag,
        title="Autocorrelation by Lag (days) - Informs Lag Feature Selection"
    )
    display_figure(fig)
    
    # Display framework results
    print("\nüìä Best Lag per Variable:")
    strong_lag_vars = []
    weekly_pattern_vars = []
    for col, result in lag_results.items():
        best_lag_info = f"best lag={result.best_lag}d (r={result.best_correlation:.2f})"
        weekly_info = " [Weekly pattern]" if result.has_weekly_pattern else ""
        
        if result.best_correlation > 0.3:
            strong_lag_vars.append((col, result.best_lag, result.best_correlation))
        if result.has_weekly_pattern:
            weekly_pattern_vars.append(col)
            
        print(f"   {col[:25]}: {best_lag_info}{weekly_info}")
    
    # INTERPRETATION SECTION
    print("\n" + "‚îÄ"*70)
    print("üìñ HOW TO INTERPRET LAG CORRELATIONS")
    print("‚îÄ"*70)
    print("""
Lag correlation shows how a variable relates to its PAST values:

Reading the heatmap:
  ‚Ä¢ Darker colors = STRONGER correlation at that lag
  ‚Ä¢ Row = variable being analyzed
  ‚Ä¢ Column = lag in days (1-14)

What the patterns mean:
  1. HIGH correlation at lag-1 (r > 0.5)
     ‚Üí Strong "memory" - today's value predicts tomorrow's
     ‚Üí Use: {col}_lag_1d, {col}_diff_1d features
     
  2. HIGH correlation at lag-7 (weekly peak)
     ‚Üí Clear weekly seasonality
     ‚Üí Use: {col}_lag_7d, day_of_week encoding
     
  3. SLOWLY decaying correlations
     ‚Üí Mean-reverting behavior
     ‚Üí Use: Rolling averages work well
     
  4. LOW correlations everywhere (< 0.2)
     ‚Üí Random/noisy variable
     ‚Üí Lag features less useful here
""")
    
    if strong_lag_vars:
        print("‚≠ê STRONG LAG CANDIDATES:")
        for col, lag, corr in strong_lag_vars:
            print(f"   ‚Ä¢ {col}: lag {lag}d (r={corr:.2f}) ‚Üí Create {col}_lag_{lag}d feature")
    
    if weekly_pattern_vars:
        print(f"\nüìÖ WEEKLY PATTERN DETECTED in: {', '.join(weekly_pattern_vars)}")
        print("   RECOMMENDED: Add day_of_week features + lag_7d features")

LAG CORRELATION ANALYSIS (using TemporalFeatureAnalyzer)



üìä Best Lag per Variable:
   target: best lag=14d (r=0.04)
   opened: best lag=4d (r=-0.03)
   clicked: best lag=6d (r=-0.03)
   bounced: best lag=3d (r=-0.03)
   time_to_open_hours: best lag=6d (r=0.03)
   send_hour: best lag=9d (r=-0.03)

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
üìñ HOW TO INTERPRET LAG CORRELATIONS
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

Lag correlation shows how a variable relates to its PAST values:

Reading the heatmap:
  ‚Ä¢ Darker colors = STRONGER correlation at that lag
  ‚Ä¢ Row = variable being analyzed
  ‚Ä¢ Column = lag in days (1-14)

What the patterns mean:
  1. HIGH correlation at lag-1 (r > 0.5)
     ‚Üí Strong 

## 1c.14 Predictive Power Analysis (IV & KS Statistics)

**üìñ Information Value (IV) and KS Statistics:**

These metrics measure how well time-window features predict the target:

| Metric | Range | Interpretation |
|--------|-------|----------------|
| **IV** | 0-1+ | <0.02=weak, 0.02-0.1=medium, 0.1-0.3=strong, >0.3=very strong |
| **KS** | 0-1 | Maximum separation between target classes |

In [30]:
# Predictive Power Analysis using TemporalFeatureAnalyzer
if ENTITY_COLUMN and SPARKLINE_TARGET and SPARKLINE_TARGET in df.columns:
    print("="*70)
    print("PREDICTIVE POWER ANALYSIS (using TemporalFeatureAnalyzer)")
    print("="*70)
    
    # Use framework analyzer
    if 'feature_analyzer' not in dir():
        feature_analyzer = TemporalFeatureAnalyzer(
            time_column=TIME_COLUMN,
            entity_column=ENTITY_COLUMN
        )
    
    # Calculate predictive power using framework
    power_results = feature_analyzer.calculate_predictive_power(
        df, sparkline_cols, SPARKLINE_TARGET
    )
    
    # Build data for visualization
    iv_values = {col: result.information_value for col, result in power_results.items()}
    ks_values = {col: result.ks_statistic for col, result in power_results.items()}
    
    # Use ChartBuilder for visualization
    fig = charts.predictive_power_chart(
        iv_values,
        ks_values,
        title="Variable Predictive Power Rankings"
    )
    display_figure(fig)
    
    # Display framework results
    print("\nüìä Predictive Power Rankings (from framework):")
    print(f"{'Variable':<25} {'IV':>8} {'Strength':<12} {'KS':>8} {'p-value':>10}")
    print("-" * 70)
    
    sorted_results = sorted(power_results.items(), key=lambda x: x[1].information_value, reverse=True)
    strong_iv_vars = []
    strong_ks_vars = []
    suspicious_vars = []
    
    for col, result in sorted_results:
        sig = "***" if result.ks_pvalue < 0.001 else "**" if result.ks_pvalue < 0.01 else "*" if result.ks_pvalue < 0.05 else ""
        print(f"{col[:24]:<25} {result.information_value:>8.3f} {result.iv_interpretation:<12} {result.ks_statistic:>8.3f} {result.ks_pvalue:>9.4f} {sig}")
        
        if result.information_value > 0.3:
            strong_iv_vars.append(col)
        if result.ks_statistic > 0.4:
            strong_ks_vars.append(col)
        if result.iv_interpretation == "suspicious":
            suspicious_vars.append(col)
    
    # INTERPRETATION SECTION
    print("\n" + "‚îÄ"*70)
    print("üìñ HOW TO INTERPRET IV AND KS STATISTICS")
    print("‚îÄ"*70)
    print("""
Information Value (IV) - measures how well a variable separates classes:
  ‚Ä¢ IV < 0.02:   Very weak - not useful alone
  ‚Ä¢ IV 0.02-0.1: Weak - some signal
  ‚Ä¢ IV 0.1-0.3:  Medium - good predictor
  ‚Ä¢ IV 0.3-0.5:  Strong - excellent predictor
  ‚Ä¢ IV > 0.5:    SUSPICIOUS - check for data leakage!

KS Statistic - measures distribution separation between retained/churned:
  ‚Ä¢ KS < 0.2:    Heavy overlap - weak discriminator
  ‚Ä¢ KS 0.2-0.4:  Moderate separation
  ‚Ä¢ KS > 0.4:    Clear separation - strong discriminator

Significance stars: *** p<0.001, ** p<0.01, * p<0.05

Combined interpretation:
  ‚Ä¢ HIGH IV + HIGH KS + Significant ‚Üí TOP FEATURE CANDIDATE
  ‚Ä¢ HIGH IV but LOW KS ‚Üí May need binning/transformation
  ‚Ä¢ LOW IV but HIGH KS ‚Üí May have outliers driving KS
""")
    
    # Warnings and recommendations
    if suspicious_vars:
        print(f"‚ö†Ô∏è WARNING: Suspicious IV for: {', '.join(suspicious_vars)}")
        print("   IV > 0.5 may indicate DATA LEAKAGE - investigate these carefully!")
        print("   Check if these variables are derived from the target or future data.")
    
    top_vars = [col for col, r in sorted_results if r.information_value > 0.1 or r.ks_statistic > 0.3]
    if top_vars:
        print(f"\n‚≠ê TOP FEATURE ENGINEERING CANDIDATES: {', '.join(top_vars[:5])}")
        print("   These variables show strong predictive power for the target.")
        print("   RECOMMENDED: Prioritize creating derived features from these.")
else:
    print("Target column required for predictive power analysis")

PREDICTIVE POWER ANALYSIS (using TemporalFeatureAnalyzer)



üìä Predictive Power Rankings (from framework):
Variable                        IV Strength           KS    p-value
----------------------------------------------------------------------
target                       6.573 suspicious      1.000    0.0000 ***
opened                       0.752 suspicious      0.360    0.0000 ***
clicked                      0.383 strong          0.334    0.0000 ***
bounced                      0.181 medium          0.133    0.0000 ***
send_hour                    0.128 medium          0.077    0.0000 ***
time_to_open_hours           0.123 medium          0.102    0.0000 ***

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
üìñ HOW TO INTERPRET IV AND KS STATISTICS
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚

## 1c.15 Momentum Analysis (Window Ratios)

**üìñ Momentum Features:**

Momentum captures behavioral changes by comparing time windows:

| Metric | Interpretation |
|--------|----------------|
| Momentum > 1 | Recent activity higher than historical (engagement increasing) |
| Momentum < 1 | Recent activity lower than historical (churn signal) |
| Large swings | High volatility in behavior |

Window pairs are derived from 01a findings (consecutive aggregation windows).


In [31]:
# Momentum Analysis using configured windows
if ENTITY_COLUMN and SPARKLINE_TARGET and SPARKLINE_TARGET in df.columns:
    momentum_pairs = pattern_config.get_momentum_pairs()
    print("="*70)
    print(f"MOMENTUM ANALYSIS (window pairs: {momentum_pairs})")
    print("="*70)
    
    if 'feature_analyzer' not in dir():
        feature_analyzer = TemporalFeatureAnalyzer(time_column=TIME_COLUMN, entity_column=ENTITY_COLUMN)
    
    momentum_cols = sparkline_cols[:4]
    
    # Use primary momentum pair from config
    short_w, long_w = momentum_pairs[0]
    window_label = f"{short_w}d/{long_w}d"
    
    # Cohort comparison
    entity_target = df.groupby(ENTITY_COLUMN)[SPARKLINE_TARGET].first()
    df_temp = df.merge(entity_target.reset_index().rename(columns={SPARKLINE_TARGET: '_target'}), on=ENTITY_COLUMN)
    
    retained_mom = feature_analyzer.calculate_momentum(df_temp[df_temp['_target'] == 1], momentum_cols, short_w, long_w)
    churned_mom = feature_analyzer.calculate_momentum(df_temp[df_temp['_target'] == 0], momentum_cols, short_w, long_w)
    
    # Build chart data
    momentum_data = {}
    divergent_cols = []
    for col in momentum_cols:
        ret = retained_mom[col].mean_momentum if col in retained_mom else 1
        churn = churned_mom[col].mean_momentum if col in churned_mom else 1
        momentum_data[col] = {"retained": ret, "churned": churn}
        if abs(ret - churn) > 0.1: divergent_cols.append(col)
    
    fig = charts.momentum_comparison_chart(
        momentum_data, 
        title=f"Momentum Comparison ({window_label})",
        window_label=window_label
    )
    display_figure(fig)
    
    # Results summary
    print(f"\nüìä Momentum Results ({window_label}):")
    print(f"{'Variable':<20} {'Retained':>12} {'Churned':>12} {'Diff':>10}")
    print("-" * 60)
    for col in momentum_cols:
        ret, churn = momentum_data[col]["retained"], momentum_data[col]["churned"]
        signal = "‚ö†Ô∏è" if abs(ret - churn) > 0.1 else ""
        print(f"{col[:19]:<20} {ret:>12.3f} {churn:>12.3f} {ret-churn:>+10.3f} {signal}")
    
    if divergent_cols:
        print(f"\n‚≠ê HIGH-SIGNAL FEATURES: {', '.join(divergent_cols)}")


MOMENTUM ANALYSIS (window pairs: [(180, 365)])



üìä Momentum Results (180d/365d):
Variable                 Retained      Churned       Diff
------------------------------------------------------------
target                      1.846        1.000     +0.846 ‚ö†Ô∏è
opened                      0.570        0.975     -0.405 ‚ö†Ô∏è
clicked                     0.240        0.917     -0.677 ‚ö†Ô∏è
bounced                     1.083        1.091     -0.008 

‚≠ê HIGH-SIGNAL FEATURES: target, opened, clicked


## 1c.16 Feature Engineering Summary

**üìã Feature Types with Configured Windows:**

The table below shows feature formulas using windows derived from 01a findings.
Run the next cell to see actual values for your data.


In [32]:
# Feature Engineering Recommendations
print("="*80)
print("FEATURE ENGINEERING RECOMMENDATIONS")
print("="*80)

# Display configured windows from pattern_config
momentum_pairs = pattern_config.get_momentum_pairs()
short_w = momentum_pairs[0][0] if momentum_pairs else 7
long_w = momentum_pairs[0][1] if momentum_pairs else 30

print(f"""
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Feature Type    ‚îÇ Formula (using configured windows)                 ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ Velocity        ‚îÇ (value_now - value_{short_w}d_ago) / {short_w}                 ‚îÇ
‚îÇ Acceleration    ‚îÇ velocity_now - velocity_{short_w}d_ago                   ‚îÇ
‚îÇ Momentum        ‚îÇ mean_{short_w}d / mean_{long_w}d                              ‚îÇ
‚îÇ Lag             ‚îÇ df[col].shift(N)                                   ‚îÇ
‚îÇ Rolling Mean    ‚îÇ df[col].rolling({short_w}).mean()                        ‚îÇ
‚îÇ Rolling Std     ‚îÇ df[col].rolling({long_w}).std()                         ‚îÇ
‚îÇ Ratio           ‚îÇ sum_{long_w}d / sum_all_time                            ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

Windows derived from 01a findings: {pattern_config.aggregation_windows}
Velocity window: {pattern_config.velocity_window_days}d
Momentum pairs: {momentum_pairs}
""")

# Framework recommendations
if 'feature_analyzer' in dir() and SPARKLINE_TARGET:
    recommendations = feature_analyzer.get_feature_recommendations(
        df, value_columns=sparkline_cols, target_column=SPARKLINE_TARGET
    )
    if recommendations:
        print("üéØ Framework Recommendations:")
        for rec in recommendations[:5]:
            print(f"   ‚Ä¢ {rec.feature_type.value}: {rec.source_column} ‚Üí {rec.feature_name} (priority {rec.priority})")


FEATURE ENGINEERING RECOMMENDATIONS

‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Feature Type    ‚îÇ Formula (using configured windows)                 ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ Velocity        ‚îÇ (value_now - value_180d_ago) / 180                   ‚îÇ
‚îÇ Acceleration    ‚îÇ velocity_now - velocity_180d_ago                     ‚îÇ
‚îÇ Momentum        ‚îÇ mean_180d / mean_365d                                 ‚îÇ
‚îÇ Lag             ‚îÇ df[col].shift(N)                                   ‚îÇ
‚îÇ Rolling Mean    ‚îÇ df[col].rolling(180).mean()                          ‚îÇ
‚îÇ Rolling Std     ‚îÇ df[col].rolling(365).s

In [33]:
print("\n" + "="*70)
print("TEMPORAL PATTERN SUMMARY")
print("="*70)

# Windows used
print(f"\n‚öôÔ∏è CONFIGURED WINDOWS: {pattern_config.aggregation_windows}")
print(f"   Velocity: {pattern_config.velocity_window_days}d | Momentum: {pattern_config.get_momentum_pairs()}")

# Trend summary
print(f"\nüìà TREND:")
print(f"   Direction: {trend_result.direction.value}")
print(f"   Confidence: {trend_result.confidence}")

# Seasonality summary
print(f"\nüîÅ SEASONALITY:")
if seasonality_results:
    for sr in seasonality_results[:2]:
        period_name = sr.period_name or f"{sr.period}-day"
        print(f"   {period_name.title()} pattern (strength: {sr.strength:.2f})")
else:
    print("   No significant seasonality detected")

# Recency summary
if ENTITY_COLUMN:
    print(f"\n‚è±Ô∏è RECENCY:")
    print(f"   Median: {recency_result.median_recency_days:.0f} days")
    if recency_result.target_correlation:
        corr = recency_result.target_correlation
        print(f"   Target correlation: {corr:.3f} {'(strong signal)' if abs(corr) > 0.3 else ''}")

# Velocity summary (if computed)
if 'velocity_summary' in dir() and velocity_summary:
    print(f"\nüöÄ VELOCITY ({pattern_config.velocity_window_days}d window):")
    divergent = [col for col, v in velocity_summary.items() if v.get('divergent')]
    if divergent:
        print(f"   Divergent columns (retained vs churned): {divergent}")
    else:
        print("   No significant divergence between cohorts")

# Momentum summary (if computed)
if 'momentum_data' in dir() and momentum_data:
    print(f"\nüìä MOMENTUM ({pattern_config.get_momentum_pairs()[0] if pattern_config.get_momentum_pairs() else 'N/A'}):")
    if 'divergent_cols' in dir() and divergent_cols:
        print(f"   High-signal columns: {divergent_cols}")
    else:
        print("   No significant momentum differences detected")



TEMPORAL PATTERN SUMMARY

‚öôÔ∏è CONFIGURED WINDOWS: ['180d', '365d', 'all_time']
   Velocity: 180d | Momentum: [(180, 365)]

üìà TREND:
   Direction: stable
   Confidence: medium

üîÅ SEASONALITY:
   Weekly pattern (strength: 0.48)
   21-Day pattern (strength: 0.48)

‚è±Ô∏è RECENCY:
   Median: 246 days
   Target correlation: 0.772 (strong signal)

üöÄ VELOCITY (180d window):
   Divergent columns (retained vs churned): ['time_to_open_hours', 'send_hour']

üìä MOMENTUM ((180, 365)):
   High-signal columns: ['target', 'opened', 'clicked']


In [34]:
# Feature engineering recommendations based on patterns
print("\n" + "="*70)
print("RECOMMENDED TEMPORAL FEATURES")
print("="*70)

print("\n\U0001f6e0\ufe0f Based on detected patterns, consider these features:\n")

print("1. RECENCY FEATURES:")
print("   - days_since_last_event")
print("   - log_days_since_last_event (if right-skewed)")
print("   - recency_bucket (categorical: 0-7d, 8-30d, etc.)")

if seasonality_results:
    weekly = any(6 <= sr.period <= 8 for sr in seasonality_results)
    monthly = any(28 <= sr.period <= 32 for sr in seasonality_results)
    
    print("\n2. SEASONALITY FEATURES:")
    if weekly:
        print("   - is_weekend (binary)")
        print("   - day_of_week_sin, day_of_week_cos (cyclical encoding)")
    if monthly:
        print("   - day_of_month")
        print("   - is_month_start, is_month_end")

print("\n3. TREND-ADJUSTED FEATURES:")
if trend_result.direction in [TrendDirection.INCREASING, TrendDirection.DECREASING]:
    print("   - event_count_recent_vs_overall (ratio)")
    print("   - activity_trend_direction (for each entity)")
else:
    print("   - Standard time-window aggregations should work well")

print("\n4. COHORT FEATURES:")
print("   - cohort_month (categorical or ordinal)")
print("   - tenure_days (days since first event)")


RECOMMENDED TEMPORAL FEATURES

üõ†Ô∏è Based on detected patterns, consider these features:

1. RECENCY FEATURES:
   - days_since_last_event
   - log_days_since_last_event (if right-skewed)
   - recency_bucket (categorical: 0-7d, 8-30d, etc.)

2. SEASONALITY FEATURES:
   - is_weekend (binary)
   - day_of_week_sin, day_of_week_cos (cyclical encoding)

3. TREND-ADJUSTED FEATURES:
   - Standard time-window aggregations should work well

4. COHORT FEATURES:
   - cohort_month (categorical or ordinal)
   - tenure_days (days since first event)


## 1c.17 Save Pattern Analysis Results

In [35]:
# Store pattern analysis results in findings
pattern_summary = {
    "windows_used": {
        "aggregation_windows": pattern_config.aggregation_windows,
        "velocity_window": pattern_config.velocity_window_days,
        "momentum_pairs": pattern_config.get_momentum_pairs(),
    },
    "trend": {
        "direction": trend_result.direction.value,
        "strength": trend_result.strength,
        "confidence": trend_result.confidence,
    },
    "seasonality": [
        {"period": sr.period, "name": sr.period_name, "strength": sr.strength}
        for sr in seasonality_results
    ],
}

if ENTITY_COLUMN:
    pattern_summary["recency"] = {
        "median_days": recency_result.median_recency_days,
        "target_correlation": recency_result.target_correlation,
    }

# Add velocity results if computed
if 'velocity_summary' in dir() and velocity_summary:
    pattern_summary["velocity"] = {
        col: {"retained": v["retained"], "churned": v["churned"], "divergent": v["divergent"]}
        for col, v in velocity_summary.items()
    }

# Add momentum results if computed
if 'momentum_data' in dir() and momentum_data:
    pattern_summary["momentum"] = {
        col: {"retained": v["retained"], "churned": v["churned"]}
        for col, v in momentum_data.items()
    }
    if 'divergent_cols' in dir():
        pattern_summary["momentum"]["_divergent_columns"] = divergent_cols

# Add to findings
if not findings.metadata:
    findings.metadata = {}
findings.metadata["temporal_patterns"] = pattern_summary

findings.save(FINDINGS_PATH)
print(f"\nPattern analysis saved to: {FINDINGS_PATH}")
print(f"\nSaved sections: {list(pattern_summary.keys())}")



Pattern analysis saved to: ../experiments/findings/customer_emails_408768_findings.yaml

Saved sections: ['windows_used', 'trend', 'seasonality', 'recency', 'velocity', 'momentum']


---

## Summary: What We Learned

In this notebook, we analyzed temporal patterns:

1. **Trend Detection** - Identified long-term direction in data
2. **Seasonality** - Found periodic patterns (weekly, monthly)
3. **Cohort Analysis** - Compared behavior by entity join date
4. **Recency Analysis** - Measured how recent activity relates to outcomes
5. **Feature Recommendations** - Generated feature engineering suggestions

## Pattern Summary

| Pattern | Status | Recommendation |
|---------|--------|----------------|
| Trend | Check findings | Detrend if strong |
| Seasonality | Check findings | Add cyclical features |
| Cohort Effects | Check findings | Add cohort indicators |
| Recency Effects | Check findings | Prioritize recent windows |

---

## Next Steps

**Complete the Event Bronze Track:**
- **01d_event_aggregation.ipynb** - Aggregate events to entity-level (produces new dataset)

After 01d produces the aggregated dataset, continue with:
- **02_column_deep_dive.ipynb** - Profile aggregated feature distributions
- **03_quality_assessment.ipynb** - Quality checks on aggregated data
- **04_relationship_analysis.ipynb** - Feature correlations and relationships

The aggregated data from 01d becomes the input for the Entity Bronze Track.