# Chapter 1a: Temporal Deep Dive (Event Bronze Track)

**Purpose:** Analyze event-level (time series) datasets with focus on temporal patterns, entity lifecycles, and event frequency distributions.

**When to use this notebook:**
- Your dataset was detected as `EVENT_LEVEL` granularity in notebook 01
- You have multiple rows per entity (customer, user, etc.)
- Each row represents an event with a timestamp

**What you'll learn:**
- How to profile entity lifecycles (first event, last event, duration)
- Understanding event frequency distributions per entity
- Inter-event timing patterns and their implications
- Time series-specific feature engineering opportunities

**Outputs:**
- Entity lifecycle visualizations
- Event frequency distribution analysis
- Inter-event timing statistics
- Updated exploration findings with time series metadata

---

## Understanding Time Series Profiling

| Metric | Description | Why It Matters |
|--------|-------------|----------------|
| **Events per Entity** | Distribution of event counts | Identifies power users vs. one-time users |
| **Entity Lifecycle** | Duration from first to last event | Reveals customer tenure patterns |
| **Inter-event Time** | Time between consecutive events | Indicates engagement patterns |
| **Time Span** | Overall data period coverage | Helps plan time window aggregations |

**Aggregation Windows (used in notebook 01d):**
- 24h: Very recent activity
- 7d: Weekly patterns
- 30d: Monthly patterns
- 90d: Quarterly trends
- 180d: Semi-annual patterns
- 365d: Annual patterns
- all-time: Historical totals

## 1a.1 Load Previous Findings

In [1]:
from customer_retention.analysis.auto_explorer import ExplorationFindings
from customer_retention.analysis.visualization import ChartBuilder, display_figure, display_table
from customer_retention.core.config.column_config import ColumnType, DatasetGranularity
from customer_retention.stages.profiling import (
    TimeSeriesProfiler, TimeSeriesProfile,
    TypeDetector,
    DistributionAnalyzer, TransformationType,
)
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

In [2]:
# === CONFIGURATION ===
# Option 1: Set the exact path from notebook 01 output
# FINDINGS_PATH = "../experiments/findings/transactions_abc123_findings.yaml"

# Option 2: Auto-discover findings files
from pathlib import Path

FINDINGS_DIR = Path("../experiments/findings")

# Find all findings files
findings_files = [f for f in FINDINGS_DIR.glob("*_findings.yaml") if "multi_dataset" not in f.name]
if not findings_files:
    raise FileNotFoundError(f"No findings files found in {FINDINGS_DIR}. Run notebook 01 first.")

# Sort by modification time (most recent first)
findings_files.sort(key=lambda f: f.stat().st_mtime, reverse=True)
FINDINGS_PATH = str(findings_files[0])

print(f"Found {len(findings_files)} findings file(s)")
print(f"Using: {FINDINGS_PATH}")
if len(findings_files) > 1:
    print(f"Other available: {[str(f.name) for f in findings_files[1:3]]}")

findings = ExplorationFindings.load(FINDINGS_PATH)
print(f"\nLoaded findings for {findings.column_count} columns from {findings.source_path}")

Found 4 findings file(s)
Using: ../experiments/findings/customer_emails_31faba_findings.yaml
Other available: ['customer_retention_retail_12f12a_findings.yaml', 'customer_transactions_10fc8c_findings.yaml']

Loaded findings for 12 columns from ../tests/fixtures/customer_emails.csv


In [3]:
# Verify this is a time series dataset
if findings.is_time_series:
    print("\u2705 Dataset confirmed as TIME SERIES (event-level)")
    ts_meta = findings.time_series_metadata
    print(f"   Entity column: {ts_meta.entity_column}")
    print(f"   Time column: {ts_meta.time_column}")
    print(f"   Avg events per entity: {ts_meta.avg_events_per_entity:.1f}" if ts_meta.avg_events_per_entity else "")
else:
    print("\u26a0\ufe0f This dataset was NOT detected as time series.")
    print("   Consider using 02_column_deep_dive.ipynb instead.")
    print("   Or manually specify entity and time columns below.")

✅ Dataset confirmed as TIME SERIES (event-level)
   Entity column: customer_id
   Time column: sent_date
   Avg events per entity: 17.6


## 1a.2 Load Source Data & Configure Columns

In [None]:
from customer_retention.stages.temporal import load_data_with_snapshot_preference, TEMPORAL_METADATA_COLS

df, data_source = load_data_with_snapshot_preference(findings, output_dir="../experiments/findings")
charts = ChartBuilder()

print(f"Loaded {len(df):,} rows x {len(df.columns)} columns")
print(f"Data source: {data_source}")

In [5]:
# === COLUMN CONFIGURATION ===
# These will be auto-populated from findings if available
# Override manually if needed

if findings.is_time_series and findings.time_series_metadata:
    ENTITY_COLUMN = findings.time_series_metadata.entity_column
    TIME_COLUMN = findings.time_series_metadata.time_column
else:
    # Manual configuration - uncomment and set if auto-detection failed
    # ENTITY_COLUMN = "customer_id"
    # TIME_COLUMN = "event_date"
    
    # Try auto-detection
    detector = TypeDetector()
    granularity = detector.detect_granularity(df)
    ENTITY_COLUMN = granularity.entity_column
    TIME_COLUMN = granularity.time_column

print(f"Entity column: {ENTITY_COLUMN}")
print(f"Time column: {TIME_COLUMN}")

if not ENTITY_COLUMN or not TIME_COLUMN:
    raise ValueError("Please set ENTITY_COLUMN and TIME_COLUMN manually above")

Entity column: customer_id
Time column: sent_date


## 1a.3 Time Series Profile Overview

**What we analyze:**
- Total events and unique entities
- Time span coverage
- Events per entity distribution
- Entity lifecycle metrics

In [6]:
# Create the time series profiler and run analysis
profiler = TimeSeriesProfiler(entity_column=ENTITY_COLUMN, time_column=TIME_COLUMN)
ts_profile = profiler.profile(df)

print("="*70)
print("TIME SERIES PROFILE SUMMARY")
print("="*70)
print(f"\n\U0001f4ca Dataset Overview:")
print(f"   Total Events: {ts_profile.total_events:,}")
print(f"   Unique Entities: {ts_profile.unique_entities:,}")
print(f"   Avg Events/Entity: {ts_profile.events_per_entity.mean:.1f}")
print(f"   Time Span: {ts_profile.time_span_days:,} days ({ts_profile.time_span_days/365:.1f} years)")

print(f"\n\U0001f4c5 Date Range:")
print(f"   First Event: {ts_profile.first_event_date}")
print(f"   Last Event: {ts_profile.last_event_date}")

print(f"\n\u23f1\ufe0f  Inter-Event Timing:")
if ts_profile.avg_inter_event_days is not None:
    print(f"   Avg Days Between Events: {ts_profile.avg_inter_event_days:.1f}")
else:
    print("   Not enough data to compute inter-event timing")

TIME SERIES PROFILE SUMMARY

📊 Dataset Overview:
   Total Events: 87,982
   Unique Entities: 4,998
   Avg Events/Entity: 17.6
   Time Span: 3,285 days (9.0 years)

📅 Date Range:
   First Event: 2015-01-01 00:00:00
   Last Event: 2023-12-30 00:00:00

⏱️  Inter-Event Timing:
   Avg Days Between Events: 136.0


## 1a.4 Events per Entity Distribution

**Goal:** Understand how event volume varies across entities to guide feature engineering and identify modeling challenges.

| Segment | Definition | Why It Matters for Modeling |
|---------|------------|---------------------------|
| **One-time** | Exactly 1 event | No temporal features possible; cold-start problem |
| **Low Activity** | Below Q25 | Sparse features, many zeros; log-transform counts |
| **Medium Activity** | Q25 to Q75 | Core population; standard aggregation windows work |
| **High Activity** | Above Q75 | Rich features; watch for training set dominance |

In [None]:
from customer_retention.stages.profiling import classify_activity_segments

segment_result = classify_activity_segments(ts_profile.entity_lifecycles)

segment_order = ["One-time", "Low Activity", "Medium Activity", "High Activity"]
segment_colors = {
    "One-time": "#d62728", "Low Activity": "#ff7f0e",
    "Medium Activity": "#2ca02c", "High Activity": "#1f77b4",
}

event_counts = segment_result.lifecycles["event_count"]
x_max = event_counts.quantile(0.99)
bins = np.linspace(0, x_max, 31)
bin_centers = (bins[:-1] + bins[1:]) / 2

lc = segment_result.lifecycles
bin_indices = np.digitize(lc["event_count"], bins) - 1
bin_indices = bin_indices.clip(0, len(bin_centers) - 1)
lc_binned = lc.assign(_bin=bin_indices)

fig = go.Figure()
for seg in segment_order:
    subset = lc_binned[lc_binned["activity_segment"] == seg]
    if subset.empty:
        continue
    counts_per_bin = subset.groupby("_bin").size().reindex(range(len(bin_centers)), fill_value=0)
    fig.add_trace(go.Bar(
        x=bin_centers, y=counts_per_bin.values, name=seg,
        marker_color=segment_colors[seg], opacity=0.85,
    ))

fig.add_vline(
    x=event_counts.median(), line_dash="solid", line_color="gray",
    annotation_text=f"Median: {event_counts.median():.0f}",
    annotation_position="top left",
)

use_log_y = event_counts.value_counts().max() > event_counts.value_counts().median() * 50

log_note = ("<br><sub>Log Y-axis: bar heights compress large differences — "
            "see table below for actual segment shares</sub>" if use_log_y else "")

fig.update_layout(
    barmode="stack", template="plotly_white", height=420,
    title="Events per Entity by Activity Segment" + log_note,
    xaxis_title="Number of Events",
    yaxis_title="Entities",
    yaxis_type="log" if use_log_y else "linear",
    legend=dict(orientation="h", yanchor="top", y=-0.15, xanchor="center", x=0.5),
    margin=dict(b=70),
)
display_figure(fig)

In [None]:
print(f"Segment thresholds: Q25 = {segment_result.q25_threshold:.0f} events, "
      f"Q75 = {segment_result.q75_threshold:.0f} events\n")
display_table(segment_result.recommendations)

## 1a.5 Entity Lifecycle Analysis

**Goal:** Classify entities by their engagement pattern to inform feature engineering and modeling strategy.

We combine two dimensions — **tenure** (days from first to last event) and **intensity** (events per day of tenure) — to identify four lifecycle quadrants:

| Quadrant | Tenure | Intensity | Meaning | Feature Implication |
|----------|--------|-----------|---------|---------------------|
| **Intense & Brief** | Short | High | Burst engagement, then gone | Recency features critical |
| **Steady & Loyal** | Long | High | Consistent power users | Trend/seasonality features valuable |
| **Occasional & Loyal** | Long | Low | Infrequent but persistent | Long time windows (90d+) needed |
| **One-shot** | Short | Low | Single/few interactions | May lack enough history for features |

In [None]:
from customer_retention.stages.profiling import classify_lifecycle_quadrants

quadrant_result = classify_lifecycle_quadrants(ts_profile.entity_lifecycles)
lifecycles = quadrant_result.lifecycles

quadrant_order = ["Steady & Loyal", "Occasional & Loyal", "Intense & Brief", "One-shot"]
quadrant_colors = {
    "Steady & Loyal": "#2ca02c", "Occasional & Loyal": "#1f77b4",
    "Intense & Brief": "#ff7f0e", "One-shot": "#d62728",
}
tenure_median = quadrant_result.tenure_threshold

print(f"Split thresholds: Tenure median = {quadrant_result.tenure_threshold:.0f} days, "
      f"Intensity median = {quadrant_result.intensity_threshold:.4f} events/day\n")
display_table(quadrant_result.recommendations)

In [None]:
# Combined panel: small multiples (top 2x2) + tenure histogram (bottom)
fig = make_subplots(
    rows=3, cols=2,
    subplot_titles=[*quadrant_order, "Tenure Distribution by Quadrant", ""],
    specs=[[{}, {}], [{}, {}], [{"colspan": 2}, None]],
    vertical_spacing=0.08, horizontal_spacing=0.10,
    row_heights=[0.28, 0.28, 0.44],
)

# Top 2x2: scatter per quadrant
positions = [(1, 1), (1, 2), (2, 1), (2, 2)]
for (row, col), q in zip(positions, quadrant_order):
    subset = lifecycles[lifecycles["lifecycle_quadrant"] == q]
    fig.add_trace(go.Scatter(
        x=subset["duration_days"], y=subset["intensity"],
        mode="markers", marker=dict(color=quadrant_colors[q], opacity=0.4, size=3),
        showlegend=False,
    ), row=row, col=col)
    fig.update_xaxes(title_text="Tenure (d)", title_font_size=10, row=row, col=col)
    fig.update_yaxes(title_text="Ev/day", title_font_size=10, row=row, col=col)

# Bottom: overlaid tenure histograms
for q in quadrant_order:
    subset = lifecycles[lifecycles["lifecycle_quadrant"] == q]
    fig.add_trace(go.Histogram(
        x=subset["duration_days"], nbinsx=40, name=q,
        marker_color=quadrant_colors[q], opacity=0.6,
    ), row=3, col=1)

fig.add_vline(x=tenure_median, line_dash="dot", line_color="gray", opacity=0.5,
              row=3, col=1, annotation_text=f"Median: {tenure_median:.0f}d",
              annotation_position="top left")

fig.update_layout(
    barmode="overlay", template="plotly_white", height=900,
    title="Entity Lifecycle Quadrants",
    legend=dict(orientation="h", yanchor="top", y=-0.05, xanchor="center", x=0.5),
    margin=dict(b=80),
)
fig.update_xaxes(title_text="Tenure (days)", row=3, col=1)
fig.update_yaxes(title_text="Entities", row=3, col=1)
display_figure(fig)

## 1a.6 Temporal Coverage Analysis

**Why this matters for modeling:**

| Question | Impact |
|----------|--------|
| **Data gaps?** | Gaps produce misleading aggregation features — zeros that mean "no data" not "no activity" |
| **Volume trend?** | Growing volume means older entities have sparser history; declining means recent windows are underpopulated |
| **Entity coverage by window?** | Shows which aggregation windows will produce meaningful features vs. mostly zeros |
| **Entity arrival pattern?** | Concentrated arrivals = cohort effects; steady arrivals = stable population |

In [None]:
from customer_retention.stages.profiling import analyze_temporal_coverage

df_temp = df.copy()
df_temp[TIME_COLUMN] = pd.to_datetime(df_temp[TIME_COLUMN])

coverage_result = analyze_temporal_coverage(df_temp, ENTITY_COLUMN, TIME_COLUMN)

# Events over time with gap highlighting
fig = go.Figure()
fig.add_trace(go.Scatter(
    x=coverage_result.events_over_time.index,
    y=coverage_result.events_over_time.values,
    mode="lines", fill="tozeroy", name="Events", line_color="steelblue",
))

for gap in coverage_result.gaps:
    color = {"minor": "rgba(255,165,0,0.15)", "moderate": "rgba(255,100,0,0.25)",
             "major": "rgba(255,0,0,0.25)"}[gap.severity]
    fig.add_vrect(
        x0=gap.start, x1=gap.end, fillcolor=color, line_width=0,
        annotation_text=f"{gap.duration_days:.0f}d gap",
        annotation_position="top left", annotation_font_size=10,
    )

trend_label = f"{coverage_result.volume_trend} ({coverage_result.volume_change_pct:+.0%})"
fig.update_layout(
    title=f"Event Volume Over Time<br><sub>Trend: {trend_label}"
          + (f" | {len(coverage_result.gaps)} gap(s) highlighted" if coverage_result.gaps else "")
          + "</sub>",
    xaxis_title="Date", yaxis_title="Events per Period",
    template="plotly_white", height=380,
)
display_figure(fig)

In [None]:
fig = make_subplots(
    rows=1, cols=2, subplot_titles=["New Entities Over Time", "Entity Coverage by Window"],
    column_widths=[0.6, 0.4], horizontal_spacing=0.12,
)

# Left: new entities over time
fig.add_trace(go.Bar(
    x=coverage_result.new_entities_over_time.index,
    y=coverage_result.new_entities_over_time.values,
    marker_color="mediumseagreen", opacity=0.8, showlegend=False,
), row=1, col=1)
fig.update_xaxes(title_text="First Event Date", row=1, col=1)
fig.update_yaxes(title_text="New Entities", row=1, col=1)

# Right: entity window coverage bar chart
cov_data = [(c.window, c.coverage_pct, c.active_entities) for c in coverage_result.entity_window_coverage]
windows_labels = [c[0] for c in cov_data]
coverage_pcts = [c[1] * 100 for c in cov_data]
active_counts = [c[2] for c in cov_data]

bar_colors = ["#2ca02c" if p >= 50 else "#ff7f0e" if p >= 10 else "#d62728" for p in coverage_pcts]
fig.add_trace(go.Bar(
    x=windows_labels, y=coverage_pcts, showlegend=False,
    marker_color=bar_colors, opacity=0.85,
    text=[f"{p:.0f}%<br>({n:,})" for p, n in zip(coverage_pcts, active_counts)],
    textposition="outside", textfont_size=9,
), row=1, col=2)
fig.update_xaxes(title_text="Window", row=1, col=2)
fig.update_yaxes(title_text="% Entities Active", range=[0, 115], row=1, col=2)

fig.update_layout(
    template="plotly_white", height=380,
    title="Entity Arrival & Window Coverage"
          + f"<br><sub>Reference date: {coverage_result.last_event.strftime('%Y-%m-%d')}</sub>",
    margin=dict(b=50),
)
display_figure(fig)

In [None]:
print(f"Coverage Summary:")
print(f"  Time span: {coverage_result.time_span_days:,} days "
      f"({coverage_result.first_event.strftime('%Y-%m-%d')} to {coverage_result.last_event.strftime('%Y-%m-%d')})")
print(f"  Volume trend: {coverage_result.volume_trend} ({coverage_result.volume_change_pct:+.0%})")
print(f"  Data gaps: {len(coverage_result.gaps)} detected"
      + (f" ({sum(g.duration_days for g in coverage_result.gaps):.0f} total days)" if coverage_result.gaps else ""))

if coverage_result.recommendations:
    print(f"\nRecommendations:")
    for rec in coverage_result.recommendations:
        print(f"  -> {rec}")
else:
    print(f"\nNo coverage issues detected — data is suitable for all candidate windows.")

In [None]:
from customer_retention.stages.profiling import derive_drift_implications

drift = derive_drift_implications(coverage_result)

risk_colors = {"low": "\033[92m", "moderate": "\033[93m", "high": "\033[91m"}
reset = "\033[0m"
color = risk_colors.get(drift.risk_level, "")

print(f"Parameter Drift Assessment: {color}{drift.risk_level.upper()}{reset}")
print(f"  Volume drift risk: {drift.volume_drift_risk}")
print(f"  Population stability: {drift.population_stability:.2f}")
print(f"  Data regimes: {drift.regime_count}")
if drift.recommended_training_start:
    print(f"  Recommended training start: {drift.recommended_training_start.strftime('%Y-%m-%d')}")

print(f"\nRationale:")
for r in drift.rationale:
    print(f"  -> {r}")

## 1a.7 Inter-Event Timing Analysis

**📖 Understanding Inter-Event Time:**
- Time between consecutive events for each entity
- Short inter-event time: Frequent engagement
- Long inter-event time: Sporadic usage or churn risk

In [15]:
# Compute inter-event times for all entities with >1 event
inter_event_times = []

for entity, group in df_temp.groupby(ENTITY_COLUMN):
    if len(group) < 2:
        continue
    sorted_times = group[TIME_COLUMN].sort_values()
    diffs = sorted_times.diff().dropna()
    inter_event_times.extend(diffs.dt.total_seconds() / 86400)  # Convert to days

if inter_event_times:
    inter_event_series = pd.Series(inter_event_times)
    
    print("\u23f1\ufe0f  Inter-Event Time Distribution (days):")
    print(f"   Min: {inter_event_series.min():.2f}")
    print(f"   25th percentile: {inter_event_series.quantile(0.25):.2f}")
    print(f"   Median: {inter_event_series.median():.2f}")
    print(f"   Mean: {inter_event_series.mean():.2f}")
    print(f"   75th percentile: {inter_event_series.quantile(0.75):.2f}")
    print(f"   Max: {inter_event_series.max():.2f}")
    
    # Histogram
    fig = go.Figure()
    
    # Cap at 99th percentile for visualization
    cap = inter_event_series.quantile(0.99)
    display_data = inter_event_series[inter_event_series <= cap]
    
    fig.add_trace(go.Histogram(
        x=display_data,
        nbinsx=50,
        name="Inter-Event Time",
        marker_color="coral",
        opacity=0.7
    ))
    
    fig.add_vline(x=inter_event_series.median(), line_dash="solid", line_color="green",
                  annotation_text=f"Median: {inter_event_series.median():.1f} days",
                  annotation_position="top right")
    
    fig.update_layout(
        title=f"Inter-Event Time Distribution (capped at {cap:.0f} days = 99th percentile)",
        xaxis_title="Days Between Events",
        yaxis_title="Frequency",
        template="plotly_white",
        height=400
    )
    display_figure(fig)
else:
    print("Not enough multi-event entities to analyze inter-event timing")

⏱️  Inter-Event Time Distribution (days):
   Min: 0.00
   25th percentile: 33.00
   Median: 89.00
   Mean: 136.04
   75th percentile: 189.00
   Max: 1643.00


In [None]:
if inter_event_times:
    median_iet = inter_event_series.median()
    mean_iet = inter_event_series.mean()
    q25 = inter_event_series.quantile(0.25)
    q75 = inter_event_series.quantile(0.75)
    iqr = q75 - q25
    skew_ratio = mean_iet / median_iet if median_iet > 0 else 1.0

    print("Interpretation:")
    if skew_ratio > 1.5:
        print(f"  Distribution is heavily right-skewed (mean/median = {skew_ratio:.2f})")
        print(f"  -> Most entities engage frequently (median {median_iet:.0f}d between events)")
        print(f"  -> A long tail of entities has very infrequent engagement")
    elif skew_ratio > 1.2:
        print(f"  Distribution is moderately right-skewed (mean/median = {skew_ratio:.2f})")
        print(f"  -> Typical engagement every {median_iet:.0f} days, with some long gaps")
    else:
        print(f"  Distribution is approximately symmetric (mean/median = {skew_ratio:.2f})")
        print(f"  -> Consistent engagement pattern around {median_iet:.0f} days")

    print(f"\n  Spread: IQR = {iqr:.0f} days (Q25={q25:.0f}d to Q75={q75:.0f}d)")
    if iqr > median_iet:
        print(f"  -> High variability (IQR > median) — entities have inconsistent timing")
    else:
        print(f"  -> Moderate variability — most entities follow a similar cadence")

    print(f"\nRecommendations:")
    # Window alignment
    window_map = [(1, "24h"), (7, "7d"), (14, "14d"), (30, "30d"),
                  (90, "90d"), (180, "180d"), (365, "365d")]
    aligned = [(d, w) for d, w in window_map if 0.5 * median_iet <= d <= 2 * median_iet]
    if aligned:
        aligned_str = ", ".join(w for _, w in aligned)
        print(f"  -> Windows aligned with median inter-event time: {aligned_str}")
        print(f"     These capture ~2 events per entity on average")
    else:
        print(f"  -> Median inter-event ({median_iet:.0f}d) does not align with standard windows")

    events_in_30d = 30.0 / median_iet if median_iet > 0 else 0
    events_in_90d = 90.0 / median_iet if median_iet > 0 else 0
    if events_in_30d < 2:
        print(f"  -> 30d window captures only ~{events_in_30d:.1f} events/entity — "
              f"consider longer windows (90d+) for meaningful aggregations")
    if median_iet < 7:
        print(f"  -> High frequency engagement — 7d and 24h windows will be rich with signal")

    if skew_ratio > 1.5:
        print(f"  -> Consider log-transforming inter-event time as a feature "
              f"(reduces right-skew impact on models)")


## 1a.8 Column Distributions

Standard column profiling applied to event-level data - distributions, outliers, transformation needs.

In [16]:
# Use framework's DistributionAnalyzer for comprehensive analysis
analyzer = DistributionAnalyzer()

numeric_cols = [n for n, c in findings.columns.items() 
                if c.inferred_type.value in ('numeric_continuous', 'numeric_discrete')
                and n not in [ENTITY_COLUMN, TIME_COLUMN]]

# Analyze all numeric columns using the framework
analyses = analyzer.analyze_dataframe(df, numeric_cols)
recommendations = {col: analyzer.recommend_transformation(analysis) 
                   for col, analysis in analyses.items()}

# Human-readable transformation names
TRANSFORM_DISPLAY_NAMES = {
    'none': 'None needed',
    'log': 'Log transform',
    'log1p': 'Log(1+x) transform',
    'sqrt': 'Square root',
    'box_cox': 'Box-Cox power transform',
    'yeo_johnson': 'Yeo-Johnson power transform',
    'quantile': 'Quantile normalization',
    'robust_scale': 'Robust scaling (median/IQR)',
    'standard_scale': 'Standard scaling (z-score)',
    'minmax_scale': 'Min-Max scaling',
}

print("="*70)
print("NUMERIC COLUMN PROFILES")
print("="*70)

for col_name in numeric_cols:
    col_info = findings.columns[col_name]
    analysis = analyses.get(col_name)
    rec = recommendations.get(col_name)
    
    print(f"\n{'='*70}")
    print(f"Column: {col_name}")
    print(f"Type: {col_info.inferred_type.value} (Confidence: {col_info.confidence:.0%})")
    print(f"-" * 70)
    
    if analysis:
        print(f"📊 Distribution Statistics:")
        print(f"   Mean: {analysis.mean:.3f}  |  Median: {analysis.median:.3f}  |  Std: {analysis.std:.3f}")
        print(f"   Range: [{analysis.min_value:.3f}, {analysis.max_value:.3f}]")
        print(f"   Percentiles: 1%={analysis.percentiles['p1']:.3f}, 25%={analysis.q1:.3f}, 75%={analysis.q3:.3f}, 99%={analysis.percentiles['p99']:.3f}")
        print(f"\n📈 Shape Analysis:")
        skew_label = '(Right-skewed)' if analysis.skewness > 0.5 else '(Left-skewed)' if analysis.skewness < -0.5 else '(Symmetric)'
        print(f"   Skewness: {analysis.skewness:.2f} {skew_label}")
        kurt_label = '(Heavy tails/outliers)' if analysis.kurtosis > 3 else '(Light tails)'
        print(f"   Kurtosis: {analysis.kurtosis:.2f} {kurt_label}")
        print(f"   Zeros: {analysis.zero_count:,} ({analysis.zero_percentage:.1f}%)")
        print(f"   Outliers (IQR): {analysis.outlier_count_iqr:,} ({analysis.outlier_percentage:.1f}%)")
        
        if rec:
            transform_display = TRANSFORM_DISPLAY_NAMES.get(rec.recommended_transform.value, rec.recommended_transform.value)
            print(f"\n🔧 Recommended Transformation: {transform_display}")
            print(f"   Reason: {rec.reason}")
            print(f"   Priority: {rec.priority}")
            if rec.warnings:
                for warn in rec.warnings:
                    print(f"   ⚠️ {warn}")

NUMERIC COLUMN PROFILES

Column: send_hour
Type: numeric_discrete (Confidence: 70%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 13.502  |  Median: 13.000  |  Std: 3.846
   Range: [6.000, 22.000]
   Percentiles: 1%=6.000, 25%=11.000, 75%=16.000, 99%=22.000

📈 Shape Analysis:
   Skewness: 0.05 (Symmetric)
   Kurtosis: -0.54 (Light tails)
   Zeros: 0 (0.0%)
   Outliers (IQR): 0 (0.0%)

🔧 Recommended Transformation: None needed
   Reason: Distribution is approximately normal (skewness: 0.05)
   Priority: low

Column: time_to_open_hours
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 3.997  |  Median: 2.800  |  Std: 4.020
   Range: [0.000, 35.600]
   Percentiles: 1%=0.000, 25%=1.100, 75%=5.500, 99%=18.770

📈 Shape Analysis:
   Skewness: 2.01 (Right-skewed)
   Kurtosis: 5.70 (Heavy tails/outliers)
   Zeros: 266 (1.3%)
   Outliers

In [17]:
# Per-column distribution visualizations with transformation recommendations
for col_name in numeric_cols:
    analysis = analyses.get(col_name)
    rec = recommendations.get(col_name)
    if not analysis:
        continue
    
    data = df[col_name].dropna()
    fig = go.Figure()
    
    fig.add_trace(go.Histogram(x=data, nbinsx=50, name='Distribution',
                                marker_color='steelblue', opacity=0.7))
    
    mean_val = data.mean()
    median_val = data.median()
    
    # Position labels on opposite sides to avoid overlap
    mean_position = "top right" if mean_val >= median_val else "top left"
    median_position = "top left" if mean_val >= median_val else "top right"
    
    fig.add_vline(
        x=mean_val, line_dash="dash", line_color="red",
        annotation_text=f"Mean: {mean_val:.2f}",
        annotation_position=mean_position,
        annotation_font_color="red",
        annotation_bgcolor="rgba(255,255,255,0.8)"
    )
    
    fig.add_vline(
        x=median_val, line_dash="solid", line_color="green",
        annotation_text=f"Median: {median_val:.2f}",
        annotation_position=median_position,
        annotation_font_color="green",
        annotation_bgcolor="rgba(255,255,255,0.8)"
    )
    
    # Add 99th percentile marker if there are outliers
    if analysis.outlier_percentage > 5:
        fig.add_vline(x=analysis.percentiles['p99'], line_dash="dot", line_color="orange",
                      annotation_text=f"99th: {analysis.percentiles['p99']:.2f}",
                      annotation_position="top right",
                      annotation_font_color="orange",
                      annotation_bgcolor="rgba(255,255,255,0.8)")
    
    transform_key = rec.recommended_transform.value if rec else "none"
    transform_label = TRANSFORM_DISPLAY_NAMES.get(transform_key, transform_key)
    fig.update_layout(
        title=f"Distribution: {col_name}<br><sub>Skew: {analysis.skewness:.2f} | Kurt: {analysis.kurtosis:.2f} | Strategy: {transform_label}</sub>",
        xaxis_title=col_name,
        yaxis_title="Count",
        template='plotly_white',
        height=400
    )
    display_figure(fig)

In [None]:
print("\n" + "="*70)
print("CATEGORICAL COLUMN PROFILES")
print("="*70)

categorical_cols = [n for n, c in findings.columns.items()
                    if c.inferred_type.value in ('categorical_nominal', 'categorical_ordinal', 'binary', 'categorical_cyclical')
                    and c.inferred_type != ColumnType.TEXT  # TEXT columns processed separately in 01a_a
                    and n not in [ENTITY_COLUMN, TIME_COLUMN]]

for col_name in categorical_cols:
    col_info = findings.columns[col_name]
    cardinality = col_info.universal_metrics.get('distinct_count', df[col_name].nunique())
    
    print(f"\n{'='*50}")
    print(f"Column: {col_name}")
    print(f"Type: {col_info.inferred_type.value} (Confidence: {col_info.confidence:.0%})")
    print(f"Distinct Values: {cardinality}")
    
    # Encoding recommendation based on type and cardinality
    if col_info.inferred_type.value == 'categorical_cyclical':
        encoding_rec = "Sin/Cos encoding (cyclical)"
    elif cardinality <= 5:
        encoding_rec = "One-hot encoding (low cardinality)"
    elif cardinality <= 20:
        encoding_rec = "One-hot or Target encoding"
    else:
        encoding_rec = "Target encoding or Frequency encoding (high cardinality)"
    print(f"Recommended Encoding: {encoding_rec}")
    
    # Value counts visualization
    value_counts = df[col_name].value_counts().head(10)
    fig = charts.bar_chart(value_counts.index.tolist(), value_counts.values.tolist(),
                           title=f"Top Categories: {col_name}")
    display_figure(fig)

In [19]:
print("\n" + "="*70)
print("TRANSFORMATION SUMMARY")
print("="*70)

# Human-readable transformation names
TRANSFORM_DISPLAY_NAMES = {
    'none': 'None needed',
    'log': 'Log transform',
    'log1p': 'Log(1+x) transform',
    'sqrt': 'Square root',
    'box_cox': 'Box-Cox power transform',
    'yeo_johnson': 'Yeo-Johnson power transform',
    'quantile': 'Quantile normalization',
    'robust_scale': 'Robust scaling (median/IQR)',
    'standard_scale': 'Standard scaling (z-score)',
    'minmax_scale': 'Min-Max scaling',
}

transformations = []
for col_name, rec in recommendations.items():
    if rec and rec.recommended_transform != TransformationType.NONE:
        transform_key = rec.recommended_transform.value
        display_name = TRANSFORM_DISPLAY_NAMES.get(transform_key, transform_key)
        transformations.append({
            'column': col_name,
            'transform': display_name,
            'reason': rec.reason,
            'priority': rec.priority
        })

if transformations:
    print("\nRecommended transformations:")
    # Sort by priority
    priority_order = {'high': 0, 'medium': 1, 'low': 2}
    transformations.sort(key=lambda x: priority_order.get(x['priority'], 3))
    
    for t in transformations:
        priority_marker = "🔴" if t['priority'] == 'high' else "🟡" if t['priority'] == 'medium' else "🟢"
        print(f"\n   {priority_marker} {t['column']}: {t['transform']}")
        print(f"      Reason: {t['reason']}")
else:
    print("\nNo transformations needed - columns are well-behaved")


TRANSFORMATION SUMMARY

Recommended transformations:

   🔴 time_to_open_hours: Yeo-Johnson power transform
      Reason: High skewness (2.01) with non-positive values


In [None]:
# Aggregation perspective: which windows preserve temporal signal per column?
if numeric_cols and inter_event_times:
    median_iet = inter_event_series.median()
    print("="*70)
    print("TEMPORAL AGGREGATION PERSPECTIVE")
    print("="*70)
    print(f"\nMedian inter-event time: {median_iet:.0f} days")
    print(f"Expected events per window (at median cadence):")
    windows_days = [("7d", 7), ("30d", 30), ("90d", 90), ("180d", 180), ("365d", 365)]
    for label, days in windows_days:
        expected = days / median_iet if median_iet > 0 else 0
        marker = "\u2705" if expected >= 2 else "\u26a0\ufe0f" if expected >= 1 else "\u274c"
        print(f"   {marker} {label}: ~{expected:.1f} events/entity")

    # Within-entity vs between-entity variance per column
    print(f"\nColumn Temporal Variability (within-entity CV vs between-entity CV):")
    print(f"{'Column':<25} {'Within-CV':<12} {'Between-CV':<12} {'Ratio':<8} {'Aggregation Guidance'}")
    print("-" * 90)

    for col in numeric_cols:
        col_data = df_temp.groupby(ENTITY_COLUMN)[col]
        entity_means = col_data.mean()
        entity_stds = col_data.std()

        within_cv = (entity_stds / entity_means.abs().clip(lower=1e-10)).median()
        between_cv = entity_means.std() / entity_means.abs().mean() if entity_means.abs().mean() > 1e-10 else 0.0

        if between_cv > 0:
            ratio = within_cv / between_cv
        else:
            ratio = float("inf") if within_cv > 0 else 0.0

        if within_cv < 0.3:
            guidance = "Stable per entity -> all_time mean sufficient"
        elif ratio > 1.5:
            guidance = "High temporal dynamics -> shorter windows preserve signal"
        elif ratio > 0.5:
            guidance = "Mixed -> both short and long windows add value"
        else:
            guidance = "Entity-driven -> between-entity differences dominate"

        within_str = f"{within_cv:.2f}" if not np.isinf(within_cv) else "inf"
        ratio_str = f"{ratio:.2f}" if not np.isinf(ratio) else ">10"
        print(f"{col:<25} {within_str:<12} {between_cv:<12.2f} {ratio_str:<8} {guidance}")

    print(f"\nInterpretation:")
    print(f"  Within-CV: how much each entity\'s values vary across their events")
    print(f"  Between-CV: how much entity averages differ from each other")
    print(f"  Ratio > 1: temporal variation dominates -> shorter windows capture dynamics")
    print(f"  Ratio < 1: entity identity dominates -> longer windows (or all_time) sufficient")


## 1a.9 Update Findings with Time Series Metadata

In [None]:
from customer_retention.analysis.auto_explorer.findings import TimeSeriesMetadata
from customer_retention.stages.profiling import WindowRecommendationCollector

# Build window recommendations from data coverage analysis
window_collector = WindowRecommendationCollector(coverage_threshold=0.10)
window_collector.add_segment_context(segment_result)
window_collector.add_quadrant_context(quadrant_result)

# Add inter-event timing context if available
if inter_event_times:
    window_collector.add_inter_event_context(
        median_days=inter_event_series.median(),
        mean_days=inter_event_series.mean(),
    )

window_result = window_collector.compute_union(
    lifecycles=quadrant_result.lifecycles,
    time_span_days=ts_profile.time_span_days,
    value_columns=len(numeric_cols),
    agg_funcs=4,
)

print(f"Selected windows: {window_result.windows}")
print(f"Total features per entity: ~{window_result.feature_count_estimate}\n")

explanation = window_result.explanation.drop(columns=["window_days"]).copy()
explanation["coverage_pct"] = (explanation["coverage_pct"] * 100).round(1).astype(str) + "%"
explanation["meaningful_pct"] = (explanation["meaningful_pct"] * 100).round(1).astype(str) + "%"
display_table(explanation)

print(f"\nCoverage: % of entities with enough tenure AND expected >=2 events in that window")
print(f"Meaningful: among entities with enough tenure, % that have sufficient event density")

In [None]:
h = window_result.heterogeneity

print("Temporal Heterogeneity (eta-squared):")
print(f"  eta² measures the fraction of variance in a metric explained by lifecycle quadrant grouping.")
print(f"  Scale: 0 = no group differences, 1 = all variance is between groups.")
print(f"  Thresholds: <0.06 = low | 0.06-0.14 = moderate | >0.14 = high effect size\n")

eta_max = max(h.eta_squared_intensity, h.eta_squared_event_count)
print(f"  Intensity eta²:   {h.eta_squared_intensity:.3f}  {'<-- dominant' if h.eta_squared_intensity >= h.eta_squared_event_count else ''}")
print(f"  Event count eta²: {h.eta_squared_event_count:.3f}  {'<-- dominant' if h.eta_squared_event_count > h.eta_squared_intensity else ''}")
print(f"  Overall level:    {h.heterogeneity_level.upper()} (max eta² = {eta_max:.3f})\n")

advisory_labels = {
    "single_model": "Single model with union windows is appropriate",
    "consider_segment_feature": "Add lifecycle_quadrant as a categorical feature to the model",
    "consider_separate_models": "Consider separate models for entities with vs without history",
}
advisory_text = advisory_labels.get(h.segmentation_advisory, h.segmentation_advisory)

print(f"Recommendation: {advisory_text}")
for r in h.advisory_rationale:
    print(f"  -> {r}")
print()
display_table(h.coverage_table)

In [None]:
advisory_labels = {
    "single_model": "Single model with union windows is appropriate",
    "consider_segment_feature": "Add lifecycle_quadrant as a categorical feature to the model",
    "consider_separate_models": "Consider separate models for entities with vs without history",
}

ts_metadata = TimeSeriesMetadata(
    granularity=DatasetGranularity.EVENT_LEVEL,
    entity_column=ENTITY_COLUMN,
    time_column=TIME_COLUMN,
    avg_events_per_entity=ts_profile.events_per_entity.mean,
    time_span_days=ts_profile.time_span_days,
    unique_entities=ts_profile.unique_entities,
    suggested_aggregations=window_result.windows,
    window_coverage_threshold=window_result.coverage_threshold,
    heterogeneity_level=window_result.heterogeneity.heterogeneity_level,
    eta_squared_intensity=window_result.heterogeneity.eta_squared_intensity,
    eta_squared_event_count=window_result.heterogeneity.eta_squared_event_count,
    temporal_segmentation_advisory=window_result.heterogeneity.segmentation_advisory,
    temporal_segmentation_recommendation=advisory_labels.get(
        window_result.heterogeneity.segmentation_advisory,
        window_result.heterogeneity.segmentation_advisory,
    ),
    drift_risk_level=drift.risk_level,
    volume_drift_risk=drift.volume_drift_risk,
    population_stability=drift.population_stability,
    regime_count=drift.regime_count,
    recommended_training_start=(
        drift.recommended_training_start.isoformat() if drift.recommended_training_start else None
    ),
)

findings.time_series_metadata = ts_metadata
findings.save(FINDINGS_PATH)

print(f"Updated findings saved to: {FINDINGS_PATH}")
print(f"  Suggested aggregations: {ts_metadata.suggested_aggregations}")
print(f"  Heterogeneity: {ts_metadata.heterogeneity_level}")
print(f"  Recommendation: {ts_metadata.temporal_segmentation_recommendation}")
print(f"  Drift risk: {ts_metadata.drift_risk_level}")

---

## Summary: What We Learned

In this notebook, we performed a deep dive on time series data:

1. **Event Distribution** - Analyzed how events are distributed across entities
2. **Activity Segments** - Categorized entities by activity level (one-time, low, medium, high)
3. **Lifecycle Analysis** - Examined entity tenure and duration patterns
4. **Temporal Coverage** - Visualized data volume over time
5. **Inter-Event Timing** - Understood engagement frequency patterns
6. **Feature Opportunities** - Identified time-window aggregations and recency features

## Key Metrics for This Dataset

| Metric | Value |
|--------|-------|
| Unique Entities | Fill from ts_profile |
| Avg Events/Entity | Fill from ts_profile |
| Median Lifecycle | Fill from analysis |
| Median Inter-Event Days | Fill from analysis |

---

## Next Steps

Continue with the **Event Bronze Track**:

1. **01b_temporal_quality.ipynb** - Check for duplicate events, temporal gaps, future dates
2. **01c_temporal_patterns.ipynb** - Detect trends, seasonality, cohort analysis
3. **01d_event_aggregation.ipynb** - Aggregate events to entity-level (produces new dataset)

After completing 01d, continue with the **Entity Bronze Track** (02 → 03 → 04) on the aggregated data.