<span style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">An Exception was encountered at '<a href="#papermill-error-cell">In [2]</a>'.</span>

# Chapter 1a: Temporal Deep Dive (Event Bronze Track)

**Purpose:** Analyze event-level (time series) datasets with focus on temporal patterns, entity lifecycles, and event frequency distributions.

**When to use this notebook:**
- Your dataset was detected as `EVENT_LEVEL` granularity in notebook 01
- You have multiple rows per entity (customer, user, etc.)
- Each row represents an event with a timestamp

**What you'll learn:**
- How to profile entity lifecycles (first event, last event, duration)
- Understanding event frequency distributions per entity
- Inter-event timing patterns and their implications
- Time series-specific feature engineering opportunities

**Outputs:**
- Entity lifecycle visualizations
- Event frequency distribution analysis
- Inter-event timing statistics
- Updated exploration findings with time series metadata

---

## Understanding Time Series Profiling

| Metric | Description | Why It Matters |
|--------|-------------|----------------|
| **Events per Entity** | Distribution of event counts | Identifies power users vs. one-time users |
| **Entity Lifecycle** | Duration from first to last event | Reveals customer tenure patterns |
| **Inter-event Time** | Time between consecutive events | Indicates engagement patterns |
| **Time Span** | Overall data period coverage | Helps plan time window aggregations |

**Aggregation Windows (used in notebook 01d):**
- 24h: Very recent activity
- 7d: Weekly patterns
- 30d: Monthly patterns
- 90d: Quarterly trends
- 180d: Semi-annual patterns
- 365d: Annual patterns
- all-time: Historical totals

## 1a.1 Load Previous Findings

In [1]:
from customer_retention.analysis.auto_explorer import ExplorationFindings
from customer_retention.analysis.visualization import ChartBuilder, display_figure, display_table
from customer_retention.core.config.column_config import ColumnType, DatasetGranularity
from customer_retention.stages.profiling import (
    TimeSeriesProfiler, TimeSeriesProfile,
    TypeDetector,
    DistributionAnalyzer, TransformationType,
    TemporalAnalyzer, TemporalGranularity,
    SegmentAnalyzer
)
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

<span id="papermill-error-cell" style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">Execution using papermill encountered an exception here and stopped:</span>

In [2]:
# === CONFIGURATION ===
# Option 1: Set the exact path from notebook 01 output
# FINDINGS_PATH = "../experiments/findings/transactions_abc123_findings.yaml"

# Option 2: Auto-discover findings files
from pathlib import Path

FINDINGS_DIR = Path("../experiments/findings")

# Find all findings files
findings_files = [f for f in FINDINGS_DIR.glob("*_findings.yaml") if "multi_dataset" not in f.name]
if not findings_files:
    raise FileNotFoundError(f"No findings files found in {FINDINGS_DIR}. Run notebook 01 first.")

# Sort by modification time (most recent first)
findings_files.sort(key=lambda f: f.stat().st_mtime, reverse=True)
FINDINGS_PATH = str(findings_files[0])

print(f"Found {len(findings_files)} findings file(s)")
print(f"Using: {FINDINGS_PATH}")
if len(findings_files) > 1:
    print(f"Other available: {[str(f.name) for f in findings_files[1:3]]}")

findings = ExplorationFindings.load(FINDINGS_PATH)
print(f"\nLoaded findings for {findings.column_count} columns from {findings.source_path}")

FileNotFoundError: No findings files found in ../experiments/findings. Run notebook 01 first.

In [None]:
# Verify this is a time series dataset
if findings.is_time_series:
    print("\u2705 Dataset confirmed as TIME SERIES (event-level)")
    ts_meta = findings.time_series_metadata
    print(f"   Entity column: {ts_meta.entity_column}")
    print(f"   Time column: {ts_meta.time_column}")
    print(f"   Avg events per entity: {ts_meta.avg_events_per_entity:.1f}" if ts_meta.avg_events_per_entity else "")
else:
    print("\u26a0\ufe0f This dataset was NOT detected as time series.")
    print("   Consider using 02_column_deep_dive.ipynb instead.")
    print("   Or manually specify entity and time columns below.")

## 1a.2 Load Source Data & Configure Columns

In [None]:
from customer_retention.stages.temporal import load_data_with_snapshot_preference, TEMPORAL_METADATA_COLS

df, data_source = load_data_with_snapshot_preference(findings, output_dir="../experiments/findings")
charts = ChartBuilder()

print(f"Loaded {len(df):,} rows x {len(df.columns)} columns")
print(f"Data source: {data_source}")

In [None]:
# === COLUMN CONFIGURATION ===
# These will be auto-populated from findings if available
# Override manually if needed

if findings.is_time_series and findings.time_series_metadata:
    ENTITY_COLUMN = findings.time_series_metadata.entity_column
    TIME_COLUMN = findings.time_series_metadata.time_column
else:
    # Manual configuration - uncomment and set if auto-detection failed
    # ENTITY_COLUMN = "customer_id"
    # TIME_COLUMN = "event_date"
    
    # Try auto-detection
    detector = TypeDetector()
    granularity = detector.detect_granularity(df)
    ENTITY_COLUMN = granularity.entity_column
    TIME_COLUMN = granularity.time_column

print(f"Entity column: {ENTITY_COLUMN}")
print(f"Time column: {TIME_COLUMN}")

if not ENTITY_COLUMN or not TIME_COLUMN:
    raise ValueError("Please set ENTITY_COLUMN and TIME_COLUMN manually above")

## 1a.3 Time Series Profile Overview

**What we analyze:**
- Total events and unique entities
- Time span coverage
- Events per entity distribution
- Entity lifecycle metrics

In [None]:
# Create the time series profiler and run analysis
profiler = TimeSeriesProfiler(entity_column=ENTITY_COLUMN, time_column=TIME_COLUMN)
ts_profile = profiler.profile(df)

print("="*70)
print("TIME SERIES PROFILE SUMMARY")
print("="*70)
print(f"\n\U0001f4ca Dataset Overview:")
print(f"   Total Events: {ts_profile.total_events:,}")
print(f"   Unique Entities: {ts_profile.unique_entities:,}")
print(f"   Avg Events/Entity: {ts_profile.events_per_entity.mean:.1f}")
print(f"   Time Span: {ts_profile.time_span_days:,} days ({ts_profile.time_span_days/365:.1f} years)")

print(f"\n\U0001f4c5 Date Range:")
print(f"   First Event: {ts_profile.first_event_date}")
print(f"   Last Event: {ts_profile.last_event_date}")

print(f"\n\u23f1\ufe0f  Inter-Event Timing:")
if ts_profile.avg_inter_event_days is not None:
    print(f"   Avg Days Between Events: {ts_profile.avg_inter_event_days:.1f}")
else:
    print("   Not enough data to compute inter-event timing")

## 1a.4 Events per Entity Distribution

**üìñ How to Interpret:**
- **Right-skewed** distribution (common): Most entities have few events, some have many
- **Bimodal**: May indicate two distinct user segments
- **Power law**: Very common in transaction data (few heavy users, many light users)

In [None]:
# Events per entity distribution statistics
events_dist = ts_profile.events_per_entity

print("\U0001f4ca Events per Entity Distribution:")
print(f"   Min: {events_dist.min:.0f}")
print(f"   25th percentile: {events_dist.q25:.0f}")
print(f"   Median: {events_dist.median:.0f}")
print(f"   Mean: {events_dist.mean:.1f}")
print(f"   75th percentile: {events_dist.q75:.0f}")
print(f"   Max: {events_dist.max:.0f}")
print(f"   Std Dev: {events_dist.std:.1f}")

# Interpretation
if events_dist.mean > events_dist.median * 1.5:
    print("\n\U0001f4a1 Insight: Distribution is RIGHT-SKEWED (mean >> median)")
    print("   This is typical - a few entities have many events, most have few.")
    print("   Consider: Log transform for event count features, or segment by activity level.")

In [None]:
# Histogram of events per entity
event_counts = ts_profile.entity_lifecycles["event_count"]

fig = go.Figure()

# For discrete integer data, use bar chart with value_counts for cleaner display
value_counts = event_counts.value_counts().sort_index()

fig.add_trace(go.Bar(
    x=value_counts.index,
    y=value_counts.values,
    name="Event Count",
    marker_color="steelblue",
    opacity=0.7
))

# Add mean and median lines - offset annotations to avoid overlap
mean_val = events_dist.mean
median_val = events_dist.median

fig.add_vline(x=mean_val, line_dash="dash", line_color="red")
fig.add_vline(x=median_val, line_dash="solid", line_color="green")

# Use paper-referenced annotations to avoid overlap
fig.add_annotation(
    text=f"Mean: {mean_val:.1f}",
    xref="paper", yref="paper",
    x=0.98, y=0.95, showarrow=False,
    font=dict(size=11, color="red"),
    xanchor="right"
)
fig.add_annotation(
    text=f"Median: {median_val:.0f}",
    xref="paper", yref="paper",
    x=0.98, y=0.88, showarrow=False,
    font=dict(size=11, color="green"),
    xanchor="right"
)

# Use log scale on Y-axis (count of entities) if highly skewed, not X-axis
use_log_y = value_counts.max() > value_counts.median() * 50

title_text = "Events per Entity Distribution"
if use_log_y:
    title_text += "<br><sub>‚ö†Ô∏è Log scale on Y-axis due to high skewness</sub>"

fig.update_layout(
    title=title_text,
    xaxis_title="Number of Events",
    yaxis_title="Number of Entities",
    template="plotly_white",
    height=400,
    yaxis_type="log" if use_log_y else "linear"
)

display_figure(fig)

In [None]:
# Entity segmentation by activity level
def categorize_activity(count, q25, q75):
    if count <= 1:
        return "One-time"
    elif count <= q25:
        return "Low Activity"
    elif count <= q75:
        return "Medium Activity"
    else:
        return "High Activity"

lifecycles = ts_profile.entity_lifecycles.copy()
lifecycles["activity_segment"] = lifecycles["event_count"].apply(
    lambda x: categorize_activity(x, events_dist.q25, events_dist.q75)
)

segment_counts = lifecycles["activity_segment"].value_counts()
segment_order = ["One-time", "Low Activity", "Medium Activity", "High Activity"]
segment_counts = segment_counts.reindex([s for s in segment_order if s in segment_counts.index])

print("\n\U0001f465 Entity Activity Segments:")
for segment, count in segment_counts.items():
    pct = count / len(lifecycles) * 100
    print(f"   {segment}: {count:,} ({pct:.1f}%)")

# Calculate segment statistics for the second chart
segment_stats = lifecycles.groupby("activity_segment").agg({
    "event_count": ["mean", "median", "max"]
}).round(1)
segment_stats.columns = ["Avg Events", "Median Events", "Max Events"]
segment_stats = segment_stats.reindex([s for s in segment_order if s in segment_stats.index])

# Side-by-side charts: Pie chart + Bar chart with segment stats
colors = {"One-time": "#d62728", "Low Activity": "#ff7f0e", 
          "Medium Activity": "#2ca02c", "High Activity": "#1f77b4"}

fig = make_subplots(
    rows=1, cols=2,
    specs=[[{"type": "pie"}, {"type": "bar"}]],
    subplot_titles=("Entity Distribution", "Avg Events per Segment"),
    horizontal_spacing=0.12
)

# Left: Pie chart
fig.add_trace(go.Pie(
    labels=segment_counts.index,
    values=segment_counts.values,
    marker_colors=[colors.get(s, "gray") for s in segment_counts.index],
    textinfo="label+percent",
    hole=0.3,
    showlegend=False
), row=1, col=1)

# Right: Bar chart showing average events per segment
fig.add_trace(go.Bar(
    x=segment_stats.index,
    y=segment_stats["Avg Events"],
    marker_color=[colors.get(s, "gray") for s in segment_stats.index],
    text=[f"{v:.1f}" for v in segment_stats["Avg Events"]],
    textposition="outside",
    showlegend=False
), row=1, col=2)

fig.update_layout(
    title="Entity Activity Segments",
    height=400,
    template="plotly_white"
)
fig.update_yaxes(title_text="Avg Event Count", row=1, col=2)

display_figure(fig)

## 1a.5 Entity Lifecycle Analysis

**üìñ Understanding Lifecycles:**
- **Tenure**: Days between first and last event
- **Active period**: Does not account for gaps - just first to last
- **Short tenure with many events**: Intense but brief engagement
- **Long tenure with few events**: Occasional but loyal user

In [None]:
# Lifecycle duration distribution
duration_stats = lifecycles["duration_days"].describe()

print("\U0001f4c6 Entity Lifecycle Duration (days):")
print(f"   Min: {duration_stats['min']:.0f}")
print(f"   25th percentile: {duration_stats['25%']:.0f}")
print(f"   Median: {duration_stats['50%']:.0f}")
print(f"   Mean: {duration_stats['mean']:.1f}")
print(f"   75th percentile: {duration_stats['75%']:.0f}")
print(f"   Max: {duration_stats['max']:.0f}")

# Single-event entities (duration = 0)
single_event = (lifecycles["duration_days"] == 0).sum()
print(f"\n\U0001f6a8 Single-event entities: {single_event:,} ({single_event/len(lifecycles)*100:.1f}%)")
if single_event / len(lifecycles) > 0.3:
    print("   High proportion of one-time users - consider retention analysis")

In [None]:
# Lifecycle duration histogram (excluding zeros for clarity)
non_zero_duration = lifecycles[lifecycles["duration_days"] > 0]["duration_days"]

if len(non_zero_duration) > 0:
    fig = go.Figure()
    
    fig.add_trace(go.Histogram(
        x=non_zero_duration,
        nbinsx=50,
        name="Duration",
        marker_color="mediumpurple",
        opacity=0.7
    ))
    
    fig.add_vline(x=non_zero_duration.median(), line_dash="solid", line_color="green",
                  annotation_text=f"Median: {non_zero_duration.median():.0f} days",
                  annotation_position="top right")
    
    fig.update_layout(
        title=f"Entity Lifecycle Duration (excluding {single_event:,} single-event entities)",
        xaxis_title="Duration (days)",
        yaxis_title="Number of Entities",
        template="plotly_white",
        height=400
    )
    display_figure(fig)
else:
    print("All entities have only single events - no duration distribution to show")

In [None]:
# Scatter: Event count vs Lifecycle duration
fig = px.scatter(
    lifecycles,
    x="duration_days",
    y="event_count",
    color="activity_segment",
    color_discrete_map=colors,
    opacity=0.5,
    title="Event Count vs Lifecycle Duration",
    labels={"duration_days": "Lifecycle Duration (days)", "event_count": "Event Count"}
)

fig.update_layout(template="plotly_white", height=500)
display_figure(fig)

## 1a.6 Temporal Coverage Analysis

**üìñ Why This Matters:**
- Shows when data collection started/ended
- Identifies gaps or seasonality in data volume
- Helps plan time window aggregations

In [None]:
# Parse time column
df_temp = df.copy()
df_temp[TIME_COLUMN] = pd.to_datetime(df_temp[TIME_COLUMN])

# Events over time (daily/weekly/monthly depending on span)
time_span_days = ts_profile.time_span_days

if time_span_days <= 90:
    freq = "D"
    freq_name = "Daily"
elif time_span_days <= 365:
    freq = "W"
    freq_name = "Weekly"
else:
    freq = "ME"  # Month-end frequency (pandas 2.2+)
    freq_name = "Monthly"

events_over_time = df_temp.groupby(pd.Grouper(key=TIME_COLUMN, freq=freq)).size()

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=events_over_time.index,
    y=events_over_time.values,
    mode="lines",
    fill="tozeroy",
    name="Events",
    line_color="steelblue"
))

fig.update_layout(
    title=f"{freq_name} Event Volume Over Time",
    xaxis_title="Date",
    yaxis_title="Number of Events",
    template="plotly_white",
    height=400
)
display_figure(fig)

In [None]:
# New entities over time (cohort arrival)
first_events = lifecycles.copy()
first_events["first_event"] = pd.to_datetime(first_events["first_event"])

new_entities = first_events.groupby(
    pd.Grouper(key="first_event", freq=freq)  # Uses freq from previous cell
).size()

fig = go.Figure()
fig.add_trace(go.Bar(
    x=new_entities.index,
    y=new_entities.values,
    name="New Entities",
    marker_color="mediumseagreen"
))

fig.update_layout(
    title=f"New Entities Over Time ({freq_name})",
    xaxis_title="First Event Date",
    yaxis_title="Number of New Entities",
    template="plotly_white",
    height=400
)
display_figure(fig)

## 1a.7 Inter-Event Timing Analysis

**üìñ Understanding Inter-Event Time:**
- Time between consecutive events for each entity
- Short inter-event time: Frequent engagement
- Long inter-event time: Sporadic usage or churn risk

In [None]:
# Compute inter-event times for all entities with >1 event
inter_event_times = []

for entity, group in df_temp.groupby(ENTITY_COLUMN):
    if len(group) < 2:
        continue
    sorted_times = group[TIME_COLUMN].sort_values()
    diffs = sorted_times.diff().dropna()
    inter_event_times.extend(diffs.dt.total_seconds() / 86400)  # Convert to days

if inter_event_times:
    inter_event_series = pd.Series(inter_event_times)
    
    print("\u23f1\ufe0f  Inter-Event Time Distribution (days):")
    print(f"   Min: {inter_event_series.min():.2f}")
    print(f"   25th percentile: {inter_event_series.quantile(0.25):.2f}")
    print(f"   Median: {inter_event_series.median():.2f}")
    print(f"   Mean: {inter_event_series.mean():.2f}")
    print(f"   75th percentile: {inter_event_series.quantile(0.75):.2f}")
    print(f"   Max: {inter_event_series.max():.2f}")
    
    # Histogram
    fig = go.Figure()
    
    # Cap at 99th percentile for visualization
    cap = inter_event_series.quantile(0.99)
    display_data = inter_event_series[inter_event_series <= cap]
    
    fig.add_trace(go.Histogram(
        x=display_data,
        nbinsx=50,
        name="Inter-Event Time",
        marker_color="coral",
        opacity=0.7
    ))
    
    fig.add_vline(x=inter_event_series.median(), line_dash="solid", line_color="green",
                  annotation_text=f"Median: {inter_event_series.median():.1f} days",
                  annotation_position="top right")
    
    fig.update_layout(
        title=f"Inter-Event Time Distribution (capped at {cap:.0f} days = 99th percentile)",
        xaxis_title="Days Between Events",
        yaxis_title="Frequency",
        template="plotly_white",
        height=400
    )
    display_figure(fig)
else:
    print("Not enough multi-event entities to analyze inter-event timing")

## 1a.8 Column Distributions

Standard column profiling applied to event-level data - distributions, outliers, transformation needs.

In [None]:
# Use framework's DistributionAnalyzer for comprehensive analysis
analyzer = DistributionAnalyzer()

numeric_cols = [n for n, c in findings.columns.items() 
                if c.inferred_type.value in ('numeric_continuous', 'numeric_discrete')
                and n not in [ENTITY_COLUMN, TIME_COLUMN]]

# Analyze all numeric columns using the framework
analyses = analyzer.analyze_dataframe(df, numeric_cols)
recommendations = {col: analyzer.recommend_transformation(analysis) 
                   for col, analysis in analyses.items()}

# Human-readable transformation names
TRANSFORM_DISPLAY_NAMES = {
    'none': 'None needed',
    'log': 'Log transform',
    'log1p': 'Log(1+x) transform',
    'sqrt': 'Square root',
    'box_cox': 'Box-Cox power transform',
    'yeo_johnson': 'Yeo-Johnson power transform',
    'quantile': 'Quantile normalization',
    'robust_scale': 'Robust scaling (median/IQR)',
    'standard_scale': 'Standard scaling (z-score)',
    'minmax_scale': 'Min-Max scaling',
}

print("="*70)
print("NUMERIC COLUMN PROFILES")
print("="*70)

for col_name in numeric_cols:
    col_info = findings.columns[col_name]
    analysis = analyses.get(col_name)
    rec = recommendations.get(col_name)
    
    print(f"\n{'='*70}")
    print(f"Column: {col_name}")
    print(f"Type: {col_info.inferred_type.value} (Confidence: {col_info.confidence:.0%})")
    print(f"-" * 70)
    
    if analysis:
        print(f"üìä Distribution Statistics:")
        print(f"   Mean: {analysis.mean:.3f}  |  Median: {analysis.median:.3f}  |  Std: {analysis.std:.3f}")
        print(f"   Range: [{analysis.min_value:.3f}, {analysis.max_value:.3f}]")
        print(f"   Percentiles: 1%={analysis.percentiles['p1']:.3f}, 25%={analysis.q1:.3f}, 75%={analysis.q3:.3f}, 99%={analysis.percentiles['p99']:.3f}")
        print(f"\nüìà Shape Analysis:")
        skew_label = '(Right-skewed)' if analysis.skewness > 0.5 else '(Left-skewed)' if analysis.skewness < -0.5 else '(Symmetric)'
        print(f"   Skewness: {analysis.skewness:.2f} {skew_label}")
        kurt_label = '(Heavy tails/outliers)' if analysis.kurtosis > 3 else '(Light tails)'
        print(f"   Kurtosis: {analysis.kurtosis:.2f} {kurt_label}")
        print(f"   Zeros: {analysis.zero_count:,} ({analysis.zero_percentage:.1f}%)")
        print(f"   Outliers (IQR): {analysis.outlier_count_iqr:,} ({analysis.outlier_percentage:.1f}%)")
        
        if rec:
            transform_display = TRANSFORM_DISPLAY_NAMES.get(rec.recommended_transform.value, rec.recommended_transform.value)
            print(f"\nüîß Recommended Transformation: {transform_display}")
            print(f"   Reason: {rec.reason}")
            print(f"   Priority: {rec.priority}")
            if rec.warnings:
                for warn in rec.warnings:
                    print(f"   ‚ö†Ô∏è {warn}")

In [None]:
# Per-column distribution visualizations with transformation recommendations
for col_name in numeric_cols:
    analysis = analyses.get(col_name)
    rec = recommendations.get(col_name)
    if not analysis:
        continue
    
    data = df[col_name].dropna()
    fig = go.Figure()
    
    fig.add_trace(go.Histogram(x=data, nbinsx=50, name='Distribution',
                                marker_color='steelblue', opacity=0.7))
    
    mean_val = data.mean()
    median_val = data.median()
    
    # Position labels on opposite sides to avoid overlap
    mean_position = "top right" if mean_val >= median_val else "top left"
    median_position = "top left" if mean_val >= median_val else "top right"
    
    fig.add_vline(
        x=mean_val, line_dash="dash", line_color="red",
        annotation_text=f"Mean: {mean_val:.2f}",
        annotation_position=mean_position,
        annotation_font_color="red",
        annotation_bgcolor="rgba(255,255,255,0.8)"
    )
    
    fig.add_vline(
        x=median_val, line_dash="solid", line_color="green",
        annotation_text=f"Median: {median_val:.2f}",
        annotation_position=median_position,
        annotation_font_color="green",
        annotation_bgcolor="rgba(255,255,255,0.8)"
    )
    
    # Add 99th percentile marker if there are outliers
    if analysis.outlier_percentage > 5:
        fig.add_vline(x=analysis.percentiles['p99'], line_dash="dot", line_color="orange",
                      annotation_text=f"99th: {analysis.percentiles['p99']:.2f}",
                      annotation_position="top right",
                      annotation_font_color="orange",
                      annotation_bgcolor="rgba(255,255,255,0.8)")
    
    transform_key = rec.recommended_transform.value if rec else "none"
    transform_label = TRANSFORM_DISPLAY_NAMES.get(transform_key, transform_key)
    fig.update_layout(
        title=f"Distribution: {col_name}<br><sub>Skew: {analysis.skewness:.2f} | Kurt: {analysis.kurtosis:.2f} | Strategy: {transform_label}</sub>",
        xaxis_title=col_name,
        yaxis_title="Count",
        template='plotly_white',
        height=400
    )
    display_figure(fig)

In [None]:
print("\n" + "="*70)
print("CATEGORICAL COLUMN PROFILES")
print("="*70)

categorical_cols = [n for n, c in findings.columns.items()
                    if c.inferred_type.value in ('categorical_nominal', 'categorical_ordinal', 'binary', 'categorical_cyclical')
                    and c.inferred_type != ColumnType.TEXT  # TEXT columns processed separately in 01a_a
                    and n not in [ENTITY_COLUMN, TIME_COLUMN]]

for col_name in categorical_cols:
    col_info = findings.columns[col_name]
    cardinality = col_info.universal_metrics.get('distinct_count', df[col_name].nunique())
    
    print(f"\n{'='*50}")
    print(f"Column: {col_name}")
    print(f"Type: {col_info.inferred_type.value} (Confidence: {col_info.confidence:.0%})")
    print(f"Distinct Values: {cardinality}")
    
    # Encoding recommendation based on type and cardinality
    if col_info.inferred_type.value == 'categorical_cyclical':
        encoding_rec = "Sin/Cos encoding (cyclical)"
    elif cardinality <= 5:
        encoding_rec = "One-hot encoding (low cardinality)"
    elif cardinality <= 20:
        encoding_rec = "One-hot or Target encoding"
    else:
        encoding_rec = "Target encoding or Frequency encoding (high cardinality)"
    print(f"Recommended Encoding: {encoding_rec}")
    
    # Value counts visualization
    value_counts = df[col_name].value_counts().head(10)
    fig = charts.bar_chart(value_counts.index.tolist(), value_counts.values.tolist(),
                           title=f"Top Categories: {col_name}")
    display_figure(fig)

In [None]:
print("\n" + "="*70)
print("TRANSFORMATION SUMMARY")
print("="*70)

# Human-readable transformation names
TRANSFORM_DISPLAY_NAMES = {
    'none': 'None needed',
    'log': 'Log transform',
    'log1p': 'Log(1+x) transform',
    'sqrt': 'Square root',
    'box_cox': 'Box-Cox power transform',
    'yeo_johnson': 'Yeo-Johnson power transform',
    'quantile': 'Quantile normalization',
    'robust_scale': 'Robust scaling (median/IQR)',
    'standard_scale': 'Standard scaling (z-score)',
    'minmax_scale': 'Min-Max scaling',
}

transformations = []
for col_name, rec in recommendations.items():
    if rec and rec.recommended_transform != TransformationType.NONE:
        transform_key = rec.recommended_transform.value
        display_name = TRANSFORM_DISPLAY_NAMES.get(transform_key, transform_key)
        transformations.append({
            'column': col_name,
            'transform': display_name,
            'reason': rec.reason,
            'priority': rec.priority
        })

if transformations:
    print("\nRecommended transformations:")
    # Sort by priority
    priority_order = {'high': 0, 'medium': 1, 'low': 2}
    transformations.sort(key=lambda x: priority_order.get(x['priority'], 3))
    
    for t in transformations:
        priority_marker = "üî¥" if t['priority'] == 'high' else "üü°" if t['priority'] == 'medium' else "üü¢"
        print(f"\n   {priority_marker} {t['column']}: {t['transform']}")
        print(f"      Reason: {t['reason']}")
else:
    print("\nNo transformations needed - columns are well-behaved")

## 1a.9 Data Segmentation Analysis

**Purpose:** Determine if the dataset contains natural subgroups that might benefit from separate models.

**üìñ Why This Matters:**
- Some datasets have distinct customer segments with very different behaviors
- A single model might struggle to capture patterns that vary significantly across segments
- Segmented models can improve accuracy but add maintenance complexity

**Recommendations:**
- **single_model** - Data is homogeneous; one model for all records
- **consider_segmentation** - Some variation exists; evaluate if complexity is worth it
- **strong_segmentation** - Distinct segments with different target rates; separate models likely beneficial

In [None]:
# Initialize segment analyzer
segment_analyzer = SegmentAnalyzer()

# Find target column if detected
target_col = None
for col_name, col_info in findings.columns.items():
    if col_info.inferred_type == ColumnType.TARGET:
        target_col = col_name
        break

# Get numeric feature columns (excluding entity/time columns)
feature_cols = [n for n, c in findings.columns.items() 
                if c.inferred_type.value in ('numeric_continuous', 'numeric_discrete')
                and n not in [ENTITY_COLUMN, TIME_COLUMN]]

print("="*70)
print("DATA SEGMENTATION ANALYSIS")
print("="*70)

if feature_cols:
    segmentation = segment_analyzer.analyze(
        df,
        target_col=target_col,
        feature_cols=feature_cols,
        max_segments=5
    )

    print(f"\nüéØ Analysis Results:")
    print(f"   Method: {segmentation.method.value}")
    print(f"   Detected Segments: {segmentation.n_segments}")
    print(f"   Cluster Quality Score: {segmentation.quality_score:.2f}")
    if segmentation.target_variance_ratio is not None:
        print(f"   Target Variance Ratio: {segmentation.target_variance_ratio:.2f}")

    print(f"\nüìä Segment Profiles:")
    for profile in segmentation.profiles:
        target_info = f" | Target Rate: {profile.target_rate*100:.1f}%" if profile.target_rate is not None else ""
        print(f"   Segment {profile.segment_id}: {profile.size:,} records ({profile.size_pct:.1f}%){target_info}")

    # Display recommendation card
    fig = charts.segment_recommendation_card(segmentation)
    display_figure(fig)

    # Display segment overview
    fig = charts.segment_overview(segmentation, title="Segment Overview")
    display_figure(fig)

    # Display feature comparison if we have features
    if segmentation.n_segments > 1 and any(p.defining_features for p in segmentation.profiles):
        fig = charts.segment_feature_comparison(segmentation, title="Feature Comparison Across Segments")
        display_figure(fig)

    print(f"\nüìù Rationale:")
    for reason in segmentation.rationale:
        print(f"   ‚Ä¢ {reason}")
else:
    print("\nNo numeric feature columns available for segmentation analysis.")

## 1a.10 Feature Engineering Opportunities

Based on the time series profile, here are recommended aggregation features:

In [None]:
# Analyze available columns for aggregation
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
numeric_cols = [c for c in numeric_cols if c not in [ENTITY_COLUMN, TIME_COLUMN]]

print("\U0001f6e0\ufe0f Feature Engineering Recommendations:")
print("="*60)

print("\n1\ufe0f\u20e3  TIME WINDOW AGGREGATIONS (for each entity):")
windows = ["24h", "7d", "30d", "90d", "180d", "365d", "all_time"]
print(f"   Windows: {', '.join(windows)}")
print(f"   \n   For event counts:")
for w in windows:
    print(f"      - event_count_{w}")

if numeric_cols:
    print(f"\n   For numeric columns ({', '.join(numeric_cols[:3])}...):")
    aggs = ["sum", "mean", "max", "min"]
    print(f"      Aggregations: {', '.join(aggs)}")
    print(f"      Example: {numeric_cols[0]}_sum_7d, {numeric_cols[0]}_mean_30d")

print("\n2\ufe0f\u20e3  RECENCY FEATURES:")
print("   - days_since_last_event")
print("   - days_since_first_event (tenure)")

print("\n3\ufe0f\u20e3  FREQUENCY FEATURES:")
print("   - avg_events_per_day")
print("   - avg_inter_event_days")
print("   - event_frequency_trend (increasing/decreasing)")

print("\n4\ufe0f\u20e3  LIFECYCLE FEATURES:")
print("   - lifecycle_duration_days")
print("   - is_new_entity (first event in last 30 days)")
print("   - activity_segment (one-time, low, medium, high)")

## 1a.11 Update Findings with Time Series Metadata

In [None]:
from customer_retention.analysis.auto_explorer.findings import TimeSeriesMetadata

# Update or create time series metadata
ts_metadata = TimeSeriesMetadata(
    granularity=DatasetGranularity.EVENT_LEVEL,
    entity_column=ENTITY_COLUMN,
    time_column=TIME_COLUMN,
    avg_events_per_entity=ts_profile.events_per_entity.mean,
    time_span_days=ts_profile.time_span_days,
    unique_entities=ts_profile.unique_entities,
    suggested_aggregations=["24h", "7d", "30d", "90d", "180d", "365d", "all_time"]
)

findings.time_series_metadata = ts_metadata

print("\u2705 Time series metadata updated:")
print(f"   Entity column: {ts_metadata.entity_column}")
print(f"   Time column: {ts_metadata.time_column}")
print(f"   Avg events/entity: {ts_metadata.avg_events_per_entity:.1f}")
print(f"   Time span: {ts_metadata.time_span_days} days")
print(f"   Suggested aggregations: {ts_metadata.suggested_aggregations}")

In [None]:
# Save updated findings
findings.save(FINDINGS_PATH)
print(f"Updated findings saved to: {FINDINGS_PATH}")

---

## Summary: What We Learned

In this notebook, we performed a deep dive on time series data:

1. **Event Distribution** - Analyzed how events are distributed across entities
2. **Activity Segments** - Categorized entities by activity level (one-time, low, medium, high)
3. **Lifecycle Analysis** - Examined entity tenure and duration patterns
4. **Temporal Coverage** - Visualized data volume over time
5. **Inter-Event Timing** - Understood engagement frequency patterns
6. **Feature Opportunities** - Identified time-window aggregations and recency features

## Key Metrics for This Dataset

| Metric | Value |
|--------|-------|
| Unique Entities | Fill from ts_profile |
| Avg Events/Entity | Fill from ts_profile |
| Median Lifecycle | Fill from analysis |
| Median Inter-Event Days | Fill from analysis |

---

## Next Steps

Continue with the **Event Bronze Track**:

1. **01b_temporal_quality.ipynb** - Check for duplicate events, temporal gaps, future dates
2. **01c_temporal_patterns.ipynb** - Detect trends, seasonality, cohort analysis
3. **01d_event_aggregation.ipynb** - Aggregate events to entity-level (produces new dataset)

After completing 01d, continue with the **Entity Bronze Track** (02 ‚Üí 03 ‚Üí 04) on the aggregated data.