# Chapter 1d: Event Aggregation (Event Bronze Track ‚Üí Entity Bronze Track)

**Purpose:** Aggregate event-level data to entity-level, applying all insights from 01a-01c.

**When to use this notebook:**
- After completing 01a (temporal profiling), 01b (quality checks), 01c (pattern analysis)
- Your dataset is EVENT_LEVEL granularity
- You want to create entity-level features informed by temporal patterns

**What this notebook produces:**
- Aggregated parquet file (one row per entity)
- New findings file for the aggregated data
- Updated original findings with aggregation metadata

**How 01a-01c findings inform aggregation:**

| Source | Insight Applied |
|--------|----------------|
| **01a** | Recommended windows (e.g., 180d, 365d), lifecycle quadrant feature |
| **01b** | Quality issues to handle (gaps, duplicates) |
| **01c** | Divergent columns for velocity/momentum (prioritize these features) |

---

## Understanding the Shape Transformation

```
EVENT-LEVEL (input)              ENTITY-LEVEL (output)
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê          ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ customer ‚îÇ date     ‚îÇ          ‚îÇ customer ‚îÇ events_180d ‚îÇ quadrant ‚îÇ ...
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§    ‚Üí     ‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ A        ‚îÇ Jan 1    ‚îÇ          ‚îÇ A        ‚îÇ 12          ‚îÇ Steady   ‚îÇ
‚îÇ A        ‚îÇ Jan 5    ‚îÇ          ‚îÇ B        ‚îÇ 5           ‚îÇ Brief    ‚îÇ
‚îÇ A        ‚îÇ Jan 10   ‚îÇ          ‚îÇ C        ‚îÇ 2           ‚îÇ Loyal    ‚îÇ
‚îÇ B        ‚îÇ Jan 3    ‚îÇ          ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
‚îÇ ...      ‚îÇ ...      ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
Many rows per entity           One row per entity + lifecycle features
```

## 1d.1 Load Findings and Data

In [None]:
from customer_retention.analysis.auto_explorer import ExplorationFindings, DataExplorer
from customer_retention.analysis.visualization import ChartBuilder, display_figure, display_table
from customer_retention.core.config.column_config import ColumnType, DatasetGranularity
from customer_retention.stages.profiling import (
    TimeWindowAggregator,
    TimeSeriesProfiler,
    classify_lifecycle_quadrants,
    classify_activity_segments,
)
from datetime import datetime
from pathlib import Path
import pandas as pd
import numpy as np

In [None]:
# === CONFIGURATION ===
FINDINGS_DIR = Path("../experiments/findings")

# Find findings files (exclude multi_dataset and already-aggregated)
findings_files = [
    f for f in FINDINGS_DIR.glob("*_findings.yaml") 
    if "multi_dataset" not in f.name and "_aggregated" not in f.name
]
if not findings_files:
    raise FileNotFoundError(f"No findings files found in {FINDINGS_DIR}. Run notebook 01 first.")

findings_files.sort(key=lambda f: f.stat().st_mtime, reverse=True)
FINDINGS_PATH = str(findings_files[0])

print(f"Using: {FINDINGS_PATH}")
findings = ExplorationFindings.load(FINDINGS_PATH)
print(f"Loaded findings for {findings.column_count} columns from {findings.source_path}")

In [None]:
# Verify this is event-level data and display findings summary
if not findings.is_time_series:
    print("‚ö†Ô∏è This dataset is NOT event-level. Aggregation not needed.")
    print("   Proceed directly to 02_column_deep_dive.ipynb")
    raise SystemExit("Skipping aggregation - data is already entity-level")

ts_meta = findings.time_series_metadata
ENTITY_COLUMN = ts_meta.entity_column
TIME_COLUMN = ts_meta.time_column

print("=" * 70)
print("FINDINGS SUMMARY FROM 01a-01c")
print("=" * 70)

# === 01a: Time Series Metadata ===
print("\nüìä FROM 01a (Temporal Profiling):")
print(f"   Entity column: {ENTITY_COLUMN}")
print(f"   Time column: {TIME_COLUMN}")
if ts_meta.unique_entities:
    print(f"   Unique entities: {ts_meta.unique_entities:,}")
if ts_meta.avg_events_per_entity:
    print(f"   Avg events/entity: {ts_meta.avg_events_per_entity:.1f}")
if ts_meta.time_span_days:
    print(f"   Time span: {ts_meta.time_span_days:,} days")

if ts_meta.suggested_aggregations:
    print(f"\n   ‚úÖ Recommended windows: {ts_meta.suggested_aggregations}")
else:
    print("\n   ‚ö†Ô∏è No window recommendations - will use defaults")

if ts_meta.temporal_segmentation_recommendation:
    print(f"\n   üìã Segmentation recommendation:")
    print(f"      {ts_meta.temporal_segmentation_recommendation}")
    if ts_meta.heterogeneity_level:
        print(f"      Heterogeneity: {ts_meta.heterogeneity_level}")

if ts_meta.drift_risk_level:
    print(f"\n   ‚ö†Ô∏è Drift risk: {ts_meta.drift_risk_level.upper()}")
    if ts_meta.volume_drift_risk:
        print(f"      Volume drift: {ts_meta.volume_drift_risk}")
    if ts_meta.population_stability is not None:
        print(f"      Population stability: {ts_meta.population_stability:.2f}")

# === 01b: Temporal Quality ===
quality_meta = findings.metadata.get("temporal_quality", {})
if quality_meta:
    print(f"\nüìã FROM 01b (Temporal Quality):")
    if quality_meta.get("temporal_quality_score"):
        print(f"   Quality score: {quality_meta.get('temporal_quality_score'):.1f}")
    if quality_meta.get("temporal_quality_grade"):
        print(f"   Quality grade: {quality_meta.get('temporal_quality_grade')}")
    issues = quality_meta.get("issues", {})
    if issues.get("duplicate_events", 0) > 0:
        print(f"   ‚ö†Ô∏è Duplicate events: {issues['duplicate_events']:,}")
    if issues.get("temporal_gaps", 0) > 0:
        print(f"   ‚ö†Ô∏è Temporal gaps: {issues['temporal_gaps']:,}")

# === 01c: Temporal Patterns ===
pattern_meta = findings.metadata.get("temporal_patterns", {})
if pattern_meta:
    print(f"\nüìà FROM 01c (Temporal Patterns):")
    windows_used = pattern_meta.get("windows_used", {})
    if windows_used:
        if windows_used.get("aggregation_windows"):
            print(f"   Windows analyzed: {windows_used.get('aggregation_windows')}")
        if windows_used.get("velocity_window"):
            print(f"   Velocity window: {windows_used.get('velocity_window')} days")
        if windows_used.get("momentum_pairs"):
            print(f"   Momentum pairs: {windows_used.get('momentum_pairs')}")
    
    trend = pattern_meta.get("trend", {})
    if trend and trend.get("direction"):
        print(f"\n   Trend: {trend.get('direction')} (strength: {trend.get('strength', 0):.2f})")
    
    seasonality = pattern_meta.get("seasonality", [])
    if seasonality:
        periods = [f"{s.get('name', 'period')} ({s.get('period')})" for s in seasonality[:3]]
        print(f"   Seasonality: {', '.join(periods)}")
    
    recency = pattern_meta.get("recency", {})
    if recency and recency.get("median_days"):
        print(f"   Recency: median={recency.get('median_days'):.0f} days, "
              f"target_corr={recency.get('target_correlation', 0):.2f}")
    
    # Divergent columns (important for feature prioritization)
    velocity = pattern_meta.get("velocity", {})
    divergent_velocity = [k for k, v in velocity.items() if isinstance(v, dict) and v.get("divergent")]
    if divergent_velocity:
        print(f"\n   üéØ Divergent velocity columns: {divergent_velocity}")
    
    momentum = pattern_meta.get("momentum", {})
    divergent_momentum = momentum.get("_divergent_columns", [])
    if divergent_momentum:
        print(f"   üéØ Divergent momentum columns: {divergent_momentum}")

print("\n" + "=" * 70)

In [None]:
from customer_retention.stages.temporal import load_data_with_snapshot_preference, TEMPORAL_METADATA_COLS

# Load source data (prefers snapshots over raw files)
df, data_source = load_data_with_snapshot_preference(findings, output_dir="../experiments/findings")
df[TIME_COLUMN] = pd.to_datetime(df[TIME_COLUMN])
charts = ChartBuilder()

print(f"Loaded {len(df):,} events x {len(df.columns)} columns")
print(f"Data source: {data_source}")
print(f"Date range: {df[TIME_COLUMN].min()} to {df[TIME_COLUMN].max()}")

## 1d.2 Configure Aggregation Based on Findings

Apply all insights from 01a-01c to configure optimal aggregation.

In [None]:
# === AGGREGATION CONFIGURATION ===
# Windows are loaded from findings (01a recommendations) with option to override

# Manual override (set to None to use findings recommendations)
WINDOW_OVERRIDE = None  # e.g., ["7d", "30d", "90d"] to override

# Get windows from findings or use defaults
if WINDOW_OVERRIDE:
    WINDOWS = WINDOW_OVERRIDE
    window_source = "manual override"
elif ts_meta.suggested_aggregations:
    WINDOWS = ts_meta.suggested_aggregations
    window_source = "01a recommendations"
else:
    WINDOWS = ["7d", "30d", "90d", "180d", "365d", "all_time"]
    window_source = "defaults (no findings)"

# Reference date for window calculations
REFERENCE_DATE = df[TIME_COLUMN].max()

# Extract pattern metadata for feature prioritization
pattern_meta = findings.metadata.get("temporal_patterns", {})
velocity_meta = pattern_meta.get("velocity", {})
momentum_meta = pattern_meta.get("momentum", {})

# Identify divergent columns (these are most predictive for target)
DIVERGENT_VELOCITY_COLS = [k for k, v in velocity_meta.items() 
                           if isinstance(v, dict) and v.get("divergent")]
DIVERGENT_MOMENTUM_COLS = momentum_meta.get("_divergent_columns", [])

# Value columns: prioritize divergent columns, then other numerics
# IMPORTANT: Exclude target column to prevent data leakage!
TARGET_COLUMN = findings.target_column
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
exclude_cols = {ENTITY_COLUMN, TIME_COLUMN}
if TARGET_COLUMN:
    exclude_cols.add(TARGET_COLUMN)
available_numeric = [c for c in numeric_cols if c not in exclude_cols]

# Put divergent columns first (they showed predictive signal in 01c)
priority_cols = [c for c in DIVERGENT_VELOCITY_COLS + DIVERGENT_MOMENTUM_COLS 
                 if c in available_numeric]
other_cols = [c for c in available_numeric if c not in priority_cols]
VALUE_COLUMNS = priority_cols + other_cols

# Aggregation functions
AGG_FUNCTIONS = ["sum", "mean", "max", "count"]

# Lifecycle features (recommended by 01a segmentation analysis)
INCLUDE_LIFECYCLE_QUADRANT = ts_meta.temporal_segmentation_recommendation is not None
INCLUDE_RECENCY = True
INCLUDE_TENURE = True

# Print configuration
print("=" * 70)
print("AGGREGATION CONFIGURATION")
print("=" * 70)
print(f"\nWindows: {WINDOWS}")
print(f"   Source: {window_source}")
print(f"\nReference date: {REFERENCE_DATE}")
print(f"\nValue columns ({len(VALUE_COLUMNS)} total):")
if priority_cols:
    print(f"   Priority (divergent): {priority_cols}")
print(f"   Other: {other_cols[:5]}{'...' if len(other_cols) > 5 else ''}")
if TARGET_COLUMN:
    print(f"\n   ‚ö†Ô∏è Excluded from aggregation: {TARGET_COLUMN} (target - prevents leakage)")
print(f"\nAggregation functions: {AGG_FUNCTIONS}")
print(f"\nAdditional features:")
print(f"   Include lifecycle_quadrant: {INCLUDE_LIFECYCLE_QUADRANT}")
print(f"   Include recency: {INCLUDE_RECENCY}")
print(f"   Include tenure: {INCLUDE_TENURE}")

## 1d.3 Preview Aggregation Plan

See what features will be created before executing.

In [None]:
# Initialize aggregator
aggregator = TimeWindowAggregator(
    entity_column=ENTITY_COLUMN,
    time_column=TIME_COLUMN
)

# Generate plan
plan = aggregator.generate_plan(
    df=df,
    windows=WINDOWS,
    value_columns=VALUE_COLUMNS,
    agg_funcs=AGG_FUNCTIONS,
    include_event_count=True,
    include_recency=INCLUDE_RECENCY,
    include_tenure=INCLUDE_TENURE
)

# Count additional features we'll add
additional_features = []
if INCLUDE_LIFECYCLE_QUADRANT:
    additional_features.append("lifecycle_quadrant")
if findings.target_column and findings.target_column in df.columns:
    additional_features.append(f"{findings.target_column} (entity target)")

print("\n" + "="*60)
print("AGGREGATION PLAN")
print("="*60)
print(f"\nEntity column: {plan.entity_column}")
print(f"Time column: {plan.time_column}")
print(f"Windows: {[w.name for w in plan.windows]}")

print(f"\nFeatures from aggregation ({len(plan.feature_columns)}):")
for feat in plan.feature_columns[:15]:
    # Highlight divergent column features
    is_priority = any(dc in feat for dc in priority_cols) if priority_cols else False
    marker = " üéØ" if is_priority else ""
    print(f"   - {feat}{marker}")
if len(plan.feature_columns) > 15:
    print(f"   ... and {len(plan.feature_columns) - 15} more")

if additional_features:
    print(f"\nAdditional features:")
    for feat in additional_features:
        print(f"   - {feat}")
    
print(f"\nTotal expected features: {len(plan.feature_columns) + len(additional_features) + 1}")

## 1d.4 Execute Aggregation

In [None]:
print("Executing aggregation...")
print(f"   Input: {len(df):,} events")
print(f"   Expected output: {df[ENTITY_COLUMN].nunique():,} entities")

# Step 1: Basic time window aggregation
df_aggregated = aggregator.aggregate(
    df,
    windows=WINDOWS,
    value_columns=VALUE_COLUMNS,
    agg_funcs=AGG_FUNCTIONS,
    reference_date=REFERENCE_DATE,
    include_event_count=True,
    include_recency=INCLUDE_RECENCY,
    include_tenure=INCLUDE_TENURE
)

# Step 2: Add lifecycle quadrant (from 01a recommendation)
if INCLUDE_LIFECYCLE_QUADRANT:
    print("\n   Adding lifecycle_quadrant feature...")
    profiler = TimeSeriesProfiler(entity_column=ENTITY_COLUMN, time_column=TIME_COLUMN)
    ts_profile = profiler.profile(df)
    
    # Rename 'entity' column to match our entity column name
    lifecycles = ts_profile.entity_lifecycles.copy()
    lifecycles = lifecycles.rename(columns={"entity": ENTITY_COLUMN})
    
    quadrant_result = classify_lifecycle_quadrants(lifecycles)
    
    # Merge lifecycle_quadrant into aggregated data
    quadrant_map = quadrant_result.lifecycles.set_index(ENTITY_COLUMN)["lifecycle_quadrant"]
    df_aggregated["lifecycle_quadrant"] = df_aggregated[ENTITY_COLUMN].map(quadrant_map)
    
    print(f"   Quadrant distribution:")
    for quad, count in df_aggregated["lifecycle_quadrant"].value_counts().items():
        pct = count / len(df_aggregated) * 100
        print(f"      {quad}: {count:,} ({pct:.1f}%)")

# Step 3: Add entity-level target (if available)
TARGET_COLUMN = findings.target_column
if TARGET_COLUMN and TARGET_COLUMN in df.columns:
    print(f"\n   Adding entity-level target ({TARGET_COLUMN})...")
    # For entity-level target, use max (if any event has target=1, entity has target=1)
    entity_target = df.groupby(ENTITY_COLUMN)[TARGET_COLUMN].max()
    df_aggregated[TARGET_COLUMN] = df_aggregated[ENTITY_COLUMN].map(entity_target)
    
    target_dist = df_aggregated[TARGET_COLUMN].value_counts()
    for val, count in target_dist.items():
        pct = count / len(df_aggregated) * 100
        print(f"      {TARGET_COLUMN}={val}: {count:,} ({pct:.1f}%)")

print(f"\n‚úÖ Aggregation complete!")
print(f"   Output: {len(df_aggregated):,} entities x {len(df_aggregated.columns)} features")
print(f"   Memory: {df_aggregated.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

In [None]:
# Preview aggregated data
print("\nAggregated Data Preview:")
display(df_aggregated.head(10))

In [None]:
# Summary statistics
print("\nFeature Summary Statistics:")
display(df_aggregated.describe().T)

## 1d.5 Quality Check on Aggregated Data

Quick validation of the aggregated output.

In [None]:
print("="*60)
print("AGGREGATED DATA QUALITY CHECK")
print("="*60)

# Check for nulls
null_counts = df_aggregated.isnull().sum()
cols_with_nulls = null_counts[null_counts > 0]

if len(cols_with_nulls) > 0:
    print(f"\n‚ö†Ô∏è Columns with null values ({len(cols_with_nulls)}):")
    for col, count in cols_with_nulls.head(10).items():
        pct = count / len(df_aggregated) * 100
        print(f"   {col}: {count:,} ({pct:.1f}%)")
    if len(cols_with_nulls) > 10:
        print(f"   ... and {len(cols_with_nulls) - 10} more")
    print("\n   Note: Nulls in aggregated features typically mean no events in that window.")
    print("   Consider filling with 0 for count/sum features.")
else:
    print("\n‚úÖ No null values in aggregated data")

# Check entity count matches
original_entities = df[ENTITY_COLUMN].nunique()
aggregated_entities = len(df_aggregated)

if original_entities == aggregated_entities:
    print(f"\n‚úÖ Entity count matches: {aggregated_entities:,}")
else:
    print(f"\n‚ö†Ô∏è Entity count mismatch!")
    print(f"   Original: {original_entities:,}")
    print(f"   Aggregated: {aggregated_entities:,}")

# Check feature statistics
print(f"\nüìä Feature Statistics:")
numeric_agg_cols = df_aggregated.select_dtypes(include=[np.number]).columns.tolist()
if TARGET_COLUMN:
    numeric_agg_cols = [c for c in numeric_agg_cols if c != TARGET_COLUMN]

print(f"   Total features: {len(df_aggregated.columns)}")
print(f"   Numeric features: {len(numeric_agg_cols)}")

# Check for constant columns (no variance)
const_cols = [c for c in numeric_agg_cols if df_aggregated[c].std() == 0]
if const_cols:
    print(f"\n‚ö†Ô∏è Constant columns (zero variance): {len(const_cols)}")
    print(f"   {const_cols[:5]}{'...' if len(const_cols) > 5 else ''}")

# If lifecycle_quadrant was added, show its correlation with target
if INCLUDE_LIFECYCLE_QUADRANT and TARGET_COLUMN and TARGET_COLUMN in df_aggregated.columns:
    print(f"\nüìä Lifecycle Quadrant vs Target:")
    cross = pd.crosstab(df_aggregated["lifecycle_quadrant"], df_aggregated[TARGET_COLUMN], normalize='index')
    if 1 in cross.columns:
        for quad in cross.index:
            rate = cross.loc[quad, 1] * 100
            print(f"   {quad}: {rate:.1f}% positive")

## 1d.6 Save Aggregated Data and Findings

In [None]:
# Generate output paths
original_name = Path(findings.source_path).stem
findings_name = Path(FINDINGS_PATH).stem.replace("_findings", "")

# Save aggregated data as parquet
AGGREGATED_DATA_PATH = FINDINGS_DIR / f"{findings_name}_aggregated.parquet"
df_aggregated.to_parquet(AGGREGATED_DATA_PATH, index=False)

print(f"\u2705 Aggregated data saved to: {AGGREGATED_DATA_PATH}")
print(f"   Size: {AGGREGATED_DATA_PATH.stat().st_size / 1024:.1f} KB")

In [None]:
# Create new findings for aggregated data using DataExplorer
print("\nGenerating findings for aggregated data...")

explorer = DataExplorer(output_dir=str(FINDINGS_DIR))
aggregated_findings = explorer.explore(
    str(AGGREGATED_DATA_PATH),
    name=f"{findings_name}_aggregated"
)

AGGREGATED_FINDINGS_PATH = explorer.last_findings_path
print(f"‚úÖ Aggregated findings saved to: {AGGREGATED_FINDINGS_PATH}")

In [None]:
# Update original findings with comprehensive aggregation metadata
findings.time_series_metadata.aggregation_executed = True
findings.time_series_metadata.aggregated_data_path = str(AGGREGATED_DATA_PATH)
findings.time_series_metadata.aggregated_findings_path = str(AGGREGATED_FINDINGS_PATH)
findings.time_series_metadata.aggregation_windows_used = WINDOWS
findings.time_series_metadata.aggregation_timestamp = datetime.now().isoformat()

# Add aggregation details to metadata
findings.metadata["aggregation"] = {
    "windows_used": WINDOWS,
    "window_source": window_source,
    "reference_date": str(REFERENCE_DATE),
    "value_columns_count": len(VALUE_COLUMNS),
    "priority_columns": priority_cols,  # Divergent columns from 01c
    "agg_functions": AGG_FUNCTIONS,
    "include_lifecycle_quadrant": INCLUDE_LIFECYCLE_QUADRANT,
    "include_recency": INCLUDE_RECENCY,
    "include_tenure": INCLUDE_TENURE,
    "output_entities": len(df_aggregated),
    "output_features": len(df_aggregated.columns),
    "target_column": TARGET_COLUMN,
}

findings.save(FINDINGS_PATH)
print(f"‚úÖ Original findings updated with aggregation metadata: {FINDINGS_PATH}")

In [None]:
# Summary of outputs
print("\n" + "="*70)
print("AGGREGATION COMPLETE - OUTPUT SUMMARY")
print("="*70)

print(f"\nüìÅ Files created:")
print(f"   1. Aggregated data: {AGGREGATED_DATA_PATH}")
print(f"   2. Aggregated findings: {AGGREGATED_FINDINGS_PATH}")
print(f"   3. Updated original findings: {FINDINGS_PATH}")

print(f"\nüìä Transformation stats:")
print(f"   Input events: {len(df):,}")
print(f"   Output entities: {len(df_aggregated):,}")
print(f"   Features created: {len(df_aggregated.columns)}")

print(f"\n‚öôÔ∏è Configuration applied:")
print(f"   Windows: {WINDOWS} (from {window_source})")
print(f"   Aggregation functions: {AGG_FUNCTIONS}")
if priority_cols:
    print(f"   Priority columns (from 01c divergence): {priority_cols}")
if INCLUDE_LIFECYCLE_QUADRANT:
    print(f"   Lifecycle quadrant: included (from 01a recommendation)")

print(f"\nüéØ Ready for modeling:")
print(f"   Entity column: {ENTITY_COLUMN}")
if TARGET_COLUMN:
    print(f"   Target column: {TARGET_COLUMN}")
    if TARGET_COLUMN in df_aggregated.columns:
        positive_rate = df_aggregated[TARGET_COLUMN].mean() * 100
        print(f"   Target positive rate: {positive_rate:.1f}%")

# Drift warning if applicable
if ts_meta.drift_risk_level == "high":
    print(f"\n‚ö†Ô∏è DRIFT WARNING: High drift risk detected in 01a")
    print(f"   Volume drift: {ts_meta.volume_drift_risk or 'unknown'}")
    print(f"   Consider: temporal validation splits, monitoring for distribution shift")

---

## Summary: What We Did

In this notebook, we transformed event-level data to entity-level, applying all insights from 01a-01c:

1. **Loaded findings** from prior notebooks (windows, patterns, quality)
2. **Configured aggregation** using recommended windows from 01a
3. **Prioritized features** based on divergent columns from 01c velocity/momentum analysis
4. **Added lifecycle_quadrant** as recommended by 01a segmentation analysis
5. **Added entity-level target** for downstream modeling
6. **Saved outputs** - aggregated data, findings, and metadata

## How Findings Were Applied

| Finding | Source | Application |
|---------|--------|-------------|
| Aggregation windows | 01a | Used `suggested_aggregations` instead of defaults |
| Lifecycle quadrant | 01a | Added as categorical feature for model |
| Divergent columns | 01c | Prioritized in feature list (velocity/momentum signal) |
| Drift warning | 01a | Flagged for temporal validation consideration |

## Output Files

| File | Purpose | Next Use |
|------|---------|----------|
| `*_aggregated.parquet` | Entity-level data with temporal features | Input for notebooks 02-04 |
| `*_aggregated_findings.yaml` | Auto-profiled findings | Loaded by 02_column_deep_dive |
| Original findings (updated) | Aggregation tracking | Reference and lineage |

---

## Next Steps

**Event Bronze Track complete!** Continue with the **Entity Bronze Track** on the aggregated data:

1. **02_column_deep_dive.ipynb** - Profile the aggregated feature distributions
2. **03_quality_assessment.ipynb** - Run quality checks on entity-level data  
3. **04_relationship_analysis.ipynb** - Analyze feature correlations and target relationships

The notebooks will auto-discover the aggregated findings file (most recently modified).

```python
# The aggregated findings file is now the most recent, so notebooks 02-04
# will automatically use it via the standard discovery pattern.
```