<span style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">An Exception was encountered at '<a href="#papermill-error-cell">In [1]</a>'.</span>

# Chapter 1b: Temporal Quality Assessment (Event Bronze Track)

**Purpose:** Run quality checks specific to event-level (time series) datasets to identify data issues before feature engineering.

**When to use this notebook:**
- After completing 01a_temporal_deep_dive.ipynb
- Your dataset is EVENT_LEVEL granularity
- You want to validate temporal data integrity

**What you'll learn:**
- How to detect duplicate events (same entity + timestamp)
- How to find unexpected temporal gaps
- How to identify future dates (data quality issue)
- How to check for event ordering ambiguities

**Quality Checks Performed:**

| Check ID | Name | Severity | Description |
|----------|------|----------|-------------|
| TQ001 | Duplicate Events | MEDIUM | Same entity with identical timestamp |
| TQ002 | Temporal Gaps | MEDIUM | Unexpected missing time periods |
| TQ003 | Future Dates | HIGH | Dates beyond reference date |
| TQ004 | Event Ordering | LOW | Ambiguous ordering (timestamp collisions) |

## 1b.1 Load Findings and Data

<span id="papermill-error-cell" style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">Execution using papermill encountered an exception here and stopped:</span>

In [None]:
from customer_retention.analysis.auto_explorer import ExplorationFindings, RecommendationEngine
from customer_retention.analysis.visualization import ChartBuilder, display_figure, display_table
from customer_retention.core.config.column_config import ColumnType, DatasetGranularity
from customer_retention.core.components.enums import Severity
from customer_retention.stages.profiling import (
    TemporalQualityCheck, TemporalQualityResult,
    DuplicateEventCheck, TemporalGapCheck, FutureDateCheck, EventOrderCheck,
    NumericProfiler, CategoricalProfiler,
    SegmentAwareOutlierAnalyzer
)
from scipy import stats
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

In [None]:
# === CONFIGURATION ===
from pathlib import Path

FINDINGS_DIR = Path("../experiments/findings")

findings_files = [f for f in FINDINGS_DIR.glob("*_findings.yaml") if "multi_dataset" not in f.name]
if not findings_files:
    raise FileNotFoundError(f"No findings files found in {FINDINGS_DIR}. Run notebook 01 first.")

findings_files.sort(key=lambda f: f.stat().st_mtime, reverse=True)
FINDINGS_PATH = str(findings_files[0])

print(f"Using: {FINDINGS_PATH}")
findings = ExplorationFindings.load(FINDINGS_PATH)
print(f"Loaded findings for {findings.column_count} columns")

In [None]:
# Verify time series dataset and get column names
if not findings.is_time_series:
    print("\u26a0\ufe0f Warning: This dataset was not detected as time series.")
    print("   Use 03_quality_assessment.ipynb for entity-level datasets.")

ts_meta = findings.time_series_metadata
ENTITY_COLUMN = ts_meta.entity_column if ts_meta else None
TIME_COLUMN = ts_meta.time_column if ts_meta else None

print(f"Entity column: {ENTITY_COLUMN}")
print(f"Time column: {TIME_COLUMN}")

if not ENTITY_COLUMN or not TIME_COLUMN:
    raise ValueError("Please run 01a_temporal_deep_dive.ipynb first to set entity/time columns")

In [None]:
from customer_retention.stages.temporal import load_data_with_snapshot_preference, TEMPORAL_METADATA_COLS

df, data_source = load_data_with_snapshot_preference(findings, output_dir="../experiments/findings")
charts = ChartBuilder()

print(f"Loaded {len(df):,} rows x {len(df.columns)} columns")
print(f"Data source: {data_source}")

## 1b.2 Configure Quality Checks

You can customize the check parameters based on your data characteristics.

In [None]:
# === QUALITY CHECK CONFIGURATION ===

# Reference date for future date check (default: now)
REFERENCE_DATE = pd.Timestamp.now()
# REFERENCE_DATE = pd.Timestamp("2024-01-01")  # Use fixed date if needed

# Expected data frequency for gap detection
# Options: "D" (daily), "W" (weekly), "M" (monthly), "H" (hourly)
EXPECTED_FREQUENCY = "D"

# Maximum gap multiplier (gaps > expected * multiplier are flagged)
MAX_GAP_MULTIPLE = 3.0

print("Quality Check Configuration:")
print(f"   Reference date: {REFERENCE_DATE}")
print(f"   Expected frequency: {EXPECTED_FREQUENCY}")
print(f"   Max gap multiple: {MAX_GAP_MULTIPLE}x")

## 1b.3 Run All Temporal Quality Checks

In [None]:
# Initialize all checks
checks = [
    DuplicateEventCheck(entity_column=ENTITY_COLUMN, time_column=TIME_COLUMN),
    TemporalGapCheck(time_column=TIME_COLUMN, expected_frequency=EXPECTED_FREQUENCY, max_gap_multiple=MAX_GAP_MULTIPLE),
    FutureDateCheck(time_column=TIME_COLUMN, reference_date=REFERENCE_DATE),
    EventOrderCheck(entity_column=ENTITY_COLUMN, time_column=TIME_COLUMN),
]

# Run all checks
results = []
for check in checks:
    result = check.run(df)
    results.append(result)

print("\n" + "="*70)
print("TEMPORAL QUALITY CHECK RESULTS")
print("="*70)

In [None]:
# Display summary
passed = sum(1 for r in results if r.passed)
failed = len(results) - passed

severity_colors = {
    Severity.HIGH: "\U0001f534",
    Severity.MEDIUM: "\U0001f7e0",
    Severity.LOW: "\U0001f7e1",
    Severity.INFO: "\U0001f535",
}

print(f"\n\U0001f4cb Summary: {passed}/{len(results)} checks passed\n")

for result in results:
    status = "\u2705" if result.passed else "\u274c"
    severity_icon = severity_colors.get(result.severity, "\u26aa")
    print(f"{status} [{result.check_id}] {result.check_name}")
    print(f"   {severity_icon} Severity: {result.severity.value}")
    print(f"   Message: {result.message}")
    if result.recommendation:
        print(f"   \U0001f4a1 {result.recommendation}")
    print()

## 1b.4 Detailed Check Results

In [None]:
# Consolidated quality summary with context
total_rows = len(df)

check_data = []
for r in results:
    issue_count = r.duplicate_count or r.gap_count or r.future_count or r.ambiguous_count or 0
    issue_pct = (issue_count / total_rows * 100) if total_rows > 0 else 0
    
    check_data.append({
        "Check": r.check_name,
        "Status": "‚úÖ PASS" if r.passed else "‚ùå FAIL",
        "Severity": r.severity.value.upper(),
        "Issues": f"{issue_count:,}",
        "% of Data": f"{issue_pct:.2f}%",
        "Impact": "None" if r.passed else ("Critical" if r.severity == Severity.HIGH else "Moderate" if r.severity == Severity.MEDIUM else "Minor")
    })

summary_df = pd.DataFrame(check_data)
display(summary_df)

print(f"\nüìä Context: Dataset has {total_rows:,} total rows")

### TQ001: Duplicate Events Analysis

**üìñ Why Duplicates Matter for ML:**

Duplicate events (same entity + timestamp) can distort your model in subtle ways:

| Impact Area | Problem |
|-------------|---------|
| **Event counts** | Inflated activity metrics ("sent 10 emails" when actually 5) |
| **Aggregations** | Sum/mean calculations skewed by duplicate values |
| **Sequence modeling** | Artificial patterns introduced in event sequences |
| **Class balance** | If duplicates correlate with target, creates sampling bias |

**Common causes:**
- System retries logging the same event multiple times
- ETL pipeline re-processing data without deduplication
- Intentional (e.g., multiple items in one transaction logged separately)

**‚ö†Ô∏è Key question:** Are duplicates a **data quality issue** or **valid business events**?

In [None]:
dup_result = results[0]  # DuplicateEventCheck

print("\U0001f50d Duplicate Events Analysis")
print("="*50)

if dup_result.duplicate_count > 0:
    print(f"\n\u26a0\ufe0f Found {dup_result.duplicate_count} duplicate events")
    print(f"   Affected entities: {dup_result.details.get('affected_entities', 'N/A')}")
    
    if "duplicate_examples" in dup_result.details:
        print("\n   Example duplicates:")
        examples = dup_result.details["duplicate_examples"][:5]
        for ex in examples:
            print(f"      Entity: {ex[ENTITY_COLUMN]}, Time: {ex[TIME_COLUMN]}")
    
    print("\n   \U0001f6e0\ufe0f Recommended Actions:")
    print("      1. Investigate why duplicates exist (system issue? intentional?)")
    print("      2. If unintentional: deduplicate by keeping first/last occurrence")
    print("      3. If intentional: add sequence column to differentiate")
else:
    print("\n\u2705 No duplicate events found")

### TQ002: Temporal Gap Analysis

**üìñ Why Gaps Matter for ML:**

Temporal gaps (missing time periods) can silently corrupt your analysis and model:

| Impact Area | Problem |
|-------------|---------|
| **Rolling features** | "Events in last 30 days" becomes artificially low during/after gaps |
| **Recency features** | "Days since last event" inflated by data gaps, not actual inactivity |
| **Seasonality detection** | Missing months distort seasonal patterns (e.g., missing all Decembers) |
| **Train/test splits** | Gap at split boundary can cause data leakage or unfair evaluation |
| **Aggregation bias** | Time windows spanning gaps have incomplete data, biasing metrics |
| **Trend analysis** | Gaps create artificial trend breaks or distort slope calculations |

**‚ö†Ô∏è Key insight:** A gap doesn't mean customers were inactive‚Äîit means **we don't know** what happened. Models can't distinguish "no events" from "no data."

**Recommended actions if gaps exist:**
1. Document gap periods for downstream users
2. Exclude gap-affected time windows from training
3. Add a `data_available` flag to features
4. Consider imputation only if gap cause is known (e.g., planned maintenance)

In [None]:
gap_result = results[1]  # TemporalGapCheck

print("\U0001f50d Temporal Gap Analysis")
print("="*50)

print(f"\n   Expected frequency: {EXPECTED_FREQUENCY}")
print(f"   Max gap detected: {gap_result.max_gap_days:.1f} days")

if gap_result.gap_count > 0:
    print(f"\n\u26a0\ufe0f Found {gap_result.gap_count} significant gaps")
    print(f"   Threshold: {gap_result.details.get('threshold_days', 'N/A'):.1f} days")
    
    print("\n   \U0001f6e0\ufe0f Recommended Actions:")
    print("      1. Investigate gaps - were systems down? holidays?")
    print("      2. Consider gap locations when designing time windows")
    print("      3. Document known gaps for downstream users")
else:
    print("\n\u2705 No significant temporal gaps detected")

In [None]:
# Visualize event volume over time
df_temp = df.copy()
df_temp[TIME_COLUMN] = pd.to_datetime(df_temp[TIME_COLUMN])

# Calculate time span to choose appropriate aggregation
time_span_days = (df_temp[TIME_COLUMN].max() - df_temp[TIME_COLUMN].min()).days

# Choose aggregation based on time span for better visibility
if time_span_days <= 90:
    freq, freq_label = "D", "Daily"
elif time_span_days <= 365:
    freq, freq_label = "W", "Weekly"
else:
    freq, freq_label = "ME", "Monthly"

event_counts = df_temp.groupby(pd.Grouper(key=TIME_COLUMN, freq=freq)).size()

fig = go.Figure()
fig.add_trace(go.Bar(
    x=event_counts.index,
    y=event_counts.values,
    name=f"{freq_label} Events",
    marker_color="#4682B4"
))
fig.update_layout(
    title=f"{freq_label} Event Volume (gaps appear as missing bars)",
    xaxis_title="Date",
    yaxis_title="Number of Events",
    template="plotly_white",
    height=350,
    bargap=0.1
)
display_figure(fig)

# Calendar heatmap for pattern discovery
fig_calendar = charts.monthly_calendar_heatmap(
    df_temp[TIME_COLUMN],
    title="Event Patterns by Month and Day of Week"
)
display_figure(fig_calendar)

### TQ003: Future Dates Analysis

**üìñ Why Future Dates Matter for ML:**

Events with timestamps in the future are almost always data quality issues:

| Impact Area | Problem |
|-------------|---------|
| **Data leakage** | Future events can leak into training data, inflating metrics |
| **Time-based splits** | Future dates break temporal train/test separation |
| **Recency features** | Negative "days since" values cause calculation errors |
| **Business logic** | Impossible to have events that haven't happened yet |

**Common causes:**
- Timezone conversion errors (UTC vs local time)
- Placeholder dates (e.g., "9999-12-31" for "unknown")
- Scheduled events logged with future execution date
- Data entry errors

In [None]:
future_result = results[2]  # FutureDateCheck

print("\U0001f50d Future Dates Analysis")
print("="*50)

print(f"\n   Reference date: {REFERENCE_DATE}")

if future_result.future_count > 0:
    print(f"\n\U0001f6a8 Found {future_result.future_count} events with future dates!")
    
    if "future_date_examples" in future_result.details:
        print("\n   Example future dates:")
        for ex in future_result.details["future_date_examples"][:5]:
            print(f"      {ex}")
    
    print("\n   \U0001f6e0\ufe0f Recommended Actions:")
    print("      1. CRITICAL: Investigate data source - likely data quality issue")
    print("      2. Check for timezone issues or date parsing errors")
    print("      3. Filter out future dates before modeling")
else:
    print("\n\u2705 No future dates detected")

### TQ004: Event Ordering Analysis

**üìñ Why Event Ordering Matters for ML:**

When multiple events share the same timestamp, their order becomes ambiguous:

| Impact Area | Problem |
|-------------|---------|
| **Sequence features** | "Previous event type" undefined when order is ambiguous |
| **State tracking** | Can't determine correct state transitions |
| **Lag calculations** | Which event comes "before" the other? |
| **Causal inference** | Impossible to establish event causality |

**When this is okay:**
- Events are independent (order doesn't matter for your features)
- You're only using aggregations (counts, sums) not sequences

**When this is a problem:**
- Building sequence models (RNNs, transformers)
- Creating "previous event" or "next event" features
- Tracking customer journey stages

In [None]:
order_result = results[3]  # EventOrderCheck

print("\U0001f50d Event Ordering Analysis")
print("="*50)

if order_result.ambiguous_count > 0:
    print(f"\n\U0001f7e1 Found {order_result.ambiguous_count} events with ambiguous ordering")
    print(f"   (Same entity + same timestamp = can't determine order)")
    print(f"   Collision groups: {order_result.details.get('collision_groups', 'N/A')}")
    
    print("\n   \U0001f6e0\ufe0f Recommended Actions:")
    print("      1. If order matters: add a sequence number column")
    print("      2. If order doesn't matter: proceed with aggregation")
    print("      3. Consider using sub-second timestamps if available")
else:
    print("\n\u2705 Event ordering is unambiguous")

## 1b.5 Quality Score & Recommendations

**üìä Proportional Scoring System:**

Each of the 4 checks contributes **25%** to the total score. Deductions are proportional to the **% of data affected**:

| % Affected | Severity | Score Range |
|------------|----------|-------------|
| 0% | None | 100% |
| < 0.1% | Negligible | ~99% |
| 0.1 - 1% | Minor | 90-95% |
| 1 - 5% | Moderate | 70-90% |
| 5 - 20% | Significant | 30-70% |
| > 20% | Severe | 0-30% |

**Example:** If a check finds issues affecting 0.5% of rows, that check scores ~92%, contributing ~23/25 points to the total.

In [None]:
# Proportional Quality Scoring System
# Each check contributes 25% to total score
# Deduction proportional to % of data affected

total_rows = len(df)

def calculate_check_score(issue_count: int, total: int) -> float:
    """Calculate score (0-100) based on % of data affected."""
    if total == 0:
        return 100.0
    pct_affected = (issue_count / total) * 100
    
    # Graduated severity based on magnitude
    if pct_affected == 0:
        return 100.0
    elif pct_affected < 0.1:
        return 99.0  # Negligible
    elif pct_affected < 1.0:
        return 95.0 - (pct_affected * 5)  # Minor: 90-95
    elif pct_affected < 5.0:
        return 90.0 - (pct_affected * 4)  # Moderate: 70-90
    elif pct_affected < 20.0:
        return 70.0 - (pct_affected * 2)  # Significant: 30-70
    else:
        return max(0, 30.0 - pct_affected)  # Severe: 0-30

# Calculate individual check scores
check_scores = []
for r in results:
    issue_count = r.duplicate_count or r.gap_count or r.future_count or r.ambiguous_count or 0
    pct_affected = (issue_count / total_rows * 100) if total_rows > 0 else 0
    score = calculate_check_score(issue_count, total_rows)
    deduction = (100 - score) * 0.25
    
    check_scores.append({
        "name": r.check_name,
        "check_id": r.check_id,
        "issues": issue_count,
        "pct_affected": pct_affected,
        "score": score,
        "max_points": 25.0,
        "deduction": deduction,
        "contribution": score * 0.25,
        "passed": r.passed
    })

quality_score = sum(c["contribution"] for c in check_scores)
total_deductions = sum(c["deduction"] for c in check_scores)

grade, message = (
    ("A", "Excellent! Ready for feature engineering.") if quality_score >= 90 else
    ("B", "Good. Minor issues - document and proceed.") if quality_score >= 75 else
    ("C", "Fair. Address issues before proceeding.") if quality_score >= 60 else
    ("D", "Needs attention. Significant issues found.")
)

# Stacked horizontal bar - compact visualization
def get_color(contribution):
    if contribution >= 23:
        return "#2ca02c"  # Green
    elif contribution >= 17.5:
        return "#ffbb00"  # Yellow
    elif contribution >= 12.5:
        return "#ff7f0e"  # Orange
    else:
        return "#d62728"  # Red

fig = go.Figure()
for c in check_scores:
    fig.add_trace(go.Bar(
        x=[c["contribution"]],
        y=["Score"],
        orientation="h",
        name=c["name"],
        marker_color=get_color(c["contribution"]),
        text=f"+{c['contribution']:.0f}",
        textposition="inside",
        textfont=dict(color="white", size=12),
        hovertemplate=f"<b>{c['name']}</b><br>Issues: {c['issues']:,} ({c['pct_affected']:.2f}%)<br>Contribution: {c['contribution']:.1f}/25<extra></extra>"
    ))

fig.update_layout(
    barmode="stack",
    title=f"Quality Score: {quality_score:.0f}/100 (Grade {grade})",
    xaxis=dict(range=[0, 105], title=""),
    yaxis=dict(visible=False),
    template="plotly_white",
    height=120,
    legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="center", x=0.5),
    margin=dict(t=60, b=20, l=20, r=40)
)
fig.add_annotation(x=quality_score, y="Score", text=f"<b>{quality_score:.0f}</b>", showarrow=False, xanchor="left", xshift=5, font=dict(size=14))
display_figure(fig)

# Detailed breakdown table
print(f"\n{'='*90}")
print(f"{'Check':<22} {'Issues':>8} {'% Affected':>11} {'Max':>6} {'Deduction':>10} {'Actual':>8}")
print(f"{'-'*90}")
for c in check_scores:
    status = "‚úì" if c["deduction"] < 0.5 else "‚ñ≥" if c["deduction"] < 5 else "‚úó"
    ded_str = f"-{c['deduction']:.1f}" if c["deduction"] > 0 else "0"
    print(f"{status} {c['name']:<20} {c['issues']:>8,} {c['pct_affected']:>10.2f}% {c['max_points']:>6.0f} {ded_str:>10} {c['contribution']:>7.1f}")
print(f"{'-'*90}")
print(f"  {'TOTAL':<20} {'':<8} {'':<11} {'100':>6} {f'-{total_deductions:.1f}':>10} {quality_score:>7.1f}")
print(f"\nüìä Grade {grade}: {message}")

In [None]:
# Mitigation strategies for checks with issues
checks_with_issues = [c for c in check_scores if c["issues"] > 0]

if checks_with_issues:
    mitigation_map = {
        "TQ001": ("Duplicate Events", [
            "Deduplicate: df.drop_duplicates(subset=[entity_col, time_col], keep='first')",
            "If intentional: df['seq'] = df.groupby([entity_col, time_col]).cumcount()"
        ]),
        "TQ002": ("Temporal Gaps", [
            "Document gaps for downstream users",
            "Exclude gap periods from rolling calculations",
            "Add indicator: df['has_gap'] = df[time_col].diff() > threshold"
        ]),
        "TQ003": ("Future Dates", [
            "Filter: df = df[df[time_col] <= pd.Timestamp.now()]",
            "Check timezones: df[time_col] = df[time_col].dt.tz_convert('UTC')"
        ]),
        "TQ004": ("Event Ordering", [
            "Add sequence: df['event_seq'] = df.groupby(entity_col).cumcount()",
            "Stable sort: df.sort_values([entity_col, time_col], kind='stable')"
        ])
    }
    
    print("="*70)
    print("üîß RECOMMENDED MITIGATIONS")
    print("="*70)
    
    for c in checks_with_issues:
        if c["check_id"] in mitigation_map:
            name, strategies = mitigation_map[c["check_id"]]
            severity = "üü¢ Minor" if c["pct_affected"] < 1 else "üü° Moderate" if c["pct_affected"] < 5 else "üî¥ Significant"
            
            print(f"\n{severity}: {name}")
            print(f"   {c['issues']:,} issues ({c['pct_affected']:.2f}% of data) ‚Üí Score: {c['score']:.0f}%")
            print(f"   Strategies:")
            for i, s in enumerate(strategies, 1):
                print(f"      {i}. {s}")
else:
    print("‚úÖ No issues detected - data quality is excellent!")

## 1b.6 Target Variable Analysis

Understanding target distribution is critical for:
- **Class imbalance** affects model training and evaluation metrics
- **Business context** helps interpret what we're trying to predict
- **Sampling strategies** depend on imbalance severity

In [None]:
print("=" * 60)
print(f"TARGET VARIABLE DISTRIBUTION: {findings.target_column}")
print("=" * 60)

if findings.target_column and findings.target_column in df.columns:
    target_series = df[findings.target_column]
    target_counts = target_series.value_counts().sort_index()
    
    dist_data = []
    for val, count in target_counts.items():
        pct = count / len(df) * 100
        dist_data.append({findings.target_column: val, "count": count, "percentage": f"{pct:.3f}"})
    
    display(pd.DataFrame(dist_data))
    
    if len(target_counts) == 2:
        majority, minority = target_counts.max(), target_counts.min()
        minority_class = target_counts.idxmin()
        imbalance_ratio = majority / minority
        retention_rate = target_counts.get(1, 0) / len(df) * 100
        
        print(f"\nImbalance ratio: {imbalance_ratio:.2f}:1 (minority class: {minority_class})")
        print(f"Retention rate: {retention_rate:.1f}%")
        
        if retention_rate > 70:
            print(f"\nüìä Business Context: {retention_rate:.0f}% retention is healthy!")
        elif retention_rate > 50:
            print(f"\nüìä Business Context: {retention_rate:.0f}% retention is moderate.")
        else:
            print(f"\n‚ö†Ô∏è Business Context: {retention_rate:.0f}% retention is concerning!")
        
        print("\n‚ö†Ô∏è Class imbalance considerations:")
        print("   - Use stratified sampling for train/test splits")
        print("   - Consider class weights in model training")
        print("   - Evaluate with Precision-Recall AUC (not just ROC-AUC)")
        
        # Visualization
        fig = make_subplots(rows=1, cols=2, specs=[[{"type": "pie"}, {"type": "bar"}]],
                           horizontal_spacing=0.25)
        
        labels = [f"{'Retained' if v == 1 else 'Churned'} ({v})" for v in target_counts.index]
        
        # Pie chart - percentages inside, legend from this trace
        fig.add_trace(go.Pie(
            labels=labels, 
            values=target_counts.values, 
            hole=0.4,
            marker_colors=["#e74c3c", "#2ecc71"],
            textposition="inside",
            textinfo="percent",
            textfont=dict(size=12, color="white"),
            hoverinfo="label+percent+value"
        ), row=1, col=1)
        
        # Bar chart - counts inside, no legend (uses same colors)
        fig.add_trace(go.Bar(
            x=labels, 
            y=target_counts.values,
            marker_color=["#e74c3c", "#2ecc71"],
            text=[f"{count:,}" for count in target_counts.values],
            textposition="inside",
            textfont=dict(color="white", size=12),
            showlegend=False
        ), row=1, col=2)
        
        fig.update_layout(
            height=420,
            showlegend=True,
            legend=dict(
                orientation="h",
                yanchor="bottom",
                y=1.02,
                xanchor="center",
                x=0.5
            ),
            template="plotly_white",
            margin=dict(t=100, b=60, l=50, r=50),
            title=dict(
                text="<b>Target Variable Distribution</b>",
                y=0.98,
                x=0.5,
                xanchor="center",
                yanchor="top"
            )
        )
        
        display_figure(fig)
else:
    print("\n‚ö†Ô∏è No target column detected. Set target_hint in DataExplorer.explore()")

## 1b.7 Numerical Feature Statistics

Comprehensive statistical summary including skewness, kurtosis, and distribution shape analysis.

In [None]:
numeric_cols = [name for name, col in findings.columns.items()
    if col.inferred_type in [ColumnType.NUMERIC_CONTINUOUS, ColumnType.NUMERIC_DISCRETE]
    and name not in [ENTITY_COLUMN, TIME_COLUMN]]

if numeric_cols:
    stats_data = []
    for col_name in numeric_cols:
        series = df[col_name].dropna()
        if len(series) > 0:
            stats_data.append({
                "feature": col_name, "count": len(series),
                "mean": series.mean(), "std": series.std(),
                "min": series.min(), "25%": series.quantile(0.25),
                "50%": series.quantile(0.50), "75%": series.quantile(0.75),
                "99%": series.quantile(0.99), "max": series.max(),
                "skewness": stats.skew(series), "kurtosis": stats.kurtosis(series)
            })
    
    stats_df = pd.DataFrame(stats_data)
    display_df = stats_df.copy()
    for col in ["mean", "std", "min", "25%", "50%", "75%", "99%", "max"]:
        display_df[col] = display_df[col].apply(lambda x: f"{x:.3f}")
    display_df["skewness"] = display_df["skewness"].apply(lambda x: f"{x:.3f}")
    display_df["kurtosis"] = display_df["kurtosis"].apply(lambda x: f"{x:.3f}")
    
    print("=" * 80)
    print("NUMERICAL FEATURE STATISTICS")
    print("=" * 80)
    display(display_df)
    
    print("\nüìä DISTRIBUTION ALERTS:")
    alerts = []
    for _, row in stats_df.iterrows():
        issues = []
        if abs(row["skewness"]) > 2:
            issues.append(f"highly skewed ({row['skewness']:.2f})")
        elif abs(row["skewness"]) > 1:
            issues.append(f"moderately skewed ({row['skewness']:.2f})")
        if row["kurtosis"] > 10:
            issues.append(f"very heavy tails ({row['kurtosis']:.1f})")
        elif row["kurtosis"] > 3:
            issues.append(f"heavy tails ({row['kurtosis']:.1f})")
        if issues:
            alerts.append(row["feature"])
            print(f"  ‚ö†Ô∏è {row['feature']}: {', '.join(issues)}")
    if not alerts:
        print("  ‚úÖ All distributions are approximately normal")
else:
    print("No numeric columns found (excluding entity/time columns).")

## 1b.8 Segment-Aware Outlier Analysis

Global outlier detection can produce false positives when data contains natural segments (e.g., high-value vs regular customers). This analysis detects segments and compares global vs segment-specific outliers.

In [None]:
print("=" * 80)
print("SEGMENT-AWARE OUTLIER ANALYSIS")
print("=" * 80)

if numeric_cols:
    analyzer = SegmentAwareOutlierAnalyzer(max_segments=5)
    segment_result = analyzer.analyze(df, feature_cols=numeric_cols, segment_col=None,
                                       target_col=findings.target_column)
    
    print(f"\nüìä Segments detected: {segment_result.n_segments}")
    
    if segment_result.n_segments > 1:
        print(f"\nüìà GLOBAL VS SEGMENT OUTLIER COMPARISON:")
        comparison_data = []
        for col in numeric_cols:
            global_outliers = segment_result.global_analysis[col].outliers_detected
            segment_outliers = sum(seg[col].outliers_detected for seg in segment_result.segment_analysis.values() if col in seg)
            false_outliers = segment_result.false_outliers.get(col, 0)
            reduction_pct = (global_outliers - segment_outliers) / global_outliers * 100 if global_outliers > 0 else 0
            comparison_data.append({"Feature": col, "Global": global_outliers, "Segment": segment_outliers,
                                    "False Outliers": false_outliers, "Reduction": f"{reduction_pct:.1f}%"})
        display(pd.DataFrame(comparison_data))
        
        if segment_result.segmentation_recommended:
            print("\nüí° SEGMENT-SPECIFIC OUTLIER TREATMENT RECOMMENDED")
            for rec in segment_result.recommendations:
                print(f"   ‚Ä¢ {rec}")
        
        # Visualization
        cols_with_diff = [r["Feature"] for r in comparison_data if r["Global"] > 0 and r["Global"] != r["Segment"]]
        if cols_with_diff:
            fig = go.Figure()
            fig.add_trace(go.Bar(name="Global", x=cols_with_diff,
                y=[r["Global"] for r in comparison_data if r["Feature"] in cols_with_diff], marker_color="#e74c3c"))
            fig.add_trace(go.Bar(name="Segment", x=cols_with_diff,
                y=[r["Segment"] for r in comparison_data if r["Feature"] in cols_with_diff], marker_color="#2ecc71"))
            fig.update_layout(barmode="group", title="Global vs Segment Outliers", height=400, template="plotly_white")
            display_figure(fig)
    else:
        print("   Data appears homogeneous - using global outlier detection")
else:
    print("No numeric columns for outlier analysis.")

## 1b.9 Binary Field & Data Consistency Validation

In [None]:
# Binary Field Validation
binary_cols = [name for name, col in findings.columns.items() if col.inferred_type == ColumnType.BINARY]
print("=" * 60)
print("BINARY FIELD VALIDATION")
print("=" * 60)

if binary_cols:
    for col in binary_cols:
        unique_vals = sorted(df[col].dropna().unique())
        is_valid = set(unique_vals).issubset({0, 1, 0.0, 1.0})
        count_0, count_1 = (df[col] == 0).sum(), (df[col] == 1).sum()
        total = count_0 + count_1
        status = "‚úì" if is_valid else "‚ö†Ô∏è"
        print(f"\n{status} {col}: 0={count_0:,} ({count_0/total*100:.1f}%), 1={count_1:,} ({count_1/total*100:.1f}%)")
        if not is_valid:
            print(f"   Invalid values: {[v for v in unique_vals if v not in [0, 1, 0.0, 1.0]]}")
else:
    print("\nNo binary columns detected.")

# Data Consistency Checks
print("\n" + "=" * 60)
print("DATA CONSISTENCY CHECKS")
print("=" * 60)

consistency_issues = []
for col_name in df.select_dtypes(include=['object']).columns:
    if col_name in [ENTITY_COLUMN, TIME_COLUMN]:
        continue
    unique_vals = df[col_name].dropna().unique()
    case_variants = {}
    for val in unique_vals:
        lower_val = str(val).lower().strip()
        if lower_val not in case_variants:
            case_variants[lower_val] = []
        case_variants[lower_val].append(val)
    for lower_val, variants in case_variants.items():
        if len(variants) > 1:
            consistency_issues.append({"Column": col_name, "Issue": "Case/Spacing Variants", "Details": str(variants[:3])})

if consistency_issues:
    print("\n‚ö†Ô∏è Consistency issues found:")
    display(pd.DataFrame(consistency_issues))
else:
    print("\n‚úÖ No consistency issues detected.")

## 1b.10 Quality Improvement Recommendations

Using the framework's RecommendationEngine to generate prioritized, actionable recommendations based on detected issues.

In [None]:
print("=" * 70)
print("QUALITY IMPROVEMENT RECOMMENDATIONS")
print("=" * 70)

rec_engine = RecommendationEngine()
cleaning_recs = rec_engine.recommend_cleaning(findings)

if cleaning_recs:
    severity_order = {"high": 0, "medium": 1, "low": 2}
    sorted_recs = sorted(cleaning_recs, key=lambda r: severity_order.get(r.severity, 3))
    
    for rec in sorted_recs:
        icon = "üî¥" if rec.severity == "high" else "üü°" if rec.severity == "medium" else "üü¢"
        print(f"\n{icon} [{rec.severity.upper()}] {rec.column_name}")
        print(f"   Issue: {rec.issue_type} - {rec.description}")
        print(f"   Strategy: {rec.strategy}")
        if rec.problem_impact:
            print(f"   Impact: {rec.problem_impact}")
        if rec.action_steps:
            print(f"   Steps: {', '.join(rec.action_steps[:3])}")
else:
    print("\n‚úÖ No critical cleaning recommendations - data quality is good!")

## 1b.11 Save Quality Check Results

In [None]:
# Store quality check results in findings
quality_summary = {
    "temporal_quality_score": quality_score,
    "temporal_quality_grade": grade,
    "checks_passed": passed,
    "checks_total": len(results),
    "issues": {
        "duplicate_events": dup_result.duplicate_count,
        "temporal_gaps": gap_result.gap_count,
        "future_dates": future_result.future_count,
        "ambiguous_ordering": order_result.ambiguous_count,
    }
}

# Add to findings notes
if not findings.metadata:
    findings.metadata = {}
findings.metadata["temporal_quality"] = quality_summary

findings.save(FINDINGS_PATH)
print(f"Quality results saved to: {FINDINGS_PATH}")
print(f"\nSummary: {quality_summary}")

---

## Summary: What We Learned

In this notebook, we performed temporal-specific quality checks:

1. **Duplicate Events** - Checked for same entity + timestamp combinations
2. **Temporal Gaps** - Identified unexpected missing time periods
3. **Future Dates** - Found dates beyond reference (data quality issue)
4. **Event Ordering** - Verified events can be uniquely ordered

## Quality Score Interpretation

| Grade | Score | Meaning |
|-------|-------|--------|
| A | 90-100 | Excellent - proceed with confidence |
| B | 75-89 | Good - minor issues, document and proceed |
| C | 60-74 | Fair - address issues before feature engineering |
| D | <60 | Poor - significant investigation needed |

---

## Next Steps

Continue with the **Event Bronze Track**:

1. **01c_temporal_patterns.ipynb** - Detect trends, seasonality, cohort analysis
2. **01d_event_aggregation.ipynb** - Aggregate events to entity-level (produces new dataset)

After completing 01d, continue with the **Entity Bronze Track** (02 ‚Üí 03 ‚Üí 04) on the aggregated data.

Or return to fix data quality issues if grade is C or below.