# Start Here: Data Discovery

**Purpose:** Create a point-in-time snapshot and understand your dataset's structure through automatic profiling.

**What you'll learn:**
- How to create temporally-safe training snapshots
- How automatic type inference works and when to override it
- How to identify entity-level vs event-level data
- How to set up your target column for downstream analysis

**Outputs:**
- Point-in-time training snapshot (Parquet)
- Dataset overview (rows, columns, memory, format, structure)
- Automatic column type inference with confidence scores
- Saved exploration findings (YAML)

---

## How to Read This Notebook

Each section includes:
- **ðŸ“Š Charts** - Interactive Plotly visualizations
- **ðŸ“– Interpretation Guide** - How to read and understand the output
- **âœ… Actions** - What to do based on the findings

## 1.1 Configuration

Configure your data source and target column **before** running the notebook.

In [1]:
from customer_retention.analysis.auto_explorer import DataExplorer
from customer_retention.analysis.auto_explorer.findings import TimeSeriesMetadata
from customer_retention.analysis.visualization import ChartBuilder, display_figure, display_table, console
from customer_retention.stages.validation import TimeSeriesDetector
from customer_retention.core.config.column_config import DatasetGranularity, ColumnType
from customer_retention.stages.profiling import TypeDetector
from customer_retention.stages.temporal import (
    ScenarioDetector, UnifiedDataPreparer, SnapshotManager,
    TimestampConfig, TimestampStrategy, PointInTimeRegistry, CutoffAnalyzer
)
from datetime import datetime
from pathlib import Path
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

In [2]:
# =============================================================================
# CONFIGURATION - Set these before running
# =============================================================================

# DATA_PATH: Path to your data file (CSV, Parquet, or Delta)
DATA_PATH = "../tests/fixtures/customer_retention_retail.csv"

# TARGET_COLUMN: Your prediction target (set to None for auto-detection)
TARGET_COLUMN = "retained"

# ENTITY_COLUMN: Customer/user ID column (set to None for auto-detection)
ENTITY_COLUMN = None

# LABEL_WINDOW_DAYS: Days after last activity to derive label timestamp
# Used when no explicit label timestamp column exists (e.g., churn_date)
# Default: 180 days (6 months observation window)
LABEL_WINDOW_DAYS = 180

# TIMESTAMP_CONFIG: Override auto-detection if needed (set to None for auto-detection)
# Example manual override:
# TIMESTAMP_CONFIG = TimestampConfig(
#     strategy=TimestampStrategy.PRODUCTION,
#     feature_timestamp_column="observation_date",
#     label_timestamp_column="churn_date",
# )
TIMESTAMP_CONFIG = None

# =============================================================================
# SAMPLE DATASETS (for learning/testing only)
# =============================================================================
# ENTITY-LEVEL (one row per customer):
# DATA_PATH = "../tests/fixtures/customer_retention_retail.csv"
# DATA_PATH = "../tests/fixtures/bank_customer_churn.csv"
# DATA_PATH = "../tests/fixtures/netflix_customer_churn.csv"
#
# EVENT-LEVEL (multiple rows per customer):
# DATA_PATH = "../tests/fixtures/customer_transactions.csv"
# DATA_PATH = "../tests/fixtures/customer_emails.csv"
# =============================================================================

# OUTPUT_DIR: All outputs go here (gitignored)
OUTPUT_DIR = Path("../experiments/findings")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

## 1.2 Load Data & Create Point-in-Time Snapshot

**This is the critical first step.** We:
1. Load raw data
2. Detect temporal scenario (production timestamps, derived, or synthetic)
3. Create a versioned snapshot with `feature_timestamp` and `label_timestamp`
4. All subsequent analysis uses the snapshot data

This ensures temporal integrity and prevents data leakage.

In [3]:
# Load raw data
raw_df = pd.read_csv(DATA_PATH) if DATA_PATH.endswith('.csv') else pd.read_parquet(DATA_PATH)

console.start_section()
console.header("Raw Data Loaded")
console.metric("Source", DATA_PATH)
console.metric("Rows", f"{len(raw_df):,}")
console.metric("Columns", len(raw_df.columns))
console.end_section()

# Detect granularity and entity column
type_detector = TypeDetector()
granularity_result = type_detector.detect_granularity(raw_df)
entity_column = ENTITY_COLUMN or granularity_result.entity_column

# Detect or use provided timestamp configuration
if TIMESTAMP_CONFIG:
    ts_config = TIMESTAMP_CONFIG
    scenario = "MANUAL_OVERRIDE"
    discovery_result = None
    console.info(f"Using manual timestamp config: {ts_config.strategy.value}")
else:
    detector = ScenarioDetector(label_window_days=LABEL_WINDOW_DAYS)
    scenario, ts_config, discovery_result = detector.detect(raw_df, TARGET_COLUMN)

console.start_section()
console.header("Temporal Scenario Detection")
console.metric("Scenario", scenario)
console.metric("Strategy", ts_config.strategy.value)
console.metric("Label Window", f"{LABEL_WINDOW_DAYS} days")

if discovery_result:
    if discovery_result.feature_timestamp:
        source_col = discovery_result.feature_timestamp.column_name
        if discovery_result.feature_timestamp.is_derived:
            console.metric("Feature Timestamp", f"derived from {discovery_result.feature_timestamp.source_columns}")
        else:
            was_promoted = "promoted" in discovery_result.feature_timestamp.notes.lower()
            if was_promoted:
                console.metric("Feature Timestamp", f"{source_col} (auto-selected as latest activity)")
            else:
                console.metric("Feature Timestamp", f"{source_col} (explicit match)")
    
    if discovery_result.label_timestamp:
        if discovery_result.label_timestamp.is_derived:
            console.metric("Label Timestamp", f"derived: {discovery_result.label_timestamp.derivation_formula}")
        else:
            console.metric("Label Timestamp", f"{discovery_result.label_timestamp.column_name} (explicit match)")
    
    if "datetime_ordering" in discovery_result.discovery_report:
        ordering = discovery_result.discovery_report["datetime_ordering"]
        if ordering:
            console.info(f"Datetime column ordering: {' â†’ '.join(ordering)}")

console.end_section()

#### RAW DATA LOADED  
Source: **../tests/fixtures/customer_retention_retail.csv**  
Rows: **30,801**  
Columns: **15**

#### TEMPORAL SCENARIO DETECTION  
Scenario: **partial**  
Strategy: **production**  
Label Window: **180 days**  
Feature Timestamp: **lastorder (auto-selected as latest activity)**  
Label Timestamp: **derived: lastorder + 180 days**  
*(i) Datetime column ordering: created â†’ firstorder â†’ lastorder*

### Cutoff Date Selection

The chart below shows the temporal distribution of your data. Use it to select an appropriate cutoff date:

- **Top chart**: Records per time bin and cumulative count
- **Bottom chart**: Train/Score split percentage at each potential cutoff date
- **Suggested cutoff** (blue dashed): Achieves ~90% train / 10% score split

**Final data allocation:**
- Cutoff: 90% train, 10% score (holdout for final evaluation)
- Train/Test split: 89% train, 11% test (from the 90%)
- **Result: ~80% training, ~10% test, ~10% score**

Adjust `CUTOFF_DATE` below if the suggested date doesn't fit your needs.

In [4]:
# Analyze temporal distribution for cutoff selection
cutoff_analyzer = CutoffAnalyzer()
cutoff_analysis = None

# Get timestamp column from discovery result (reuse existing analysis)
timestamp_col = None
if discovery_result:
    if "datetime_ordering" in discovery_result.discovery_report:
        ordering = discovery_result.discovery_report["datetime_ordering"]
        if ordering:
            timestamp_col = ordering[-1]  # Latest datetime column
    if not timestamp_col and discovery_result.feature_timestamp:
        if not discovery_result.feature_timestamp.is_derived:
            timestamp_col = discovery_result.feature_timestamp.column_name

# Check registry for existing cutoff
pit_registry = PointInTimeRegistry(OUTPUT_DIR)
registry_cutoff = pit_registry.check_consistency().reference_cutoff

if timestamp_col:
    cutoff_analysis = cutoff_analyzer.analyze(raw_df, timestamp_column=timestamp_col, n_bins=50)
    data_suggested_cutoff = cutoff_analysis.suggest_cutoff(train_ratio=0.9)
    
    console.start_section()
    console.header("Cutoff Date Analysis")
    console.metric("Timestamp Column", timestamp_col)
    console.metric("Date Range", f"{cutoff_analysis.date_range[0].strftime('%Y-%m-%d')} to {cutoff_analysis.date_range[1].strftime('%Y-%m-%d')}")
    console.metric("Data-Suggested Cutoff", data_suggested_cutoff.strftime("%Y-%m-%d"))
    split = cutoff_analysis.get_split_at_date(data_suggested_cutoff)
    console.metric("At Suggested Split", f"{split['train_pct']:.0f}% train / {split['score_pct']:.0f}% score")
    
    if registry_cutoff:
        console.warning(f"Registry has cutoff: {registry_cutoff.date()} (may be stale)")
        console.info("To clear: pit_registry.clear_registry()")
    
    # Show milestones for reference
    milestones = cutoff_analysis.get_percentage_milestones(step=10)
    if milestones:
        console.subheader("Reference Dates (10% intervals)")
        for m in milestones:
            console.info(f"  {m['train_pct']:.0f}% train: {m['date'].strftime('%Y-%m-%d')}")
    console.end_section()
else:
    data_suggested_cutoff = datetime.now()
    console.start_section()
    console.header("Cutoff Date Analysis")
    console.warning("No timestamp column detected")
    console.metric("Default Cutoff", data_suggested_cutoff.strftime("%Y-%m-%d"))
    if registry_cutoff:
        console.info(f"Registry cutoff: {registry_cutoff.date()}")
    console.end_section()

#### CUTOFF DATE ANALYSIS  
Timestamp Column: **lastorder**  
Date Range: **2004-01-01 to 2018-01-21**  
Data-Suggested Cutoff: **2017-06-29**  
At Suggested Split: **86% train / 14% score**  
[!] Registry has cutoff: 2017-06-29 (may be stale)  
*(i) To clear: pit_registry.clear_registry()*  
**Reference Dates (10% intervals)**  
*(i)   11% train: 2011-11-15*  
*(i)   21% train: 2012-12-29*  
*(i)   51% train: 2013-07-23*  
*(i)   51% train: 2013-07-23*  
*(i)   51% train: 2013-07-23*  
*(i)   80% train: 2013-11-02*  
*(i)   80% train: 2013-11-02*  
*(i)   80% train: 2013-11-02*  
*(i)   91% train: 2017-06-29*

In [5]:
# =============================================================================
# CUTOFF DATE SELECTION - Set your preferred cutoff date
# =============================================================================
# Options:
#   None = use data-suggested cutoff (~90/10 split)
#   datetime(YYYY, M, D) = use specific date
#
# To clear stale registry: pit_registry.clear_registry()
# =============================================================================
CUTOFF_DATE = None  # e.g., datetime(2017, 7, 1)

# Compute final selected cutoff
selected_cutoff = CUTOFF_DATE or data_suggested_cutoff

console.start_section()
console.header("Selected Cutoff Date")
if CUTOFF_DATE:
    console.info(f"Manual override: {CUTOFF_DATE.strftime('%Y-%m-%d')}")
else:
    console.info(f"Using data-suggested: {selected_cutoff.strftime('%Y-%m-%d')}")

if cutoff_analysis:
    split = cutoff_analysis.get_split_at_date(selected_cutoff)
    console.metric("Train/Score Split", f"{split['train_pct']:.0f}% / {split['score_pct']:.0f}%")
    console.metric("Train Records", f"{split['train_count']:,}")
    console.metric("Score Records", f"{split['score_count']:,}")
console.end_section()

# Display chart with selected cutoff
if cutoff_analysis:
    chart_builder = ChartBuilder()
    display_figure(chart_builder.cutoff_selection_chart(
        cutoff_analysis, 
        suggested_cutoff=selected_cutoff,
        current_cutoff=registry_cutoff
    ))

#### SELECTED CUTOFF DATE  
*(i) Using data-suggested: 2017-06-29*  
Train/Score Split: **86% / 14%**  
Train Records: **26,578**  
Score Records: **4,180**

In [6]:
# pit_registry already initialized in cutoff analysis cell
dataset_name = Path(DATA_PATH).stem

# Use the user's selected cutoff (not forced by registry)
cutoff_date = selected_cutoff

# Warn if overriding registry
if registry_cutoff and registry_cutoff.date() != selected_cutoff.date():
    console.start_section()
    console.header("Registry Update")
    console.warning(f"Overriding registry cutoff ({registry_cutoff.date()}) with {selected_cutoff.date()}")
    console.info("All datasets in this project should use the same cutoff date")
    console.end_section()

preparer = UnifiedDataPreparer(OUTPUT_DIR, ts_config)
df = preparer.prepare_from_raw(raw_df, target_column=TARGET_COLUMN, entity_column=entity_column or "entity_id")
snapshot_df, snapshot_metadata = preparer.create_training_snapshot(df, cutoff_date)

pit_registry.register_snapshot(
    dataset_name=dataset_name,
    snapshot_id=snapshot_metadata['snapshot_id'],
    cutoff_date=cutoff_date,
    source_path=DATA_PATH,
    row_count=snapshot_metadata['row_count']
)

console.start_section()
console.header("Point-in-Time Snapshot Created")
console.metric("Dataset", dataset_name)
console.metric("Snapshot ID", snapshot_metadata['snapshot_id'])
console.metric("Rows", f"{snapshot_metadata['row_count']:,}")
console.metric("Features", len(snapshot_metadata['feature_columns']))
console.metric("Cutoff Date", str(cutoff_date.date()))
console.metric("Data Hash", snapshot_metadata['data_hash'][:16] + "...")

if "feature_timestamp" in df.columns:
    console.success("Temporal columns added: feature_timestamp, label_timestamp")
else:
    console.warning("No temporal columns added (synthetic strategy)")

updated_report = pit_registry.check_consistency()
if updated_report.is_consistent:
    console.success(f"All {len(pit_registry.snapshots)} datasets use cutoff: {cutoff_date.date()}")
else:
    console.error("INCONSISTENT CUTOFF DATES DETECTED")
    console.warning(f"Out of sync: {', '.join(updated_report.inconsistent_datasets)}")
    console.info("Re-run notebook 01 for out-of-sync datasets to align cutoff dates")

console.end_section()

df = snapshot_df


Column 'lastorder': 23 invalid dates coerced to NaT



#### POINT-IN-TIME SNAPSHOT CREATED  
Dataset: **customer_retention_retail**  
Snapshot ID: **training_v38**  
Rows: **26,578**  
Features: **14**  
Cutoff Date: **2017-06-29**  
Data Hash: **b38fd4f1bbda9a00...**  
[OK] Temporal columns added: feature_timestamp, label_timestamp  
[OK] All 1 datasets use cutoff: 2017-06-29

## 1.3 Dataset Exploration

Now we explore the **snapshot data** (not raw data). This ensures all visualizations and metrics reflect the actual training data with temporal integrity.

In [None]:
# Explore the snapshot data
# Note: UnifiedDataPreparer renames the target column to "target" in the snapshot
# So we use "target" as the hint, not the original TARGET_COLUMN name
explorer = DataExplorer(visualize=False, save_findings=True, output_dir=str(OUTPUT_DIR))
findings = explorer.explore(df, target_hint="target", name=dataset_name)
findings.source_path = DATA_PATH

# Store snapshot info in findings
findings.snapshot_id = snapshot_metadata['snapshot_id']
findings.snapshot_path = str(OUTPUT_DIR / "snapshots" / f"{snapshot_metadata['snapshot_id']}.parquet")
findings.timestamp_scenario = scenario
findings.timestamp_strategy = ts_config.strategy.value

# Also store the original target column name for reference
findings.metadata["original_target_column"] = TARGET_COLUMN

granularity = "event" if granularity_result.granularity == DatasetGranularity.EVENT_LEVEL else "entity"

# Display dataset overview
chart_builder = ChartBuilder()
display_figure(chart_builder.dataset_at_a_glance(
    df, findings,
    source_path=f"Snapshot: {snapshot_metadata['snapshot_id']}",
    granularity=granularity,
    max_columns=15,
    columns_per_row=5
))

## 1.4 Column Summary Table

In [8]:
# Exclude temporal metadata columns from summary
TEMPORAL_METADATA_COLS = {"feature_timestamp", "label_timestamp", "label_available_flag"}

summary_data = []
for name, col in findings.columns.items():
    if name in TEMPORAL_METADATA_COLS:
        continue
    null_pct = col.universal_metrics.get("null_percentage", 0)
    distinct = col.universal_metrics.get("distinct_count", "N/A")
    summary_data.append({
        "Column": name,
        "Type": col.inferred_type.value,
        "Confidence": f"{col.confidence:.0%}",
        "Nulls %": f"{null_pct:.1f}%",
        "Distinct": distinct,
        "Evidence": col.evidence[0] if col.evidence else ""
    })

summary_df = pd.DataFrame(summary_data)
display_table(summary_df)

Column,Type,Confidence,Nulls %,Distinct,Evidence
entity_id,identifier,90%,0.0%,26573,Column name contains identifier pattern
target,target,90%,0.0%,2,Column name contains generic target pattern 'target' with 2 classes
created,datetime,90%,0.0%,2511,100/100 values parseable as datetime
firstorder,datetime,90%,0.0%,2384,100/100 values parseable as datetime
lastorder,datetime,90%,0.0%,2210,100/100 values parseable as datetime
esent,numeric_continuous,90%,0.0%,85,Numeric with 85 unique values (>20)
eopenrate,numeric_continuous,90%,0.0%,941,Numeric with 941 unique values (>20)
eclickrate,numeric_continuous,90%,0.0%,475,Numeric with 475 unique values (>20)
avgorder,numeric_continuous,90%,0.0%,9409,Numeric with 9409 unique values (>20)
ordfreq,numeric_continuous,90%,0.0%,3484,Numeric with 3484 unique values (>20)


## 1.5 Target Column Verification

In [9]:
console.start_section()
console.header("Target Column")

if findings.target_column and findings.target_column in df.columns:
    console.success(f"Target: {findings.target_column}")
    target_counts = df[findings.target_column].value_counts()
    for val, count in target_counts.items():
        pct = (count / len(df)) * 100
        console.metric(f"Class {val}", f"{count:,} ({pct:.1f}%)")
else:
    console.warning("No target column configured")
    console.info("Set TARGET_COLUMN in the configuration cell above")

console.end_section()

#### TARGET COLUMN  
[OK] Target: label_available_flag  
Class True: **26,578 (100.0%)**

## 1.6 Dataset Structure Detection

In [10]:
ts_detector = TimeSeriesDetector()
ts_characteristics = ts_detector.detect(df, entity_column=entity_column)

console.start_section()
console.header("Dataset Structure")
console.metric("Type", ts_characteristics.dataset_type.value.upper())
console.metric("Granularity", granularity_result.granularity.value.upper())
console.metric("Entity Column", entity_column or "N/A")

if granularity_result.unique_entities:
    console.metric("Unique Entities", f"{granularity_result.unique_entities:,}")
if granularity_result.avg_events_per_entity:
    console.metric("Avg Events/Entity", f"{granularity_result.avg_events_per_entity:.1f}")

is_event_level = granularity_result.granularity == DatasetGranularity.EVENT_LEVEL
if is_event_level:
    console.info("EVENT-LEVEL DATA - Use Event Bronze Track:")
    console.info("  -> 01a_temporal_deep_dive.ipynb")
    console.info("  -> 01b_temporal_quality.ipynb")
    console.info("  -> 01c_temporal_patterns.ipynb")
    console.info("  -> 01d_event_aggregation.ipynb")
else:
    console.info("ENTITY-LEVEL DATA - Use standard flow:")
    console.info("  -> 02_column_deep_dive.ipynb")
    console.info("  -> 03_quality_assessment.ipynb")

console.end_section()

#### DATASET STRUCTURE  
Type: **UNKNOWN**  
Granularity: **UNKNOWN**  
Entity Column: **custid**  
*(i) ENTITY-LEVEL DATA - Use standard flow:*  
*(i)   -> 02_column_deep_dive.ipynb*  
*(i)   -> 03_quality_assessment.ipynb*

## 1.7 Type Override (Optional)

Override any incorrectly inferred column types before saving findings.

In [11]:
# === TYPE OVERRIDES ===
TYPE_OVERRIDES = {
    # "column_name": ColumnType.NEW_TYPE,
}

console.start_section()
console.header("Type Override Review")

low_conf = [(name, col.inferred_type.value, col.confidence) 
            for name, col in findings.columns.items() 
            if col.confidence < 0.8 and name not in TEMPORAL_METADATA_COLS]
if low_conf:
    console.subheader("Low Confidence Detections")
    for col_name, col_type, conf in sorted(low_conf, key=lambda x: x[2]):
        console.warning(f"{col_name}: {col_type} ({conf:.0%})")
else:
    console.success("All type detections have high confidence (>=80%)")

if TYPE_OVERRIDES:
    console.subheader("Applying Overrides")
    for col_name, new_type in TYPE_OVERRIDES.items():
        if col_name in findings.columns:
            old_type = findings.columns[col_name].inferred_type.value
            findings.columns[col_name].inferred_type = new_type
            findings.columns[col_name].confidence = 1.0
            console.success(f"{col_name}: {old_type} -> {new_type.value}")

console.end_section()

#### TYPE OVERRIDE REVIEW  
**Low Confidence Detections**  
[!] favday: categorical_cyclical (70%)

## 1.8 Save Findings

In [12]:
# Populate time series metadata if event-level
if is_event_level:
    findings.time_series_metadata = TimeSeriesMetadata(
        granularity=DatasetGranularity.EVENT_LEVEL,
        entity_column=entity_column,
        time_column=granularity_result.time_column or ts_characteristics.timestamp_column,
        avg_events_per_entity=granularity_result.avg_events_per_entity,
        time_span_days=int(ts_characteristics.time_span_days) if ts_characteristics.time_span_days else None,
        unique_entities=granularity_result.unique_entities,
        suggested_aggregations=["24h", "7d", "30d", "90d", "all_time"]
    )

FINDINGS_PATH = explorer.last_findings_path
findings.save(FINDINGS_PATH)

console.start_section()
console.header("Findings Saved")
console.success(f"Findings: {FINDINGS_PATH}")
console.success(f"Snapshot: {findings.snapshot_path}")
console.metric("Columns", findings.column_count)
console.metric("Target", findings.target_column or "Not set")
console.metric("Snapshot ID", findings.snapshot_id)
console.metric("Timestamp Strategy", findings.timestamp_strategy)
console.end_section()

#### FINDINGS SAVED  
[OK] Findings: ../experiments/findings/customer_retention_retail_408768_findings.yaml  
[OK] Snapshot: ../experiments/findings/snapshots/training_v38.parquet  
Columns: **18**  
Target: **label_available_flag**  
Snapshot ID: **training_v38**  
Timestamp Strategy: **production**

## 1.9 Summary

**What was created:**
- Point-in-time snapshot with `feature_timestamp` and `label_timestamp`
- Exploration findings with column types and metrics

**All downstream notebooks load the snapshot**, ensuring:
- Temporal integrity (no data leakage)
- Reproducibility (SHA256 hash verification)
- Consistency (same data across all analysis)

**Next steps:**
- Entity-level data: `02_column_deep_dive.ipynb`
- Event-level data: `01a_temporal_deep_dive.ipynb`