# Chapter 2: Column Deep Dive

**Purpose:** Analyze each column in detail with distribution analysis, value validation, and transformation recommendations.

**What you'll learn:**
- How to validate value ranges for different column types
- How to interpret distribution shapes (skewness, kurtosis)
- When and why to apply transformations (log, sqrt, capping)
- How to detect zero-inflation and handle it

**Outputs:**
- Value range validation results
- Per-column distribution visualizations with statistics
- Skewness/kurtosis analysis with transformation recommendations
- Zero-inflation detection
- Type confirmation/override capability
- Updated exploration findings

## 2.1 Load Previous Findings

In [1]:
from customer_retention.analysis.auto_explorer import ExplorationFindings, RecommendationRegistry
from customer_retention.analysis.visualization import ChartBuilder, display_figure, display_table, console
from customer_retention.core.config.column_config import ColumnType
from customer_retention.stages.profiling import (
    DistributionAnalyzer, TransformationType,
    TemporalAnalyzer, TemporalGranularity,
    CategoricalDistributionAnalyzer, EncodingType
)
from customer_retention.stages.validation import DataValidator, RuleGenerator
import pandas as pd
import numpy as np
from scipy import stats
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from customer_retention.core.config.experiments import FINDINGS_DIR, EXPERIMENTS_DIR, OUTPUT_DIR, setup_experiments_structure


In [2]:
# === CONFIGURATION ===
# Option 1: Set the exact path from notebook 01 output
# FINDINGS_PATH = "../experiments/findings/customer_retention_retail_abc123_findings.yaml"

# Option 2: Auto-discover findings file (prefers aggregated over event-level)
from pathlib import Path
import os

# FINDINGS_DIR imported from customer_retention.core.config.experiments

# Find all findings files
findings_files = [f for f in FINDINGS_DIR.glob("*_findings.yaml") if "multi_dataset" not in f.name]
if not findings_files:
    raise FileNotFoundError(f"No findings files found in {FINDINGS_DIR}. Run notebook 01 first.")

# Prefer aggregated findings (from 01d) over event-level findings
# This ensures notebooks 02-10 work with entity-level data
# Pattern: *_aggregated_* in filename indicates aggregated data
aggregated_files = [f for f in findings_files if "_aggregated" in f.name]
non_aggregated_files = [f for f in findings_files if "_aggregated" not in f.name]

if aggregated_files:
    # Use most recent aggregated file
    aggregated_files.sort(key=lambda f: f.stat().st_mtime, reverse=True)
    FINDINGS_PATH = str(aggregated_files[0])
    print(f"Found {len(aggregated_files)} aggregated findings file(s)")
    print(f"Using: {FINDINGS_PATH}")
    if non_aggregated_files:
        print(f"   (Skipping {len(non_aggregated_files)} event-level findings)")
else:
    # Fall back to most recent non-aggregated file
    non_aggregated_files.sort(key=lambda f: f.stat().st_mtime, reverse=True)
    FINDINGS_PATH = str(non_aggregated_files[0])
    print(f"Found {len(non_aggregated_files)} findings file(s)")
    print(f"Using: {FINDINGS_PATH}")

findings = ExplorationFindings.load(FINDINGS_PATH)
print(f"\nLoaded findings for {findings.column_count} columns from {findings.source_path}")

# Warn if this is event-level data (should run 01d first)
if findings.is_time_series and "_aggregated" not in FINDINGS_PATH:
    ts_meta = findings.time_series_metadata
    print(f"\n⚠️  WARNING: This appears to be EVENT-LEVEL data")
    print(f"   Entity: {ts_meta.entity_column}, Time: {ts_meta.time_column}")
    print(f"   Recommendation: Run 01d_event_aggregation.ipynb first to create entity-level data")

Found 1 aggregated findings file(s)
Using: ../experiments/findings/customer_emails_408768_aggregated_d24886_findings.yaml
   (Skipping 1 event-level findings)

Loaded findings for 68 columns from ../experiments/findings/customer_emails_408768_aggregated.parquet


## 2.2 Load Source Data

In [None]:
# Load data - handle aggregated parquet files directly
from customer_retention.stages.temporal import load_data_with_snapshot_preference, TEMPORAL_METADATA_COLS

# For aggregated data, load directly from the parquet source
if "_aggregated" in FINDINGS_PATH and findings.source_path.endswith('.parquet'):
    source_path = Path(findings.source_path)
    # Handle relative path from notebook directory
    if not source_path.is_absolute():
        source_path = Path("..") / source_path.relative_to("..") if str(source_path).startswith("..") else FINDINGS_DIR / source_path.name
    df = pd.read_parquet(source_path)
    data_source = f"aggregated:{source_path.name}"
else:
    # Standard loading for event-level or entity-level data
    df, data_source = load_data_with_snapshot_preference(findings, output_dir=str(FINDINGS_DIR))

print(f"Loaded data from: {data_source}")
print(f"Shape: {df.shape}")

charts = ChartBuilder()

# Initialize recommendation registry for this exploration
registry = RecommendationRegistry()
registry.init_bronze(findings.source_path)

# Find target column for Gold layer initialization
target_col = next((name for name, col in findings.columns.items() if col.inferred_type == ColumnType.TARGET), None)
if target_col:
    registry.init_gold(target_col)

# Find entity column for Silver layer initialization
entity_col = next((name for name, col in findings.columns.items() if col.inferred_type == ColumnType.IDENTIFIER), None)
if entity_col:
    registry.init_silver(entity_col)

print(f"Initialized recommendation registry (Bronze: {findings.source_path})")

## 2.3 Value Range Validation

**📖 Interpretation Guide:**
- **Percentage fields** (rates): Should be 0-100 or 0-1 depending on format
- **Binary fields**: Should only contain 0 and 1
- **Count fields**: Should be non-negative integers
- **Amount fields**: Should be non-negative (unless refunds are possible)

**What to Watch For:**
- Rates > 100% suggest measurement or data entry errors
- Negative values in fields that should be positive
- Binary fields with values other than 0/1

**Actions:**
- Cap rates at 100 if they exceed (or investigate cause)
- Flag records with impossible negative values
- Convert binary fields to proper 0/1 encoding

In [4]:
validator = DataValidator()
range_rules = RuleGenerator.from_findings(findings)

console.start_section()
console.header("Value Range Validation")

if range_rules:
    range_results = validator.validate_value_ranges(df, range_rules)
    
    issues_found = []
    for r in range_results:
        detail = f"{r.invalid_values} invalid" if r.invalid_values > 0 else None
        console.check(f"{r.column_name} ({r.rule_type})", r.invalid_values == 0, detail)
        if r.invalid_values > 0:
            issues_found.append(r)
    
    all_invalid = sum(r.invalid_values for r in range_results)
    if all_invalid == 0:
        console.success("All value ranges valid")
    else:
        console.error(f"Found {all_invalid:,} values outside expected ranges")
        
        console.info("Examples of invalid values:")
        for r in issues_found[:3]:
            col = r.column_name
            if col in df.columns:
                if r.rule_type == 'binary':
                    invalid_mask = ~df[col].isin([0, 1, np.nan])
                    condition = "value not in [0, 1]"
                elif r.rule_type == 'non_negative':
                    invalid_mask = df[col] < 0
                    condition = "value < 0"
                elif r.rule_type == 'percentage':
                    invalid_mask = (df[col] < 0) | (df[col] > 100)
                    condition = "value < 0 or value > 100"
                elif r.rule_type == 'rate':
                    invalid_mask = (df[col] < 0) | (df[col] > 1)
                    condition = "value < 0 or value > 1"
                else:
                    continue
                
                invalid_values = df.loc[invalid_mask, col].dropna()
                if len(invalid_values) > 0:
                    examples = invalid_values.head(5).tolist()
                    console.metric(f"  {col}", f"{examples}")
                    
                    # Add filtering recommendation
                    registry.add_bronze_filtering(
                        column=col, condition=condition, action="cap",
                        rationale=f"{r.invalid_values} values violate {r.rule_type} constraint",
                        source_notebook="02_column_deep_dive"
                    )
    
    console.info("Rules auto-generated from detected column types")
else:
    range_results = []
    console.info("No validation rules generated - no binary/numeric columns detected")

console.end_section()

#### VALUE RANGE VALIDATION  
[OK] opened_max_180d (binary)  
[OK] clicked_max_180d (binary)  
[OK] bounced_max_180d (binary)  
[OK] opened_max_365d (binary)  
[OK] clicked_max_365d (binary)  
[OK] bounced_max_365d (binary)  
[OK] opened_max_all_time (binary)  
[OK] clicked_max_all_time (binary)  
[OK] bounced_max_all_time (binary)  
[OK] All value ranges valid  
*(i) Rules auto-generated from detected column types*

## 2.4 Numeric Columns Analysis

**📖 How to Interpret These Charts:**
- **Red dashed line** = Mean (sensitive to outliers)
- **Green solid line** = Median (robust to outliers)
- **Large gap between mean and median** = Skewed distribution
- **Long right tail** = Positive skew (common in count/amount data)

**📖 Understanding Distribution Metrics**

| Metric | Interpretation | Action |
|--------|---------------|--------|
| **Skewness** | Measures asymmetry | \|skew\| > 1: Consider log transform |
| **Kurtosis** | Measures tail heaviness | kurt > 10: Cap outliers before transform |
| **Zero %** | Percentage of zeros | > 40%: Use zero-inflation handling |

**📖 Transformation Decision Tree:**
1. If zeros > 40% → Create binary indicator + log(non-zeros)
2. If \|skewness\| > 1 AND kurtosis > 10 → Cap then log
3. If \|skewness\| > 1 → Log transform
4. If kurtosis > 10 → Cap outliers only
5. Otherwise → Standard scaling is sufficient

In [5]:
# Use framework's DistributionAnalyzer for comprehensive analysis
analyzer = DistributionAnalyzer()

numeric_cols = [
    name for name, col in findings.columns.items()
    if col.inferred_type in [ColumnType.NUMERIC_CONTINUOUS, ColumnType.NUMERIC_DISCRETE]
    and name not in TEMPORAL_METADATA_COLS
]

# Analyze all numeric columns using the framework
analyses = analyzer.analyze_dataframe(df, numeric_cols)
recommendations = {col: analyzer.recommend_transformation(analysis) 
                   for col, analysis in analyses.items()}

for col_name in numeric_cols:
    col_info = findings.columns[col_name]
    analysis = analyses.get(col_name)
    rec = recommendations.get(col_name)
    
    print(f"\n{'='*70}")
    print(f"Column: {col_name}")
    print(f"Type: {col_info.inferred_type.value} (Confidence: {col_info.confidence:.0%})")
    print(f"-" * 70)
    
    if analysis:
        print(f"📊 Distribution Statistics:")
        print(f"   Mean: {analysis.mean:.3f}  |  Median: {analysis.median:.3f}  |  Std: {analysis.std:.3f}")
        print(f"   Range: [{analysis.min_value:.3f}, {analysis.max_value:.3f}]")
        print(f"   Percentiles: 1%={analysis.percentiles['p1']:.3f}, 25%={analysis.q1:.3f}, 75%={analysis.q3:.3f}, 99%={analysis.percentiles['p99']:.3f}")
        print(f"\n📈 Shape Analysis:")
        skew_label = '(Right-skewed)' if analysis.skewness > 0.5 else '(Left-skewed)' if analysis.skewness < -0.5 else '(Symmetric)'
        print(f"   Skewness: {analysis.skewness:.2f} {skew_label}")
        kurt_label = '(Heavy tails/outliers)' if analysis.kurtosis > 3 else '(Light tails)'
        print(f"   Kurtosis: {analysis.kurtosis:.2f} {kurt_label}")
        print(f"   Zeros: {analysis.zero_count:,} ({analysis.zero_percentage:.1f}%)")
        print(f"   Outliers (IQR): {analysis.outlier_count_iqr:,} ({analysis.outlier_percentage:.1f}%)")
        
        if rec:
            print(f"\n🔧 Recommended Transformation: {rec.recommended_transform.value}")
            print(f"   Reason: {rec.reason}")
            print(f"   Priority: {rec.priority}")
            if rec.warnings:
                for warn in rec.warnings:
                    print(f"   ⚠️ {warn}")
    
    # Create enhanced histogram with Plotly
    data = df[col_name].dropna()
    fig = go.Figure()
    
    fig.add_trace(go.Histogram(x=data, nbinsx=50, name='Distribution',
                                marker_color='steelblue', opacity=0.7))
    
    # Calculate mean and median
    mean_val = data.mean()
    median_val = data.median()
    
    # Position labels on opposite sides (left/right) to avoid overlap
    # The larger value gets right-justified, smaller gets left-justified
    mean_position = "top right" if mean_val >= median_val else "top left"
    median_position = "top left" if mean_val >= median_val else "top right"
    
    # Add mean line
    fig.add_vline(
        x=mean_val, 
        line_dash="dash", 
        line_color="red",
        annotation_text=f"Mean: {mean_val:.2f}",
        annotation_position=mean_position,
        annotation_font_color="red",
        annotation_bgcolor="rgba(255,255,255,0.8)"
    )
    
    # Add median line
    fig.add_vline(
        x=median_val, 
        line_dash="solid", 
        line_color="green",
        annotation_text=f"Median: {median_val:.2f}",
        annotation_position=median_position,
        annotation_font_color="green",
        annotation_bgcolor="rgba(255,255,255,0.8)"
    )
    
    # Add 99th percentile marker if there are outliers
    if analysis and analysis.outlier_percentage > 5:
        fig.add_vline(x=analysis.percentiles['p99'], line_dash="dot", line_color="orange",
                      annotation_text=f"99th: {analysis.percentiles['p99']:.2f}",
                      annotation_position="top right",
                      annotation_font_color="orange",
                      annotation_bgcolor="rgba(255,255,255,0.8)")
    
    transform_label = rec.recommended_transform.value if rec else "none"
    fig.update_layout(
        title=f"Distribution: {col_name}<br><sub>Skew: {analysis.skewness:.2f} | Kurt: {analysis.kurtosis:.2f} | Strategy: {transform_label}</sub>",
        xaxis_title=col_name,
        yaxis_title="Count",
        template='plotly_white',
        height=400
    )
    display_figure(fig)


Column: event_count_180d
Type: numeric_discrete (Confidence: 70%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 0.707  |  Median: 0.000  |  Std: 1.041
   Range: [0.000, 12.000]
   Percentiles: 1%=0.000, 25%=0.000, 75%=1.000, 99%=4.000

📈 Shape Analysis:
   Skewness: 2.14 (Right-skewed)
   Kurtosis: 8.88 (Heavy tails/outliers)
   Zeros: 2,884 (57.7%)
   Outliers (IQR): 302 (6.0%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Zero-inflation (57.7%) combined with high skewness (2.14)
   Priority: high
   ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values



Column: event_count_365d
Type: numeric_discrete (Confidence: 70%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 1.481  |  Median: 1.000  |  Std: 1.733
   Range: [0.000, 25.000]
   Percentiles: 1%=0.000, 25%=0.000, 75%=2.000, 99%=7.000

📈 Shape Analysis:
   Skewness: 1.95 (Right-skewed)
   Kurtosis: 11.01 (Heavy tails/outliers)
   Zeros: 2,069 (41.4%)
   Outliers (IQR): 114 (2.3%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Significant zero-inflation (41.4%)
   Priority: medium
   ⚠️ Many zero values may indicate a mixture distribution



Column: event_count_all_time
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 14.974  |  Median: 14.000  |  Std: 8.287
   Range: [1.000, 106.000]
   Percentiles: 1%=2.000, 25%=11.000, 75%=17.000, 99%=48.030

📈 Shape Analysis:
   Skewness: 2.95 (Right-skewed)
   Kurtosis: 18.06 (Heavy tails/outliers)
   Zeros: 0 (0.0%)
   Outliers (IQR): 304 (6.1%)

🔧 Recommended Transformation: cap_then_log
   Reason: High skewness (2.95) with significant outliers (6.1%)
   Priority: high



Column: time_to_open_hours_sum_180d
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 0.631  |  Median: 0.000  |  Std: 2.325
   Range: [0.000, 29.500]
   Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=11.903

📈 Shape Analysis:
   Skewness: 5.47 (Right-skewed)
   Kurtosis: 38.20 (Heavy tails/outliers)
   Zeros: 4,320 (86.4%)
   Outliers (IQR): 678 (13.6%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Zero-inflation (86.4%) combined with high skewness (5.47)
   Priority: high
   ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values



Column: time_to_open_hours_mean_180d
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 3.992  |  Median: 2.975  |  Std: 3.814
   Range: [0.000, 26.500]
   Percentiles: 1%=0.100, 25%=1.300, 75%=5.500, 99%=18.851

📈 Shape Analysis:
   Skewness: 2.03 (Right-skewed)
   Kurtosis: 6.02 (Heavy tails/outliers)
   Zeros: 6 (0.9%)
   Outliers (IQR): 30 (4.4%)

🔧 Recommended Transformation: yeo_johnson
   Reason: High skewness (2.03) with non-positive values
   Priority: high



Column: time_to_open_hours_max_180d
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 4.298  |  Median: 3.100  |  Std: 4.088
   Range: [0.000, 26.500]
   Percentiles: 1%=0.100, 25%=1.300, 75%=6.100, 99%=19.236

📈 Shape Analysis:
   Skewness: 1.80 (Right-skewed)
   Kurtosis: 4.34 (Heavy tails/outliers)
   Zeros: 6 (0.9%)
   Outliers (IQR): 26 (3.8%)

🔧 Recommended Transformation: sqrt_transform
   Reason: Moderate skewness (1.80)
   Priority: medium



Column: time_to_open_hours_count_180d
Type: numeric_discrete (Confidence: 70%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 0.157  |  Median: 0.000  |  Std: 0.427
   Range: [0.000, 6.000]
   Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=2.000

📈 Shape Analysis:
   Skewness: 3.31 (Right-skewed)
   Kurtosis: 16.31 (Heavy tails/outliers)
   Zeros: 4,314 (86.3%)
   Outliers (IQR): 684 (13.7%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Zero-inflation (86.3%) combined with high skewness (3.31)
   Priority: high
   ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values



Column: send_hour_sum_180d
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 9.568  |  Median: 0.000  |  Std: 14.501
   Range: [0.000, 164.000]
   Percentiles: 1%=0.000, 25%=0.000, 75%=16.000, 99%=57.000

📈 Shape Analysis:
   Skewness: 2.25 (Right-skewed)
   Kurtosis: 9.74 (Heavy tails/outliers)
   Zeros: 2,884 (57.7%)
   Outliers (IQR): 201 (4.0%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Zero-inflation (57.7%) combined with high skewness (2.25)
   Priority: high
   ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values



Column: send_hour_mean_180d
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 13.510  |  Median: 13.500  |  Std: 3.368
   Range: [6.000, 22.000]
   Percentiles: 1%=6.000, 25%=11.000, 75%=16.000, 99%=22.000

📈 Shape Analysis:
   Skewness: 0.05 (Symmetric)
   Kurtosis: -0.17 (Light tails)
   Zeros: 0 (0.0%)
   Outliers (IQR): 0 (0.0%)

🔧 Recommended Transformation: none
   Reason: Distribution is approximately normal (skewness: 0.05)
   Priority: low



Column: send_hour_max_180d
Type: numeric_discrete (Confidence: 70%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 14.695  |  Median: 15.000  |  Std: 3.843
   Range: [6.000, 22.000]
   Percentiles: 1%=6.000, 25%=12.000, 75%=17.000, 99%=22.000

📈 Shape Analysis:
   Skewness: -0.17 (Symmetric)
   Kurtosis: -0.49 (Light tails)
   Zeros: 0 (0.0%)
   Outliers (IQR): 0 (0.0%)

🔧 Recommended Transformation: none
   Reason: Distribution is approximately normal (skewness: -0.17)
   Priority: low



Column: send_hour_count_180d
Type: numeric_discrete (Confidence: 70%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 0.707  |  Median: 0.000  |  Std: 1.041
   Range: [0.000, 12.000]
   Percentiles: 1%=0.000, 25%=0.000, 75%=1.000, 99%=4.000

📈 Shape Analysis:
   Skewness: 2.14 (Right-skewed)
   Kurtosis: 8.88 (Heavy tails/outliers)
   Zeros: 2,884 (57.7%)
   Outliers (IQR): 302 (6.0%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Zero-inflation (57.7%) combined with high skewness (2.14)
   Priority: high
   ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values



Column: opened_sum_180d
Type: numeric_discrete (Confidence: 70%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 0.157  |  Median: 0.000  |  Std: 0.427
   Range: [0.000, 6.000]
   Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=2.000

📈 Shape Analysis:
   Skewness: 3.31 (Right-skewed)
   Kurtosis: 16.31 (Heavy tails/outliers)
   Zeros: 4,314 (86.3%)
   Outliers (IQR): 684 (13.7%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Zero-inflation (86.3%) combined with high skewness (3.31)
   Priority: high
   ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values



Column: opened_mean_180d
Type: numeric_discrete (Confidence: 70%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 0.216  |  Median: 0.000  |  Std: 0.352
   Range: [0.000, 1.000]
   Percentiles: 1%=0.000, 25%=0.000, 75%=0.500, 99%=1.000

📈 Shape Analysis:
   Skewness: 1.36 (Right-skewed)
   Kurtosis: 0.35 (Light tails)
   Zeros: 1,430 (67.6%)
   Outliers (IQR): 0 (0.0%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Significant zero-inflation (67.6%)
   Priority: medium
   ⚠️ Many zero values may indicate a mixture distribution



Column: opened_count_180d
Type: numeric_discrete (Confidence: 70%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 0.707  |  Median: 0.000  |  Std: 1.041
   Range: [0.000, 12.000]
   Percentiles: 1%=0.000, 25%=0.000, 75%=1.000, 99%=4.000

📈 Shape Analysis:
   Skewness: 2.14 (Right-skewed)
   Kurtosis: 8.88 (Heavy tails/outliers)
   Zeros: 2,884 (57.7%)
   Outliers (IQR): 302 (6.0%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Zero-inflation (57.7%) combined with high skewness (2.14)
   Priority: high
   ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values



Column: clicked_sum_180d
Type: numeric_discrete (Confidence: 70%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 0.049  |  Median: 0.000  |  Std: 0.225
   Range: [0.000, 3.000]
   Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000

📈 Shape Analysis:
   Skewness: 4.79 (Right-skewed)
   Kurtosis: 25.28 (Heavy tails/outliers)
   Zeros: 4,761 (95.3%)
   Outliers (IQR): 237 (4.7%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Zero-inflation (95.3%) combined with high skewness (4.79)
   Priority: high
   ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values



Column: clicked_mean_180d
Type: numeric_discrete (Confidence: 70%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 0.064  |  Median: 0.000  |  Std: 0.202
   Range: [0.000, 1.000]
   Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000

📈 Shape Analysis:
   Skewness: 3.46 (Right-skewed)
   Kurtosis: 11.71 (Heavy tails/outliers)
   Zeros: 1,877 (88.8%)
   Outliers (IQR): 237 (11.2%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Zero-inflation (88.8%) combined with high skewness (3.46)
   Priority: high
   ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values



Column: clicked_count_180d
Type: numeric_discrete (Confidence: 70%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 0.707  |  Median: 0.000  |  Std: 1.041
   Range: [0.000, 12.000]
   Percentiles: 1%=0.000, 25%=0.000, 75%=1.000, 99%=4.000

📈 Shape Analysis:
   Skewness: 2.14 (Right-skewed)
   Kurtosis: 8.88 (Heavy tails/outliers)
   Zeros: 2,884 (57.7%)
   Outliers (IQR): 302 (6.0%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Zero-inflation (57.7%) combined with high skewness (2.14)
   Priority: high
   ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values



Column: bounced_sum_180d
Type: numeric_discrete (Confidence: 70%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 0.018  |  Median: 0.000  |  Std: 0.134
   Range: [0.000, 2.000]
   Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000

📈 Shape Analysis:
   Skewness: 7.50 (Right-skewed)
   Kurtosis: 56.57 (Heavy tails/outliers)
   Zeros: 4,909 (98.2%)
   Outliers (IQR): 89 (1.8%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Zero-inflation (98.2%) combined with high skewness (7.50)
   Priority: high
   ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values



Column: bounced_mean_180d
Type: numeric_discrete (Confidence: 70%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 0.025  |  Median: 0.000  |  Std: 0.136
   Range: [0.000, 1.000]
   Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000

📈 Shape Analysis:
   Skewness: 6.04 (Right-skewed)
   Kurtosis: 37.70 (Heavy tails/outliers)
   Zeros: 2,025 (95.8%)
   Outliers (IQR): 89 (4.2%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Zero-inflation (95.8%) combined with high skewness (6.04)
   Priority: high
   ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values



Column: bounced_count_180d
Type: numeric_discrete (Confidence: 70%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 0.707  |  Median: 0.000  |  Std: 1.041
   Range: [0.000, 12.000]
   Percentiles: 1%=0.000, 25%=0.000, 75%=1.000, 99%=4.000

📈 Shape Analysis:
   Skewness: 2.14 (Right-skewed)
   Kurtosis: 8.88 (Heavy tails/outliers)
   Zeros: 2,884 (57.7%)
   Outliers (IQR): 302 (6.0%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Zero-inflation (57.7%) combined with high skewness (2.14)
   Priority: high
   ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values



Column: time_to_open_hours_sum_365d
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 1.288  |  Median: 0.000  |  Std: 3.514
   Range: [0.000, 55.900]
   Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=16.403

📈 Shape Analysis:
   Skewness: 4.85 (Right-skewed)
   Kurtosis: 37.02 (Heavy tails/outliers)
   Zeros: 3,769 (75.4%)
   Outliers (IQR): 1,229 (24.6%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Zero-inflation (75.4%) combined with high skewness (4.85)
   Priority: high
   ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values



Column: time_to_open_hours_mean_365d
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 3.941  |  Median: 2.900  |  Std: 3.613
   Range: [0.000, 26.500]
   Percentiles: 1%=0.000, 25%=1.325, 75%=5.400, 99%=17.032

📈 Shape Analysis:
   Skewness: 1.93 (Right-skewed)
   Kurtosis: 5.32 (Heavy tails/outliers)
   Zeros: 14 (1.1%)
   Outliers (IQR): 55 (4.4%)

🔧 Recommended Transformation: sqrt_transform
   Reason: Moderate skewness (1.93)
   Priority: medium



Column: time_to_open_hours_max_365d
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 4.501  |  Median: 3.400  |  Std: 4.167
   Range: [0.000, 28.300]
   Percentiles: 1%=0.000, 25%=1.400, 75%=6.300, 99%=19.158

📈 Shape Analysis:
   Skewness: 1.81 (Right-skewed)
   Kurtosis: 4.53 (Heavy tails/outliers)
   Zeros: 14 (1.1%)
   Outliers (IQR): 50 (4.0%)

🔧 Recommended Transformation: sqrt_transform
   Reason: Moderate skewness (1.81)
   Priority: medium



Column: time_to_open_hours_count_365d
Type: numeric_discrete (Confidence: 70%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 0.326  |  Median: 0.000  |  Std: 0.660
   Range: [0.000, 10.000]
   Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=3.000

📈 Shape Analysis:
   Skewness: 2.98 (Right-skewed)
   Kurtosis: 17.74 (Heavy tails/outliers)
   Zeros: 3,755 (75.1%)
   Outliers (IQR): 1,243 (24.9%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Zero-inflation (75.1%) combined with high skewness (2.98)
   Priority: high
   ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values



Column: send_hour_sum_365d
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 20.009  |  Median: 14.000  |  Std: 23.826
   Range: [0.000, 333.000]
   Percentiles: 1%=0.000, 25%=0.000, 75%=33.000, 99%=90.000

📈 Shape Analysis:
   Skewness: 1.94 (Right-skewed)
   Kurtosis: 10.08 (Heavy tails/outliers)
   Zeros: 2,069 (41.4%)
   Outliers (IQR): 81 (1.6%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Significant zero-inflation (41.4%)
   Priority: medium
   ⚠️ Many zero values may indicate a mixture distribution



Column: send_hour_mean_365d
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 13.531  |  Median: 13.500  |  Std: 2.866
   Range: [6.000, 22.000]
   Percentiles: 1%=6.000, 25%=11.750, 75%=15.333, 99%=21.000

📈 Shape Analysis:
   Skewness: 0.06 (Symmetric)
   Kurtosis: 0.32 (Light tails)
   Zeros: 0 (0.0%)
   Outliers (IQR): 72 (2.5%)

🔧 Recommended Transformation: none
   Reason: Distribution is approximately normal (skewness: 0.06)
   Priority: low



Column: send_hour_max_365d
Type: numeric_discrete (Confidence: 70%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 15.772  |  Median: 16.000  |  Std: 3.601
   Range: [6.000, 22.000]
   Percentiles: 1%=6.000, 25%=13.000, 75%=18.000, 99%=22.000

📈 Shape Analysis:
   Skewness: -0.35 (Symmetric)
   Kurtosis: -0.25 (Light tails)
   Zeros: 0 (0.0%)
   Outliers (IQR): 0 (0.0%)

🔧 Recommended Transformation: none
   Reason: Distribution is approximately normal (skewness: -0.35)
   Priority: low



Column: send_hour_count_365d
Type: numeric_discrete (Confidence: 70%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 1.481  |  Median: 1.000  |  Std: 1.733
   Range: [0.000, 25.000]
   Percentiles: 1%=0.000, 25%=0.000, 75%=2.000, 99%=7.000

📈 Shape Analysis:
   Skewness: 1.95 (Right-skewed)
   Kurtosis: 11.01 (Heavy tails/outliers)
   Zeros: 2,069 (41.4%)
   Outliers (IQR): 114 (2.3%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Significant zero-inflation (41.4%)
   Priority: medium
   ⚠️ Many zero values may indicate a mixture distribution



Column: opened_sum_365d
Type: numeric_discrete (Confidence: 70%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 0.326  |  Median: 0.000  |  Std: 0.660
   Range: [0.000, 10.000]
   Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=3.000

📈 Shape Analysis:
   Skewness: 2.98 (Right-skewed)
   Kurtosis: 17.74 (Heavy tails/outliers)
   Zeros: 3,755 (75.1%)
   Outliers (IQR): 1,243 (24.9%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Zero-inflation (75.1%) combined with high skewness (2.98)
   Priority: high
   ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values



Column: opened_mean_365d
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 0.214  |  Median: 0.000  |  Std: 0.302
   Range: [0.000, 1.000]
   Percentiles: 1%=0.000, 25%=0.000, 75%=0.333, 99%=1.000

📈 Shape Analysis:
   Skewness: 1.33 (Right-skewed)
   Kurtosis: 0.81 (Light tails)
   Zeros: 1,686 (57.6%)
   Outliers (IQR): 216 (7.4%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Significant zero-inflation (57.6%)
   Priority: medium
   ⚠️ Many zero values may indicate a mixture distribution



Column: opened_count_365d
Type: numeric_discrete (Confidence: 70%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 1.481  |  Median: 1.000  |  Std: 1.733
   Range: [0.000, 25.000]
   Percentiles: 1%=0.000, 25%=0.000, 75%=2.000, 99%=7.000

📈 Shape Analysis:
   Skewness: 1.95 (Right-skewed)
   Kurtosis: 11.01 (Heavy tails/outliers)
   Zeros: 2,069 (41.4%)
   Outliers (IQR): 114 (2.3%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Significant zero-inflation (41.4%)
   Priority: medium
   ⚠️ Many zero values may indicate a mixture distribution



Column: clicked_sum_365d
Type: numeric_discrete (Confidence: 70%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 0.101  |  Median: 0.000  |  Std: 0.332
   Range: [0.000, 4.000]
   Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000

📈 Shape Analysis:
   Skewness: 3.71 (Right-skewed)
   Kurtosis: 17.20 (Heavy tails/outliers)
   Zeros: 4,534 (90.7%)
   Outliers (IQR): 464 (9.3%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Zero-inflation (90.7%) combined with high skewness (3.71)
   Priority: high
   ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values



Column: clicked_mean_365d
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 0.063  |  Median: 0.000  |  Std: 0.172
   Range: [0.000, 1.000]
   Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000

📈 Shape Analysis:
   Skewness: 3.28 (Right-skewed)
   Kurtosis: 11.91 (Heavy tails/outliers)
   Zeros: 2,465 (84.2%)
   Outliers (IQR): 464 (15.8%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Zero-inflation (84.2%) combined with high skewness (3.28)
   Priority: high
   ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values



Column: clicked_count_365d
Type: numeric_discrete (Confidence: 70%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 1.481  |  Median: 1.000  |  Std: 1.733
   Range: [0.000, 25.000]
   Percentiles: 1%=0.000, 25%=0.000, 75%=2.000, 99%=7.000

📈 Shape Analysis:
   Skewness: 1.95 (Right-skewed)
   Kurtosis: 11.01 (Heavy tails/outliers)
   Zeros: 2,069 (41.4%)
   Outliers (IQR): 114 (2.3%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Significant zero-inflation (41.4%)
   Priority: medium
   ⚠️ Many zero values may indicate a mixture distribution



Column: bounced_sum_365d
Type: numeric_discrete (Confidence: 70%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 0.037  |  Median: 0.000  |  Std: 0.195
   Range: [0.000, 2.000]
   Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=1.000

📈 Shape Analysis:
   Skewness: 5.40 (Right-skewed)
   Kurtosis: 30.14 (Heavy tails/outliers)
   Zeros: 4,820 (96.4%)
   Outliers (IQR): 178 (3.6%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Zero-inflation (96.4%) combined with high skewness (5.40)
   Priority: high
   ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values



Column: bounced_mean_365d
Type: numeric_discrete (Confidence: 70%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 0.026  |  Median: 0.000  |  Std: 0.120
   Range: [0.000, 1.000]
   Percentiles: 1%=0.000, 25%=0.000, 75%=0.000, 99%=0.500

📈 Shape Analysis:
   Skewness: 5.94 (Right-skewed)
   Kurtosis: 39.72 (Heavy tails/outliers)
   Zeros: 2,751 (93.9%)
   Outliers (IQR): 178 (6.1%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Zero-inflation (93.9%) combined with high skewness (5.94)
   Priority: high
   ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values



Column: bounced_count_365d
Type: numeric_discrete (Confidence: 70%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 1.481  |  Median: 1.000  |  Std: 1.733
   Range: [0.000, 25.000]
   Percentiles: 1%=0.000, 25%=0.000, 75%=2.000, 99%=7.000

📈 Shape Analysis:
   Skewness: 1.95 (Right-skewed)
   Kurtosis: 11.01 (Heavy tails/outliers)
   Zeros: 2,069 (41.4%)
   Outliers (IQR): 114 (2.3%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Significant zero-inflation (41.4%)
   Priority: medium
   ⚠️ Many zero values may indicate a mixture distribution



Column: time_to_open_hours_sum_all_time
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 13.201  |  Median: 10.000  |  Std: 13.690
   Range: [0.000, 158.500]
   Percentiles: 1%=0.000, 25%=3.100, 75%=19.000, 99%=62.324

📈 Shape Analysis:
   Skewness: 2.32 (Right-skewed)
   Kurtosis: 11.04 (Heavy tails/outliers)
   Zeros: 695 (13.9%)
   Outliers (IQR): 168 (3.4%)

🔧 Recommended Transformation: yeo_johnson
   Reason: High skewness (2.32) with non-positive values
   Priority: high



Column: time_to_open_hours_mean_all_time
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 3.934  |  Median: 3.520  |  Std: 2.490
   Range: [0.000, 29.600]
   Percentiles: 1%=0.250, 25%=2.300, 75%=5.080, 99%=12.400

📈 Shape Analysis:
   Skewness: 1.98 (Right-skewed)
   Kurtosis: 9.11 (Heavy tails/outliers)
   Zeros: 5 (0.1%)
   Outliers (IQR): 132 (3.1%)

🔧 Recommended Transformation: sqrt_transform
   Reason: Moderate skewness (1.98)
   Priority: medium



Column: time_to_open_hours_max_all_time
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 7.506  |  Median: 6.500  |  Std: 5.154
   Range: [0.000, 43.200]
   Percentiles: 1%=0.300, 25%=3.800, 75%=10.200, 99%=23.893

📈 Shape Analysis:
   Skewness: 1.32 (Right-skewed)
   Kurtosis: 2.96 (Light tails)
   Zeros: 5 (0.1%)
   Outliers (IQR): 124 (2.9%)

🔧 Recommended Transformation: sqrt_transform
   Reason: Moderate skewness (1.32)
   Priority: medium



Column: time_to_open_hours_count_all_time
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 3.336  |  Median: 3.000  |  Std: 2.865
   Range: [0.000, 35.000]
   Percentiles: 1%=0.000, 25%=1.000, 75%=5.000, 99%=13.000

📈 Shape Analysis:
   Skewness: 2.33 (Right-skewed)
   Kurtosis: 13.43 (Heavy tails/outliers)
   Zeros: 690 (13.8%)
   Outliers (IQR): 81 (1.6%)

🔧 Recommended Transformation: yeo_johnson
   Reason: High skewness (2.33) with non-positive values
   Priority: high



Column: send_hour_sum_all_time
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 202.492  |  Median: 195.000  |  Std: 112.777
   Range: [6.000, 1434.000]
   Percentiles: 1%=23.000, 25%=152.000, 75%=234.000, 99%=636.600

📈 Shape Analysis:
   Skewness: 2.86 (Right-skewed)
   Kurtosis: 17.18 (Heavy tails/outliers)
   Zeros: 0 (0.0%)
   Outliers (IQR): 356 (7.1%)

🔧 Recommended Transformation: cap_then_log
   Reason: High skewness (2.86) with significant outliers (7.1%)
   Priority: high



Column: send_hour_mean_all_time
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 13.515  |  Median: 13.533  |  Std: 1.209
   Range: [6.000, 21.000]
   Percentiles: 1%=10.333, 25%=12.802, 75%=14.222, 99%=16.600

📈 Shape Analysis:
   Skewness: -0.24 (Symmetric)
   Kurtosis: 3.02 (Heavy tails/outliers)
   Zeros: 0 (0.0%)
   Outliers (IQR): 131 (2.6%)

🔧 Recommended Transformation: none
   Reason: Distribution is approximately normal (skewness: -0.24)
   Priority: low



Column: send_hour_max_all_time
Type: numeric_discrete (Confidence: 70%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 19.732  |  Median: 20.000  |  Std: 2.184
   Range: [6.000, 22.000]
   Percentiles: 1%=13.000, 25%=19.000, 75%=22.000, 99%=22.000

📈 Shape Analysis:
   Skewness: -1.31 (Left-skewed)
   Kurtosis: 3.10 (Heavy tails/outliers)
   Zeros: 0 (0.0%)
   Outliers (IQR): 122 (2.4%)

🔧 Recommended Transformation: sqrt_transform
   Reason: Moderate skewness (-1.31)
   Priority: medium



Column: send_hour_count_all_time
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 14.974  |  Median: 14.000  |  Std: 8.287
   Range: [1.000, 106.000]
   Percentiles: 1%=2.000, 25%=11.000, 75%=17.000, 99%=48.030

📈 Shape Analysis:
   Skewness: 2.95 (Right-skewed)
   Kurtosis: 18.06 (Heavy tails/outliers)
   Zeros: 0 (0.0%)
   Outliers (IQR): 304 (6.1%)

🔧 Recommended Transformation: cap_then_log
   Reason: High skewness (2.95) with significant outliers (6.1%)
   Priority: high



Column: opened_sum_all_time
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 3.336  |  Median: 3.000  |  Std: 2.865
   Range: [0.000, 35.000]
   Percentiles: 1%=0.000, 25%=1.000, 75%=5.000, 99%=13.000

📈 Shape Analysis:
   Skewness: 2.33 (Right-skewed)
   Kurtosis: 13.43 (Heavy tails/outliers)
   Zeros: 690 (13.8%)
   Outliers (IQR): 81 (1.6%)

🔧 Recommended Transformation: yeo_johnson
   Reason: High skewness (2.33) with non-positive values
   Priority: high



Column: opened_mean_all_time
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 0.205  |  Median: 0.200  |  Std: 0.133
   Range: [0.000, 1.000]
   Percentiles: 1%=0.000, 25%=0.115, 75%=0.286, 99%=0.533

📈 Shape Analysis:
   Skewness: 0.41 (Symmetric)
   Kurtosis: 0.64 (Light tails)
   Zeros: 690 (13.8%)
   Outliers (IQR): 39 (0.8%)

🔧 Recommended Transformation: none
   Reason: Distribution is approximately normal (skewness: 0.41)
   Priority: low



Column: opened_count_all_time
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 14.974  |  Median: 14.000  |  Std: 8.287
   Range: [1.000, 106.000]
   Percentiles: 1%=2.000, 25%=11.000, 75%=17.000, 99%=48.030

📈 Shape Analysis:
   Skewness: 2.95 (Right-skewed)
   Kurtosis: 18.06 (Heavy tails/outliers)
   Zeros: 0 (0.0%)
   Outliers (IQR): 304 (6.1%)

🔧 Recommended Transformation: cap_then_log
   Reason: High skewness (2.95) with significant outliers (6.1%)
   Priority: high



Column: clicked_sum_all_time
Type: numeric_discrete (Confidence: 70%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 1.017  |  Median: 1.000  |  Std: 1.222
   Range: [0.000, 14.000]
   Percentiles: 1%=0.000, 25%=0.000, 75%=2.000, 99%=5.000

📈 Shape Analysis:
   Skewness: 2.07 (Right-skewed)
   Kurtosis: 8.91 (Heavy tails/outliers)
   Zeros: 2,103 (42.1%)
   Outliers (IQR): 38 (0.8%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Zero-inflation (42.1%) combined with high skewness (2.07)
   Priority: high
   ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values



Column: clicked_mean_all_time
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 0.063  |  Median: 0.059  |  Std: 0.070
   Range: [0.000, 0.500]
   Percentiles: 1%=0.000, 25%=0.000, 75%=0.100, 99%=0.286

📈 Shape Analysis:
   Skewness: 1.24 (Right-skewed)
   Kurtosis: 1.91 (Light tails)
   Zeros: 2,103 (42.1%)
   Outliers (IQR): 70 (1.4%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Significant zero-inflation (42.1%)
   Priority: medium
   ⚠️ Many zero values may indicate a mixture distribution



Column: clicked_count_all_time
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 14.974  |  Median: 14.000  |  Std: 8.287
   Range: [1.000, 106.000]
   Percentiles: 1%=2.000, 25%=11.000, 75%=17.000, 99%=48.030

📈 Shape Analysis:
   Skewness: 2.95 (Right-skewed)
   Kurtosis: 18.06 (Heavy tails/outliers)
   Zeros: 0 (0.0%)
   Outliers (IQR): 304 (6.1%)

🔧 Recommended Transformation: cap_then_log
   Reason: High skewness (2.95) with significant outliers (6.1%)
   Priority: high



Column: bounced_sum_all_time
Type: numeric_discrete (Confidence: 70%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 0.321  |  Median: 0.000  |  Std: 0.587
   Range: [0.000, 4.000]
   Percentiles: 1%=0.000, 25%=0.000, 75%=1.000, 99%=2.000

📈 Shape Analysis:
   Skewness: 2.02 (Right-skewed)
   Kurtosis: 4.80 (Heavy tails/outliers)
   Zeros: 3,658 (73.2%)
   Outliers (IQR): 39 (0.8%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Zero-inflation (73.2%) combined with high skewness (2.02)
   Priority: high
   ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values



Column: bounced_mean_all_time
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 0.022  |  Median: 0.000  |  Std: 0.049
   Range: [0.000, 1.000]
   Percentiles: 1%=0.000, 25%=0.000, 75%=0.035, 99%=0.182

📈 Shape Analysis:
   Skewness: 6.85 (Right-skewed)
   Kurtosis: 103.75 (Heavy tails/outliers)
   Zeros: 3,658 (73.2%)
   Outliers (IQR): 313 (6.3%)

🔧 Recommended Transformation: zero_inflation_handling
   Reason: Zero-inflation (73.2%) combined with high skewness (6.85)
   Priority: high
   ⚠️ Consider creating a binary indicator for zeros plus log transform of non-zero values



Column: bounced_count_all_time
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 14.974  |  Median: 14.000  |  Std: 8.287
   Range: [1.000, 106.000]
   Percentiles: 1%=2.000, 25%=11.000, 75%=17.000, 99%=48.030

📈 Shape Analysis:
   Skewness: 2.95 (Right-skewed)
   Kurtosis: 18.06 (Heavy tails/outliers)
   Zeros: 0 (0.0%)
   Outliers (IQR): 304 (6.1%)

🔧 Recommended Transformation: cap_then_log
   Reason: High skewness (2.95) with significant outliers (6.1%)
   Priority: high



Column: days_since_last_event
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 665.750  |  Median: 246.500  |  Std: 803.342
   Range: [0.000, 2824.000]
   Percentiles: 1%=3.000, 25%=86.000, 75%=1088.000, 99%=2715.060

📈 Shape Analysis:
   Skewness: 1.22 (Right-skewed)
   Kurtosis: 0.12 (Light tails)
   Zeros: 9 (0.2%)
   Outliers (IQR): 116 (2.3%)

🔧 Recommended Transformation: sqrt_transform
   Reason: Moderate skewness (1.22)
   Priority: medium



Column: days_since_first_event
Type: numeric_continuous (Confidence: 90%)
----------------------------------------------------------------------
📊 Distribution Statistics:
   Mean: 2669.425  |  Median: 2719.000  |  Std: 158.137
   Range: [1498.000, 2825.000]
   Percentiles: 1%=2109.940, 25%=2603.000, 75%=2784.000, 99%=2824.000

📈 Shape Analysis:
   Skewness: -1.88 (Left-skewed)
   Kurtosis: 4.95 (Heavy tails/outliers)
   Zeros: 0 (0.0%)
   Outliers (IQR): 217 (4.3%)

🔧 Recommended Transformation: sqrt_transform
   Reason: Moderate skewness (-1.88)
   Priority: medium


In [6]:
# Numerical Feature Statistics Table
if numeric_cols:
    stats_data = []
    for col_name in numeric_cols:
        series = df[col_name].dropna()
        if len(series) > 0:
            stats_data.append({
                "feature": col_name,
                "count": len(series),
                "mean": series.mean(),
                "std": series.std(),
                "min": series.min(),
                "25%": series.quantile(0.25),
                "50%": series.quantile(0.50),
                "75%": series.quantile(0.75),
                "95%": series.quantile(0.95),
                "99%": series.quantile(0.99),
                "max": series.max(),
                "skewness": stats.skew(series),
                "kurtosis": stats.kurtosis(series)
            })
    
    stats_df = pd.DataFrame(stats_data)
    
    # Format for display
    display_stats = stats_df.copy()
    for col in ["mean", "std", "min", "25%", "50%", "75%", "95%", "99%", "max"]:
        display_stats[col] = display_stats[col].apply(lambda x: f"{x:.3f}")
    display_stats["skewness"] = display_stats["skewness"].apply(lambda x: f"{x:.3f}")
    display_stats["kurtosis"] = display_stats["kurtosis"].apply(lambda x: f"{x:.3f}")
    
    print("=" * 80)
    print("NUMERICAL FEATURE STATISTICS")
    print("=" * 80)
    display(display_stats)

NUMERICAL FEATURE STATISTICS


Unnamed: 0,feature,count,mean,std,min,25%,50%,75%,95%,99%,max,skewness,kurtosis
0,event_count_180d,4998,0.707,1.041,0.0,0.0,0.0,1.0,3.0,4.0,12.0,2.135,8.867
1,event_count_365d,4998,1.481,1.733,0.0,0.0,1.0,2.0,5.0,7.0,25.0,1.951,10.997
2,event_count_all_time,4998,14.974,8.287,1.0,11.0,14.0,17.0,27.0,48.03,106.0,2.949,18.039
3,time_to_open_hours_sum_180d,4998,0.631,2.325,0.0,0.0,0.0,0.0,4.5,11.903,29.5,5.472,38.164
4,time_to_open_hours_mean_180d,684,3.992,3.814,0.0,1.3,2.975,5.5,11.585,18.851,26.5,2.026,5.966
5,time_to_open_hours_max_180d,684,4.298,4.088,0.0,1.3,3.1,6.1,12.0,19.236,26.5,1.797,4.299
6,time_to_open_hours_count_180d,4998,0.157,0.427,0.0,0.0,0.0,0.0,1.0,2.0,6.0,3.306,16.288
7,send_hour_sum_180d,4998,9.568,14.501,0.0,0.0,0.0,16.0,38.0,57.0,164.0,2.252,9.73
8,send_hour_mean_180d,2114,13.51,3.368,6.0,11.0,13.5,16.0,19.0,22.0,22.0,0.049,-0.176
9,send_hour_max_180d,2114,14.695,3.843,6.0,12.0,15.0,17.0,21.0,22.0,22.0,-0.171,-0.494


## 2.5 Distribution Summary & Transformation Plan

This table summarizes all numeric columns with their recommended transformations.

In [7]:
# Build transformation summary table
summary_data = []
for col_name in numeric_cols:
    analysis = analyses.get(col_name)
    rec = recommendations.get(col_name)
    
    if analysis and rec:
        summary_data.append({
            "Column": col_name,
            "Skewness": f"{analysis.skewness:.2f}",
            "Kurtosis": f"{analysis.kurtosis:.2f}",
            "Zeros %": f"{analysis.zero_percentage:.1f}%",
            "Outliers %": f"{analysis.outlier_percentage:.1f}%",
            "Transform": rec.recommended_transform.value,
            "Priority": rec.priority
        })
        
        # Add Gold transformation recommendation if not "none"
        if rec.recommended_transform != TransformationType.NONE and registry.gold:
            registry.add_gold_transformation(
                column=col_name,
                transform=rec.recommended_transform.value,
                parameters=rec.parameters,
                rationale=rec.reason,
                source_notebook="02_column_deep_dive"
            )

if summary_data:
    summary_df = pd.DataFrame(summary_data)
    display_table(summary_df)
    
    # Show how many transformation recommendations were added
    transform_count = sum(1 for r in recommendations.values() if r and r.recommended_transform != TransformationType.NONE)
    if transform_count > 0 and registry.gold:
        print(f"\n✅ Added {transform_count} transformation recommendations to Gold layer")
else:
    console.info("No numeric columns to summarize")

Column,Skewness,Kurtosis,Zeros %,Outliers %,Transform,Priority
event_count_180d,2.14,8.88,57.7%,6.0%,zero_inflation_handling,high
event_count_365d,1.95,11.01,41.4%,2.3%,zero_inflation_handling,medium
event_count_all_time,2.95,18.06,0.0%,6.1%,cap_then_log,high
time_to_open_hours_sum_180d,5.47,38.2,86.4%,13.6%,zero_inflation_handling,high
time_to_open_hours_mean_180d,2.03,6.02,0.9%,4.4%,yeo_johnson,high
time_to_open_hours_max_180d,1.8,4.34,0.9%,3.8%,sqrt_transform,medium
time_to_open_hours_count_180d,3.31,16.31,86.3%,13.7%,zero_inflation_handling,high
send_hour_sum_180d,2.25,9.74,57.7%,4.0%,zero_inflation_handling,high
send_hour_mean_180d,0.05,-0.17,0.0%,0.0%,none,low
send_hour_max_180d,-0.17,-0.49,0.0%,0.0%,none,low



✅ Added 50 transformation recommendations to Gold layer


## 2.6 Categorical Columns Analysis

**📖 Distribution Metrics (Analogues to Numeric Skewness/Kurtosis):**

| Metric | Interpretation | Action |
|--------|---------------|--------|
| **Imbalance Ratio** | Largest / Smallest category count | > 10: Consider grouping rare categories |
| **Entropy** | Diversity measure (0 = one category, higher = more uniform) | Low entropy: May need stratified sampling |
| **Top-3 Concentration** | % of data in top 3 categories | > 90%: Rare categories may cause issues |
| **Rare Category %** | Categories with < 1% of data | High %: Group into "Other" category |

**📖 Encoding Recommendations:**
- **Low cardinality (≤5)** → One-hot encoding
- **Medium cardinality (6-20)** → One-hot or Target encoding
- **High cardinality (>20)** → Target encoding or Frequency encoding
- **Cyclical (days, months)** → Sin/Cos encoding

**⚠️ Common Issues:**
- Rare categories can cause overfitting with one-hot encoding
- High cardinality + one-hot = feature explosion
- Imbalanced categories may need special handling in train/test splits

In [8]:
# Use framework's CategoricalDistributionAnalyzer
cat_analyzer = CategoricalDistributionAnalyzer()

categorical_cols = [
    name for name, col in findings.columns.items()
    if col.inferred_type in [ColumnType.CATEGORICAL_NOMINAL, ColumnType.CATEGORICAL_ORDINAL, ColumnType.CATEGORICAL_CYCLICAL]
    and col.inferred_type != ColumnType.TEXT  # TEXT columns processed separately in 02a
    and name not in TEMPORAL_METADATA_COLS
]

# Analyze all categorical columns
cat_analyses = cat_analyzer.analyze_dataframe(df, categorical_cols)

# Get encoding recommendations
cyclical_cols = [name for name, col in findings.columns.items() 
                 if col.inferred_type == ColumnType.CATEGORICAL_CYCLICAL]
cat_recommendations = cat_analyzer.get_all_recommendations(df, categorical_cols, cyclical_columns=cyclical_cols)

for col_name in categorical_cols:
    col_info = findings.columns[col_name]
    analysis = cat_analyses.get(col_name)
    rec = next((r for r in cat_recommendations if r.column_name == col_name), None)
    
    print(f"\n{'='*70}")
    print(f"Column: {col_name}")
    print(f"Type: {col_info.inferred_type.value} (Confidence: {col_info.confidence:.0%})")
    print(f"-" * 70)
    
    if analysis:
        print(f"\n📊 Distribution Metrics:")
        print(f"   Categories: {analysis.category_count}")
        print(f"   Imbalance Ratio: {analysis.imbalance_ratio:.1f}x (largest/smallest)")
        print(f"   Entropy: {analysis.entropy:.2f} ({analysis.normalized_entropy*100:.0f}% of max)")
        print(f"   Top-1 Concentration: {analysis.top1_concentration:.1f}%")
        print(f"   Top-3 Concentration: {analysis.top3_concentration:.1f}%")
        print(f"   Rare Categories (<1%): {analysis.rare_category_count}")
        
        # Interpretation
        print(f"\n📈 Interpretation:")
        if analysis.has_low_diversity:
            print(f"   ⚠️ LOW DIVERSITY: Distribution dominated by few categories")
        elif analysis.normalized_entropy > 0.9:
            print(f"   ✓ HIGH DIVERSITY: Categories are relatively balanced")
        else:
            print(f"   ✓ MODERATE DIVERSITY: Some category dominance but acceptable")
        
        if analysis.imbalance_ratio > 100:
            print(f"   🔴 SEVERE IMBALANCE: Rarest category has very few samples")
        elif analysis.is_imbalanced:
            print(f"   🟡 MODERATE IMBALANCE: Consider grouping rare categories")
        
        # Recommendations
        if rec:
            print(f"\n🔧 Recommendations:")
            print(f"   Encoding: {rec.encoding_type.value}")
            print(f"   Reason: {rec.reason}")
            print(f"   Priority: {rec.priority}")
            
            if rec.preprocessing_steps:
                print(f"   Preprocessing:")
                for step in rec.preprocessing_steps:
                    print(f"      • {step}")
            
            if rec.warnings:
                for warn in rec.warnings:
                    print(f"   ⚠️ {warn}")
    
    # Visualization
    value_counts = df[col_name].value_counts()
    subtitle = f"Entropy: {analysis.normalized_entropy*100:.0f}% | Imbalance: {analysis.imbalance_ratio:.1f}x | Rare: {analysis.rare_category_count}" if analysis else ""
    fig = charts.bar_chart(
        value_counts.head(10).index.tolist(), 
        value_counts.head(10).values.tolist(),
        title=f"Top Categories: {col_name}<br><sub>{subtitle}</sub>"
    )
    display_figure(fig)

# Summary table and add recommendations to registry
if cat_analyses:
    print("\n" + "=" * 70)
    print("CATEGORICAL COLUMNS SUMMARY")
    print("=" * 70)
    summary_data = []
    for col_name, analysis in cat_analyses.items():
        rec = next((r for r in cat_recommendations if r.column_name == col_name), None)
        summary_data.append({
            "Column": col_name,
            "Categories": analysis.category_count,
            "Imbalance": f"{analysis.imbalance_ratio:.1f}x",
            "Entropy": f"{analysis.normalized_entropy*100:.0f}%",
            "Top-3 Conc.": f"{analysis.top3_concentration:.1f}%",
            "Rare (<1%)": analysis.rare_category_count,
            "Encoding": rec.encoding_type.value if rec else "N/A"
        })
        
        # Add encoding recommendation to Gold layer
        if rec and registry.gold:
            registry.add_gold_encoding(
                column=col_name,
                method=rec.encoding_type.value,
                rationale=rec.reason,
                source_notebook="02_column_deep_dive"
            )
    
    display_table(pd.DataFrame(summary_data))
    
    if registry.gold:
        print(f"\n✅ Added {len(cat_recommendations)} encoding recommendations to Gold layer")


Column: lifecycle_quadrant
Type: categorical_nominal (Confidence: 90%)
----------------------------------------------------------------------

📊 Distribution Metrics:
   Categories: 4
   Imbalance Ratio: 1.9x (largest/smallest)
   Entropy: 1.93 (97% of max)
   Top-1 Concentration: 32.6%
   Top-3 Concentration: 82.7%
   Rare Categories (<1%): 0

📈 Interpretation:
   ✓ HIGH DIVERSITY: Categories are relatively balanced

🔧 Recommendations:
   Encoding: one_hot
   Reason: Low cardinality (4 categories) - safe feature expansion
   Priority: low



CATEGORICAL COLUMNS SUMMARY


Column,Categories,Imbalance,Entropy,Top-3 Conc.,Rare (<1%),Encoding
lifecycle_quadrant,4,1.9x,97%,82.7%,0,one_hot



✅ Added 1 encoding recommendations to Gold layer


## 2.7 Datetime Columns Analysis

**📖 Unlike numeric transformations, datetime analysis recommends NEW FEATURES to create:**

| Recommendation Type | Purpose | Examples |
|---------------------|---------|----------|
| **Feature Engineering** | Create predictive features from dates | `days_since_signup`, `tenure_years`, `month_sin_cos` |
| **Modeling Strategy** | How to structure train/test | Time-based splits when trends detected |
| **Data Quality** | Issues to address before modeling | Placeholder dates (1/1/1900) to filter |

**📖 Feature Engineering Strategies:**
- **Recency**: `days_since_X` - How recent was the event? (useful for predicting behavior)
- **Tenure**: `tenure_years` - How long has customer been active? (maturity/loyalty)
- **Duration**: `days_between_A_and_B` - Time between events (e.g., signup to first purchase)
- **Cyclical**: `month_sin`, `month_cos` - Preserves that December is near January
- **Categorical**: `is_weekend`, `is_quarter_end` - Behavioral indicators

In [9]:
from customer_retention.stages.profiling.temporal_analyzer import TemporalRecommendationType

datetime_cols = [
    name for name, col in findings.columns.items()
    if col.inferred_type == ColumnType.DATETIME
    and name not in TEMPORAL_METADATA_COLS
]

temporal_analyzer = TemporalAnalyzer()

# Store all datetime recommendations grouped by type
feature_engineering_recs = []
modeling_strategy_recs = []
data_quality_recs = []
datetime_summaries = []

for col_name in datetime_cols:
    col_info = findings.columns[col_name]
    print(f"\n{'='*70}")
    print(f"Column: {col_name}")
    print(f"Type: {col_info.inferred_type.value} (Confidence: {col_info.confidence:.0%})")
    print(f"{'='*70}")
    
    date_series = pd.to_datetime(df[col_name], errors='coerce', format='mixed')
    valid_dates = date_series.dropna()
    
    print(f"\n📅 Date Range: {valid_dates.min()} to {valid_dates.max()}")
    print(f"   Nulls: {date_series.isna().sum():,} ({date_series.isna().mean()*100:.1f}%)")
    
    # Basic temporal analysis
    analysis = temporal_analyzer.analyze(date_series)
    print(f"   Auto-detected granularity: {analysis.granularity.value}")
    print(f"   Span: {analysis.span_days:,} days ({analysis.span_days/365:.1f} years)")
    
    # Growth analysis
    growth = temporal_analyzer.calculate_growth_rate(date_series)
    if growth.get("has_data"):
        print(f"\n📈 Growth Analysis:")
        print(f"   Trend: {growth['trend_direction'].upper()}")
        print(f"   Overall growth: {growth['overall_growth_pct']:+.1f}%")
        print(f"   Avg monthly growth: {growth['avg_monthly_growth']:+.1f}%")
    
    # Seasonality analysis
    seasonality = temporal_analyzer.analyze_seasonality(date_series)
    if seasonality.has_seasonality:
        print(f"\n🔄 Seasonality Detected:")
        print(f"   Peak months: {', '.join(seasonality.peak_periods[:3])}")
        print(f"   Trough months: {', '.join(seasonality.trough_periods[:3])}")
        print(f"   Seasonal strength: {seasonality.seasonal_strength:.2f}")
    
    # Get recommendations using framework
    other_dates = [c for c in datetime_cols if c != col_name]
    recommendations = temporal_analyzer.recommend_features(date_series, col_name, other_date_columns=other_dates)
    
    # Group by recommendation type
    col_feature_recs = [r for r in recommendations if r.recommendation_type == TemporalRecommendationType.FEATURE_ENGINEERING]
    col_modeling_recs = [r for r in recommendations if r.recommendation_type == TemporalRecommendationType.MODELING_STRATEGY]
    col_quality_recs = [r for r in recommendations if r.recommendation_type == TemporalRecommendationType.DATA_QUALITY]
    
    feature_engineering_recs.extend(col_feature_recs)
    modeling_strategy_recs.extend(col_modeling_recs)
    data_quality_recs.extend(col_quality_recs)
    
    # Display recommendations grouped by type
    if col_feature_recs:
        print(f"\n🛠️ FEATURES TO CREATE:")
        for rec in col_feature_recs:
            priority_icon = "🔴" if rec.priority == "high" else "🟡" if rec.priority == "medium" else "✓"
            print(f"   {priority_icon} {rec.feature_name} ({rec.category})")
            print(f"      Why: {rec.reason}")
            if rec.code_hint:
                print(f"      Code: {rec.code_hint}")
    
    if col_modeling_recs:
        print(f"\n⚙️ MODELING CONSIDERATIONS:")
        for rec in col_modeling_recs:
            priority_icon = "🔴" if rec.priority == "high" else "🟡" if rec.priority == "medium" else "✓"
            print(f"   {priority_icon} {rec.feature_name}")
            print(f"      Why: {rec.reason}")
    
    if col_quality_recs:
        print(f"\n⚠️ DATA QUALITY ISSUES:")
        for rec in col_quality_recs:
            priority_icon = "🔴" if rec.priority == "high" else "🟡" if rec.priority == "medium" else "✓"
            print(f"   {priority_icon} {rec.feature_name}")
            print(f"      Why: {rec.reason}")
            if rec.code_hint:
                print(f"      Code: {rec.code_hint}")
    
    # Standard extractions always available
    print(f"\n   Standard extractions available: year, month, day, day_of_week, quarter")
    
    # Store summary
    datetime_summaries.append({
        "Column": col_name,
        "Span (days)": analysis.span_days,
        "Seasonality": "Yes" if seasonality.has_seasonality else "No",
        "Trend": growth.get('trend_direction', 'N/A').capitalize() if growth.get("has_data") else "N/A",
        "Features to Create": len(col_feature_recs),
        "Modeling Notes": len(col_modeling_recs),
        "Quality Issues": len(col_quality_recs)
    })
    
    # === VISUALIZATIONS ===
    
    if growth.get("has_data"):
        fig = charts.growth_summary_indicators(growth, title=f"Growth Summary: {col_name}")
        display_figure(fig)
    
    chart_type = "line" if analysis.granularity in [TemporalGranularity.DAY, TemporalGranularity.WEEK] else "bar"
    fig = charts.temporal_distribution(analysis, title=f"Records Over Time: {col_name}", chart_type=chart_type)
    display_figure(fig)
    
    fig = charts.temporal_trend(analysis, title=f"Trend Analysis: {col_name}")
    display_figure(fig)
    
    yoy_data = temporal_analyzer.year_over_year_comparison(date_series)
    if len(yoy_data) > 1:
        fig = charts.year_over_year_lines(yoy_data, title=f"Year-over-Year: {col_name}")
        display_figure(fig)
        fig = charts.year_month_heatmap(yoy_data, title=f"Records Heatmap: {col_name}")
        display_figure(fig)
    
    if growth.get("has_data"):
        fig = charts.cumulative_growth_chart(growth["cumulative"], title=f"Cumulative Records: {col_name}")
        display_figure(fig)
    
    fig = charts.temporal_heatmap(date_series, title=f"Day of Week Distribution: {col_name}")
    display_figure(fig)

# === DATETIME SUMMARY ===
if datetime_summaries:
    print("\n" + "=" * 70)
    print("DATETIME COLUMNS SUMMARY")
    print("=" * 70)
    display_table(pd.DataFrame(datetime_summaries))
    
    # Summary by recommendation type
    print("\n📋 ALL RECOMMENDATIONS BY TYPE:")
    
    if feature_engineering_recs:
        print(f"\n🛠️ FEATURES TO CREATE ({len(feature_engineering_recs)}):")
        for i, rec in enumerate(feature_engineering_recs, 1):
            priority_icon = "🔴" if rec.priority == "high" else "🟡" if rec.priority == "medium" else "✓"
            print(f"   {i}. {priority_icon} {rec.feature_name}")
    
    if modeling_strategy_recs:
        print(f"\n⚙️ MODELING CONSIDERATIONS ({len(modeling_strategy_recs)}):")
        for i, rec in enumerate(modeling_strategy_recs, 1):
            priority_icon = "🔴" if rec.priority == "high" else "🟡" if rec.priority == "medium" else "✓"
            print(f"   {i}. {priority_icon} {rec.feature_name}: {rec.reason}")
    
    if data_quality_recs:
        print(f"\n⚠️ DATA QUALITY TO ADDRESS ({len(data_quality_recs)}):")
        for i, rec in enumerate(data_quality_recs, 1):
            priority_icon = "🔴" if rec.priority == "high" else "🟡" if rec.priority == "medium" else "✓"
            print(f"   {i}. {priority_icon} {rec.feature_name}: {rec.reason}")
    
    # Add recommendations to registry
    added_derived = 0
    added_modeling = 0
    
    # Add feature engineering recommendations to Silver layer (derived columns)
    if registry.silver:
        for rec in feature_engineering_recs:
            registry.add_silver_derived(
                column=rec.feature_name,
                expression=rec.code_hint or "",
                feature_type=rec.category,
                rationale=rec.reason,
                source_notebook="02_column_deep_dive"
            )
            added_derived += 1
    
    # Add modeling strategy recommendations to Bronze layer
    seen_strategies = set()
    for rec in modeling_strategy_recs:
        if rec.feature_name not in seen_strategies:
            registry.add_bronze_modeling_strategy(
                strategy=rec.feature_name,
                column=datetime_cols[0] if datetime_cols else "",
                parameters={"category": rec.category},
                rationale=rec.reason,
                source_notebook="02_column_deep_dive"
            )
            seen_strategies.add(rec.feature_name)
            added_modeling += 1
    
    print(f"\n✅ Added {added_derived} derived column recommendations to Silver layer")
    print(f"✅ Added {added_modeling} modeling strategy recommendations to Bronze layer")

## 2.8 Type Override (Optional)

If any column types were incorrectly inferred, you can override them here.

**Common overrides:**
- Binary columns detected as numeric → `ColumnType.BINARY`
- IDs detected as numeric → `ColumnType.IDENTIFIER`
- Ordinal categories detected as nominal → `ColumnType.CATEGORICAL_ORDINAL`

In [10]:
# === TYPE OVERRIDES ===
# Uncomment and modify to override any incorrectly inferred types
TYPE_OVERRIDES = {
    # "column_name": ColumnType.NEW_TYPE,
    # Examples:
    # "is_active": ColumnType.BINARY,
    # "user_id": ColumnType.IDENTIFIER,
    # "satisfaction_level": ColumnType.CATEGORICAL_ORDINAL,
}

if TYPE_OVERRIDES:
    print("Applying type overrides:")
    for col_name, new_type in TYPE_OVERRIDES.items():
        if col_name in findings.columns:
            old_type = findings.columns[col_name].inferred_type.value
            findings.columns[col_name].inferred_type = new_type
            findings.columns[col_name].confidence = 1.0
            findings.columns[col_name].evidence.append("Manually overridden")
            print(f"  {col_name}: {old_type} → {new_type.value}")
else:
    print("No type overrides configured.")
    print("To override a type, add entries to TYPE_OVERRIDES dictionary above.")

No type overrides configured.
To override a type, add entries to TYPE_OVERRIDES dictionary above.


## 2.9 Data Segmentation Analysis

**Purpose:** Determine if the dataset contains natural subgroups that might benefit from separate models.

**📖 Why This Matters:**
- Some datasets have distinct customer segments with very different behaviors
- A single model might struggle to capture patterns that vary significantly across segments
- Segmented models can improve accuracy but add maintenance complexity

**Recommendations:**
- **single_model** - Data is homogeneous; one model for all records
- **consider_segmentation** - Some variation exists; evaluate if complexity is worth it
- **strong_segmentation** - Distinct segments with different target rates; separate models likely beneficial

**Important:** This is exploratory guidance only. The final decision depends on business context, model complexity tolerance, and available resources.

In [11]:
from customer_retention.stages.profiling import SegmentAnalyzer

# Initialize segment analyzer
segment_analyzer = SegmentAnalyzer()

# Find target column if detected
target_col = None
for col_name, col_info in findings.columns.items():
    if col_info.inferred_type == ColumnType.TARGET:
        target_col = col_name
        break

# Run segmentation analysis using numeric features
print("="*70)
print("DATA SEGMENTATION ANALYSIS")
print("="*70)

segmentation = segment_analyzer.analyze(
    df,
    target_col=target_col,
    feature_cols=numeric_cols if numeric_cols else None,
    max_segments=5
)

print(f"\n🎯 Analysis Results:")
print(f"   Method: {segmentation.method.value}")
print(f"   Detected Segments: {segmentation.n_segments}")
print(f"   Cluster Quality Score: {segmentation.quality_score:.2f}")
if segmentation.target_variance_ratio is not None:
    print(f"   Target Variance Ratio: {segmentation.target_variance_ratio:.2f}")

print(f"\n📊 Segment Profiles:")
for profile in segmentation.profiles:
    target_info = f" | Target Rate: {profile.target_rate*100:.1f}%" if profile.target_rate is not None else ""
    print(f"   Segment {profile.segment_id}: {profile.size:,} records ({profile.size_pct:.1f}%){target_info}")

# Display recommendation card
fig = charts.segment_recommendation_card(segmentation)
display_figure(fig)

# Display segment overview
fig = charts.segment_overview(segmentation, title="Segment Overview")
display_figure(fig)

# Display feature comparison if we have features
if segmentation.n_segments > 1 and any(p.defining_features for p in segmentation.profiles):
    fig = charts.segment_feature_comparison(segmentation, title="Feature Comparison Across Segments")
    display_figure(fig)

print(f"\n📝 Rationale:")
for reason in segmentation.rationale:
    print(f"   • {reason}")

DATA SEGMENTATION ANALYSIS

🎯 Analysis Results:
   Method: kmeans
   Detected Segments: 2
   Cluster Quality Score: 0.64
   Target Variance Ratio: 0.00

📊 Segment Profiles:
   Segment 0: 118 records (17.3%) | Target Rate: 3.4%
   Segment 1: 566 records (82.7%) | Target Rate: 0.9%



📝 Rationale:
   • Moderate cluster quality (silhouette: 0.64)
   • Low target rate variation (0.00)
   • All segments have sufficient size (min: 17.3%)


## 2.10 Save Updated Findings

In [12]:
# Save updated findings back to the same file
findings.save(FINDINGS_PATH)
print(f"Updated findings saved to: {FINDINGS_PATH}")

# Save recommendations registry
import yaml
recommendations_path = FINDINGS_PATH.replace("_findings.yaml", "_recommendations.yaml")
with open(recommendations_path, "w") as f:
    yaml.dump(registry.to_dict(), f, default_flow_style=False, sort_keys=False)
print(f"Recommendations saved to: {recommendations_path}")

# Summary of recommendations
all_recs = registry.all_recommendations
print(f"\n📋 Recommendations Summary:")
print(f"   Bronze layer: {len(registry.get_by_layer('bronze'))} recommendations")
print(f"   Silver layer: {len(registry.get_by_layer('silver'))} recommendations")
print(f"   Gold layer: {len(registry.get_by_layer('gold'))} recommendations")
print(f"   Total: {len(all_recs)} recommendations")

Updated findings saved to: ../experiments/findings/customer_emails_408768_aggregated_d24886_findings.yaml
Recommendations saved to: ../experiments/findings/customer_emails_408768_aggregated_d24886_recommendations.yaml

📋 Recommendations Summary:
   Bronze layer: 0 recommendations
   Silver layer: 0 recommendations
   Gold layer: 51 recommendations
   Total: 51 recommendations


---

## Summary: What We Learned

In this notebook, we performed a deep dive analysis that included:

1. **Value Range Validation** - Validated rates, binary fields, and non-negative constraints
2. **Numeric Distribution Analysis** - Calculated skewness, kurtosis, and percentiles with transformation recommendations
3. **Categorical Distribution Analysis** - Calculated imbalance ratio, entropy, and concentration with encoding recommendations
4. **Datetime Analysis** - Analyzed seasonality, trends, and patterns with feature engineering recommendations
5. **Data Segmentation** - Evaluated if natural subgroups exist that might benefit from separate models

## Key Metrics Reference

**Numeric Columns:**
| Metric | Threshold | Action |
|--------|-----------|--------|
| Skewness | \|skew\| > 1 | Log transform |
| Kurtosis | > 10 | Cap outliers first |
| Zero % | > 40% | Zero-inflation handling |

**Categorical Columns:**
| Metric | Threshold | Action |
|--------|-----------|--------|
| Imbalance Ratio | > 10x | Group rare categories |
| Entropy | < 50% | Stratified sampling |
| Rare Categories | > 0 | Group into "Other" |

**Datetime Columns:**
| Finding | Action |
|---------|--------|
| Seasonality | Add cyclical month encoding |
| Strong trend | Time-based train/test split |
| Multiple dates | Calculate duration features |
| Placeholder dates | Filter or flag |

## Transformation & Encoding Summary

Review the summary tables above for:
- **Numeric**: Which columns need log transforms, capping, or zero-inflation handling
- **Categorical**: Which encoding to use and whether to group rare categories
- **Datetime**: Which temporal features to engineer based on detected patterns

---

## Next Steps

Continue to **03_quality_assessment.ipynb** to:
- Analyze duplicate records and value conflicts
- Deep dive into missing value patterns
- Analyze outliers with IQR method
- Check data consistency
- Get cleaning recommendations

Or jump to **05_feature_opportunities.ipynb** if you want to see derived feature recommendations.