<span style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">An Exception was encountered at '<a href="#papermill-error-cell">In [2]</a>'.</span>

# Chapter 6: Feature Opportunities

**Purpose:** Identify and implement feature engineering opportunities to improve model performance.

**What you'll learn:**
- How to derive time-based features (tenure, recency, active period)
- How to create composite engagement scores
- How to segment customers based on behavior patterns
- How to encode categorical variables effectively

**Outputs:**
- Derived feature recommendations with code examples
- Composite score formulas (engagement, service adoption)
- Customer segmentation rules
- Categorical encoding strategies

---

## Why Feature Engineering Matters

| Feature Type | Business Meaning | Predictive Power |
|-------------|-----------------|------------------|
| **Tenure** | How long customer has been with us | Loyalty indicator |
| **Recency** | Days since last order | Engagement/churn signal |
| **Engagement Score** | Combined email metrics | Overall engagement level |
| **Segments** | High/Low value √ó Frequent/Infrequent | Risk stratification |

## 6.1 Setup

In [1]:
from customer_retention.analysis.auto_explorer import ExplorationFindings, RecommendationEngine, RecommendationRegistry
from customer_retention.analysis.visualization import ChartBuilder, display_figure, display_table
from customer_retention.core.config.column_config import ColumnType
from customer_retention.stages.features import CustomerSegmenter, SegmentationType
from customer_retention.stages.profiling import FeatureCapacityAnalyzer
import yaml
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

<span id="papermill-error-cell" style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">Execution using papermill encountered an exception here and stopped:</span>

In [2]:
# === CONFIGURATION ===
# Option 1: Set the exact path from notebook 01 output
# FINDINGS_PATH = "../experiments/findings/customer_retention_retail_abc123_findings.yaml"

# Option 2: Auto-discover the most recent findings file
from pathlib import Path

FINDINGS_DIR = Path("../experiments/findings")

findings_files = [f for f in FINDINGS_DIR.glob("*_findings.yaml") if "multi_dataset" not in f.name]
if not findings_files:
    raise FileNotFoundError(f"No findings files found in {FINDINGS_DIR}. Run notebook 01 first.")

findings_files.sort(key=lambda f: f.stat().st_mtime, reverse=True)
FINDINGS_PATH = str(findings_files[0])
RECOMMENDATIONS_PATH = FINDINGS_PATH.replace("_findings.yaml", "_recommendations.yaml")

print(f"Found {len(findings_files)} findings file(s)")
print(f"Using: {FINDINGS_PATH}")

findings = ExplorationFindings.load(FINDINGS_PATH)

# Load data with snapshot preference (uses temporal snapshots if available)
from customer_retention.stages.temporal import load_data_with_snapshot_preference, TEMPORAL_METADATA_COLS
df, data_source = load_data_with_snapshot_preference(findings, output_dir="../experiments/findings")
charts = ChartBuilder()

if Path(RECOMMENDATIONS_PATH).exists():
    with open(RECOMMENDATIONS_PATH, "r") as f:
        registry = RecommendationRegistry.from_dict(yaml.safe_load(f))
    print(f"Loaded existing recommendations: {len(registry.all_recommendations)} total")
else:
    registry = RecommendationRegistry()
    print("Initialized new recommendation registry")

# Ensure all layers are initialized (even if loaded from file)
if not registry.bronze:
    registry.init_bronze(findings.source_path)
if not registry.silver:
    registry.init_silver(findings.entity_column or "entity_id")
if not registry.gold:
    registry.init_gold(findings.target_column or "target")
    print("  Initialized gold layer for feature engineering recommendations")

print(f"\nLoaded {len(df):,} rows from: {data_source}")

FileNotFoundError: No findings files found in ../experiments/findings. Run notebook 01 first.

## 6.2 Automated Feature Recommendations

In [None]:
recommender = RecommendationEngine()
feature_recs = recommender.recommend_features(findings)

print(f"Found {len(feature_recs)} feature engineering opportunities:\n")

for rec in feature_recs:
    print(f"{rec.feature_name}")
    print(f"  Source: {rec.source_column}")
    print(f"  Type: {rec.feature_type}")
    print(f"  Priority: {rec.priority}")
    print(f"  Description: {rec.description}")
    print()

## 6.3 Feature Capacity Analysis

**üìñ Understanding Feature-to-Data Ratios**

Before creating new features, it's critical to understand how many features your data can reliably support. This analysis uses the **Events Per Variable (EPV)** principle:

| EPV Level | Risk Level | Recommendations |
|-----------|------------|-----------------|
| **EPV ‚â• 20** | Low risk | Stable coefficients, reliable inference |
| **EPV = 10-20** | Moderate | Standard practice, consider regularization |
| **EPV = 5-10** | Elevated | Strong regularization required (L1/Lasso) |
| **EPV < 5** | High risk | Reduce features or collect more data |

**Key Assumptions:**
1. **Minority class drives capacity**: For classification, the smaller class limits feature count
2. **Correlated features are redundant**: Highly correlated features (r > 0.8) count as ~1 effective feature
3. **Model type matters**: Tree models are more flexible than linear models
4. **Regularization helps**: L1/L2 penalties allow more features with less data

**üìä What This Analysis Provides:**
- Recommended feature counts (conservative/moderate/aggressive)
- Effective feature count after removing redundancy
- Model complexity guidance (linear vs tree-based)
- Segment-specific capacity for multi-model strategies

In [None]:
# Feature Capacity Analysis
capacity_analyzer = FeatureCapacityAnalyzer()

# Get all potential feature columns (excluding target and identifiers)
feature_cols = [
    name for name, col in findings.columns.items()
    if col.inferred_type in [
        ColumnType.NUMERIC_CONTINUOUS, ColumnType.NUMERIC_DISCRETE,
        ColumnType.CATEGORICAL_NOMINAL, ColumnType.CATEGORICAL_ORDINAL,
        ColumnType.BINARY
    ] and name != findings.target_column
    and name not in TEMPORAL_METADATA_COLS
]

print("=" * 80)
print("FEATURE CAPACITY ANALYSIS")
print("=" * 80)

if findings.target_column:
    # Analyze capacity with current features
    numeric_features = [
        name for name, col in findings.columns.items()
        if col.inferred_type in [ColumnType.NUMERIC_CONTINUOUS, ColumnType.NUMERIC_DISCRETE]
        and name != findings.target_column
    ]
    
    capacity_result = capacity_analyzer.analyze(
        df,
        feature_cols=numeric_features,
        target_col=findings.target_column,
    )
    
    print(f"\nüìä DATA SUMMARY:")
    print(f"   Total samples: {capacity_result.total_samples:,}")
    print(f"   Minority class samples: {capacity_result.minority_class_samples:,}")
    print(f"   Minority class rate: {capacity_result.minority_class_samples/capacity_result.total_samples:.1%}")
    print(f"   Current numeric features: {capacity_result.total_features}")
    
    print(f"\nüìà FEATURE CAPACITY METRICS:")
    print(f"   Events Per Variable (EPV): {capacity_result.events_per_variable:.1f}")
    print(f"   Samples Per Feature: {capacity_result.samples_per_feature:.1f}")
    print(f"   Capacity Status: {capacity_result.capacity_status.upper()}")
    
    # Capacity status visualization
    status_colors = {"adequate": "#2ecc71", "limited": "#f39c12", "inadequate": "#e74c3c"}
    status_color = status_colors.get(capacity_result.capacity_status, "#95a5a6")
    
    print(f"\nüéØ RECOMMENDED FEATURE COUNTS:")
    print(f"   Conservative (EPV=20): {capacity_result.recommended_features_conservative} features")
    print(f"   Moderate (EPV=10):     {capacity_result.recommended_features_moderate} features")
    print(f"   Aggressive (EPV=5):    {capacity_result.recommended_features_aggressive} features")
    
    # Effective features analysis
    if capacity_result.effective_features_result:
        eff = capacity_result.effective_features_result
        print(f"\nüîç EFFECTIVE FEATURES (accounting for correlation):")
        print(f"   Total features analyzed: {eff.total_count}")
        print(f"   Effective independent features: {eff.effective_count:.1f}")
        print(f"   Redundant features identified: {len(eff.redundant_features)}")
        
        if eff.redundant_features:
            print(f"\n   ‚ö†Ô∏è Redundant features (highly correlated):")
            for feat in eff.redundant_features[:5]:
                print(f"      ‚Ä¢ {feat}")
        
        if eff.feature_clusters:
            print(f"\n   üì¶ Correlated feature clusters ({len(eff.feature_clusters)}):")
            for i, cluster in enumerate(eff.feature_clusters[:3]):
                print(f"      Cluster {i+1}: {', '.join(cluster[:4])}")
                if len(cluster) > 4:
                    print(f"                  ... and {len(cluster)-4} more")
    
    # Persist feature capacity to registry
    registry.add_bronze_feature_capacity(
        epv=capacity_result.events_per_variable,
        capacity_status=capacity_result.capacity_status,
        recommended_features=capacity_result.recommended_features_moderate,
        current_features=capacity_result.total_features,
        rationale=f"EPV={capacity_result.events_per_variable:.1f}, status={capacity_result.capacity_status}",
        source_notebook="06_feature_opportunities"
    )
    print(f"\n‚úÖ Persisted feature capacity recommendation to registry")
    
    # Store capacity info in findings
    findings.metadata["feature_capacity"] = capacity_result.to_dict()
else:
    print("\n‚ö†Ô∏è No target column detected. Capacity analysis requires a target variable.")

### 6.3.1 Model Complexity Guidance

Based on your data capacity, here's guidance on model complexity and feature limits.

In [None]:
# Model Complexity Guidance
if findings.target_column and 'capacity_result' in dir():
    guidance = capacity_result.complexity_guidance
    
    print("=" * 70)
    print("MODEL COMPLEXITY GUIDANCE")
    print("=" * 70)
    
    # Create visualization of feature limits by model type
    model_types = ["Linear\n(no regularization)", "Regularized\n(L1/L2)", "Tree-based\n(RF/XGBoost)"]
    max_features = [guidance.max_features_linear, guidance.max_features_regularized, guidance.max_features_tree]
    current_features = capacity_result.total_features
    
    colors = ['#e74c3c' if m < current_features else '#2ecc71' for m in max_features]
    
    fig = go.Figure()
    
    fig.add_trace(go.Bar(
        x=model_types,
        y=max_features,
        marker_color=colors,
        text=[f"{m}" for m in max_features],
        textposition='outside',
        name='Max Features'
    ))
    
    # Add horizontal line for current feature count
    fig.add_hline(
        y=current_features,
        line_dash="dash",
        line_color="#3498db",
        annotation_text=f"Current: {current_features}",
        annotation_position="right"
    )
    
    # Calculate y-axis range to fit labels
    max_val = max(max_features)
    fig.update_layout(
        title="Maximum Recommended Features by Model Type",
        xaxis_title="Model Type",
        yaxis_title="Max Features",
        yaxis_range=[0, max_val * 1.15],  # Add 15% headroom for labels
        template='plotly_white',
        height=400,
        showlegend=False,
    )
    
    display_figure(fig)
    
    print(f"\nüéØ RECOMMENDED MODEL TYPE: {guidance.recommended_model_type.replace('_', ' ').title()}")
    
    print("\nüìã MODEL-SPECIFIC RECOMMENDATIONS:")
    for rec in guidance.model_recommendations:
        print(f"   ‚Ä¢ {rec}")
    
    print("\nüí° GENERAL GUIDANCE:")
    for rec in guidance.recommendations:
        print(f"   {rec}")
    
    # Summary table
    print("\n" + "-" * 70)
    print("FEATURE BUDGET SUMMARY:")
    print("-" * 70)
    summary_data = {
        "Model Type": ["Linear (no regularization)", "Regularized (L1/L2)", "Tree-based"],
        "Max Features": [guidance.max_features_linear, guidance.max_features_regularized, guidance.max_features_tree],
        "Current": [current_features] * 3,
        "Status": [
            "‚úÖ OK" if guidance.max_features_linear >= current_features else "‚ö†Ô∏è Reduce",
            "‚úÖ OK" if guidance.max_features_regularized >= current_features else "‚ö†Ô∏è Reduce", 
            "‚úÖ OK" if guidance.max_features_tree >= current_features else "‚ö†Ô∏è Reduce"
        ]
    }
    display(pd.DataFrame(summary_data))
    
    # Persist model type recommendation to registry
    registry.add_bronze_model_type(
        model_type=guidance.recommended_model_type,
        max_features_linear=guidance.max_features_linear,
        max_features_regularized=guidance.max_features_regularized,
        max_features_tree=guidance.max_features_tree,
        rationale=f"Recommended: {guidance.recommended_model_type}",
        source_notebook="06_feature_opportunities"
    )
    print(f"\n‚úÖ Persisted model type recommendation to registry: {guidance.recommended_model_type}")

### 6.3.2 Segment-Specific Capacity (for Multi-Model Strategy)

When considering **separate models per customer segment**, each segment must have sufficient data to support the feature set. This analysis shows whether segmented modeling is viable.

**üìñ Single Model vs Segment Models:**

| Approach | When to Use | Pros | Cons |
|----------|------------|------|------|
| **Single Model** | Small data, uniform segments | More data per model, simpler | May miss segment-specific patterns |
| **Segment Models** | Large data, distinct segments | Tailored patterns | Need sufficient data per segment |
| **Hybrid** | Mixed segment sizes | Best of both | More complex to maintain |

In [None]:
# Segment Capacity Analysis
categorical_cols = [
    name for name, col in findings.columns.items()
    if col.inferred_type in [ColumnType.CATEGORICAL_NOMINAL, ColumnType.CATEGORICAL_ORDINAL]
]

print("=" * 70)
print("SEGMENT CAPACITY ANALYSIS")
print("=" * 70)

if findings.target_column and categorical_cols and 'numeric_features' in dir():
    # Analyze the first categorical column as potential segment
    segment_col = categorical_cols[0]
    
    print(f"\nüìä Analyzing segments by: {segment_col}")
    print(f"   Features to evaluate: {len(numeric_features)}")
    
    segment_result = capacity_analyzer.analyze_segment_capacity(
        df,
        feature_cols=numeric_features,
        target_col=findings.target_column,
        segment_col=segment_col,
    )
    
    print(f"\nüéØ RECOMMENDED STRATEGY: {segment_result.recommended_strategy.replace('_', ' ').title()}")
    print(f"   Reason: {segment_result.strategy_reason}")
    
    # Segment details table
    segment_data = []
    for seg_name, cap in segment_result.segment_capacities.items():
        segment_data.append({
            "Segment": seg_name,
            "Samples": cap.total_samples,
            "Minority Events": cap.minority_class_samples,
            "EPV": f"{cap.events_per_variable:.1f}",
            "Max Features (EPV=10)": cap.recommended_features_moderate,
            "Status": cap.capacity_status.title()
        })
    
    segment_df = pd.DataFrame(segment_data)
    segment_df = segment_df.sort_values("Samples", ascending=False)
    display(segment_df)
    
    # Visualization
    fig = go.Figure()
    
    max_events = 0
    for seg_name, cap in segment_result.segment_capacities.items():
        color = "#2ecc71" if cap.capacity_status == "adequate" else "#f39c12" if cap.capacity_status == "limited" else "#e74c3c"
        fig.add_trace(go.Bar(
            name=seg_name,
            x=[seg_name],
            y=[cap.minority_class_samples],
            marker_color=color,
            text=[f"EPV={cap.events_per_variable:.1f}"],
            textposition='outside'
        ))
        max_events = max(max_events, cap.minority_class_samples)
    
    # Add threshold line
    threshold_events = len(numeric_features) * 10  # EPV=10 threshold
    fig.add_hline(
        y=threshold_events,
        line_dash="dash",
        line_color="#3498db",
        annotation_text=f"Min events for {len(numeric_features)} features (EPV=10)",
        annotation_position="right"
    )
    
    # Calculate y-axis range to fit labels
    y_max = max(max_events, threshold_events)
    fig.update_layout(
        title=f"Minority Class Events by Segment ({segment_col})",
        xaxis_title="Segment",
        yaxis_title="Minority Class Events",
        yaxis_range=[0, y_max * 1.15],  # Add 15% headroom for labels
        template='plotly_white',
        height=400,
        showlegend=False,
    )
    display_figure(fig)
    
    print("\nüìã SEGMENT RECOMMENDATIONS:")
    for rec in segment_result.recommendations:
        print(f"   {rec}")
    
    if segment_result.viable_segments:
        print(f"\n   ‚úÖ Viable for separate models: {', '.join(segment_result.viable_segments)}")
    if segment_result.insufficient_segments:
        print(f"   ‚ö†Ô∏è Insufficient data: {', '.join(segment_result.insufficient_segments)}")
    
    # Store in findings
    findings.metadata["segment_capacity"] = segment_result.to_dict()
else:
    print("\n‚ö†Ô∏è No categorical columns available for segment analysis.")
    print("   Segment capacity analysis requires at least one categorical column.")

### 6.3.3 Feature Capacity Action Items

Based on the analysis above, here are the key considerations for feature engineering:

In [None]:
# Feature Capacity Action Items Summary
if findings.target_column and 'capacity_result' in dir():
    print("=" * 70)
    print("FEATURE CAPACITY ACTION ITEMS")
    print("=" * 70)
    
    print("\nüìã BASED ON YOUR DATA CAPACITY:")
    
    # Action items based on capacity status
    if capacity_result.capacity_status == "adequate":
        print("\n‚úÖ ADEQUATE CAPACITY - You have room to add features")
        print(f"   ‚Ä¢ Current features: {capacity_result.total_features}")
        print(f"   ‚Ä¢ Can add up to: {capacity_result.recommended_features_moderate - capacity_result.total_features} more features (EPV=10)")
        print(f"   ‚Ä¢ Consider: Creating derived features from datetime and categorical columns")
    elif capacity_result.capacity_status == "limited":
        print("\n‚ö†Ô∏è LIMITED CAPACITY - Be selective with new features")
        print(f"   ‚Ä¢ Current features: {capacity_result.total_features}")
        print(f"   ‚Ä¢ Recommended max: {capacity_result.recommended_features_moderate} features (EPV=10)")
        print(f"   ‚Ä¢ Action: Remove {max(0, capacity_result.total_features - capacity_result.recommended_features_moderate)} redundant features before adding new ones")
        print(f"   ‚Ä¢ Consider: Using regularization (L1/Lasso) if keeping all features")
    else:
        print("\nüî¥ INADEQUATE CAPACITY - Reduce features or get more data")
        print(f"   ‚Ä¢ Current features: {capacity_result.total_features}")
        print(f"   ‚Ä¢ Recommended max: {capacity_result.recommended_features_moderate} features (EPV=10)")
        print(f"   ‚Ä¢ CRITICAL: Reduce to {capacity_result.recommended_features_conservative} features for stable estimates")
        print(f"   ‚Ä¢ Options: (1) Feature selection, (2) PCA, (3) Collect more data")
    
    # Redundancy recommendations
    if capacity_result.effective_features_result and capacity_result.effective_features_result.redundant_features:
        redundant = capacity_result.effective_features_result.redundant_features
        print(f"\nüîÑ REDUNDANT FEATURES TO CONSIDER REMOVING:")
        print(f"   These features are highly correlated with others and add little new information:")
        for feat in redundant[:5]:
            print(f"   ‚Ä¢ {feat}")
        if len(redundant) > 5:
            print(f"   ... and {len(redundant) - 5} more")
    
    # New feature budget
    print("\nüí∞ FEATURE BUDGET FOR NEW FEATURES:")
    remaining_budget = capacity_result.recommended_features_moderate - capacity_result.total_features
    if remaining_budget > 0:
        print(f"   You can safely add {remaining_budget} new features")
        print("   Prioritize:")
        print("   ‚Ä¢ Recency features (days_since_last_activity)")
        print("   ‚Ä¢ Tenure features (days_since_created)")
        print("   ‚Ä¢ Engagement composites (email_engagement_score)")
    else:
        print(f"   ‚ö†Ô∏è At or over capacity. Remove {-remaining_budget} features before adding new ones.")
    
    # Model selection summary
    print("\nüéØ RECOMMENDED MODELING APPROACH:")
    if capacity_result.complexity_guidance:
        print(f"   Model type: {capacity_result.complexity_guidance.recommended_model_type.replace('_', ' ').title()}")
        if "regularized" in capacity_result.complexity_guidance.recommended_model_type:
            print("   ‚Üí Use Lasso (L1) for automatic feature selection")
            print("   ‚Üí Use Ridge (L2) if you want to keep all features")
        elif "tree" in capacity_result.complexity_guidance.recommended_model_type:
            print("   ‚Üí Random Forest or XGBoost recommended")
            print("   ‚Üí Trees handle correlated features naturally")
    
    print("\n" + "=" * 70)

## 6.4 Datetime Feature Opportunities

In [None]:
datetime_cols = [
    name for name, col in findings.columns.items()
    if col.inferred_type == ColumnType.DATETIME
]

if datetime_cols:
    print("Datetime Feature Opportunities:")
    print("="*50)
    for col in datetime_cols:
        print(f"\n{col}:")
        print(f"  - {col}_year: Extract year")
        print(f"  - {col}_month: Extract month")
        print(f"  - {col}_day: Extract day of month")
        print(f"  - {col}_dayofweek: Extract day of week (0-6)")
        print(f"  - {col}_is_weekend: Is weekend flag")
        print(f"  - days_since_{col}: Days since date")
else:
    print("No datetime columns found.")

## 6.5 Business-Driven Derived Features

These features are based on domain knowledge from the reference analysis (my_take Phase 1).

**üìñ Key Derived Features:**
- **Tenure Days**: Days from account creation to analysis date
- **Days Since Last Order**: Recency indicator (critical for churn)
- **Active Period Days**: Duration of customer activity
- **Email Engagement Score**: Composite of open rate and click rate
- **Click-to-Open Ratio**: Quality of email engagement
- **Service Adoption Score**: Sum of service flags (paperless, refill, doorstep)

In [None]:
print("=" * 70)
print("CREATING DERIVED FEATURES")
print("=" * 70)

segmenter = CustomerSegmenter()
df_features = df.copy()

datetime_cols = [name for name, col in findings.columns.items() 
                 if col.inferred_type == ColumnType.DATETIME
                 and name not in TEMPORAL_METADATA_COLS]
binary_cols = [name for name, col in findings.columns.items() 
               if col.inferred_type == ColumnType.BINARY
               and name not in TEMPORAL_METADATA_COLS]
numeric_cols = [name for name, col in findings.columns.items() 
                if col.inferred_type in [ColumnType.NUMERIC_CONTINUOUS, ColumnType.NUMERIC_DISCRETE]]

for col in datetime_cols:
    df_features[col] = pd.to_datetime(df_features[col], errors='coerce', format='mixed')

reference_date = pd.Timestamp.now()
if datetime_cols:
    last_dates = [df_features[col].max() for col in datetime_cols if df_features[col].notna().any()]
    if last_dates:
        reference_date = max(last_dates)
print(f"\nReference date: {reference_date}")

print("\nüìÖ TIME-BASED FEATURES:")
created_cols = [c for c in datetime_cols if 'creat' in c.lower() or 'signup' in c.lower() or 'register' in c.lower()]
if created_cols:
    created_col = created_cols[0]
    df_features = segmenter.create_tenure_features(df_features, created_column=created_col, reference_date=reference_date)
    print(f"  ‚úì tenure_days from {created_col}")
    registry.add_silver_derived(
        column="tenure_days",
        expression=f"(reference_date - {created_col}).days",
        feature_type="tenure",
        rationale=f"Customer tenure in days from {created_col}",
        source_notebook="06_feature_opportunities"
    )

activity_cols = [c for c in datetime_cols if 'last' in c.lower() or 'recent' in c.lower()]
if activity_cols:
    activity_col = activity_cols[0]
    df_features = segmenter.create_recency_features(df_features, last_activity_column=activity_col, 
                                                     reference_date=reference_date, output_column='days_since_last_activity')
    print(f"  ‚úì days_since_last_activity from {activity_col}")
    registry.add_silver_derived(
        column="days_since_last_activity",
        expression=f"(reference_date - {activity_col}).days",
        feature_type="recency",
        rationale=f"Days since last activity from {activity_col}",
        source_notebook="06_feature_opportunities"
    )

print("\nüìß ENGAGEMENT FEATURES:")
rate_cols = [c for c in numeric_cols if 'rate' in c.lower() or 'pct' in c.lower() or 'percent' in c.lower()]
open_rate_cols = [c for c in rate_cols if 'open' in c.lower()]
click_rate_cols = [c for c in rate_cols if 'click' in c.lower()]

if open_rate_cols and click_rate_cols:
    open_col, click_col = open_rate_cols[0], click_rate_cols[0]
    df_features = segmenter.create_engagement_score(df_features, open_rate_column=open_col, 
                                                     click_rate_column=click_col, output_column='email_engagement_score')
    print(f"  ‚úì email_engagement_score from {open_col}, {click_col}")
    registry.add_silver_derived(
        column="email_engagement_score",
        expression=f"0.6 * {open_col} + 0.4 * {click_col}",
        feature_type="composite",
        rationale=f"Weighted engagement score from {open_col} and {click_col}",
        source_notebook="06_feature_opportunities"
    )
    
    df_features['click_to_open_rate'] = np.where(df_features[open_col] > 0, df_features[click_col] / df_features[open_col], 0)
    print(f"  ‚úì click_to_open_rate")
    registry.add_silver_ratio(
        column="click_to_open_rate",
        numerator=click_col,
        denominator=open_col,
        rationale=f"Click-to-open ratio: {click_col} / {open_col}",
        source_notebook="06_feature_opportunities"
    )

print("\nüîß SERVICE ADOPTION:")
if binary_cols:
    service_binary = [c for c in binary_cols if c != findings.target_column]
    if service_binary:
        df_features['service_adoption_score'] = df_features[service_binary].sum(axis=1)
        print(f"  ‚úì service_adoption_score from {service_binary}")
        registry.add_silver_derived(
            column="service_adoption_score",
            expression=f"sum([{', '.join(service_binary)}])",
            feature_type="composite",
            rationale=f"Service adoption count from {len(service_binary)} binary flags",
            source_notebook="06_feature_opportunities"
        )

print("\nüí∞ VALUE FEATURES:")
value_cols = [c for c in numeric_cols if 'order' in c.lower() or 'amount' in c.lower() or 'value' in c.lower() or 'avg' in c.lower()]
freq_cols = [c for c in numeric_cols if 'freq' in c.lower() or 'count' in c.lower()]
if value_cols and freq_cols:
    df_features['value_frequency_product'] = df_features[value_cols[0]] * df_features[freq_cols[0]]
    print(f"  ‚úì value_frequency_product from {value_cols[0]}, {freq_cols[0]}")
    registry.add_silver_interaction(
        column="value_frequency_product",
        features=[value_cols[0], freq_cols[0]],
        rationale=f"Value-frequency interaction: {value_cols[0]} √ó {freq_cols[0]}",
        source_notebook="06_feature_opportunities"
    )

new_cols = len(df_features.columns) - len(df.columns)
print(f"\n‚úì Created {new_cols} new features (total: {len(df_features.columns)})")
print(f"‚úÖ Persisted {len([c for c in ['tenure_days', 'days_since_last_activity', 'email_engagement_score', 'click_to_open_rate', 'service_adoption_score', 'value_frequency_product'] if c in df_features.columns])} derived feature recommendations to registry")

## 6.6 Customer Segmentation Features

Create business-meaningful segments for analysis and modeling.

**üìñ Segmentation Strategy:**
- **Value Dimension**: High vs Low (based on avgorder median)
- **Frequency Dimension**: Frequent vs Infrequent (based on ordfreq median)
- **Recency Buckets**: Active, Recent, Lapsing, Dormant

In [None]:
print("=" * 70)
print("CUSTOMER SEGMENTATION")
print("=" * 70)

print("\nüéØ VALUE-FREQUENCY SEGMENTS:")
value_cols = [c for c in numeric_cols if 'order' in c.lower() or 'amount' in c.lower() or 'value' in c.lower() or 'avg' in c.lower()]
freq_cols = [c for c in numeric_cols if 'freq' in c.lower() or 'count' in c.lower()]

if value_cols and freq_cols:
    df_features, vf_result = segmenter.segment_by_value_frequency(
        df_features, value_column=value_cols[0], frequency_column=freq_cols[0])
    print(f"  Using {value_cols[0]} √ó {freq_cols[0]}")
    for seg in vf_result.segments:
        print(f"    {seg.name}: {seg.count:,} ({seg.percentage:.1f}%)")
else:
    print("  No suitable value/frequency columns found")

print("\nüìÖ RECENCY SEGMENTS:")
if 'days_since_last_activity' in df_features.columns:
    df_features, recency_result = segmenter.segment_by_recency(df_features, days_since_column='days_since_last_activity')
    for seg in recency_result.segments:
        print(f"    {seg.name}: {seg.count:,} ({seg.percentage:.1f}%)")
else:
    print("  No recency column available")

print("\nüìß ENGAGEMENT SEGMENTS:")
if 'email_engagement_score' in df_features.columns:
    max_score = df_features['email_engagement_score'].max()
    if max_score > 0:
        df_features['engagement_normalized'] = df_features['email_engagement_score'] / max_score
        df_features, eng_result = segmenter.segment_by_engagement(df_features, engagement_column='engagement_normalized')
        for seg in eng_result.segments:
            print(f"    {seg.name}: {seg.count:,} ({seg.percentage:.1f}%)")
        df_features = df_features.drop(columns=['engagement_normalized'])
else:
    print("  No engagement score available")

if 'customer_segment' in df_features.columns and findings.target_column and findings.target_column in df_features.columns:
    target = findings.target_column
    segment_retention = df_features.groupby('customer_segment')[target].mean() * 100
    
    max_rate = segment_retention.max()
    fig = go.Figure(go.Bar(
        x=segment_retention.index, y=segment_retention.values,
        marker_color=['#2ca02c' if r > 70 else '#ffbb00' if r > 50 else '#d62728' for r in segment_retention.values],
        text=[f'{r:.1f}%' for r in segment_retention.values], textposition='outside'))
    fig.update_layout(
        title='Retention Rate by Customer Segment', 
        xaxis_title='Segment', 
        yaxis_title='Retention Rate (%)',
        yaxis_range=[0, max_rate * 1.15],  # Add 15% headroom for labels
        template='plotly_white', 
        height=400,
    )
    display_figure(fig)

segment_cols = [c for c in df_features.columns if 'segment' in c.lower() or 'bucket' in c.lower()]
print(f"\n‚úì Created {len(segment_cols)} segmentation features")

## 6.7 Numeric Transformation Opportunities

In [None]:
numeric_cols = [
    name for name, col in findings.columns.items()
    if col.inferred_type in [ColumnType.NUMERIC_CONTINUOUS, ColumnType.NUMERIC_DISCRETE]
    and name not in TEMPORAL_METADATA_COLS
]

transform_count = 0
if numeric_cols:
    print("Numeric Transformation Opportunities:")
    print("="*50)
    
    for col_name in numeric_cols:
        col_info = findings.columns[col_name]
        series = df[col_name].dropna()
        skewness = series.skew()
        
        print(f"\n{col_name}:")
        print(f"  Skewness: {skewness:.2f}")
        
        if abs(skewness) > 1:
            print(f"  Recommendation: Apply log transform (highly skewed)")
            registry.add_gold_transformation(
                column=col_name,
                transform="log",
                parameters={"skewness": float(skewness), "reason": "highly_skewed"},
                rationale=f"Log transform for highly skewed distribution (skewness={skewness:.2f})",
                source_notebook="06_feature_opportunities"
            )
            transform_count += 1
        elif abs(skewness) > 0.5:
            print(f"  Recommendation: Consider sqrt transform (moderately skewed)")
            registry.add_gold_transformation(
                column=col_name,
                transform="sqrt",
                parameters={"skewness": float(skewness), "reason": "moderately_skewed"},
                rationale=f"Sqrt transform for moderately skewed distribution (skewness={skewness:.2f})",
                source_notebook="06_feature_opportunities"
            )
            transform_count += 1
        else:
            print(f"  Recommendation: Standard scaling sufficient")
            registry.add_gold_scaling(
                column=col_name,
                method="standard",
                rationale=f"Standard scaling for normally distributed column (skewness={skewness:.2f})",
                source_notebook="06_feature_opportunities"
            )
            transform_count += 1
        
        if col_info.inferred_type == ColumnType.NUMERIC_CONTINUOUS:
            print(f"  Binning: Consider creating bins for {col_name}_binned")
    
    print(f"\n‚úÖ Persisted {transform_count} transformation recommendations to registry")

## 6.8 Categorical Encoding Opportunities

In [None]:
categorical_cols = [
    name for name, col in findings.columns.items()
    if col.inferred_type in [ColumnType.CATEGORICAL_NOMINAL, ColumnType.CATEGORICAL_ORDINAL]
    and name not in TEMPORAL_METADATA_COLS
]

encoding_count = 0
if categorical_cols:
    print("Categorical Encoding Recommendations:")
    print("="*50)
    
    for col_name in categorical_cols:
        col_info = findings.columns[col_name]
        distinct = col_info.universal_metrics.get("distinct_count", 0)
        
        print(f"\n{col_name}: ({distinct} unique values)")
        
        if distinct <= 5:
            print(f"  Recommendation: One-hot encoding")
            registry.add_gold_encoding(
                column=col_name,
                method="onehot",
                rationale=f"One-hot encoding for low cardinality ({distinct} unique values)",
                source_notebook="06_feature_opportunities"
            )
            encoding_count += 1
        elif distinct <= 20:
            print(f"  Recommendation: Target encoding or one-hot with frequency threshold")
            registry.add_gold_encoding(
                column=col_name,
                method="target",
                rationale=f"Target encoding for medium cardinality ({distinct} unique values)",
                source_notebook="06_feature_opportunities"
            )
            encoding_count += 1
        else:
            print(f"  Recommendation: Target encoding or embedding (high cardinality)")
            registry.add_gold_encoding(
                column=col_name,
                method="target",
                rationale=f"Target encoding for high cardinality ({distinct} unique values)",
                source_notebook="06_feature_opportunities"
            )
            encoding_count += 1
        
        if col_info.inferred_type == ColumnType.CATEGORICAL_ORDINAL:
            print(f"  Note: Consider ordinal encoding to preserve order")
    
    print(f"\n‚úÖ Persisted {encoding_count} encoding recommendations to registry")

---

## Summary: What We Learned

In this notebook, we identified feature engineering opportunities and analyzed data capacity:

### Feature Capacity Analysis
1. **Events Per Variable (EPV)** - Calculated the data's capacity to support features
2. **Effective Features** - Identified redundant features due to high correlation
3. **Model Complexity Guidance** - Determined appropriate model types based on data size
4. **Segment Capacity** - Evaluated whether segmented modeling is viable

### Feature Engineering
5. **Automated Recommendations** - Framework suggested feature opportunities
6. **Time-Based Features** - Created tenure, recency, active period metrics
7. **Engagement Scores** - Built composite email engagement metrics
8. **Customer Segments** - Created value-frequency and recency-based segments
9. **Encoding Strategies** - Identified optimal encoding for each categorical

## Feature Capacity Key Concepts

| Metric | What It Means | Rule of Thumb |
|--------|---------------|---------------|
| **EPV ‚â• 20** | Stable, reliable estimates | Conservative, regulatory-grade |
| **EPV = 10-20** | Standard practice | Use for most applications |
| **EPV = 5-10** | Limited capacity | Requires strong regularization |
| **EPV < 5** | High risk | Reduce features or get more data |

## Key Derived Features Created

| Feature | Formula | Business Meaning |
|---------|---------|-----------------|
| `tenure_days` | reference_date - created | Customer longevity |
| `days_since_last_order` | reference_date - lastorder | Recency/engagement |
| `email_engagement_score` | 0.6√óopenrate + 0.4√óclickrate | Overall engagement |
| `service_adoption_score` | paperless + refill + doorstep | Service utilization |
| `customer_segment` | Value √ó Frequency quadrant | Customer type |

---

## Next Steps

Continue to **07_modeling_readiness.ipynb** to:
- Validate data is ready for modeling
- Check for data leakage
- Assess class imbalance
- Review feature completeness

In [None]:
print("Potential Interaction Features:")
print("="*50)

if len(numeric_cols) >= 2:
    print("\nNumeric Interactions:")
    for i, col1 in enumerate(numeric_cols[:3]):
        for col2 in numeric_cols[i+1:4]:
            print(f"  - {col1}_x_{col2}: Multiplication")
            print(f"  - {col1}_div_{col2}: Division (if {col2} > 0)")

if categorical_cols and numeric_cols:
    print("\nCategorical-Numeric Interactions:")
    for cat_col in categorical_cols[:2]:
        for num_col in numeric_cols[:2]:
            print(f"  - {num_col}_by_{cat_col}_mean: Group mean")
            print(f"  - {num_col}_by_{cat_col}_std: Group std")

## 6.9 Feature Summary Table

In [None]:
feature_summary = []
for rec in feature_recs:
    feature_summary.append({
        "Feature Name": rec.feature_name,
        "Source": rec.source_column,
        "Type": rec.feature_type,
        "Priority": rec.priority
    })

if feature_summary:
    summary_df = pd.DataFrame(feature_summary)
    display(summary_df)

---

## Next Steps

Continue to **07_modeling_readiness.ipynb** to validate data is ready for modeling.

In [None]:
# Save recommendations
with open(RECOMMENDATIONS_PATH, "w") as f:
    yaml.dump(registry.to_dict(), f, default_flow_style=False, sort_keys=False)

print(f"‚úÖ Saved {len(registry.all_recommendations)} recommendations to {RECOMMENDATIONS_PATH}")
print(f"\nRecommendations by layer:")
for layer in ["bronze", "silver", "gold"]:
    recs = registry.get_by_layer(layer)
    print(f"  {layer.upper()}: {len(recs)}")