# Feature Engineering Toolkit - In-Depth Tutorial

This comprehensive tutorial covers advanced workflows and techniques for data preparation and feature engineering.

## What You'll Learn

1. **Exploratory Data Analysis** - Deep dive into DataAnalyzer and TargetAnalyzer
2. **Data Preprocessing** - Handle messy real-world data
3. **Feature Engineering** - Create powerful predictive features
4. **Feature Selection** - Identify and select important features
5. **End-to-End Pipeline** - Complete workflow from raw data to ML-ready
6. **Advanced Techniques** - Statistical robustness and production patterns

Let's get started!

## Setup

In [None]:
# Import all necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta

from feature_engineering_tk import (
    DataAnalyzer,
    TargetAnalyzer,
    DataPreprocessor,
    FeatureEngineer,
    FeatureSelector,
    statistical_utils
)
from feature_engineering_tk.data_analysis import quick_analysis
from feature_engineering_tk.feature_selection import select_features_auto

# Set random seed for reproducibility
np.random.seed(42)

# Configure plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print("Setup complete!")

---
# Section 1: Exploratory Data Analysis

We'll analyze a real estate price prediction dataset to understand the data before modeling.

## Generate Real Estate Dataset

In [None]:
# Generate realistic real estate data
n_houses = 5000

df_houses = pd.DataFrame({
    'square_feet': np.random.normal(2000, 800, n_houses).clip(500, 10000),
    'bedrooms': np.random.choice([1, 2, 3, 4, 5, 6], n_houses, p=[0.05, 0.15, 0.35, 0.30, 0.10, 0.05]),
    'bathrooms': np.random.choice([1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0], n_houses),
    'age_years': np.random.exponential(15, n_houses).clip(0, 100),
    'lot_size': np.random.lognormal(8.5, 0.5, n_houses),
    'garage_spaces': np.random.choice([0, 1, 2, 3], n_houses, p=[0.1, 0.2, 0.5, 0.2]),
    'neighborhood': np.random.choice(['Downtown', 'Suburb', 'Rural', 'Waterfront'], n_houses, p=[0.2, 0.5, 0.2, 0.1]),
    'property_type': np.random.choice(['Single Family', 'Condo', 'Townhouse'], n_houses, p=[0.6, 0.25, 0.15]),
    'has_pool': np.random.choice([0, 1], n_houses, p=[0.85, 0.15]),
    'has_fireplace': np.random.choice([0, 1], n_houses, p=[0.7, 0.3]),
    'school_rating': np.random.choice(range(1, 11), n_houses),
})

# Create target with realistic price model
base_price = 100000
price = (
    base_price +
    df_houses['square_feet'] * 150 +
    df_houses['bedrooms'] * 20000 +
    df_houses['bathrooms'] * 15000 +
    df_houses['garage_spaces'] * 10000 -
    df_houses['age_years'] * 1000 +
    df_houses['lot_size'] * 5 +
    df_houses['has_pool'] * 25000 +
    df_houses['has_fireplace'] * 8000 +
    df_houses['school_rating'] * 5000 +
    np.where(df_houses['neighborhood'] == 'Waterfront', 100000, 0) +
    np.where(df_houses['neighborhood'] == 'Downtown', 50000, 0) +
    np.random.normal(0, 30000, n_houses)  # Random variation
)
df_houses['price'] = price.clip(50000, 2000000)

# Add missing values
missing_idx = np.random.choice(df_houses.index, size=int(0.08 * n_houses), replace=False)
df_houses.loc[missing_idx, 'lot_size'] = np.nan

missing_idx2 = np.random.choice(df_houses.index, size=int(0.05 * n_houses), replace=False)
df_houses.loc[missing_idx2, 'school_rating'] = np.nan

print(f"Dataset shape: {df_houses.shape}")
df_houses.head(10)

## DataAnalyzer: General EDA

In [None]:
# Quick overview
quick_analysis(df_houses)

In [None]:
# Initialize analyzer
analyzer = DataAnalyzer(df_houses)

# Basic information
basic_info = analyzer.get_basic_info()
print(basic_info)

In [None]:
# Missing value analysis
missing_summary = analyzer.get_missing_summary()
print(missing_summary)

In [None]:
# Numeric summary statistics
numeric_summary = analyzer.get_numeric_summary()
print(numeric_summary)

In [None]:
# Detect outliers using IQR method
outlier_summary_iqr = analyzer.detect_outliers_iqr()
print("Outliers detected (IQR method):")
print(outlier_summary_iqr)

In [None]:
# Check correlations
high_correlations = analyzer.get_high_correlations(threshold=0.6)
if not high_correlations.empty:
    print("High correlations found:")
    print(high_correlations)
else:
    print("No high correlations above 0.6 threshold.")

In [None]:
# Check for multicollinearity using VIF
numeric_cols = df_houses.select_dtypes(include=[np.number]).columns.tolist()
if len(numeric_cols) > 1:
    vif_results = analyzer.calculate_vif(columns=numeric_cols)
    print("\nVariance Inflation Factor (VIF):")
    print(vif_results)
    print("\nNote: VIF > 10 indicates high multicollinearity")

In [None]:
# Detect misclassified categorical columns
misclassified = analyzer.detect_misclassified_categorical()
if not misclassified.empty:
    print("Columns that should be categorical:")
    for idx, row in misclassified.iterrows():
        print(f"  {row['column']}: {row['reason']} (unique values: {row['unique_count']})")
else:
    print("No misclassified categorical columns detected.")

In [None]:
# Get binning suggestions
binning_suggestions = analyzer.suggest_binning()
if not binning_suggestions.empty:
    print("Binning suggestions:")
    for idx, row in binning_suggestions.iterrows():
        print(f"\n{row['column']}:")
        print(f"  Strategy: {row['strategy']}")
        print(f"  Bins: {row['n_bins']}")
        print(f"  Reason: {row['reason']}")

In [None]:
# Visualize correlation heatmap
fig = analyzer.plot_correlation_heatmap(method='pearson', show=False)
if fig:
    plt.tight_layout()
    plt.show()

## TargetAnalyzer: Target-Aware Analysis

In [None]:
# Initialize target analyzer
target_analyzer = TargetAnalyzer(df_houses, target_column='price')

# Auto-detect task type
print(f"Task type: {target_analyzer.task}")

In [None]:
# Analyze target distribution
target_stats = target_analyzer.analyze_target_distribution()
print("Target Distribution Statistics:")
for key, value in target_stats.items():
    if isinstance(value, float):
        print(f"{key}: {value:.2f}")
    else:
        print(f"{key}: {value}")

In [None]:
# Plot target distribution
fig = target_analyzer.plot_target_distribution(show=False)
if fig:
    plt.tight_layout()
    plt.show()

In [None]:
# Analyze feature correlations with target
correlations = target_analyzer.analyze_feature_correlations(method='pearson')
print("Top 10 Features by Correlation with Price:")
print(correlations.head(10))

In [None]:
# Analyze mutual information scores
mi_scores = target_analyzer.analyze_mutual_information()
print("\nTop 10 Features by Mutual Information:")
print(mi_scores.head(10))

In [None]:
# Check data quality
quality_report = target_analyzer.analyze_data_quality()
print("Data Quality Report:")
print(f"Missing values: {quality_report['missing_values_count']}")
print(f"Constant features: {quality_report['constant_features_count']}")
if quality_report['constant_features']:
    print(f"  Constant feature list: {quality_report['constant_features']}")

In [None]:
# Get feature engineering suggestions
fe_suggestions = target_analyzer.suggest_feature_engineering()
print("\nTop Feature Engineering Suggestions:")
high_priority = [s for s in fe_suggestions if s['priority'] == 'high']
for i, sugg in enumerate(high_priority[:8], 1):
    print(f"\n{i}. {sugg['feature']} - {sugg['suggestion']}")
    print(f"   Reason: {sugg['reason']}")

In [None]:
# Get model recommendations
model_recs = target_analyzer.recommend_models()
print("\nModel Recommendations:")
for i, rec in enumerate(model_recs[:5], 1):
    print(f"\n{i}. {rec['model']} (Priority: {rec['priority']})")
    print(f"   Why: {rec['reason']}")
    print(f"   Note: {rec['considerations']}")

In [None]:
# Generate comprehensive report
report = target_analyzer.generate_full_report()
print("\nComprehensive Report Generated:")
print(f"Sections: {list(report.keys())}")

# Export to HTML
target_analyzer.export_report('real_estate_analysis.html', format='html')
print("\nReport exported to real_estate_analysis.html")

---
# Section 2: Data Preprocessing

Let's work with a messy customer dataset that has missing values, outliers, and text columns.

## Generate Messy Customer Dataset

In [None]:
# Generate customer dataset with various data quality issues
n_customers = 3000

# Create base date
base_date = datetime(2023, 1, 1)

df_customers = pd.DataFrame({
    'customer_id': range(1, n_customers + 1),
    'name': [f"Customer {i}" for i in range(1, n_customers + 1)],
    'email': [f"user{i}@example.com" if np.random.random() > 0.1 else f"  USER{i}@EXAMPLE.COM  " for i in range(1, n_customers + 1)],
    'signup_date': [base_date + timedelta(days=int(x)) for x in np.random.uniform(0, 730, n_customers)],
    'age': np.random.normal(45, 15, n_customers).clip(18, 90),
    'income': np.random.lognormal(10.8, 0.6, n_customers),
    'credit_score': np.random.normal(680, 80, n_customers).clip(300, 850),
    'num_purchases': np.random.poisson(8, n_customers),
    'total_spent': np.random.exponential(500, n_customers),
    'membership_tier': np.random.choice(['  Bronze  ', 'Silver', 'GOLD', 'platinum'], n_customers, p=[0.5, 0.3, 0.15, 0.05]),
    'region': np.random.choice(['North', 'South', 'East', 'West'], n_customers),
    'is_active': np.random.choice([0, 1], n_customers, p=[0.3, 0.7]),
})

# Add missing values (realistic patterns)
missing_idx = np.random.choice(df_customers.index, size=int(0.15 * n_customers), replace=False)
df_customers.loc[missing_idx, 'income'] = np.nan

missing_idx2 = np.random.choice(df_customers.index, size=int(0.08 * n_customers), replace=False)
df_customers.loc[missing_idx2, 'credit_score'] = np.nan

missing_idx3 = np.random.choice(df_customers.index, size=int(0.05 * n_customers), replace=False)
df_customers.loc[missing_idx3, 'total_spent'] = np.nan

# Add outliers
outlier_idx = np.random.choice(df_customers.index, size=30, replace=False)
df_customers.loc[outlier_idx, 'total_spent'] = np.random.uniform(5000, 20000, 30)

# Add infinite values
inf_idx = np.random.choice(df_customers.index, size=5, replace=False)
df_customers.loc[inf_idx, 'income'] = np.inf

# Add duplicates
df_customers = pd.concat([df_customers, df_customers.sample(10)], ignore_index=True)

print(f"Dataset shape: {df_customers.shape}")
print(f"Missing values: {df_customers.isnull().sum().sum()}")
df_customers.head(10)

## Data Validation

In [None]:
# Initialize preprocessor
preprocessor = DataPreprocessor(df_customers)

# Validate data quality
quality_report = preprocessor.validate_data_quality()
print("Data Quality Report:")
print(f"Total rows: {quality_report['shape'][0]}")
print(f"Total columns: {quality_report['shape'][1]}")
print(f"Missing values: {quality_report['missing_values']}")
print(f"Duplicate rows: {quality_report['duplicate_rows']}")
print(f"Constant columns: {quality_report['constant_columns']}")
print(f"Issues found: {quality_report['issues_found']}")

In [None]:
# Detect infinite values
infinite_vals = preprocessor.detect_infinite_values()
if infinite_vals:
    print("Infinite values detected:")
    for col, count in infinite_vals.items():
        print(f"  {col}: {count} infinite values")

## Handling Missing Values

In [None]:
# Handle missing values with appropriate strategies
# For normally distributed numeric columns: mean
preprocessor.handle_missing_values(
    columns=['age', 'credit_score'],
    strategy='mean',
    inplace=True
)

# For skewed numeric columns: median
preprocessor.handle_missing_values(
    columns=['income', 'total_spent'],
    strategy='median',
    inplace=True
)

# For categorical columns: mode
preprocessor.handle_missing_values(
    columns=['membership_tier'],
    strategy='mode',
    inplace=True
)

df_processed = preprocessor.get_dataframe()
print(f"Missing values after imputation: {df_processed.isnull().sum().sum()}")

## String Preprocessing

In [None]:
# Clean string columns
preprocessor.clean_string_columns(
    columns=['email', 'membership_tier'],
    operations=['strip', 'lower'],
    inplace=True
)

# Handle whitespace variants
preprocessor.handle_whitespace_variants(
    columns=['membership_tier'],
    inplace=True
)

# Extract string length as a feature
preprocessor.extract_string_length(
    columns=['name'],
    suffix='_length',
    inplace=True
)

df_processed = preprocessor.get_dataframe()
print("\nCleaned membership tiers:")
print(df_processed['membership_tier'].value_counts())

## Handling Outliers and Duplicates

In [None]:
# Remove duplicates
rows_before = df_processed.shape[0]
preprocessor.remove_duplicates(inplace=True)
df_processed = preprocessor.get_dataframe()
print(f"Duplicates removed: {rows_before - df_processed.shape[0]}")

In [None]:
# Handle outliers in spending (cap extreme values)
preprocessor.handle_outliers(
    columns=['total_spent'],
    method='iqr',
    action='cap',
    inplace=True
)

print("Outliers capped in total_spent")

In [None]:
# Replace infinite values
preprocessor.clip_values(
    columns=['income'],
    lower=0,
    upper=1000000,
    inplace=True
)

df_processed = preprocessor.get_dataframe()
print(f"Final dataset shape: {df_processed.shape}")

## Operation History Tracking

In [None]:
# View preprocessing history
summary = preprocessor.get_preprocessing_summary()
print(summary)

In [None]:
# Export preprocessing report
preprocessor.export_summary('customer_preprocessing.md', format='markdown')
print("Preprocessing report exported to customer_preprocessing.md")

---
# Section 3: Feature Engineering

Let's work with e-commerce transaction data to create powerful features.

## Generate E-Commerce Dataset

In [None]:
# Generate e-commerce transaction data
n_transactions = 8000
n_customers_ecom = 1000

base_date = datetime(2023, 1, 1)

df_ecommerce = pd.DataFrame({
    'transaction_id': range(1, n_transactions + 1),
    'customer_id': np.random.randint(1, n_customers_ecom + 1, n_transactions),
    'transaction_date': [base_date + timedelta(days=int(x)) for x in np.random.uniform(0, 365, n_transactions)],
    'order_value': np.random.lognormal(4, 1, n_transactions),
    'item_count': np.random.poisson(3, n_transactions) + 1,
    'discount_amount': np.random.exponential(5, n_transactions),
    'shipping_cost': np.random.uniform(0, 25, n_transactions),
    'category': np.random.choice(['Electronics', 'Clothing', 'Home', 'Books', 'Sports'], n_transactions),
    'payment_method': np.random.choice(['Credit Card', 'Debit Card', 'PayPal', 'Gift Card'], n_transactions),
    'device_type': np.random.choice(['Desktop', 'Mobile', 'Tablet'], n_transactions, p=[0.4, 0.5, 0.1]),
    'is_weekend': np.random.choice([0, 1], n_transactions, p=[0.7, 0.3]),
    'is_repeat_customer': np.random.choice([0, 1], n_transactions, p=[0.4, 0.6]),
})

# Calculate final amount
df_ecommerce['final_amount'] = (
    df_ecommerce['order_value'] -
    df_ecommerce['discount_amount'] +
    df_ecommerce['shipping_cost']
).clip(lower=1)

print(f"E-commerce dataset shape: {df_ecommerce.shape}")
df_ecommerce.head(10)

## Encoding Categorical Variables

In [None]:
# Initialize feature engineer
engineer = FeatureEngineer(df_ecommerce)

# One-hot encoding for nominal categories
engineer.encode_categorical_onehot(
    columns=['category', 'payment_method', 'device_type'],
    inplace=True
)

df_engineered = engineer.get_dataframe()

print(f"Shape after one-hot encoding: {df_engineered.shape}")
print(f"New columns created: {df_engineered.shape[1] - df_ecommerce.shape[1]}")

## Scaling Features

In [None]:
# Standard scaling for normally distributed features
engineer.scale_features(
    columns=['item_count'],
    method='standard',
    inplace=True
)

# Robust scaling for features with outliers
engineer.scale_features(
    columns=['order_value', 'discount_amount', 'shipping_cost', 'final_amount'],
    method='robust',
    inplace=True
)

print("Scaling complete")

## Mathematical Transformations

In [None]:
# Log transform for skewed distributions
engineer.create_log_transform(
    columns=['order_value'],
    inplace=True
)

# Polynomial features for non-linear relationships
engineer.create_polynomial_features(
    columns=['item_count'],
    degree=2,
    inplace=True
)

df_engineered = engineer.get_dataframe()
print(f"Shape after transformations: {df_engineered.shape}")

## Datetime Feature Extraction

In [None]:
# Extract datetime components
engineer.create_datetime_features(
    column='transaction_date',
    features=['year', 'month', 'day', 'dayofweek', 'quarter'],
    inplace=True
)

df_engineered = engineer.get_dataframe()

print("Datetime features extracted:")
datetime_cols = [col for col in df_engineered.columns if col.startswith('transaction_date_')]
print(datetime_cols)

## Creating Derived Features

In [None]:
# Ratio features
engineer.create_ratio_features(
    numerator='discount_amount',
    denominator='order_value',
    name='discount_rate',
    inplace=True
)

engineer.create_ratio_features(
    numerator='shipping_cost',
    denominator='order_value',
    name='shipping_rate',
    inplace=True
)

# Flag features
engineer.create_flag_features(
    column='discount_amount',
    condition=lambda x: x > 10,
    flag_name='has_large_discount',
    inplace=True
)

engineer.create_flag_features(
    column='order_value',
    condition=lambda x: x > 10,
    flag_name='is_high_value',
    inplace=True
)

df_engineered = engineer.get_dataframe()

print(f"\nFinal engineered dataset shape: {df_engineered.shape}")
print(f"Total new features created: {df_engineered.shape[1] - df_ecommerce.shape[1]}")

## Aggregation Features by Group

In [None]:
# Aggregate spending by customer
engineer.create_aggregations(
    agg_column='order_value',
    group_by='customer_id',
    agg_funcs=['mean', 'sum', 'count'],
    inplace=True
)

df_engineered = engineer.get_dataframe()

print("Aggregation features created:")
agg_cols = [col for col in df_engineered.columns if 'order_value_customer_id' in col]
print(agg_cols)
print(f"\nFinal shape: {df_engineered.shape}")

## Saving Transformers for Production

In [None]:
# Save fitted transformers
engineer.save_transformers('ecommerce_transformers.pkl')
print("Transformers saved to ecommerce_transformers.pkl")
print(f"Saved transformers: {list(engineer.encoders.keys()) + list(engineer.scalers.keys())}")

In [None]:
# Demonstrate loading transformers (for new data)
new_engineer = FeatureEngineer(df_ecommerce.head(100))  # Simulate new data
new_engineer.load_transformers('ecommerce_transformers.pkl')
print(f"\nTransformers loaded successfully!")
print(f"Loaded encoders: {list(new_engineer.encoders.keys())}")
print(f"Loaded scalers: {list(new_engineer.scalers.keys())}")

---
# Section 4: Feature Selection

We'll work with a high-dimensional dataset and select the most important features.

## Generate High-Dimensional Dataset

In [None]:
# Generate dataset with many features
n_samples = 2000
n_features = 50

# Create features
X = np.random.randn(n_samples, n_features)
feature_names = [f'feature_{i}' for i in range(1, n_features + 1)]

# Create target with only some features being truly predictive
important_features = [0, 5, 10, 15, 20, 25, 30, 35, 40]
y = (
    X[:, 0] * 2 +
    X[:, 5] * 1.5 +
    X[:, 10] * 1.2 +
    X[:, 15] * 1.0 +
    X[:, 20] * 0.8 +
    X[:, 25] * 0.6 +
    X[:, 30] * 0.5 +
    X[:, 35] * 0.4 +
    X[:, 40] * 0.3 +
    np.random.randn(n_samples) * 0.5  # Add noise
)

# Add some constant and near-constant features
X[:, -1] = 1  # Constant
X[:, -2] = np.random.choice([0, 1], n_samples, p=[0.99, 0.01])  # Near constant

# Add highly correlated features
X[:, -3] = X[:, 0] + np.random.randn(n_samples) * 0.01
X[:, -4] = X[:, 5] + np.random.randn(n_samples) * 0.01

df_highdim = pd.DataFrame(X, columns=feature_names)
df_highdim['target'] = y

print(f"High-dimensional dataset shape: {df_highdim.shape}")
print(f"Target statistics: mean={y.mean():.2f}, std={y.std():.2f}")

## Individual Selection Methods

In [None]:
# Initialize feature selector
selector = FeatureSelector(df_highdim, target_column='target')

# Remove low variance features
variance_features = selector.select_by_variance(threshold=0.01)
df_variance = selector.apply_selection(variance_features, keep_target=True)

print(f"Features after variance filter: {df_variance.shape[1] - 1}")
print(f"Features removed: {df_highdim.shape[1] - df_variance.shape[1]}")

In [None]:
# Remove highly correlated features
selector_corr = FeatureSelector(df_variance, target_column='target')
corr_features = selector_corr.select_by_correlation(threshold=0.95)
df_corr = selector_corr.apply_selection(corr_features, keep_target=True)

print(f"\nFeatures after correlation filter: {df_corr.shape[1] - 1}")
print(f"Features removed: {df_variance.shape[1] - df_corr.shape[1]}")

In [None]:
# Select by target correlation
selector_target = FeatureSelector(df_corr, target_column='target')
target_corr_features = selector_target.select_by_target_correlation(k=20)
target_corr_df = selector_target.apply_selection(target_corr_features, keep_target=True)

print(f"\nTop 20 features with highest target correlation: {target_corr_df.shape[1] - 1}")

In [None]:
# Statistical test selection
selector_stat = FeatureSelector(df_corr, target_column='target')
df_ftest = selector_stat.select_by_statistical_test(score_func='f_regression', k=20)
print(f"\nTop 20 features by F-test: {df_ftest.shape[1] - 1}")

In [None]:
# Mutual information selection
df_mi = selector_stat.select_by_statistical_test(score_func='mutual_info_regression', k=15)
print(f"Top 15 features by mutual information: {df_mi.shape[1] - 1}")
print(f"\nSelected features: {[col for col in df_mi.columns if col != 'target'][:10]}...")

## Automatic Feature Selection Pipeline

In [None]:
# Use automatic 3-step selection
df_auto_selected = select_features_auto(
    df=df_highdim,
    target_column='target',
    task='regression',
    max_features=15,
    variance_threshold=0.01,
    correlation_threshold=0.95
)

print(f"Original features: {df_highdim.shape[1] - 1}")
print(f"Selected features: {df_auto_selected.shape[1] - 1}")
print(f"Reduction: {(1 - (df_auto_selected.shape[1] - 1) / (df_highdim.shape[1] - 1)) * 100:.1f}%")

In [None]:
# Get feature importance scores (stored after selection)
selector_auto = FeatureSelector(df_highdim, target_column='target')
if selector_auto.feature_scores is not None:
    print("\nTop 10 Features by Importance:")
    print(selector_auto.feature_scores.head(10))

## Tree-Based Feature Importance

In [None]:
# Select features using tree-based importance
selector_tree = FeatureSelector(df_highdim, target_column='target')
df_tree_selected = selector_tree.select_by_importance(
    k=15,
    task='regression'
)

print(f"Features selected by Random Forest importance: {df_tree_selected.shape[1] - 1}")
print(f"\nSelected features:")
selected_features_tree = [col for col in df_tree_selected.columns if col != 'target']
for i, feat in enumerate(selected_features_tree, 1):
    print(f"{i}. {feat}")

---
# Section 5: Complete End-to-End Pipeline

Let's put everything together with an insurance claim prediction example.

## Generate Insurance Claims Dataset

In [None]:
# Generate insurance claims data
n_claims = 4000

df_insurance = pd.DataFrame({
    'policy_id': range(1, n_claims + 1),
    'age': np.random.randint(18, 80, n_claims),
    'gender': np.random.choice(['Male', 'Female'], n_claims),
    'bmi': np.random.normal(28, 6, n_claims).clip(15, 50),
    'children': np.random.choice([0, 1, 2, 3, 4, 5], n_claims, p=[0.4, 0.25, 0.2, 0.1, 0.04, 0.01]),
    'smoker': np.random.choice(['Yes', 'No'], n_claims, p=[0.2, 0.8]),
    'region': np.random.choice(['Northeast', 'Northwest', 'Southeast', 'Southwest'], n_claims),
    'coverage_type': np.random.choice(['Basic', 'Standard', 'Premium'], n_claims, p=[0.3, 0.5, 0.2]),
    'years_insured': np.random.randint(0, 40, n_claims),
    'previous_claims': np.random.poisson(1.5, n_claims),
    'vehicle_age': np.random.randint(0, 20, n_claims),
})

# Create target (claim amount)
base_claim = 5000
claim = (
    base_claim +
    df_insurance['age'] * 100 +
    df_insurance['bmi'] * 200 +
    df_insurance['children'] * 1000 +
    np.where(df_insurance['smoker'] == 'Yes', 10000, 0) +
    df_insurance['previous_claims'] * 3000 +
    df_insurance['vehicle_age'] * 500 +
    np.where(df_insurance['coverage_type'] == 'Premium', 5000, 0) +
    np.where(df_insurance['coverage_type'] == 'Standard', 2000, 0) +
    np.random.normal(0, 2000, n_claims)
)
df_insurance['claim_amount'] = claim.clip(1000, 100000)

# Add realistic data issues
missing_idx = np.random.choice(df_insurance.index, size=int(0.06 * n_claims), replace=False)
df_insurance.loc[missing_idx, 'bmi'] = np.nan

missing_idx2 = np.random.choice(df_insurance.index, size=int(0.04 * n_claims), replace=False)
df_insurance.loc[missing_idx2, 'vehicle_age'] = np.nan

# Add outliers
outlier_idx = np.random.choice(df_insurance.index, size=20, replace=False)
df_insurance.loc[outlier_idx, 'claim_amount'] = np.random.uniform(150000, 250000, 20)

print(f"Insurance dataset shape: {df_insurance.shape}")
df_insurance.head(10)

## Step 1: Exploratory Data Analysis

In [None]:
# Quick analysis
quick_analysis(df_insurance)

In [None]:
# Target analysis
target_analyzer_insurance = TargetAnalyzer(df_insurance, target_column='claim_amount')

target_dist = target_analyzer_insurance.analyze_target_distribution()
print("\nClaim Amount Statistics:")
print(f"Mean: ${target_dist['mean']:.2f}")
print(f"Median: ${target_dist['median']:.2f}")
print(f"Std: ${target_dist['std']:.2f}")
print(f"Skewness: {target_dist['skewness']:.2f}")

In [None]:
# Feature correlations
correlations_insurance = target_analyzer_insurance.analyze_feature_correlations()
print("\nTop Features by Correlation:")
print(correlations_insurance.head())

## Step 2: Data Preprocessing

In [None]:
# Initialize preprocessor and clean data
preprocessor_insurance = DataPreprocessor(df_insurance)

preprocessor_insurance\
    .drop_columns(['policy_id'], inplace=True)\
    .handle_missing_values(strategy='median', columns=['bmi', 'vehicle_age'], inplace=True)\
    .handle_outliers(columns=['claim_amount'], method='iqr', action='cap', inplace=True)

df_insurance_clean = preprocessor_insurance.get_dataframe()
print(f"Cleaned dataset shape: {df_insurance_clean.shape}")
print(f"Missing values: {df_insurance_clean.isnull().sum().sum()}")

## Step 3: Feature Engineering

In [None]:
# Initialize feature engineer
engineer_insurance = FeatureEngineer(df_insurance_clean)

# Encode categorical variables
engineer_insurance.encode_categorical_onehot(
    columns=['gender', 'smoker', 'region', 'coverage_type'],
    inplace=True
)

# Create polynomial features
engineer_insurance.create_polynomial_features(
    columns=['bmi'],
    degree=2,
    inplace=True
)

# Create interaction features (age * bmi)
engineer_insurance.create_polynomial_features(
    columns=['age', 'bmi'],
    degree=2,
    interaction_only=True,
    inplace=True
)

# Create flags
engineer_insurance.create_flag_features(
    column='bmi',
    condition=lambda x: x > 30,
    flag_name='is_obese',
    inplace=True
)

engineer_insurance.create_flag_features(
    column='previous_claims',
    condition=lambda x: x > 2,
    flag_name='has_frequent_claims',
    inplace=True
)

# Scale numeric features
numeric_features = ['age', 'bmi', 'children', 'years_insured', 'previous_claims', 'vehicle_age']
engineer_insurance.scale_features(
    columns=numeric_features,
    method='standard',
    inplace=True
)

df_insurance_eng = engineer_insurance.get_dataframe()

print(f"Engineered dataset shape: {df_insurance_eng.shape}")
print(f"Features created: {df_insurance_eng.shape[1] - df_insurance_clean.shape[1]}")

## Step 4: Feature Selection

In [None]:
# Select best features using automatic pipeline
df_insurance_final = select_features_auto(
    df=df_insurance_eng,
    target_column='claim_amount',
    task='regression',
    max_features=20,
    variance_threshold=0.01,
    correlation_threshold=0.95
)

print(f"\nOriginal features: {df_insurance_eng.shape[1] - 1}")
print(f"Selected features: {df_insurance_final.shape[1] - 1}")
print(f"\nSelected feature list:")
for i, feat in enumerate([col for col in df_insurance_final.columns if col != 'claim_amount'], 1):
    print(f"{i}. {feat}")

## Step 5: Final Analysis and Insights

In [None]:
# Analyze with final features
final_analyzer = TargetAnalyzer(df_insurance_final, target_column='claim_amount')

# Get feature engineering suggestions
fe_suggestions_insurance = final_analyzer.suggest_feature_engineering()
print("Additional Feature Engineering Suggestions:")
for i, sugg in enumerate(fe_suggestions_insurance[:5], 1):
    print(f"\n{i}. {sugg['feature']}")
    print(f"   {sugg['suggestion']} (Priority: {sugg['priority']})")
    print(f"   Reason: {sugg['reason']}")

In [None]:
# Get model recommendations
model_recs_insurance = final_analyzer.recommend_models()
print("\nModel Recommendations:")
for i, rec in enumerate(model_recs_insurance[:4], 1):
    print(f"\n{i}. {rec['model']} (Priority: {rec['priority']})")
    print(f"   Reason: {rec['reason']}")

## Step 6: Export Everything

In [None]:
# Export preprocessing report
preprocessor_insurance.export_summary('insurance_preprocessing.json', format='json')
print("Preprocessing report exported")

# Export analysis report
final_analyzer.export_report('insurance_analysis.html', format='html')
print("Analysis report exported")

# Save transformers for production
engineer_insurance.save_transformers('insurance_transformers.pkl')
print("Transformers saved for production deployment")

# Save final dataset
df_insurance_final.to_csv('insurance_ml_ready.csv', index=False)
print("ML-ready dataset saved")

print("\nâœ“ End-to-end pipeline complete!")

## Step 7: Apply Pipeline to New Data

In [None]:
# Simulate new test data
df_new = df_insurance.sample(100, random_state=99).copy()
print(f"New data shape: {df_new.shape}")

# Apply same preprocessing steps
prep_new = DataPreprocessor(df_new)
prep_new\
    .drop_columns(['policy_id'], inplace=True)\
    .handle_missing_values(strategy='median', columns=['bmi', 'vehicle_age'], inplace=True)\
    .handle_outliers(columns=['claim_amount'], method='iqr', action='cap', inplace=True)

df_new_clean = prep_new.get_dataframe()

# Load transformers and apply
eng_new = FeatureEngineer(df_new_clean)
eng_new.load_transformers('insurance_transformers.pkl')

# Apply same transformations
df_new_transformed = eng_new.encode_categorical_onehot(
    columns=['gender', 'smoker', 'region', 'coverage_type'],
    method='onehot'
)

print(f"\nNew data after transformation: {df_new_transformed.shape}")
print("Pipeline successfully applied to new data!")


---
# Section 6: Advanced Techniques

Let's explore statistical robustness and production patterns.

## Statistical Robustness with statistical_utils

In [None]:
# Generate sample data for statistical tests
group1 = np.random.normal(100, 15, 200)
group2 = np.random.normal(105, 15, 200)
group3 = np.random.normal(95, 15, 200)

# Check normality assumption
normality_check = statistical_utils.check_normality(group1)
print("Normality Check:")
print(f"Is normal: {normality_check['is_normal']}")
print(f"P-value: {normality_check['pvalue']:.4f}")
print(f"Recommendation: {normality_check['recommendation']}")

In [None]:
# Check homogeneity of variance
homogeneity_check = statistical_utils.check_homogeneity_of_variance([group1, group2, group3])
print("\nHomogeneity of Variance:")
print(f"Equal variances: {homogeneity_check['equal_variances']}")
print(f"P-value: {homogeneity_check['pvalue']:.4f}")
print(f"Recommendation: {homogeneity_check['recommendation']}")

In [None]:
# Calculate effect size (Cohen's d)
effect_size = statistical_utils.cohens_d(group1, group2)
print("\nEffect Size (Cohen's d):")
print(f"Cohen's d: {effect_size['cohens_d']:.3f}")
print(f"Interpretation: {effect_size['interpretation']}")
print(f"Description: {effect_size['description']}")

In [None]:
# Confidence interval for mean
ci = statistical_utils.calculate_mean_ci(group1, confidence=0.95)
print("\nConfidence Interval for Mean:")
print(f"Mean: {ci['mean']:.2f}")
print(f"95% CI: [{ci['ci_lower']:.2f}, {ci['ci_upper']:.2f}]")
print(f"Margin of error: Â±{ci['margin_of_error']:.2f}")

In [None]:
# Multiple testing correction
pvalues = np.array([0.001, 0.01, 0.03, 0.04, 0.05, 0.06, 0.08, 0.10, 0.20, 0.50])
correction = statistical_utils.apply_multiple_testing_correction(pvalues, method='fdr_bh', alpha=0.05)

print("\nMultiple Testing Correction (FDR):")
print(f"Significant before correction: {correction['num_significant_raw']}")
print(f"Significant after correction: {correction['num_significant_corrected']}")
print(f"\nRejected null hypotheses: {correction['reject']}")

In [None]:
# Bootstrap confidence interval
bootstrap_ci = statistical_utils.bootstrap_ci(
    group1,
    statistic_func=np.median,
    n_bootstrap=1000,
    confidence=0.95
)

print("\nBootstrap CI for Median:")
print(f"Median: {bootstrap_ci['statistic']:.2f}")
print(f"95% CI: [{bootstrap_ci['ci_lower']:.2f}, {bootstrap_ci['ci_upper']:.2f}]")

## Production Patterns and Best Practices

In [None]:
# Example: Creating a reproducible preprocessing pipeline class
class InsurancePreprocessingPipeline:
    def __init__(self):
        self.preprocessor = None
        self.engineer = None
        self.selected_features = None
        
    def fit(self, df_train, target_col='claim_amount'):
        """Fit the pipeline on training data."""
        # Preprocessing
        self.preprocessor = DataPreprocessor(df_train)
        self.preprocessor\
            .drop_columns(['policy_id'], inplace=True)\
            .handle_missing_values(strategy='median', columns=['bmi', 'vehicle_age'], inplace=True)\
            .handle_outliers(columns=[target_col], method='iqr', action='cap', inplace=True)
        
        df_clean = self.preprocessor.get_dataframe()
        
        # Feature engineering
        self.engineer = FeatureEngineer(df_clean)
        self.engineer.encode_categorical_onehot(
            columns=['gender', 'smoker', 'region', 'coverage_type'],
            method='onehot',
            inplace=True
        )
        
        df_eng = self.engineer.get_dataframe()
        
        # Feature selection
        df_final = select_features_auto(
            df=df_eng,
            target_column=target_col,
            task='regression',
            max_features=20,
            variance_threshold=0.01,
            correlation_threshold=0.95
        )
        
        # Store selected feature names
        self.selected_features = [col for col in df_final.columns if col != target_col]
        
        return df_final
    
    def transform(self, df_test):
        """Transform test data using fitted pipeline."""
        # Apply same preprocessing steps
        prep = DataPreprocessor(df_test)
        prep\
            .drop_columns(['policy_id'], inplace=True)\
            .handle_missing_values(strategy='median', columns=['bmi', 'vehicle_age'], inplace=True)
        
        df_clean = prep.get_dataframe()
        
        # Apply fitted transformers
        eng = FeatureEngineer(df_clean)
        eng.encoders = self.engineer.encoders
        eng.scalers = self.engineer.scalers
        
        eng.encode_categorical_onehot(
            columns=['gender', 'smoker', 'region', 'coverage_type'],
            method='onehot',
            inplace=True
        )
        
        df_eng = eng.get_dataframe()
        
        # Select same features
        selected_cols = [col for col in self.selected_features if col in df_eng.columns]
        df_final = df_eng[selected_cols + ['claim_amount']]
        
        return df_final

# Example usage
pipeline = InsurancePreprocessingPipeline()
df_train_final = pipeline.fit(df_insurance.iloc[:3000])
df_test_final = pipeline.transform(df_insurance.iloc[3000:])

print(f"Train shape after pipeline: {df_train_final.shape}")
print(f"Test shape after pipeline: {df_test_final.shape}")

## Edge Case Handling

In [None]:
# Example 1: Handling constant columns
df_with_constant = pd.DataFrame({
    'feature1': [1, 2, 3, 4, 5],
    'feature2': [1, 1, 1, 1, 1],  # Constant
    'target': [10, 20, 15, 25, 18]
})

analyzer_const = DataAnalyzer(df_with_constant)
outliers = analyzer_const.detect_outliers_zscore()
print("Edge Case 1: Constant Column Handling")
print("Outliers detected (skips constant columns):")
print(outliers)

In [None]:
# Example 2: Single-class target
df_single_class = pd.DataFrame({
    'feature1': np.random.randn(100),
    'feature2': np.random.randn(100),
    'target': [1] * 100  # All same class
})

try:
    target_analyzer_single = TargetAnalyzer(df_single_class, target_column='target')
    imbalance = target_analyzer_single.get_class_imbalance_info()
    print("\nEdge Case 2: Single-Class Target Handling")
    print(f"Imbalance ratio: {imbalance['imbalance_ratio']}")
except Exception as e:
    print(f"\nEdge Case 2: Handled gracefully - {str(e)}")

In [None]:
# Example 3: Highly imbalanced data
df_imbalanced = pd.DataFrame({
    'feature1': np.random.randn(1000),
    'feature2': np.random.randn(1000),
    'target': [0] * 950 + [1] * 50  # 95-5 split
})

target_analyzer_imb = TargetAnalyzer(df_imbalanced, target_column='target')
imbalance_info = target_analyzer_imb.get_class_imbalance_info()

print("\nEdge Case 3: Highly Imbalanced Data")
print(f"Imbalance severity: {imbalance_info['imbalance_severity']}")
print(f"Imbalance ratio: {imbalance_info['imbalance_ratio']:.2f}")
print(f"Recommendation: Use SMOTE or class weights during modeling")

---
# Wrap-Up

## What We've Covered

### 1. Exploratory Data Analysis
- DataAnalyzer for general EDA
- TargetAnalyzer for target-aware analysis
- Statistical assumption checking
- Correlation and multicollinearity detection

### 2. Data Preprocessing
- Missing value handling strategies
- Outlier detection and treatment
- String preprocessing
- Data validation
- Operation history tracking

### 3. Feature Engineering
- Encoding categorical variables
- Scaling methods
- Mathematical transformations
- Datetime feature extraction
- Creating derived features
- Saving and loading transformers

### 4. Feature Selection
- Variance threshold
- Correlation-based filtering
- Statistical tests
- Tree-based importance
- Automatic selection pipeline

### 5. End-to-End Pipeline
- Complete workflow from raw data to ML-ready
- Reproducible preprocessing
- Applying pipeline to new data
- Exporting reports and transformers

### 6. Advanced Techniques
- Statistical robustness utilities
- Production deployment patterns
- Edge case handling
- Custom pipeline creation

## Next Steps

- Apply these patterns to your own datasets
- Explore the statistical utilities for rigorous analysis
- Build production-ready pipelines
- Check out the GitHub repo for updates: https://github.com/bluelion1999/feature_engineering_tk

Happy feature engineering! ðŸš€