# ML Model Optimization: Exploratory Analysis
## Human-Guided Interaction Term Engineering

**Expert**: Enzo Rodriguez  
**Task ID**: TASK_11251  
**Model**: Buffalo (Claude Sonnet 4.5)  
**Date**: 2026-02-10

---

This notebook demonstrates the complete workflow for optimizing ML models through correlation analysis and interaction term engineering.

## Setup

In [None]:
import sys
sys.path.insert(0, '../src')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from data_processing import DataProcessor
from correlation_analysis import CorrelationAnalyzer
from interaction_engineering import InteractionEngineer
from model_training import ModelTrainer
from evaluation import ModelEvaluator, compare_multiple_models

# Settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

print("✓ Imports successful")

## 1. Load and Explore Data

In [None]:
# Initialize data processor
processor = DataProcessor()

# Load data - UPDATE THIS PATH
data_path = '../data/raw/your_dataset.csv'
target_col = 'target'  # UPDATE THIS

data = processor.load_data(data_path)

# Display first few rows
data.head()

In [None]:
# Generate and print data profile
processor.print_data_profile()

In [None]:
# Check for missing values
missing = data.isnull().sum()
missing[missing > 0]

## 2. Data Preprocessing

In [None]:
# Handle missing values
if data.isnull().sum().sum() > 0:
    data = processor.handle_missing_values(strategy='median')

# Handle outliers
data = processor.handle_outliers(method='iqr', threshold=1.5)

print(f"\nCleaned data shape: {data.shape}")

In [None]:
# Encode categorical variables if present
categorical_cols = data.select_dtypes(include=['object', 'category']).columns.tolist()
if target_col in categorical_cols:
    categorical_cols.remove(target_col)

if categorical_cols:
    data = processor.encode_categorical_variables(columns=categorical_cols, method='onehot')
    print(f"Encoded {len(categorical_cols)} categorical columns")

## 3. Correlation Analysis

In [None]:
# Initialize correlation analyzer
analyzer = CorrelationAnalyzer(data=data, target_col=target_col)

# Compute correlations
corr_matrix = analyzer.compute_correlation_matrix(method='pearson')
target_corr = analyzer.compute_target_correlations(method='pearson')

# Display top correlations with target
target_corr.head(10)

In [None]:
# Visualize correlation heatmap
analyzer.plot_correlation_heatmap(figsize=(14, 12), save_path='../results/correlation_heatmap.png')

In [None]:
# Visualize target correlations
analyzer.plot_target_correlations(top_n=20, save_path='../results/target_correlations.png')

In [None]:
# Identify multicollinearity
multicoll = analyzer.identify_multicollinearity(threshold=0.8)
print(f"Found {len(multicoll)} highly correlated feature pairs\n")
multicoll.head(10)

In [None]:
# Identify interaction candidates
interaction_candidates = analyzer.identify_interaction_candidates(
    target_corr_threshold=0.1,
    feature_corr_range=(0.1, 0.7),
    top_n=20
)

interaction_candidates

In [None]:
# Print comprehensive correlation report
analyzer.print_report()

## 4. Interaction Engineering

In [None]:
# Initialize interaction engineer
engineer = InteractionEngineer(data=data, target_col=target_col)

# Create interaction pairs from top candidates
top_n = 10
interaction_pairs = [
    (row['feature_1'], row['feature_2'])
    for _, row in interaction_candidates.head(top_n).iterrows()
]

print(f"Creating {len(interaction_pairs)} interaction terms...")
for pair in interaction_pairs:
    print(f"  • {pair[0]} × {pair[1]}")

In [None]:
# Create multiplicative interactions
interactions = engineer.batch_create_interactions(
    interaction_pairs,
    interaction_type='multiplicative'
)

interactions.head()

In [None]:
# Evaluate interaction importance
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)

importance = engineer.evaluate_interaction_importance(
    interactions,
    estimator=model,
    cv=5,
    scoring='r2'
)

importance

In [None]:
# Visualize interaction importance
plt.figure(figsize=(12, 6))
top_interactions = importance.head(15)
plt.barh(top_interactions['interaction_term'], top_interactions['improvement'], alpha=0.7)
plt.xlabel('R² Improvement', fontsize=12)
plt.ylabel('Interaction Term', fontsize=12)
plt.title('Top Interaction Terms by Model Improvement', fontsize=14, fontweight='bold')
plt.axvline(x=0, color='red', linestyle='--', linewidth=1)
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.savefig('../results/interaction_importance.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Select best interactions
best_interactions = engineer.select_best_interactions(
    importance,
    threshold=0.0,
    top_n=None
)

print(f"\nSelected {len(best_interactions)} beneficial interactions")

In [None]:
# Create enhanced dataset
enhanced_data = engineer.add_interactions_to_data(interactions[best_interactions])

print(f"Original features: {data.shape[1] - 1}")
print(f"Enhanced features: {enhanced_data.shape[1] - 1}")
print(f"Added interactions: {len(best_interactions)}")

## 5. Model Training

In [None]:
# Initialize model trainer
trainer = ModelTrainer(
    data=data,
    target_col=target_col,
    test_size=0.2,
    random_state=42,
    scale_features=True
)

In [None]:
# Train baseline models
baseline_results = trainer.train_baseline_models(cv=5)

In [None]:
# Train enhanced model with interactions
enhanced_results = trainer.train_enhanced_model(
    enhanced_data=enhanced_data,
    model_name='Enhanced Random Forest',
    cv=5
)

In [None]:
# Compare all models
trainer.print_comparison()

## 6. Model Evaluation

In [None]:
# Create evaluators for all models
evaluators = []

# Baseline models
for name, results in baseline_results.items():
    evaluator = ModelEvaluator(
        y_true=trainer.y_test,
        y_pred=results['predictions_test'],
        model_name=f"Baseline - {name}"
    )
    evaluators.append(evaluator)

# Enhanced model
for name, results in trainer.enhanced_results.items():
    evaluator = ModelEvaluator(
        y_true=results['y_test'],
        y_pred=results['predictions_test'],
        model_name=name
    )
    evaluators.append(evaluator)

In [None]:
# Print evaluation reports
for evaluator in evaluators:
    evaluator.print_evaluation_report()

In [None]:
# Compare all models
comparison = compare_multiple_models(evaluators)

In [None]:
# Visualize best enhanced model
best_enhanced = evaluators[-1]  # Last one is typically the enhanced model

best_enhanced.plot_predictions(save_path='../results/best_model_predictions.png')
best_enhanced.plot_residuals(save_path='../results/best_model_residuals.png')
best_enhanced.plot_error_distribution(save_path='../results/best_model_errors.png')

## 7. Feature Importance Analysis

In [None]:
# Get feature importance from enhanced model
feature_importance = trainer.get_feature_importance('Enhanced Random Forest')

feature_importance.head(20)

In [None]:
# Visualize feature importance
plt.figure(figsize=(12, 8))
top_features = feature_importance.head(20)
plt.barh(top_features['feature'], top_features['importance'], alpha=0.7)
plt.xlabel('Importance', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.title('Top 20 Feature Importances (Enhanced Model)', fontsize=14, fontweight='bold')
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.savefig('../results/feature_importance.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Identify interaction terms in top features
interaction_features = feature_importance[feature_importance['feature'].str.contains('×')]
print(f"\nInteraction terms in top 20 features: {len(interaction_features.head(20))}")
print("\nTop interaction terms:")
interaction_features.head(10)

## 8. Save Results

In [None]:
# Save processed data
enhanced_data.to_csv('../data/processed/enhanced_data.csv', index=False)
print("✓ Enhanced data saved")

# Save feature importance
feature_importance.to_csv('../results/feature_importance.csv', index=False)
print("✓ Feature importance saved")

# Save model comparison
comparison.to_csv('../results/model_comparison.csv', index=False)
print("✓ Model comparison saved")

# Save best model
trainer.save_model('Enhanced Random Forest', '../models/best_model.joblib')
print("✓ Best model saved")

## Summary

This notebook demonstrated the complete ML optimization workflow:

1. **Data Loading & Preprocessing**: Handled missing values, outliers, and categorical encoding
2. **Correlation Analysis**: Identified relationships between features and potential interaction candidates
3. **Interaction Engineering**: Created and evaluated interaction terms
4. **Model Training**: Trained baseline and enhanced models
5. **Model Evaluation**: Comprehensive evaluation with statistical rigor
6. **Feature Importance**: Analyzed which features (including interactions) contribute most

### Key Insights
- Review the model comparison table to see performance improvements
- Check which interaction terms provide the most value
- Examine residual plots to verify model assumptions
- Consider domain knowledge when interpreting interaction terms

### Next Steps
- Experiment with different interaction types (ratio, difference, polynomial)
- Try other correlation methods (Spearman, Kendall)
- Tune hyperparameters for the best performing model
- Validate on additional holdout data

---

**The Human Element**: This framework emphasizes human-guided optimization, where statistical analysis informs feature engineering decisions, leading to more interpretable and robust models.