# Feature Engineering + Model Selection Workflow (Project Baseline Build)

<hr>

<center>
<div>
<img src="https://raw.githubusercontent.com/davi-moreira/2026Summer_predictive_analytics_purdue_MGMT474/main/notebooks/figures/mgmt_474_ai_logo_02-modified.png" width="200"/>
</div>
</center>

# <center><a class="tocSkip"></center>
# <center>MGMT47400 Predictive Analytics</center>
# <center>Professor: Davi Moreira </center>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/davi-moreira/2026Summer_predictive_analytics_purdue_MGMT474/blob/main/notebooks/09_tuning_feature_engineering_project_baseline.ipynb)

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Engineer features with pipelines without leakage
2. Use `GridSearchCV` / `RandomizedSearchCV` for systematic tuning
3. Define a project-grade evaluation plan (metric + split/CV + baseline + reporting)
4. Produce a baseline model notebook that can be extended
5. Use Gemini to draft search grids and then simplify them

---

In [None]:
# Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import (
    train_test_split, StratifiedKFold,
    GridSearchCV, RandomizedSearchCV
)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import uniform, randint
import warnings

warnings.filterwarnings('ignore')
pd.set_option('display.precision', 4)
RANDOM_SEED = 474
np.random.seed(RANDOM_SEED)
print("✓ Setup complete!")

## 1. Load Data and Baseline

In [None]:
# Load data
data = load_breast_cancer(as_frame=True)
X = data.data
y = data.target

# Split
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.20, random_state=RANDOM_SEED, stratify=y)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=RANDOM_SEED, stratify=y_temp)

print(f"Train: {len(X_train)} | Val: {len(X_val)} | Test: {len(X_test)}")
print(f"Features: {X.shape[1]}")

# Simple baseline
baseline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(random_state=RANDOM_SEED, max_iter=1000))
])

baseline.fit(X_train, y_train)
baseline_score = baseline.score(X_val, y_val)

print(f"\nBaseline validation accuracy: {baseline_score:.4f}")

## 2. Feature Engineering Inside Pipelines

### Safe Feature Engineering Patterns

**Rule:** All feature engineering must happen INSIDE the pipeline

**Why?**
- Prevents leakage (fit on train, transform on val/test)
- Ensures reproducibility
- Makes deployment easier

**Common feature engineering steps:**
1. Polynomial features
2. Feature interactions
3. Feature selection
4. Domain-specific transformations

In [None]:
# Pipeline with feature engineering
fe_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('feature_selection', SelectKBest(f_classif, k=20)),  # Keep top 20 features
    ('clf', LogisticRegression(random_state=RANDOM_SEED, max_iter=1000))
])

fe_pipeline.fit(X_train, y_train)
fe_score = fe_pipeline.score(X_val, y_val)

print("=== FEATURE ENGINEERING PIPELINE ===")
print(f"Original features: {X.shape[1]}")
print(f"Selected features: 20")
print(f"Validation accuracy: {fe_score:.4f}")
print(f"Improvement: {(fe_score - baseline_score):.4f}")

# See which features were selected
selected_mask = fe_pipeline.named_steps['feature_selection'].get_support()
selected_features = X.columns[selected_mask].tolist()
print(f"\nSelected features: {selected_features[:5]}... (showing first 5)")

## 📝 PAUSE-AND-DO Exercise 1 (10 minutes)

**Task:** Add engineered features and re-run CV.

Try adding polynomial features (degree=2) to a subset of features.

---

In [None]:
# YOUR CODE: Add polynomial features
from sklearn.model_selection import cross_val_score

# Warning: Polynomial features can explode the feature space!
# Let's use only a subset
poly_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('feature_selection_pre', SelectKBest(f_classif, k=10)),  # Reduce to 10 first
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('scaler2', StandardScaler()),  # Re-scale after polynomial
    ('clf', LogisticRegression(random_state=RANDOM_SEED, max_iter=2000))
])

# Evaluate with CV
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED)
poly_scores = cross_val_score(poly_pipeline, X_train, y_train, cv=cv, scoring='roc_auc')
baseline_scores = cross_val_score(baseline, X_train, y_train, cv=cv, scoring='roc_auc')

print("=== POLYNOMIAL FEATURES COMPARISON ===")
print(f"Baseline:   {baseline_scores.mean():.4f} ± {baseline_scores.std():.4f}")
print(f"Polynomial: {poly_scores.mean():.4f} ± {poly_scores.std():.4f}")
print(f"\nImprovement: {(poly_scores.mean() - baseline_scores.mean()):.4f}")

# Check feature explosion
poly_pipeline.fit(X_train.iloc[:10], y_train.iloc[:10])  # Fit on small sample to check
n_poly_features = poly_pipeline.named_steps['poly'].n_output_features_
print(f"\n⚠️ Feature explosion: 10 → {n_poly_features} features")

### YOUR ANALYSIS:

**Did polynomial features help?**  
[Your analysis]

**What's the cost?**  
[Feature explosion, complexity, overfitting risk]

**Would you use this in production?**  
[Justify your decision]

---

## 3. GridSearchCV - Systematic Hyperparameter Tuning

### How GridSearchCV Works

1. Define parameter grid
2. Try every combination
3. Use CV to evaluate each
4. Return best parameters

**Warning:** Grid search can be expensive!
- 3 parameters × 3 values each = 27 combinations
- 27 combinations × 5 folds = 135 model fits

In [None]:
# Define pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(random_state=RANDOM_SEED, max_iter=1000))
])

# Define parameter grid
param_grid = {
    'clf__C': [0.01, 0.1, 1.0, 10.0],
    'clf__penalty': ['l2'],  # Just L2 for speed
    'clf__solver': ['lbfgs', 'liblinear']
}

print("=== GRID SEARCH CONFIGURATION ===")
print(f"Parameter grid: {param_grid}")
n_combinations = len(param_grid['clf__C']) * len(param_grid['clf__penalty']) * len(param_grid['clf__solver'])
print(f"Total combinations: {n_combinations}")
print(f"With 5-fold CV: {n_combinations * 5} model fits")

# Run grid search
grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED),
    scoring='roc_auc',
    n_jobs=-1,
    verbose=0,
    return_train_score=True
)

print("\nRunning grid search...")
grid_search.fit(X_train, y_train)

print("\n=== GRID SEARCH RESULTS ===")
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
print(f"Validation score: {grid_search.score(X_val, y_val):.4f}")

In [None]:
# Examine all results
results_df = pd.DataFrame(grid_search.cv_results_)
results_summary = results_df[[
    'param_clf__C', 'param_clf__solver',
    'mean_test_score', 'std_test_score',
    'mean_train_score', 'rank_test_score'
]].sort_values('rank_test_score')

print("\n=== TOP 5 PARAMETER COMBINATIONS ===")
print(results_summary.head().to_string(index=False))

# Visualize C parameter effect
plt.figure(figsize=(10, 6))
for solver in param_grid['clf__solver']:
    mask = results_df['param_clf__solver'] == solver
    plt.plot(
        results_df[mask]['param_clf__C'],
        results_df[mask]['mean_test_score'],
        marker='o', label=f'Solver: {solver}'
    )
plt.xscale('log')
plt.xlabel('C (Regularization Parameter)')
plt.ylabel('Mean CV ROC-AUC')
plt.title('Grid Search: C Parameter Effect')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 4. RandomizedSearchCV - Faster Alternative

### When to Use Randomized Search

**Use RandomizedSearchCV when:**
- Parameter space is large
- Continuous parameters
- Time budget is limited
- Initial exploration phase

**Advantage:** Sample randomly instead of exhaustive search

In [None]:
# Random Forest with randomized search
rf_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(random_state=RANDOM_SEED))
])

# Define parameter distributions
param_distributions = {
    'clf__n_estimators': randint(50, 200),
    'clf__max_depth': randint(3, 20),
    'clf__min_samples_split': randint(2, 20),
    'clf__min_samples_leaf': randint(1, 10)
}

random_search = RandomizedSearchCV(
    rf_pipeline,
    param_distributions,
    n_iter=20,  # Try 20 random combinations
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED),
    scoring='roc_auc',
    n_jobs=-1,
    random_state=RANDOM_SEED,
    return_train_score=True
)

print("=== RANDOMIZED SEARCH ===")
print(f"Parameter space: {param_distributions}")
print(f"Sampling 20 random combinations...")

random_search.fit(X_train, y_train)

print(f"\nBest parameters: {random_search.best_params_}")
print(f"Best CV score: {random_search.best_score_:.4f}")
print(f"Validation score: {random_search.score(X_val, y_val):.4f}")

## 📝 PAUSE-AND-DO Exercise 2 (10 minutes)

**Task:** Run a small grid (2-3 params) and report best CV score.

Already done above! Now create a baseline report table:

---

In [None]:
# Create comprehensive baseline report
baseline_report = []

# Model 1: Simple baseline
baseline_report.append({
    'Model': 'Logistic (default)',
    'Val_ROC_AUC': baseline_score,
    'Parameters': 'C=1.0, penalty=l2',
    'Features': X.shape[1],
    'Notes': 'Simple baseline'
})

# Model 2: Grid search best
baseline_report.append({
    'Model': 'Logistic (tuned)',
    'Val_ROC_AUC': grid_search.score(X_val, y_val),
    'Parameters': str(grid_search.best_params_),
    'Features': X.shape[1],
    'Notes': 'Grid search optimized'
})

# Model 3: Random Forest tuned
baseline_report.append({
    'Model': 'Random Forest (tuned)',
    'Val_ROC_AUC': random_search.score(X_val, y_val),
    'Parameters': str(random_search.best_params_),
    'Features': X.shape[1],
    'Notes': 'Random search optimized'
})

report_df = pd.DataFrame(baseline_report)
print("=== PROJECT BASELINE REPORT ===")
print(report_df.to_string(index=False))

# Identify champion
best_idx = report_df['Val_ROC_AUC'].idxmax()
print(f"\n✓ Champion model: {report_df.loc[best_idx, 'Model']}")
print(f"✓ Validation ROC-AUC: {report_df.loc[best_idx, 'Val_ROC_AUC']:.4f}")

## 5. Project Baseline Notebook Scaffold

### Required Components for Project Baseline

1. **Data Loading and Audit**
   - Load dataset
   - Check for issues
   - Document data quality

2. **Train/Val/Test Splits**
   - Proper splits with stratification
   - Lock test set away

3. **Baseline Model**
   - Simple model (mean/mode/simple classifier)
   - Establishes floor performance

4. **Improved Model**
   - Preprocessing pipeline
   - Tuned hyperparameters
   - CV evaluation

5. **Evaluation Report**
   - Multiple metrics
   - Comparison table
   - Visualizations

6. **Documentation**
   - Modeling choices explained
   - Assumptions documented
   - Next steps identified

## 6. Gemini Prompts for Tuning

### Example Prompts:

**Prompt 1: Generate Parameter Grid**
```
I'm tuning a Random Forest classifier for a binary classification task.
Generate a reasonable parameter grid for GridSearchCV including:
- n_estimators
- max_depth
- min_samples_split

Keep it small (< 20 combinations) for initial exploration.
```

**Prompt 2: Optimize Grid**
```
I ran GridSearchCV and found best params: {results}
Help me design a refined grid search around these values
to fine-tune performance.
```

**Prompt 3: Debug Search**
```
My RandomizedSearchCV is taking too long. Here's my config: {config}
Help me reduce search time while maintaining good coverage.
```

**Remember:**
- Verify Gemini's suggestions
- Start small, then expand
- Always use CV, never single split
- Document your search strategy

## 7. Wrap-Up: Key Takeaways

### What We Learned Today:

1. **Pipeline Feature Engineering**: Keep everything inside pipelines
2. **GridSearchCV**: Exhaustive search for small parameter spaces
3. **RandomizedSearchCV**: Faster exploration of large spaces
4. **Baseline Reports**: Document all models systematically
5. **Project Readiness**: Structure for reproducible modeling

### Critical Rules:

> **"All feature engineering must be in the pipeline"**

> **"Start with small grids, then refine"**

> **"Document every modeling choice"**

### Next Steps:

- Next notebook: Midterm - Business case practicum
- **Project Milestone 2 checkpoint**: Draft baseline notebook
- Apply today's patterns to your project dataset

---

## Bibliography

- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). *An Introduction to Statistical Learning with Python* - Python labs on feature engineering
- scikit-learn User Guide: [Grid search](https://scikit-learn.org/stable/modules/grid_search.html)
- scikit-learn User Guide: [Pipeline parameter tuning](https://scikit-learn.org/stable/modules/compose.html#pipeline-tuning)
- Provost, F., & Fawcett, T. (2013). *Data Science for Business* - Evaluation and business framing

---



<center>

Thank you!

</center>