# Regularization (Ridge/Lasso) + Project Proposal Sprint

<hr>

<center>
<div>
<img src="https://raw.githubusercontent.com/davi-moreira/2026Summer_predictive_analytics_purdue_MGMT474/main/notebooks/figures/mgmt_474_ai_logo_02-modified.png" width="200"/>
</div>
</center>

# <center><a class="tocSkip"></center>
# <center>MGMT47400 Predictive Analytics</center>
# <center>Professor: Davi Moreira </center>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/davi-moreira/2026Summer_predictive_analytics_purdue_MGMT474/blob/main/notebooks/05_regularization_project_proposal.ipynb)

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Explain why regularization improves generalization
2. Fit Ridge/Lasso with proper scaling and CV selection
3. Interpret coefficient shrinkage and sparsity
4. Draft a project proposal with a viable dataset + target + metric + split plan
5. Use Gemini to scaffold code and then add guardrails (checks + comments)

---

In [None]:
# Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV, ElasticNet
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import warnings

warnings.filterwarnings('ignore')
pd.set_option('display.precision', 4)
RANDOM_SEED = 474
np.random.seed(RANDOM_SEED)
print("✓ Setup complete!")

## 1. Load Data

In [None]:
california = fetch_california_housing(as_frame=True)
df = california.frame
X = df.drop(columns=['MedHouseVal'])
y = df['MedHouseVal']

X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.20, random_state=RANDOM_SEED)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=RANDOM_SEED)

print(f"Train: {len(X_train)} | Val: {len(X_val)} | Test: {len(X_test)} (locked)")

## 2. Why Regularization?

### The Problem: Overfitting

Without regularization, linear regression can:
- Fit noise in training data
- Produce unstable coefficients
- Perform poorly on new data

### The Solution: Penalize Complexity

**Ridge (L2 penalty)**
- Adds penalty: α × (sum of squared coefficients)
- Shrinks all coefficients toward zero
- Keeps all features
- Use when: All features might be relevant

**Lasso (L1 penalty)**
- Adds penalty: α × (sum of absolute coefficients)
- Can shrink coefficients to exactly zero
- Performs feature selection
- Use when: You expect many irrelevant features

**Elastic Net**
- Combines L1 + L2
- Use when: You want feature selection but with less sensitivity than Lasso

## 3. Ridge Regression with Cross-Validation

In [None]:
# Ridge with automatic alpha tuning
alphas = np.logspace(-3, 3, 50)

ridge_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('ridge', RidgeCV(alphas=alphas, cv=5))
])

ridge_pipeline.fit(X_train, y_train)

best_alpha = ridge_pipeline.named_steps['ridge'].alpha_
train_score_ridge = ridge_pipeline.score(X_train, y_train)
val_score_ridge = ridge_pipeline.score(X_val, y_val)

print("=== RIDGE REGRESSION ===")
print(f"Best alpha: {best_alpha:.4f}")
print(f"Train R²: {train_score_ridge:.4f}")
print(f"Val R²: {val_score_ridge:.4f}")

## 📝 PAUSE-AND-DO Exercise 1 (10 minutes)

**Task:** Run RidgeCV and summarize alpha choice + validation performance.

The code above already ran RidgeCV. Now:
1. Interpret what the chosen alpha value means
2. Compare to baseline linear regression (from previous notebooks)
3. Explain why Ridge might help with generalization

---

### YOUR ANALYSIS:

**Alpha Interpretation:**  
[What does the chosen alpha value mean? Higher = more regularization]

**Performance Comparison:**  
[How does Ridge compare to unregularized linear regression?]

**Generalization:**  
[Why might Ridge improve validation performance?]

---

## 4. Lasso Regression with Cross-Validation

In [None]:
# Lasso with automatic alpha tuning
lasso_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('lasso', LassoCV(alphas=alphas, cv=5, max_iter=10000))
])

lasso_pipeline.fit(X_train, y_train)

best_alpha_lasso = lasso_pipeline.named_steps['lasso'].alpha_
train_score_lasso = lasso_pipeline.score(X_train, y_train)
val_score_lasso = lasso_pipeline.score(X_val, y_val)

print("=== LASSO REGRESSION ===")
print(f"Best alpha: {best_alpha_lasso:.4f}")
print(f"Train R²: {train_score_lasso:.4f}")
print(f"Val R²: {val_score_lasso:.4f}")

## 📝 PAUSE-AND-DO Exercise 2 (10 minutes)

**Task:** Run LassoCV and identify top selected features (if any).

---

In [None]:
# Extract Lasso coefficients
lasso_coefs = pd.DataFrame({
    'Feature': X_train.columns,
    'Coefficient': lasso_pipeline.named_steps['lasso'].coef_,
    'Abs_Coefficient': np.abs(lasso_pipeline.named_steps['lasso'].coef_)
}).sort_values('Abs_Coefficient', ascending=False)

print("=== LASSO COEFFICIENTS ===")
print(lasso_coefs)

# Count non-zero coefficients
n_nonzero = (lasso_coefs['Coefficient'] != 0).sum()
n_total = len(lasso_coefs)

print(f"\n=== FEATURE SELECTION ===")
print(f"Non-zero features: {n_nonzero} / {n_total}")
print(f"Features zeroed out: {n_total - n_nonzero}")

if n_nonzero < n_total:
    print(f"\n✓ Lasso performed feature selection!")
    print(f"\nSelected features:")
    print(lasso_coefs[lasso_coefs['Coefficient'] != 0]['Feature'].tolist())
else:
    print(f"\n⚠️ All features retained - alpha might be too small")

### YOUR FINDINGS:

**Feature Selection:**  
[How many features did Lasso select?]

**Top Features:**  
[Which features have the largest coefficients?]

**Comparison:**  
[How does Lasso performance compare to Ridge?]

---

## 5. Model Comparison Table

In [None]:
# Baseline for comparison
baseline_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])
baseline_pipeline.fit(X_train, y_train)

comparison = pd.DataFrame([
    {
        'Model': 'Linear Regression',
        'Alpha': 'None',
        'Train_R2': baseline_pipeline.score(X_train, y_train),
        'Val_R2': baseline_pipeline.score(X_val, y_val),
        'Val_MAE': mean_absolute_error(y_val, baseline_pipeline.predict(X_val)),
        'Non_Zero_Features': len(X_train.columns)
    },
    {
        'Model': 'Ridge',
        'Alpha': f"{best_alpha:.4f}",
        'Train_R2': train_score_ridge,
        'Val_R2': val_score_ridge,
        'Val_MAE': mean_absolute_error(y_val, ridge_pipeline.predict(X_val)),
        'Non_Zero_Features': len(X_train.columns)
    },
    {
        'Model': 'Lasso',
        'Alpha': f"{best_alpha_lasso:.4f}",
        'Train_R2': train_score_lasso,
        'Val_R2': val_score_lasso,
        'Val_MAE': mean_absolute_error(y_val, lasso_pipeline.predict(X_val)),
        'Non_Zero_Features': n_nonzero
    }
])

print("=== MODEL COMPARISON ===")
print(comparison.to_string(index=False))

best_model = comparison.loc[comparison['Val_R2'].idxmax(), 'Model']
print(f"\n✓ Best validation performance: {best_model}")

## 6. Coefficient Path Visualization

In [None]:
# Compare coefficients across models
coef_comparison = pd.DataFrame({
    'Feature': X_train.columns,
    'Linear': baseline_pipeline.named_steps['regressor'].coef_,
    'Ridge': ridge_pipeline.named_steps['ridge'].coef_,
    'Lasso': lasso_pipeline.named_steps['lasso'].coef_
})

# Visualize
fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(len(coef_comparison))
width = 0.25

ax.bar(x - width, coef_comparison['Linear'], width, label='Linear', alpha=0.8)
ax.bar(x, coef_comparison['Ridge'], width, label='Ridge', alpha=0.8)
ax.bar(x + width, coef_comparison['Lasso'], width, label='Lasso', alpha=0.8)

ax.set_xlabel('Features')
ax.set_ylabel('Coefficient Value')
ax.set_title('Coefficient Comparison: Linear vs Ridge vs Lasso')
ax.set_xticks(x)
ax.set_xticklabels(coef_comparison['Feature'], rotation=45, ha='right')
ax.legend()
ax.axhline(y=0, color='black', linestyle='-', linewidth=0.5)
plt.tight_layout()
plt.show()

print("💡 Notice how Ridge shrinks all coefficients")
print("💡 Notice how Lasso can zero out some features completely")

## 7. Project Proposal Builder

### Week 1 Milestone: Project Proposal (Due Today)

Your proposal should include:

1. **Dataset Description**
   - Source and link
   - Number of samples and features
   - Brief domain context

2. **Predictive Task**
   - Target variable (what are you predicting?)
   - Prediction unit (what is one row?)
   - Regression or classification?

3. **Evaluation Plan**
   - Primary metric (and why)
   - Train/validation/test split strategy
   - Baseline approach

4. **Leakage Risk Assessment**
   - List 3 potential sources of leakage
   - How you'll prevent them

5. **Initial Concerns**
   - Data quality issues
   - Missing values strategy
   - Class imbalance (if classification)

---

## 📋 PROJECT PROPOSAL TEMPLATE

### 1. Dataset Description

**Dataset Name:**  
[Name]

**Source:**  
[URL or citation]

**Size:**  
[Number of samples × features]

**Domain Context:**  
[Brief description of what the data represents]

---

### 2. Predictive Task

**Target Variable:**  
[What are you predicting?]

**Prediction Unit:**  
[What does one row represent?]

**Task Type:**  
[Regression / Binary Classification / Multiclass Classification]

**Business Question:**  
[Why does this prediction matter?]

---

### 3. Evaluation Plan

**Primary Metric:**  
[MAE / RMSE / R² / Accuracy / F1 / ROC-AUC / etc.]

**Metric Rationale:**  
[Why is this metric aligned with the business goal?]

**Split Strategy:**  
- Training: [%]
- Validation: [%]
- Test: [%]
- Special considerations: [Time-based? Stratified? Grouped?]

**Baseline Approach:**  
[Mean/median predictor, most frequent class, domain heuristic, etc.]

---

### 4. Leakage Risk Assessment

**Risk 1:**  
[Describe potential leakage source and prevention strategy]

**Risk 2:**  
[Describe potential leakage source and prevention strategy]

**Risk 3:**  
[Describe potential leakage source and prevention strategy]

---

### 5. Initial Concerns

**Data Quality:**  
[Issues you've identified or expect]

**Missing Values:**  
[Strategy for handling missingness]

**Class Imbalance (if classification):**  
[How you'll handle it]

**Other Challenges:**  
[Any other concerns]

---

### 6. Success Criteria

**Minimum Viable Model:**  
[What performance would make this useful?]

**Stretch Goal:**  
[What would be excellent performance?]

---

## 8. Wrap-Up: Key Takeaways

### What We Learned Today:

1. **Regularization Intuition**: Penalize complexity to improve generalization
2. **Ridge vs Lasso**: Shrinkage vs sparsity
3. **Cross-Validation Selection**: Let CV choose alpha automatically
4. **Coefficient Interpretation**: Regularization changes coefficient magnitudes
5. **Project Planning**: Clear problem framing prevents wasted effort

### Critical Rules:

> **"Always scale features before regularization"**

> **"Use CV to tune alpha, don't peek at test"**

> **"Start project planning early - dataset choice matters"**

### Next Steps:

- **Submit your project proposal today!**
- Week 2 starts classification (logistic regression, metrics)
- Start collecting your project dataset

---

## Bibliography

- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). *An Introduction to Statistical Learning with Python* - Linear Model Selection and Regularization
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning* - Shrinkage and regularization theory
- scikit-learn User Guide: [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html), [Lasso](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html), [ElasticNet](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html)

---



<center>

Thank you!

</center>