# **AI TECH INSTITUTE** ¬∑ *Intermediate AI & Data Science*
### Week 8 - Notebook 01B: Regularized Linear Models
**Instructor:** Amir Charkhi | **Goal:** Master regularization techniques

> Ridge, Lasso, and ElasticNet for better models

## üìö What You'll Learn

**Problem:** Overfitting and noisy features

**Solutions:**
1. ‚úÖ Ridge Regression (L2) - Shrink all coefficients
2. ‚úÖ Lasso Regression (L1) - Select features automatically
3. ‚úÖ ElasticNet - Best of both worlds

**Why Regularization?**
- Prevents overfitting
- Handles many features
- Automatic feature selection (Lasso)
- More stable predictions

---

In [None]:
# Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')
print("‚úÖ Regularization module loaded!")

---

## Part 5: Understanding the Problem

**Scenario:** Many features, some are noisy or irrelevant

### 5.1: Create Dataset with Many Features

In [None]:
# Generate data with useful and noisy features
np.random.seed(42)
n_samples = 200
n_features = 20

# Create features
X = np.random.randn(n_samples, n_features)

# Only first 5 features are useful
true_coef = np.zeros(n_features)
true_coef[:5] = [10, 8, 6, 4, 2]  # Real effects
# Rest are 0 (noise features)

# Create target
y = X @ true_coef + np.random.randn(n_samples) * 2

print(f"Dataset: {n_samples} samples, {n_features} features")
print(f"Only 5 features are truly useful!")
print(f"True coefficients: {true_coef[:8]}...")

### 5.2: Split and Scale Data

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# IMPORTANT: Scale features for regularization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training: {len(X_train)} samples")
print(f"Test: {len(X_test)} samples")
print("‚úÖ Data scaled (required for regularization!)")

### 5.3: Baseline - Regular Linear Regression

In [None]:
# Train regular linear regression
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)

# Evaluate
train_score = lr.score(X_train_scaled, y_train)
test_score = lr.score(X_test_scaled, y_test)

print("Regular Linear Regression:")
print(f"  Training R¬≤: {train_score:.4f}")
print(f"  Test R¬≤:     {test_score:.4f}")
print(f"  Difference:  {train_score - test_score:.4f}")

if train_score - test_score > 0.1:
    print("\n‚ö†Ô∏è Overfitting detected! Need regularization.")

### 5.4: Visualize Coefficients

In [None]:
# Plot coefficients vs true coefficients
plt.figure(figsize=(12, 5))
x_pos = np.arange(len(lr.coef_))

plt.bar(x_pos, true_coef, alpha=0.5, label='True Coefficients', color='green')
plt.bar(x_pos, lr.coef_, alpha=0.5, label='Learned Coefficients', color='blue')

plt.xlabel('Feature Index', fontsize=12)
plt.ylabel('Coefficient Value', fontsize=12)
plt.title('True vs Learned Coefficients (Linear Regression)', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("üí° Notice: Noise features (5+) get non-zero coefficients!")
print("   This is overfitting to training noise.")

---

## Part 6: Ridge Regression (L2 Regularization)

**How it works:** Penalize large coefficients

**Formula:** Minimize: `MSE + Œ± √ó (sum of squared coefficients)`

**Effect:** Shrinks all coefficients, especially large ones

### 6.1: Train Ridge Model

In [None]:
# Train Ridge with default alpha
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)

# Evaluate
ridge_train = ridge.score(X_train_scaled, y_train)
ridge_test = ridge.score(X_test_scaled, y_test)

print("Ridge Regression (Œ±=1.0):")
print(f"  Training R¬≤: {ridge_train:.4f}")
print(f"  Test R¬≤:     {ridge_test:.4f}")
print(f"  Difference:  {ridge_train - ridge_test:.4f}")
print("\n‚úÖ Less overfitting!")

### 6.2: Compare Coefficients

In [None]:
# Compare coefficient magnitudes
plt.figure(figsize=(12, 5))
x_pos = np.arange(n_features)

plt.bar(x_pos - 0.2, lr.coef_, width=0.4, alpha=0.7, 
        label='Linear Regression', color='blue')
plt.bar(x_pos + 0.2, ridge.coef_, width=0.4, alpha=0.7, 
        label='Ridge', color='red')

plt.xlabel('Feature Index', fontsize=12)
plt.ylabel('Coefficient Value', fontsize=12)
plt.title('Linear Regression vs Ridge Coefficients', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("üí° Ridge shrinks coefficients, especially noise features!")

### 6.3: Effect of Alpha

In [None]:
# Try different alpha values
alphas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
train_scores = []
test_scores = []

for alpha in alphas:
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train_scaled, y_train)
    train_scores.append(ridge.score(X_train_scaled, y_train))
    test_scores.append(ridge.score(X_test_scaled, y_test))

print("Ridge Performance vs Alpha:")
print("\nAlpha   | Train R¬≤ | Test R¬≤  | Difference")
print("-" * 50)
for alpha, train, test in zip(alphas, train_scores, test_scores):
    print(f"{alpha:7.3f} | {train:8.4f} | {test:8.4f} | {train-test:8.4f}")

### 6.4: Visualize Alpha Effect

In [None]:
# Plot scores vs alpha
plt.figure(figsize=(10, 6))
plt.semilogx(alphas, train_scores, 'o-', label='Training R¬≤', linewidth=2)
plt.semilogx(alphas, test_scores, 's-', label='Test R¬≤', linewidth=2)
plt.xlabel('Alpha (Regularization Strength)', fontsize=12)
plt.ylabel('R¬≤ Score', fontsize=12)
plt.title('Ridge: Effect of Alpha on Performance', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("üí° Alpha controls regularization:")
print("  - Low alpha = less regularization (might overfit)")
print("  - High alpha = more regularization (might underfit)")
print("  - Sweet spot in the middle!")

---

## Part 7: Lasso Regression (L1 Regularization)

**How it works:** Penalize absolute value of coefficients

**Formula:** Minimize: `MSE + Œ± √ó (sum of absolute coefficients)`

**Magic:** Can set coefficients to EXACTLY zero (feature selection!)

### 7.1: Train Lasso Model

In [None]:
# Train Lasso
lasso = Lasso(alpha=0.1)
lasso.fit(X_train_scaled, y_train)

# Evaluate
lasso_train = lasso.score(X_train_scaled, y_train)
lasso_test = lasso.score(X_test_scaled, y_test)

print("Lasso Regression (Œ±=0.1):")
print(f"  Training R¬≤: {lasso_train:.4f}")
print(f"  Test R¬≤:     {lasso_test:.4f}")
print(f"  Difference:  {lasso_train - lasso_test:.4f}")

### 7.2: Automatic Feature Selection

In [None]:
# Count non-zero coefficients
n_nonzero = np.sum(lasso.coef_ != 0)
n_zero = np.sum(lasso.coef_ == 0)

print(f"\nüéØ Lasso Feature Selection:")
print(f"  Selected features: {n_nonzero} (non-zero coefficients)")
print(f"  Removed features:  {n_zero} (zero coefficients)")
print(f"\n  Lasso automatically removed {n_zero} features!")

# Show which features were selected
selected = np.where(lasso.coef_ != 0)[0]
print(f"\n  Selected feature indices: {selected}")
print(f"  (Remember: first 5 are the real features!)")

### 7.3: Visualize Lasso Coefficients

In [None]:
# Plot coefficients
plt.figure(figsize=(12, 5))
x_pos = np.arange(n_features)

plt.bar(x_pos - 0.3, true_coef, width=0.3, alpha=0.7, 
        label='True', color='green')
plt.bar(x_pos, ridge.coef_, width=0.3, alpha=0.7, 
        label='Ridge', color='blue')
plt.bar(x_pos + 0.3, lasso.coef_, width=0.3, alpha=0.7, 
        label='Lasso', color='red')

plt.xlabel('Feature Index', fontsize=12)
plt.ylabel('Coefficient Value', fontsize=12)
plt.title('Ridge vs Lasso Coefficients', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("üí° Notice: Lasso sets many coefficients to EXACTLY zero!")
print("   Ridge only shrinks them.")

### 7.4: Lasso Path (varying alpha)

In [None]:
# Show how features are eliminated as alpha increases
alphas_lasso = [0.001, 0.01, 0.1, 0.5, 1.0, 5.0]
coef_matrix = []

for alpha in alphas_lasso:
    lasso_temp = Lasso(alpha=alpha)
    lasso_temp.fit(X_train_scaled, y_train)
    coef_matrix.append(lasso_temp.coef_)

coef_matrix = np.array(coef_matrix)

# Plot
plt.figure(figsize=(12, 6))
for i in range(min(10, n_features)):  # Plot first 10 features
    plt.plot(alphas_lasso, coef_matrix[:, i], marker='o', label=f'Feature {i}')

plt.xlabel('Alpha', fontsize=12)
plt.ylabel('Coefficient Value', fontsize=12)
plt.title('Lasso Path: Coefficients vs Alpha', fontsize=14, fontweight='bold')
plt.xscale('log')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("üí° As alpha increases, more coefficients become zero!")

---

## Part 8: ElasticNet (L1 + L2)

**Best of both worlds!**

**Combines:**
- Ridge penalty (shrink coefficients)
- Lasso penalty (select features)

**Use when:** Many correlated features

### 8.1: Train ElasticNet

In [None]:
# Train ElasticNet
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)  # 50% L1, 50% L2
elastic.fit(X_train_scaled, y_train)

# Evaluate
elastic_train = elastic.score(X_train_scaled, y_train)
elastic_test = elastic.score(X_test_scaled, y_test)

print("ElasticNet (Œ±=0.1, l1_ratio=0.5):")
print(f"  Training R¬≤: {elastic_train:.4f}")
print(f"  Test R¬≤:     {elastic_test:.4f}")
print(f"  Difference:  {elastic_train - elastic_test:.4f}")

# Feature selection
n_selected = np.sum(elastic.coef_ != 0)
print(f"\n  Selected {n_selected} features")

### 8.2: Compare All Methods

In [None]:
# Summary comparison
results = pd.DataFrame({
    'Model': ['Linear', 'Ridge', 'Lasso', 'ElasticNet'],
    'Train R¬≤': [train_score, ridge_train, lasso_train, elastic_train],
    'Test R¬≤': [test_score, ridge_test, lasso_test, elastic_test],
    'Gap': [train_score - test_score, 
            ridge_train - ridge_test, 
            lasso_train - lasso_test, 
            elastic_train - elastic_test],
    'Features': [n_features, 
                 n_features, 
                 np.sum(lasso.coef_ != 0), 
                 np.sum(elastic.coef_ != 0)]
})

print("\nüìä MODEL COMPARISON:\n")
print(results.to_string(index=False))

print("\nüí° Analysis:")
print("  - Linear: Overfits (large gap)")
print("  - Ridge: Reduces overfitting, keeps all features")
print("  - Lasso: Automatic feature selection")
print("  - ElasticNet: Balanced approach")

### 8.3: Visualize Model Comparison

In [None]:
# Bar plot comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Performance comparison
x = np.arange(len(results))
width = 0.35

axes[0].bar(x - width/2, results['Train R¬≤'], width, label='Train', alpha=0.8)
axes[0].bar(x + width/2, results['Test R¬≤'], width, label='Test', alpha=0.8)
axes[0].set_xlabel('Model', fontsize=12)
axes[0].set_ylabel('R¬≤ Score', fontsize=12)
axes[0].set_title('Model Performance Comparison', fontsize=13, fontweight='bold')
axes[0].set_xticks(x)
axes[0].set_xticks(results['Model'])
axes[0].legend()
axes[0].grid(True, alpha=0.3, axis='y')

# Feature selection
axes[1].bar(results['Model'], results['Features'], alpha=0.8, color='coral')
axes[1].set_xlabel('Model', fontsize=12)
axes[1].set_ylabel('Number of Features Used', fontsize=12)
axes[1].set_title('Feature Selection Comparison', fontsize=13, fontweight='bold')
axes[1].axhline(y=5, color='red', linestyle='--', label='True # of useful features')
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

---

## Part 9: Choosing the Right Model

**Decision Guide:**

### 9.1: When to Use Each Model

In [None]:
# Decision framework
decision_guide = """
üéØ MODEL SELECTION GUIDE:

1. LINEAR REGRESSION
   Use when:
   - Few features (< 10)
   - All features are relevant
   - No multicollinearity
   - Need interpretability
   ‚ö†Ô∏è Avoid when: Many features or noise

2. RIDGE REGRESSION
   Use when:
   - Many correlated features
   - Want to keep all features
   - Prevent overfitting
   ‚úÖ Safe default choice

3. LASSO REGRESSION
   Use when:
   - Many features, few are useful
   - Want automatic feature selection
   - Need sparse model
   ‚ö†Ô∏è Can be unstable with correlated features

4. ELASTICNET
   Use when:
   - Many correlated features
   - Want some feature selection
   - Lasso is unstable
   ‚úÖ Best of Ridge + Lasso

5. POLYNOMIAL REGRESSION
   Use when:
   - Non-linear relationships
   - Curved patterns in data
   ‚ö†Ô∏è Watch for overfitting!
"""

print(decision_guide)

---

## üéì Key Takeaways

### Linear Models Summary:

1. **Simple Linear Regression**
   - One feature ‚Üí fast and interpretable
   - Great for understanding relationships

2. **Multiple Linear Regression**
   - Many features ‚Üí more realistic
   - Check for multicollinearity

3. **Polynomial Regression**
   - Handle curves ‚Üí more flexible
   - Careful with degree (don't overfit!)

4. **Regularization** (Ridge/Lasso/ElasticNet)
   - Prevents overfitting ‚Üí better generalization
   - Essential with many features

### Best Practices:

‚úÖ Always scale features for regularization  
‚úÖ Use train/test split  
‚úÖ Try multiple alpha values  
‚úÖ Compare train vs test performance  
‚úÖ Check coefficient magnitudes  
‚úÖ Visualize predictions vs actuals  

### Next Steps:

‚Üí **Notebook 02:** Tree-Based Models  
‚Üí **Lab:** Practice with real datasets  
‚Üí **Project:** Compare linear vs tree models  

---

**Great job! You now understand all linear modeling techniques! üéâ**