# **AI TECH INSTITUTE** ¬∑ *Intermediate AI & Data Science*
### Week 8 - Theory: Linear Models Foundation
**Instructor:** Amir Charkhi | **Goal:** Master the mathematics & intuition behind linear models

> Understanding what happens under the hood

## üìö What You'll Learn Today

**Theory Topics:**
1. ‚úÖ Linear Regression Mathematics
2. ‚úÖ Loss Functions & Optimization
3. ‚úÖ Assumptions of Linear Models
4. ‚úÖ Logistic Regression Theory
5. ‚úÖ Maximum Likelihood Estimation
6. ‚úÖ Regularization Theory (Ridge, Lasso, ElasticNet)
7. ‚úÖ Bias-Variance Tradeoff

**Why Study Theory?**
- Understand when models work (and when they don't)
- Debug problems effectively
- Make better modeling decisions
- Build intuition for advanced methods
- Communicate results confidently

**Time**: 75 minutes | **Prerequisites**: Basic calculus, linear algebra

---

In [None]:
# Essential imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("üìê THEORY OF LINEAR MODELS")
print("‚úÖ Setup complete!")

---

## Part 1: Linear Regression Theory

### 1.1: The Model

**General Form:**
$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p + \epsilon$$

**Matrix Form:**
$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$$

Where:
- $\mathbf{y}$ = target vector (n √ó 1)
- $\mathbf{X}$ = feature matrix (n √ó p)
- $\boldsymbol{\beta}$ = coefficients (p √ó 1)
- $\boldsymbol{\epsilon}$ = error/noise (n √ó 1)

**Key Idea:** We assume a *linear relationship* between features and target.

In [None]:
# Visualize the linear assumption
np.random.seed(42)
x = np.linspace(0, 10, 100)
y_perfect = 2 * x + 1
y_noisy = y_perfect + np.random.normal(0, 2, 100)

plt.figure(figsize=(10, 5))
plt.scatter(x, y_noisy, alpha=0.5, s=30, label='Observed data (y)')
plt.plot(x, y_perfect, 'r-', linewidth=2, label='True relationship (no noise)')
plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Linear Relationship with Noise', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

### 1.2: The Goal - Minimize Error

We want to find $\boldsymbol{\beta}$ that minimizes prediction errors.

**Residual (error) for one point:**
$$e_i = y_i - \hat{y}_i = y_i - (\beta_0 + \beta_1 x_i)$$

**Loss Function - Mean Squared Error (MSE):**
$$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \frac{1}{n} \sum_{i=1}^{n} e_i^2$$

**Why square the errors?**
- Penalizes large errors more
- Makes math tractable (differentiable)
- Positive values (can't cancel out)

**Matrix form:**
$$MSE = \frac{1}{n}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$$

In [None]:
# Visualize residuals (errors)
from sklearn.linear_model import LinearRegression

X = x.reshape(-1, 1)
model = LinearRegression()
model.fit(X, y_noisy)
y_pred = model.predict(X)

plt.figure(figsize=(10, 5))
plt.scatter(x, y_noisy, alpha=0.5, s=30, label='Data')
plt.plot(x, y_pred, 'r-', linewidth=2, label='Fitted line')

# Draw residuals for first 20 points
for i in range(0, 100, 5):
    plt.plot([x[i], x[i]], [y_noisy[i], y_pred[i]], 'g--', alpha=0.5, linewidth=1)

plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Residuals: Distance from Points to Line', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"üí° We minimize the sum of squared green lines!")

### 1.3: The Solution - Normal Equation

**Closed-form solution** (when it exists):

$$\boldsymbol{\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

**Derivation (intuition):**
1. Take derivative of MSE with respect to $\boldsymbol{\beta}$
2. Set to zero (find minimum)
3. Solve for $\boldsymbol{\beta}$

**When does it fail?**
- $\mathbf{X}^T\mathbf{X}$ is not invertible (singular matrix)
- Features are perfectly correlated (multicollinearity)
- More features than samples (p > n)

**Solution:** Use gradient descent or regularization instead!

In [None]:
# Manual implementation of Normal Equation
def normal_equation(X, y):
    """Solve linear regression using normal equation"""
    X_with_intercept = np.c_[np.ones(len(X)), X]  # Add column of 1s for intercept
    beta = np.linalg.inv(X_with_intercept.T @ X_with_intercept) @ X_with_intercept.T @ y
    return beta

# Compare with sklearn
beta_manual = normal_equation(X, y_noisy)

print(f"Manual solution: Œ≤‚ÇÄ={beta_manual[0]:.3f}, Œ≤‚ÇÅ={beta_manual[1]:.3f}")
print(f"Sklearn solution: Œ≤‚ÇÄ={model.intercept_:.3f}, Œ≤‚ÇÅ={model.coef_[0]:.3f}")
print(f"\n‚úÖ Same results!")

### 1.4: Key Assumptions

Linear regression makes important assumptions:

#### 1. **Linearity**
- Relationship between X and y is linear
- Check: Residual plots should show no pattern

#### 2. **Independence**
- Observations are independent
- Errors are uncorrelated
- Violated in: Time series, spatial data

#### 3. **Homoscedasticity**
- Constant variance of errors
- $Var(\epsilon_i) = \sigma^2$ for all i
- Check: Residual plot should have constant spread

#### 4. **Normality**
- Errors are normally distributed: $\epsilon \sim N(0, \sigma^2)$
- Important for: Confidence intervals, hypothesis tests
- Less critical for: Predictions

#### 5. **No Multicollinearity**
- Features are not highly correlated
- Causes unstable coefficient estimates

In [None]:
# Visualize assumption violations
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Good: Assumptions met
x_good = np.linspace(0, 10, 100)
y_good = 2*x_good + np.random.normal(0, 1, 100)
model_good = LinearRegression().fit(x_good.reshape(-1,1), y_good)
residuals_good = y_good - model_good.predict(x_good.reshape(-1,1))

axes[0,0].scatter(x_good, residuals_good, alpha=0.5, s=20)
axes[0,0].axhline(0, color='red', linestyle='--')
axes[0,0].set_title('‚úÖ Good: Random scatter', fontweight='bold')
axes[0,0].set_xlabel('Fitted values')
axes[0,0].set_ylabel('Residuals')

# Bad: Non-linearity
x_curve = np.linspace(0, 10, 100)
y_curve = x_curve**2 + np.random.normal(0, 5, 100)
model_curve = LinearRegression().fit(x_curve.reshape(-1,1), y_curve)
residuals_curve = y_curve - model_curve.predict(x_curve.reshape(-1,1))

axes[0,1].scatter(x_curve, residuals_curve, alpha=0.5, s=20, color='orange')
axes[0,1].axhline(0, color='red', linestyle='--')
axes[0,1].set_title('‚ö†Ô∏è Bad: Curved pattern (non-linear)', fontweight='bold')
axes[0,1].set_xlabel('Fitted values')
axes[0,1].set_ylabel('Residuals')

# Bad: Heteroscedasticity
x_hetero = np.linspace(1, 10, 100)
y_hetero = 2*x_hetero + np.random.normal(0, x_hetero*0.5, 100)
model_hetero = LinearRegression().fit(x_hetero.reshape(-1,1), y_hetero)
residuals_hetero = y_hetero - model_hetero.predict(x_hetero.reshape(-1,1))

axes[1,0].scatter(x_hetero, residuals_hetero, alpha=0.5, s=20, color='red')
axes[1,0].axhline(0, color='red', linestyle='--')
axes[1,0].set_title('‚ö†Ô∏è Bad: Funnel shape (heteroscedasticity)', fontweight='bold')
axes[1,0].set_xlabel('Fitted values')
axes[1,0].set_ylabel('Residuals')

# Normality check with Q-Q plot
stats.probplot(residuals_good, dist="norm", plot=axes[1,1])
axes[1,1].set_title('‚úÖ Q-Q Plot: Checking normality', fontweight='bold')
axes[1,1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## Part 2: Logistic Regression Theory

### 2.1: From Regression to Classification

**Problem:** Linear regression outputs can be outside [0,1]
- Can't interpret as probabilities
- Unbounded predictions

**Solution:** Use a transformation!

**The Logistic (Sigmoid) Function:**
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

**Properties:**
- Maps any real number to (0, 1)
- Smooth and differentiable
- $\sigma(0) = 0.5$ (decision boundary)
- $\sigma(\infty) = 1$, $\sigma(-\infty) = 0$

In [None]:
# Visualize sigmoid function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

z = np.linspace(-10, 10, 200)
y_sigmoid = sigmoid(z)

plt.figure(figsize=(10, 6))
plt.plot(z, y_sigmoid, linewidth=3, color='blue')
plt.axhline(0.5, color='red', linestyle='--', alpha=0.5, label='Decision boundary')
plt.axvline(0, color='red', linestyle='--', alpha=0.5)
plt.axhline(0, color='black', linestyle='-', alpha=0.3, linewidth=0.5)
plt.axhline(1, color='black', linestyle='-', alpha=0.3, linewidth=0.5)
plt.xlabel('z (linear combination)', fontsize=12)
plt.ylabel('œÉ(z) = P(y=1|x)', fontsize=12)
plt.title('Sigmoid Function: Transforming Linear Output to Probability', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.legend()
plt.ylim([-0.1, 1.1])
plt.show()

print("üí° Output is now between 0 and 1 - a valid probability!")

### 2.2: The Logistic Regression Model

**Model equation:**
$$P(y=1|\mathbf{x}) = \sigma(\mathbf{x}^T\boldsymbol{\beta}) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + ... + \beta_p x_p)}}$$

**Decision rule:**
- If $P(y=1|\mathbf{x}) > 0.5$ ‚Üí predict class 1
- If $P(y=1|\mathbf{x}) \leq 0.5$ ‚Üí predict class 0

**The linear part (logit/log-odds):**
$$\log\left(\frac{P(y=1|\mathbf{x})}{1-P(y=1|\mathbf{x})}\right) = \mathbf{x}^T\boldsymbol{\beta}$$

The log-odds is linear in the features!

### 2.3: Maximum Likelihood Estimation

**Problem:** Can't use MSE for classification!

**Solution:** Maximum Likelihood Estimation (MLE)

**Likelihood function** (probability of observing our data):
$$L(\boldsymbol{\beta}) = \prod_{i=1}^{n} P(y_i|\mathbf{x}_i, \boldsymbol{\beta})$$

**For binary classification:**
$$L(\boldsymbol{\beta}) = \prod_{i=1}^{n} [h(\mathbf{x}_i)]^{y_i}[1-h(\mathbf{x}_i)]^{1-y_i}$$

where $h(\mathbf{x}_i) = P(y=1|\mathbf{x}_i)$

**Log-Likelihood** (easier to work with):
$$\ell(\boldsymbol{\beta}) = \sum_{i=1}^{n} [y_i \log h(\mathbf{x}_i) + (1-y_i) \log(1-h(\mathbf{x}_i))]$$

**Loss function** (Cross-Entropy):
$$J(\boldsymbol{\beta}) = -\frac{1}{n} \ell(\boldsymbol{\beta})$$

**Goal:** Maximize likelihood = Minimize cross-entropy loss

In [None]:
# Visualize cross-entropy loss
p_pred = np.linspace(0.01, 0.99, 100)

# Loss when true label is 1
loss_y1 = -np.log(p_pred)
# Loss when true label is 0
loss_y0 = -np.log(1 - p_pred)

plt.figure(figsize=(10, 6))
plt.plot(p_pred, loss_y1, label='True label = 1', linewidth=2)
plt.plot(p_pred, loss_y0, label='True label = 0', linewidth=2)
plt.xlabel('Predicted Probability P(y=1)', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.title('Cross-Entropy Loss Function', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.ylim([0, 5])
plt.show()

print("üí° Wrong predictions are heavily penalized!")

### 2.4: Optimization - Gradient Descent

**No closed-form solution** for logistic regression!

**Gradient Descent Algorithm:**
1. Start with random $\boldsymbol{\beta}$
2. Compute gradient: $\nabla J(\boldsymbol{\beta})$
3. Update: $\boldsymbol{\beta} \leftarrow \boldsymbol{\beta} - \alpha \nabla J(\boldsymbol{\beta})$
4. Repeat until convergence

**Gradient for logistic regression:**
$$\nabla J(\boldsymbol{\beta}) = \frac{1}{n}\mathbf{X}^T(h(\mathbf{X}) - \mathbf{y})$$

Where:
- $\alpha$ = learning rate
- $h(\mathbf{X})$ = predicted probabilities
- $\mathbf{y}$ = true labels

In [None]:
# Visualize gradient descent
def loss_surface(b0, b1):
    """Simplified 2D loss surface"""
    return (b0-2)**2 + (b1+1)**2

# Create grid
b0 = np.linspace(-1, 5, 100)
b1 = np.linspace(-4, 2, 100)
B0, B1 = np.meshgrid(b0, b1)
Z = loss_surface(B0, B1)

# Gradient descent path
path_b0 = [0, 0.5, 1.0, 1.5, 1.8, 1.95, 2.0]
path_b1 = [-3, -2.5, -2.0, -1.5, -1.2, -1.05, -1.0]

plt.figure(figsize=(10, 8))
plt.contour(B0, B1, Z, levels=20, cmap='viridis')
plt.plot(path_b0, path_b1, 'r.-', linewidth=2, markersize=10, label='Gradient descent path')
plt.scatter([2], [-1], s=200, c='red', marker='*', label='Minimum', zorder=5)
plt.xlabel('Œ≤‚ÇÄ', fontsize=12)
plt.ylabel('Œ≤‚ÇÅ', fontsize=12)
plt.title('Gradient Descent: Finding the Minimum', fontsize=14, fontweight='bold')
plt.legend()
plt.colorbar(label='Loss')
plt.grid(True, alpha=0.3)
plt.show()

---

## Part 3: Regularization Theory

### 3.1: The Overfitting Problem

**Issue:** Complex models can fit training data *too well*
- High variance
- Poor generalization
- Sensitive to noise

**Solution:** Add a penalty for model complexity!

**Regularized loss:**
$$J(\boldsymbol{\beta}) = \text{Loss}(\mathbf{y}, \hat{\mathbf{y}}) + \lambda \cdot \text{Penalty}(\boldsymbol{\beta})$$

Where:
- $\lambda$ = regularization strength (hyperparameter)
- Larger $\lambda$ ‚Üí simpler model

### 3.2: Ridge Regression (L2 Regularization)

**Loss function:**
$$J(\boldsymbol{\beta}) = \frac{1}{n}\sum_{i=1}^{n}(y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2 + \lambda\sum_{j=1}^{p}\beta_j^2$$

**Matrix form:**
$$J(\boldsymbol{\beta}) = ||\mathbf{y} - \mathbf{X}\boldsymbol{\beta}||^2 + \lambda||\boldsymbol{\beta}||^2$$

**Closed-form solution:**
$$\boldsymbol{\beta}_{ridge} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$$

**Properties:**
- Shrinks coefficients toward zero
- Never exactly zero
- All features kept in model
- Handles multicollinearity well

### 3.3: Lasso Regression (L1 Regularization)

**Loss function:**
$$J(\boldsymbol{\beta}) = \frac{1}{n}\sum_{i=1}^{n}(y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2 + \lambda\sum_{j=1}^{p}|\beta_j|$$

**Properties:**
- Can set coefficients exactly to zero
- Performs **feature selection**
- Creates sparse models
- No closed-form solution (needs optimization)

**When to use:**
- Many features, few important
- Want interpretable model
- Need automatic feature selection

### 3.4: ElasticNet (L1 + L2)

**Combines both penalties:**
$$J(\boldsymbol{\beta}) = \frac{1}{n}\sum_{i=1}^{n}(y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2 + \lambda_1\sum_{j=1}^{p}|\beta_j| + \lambda_2\sum_{j=1}^{p}\beta_j^2$$

**Alternative form:**
$$J(\boldsymbol{\beta}) = \text{MSE} + \lambda[\alpha||\boldsymbol{\beta}||_1 + (1-\alpha)||\boldsymbol{\beta}||^2]$$

Where:
- $\alpha \in [0,1]$ controls L1 vs L2 mix
- $\alpha=0$ ‚Üí Pure Ridge
- $\alpha=1$ ‚Üí Pure Lasso

**Benefits:**
- Feature selection (like Lasso)
- Handles correlated features (like Ridge)
- More stable than Lasso

In [None]:
# Visualize regularization penalties
beta_range = np.linspace(-3, 3, 100)
l1_penalty = np.abs(beta_range)
l2_penalty = beta_range**2

plt.figure(figsize=(10, 6))
plt.plot(beta_range, l1_penalty, label='L1: |Œ≤|', linewidth=2)
plt.plot(beta_range, l2_penalty, label='L2: Œ≤¬≤', linewidth=2)
plt.axvline(0, color='black', linestyle='--', alpha=0.3)
plt.axhline(0, color='black', linestyle='--', alpha=0.3)
plt.xlabel('Œ≤ (coefficient value)', fontsize=12)
plt.ylabel('Penalty', fontsize=12)
plt.title('L1 vs L2 Penalty Functions', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()

print("üí° L1 is linear (can hit zero), L2 is quadratic (asymptotic to zero)")

In [None]:
# Geometric intuition: Constraint regions
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# L1 constraint (diamond)
theta = np.linspace(0, 2*np.pi, 100)
l1_x = np.cos(theta)
l1_y = np.sin(theta)
l1_diamond_x = np.sign(l1_x) * np.abs(l1_x)
l1_diamond_y = np.sign(l1_y) * (1 - np.abs(l1_x))

# Plot L1
axes[0].plot(l1_diamond_x, l1_diamond_y, 'b-', linewidth=2)
axes[0].fill(l1_diamond_x, l1_diamond_y, alpha=0.2, color='blue')
axes[0].scatter([0], [1], s=200, c='red', marker='*', zorder=5, label='Optimal (Œ≤‚ÇÇ‚â†0, Œ≤‚ÇÅ=0)')
axes[0].set_xlabel('Œ≤‚ÇÅ', fontsize=12)
axes[0].set_ylabel('Œ≤‚ÇÇ', fontsize=12)
axes[0].set_title('L1 (Lasso): Diamond-shaped constraint\nHits axes ‚Üí Sparse solution', fontweight='bold')
axes[0].grid(True, alpha=0.3)
axes[0].legend()
axes[0].set_xlim([-1.5, 1.5])
axes[0].set_ylim([-1.5, 1.5])

# L2 constraint (circle)
circle_x = np.cos(theta)
circle_y = np.sin(theta)

axes[1].plot(circle_x, circle_y, 'g-', linewidth=2)
axes[1].fill(circle_x, circle_y, alpha=0.2, color='green')
axes[1].scatter([0.7], [0.7], s=200, c='red', marker='*', zorder=5, label='Optimal (Œ≤‚ÇÇ‚â†0, Œ≤‚ÇÅ‚â†0)')
axes[1].set_xlabel('Œ≤‚ÇÅ', fontsize=12)
axes[1].set_ylabel('Œ≤‚ÇÇ', fontsize=12)
axes[1].set_title('L2 (Ridge): Circular constraint\nRarely hits axes ‚Üí Dense solution', fontweight='bold')
axes[1].grid(True, alpha=0.3)
axes[1].legend()
axes[1].set_xlim([-1.5, 1.5])
axes[1].set_ylim([-1.5, 1.5])

plt.tight_layout()
plt.show()

### 3.5: Comparison Table

| Property | Ridge (L2) | Lasso (L1) | ElasticNet |
|----------|-----------|-----------|------------|
| **Penalty** | $\lambda\sum\beta_j^2$ | $\lambda\sum|\beta_j|$ | Both |
| **Feature Selection** | ‚ùå No | ‚úÖ Yes | ‚úÖ Yes |
| **Coefficient Values** | Shrink to ~0 | Exactly 0 | Mix |
| **Multicollinearity** | ‚úÖ Handles well | ‚ö†Ô∏è Picks one | ‚úÖ Handles well |
| **Computational Cost** | Fast (closed form) | Slower (iterative) | Slower |
| **Best When** | Many correlated features | Many irrelevant features | Combination needed |

---

## Part 4: Bias-Variance Tradeoff

### 4.1: Core Concept

**Total Error Decomposition:**
$$\text{Expected Test Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}$$

**Bias:**
- Error from wrong assumptions
- Model too simple (underfitting)
- High bias ‚Üí systematic errors

**Variance:**
- Error from sensitivity to training data
- Model too complex (overfitting)
- High variance ‚Üí inconsistent predictions

**Irreducible Error:**
- Noise in the data
- Cannot be reduced by better models

In [None]:
# Visualize bias-variance tradeoff
complexity = np.linspace(0, 10, 100)
bias = 5 / (1 + complexity)
variance = complexity / 2
total_error = bias + variance + 0.5  # +0.5 is irreducible error

plt.figure(figsize=(10, 6))
plt.plot(complexity, bias, label='Bias¬≤', linewidth=2)
plt.plot(complexity, variance, label='Variance', linewidth=2)
plt.plot(complexity, total_error, label='Total Error', linewidth=3, linestyle='--', color='red')
plt.axhline(0.5, color='gray', linestyle=':', alpha=0.5, label='Irreducible Error')

# Mark optimal point
optimal_idx = np.argmin(total_error)
plt.scatter([complexity[optimal_idx]], [total_error[optimal_idx]], 
           s=200, c='red', marker='*', zorder=5, label='Optimal Complexity')

plt.xlabel('Model Complexity', fontsize=12)
plt.ylabel('Error', fontsize=12)
plt.title('Bias-Variance Tradeoff', fontsize=14, fontweight='bold')
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)
plt.show()

### 4.2: Regularization & Bias-Variance

**Effect of regularization parameter $\lambda$:**

**Low $\lambda$ (weak regularization):**
- Low bias, high variance
- Complex model
- Risk of overfitting

**High $\lambda$ (strong regularization):**
- High bias, low variance
- Simple model
- Risk of underfitting

**Sweet spot:**
- Balance bias and variance
- Found via cross-validation
- Minimizes test error

In [None]:
# Visualize regularization effect
lambda_range = np.logspace(-3, 3, 100)
train_error = 0.1 + 0.2 * np.exp(-lambda_range)
test_error = 0.3 + 0.2 * np.exp(-lambda_range) + 0.15 * lambda_range / 10

plt.figure(figsize=(10, 6))
plt.semilogx(lambda_range, train_error, label='Training Error', linewidth=2)
plt.semilogx(lambda_range, test_error, label='Test Error', linewidth=2)

# Mark optimal lambda
optimal_idx = np.argmin(test_error)
plt.scatter([lambda_range[optimal_idx]], [test_error[optimal_idx]], 
           s=200, c='red', marker='*', zorder=5, label='Optimal Œª')

# Annotate regions
plt.text(0.01, 0.4, 'Overfitting\n(High Variance)', ha='center', fontsize=10, 
        bbox=dict(boxstyle='round', facecolor='red', alpha=0.2))
plt.text(100, 0.4, 'Underfitting\n(High Bias)', ha='center', fontsize=10,
        bbox=dict(boxstyle='round', facecolor='blue', alpha=0.2))

plt.xlabel('Regularization Parameter (Œª)', fontsize=12)
plt.ylabel('Error', fontsize=12)
plt.title('Effect of Regularization on Error', fontsize=14, fontweight='bold')
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3, which='both')
plt.show()

---

## Summary: Key Theoretical Insights

### Linear Regression
‚úÖ **Goal:** Minimize MSE using Normal Equation or Gradient Descent  
‚úÖ **Assumptions:** Linearity, Independence, Homoscedasticity, Normality  
‚úÖ **Limitation:** Only captures linear relationships

### Logistic Regression
‚úÖ **Goal:** Maximize likelihood (minimize cross-entropy)  
‚úÖ **Key:** Sigmoid transforms linear output to probability  
‚úÖ **Optimization:** Gradient descent (no closed-form solution)  
‚úÖ **Limitation:** Linear decision boundary

### Regularization
‚úÖ **Ridge (L2):** Shrinks all coefficients, keeps all features  
‚úÖ **Lasso (L1):** Sets some coefficients to zero, feature selection  
‚úÖ **ElasticNet:** Best of both worlds  
‚úÖ **Purpose:** Control model complexity, reduce overfitting

### Bias-Variance Tradeoff
‚úÖ **Bias:** Error from simplifying assumptions  
‚úÖ **Variance:** Error from sensitivity to training data  
‚úÖ **Goal:** Find optimal complexity via cross-validation  
‚úÖ **Tool:** Regularization helps balance the tradeoff

---

### Next Steps
1. ‚úÖ Practice with real datasets
2. ‚úÖ Experiment with regularization parameters
3. ‚úÖ Check model assumptions
4. ‚úÖ Use cross-validation for model selection
5. ‚úÖ Visualize decision boundaries (for classification)

**Remember:** Theory guides practice. Understanding *why* helps you know *when* to apply each technique!