# Linear Regression & Normal Equation - Interview Q&A

This notebook contains interview-level questions and answers about Linear Regression and the Normal Equation method.

## Basic Concepts

### Q1: What is Linear Regression? When would you use it?

**Answer:**

Linear Regression is a supervised learning algorithm used to predict a continuous target variable based on one or more independent variables by fitting a linear equation to the observed data.

**Mathematical form:**
- Simple: $y = \beta_0 + \beta_1x + \varepsilon$
- Multiple: $y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n + \varepsilon$

**Use cases:**
- Predicting house prices based on features
- Forecasting sales based on advertising spend
- Estimating continuous outcomes (temperature, stock prices, etc.)
- When relationship between variables is approximately linear

**Key requirement:** The target variable must be continuous (not categorical).

### Q2: What are the four key assumptions of Linear Regression?

**Answer:**

1. **Linearity**: The relationship between independent and dependent variables is linear
   - Check: Scatter plots, residual plots

2. **Independence**: Observations are independent of each other
   - Violation example: Time series data with autocorrelation
   - Check: Durbin-Watson test

3. **Homoscedasticity**: Constant variance of residuals across all levels of independent variables
   - Violation: Residuals increase/decrease with fitted values (heteroscedasticity)
   - Check: Residual plot (should show random scatter)

4. **Normality**: Residuals are normally distributed
   - Check: Q-Q plot, histogram of residuals, Shapiro-Wilk test

**Consequences of violations:**
- Biased or inefficient parameter estimates
- Invalid confidence intervals and p-values
- Poor prediction performance

### Q3: Explain the difference between Simple and Multiple Linear Regression.

**Answer:**

**Simple Linear Regression:**
- One independent variable
- Form: $y = \beta_0 + \beta_1x$
- Example: Predicting salary based only on years of experience
- Visualization: 2D line

**Multiple Linear Regression:**
- Two or more independent variables
- Form: $y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$
- Example: Predicting salary based on experience, education, location
- Visualization: Hyperplane in n-dimensional space

**Key differences:**
- Multiple regression can capture more complex relationships
- Multiple regression is prone to multicollinearity
- Interpretation: In multiple regression, each coefficient represents the effect of that variable while holding others constant

## Normal Equation

### Q4: What is the Normal Equation? Derive or explain the formula.

**Answer:**

The Normal Equation is a closed-form analytical solution to find optimal parameters for linear regression without iteration.

**Formula:**
$$\theta = (X^TX)^{-1}X^Ty$$

**Derivation (high-level):**

1. Start with cost function (MSE):
   $$J(\theta) = \frac{1}{2m}(X\theta - y)^T(X\theta - y)$$

2. To minimize, take derivative with respect to $\theta$ and set to zero:
   $$\frac{\partial J}{\partial \theta} = 0$$

3. Expand and solve:
   $$\frac{1}{m}X^T(X\theta - y) = 0$$
   $$X^TX\theta = X^Ty$$
   $$\theta = (X^TX)^{-1}X^Ty$$

**Key insight:** This gives the exact optimal parameters in one computation (no iterations needed).

### Q5: What are the advantages and disadvantages of the Normal Equation?

**Answer:**

**Advantages:**
- ✅ **No hyperparameters** - No learning rate to tune
- ✅ **Exact solution** - Finds global optimum in one step
- ✅ **No iterations** - Single computation
- ✅ **No feature scaling needed** - Works with unnormalized data
- ✅ **Guaranteed convergence** - Always finds solution (if invertible)

**Disadvantages:**
- ❌ **Computational complexity** - O(n³) due to matrix inversion
- ❌ **Slow for large n** - Impractical when features > 10,000
- ❌ **Memory intensive** - Must compute and store $X^TX$
- ❌ **Matrix inversion required** - Fails if $X^TX$ is singular (non-invertible)
- ❌ **Not suitable for online learning** - Requires all data at once

**Rule of thumb:** Use Normal Equation when n < 10,000 features; use Gradient Descent otherwise.

### Q6: When is $(X^TX)$ not invertible? What can you do about it?

**Answer:**

**When $X^TX$ is singular (non-invertible):**

1. **Redundant features**: Linearly dependent columns
   - Example: Feature 1 = Temperature in Celsius, Feature 2 = Temperature in Fahrenheit

2. **More features than samples**: $n > m$
   - System is underdetermined

3. **Perfect multicollinearity**: One feature is exact linear combination of others

**Solutions:**

1. **Remove redundant features**
   - Identify and eliminate linearly dependent features
   - Use correlation matrix or VIF (Variance Inflation Factor)

2. **Use pseudo-inverse (Moore-Penrose)**
   ```python
   theta = np.linalg.pinv(X) @ y
   ```
   - Works even when $X^TX$ is singular
   - More numerically stable

3. **Regularization** (Ridge/Lasso)
   - Add penalty term to make matrix invertible
   - Ridge: $(X^TX + \lambda I)^{-1}X^Ty$

4. **Use Gradient Descent** instead
   - Doesn't require matrix inversion

### Q7: Compare Normal Equation vs Gradient Descent. When would you use each?

**Answer:**

| Aspect | Normal Equation | Gradient Descent |
|--------|----------------|------------------|
| **Algorithm type** | Analytical/Direct | Iterative |
| **Iterations** | None (1 computation) | Many (100s-1000s) |
| **Learning rate** | Not needed | Required |
| **Feature scaling** | Not needed | Highly recommended |
| **Time complexity** | O(n³) | O(kmn), k=iterations |
| **Works for large n** | No (n > 10,000) | Yes |
| **Works for large m** | Yes | Yes |
| **Singularity issue** | Yes (if $X^TX$ singular) | No |
| **Online learning** | No | Yes |
| **Optimization** | Global optimum | Global optimum* |

*Linear regression cost function is convex, so GD always finds global optimum

**Use Normal Equation when:**
- Small to medium datasets (m < 100,000)
- Few features (n < 10,000)
- Need exact solution quickly
- Don't want to tune hyperparameters

**Use Gradient Descent when:**
- Large datasets (m > 100,000)
- Many features (n > 10,000)
- Online learning required
- Memory constrained
- Want to generalize to other algorithms (logistic regression, neural networks)

## Model Evaluation & Interpretation

### Q8: What is R² score? How do you interpret it?

**Answer:**

**R² (Coefficient of Determination)** measures the proportion of variance in the dependent variable explained by the independent variables.

**Formula:**
$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$$

Where:
- $SS_{res}$ = Residual sum of squares (model error)
- $SS_{tot}$ = Total sum of squares (variance in data)

**Interpretation:**
- **Range**: 0 to 1 (can be negative for very poor models)
- **R² = 0.85**: Model explains 85% of variance in target
- **R² = 1.0**: Perfect fit (likely overfitting)
- **R² = 0.0**: Model no better than predicting mean
- **R² < 0**: Model worse than predicting mean

**Limitations:**
- Always increases with more features (even irrelevant ones)
- Use **Adjusted R²** for multiple regression
- High R² doesn't mean causation
- Can be misleading with non-linear relationships

### Q9: What is multicollinearity? How do you detect and handle it?

**Answer:**

**Multicollinearity**: High correlation between two or more independent variables.

**Problems it causes:**
- Unstable coefficient estimates (small data changes → large coefficient changes)
- Difficult to interpret individual feature importance
- Inflated standard errors
- May lead to singular $X^TX$ matrix

**Detection methods:**

1. **Correlation Matrix**
   - Check pairwise correlations
   - |correlation| > 0.8-0.9 indicates problem

2. **Variance Inflation Factor (VIF)**
   $$VIF_j = \frac{1}{1 - R_j^2}$$
   - $R_j^2$ = R² from regressing feature j on all other features
   - **VIF > 5-10**: Problematic multicollinearity
   - **VIF = 1**: No multicollinearity

**Solutions:**
1. **Remove correlated features** - Drop one of the correlated variables
2. **Combine features** - Create interaction term or ratio
3. **Principal Component Analysis (PCA)** - Transform to uncorrelated components
4. **Regularization** - Ridge regression (L2) handles multicollinearity well
5. **Collect more data** - Sometimes helps reduce correlation

### Q10: What is the difference between MSE, RMSE, and MAE? When would you use each?

**Answer:**

**Mean Squared Error (MSE):**
$$MSE = \frac{1}{m}\sum_{i=1}^{m}(y_i - \hat{y}_i)^2$$

- Squares errors (penalizes large errors more)
- Units: squared units of target variable
- Not interpretable in original units
- Sensitive to outliers

**Root Mean Squared Error (RMSE):**
$$RMSE = \sqrt{MSE} = \sqrt{\frac{1}{m}\sum_{i=1}^{m}(y_i - \hat{y}_i)^2}$$

- Square root of MSE
- **Units: same as target variable** (interpretable!)
- Still sensitive to outliers
- Most commonly used

**Mean Absolute Error (MAE):**
$$MAE = \frac{1}{m}\sum_{i=1}^{m}|y_i - \hat{y}_i|$$

- Average absolute error
- Units: same as target variable
- **Less sensitive to outliers**
- All errors weighted equally

**When to use:**
- **RMSE**: General purpose, when large errors should be penalized more
- **MAE**: When outliers shouldn't dominate, need robust metric
- **MSE**: Theoretical work, optimization (easier to differentiate)

**Example:**
- Errors: [1, 1, 1, 10]
- MAE = 3.25
- RMSE = 5.22 (heavily influenced by the 10)

## Advanced Concepts

### Q11: What is the difference between population regression line and sample regression line?

**Answer:**

**Population Regression Line:**
- True relationship in the entire population
- Unknown in practice (we don't have access to entire population)
- Represented by true parameters: $\beta_0, \beta_1, ..., \beta_n$
- Example: True relationship between all people's education and income

**Sample Regression Line:**
- Estimated from a sample of data
- What we actually compute in practice
- Represented by estimated parameters: $\hat{\beta}_0, \hat{\beta}_1, ..., \hat{\beta}_n$ (or $\theta$)
- **Goal**: Estimate the population line as accurately as possible
- Different samples → slightly different sample regression lines

**Key relationship:**
- Sample line is an **estimator** of population line
- As sample size increases, sample line approaches population line
- Standard errors quantify uncertainty in our estimates

### Q12: How would you handle outliers in Linear Regression?

**Answer:**

**Detection:**
1. **Residual analysis** - Large residuals indicate outliers
2. **Cook's distance** - Measures influence of each point
3. **Leverage** - Points far from mean of X
4. **Visualization** - Scatter plots, box plots

**Handling strategies:**

1. **Remove outliers** (if justified)
   - Data entry errors
   - Measurement errors
   - Document removal decision

2. **Transform variables**
   - Log transformation reduces impact
   - Box-Cox transformation

3. **Use robust regression**
   - RANSAC (Random Sample Consensus)
   - Huber regression
   - Minimize MAE instead of MSE

4. **Cap/Winsorize**
   - Replace extreme values with percentile values
   - Example: Cap at 95th percentile

5. **Add indicator variable**
   - Binary variable indicating outlier status
   - Allows model to treat differently

**Important**: Don't automatically remove outliers - they might contain valuable information!

### Q13: Explain overfitting and underfitting in Linear Regression context.

**Answer:**

**Underfitting (High Bias):**
- Model is too simple
- Poor performance on both training and test data
- High training error, high test error

**Causes:**
- Too few features
- Model doesn't capture relationship (maybe non-linear)
- Too much regularization

**Solutions:**
- Add polynomial features
- Add more relevant features
- Reduce regularization
- Try more complex model

**Overfitting (High Variance):**
- Model is too complex
- Excellent performance on training, poor on test
- Low training error, high test error
- Model learns noise in training data

**Causes:**
- Too many features
- Polynomial features with high degree
- Too little regularization
- Small training set

**Solutions:**
- Add more training data
- Feature selection (remove irrelevant features)
- Regularization (Ridge/Lasso)
- Cross-validation
- Reduce model complexity

**Bias-Variance Tradeoff:**
- Need to balance model complexity
- Total Error = Bias² + Variance + Irreducible Error
- Goal: Find sweet spot (minimum total error)

### Q14: What is regularization? Explain Ridge and Lasso regression.

**Answer:**

**Regularization**: Technique to prevent overfitting by adding a penalty term to the cost function.

**Ridge Regression (L2 Regularization):**
$$J(\theta) = MSE + \lambda\sum_{j=1}^{n}\theta_j^2$$

- Penalty: Sum of squared coefficients
- **Effect**: Shrinks coefficients toward zero (but never exactly zero)
- Keeps all features
- Works well with multicollinearity
- Normal equation: $\theta = (X^TX + \lambda I)^{-1}X^Ty$
- Note: $\lambda I$ makes matrix invertible even if $X^TX$ is singular!

**Lasso Regression (L1 Regularization):**
$$J(\theta) = MSE + \lambda\sum_{j=1}^{n}|\theta_j|$$

- Penalty: Sum of absolute coefficients
- **Effect**: Can shrink coefficients to exactly zero
- Performs automatic feature selection
- Creates sparse models
- No closed-form solution (use gradient descent)

**Elastic Net:**
$$J(\theta) = MSE + r\lambda\sum|\theta_j| + \frac{(1-r)}{2}\lambda\sum\theta_j^2$$

- Combines L1 and L2
- Best of both worlds

**Choosing λ:**
- Use cross-validation
- Higher λ → more regularization → simpler model
- λ = 0 → standard linear regression

**When to use:**
- **Ridge**: Many features, multicollinearity, want to keep all features
- **Lasso**: Want feature selection, sparse model
- **Elastic Net**: Many correlated features, want feature selection

### Q15: How would you implement Linear Regression from scratch in a coding interview?

**Answer:**

Here's a concise implementation using Normal Equation:

```python
import numpy as np

class LinearRegression:
    def __init__(self):
        self.theta = None
    
    def fit(self, X, y):
        """Train using Normal Equation."""
        # Add intercept term
        m = X.shape[0]
        X_b = np.c_[np.ones((m, 1)), X]
        
        # Normal equation: θ = (X^T X)^-1 X^T y
        self.theta = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
        
        return self
    
    def predict(self, X):
        """Make predictions."""
        m = X.shape[0]
        X_b = np.c_[np.ones((m, 1)), X]
        return X_b @ self.theta
    
    def score(self, X, y):
        """Calculate R² score."""
        y_pred = self.predict(X)
        ss_tot = np.sum((y - y.mean()) ** 2)
        ss_res = np.sum((y - y_pred) ** 2)
        return 1 - (ss_res / ss_tot)

# Usage
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
r2 = model.score(X_test, y_test)
```

**Key points to mention:**
- Add bias/intercept term (column of ones)
- Use vectorized operations
- Could use `np.linalg.pinv()` for stability
- Should handle edge cases (empty data, singular matrix)
- Could add input validation

## Polynomial Regression

### Q16: What is Polynomial Regression? Is it a linear model or a non-linear model?

**Answer:**

Polynomial Regression extends Linear Regression by adding polynomial terms (powers of the original features) to capture **non-linear relationships** between features and the target.

For a single feature $x$ with degree $d$:

$$\hat{y} = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3 + \cdots + \theta_d x^d$$

**Is it linear?** Yes — despite fitting curves, Polynomial Regression is **linear in the parameters** $\theta$. We simply create new features ($x^2, x^3, \ldots$) and fit a standard linear model. The model is non-linear in $x$, but the optimization (Normal Equation, Gradient Descent) works identically to regular Linear Regression.

**Key distinction:**
- **Linear in features** = model outputs a straight line/hyperplane → standard LR
- **Linear in parameters** = the output is a linear combination of parameters → Polynomial Regression, still solvable with the same methods
- **Non-linear model** = parameters appear non-linearly (e.g., neural networks) → requires different optimization

### Q17: How do you choose the polynomial degree? What happens if it's too high or too low?

**Answer:**

The polynomial degree $d$ is a hyperparameter that controls model complexity:

| Degree | Effect | Risk |
|:---:|:---|:---|
| Too low (e.g., 1 for curved data) | Can't capture the true pattern | **Underfitting** (high bias) |
| Just right (e.g., 2–3) | Captures the relationship without memorizing noise | Good generalization |
| Too high (e.g., 10+ for small data) | Fits training data perfectly, including noise | **Overfitting** (high variance) |

**How to choose the degree:**

1. **Cross-validation:** Try degrees 1, 2, 3, ..., $k$ and pick the one with the lowest cross-validated error (e.g., RMSE or MAE).

2. **Learning curves:** Plot training error and validation error vs. degree:
   - If both are high → underfitting → increase degree
   - If training is low but validation is high → overfitting → decrease degree or add regularization

3. **Domain knowledge:** If you know the physical relationship is quadratic (e.g., projectile motion), use degree 2.

4. **Start simple:** Begin with degree 2. Only increase if the model underfits.

```python
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

# Test different degrees
for degree in range(1, 8):
    model = Pipeline([
        ('poly', PolynomialFeatures(degree=degree, include_bias=False)),
        ('lr', LinearRegression())
    ])
    scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
    rmse = (-scores.mean()) ** 0.5
    print(f"Degree {degree}: RMSE = {rmse:.4f}")
```

### Q18: How does the number of features grow with polynomial degree? Why is this a problem?

**Answer:**

With $n$ original features and polynomial degree $d$, the number of expanded features (including all cross terms) is:

$$\text{Number of features} = \binom{n + d}{d}$$

This grows **combinatorially**:

| Original features ($n$) | Degree ($d$) | Polynomial features |
|:---:|:---:|:---:|
| 2 | 2 | 6 |
| 2 | 3 | 10 |
| 10 | 2 | 66 |
| 10 | 3 | 286 |
| 100 | 2 | 5,151 |
| 100 | 3 | 176,851 |

**Why is this a problem?**

1. **Overfitting:** More features means more parameters to fit, and with limited training data the model can memorize noise instead of learning patterns.

2. **Computational cost:** Training time increases with the number of features. The Normal Equation has $O(p^3)$ complexity where $p$ is the number of polynomial features. Gradient Descent has $O(kmp)$ per iteration.

3. **Memory:** Storing the expanded feature matrix can become prohibitive. With $n=100$ and $d=3$, you have ~177K features per sample.

4. **Curse of dimensionality:** In high-dimensional feature spaces, data points become sparse. The model needs exponentially more data to generalize well.

**Solutions:**
- Keep degree low (2–3)
- Use **regularization** (Ridge, Lasso, Elastic Net) to constrain coefficients
- Use Lasso or Elastic Net for **feature selection** — they zero out irrelevant polynomial terms
- Consider `interaction_only=True` in scikit-learn to skip pure powers and keep only cross terms

### Q19: Why is feature scaling especially important for Polynomial Regression?

**Answer:**

Feature scaling is critical because polynomial features **amplify scale differences enormously**.

**Example:** Suppose a feature $x$ has values in the range $[100, 1000]$:
- $x$: range $[100, 1000]$
- $x^2$: range $[10\,000, \; 1\,000\,000]$
- $x^3$: range $[1\,000\,000, \; 1\,000\,000\,000]$

**What happens without scaling:**
1. **Gradient Descent** struggles: the cost surface becomes extremely elongated. Gradients for higher-degree terms are huge while gradients for lower-degree terms are tiny. The algorithm either diverges (if LR is tuned for low-degree terms) or barely moves (if tuned for high-degree terms).

2. **Numerical instability:** Very large values ($10^9$) can cause floating-point overflow, especially when computing $X^TX$ for the Normal Equation.

3. **Regularization is biased:** The penalty $\lambda\sum\theta_j^2$ treats all coefficients equally. Without scaling, the model unfairly penalizes coefficients of small-scale features.

**Best practice — scale before adding polynomial features:**

```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge

model = Pipeline([
    ('scaler', StandardScaler()),        # Scale FIRST
    ('poly', PolynomialFeatures(degree=3, include_bias=False)),
    ('ridge', Ridge(alpha=1.0))
])
```

**Note:** Some practitioners scale after polynomial expansion instead, which is also valid. The key point is that the features seen by the model must be on comparable scales.

### Q20: When should you use Polynomial Regression vs. a non-linear model like Decision Trees?

**Answer:**

| Criterion | Polynomial Regression | Decision Trees / Random Forests |
|:---|:---|:---|
| **Relationship type** | Smooth, continuous curves | Arbitrary, step-like patterns |
| **Interpretability** | Moderate (coefficients have meaning if degree is low) | Low for ensembles, high for single trees |
| **Feature count** | Works well with few features + low degree | Handles many features natively |
| **Extrapolation** | Extrapolates (but dangerously — polynomials blow up outside training range) | Cannot extrapolate (predicts constant outside training range) |
| **Overfitting control** | Regularization + degree tuning | Max depth, min samples, pruning |
| **Training speed** | Fast (linear algebra) | Fast (greedy splits) |
| **Feature interactions** | Must explicitly create cross terms | Discovers interactions automatically |

**Use Polynomial Regression when:**
- You believe the true relationship is a smooth curve (e.g., physics, engineering).
- You have few features and want interpretable coefficients.
- You need to extrapolate slightly beyond training data (with caution).
- You want a simple extension of your existing Linear Regression pipeline.

**Use Tree-based models when:**
- The relationship is complex, discontinuous, or unknown.
- You have many features and don't want to worry about polynomial explosion.
- You need automatic feature interaction discovery.
- You don't need to extrapolate.

**Important warning about extrapolation:**
Polynomial models can produce wildly wrong predictions outside the training range. A degree-10 polynomial that fits beautifully within $[0, 10]$ can output absurd values at $x = 11$. Always be cautious when using polynomial models for prediction beyond the observed data range.

### Q21: Implement Polynomial Regression from scratch and show how different degrees affect the fit.

**Answer:**

## Applied Exercises

### Exercise 1: Training with Millions of Features

**Question:** What Linear Regression training algorithm can you use if you have a training set with millions of features?

**Answer:**

When the number of features is very large (millions), the **Normal Equation** becomes impractical because it requires computing $(X^T X)^{-1}$, which involves inverting an $n \times n$ matrix (where $n$ is the number of features). This inversion has a computational complexity of approximately $O(n^{2.4})$ to $O(n^3)$, making it extremely slow or even impossible for millions of features.

Instead, you should use **Gradient Descent** — specifically:

| Algorithm | Why It Works |
|---|---|
| **Batch Gradient Descent** | Computes gradients over the full dataset. Each iteration is $O(m \times n)$, which is linear in $n$. Feasible if $m$ is moderate. |
| **Stochastic Gradient Descent (SGD)** | Updates parameters using one sample at a time — $O(n)$ per step. Very efficient for large $n$ and large $m$. |
| **Mini-batch Gradient Descent** | Compromise between Batch and SGD. Efficient, benefits from vectorized hardware. |

**Key points:**
- Gradient Descent scales linearly with the number of features, making it suitable for high-dimensional datasets.
- SGD or Mini-batch GD are preferred when both $m$ and $n$ are large.
- Regularization (Ridge, Lasso, Elastic Net) becomes important to prevent overfitting in high-dimensional settings.
- Lasso (L1) is especially useful because it can perform **feature selection** by driving irrelevant feature coefficients to zero.

### Exercise 8: Large Gap Between Training and Validation Error in Polynomial Regression

**Question:** Suppose you are using Polynomial Regression. You plot the learning curves and you notice that there is a large gap between the training error and the validation error. What is happening? What are three ways to solve this?

**Answer:**

A **large gap** between training error (low) and validation error (high) is the classic sign of **overfitting** (high variance). The model has learned the noise and specific patterns of the training data rather than the underlying relationship, so it generalizes poorly to unseen data.

**What is happening:**
- The polynomial degree is too high, giving the model too many parameters relative to the amount of training data.
- The model memorizes the training set (low training error) but fails on new data (high validation error).

**Three ways to solve it:**

1. **Reduce model complexity:**
   - Lower the polynomial degree. A simpler model has fewer parameters and is less prone to overfitting.
   - This directly addresses the root cause — the model is too flexible.

2. **Add regularization:**
   - Apply **Ridge (L2)**, **Lasso (L1)**, or **Elastic Net** regularization to penalize large coefficients.
   - Regularization constrains the model, effectively reducing its capacity to overfit.
   - Tune the regularization hyperparameter $\alpha$ using cross-validation.

3. **Increase the training set size:**
   - With more data, the model has less opportunity to memorize specific examples.
   - The training error will increase slightly, but the validation error will decrease — the gap closes.
   - If collecting more data is not feasible, consider **data augmentation** techniques.

**Bonus approaches:**
- Use **cross-validation** to select the best polynomial degree.
- Apply **early stopping** if using an iterative training algorithm.
- Perform **feature selection** to remove irrelevant polynomial features.

### Exercise 9: Ridge Regression with High Training and Validation Error

**Question:** Suppose you are using Ridge Regression and you notice that the training error and the validation error are almost equal and fairly high. Would you say that the model suffers from high bias or high variance? Should you increase the regularization hyperparameter $\alpha$ or reduce it?

**Answer:**

**Diagnosis: High Bias (Underfitting)**

When both training error and validation error are **high and close to each other**, the model is **underfitting**. It is too simple to capture the underlying patterns in the data. This is the hallmark of **high bias**.

| Scenario | Training Error | Validation Error | Gap | Problem |
|---|---|---|---|---|
| High bias (underfitting) | High | High | Small | Model is too constrained |
| High variance (overfitting) | Low | High | Large | Model is too flexible |

**What to do with $\alpha$:**

You should **reduce** the regularization hyperparameter $\alpha$.

**Reasoning:**
- Ridge Regression adds a penalty of $\alpha \sum_{j=1}^{n} \theta_j^2$ to the cost function.
- A **large $\alpha$** penalizes the coefficients heavily, forcing them toward zero. This makes the model too simple — it cannot fit the training data well.
- **Reducing $\alpha$** relaxes the constraint on the coefficients, allowing the model to learn more complex patterns and fit the data better.
- In the extreme case, $\alpha = 0$ gives plain Linear Regression (no regularization).

**Additional strategies if reducing $\alpha$ alone isn't enough:**
- Add more features or polynomial features to increase model capacity.
- Use a more complex model (e.g., decision trees, neural networks).
- Engineer better features from domain knowledge.

### Exercise 10: When to Use Ridge, Lasso, or Elastic Net

**Question:** Why would you want to use:
- Ridge Regression instead of plain Linear Regression?
- Lasso instead of Ridge Regression?
- Elastic Net instead of Lasso?

**Answer:**

#### Ridge Regression instead of plain Linear Regression

Use Ridge when you suspect **overfitting** or when the model has **many features** relative to the number of observations.

| Aspect | Plain Linear Regression | Ridge Regression |
|---|---|---|
| Regularization | None | L2 penalty: $\alpha \sum \theta_j^2$ |
| Overfitting risk | High with many features | Controlled by $\alpha$ |
| Coefficient behavior | Can be arbitrarily large | Shrunk toward zero (but never exactly zero) |
| Multicollinearity | Unstable estimates | Stabilized estimates |

**Use Ridge when:**
- You have multicollinearity among features.
- You want to keep all features but prevent any from dominating.
- The number of features is large relative to samples.
- You want a regularized model as a default improvement over OLS.

---

#### Lasso instead of Ridge Regression

Use Lasso when you want **automatic feature selection** — Lasso can drive coefficients to exactly zero, effectively removing irrelevant features.

| Aspect | Ridge (L2) | Lasso (L1) |
|---|---|---|
| Penalty | $\alpha \sum \theta_j^2$ | $\alpha \sum |\theta_j|$ |
| Feature selection | No (shrinks but keeps all) | Yes (sets some coefficients to exactly 0) |
| Sparse models | No | Yes |
| Correlated features | Handles well (keeps all) | Arbitrarily picks one, drops others |

**Use Lasso when:**
- You believe only a few features are truly relevant (sparse solution).
- You want an interpretable model with fewer features.
- Feature selection is part of the modeling goal.

---

#### Elastic Net instead of Lasso

Use Elastic Net when you want the **feature selection** ability of Lasso but also the **stability** of Ridge, especially when features are correlated.

| Aspect | Lasso (L1) | Elastic Net (L1 + L2) |
|---|---|---|
| Penalty | $\alpha \sum |\theta_j|$ | $\alpha \left( r \sum |\theta_j| + \frac{1-r}{2} \sum \theta_j^2 \right)$ |
| Correlated features | Unstable (picks one randomly) | Groups correlated features together |
| Feature selection | Yes | Yes (but softer) |
| Stability | Can be erratic | More stable |

**Use Elastic Net when:**
- Features are **correlated** with each other (groups of related features).
- You want feature selection but Lasso's behavior with correlated features is problematic.
- The number of features is larger than the number of samples ($n > m$) — Lasso selects at most $m$ features, while Elastic Net does not have this limitation.
- **As a general default**, Elastic Net is almost always preferred over pure Lasso.

---

**Summary decision flow:**

```
Plain LR → worried about overfitting? → Ridge
Ridge → want feature selection? → Lasso
Lasso → correlated features or n > m? → Elastic Net
```