# Linear Regression & Normal Equation - Interview Q&A

This notebook contains interview-level questions and answers about Linear Regression and the Normal Equation method.

## Basic Concepts

### Q1: What is Linear Regression? When would you use it?

**Answer:**

Linear Regression is a supervised learning algorithm used to predict a continuous target variable based on one or more independent variables by fitting a linear equation to the observed data.

**Mathematical form:**
- Simple: $y = \beta_0 + \beta_1x + \varepsilon$
- Multiple: $y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n + \varepsilon$

**Use cases:**
- Predicting house prices based on features
- Forecasting sales based on advertising spend
- Estimating continuous outcomes (temperature, stock prices, etc.)
- When relationship between variables is approximately linear

**Key requirement:** The target variable must be continuous (not categorical).

### Q2: What are the four key assumptions of Linear Regression?

**Answer:**

1. **Linearity**: The relationship between independent and dependent variables is linear
   - Check: Scatter plots, residual plots

2. **Independence**: Observations are independent of each other
   - Violation example: Time series data with autocorrelation
   - Check: Durbin-Watson test

3. **Homoscedasticity**: Constant variance of residuals across all levels of independent variables
   - Violation: Residuals increase/decrease with fitted values (heteroscedasticity)
   - Check: Residual plot (should show random scatter)

4. **Normality**: Residuals are normally distributed
   - Check: Q-Q plot, histogram of residuals, Shapiro-Wilk test

**Consequences of violations:**
- Biased or inefficient parameter estimates
- Invalid confidence intervals and p-values
- Poor prediction performance

### Q3: Explain the difference between Simple and Multiple Linear Regression.

**Answer:**

**Simple Linear Regression:**
- One independent variable
- Form: $y = \beta_0 + \beta_1x$
- Example: Predicting salary based only on years of experience
- Visualization: 2D line

**Multiple Linear Regression:**
- Two or more independent variables
- Form: $y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$
- Example: Predicting salary based on experience, education, location
- Visualization: Hyperplane in n-dimensional space

**Key differences:**
- Multiple regression can capture more complex relationships
- Multiple regression is prone to multicollinearity
- Interpretation: In multiple regression, each coefficient represents the effect of that variable while holding others constant

## Normal Equation

### Q4: What is the Normal Equation? Derive or explain the formula.

**Answer:**

The Normal Equation is a closed-form analytical solution to find optimal parameters for linear regression without iteration.

**Formula:**
$$\theta = (X^TX)^{-1}X^Ty$$

**Derivation (high-level):**

1. Start with cost function (MSE):
   $$J(\theta) = \frac{1}{2m}(X\theta - y)^T(X\theta - y)$$

2. To minimize, take derivative with respect to $\theta$ and set to zero:
   $$\frac{\partial J}{\partial \theta} = 0$$

3. Expand and solve:
   $$\frac{1}{m}X^T(X\theta - y) = 0$$
   $$X^TX\theta = X^Ty$$
   $$\theta = (X^TX)^{-1}X^Ty$$

**Key insight:** This gives the exact optimal parameters in one computation (no iterations needed).

### Q5: What are the advantages and disadvantages of the Normal Equation?

**Answer:**

**Advantages:**
- ✅ **No hyperparameters** - No learning rate to tune
- ✅ **Exact solution** - Finds global optimum in one step
- ✅ **No iterations** - Single computation
- ✅ **No feature scaling needed** - Works with unnormalized data
- ✅ **Guaranteed convergence** - Always finds solution (if invertible)

**Disadvantages:**
- ❌ **Computational complexity** - O(n³) due to matrix inversion
- ❌ **Slow for large n** - Impractical when features > 10,000
- ❌ **Memory intensive** - Must compute and store $X^TX$
- ❌ **Matrix inversion required** - Fails if $X^TX$ is singular (non-invertible)
- ❌ **Not suitable for online learning** - Requires all data at once

**Rule of thumb:** Use Normal Equation when n < 10,000 features; use Gradient Descent otherwise.

### Q6: When is $(X^TX)$ not invertible? What can you do about it?

**Answer:**

**When $X^TX$ is singular (non-invertible):**

1. **Redundant features**: Linearly dependent columns
   - Example: Feature 1 = Temperature in Celsius, Feature 2 = Temperature in Fahrenheit

2. **More features than samples**: $n > m$
   - System is underdetermined

3. **Perfect multicollinearity**: One feature is exact linear combination of others

**Solutions:**

1. **Remove redundant features**
   - Identify and eliminate linearly dependent features
   - Use correlation matrix or VIF (Variance Inflation Factor)

2. **Use pseudo-inverse (Moore-Penrose)**
   ```python
   theta = np.linalg.pinv(X) @ y
   ```
   - Works even when $X^TX$ is singular
   - More numerically stable

3. **Regularization** (Ridge/Lasso)
   - Add penalty term to make matrix invertible
   - Ridge: $(X^TX + \lambda I)^{-1}X^Ty$

4. **Use Gradient Descent** instead
   - Doesn't require matrix inversion

### Q7: Compare Normal Equation vs Gradient Descent. When would you use each?

**Answer:**

| Aspect | Normal Equation | Gradient Descent |
|--------|----------------|------------------|
| **Algorithm type** | Analytical/Direct | Iterative |
| **Iterations** | None (1 computation) | Many (100s-1000s) |
| **Learning rate** | Not needed | Required |
| **Feature scaling** | Not needed | Highly recommended |
| **Time complexity** | O(n³) | O(kmn), k=iterations |
| **Works for large n** | No (n > 10,000) | Yes |
| **Works for large m** | Yes | Yes |
| **Singularity issue** | Yes (if $X^TX$ singular) | No |
| **Online learning** | No | Yes |
| **Optimization** | Global optimum | Global optimum* |

*Linear regression cost function is convex, so GD always finds global optimum

**Use Normal Equation when:**
- Small to medium datasets (m < 100,000)
- Few features (n < 10,000)
- Need exact solution quickly
- Don't want to tune hyperparameters

**Use Gradient Descent when:**
- Large datasets (m > 100,000)
- Many features (n > 10,000)
- Online learning required
- Memory constrained
- Want to generalize to other algorithms (logistic regression, neural networks)

## Model Evaluation & Interpretation

### Q8: What is R² score? How do you interpret it?

**Answer:**

**R² (Coefficient of Determination)** measures the proportion of variance in the dependent variable explained by the independent variables.

**Formula:**
$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$$

Where:
- $SS_{res}$ = Residual sum of squares (model error)
- $SS_{tot}$ = Total sum of squares (variance in data)

**Interpretation:**
- **Range**: 0 to 1 (can be negative for very poor models)
- **R² = 0.85**: Model explains 85% of variance in target
- **R² = 1.0**: Perfect fit (likely overfitting)
- **R² = 0.0**: Model no better than predicting mean
- **R² < 0**: Model worse than predicting mean

**Limitations:**
- Always increases with more features (even irrelevant ones)
- Use **Adjusted R²** for multiple regression
- High R² doesn't mean causation
- Can be misleading with non-linear relationships

### Q9: What is multicollinearity? How do you detect and handle it?

**Answer:**

**Multicollinearity**: High correlation between two or more independent variables.

**Problems it causes:**
- Unstable coefficient estimates (small data changes → large coefficient changes)
- Difficult to interpret individual feature importance
- Inflated standard errors
- May lead to singular $X^TX$ matrix

**Detection methods:**

1. **Correlation Matrix**
   - Check pairwise correlations
   - |correlation| > 0.8-0.9 indicates problem

2. **Variance Inflation Factor (VIF)**
   $$VIF_j = \frac{1}{1 - R_j^2}$$
   - $R_j^2$ = R² from regressing feature j on all other features
   - **VIF > 5-10**: Problematic multicollinearity
   - **VIF = 1**: No multicollinearity

**Solutions:**
1. **Remove correlated features** - Drop one of the correlated variables
2. **Combine features** - Create interaction term or ratio
3. **Principal Component Analysis (PCA)** - Transform to uncorrelated components
4. **Regularization** - Ridge regression (L2) handles multicollinearity well
5. **Collect more data** - Sometimes helps reduce correlation

### Q10: What is the difference between MSE, RMSE, and MAE? When would you use each?

**Answer:**

**Mean Squared Error (MSE):**
$$MSE = \frac{1}{m}\sum_{i=1}^{m}(y_i - \hat{y}_i)^2$$

- Squares errors (penalizes large errors more)
- Units: squared units of target variable
- Not interpretable in original units
- Sensitive to outliers

**Root Mean Squared Error (RMSE):**
$$RMSE = \sqrt{MSE} = \sqrt{\frac{1}{m}\sum_{i=1}^{m}(y_i - \hat{y}_i)^2}$$

- Square root of MSE
- **Units: same as target variable** (interpretable!)
- Still sensitive to outliers
- Most commonly used

**Mean Absolute Error (MAE):**
$$MAE = \frac{1}{m}\sum_{i=1}^{m}|y_i - \hat{y}_i|$$

- Average absolute error
- Units: same as target variable
- **Less sensitive to outliers**
- All errors weighted equally

**When to use:**
- **RMSE**: General purpose, when large errors should be penalized more
- **MAE**: When outliers shouldn't dominate, need robust metric
- **MSE**: Theoretical work, optimization (easier to differentiate)

**Example:**
- Errors: [1, 1, 1, 10]
- MAE = 3.25
- RMSE = 5.22 (heavily influenced by the 10)

## Advanced Concepts

### Q11: What is the difference between population regression line and sample regression line?

**Answer:**

**Population Regression Line:**
- True relationship in the entire population
- Unknown in practice (we don't have access to entire population)
- Represented by true parameters: $\beta_0, \beta_1, ..., \beta_n$
- Example: True relationship between all people's education and income

**Sample Regression Line:**
- Estimated from a sample of data
- What we actually compute in practice
- Represented by estimated parameters: $\hat{\beta}_0, \hat{\beta}_1, ..., \hat{\beta}_n$ (or $\theta$)
- **Goal**: Estimate the population line as accurately as possible
- Different samples → slightly different sample regression lines

**Key relationship:**
- Sample line is an **estimator** of population line
- As sample size increases, sample line approaches population line
- Standard errors quantify uncertainty in our estimates

### Q12: How would you handle outliers in Linear Regression?

**Answer:**

**Detection:**
1. **Residual analysis** - Large residuals indicate outliers
2. **Cook's distance** - Measures influence of each point
3. **Leverage** - Points far from mean of X
4. **Visualization** - Scatter plots, box plots

**Handling strategies:**

1. **Remove outliers** (if justified)
   - Data entry errors
   - Measurement errors
   - Document removal decision

2. **Transform variables**
   - Log transformation reduces impact
   - Box-Cox transformation

3. **Use robust regression**
   - RANSAC (Random Sample Consensus)
   - Huber regression
   - Minimize MAE instead of MSE

4. **Cap/Winsorize**
   - Replace extreme values with percentile values
   - Example: Cap at 95th percentile

5. **Add indicator variable**
   - Binary variable indicating outlier status
   - Allows model to treat differently

**Important**: Don't automatically remove outliers - they might contain valuable information!

### Q13: Explain overfitting and underfitting in Linear Regression context.

**Answer:**

**Underfitting (High Bias):**
- Model is too simple
- Poor performance on both training and test data
- High training error, high test error

**Causes:**
- Too few features
- Model doesn't capture relationship (maybe non-linear)
- Too much regularization

**Solutions:**
- Add polynomial features
- Add more relevant features
- Reduce regularization
- Try more complex model

**Overfitting (High Variance):**
- Model is too complex
- Excellent performance on training, poor on test
- Low training error, high test error
- Model learns noise in training data

**Causes:**
- Too many features
- Polynomial features with high degree
- Too little regularization
- Small training set

**Solutions:**
- Add more training data
- Feature selection (remove irrelevant features)
- Regularization (Ridge/Lasso)
- Cross-validation
- Reduce model complexity

**Bias-Variance Tradeoff:**
- Need to balance model complexity
- Total Error = Bias² + Variance + Irreducible Error
- Goal: Find sweet spot (minimum total error)

### Q14: What is regularization? Explain Ridge and Lasso regression.

**Answer:**

**Regularization**: Technique to prevent overfitting by adding a penalty term to the cost function.

**Ridge Regression (L2 Regularization):**
$$J(\theta) = MSE + \lambda\sum_{j=1}^{n}\theta_j^2$$

- Penalty: Sum of squared coefficients
- **Effect**: Shrinks coefficients toward zero (but never exactly zero)
- Keeps all features
- Works well with multicollinearity
- Normal equation: $\theta = (X^TX + \lambda I)^{-1}X^Ty$
- Note: $\lambda I$ makes matrix invertible even if $X^TX$ is singular!

**Lasso Regression (L1 Regularization):**
$$J(\theta) = MSE + \lambda\sum_{j=1}^{n}|\theta_j|$$

- Penalty: Sum of absolute coefficients
- **Effect**: Can shrink coefficients to exactly zero
- Performs automatic feature selection
- Creates sparse models
- No closed-form solution (use gradient descent)

**Elastic Net:**
$$J(\theta) = MSE + r\lambda\sum|\theta_j| + \frac{(1-r)}{2}\lambda\sum\theta_j^2$$

- Combines L1 and L2
- Best of both worlds

**Choosing λ:**
- Use cross-validation
- Higher λ → more regularization → simpler model
- λ = 0 → standard linear regression

**When to use:**
- **Ridge**: Many features, multicollinearity, want to keep all features
- **Lasso**: Want feature selection, sparse model
- **Elastic Net**: Many correlated features, want feature selection

### Q15: How would you implement Linear Regression from scratch in a coding interview?

**Answer:**

Here's a concise implementation using Normal Equation:

```python
import numpy as np

class LinearRegression:
    def __init__(self):
        self.theta = None
    
    def fit(self, X, y):
        """Train using Normal Equation."""
        # Add intercept term
        m = X.shape[0]
        X_b = np.c_[np.ones((m, 1)), X]
        
        # Normal equation: θ = (X^T X)^-1 X^T y
        self.theta = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
        
        return self
    
    def predict(self, X):
        """Make predictions."""
        m = X.shape[0]
        X_b = np.c_[np.ones((m, 1)), X]
        return X_b @ self.theta
    
    def score(self, X, y):
        """Calculate R² score."""
        y_pred = self.predict(X)
        ss_tot = np.sum((y - y.mean()) ** 2)
        ss_res = np.sum((y - y_pred) ** 2)
        return 1 - (ss_res / ss_tot)

# Usage
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
r2 = model.score(X_test, y_test)
```

**Key points to mention:**
- Add bias/intercept term (column of ones)
- Use vectorized operations
- Could use `np.linalg.pinv()` for stability
- Should handle edge cases (empty data, singular matrix)
- Could add input validation