# Multiple Linear Regression (Detailed Video Explanation)

---

## 1. Introduction and Context

The video starts by placing **Multiple Linear Regression (MLR)** within the broader topic of **Regression**.

### Recap
- The previous session covered **Simple Linear Regression (SLR)**.
- SLR works with datasets having **only one input feature**.

### Types of Regression
1. Simple Linear Regression  
2. **Multiple Linear Regression** (focus of this video)  
3. Polynomial Regression (covered later)

---

## 2. From Simple to Multiple Regression

### Simple Linear Regression (SLR)
- **Scenario:** One independent variable and one dependent variable.
- **Example:** Predicting **Salary (Package)** using **CGPA** only.
- **Geometry:** Data plotted on a 2D graph ($X$ vs $Y$).
- **Model:** A straight line.
- **Goal:** Find the best line equation  
  \[
  y = mx + b
  \]
  where  
  - $m$ = slope  
  - $b$ = intercept  

---

### Multiple Linear Regression (MLR)
- **Scenario:** Real-world outcomes depend on multiple factors.
- **Example:** Predicting **Salary** using **CGPA, Gender, IQ**.
- **Significance:**  
  - MLR is the **general form** of regression.
  - SLR is a special case of MLR with only one input.
  - Industry datasets almost always require MLR.

---

## 3. Geometric Intuition: The Hyperplane

### 3D Visualization
- Inputs:  
  - $X_1$ = CGPA  
  - $X_2$ = IQ  
- Output:  
  - $Y$ = Salary  
- Requires a **3D plot** (X, Y, Z axes).

### Intuition
- Data points are not on a flat sheet; they are **floating in space**.
- In SLR (2D): we draw a **line**.
- In MLR (3D): we draw a **plane**.

### Objective
- The algorithm positions the plane such that the **distance between data points and the plane is minimized**.

### Higher Dimensions
- 1 input → Line  
- 2 inputs → Plane  
- $n$ inputs → **Hyperplane**  

Visualization becomes impossible, but the math remains the same.

---

## 4. Mathematical Formulation

### Simple Linear Regression
\[
y = mx + b
\]

### Multiple Linear Regression
\[
y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \dots + \beta_nx_n
\]

### Term Explanation
- $y$ → predicted output (e.g., Salary)
- $\beta_0$ → intercept (baseline value of $y$)
- $\beta_1, \beta_2, \dots, \beta_n$ → coefficients (weights)
- $x_1, x_2, \dots, x_n$ → input features

### Training Goal
- Learn optimal values of  
  \[
  \beta_0, \beta_1, \beta_2, \dots, \beta_n
  \]
- For $n$ input features, the model learns **$n+1$ parameters**.

---

## 5. Interpretation of Coefficients

### Example Model
\[
\text{Salary} = \beta_0 + \beta_1(\text{CGPA}) + \beta_2(\text{IQ})
\]

### Meaning
- **Magnitude of $\beta$** indicates importance.
- Large $\beta_1$ and small $\beta_2$ → CGPA matters more than IQ.
- Coefficient ≈ 0 → feature has little or no impact.

### Prediction
- Multiply feature values with their coefficients.
- Add the intercept $\beta_0$.

---

## 6. Coding Demonstration (Python)

### Dataset Creation
- Used `sklearn.datasets.make_regression`
- Parameters:
  - 100 samples
  - 2 input features
  - 1 output variable

### Visualization
- Used **Plotly** for interactive 3D scatter plot.
- Shows data points floating in 3D space.

### Model Training
- Imported `LinearRegression` from `sklearn.linear_model`
- Model fitting:
  ```python
  lr.fit(X_train, y_train)
  ```
### Results

- Predictions made on test data.

- Metrics calculated:

  - $R^2$ Score

  - Mean Absolute Error (MAE)

### Coefficients

- lr.coef_ returns an array:

  - $\beta_1$, $\beta_2$

- lr.intercept_ returns:

  - $\beta_0$

- Plane Visualization

  - A transparent 2D plane slicing through the 3D scatter plot.

  - Confirms the geometric intuition of MLR.

In [None]:
from sklearn.datasets import make_regression
import pandas as pd
import numpy as np

import plotly.express as px
import plotly.graph_objects as go

from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score

In [None]:
X,y = make_regression(n_samples=100, n_features=2, n_informative=2, n_targets=1, noise=50)


df = pd.DataFrame({'feature1':X[:,0],'feature2':X[:,1],'target':y})

In [None]:
df.shape

(100, 3)

In [None]:
df.head()

Unnamed: 0,feature1,feature2,target
0,-1.741814,1.524223,54.310125
1,-0.03914,-0.253922,-74.895138
2,-0.27799,0.365484,74.833617
3,-0.725075,-0.407639,-21.057108
4,-0.421594,1.859432,187.281502


In [None]:
fig = px.scatter_3d(df, x='feature1', y='feature2', z='target')

fig.show()

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=3)


from sklearn.linear_model import LinearRegression


lr = LinearRegression()


lr.fit(X_train,y_train)

In [None]:
y_pred = lr.predict(X_test)


print("MAE",mean_absolute_error(y_test,y_pred))
print("MSE",mean_squared_error(y_test,y_pred))
print("R2 score",r2_score(y_test,y_pred))

MAE 35.79163835101106
MSE 2120.6110818929483
R2 score 0.8064841263319188


In [None]:
x = np.linspace(-5, 5, 10)
y = np.linspace(-5, 5, 10)
xGrid, yGrid = np.meshgrid(y, x)

final = np.vstack((xGrid.ravel().reshape(1,100),yGrid.ravel().reshape(1,100))).T

z_final = lr.predict(final).reshape(10,10)

z = z_final

In [None]:
fig = px.scatter_3d(df, x='feature1', y='feature2', z='target')

fig.add_trace(go.Surface(x = x, y = y, z =z ))

In [None]:
lr.coef_

array([37.88786443, 77.24160398])

In [None]:
lr.intercept_

np.float64(-1.438962736882175)

# Multiple Linear Regression — Mathematical Derivation (From Scratch)

## 1. Introduction

- In the previous video:
  - Covered **Multiple Linear Regression (MLR)** intuition
  - Looked at **geometry (plane / hyperplane)**
  - Saw **basic coding**
- In this video:
  - Focus is **pure mathematics**
  - Goal: derive **MLR formulas from scratch**
  - Understand **where β (coefficients) come from**

Regression types recap:
1. Simple Linear Regression (SLR)
2. **Multiple Linear Regression (MLR)**
3. Polynomial Regression (later)

---

## 2. Problem Setup

### Data
- Inputs (features):
  - CGPA
  - IQ
  - Gender
- Output (target):
  - Salary

Let:
- Number of students = **n**
- Number of features = **m**

---

## 3. Model Equation

### Scalar Form

$$
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_m x_m
$$

- $\beta_0$: intercept  
- $\beta_1, \dots, \beta_m$: coefficients  
- $x_1, \dots, x_m$: input features  

---


## 4. Matrix Representation

### Input Matrix (Design Matrix)

Add a column of **1s** for intercept.

\[
X =
\begin{bmatrix}
1 & x_{11} & x_{12} & \dots & x_{1m} \\
1 & x_{21} & x_{22} & \dots & x_{2m} \\
\vdots & \vdots & \vdots & & \vdots \\
1 & x_{n1} & x_{n2} & \dots & x_{nm}
\end{bmatrix}
\]

Shape:
\[
X \in \mathbb{R}^{n \times (m+1)}
\]

---

### Coefficient Vector

\[
\beta =
\begin{bmatrix}
\beta_0 \\
\beta_1 \\
\beta_2 \\
\vdots \\
\beta_m
\end{bmatrix}
\]

Shape:
\[
\beta \in \mathbb{R}^{(m+1) \times 1}
\]

---

### Target Vector

\[
y =
\begin{bmatrix}
y_1 \\
y_2 \\
\vdots \\
y_n
\end{bmatrix}
\]

---

### Prediction Equation

\[
\hat{y} = X\beta
\]

This gives predictions for **all students at once**.

---

## 5. Residuals and Error Vector

The error vector represents the difference between the observed values and the predicted values.

$$
e = y - \hat{y} = y - X\beta
$$

---

## 6. Loss Function (Cost Function)

The goal is to minimize the error. We use the **Sum of Squared Errors (SSE)**.

### Scalar Form (from Simple Linear Regression)
$$
J(\beta) = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

### Matrix Form
$$
J(\beta) = (y - X\beta)^T (y - X\beta)
$$

---

## 7. Expanding the Loss Function

Expanding the matrix notation gives us the quadratic form of the equation:

$$
J(\beta) = y^T y - 2 y^T X\beta + \beta^T X^T X \beta
$$

---

## 8. Optimization Objective

To find the best fit line, we want to minimize the cost function:

$$
\min_\beta J(\beta)
$$

**Strategy:** Take the derivative with respect to $\beta$ and set it to zero.

---

## 9. Differentiation (Key Result)

Using matrix differentiation rules:

$$
\frac{\partial J}{\partial \beta} = -2 X^T y + 2 X^T X \beta
$$

Set the derivative to zero:

$$
X^T X \beta = X^T y
$$

---

## 10. Normal Equation (Closed-Form Solution)

Assuming that the matrix $X^T X$ is invertible, we can solve for $\beta$:

> **Final Formula:**
> $$
> \beta = (X^T X)^{-1} X^T y
> $$
This is the **Ordinary Least Squares (OLS)** solution.

---

## 11. Interpretation

- \(X^T X\): feature interaction matrix  
- \((X^T X)^{-1}\): adjusts for feature correlation  
- \(X^T y\): correlation between features and output  
- \(\beta\): optimal coefficients minimizing squared error  

---

## 12. Shape Validation (Important)

| Term | Shape |
|----|----|
| \(X\) | \(n \times (m+1)\) |
| \(X^T\) | \((m+1) \times n\) |
| \(X^T X\) | \((m+1) \times (m+1)\) |
| \((X^T X)^{-1}\) | \((m+1) \times (m+1)\) |
| \(X^T y\) | \((m+1) \times 1\) |
| \(\beta\) | \((m+1) \times 1\) |

✔ Dimensions match → equation is valid.

---

## 13. Why Not Always Use Normal Equation?

### Problem
- Matrix inversion cost:
\[
O(m^3)
\]
- Very slow for **large number of features**

---

## 14. Gradient Descent (Alternative)

Instead of direct formula:
- Start with random \(\beta\)
- Iteratively update:
\[
\beta := \beta - \alpha \nabla J(\beta)
\]

### Pros
- Works for **large datasets**
- No matrix inversion

### Cons
- Approximate solution
- Needs learning rate tuning

---

## 15. Practical Usage (scikit-learn)

- `LinearRegression` → uses **OLS (Normal Equation / SVD)**
- `SGDRegressor` → uses **Gradient Descent**

Rule of thumb:
- Small / medium features → `LinearRegression`
- Huge / high-dimensional data → `SGDRegressor`

---

## 16. Final Takeaways

- Multiple Linear Regression is **just linear algebra**
- Core equation:
\[
\beta = (X^T X)^{-1} X^T y
\]
- Matrix form simplifies everything
- Gradient Descent exists because **matrix inversion is expensive**
- Understanding math → confidence in ML algorithms

---