# 📘 Understanding Linear Regression from Scratch

## 📌 1. Introduction to Linear Regression

Linear Regression predicts a continuous target variable $y$ based on input features $X$. The equation is:

$$
y = w_1 x_1 + w_2 x_2 + ... + w_n x_n + b
$$

or in vectorized form:

$$
\mathbf{y} = \mathbf{X} \mathbf{w} + b
$$

**Where:**  
- $ \mathbf{X} $ is the input feature matrix.  
- $ \mathbf{w} $ is the weight vector.  
- $ b $ is the bias term.  
- $ \mathbf{y} $ is the target variable.  


## 📌 2. Cost Function (Mean Squared Error)

To measure model performance, we use **Mean Squared Error (MSE)**:

$$
MSE = \frac{1}{m} \sum_{i=1}^{m} (y_i - \hat{y}_i)^2
$$

**Where:**  
- $ y_i $ is the actual value.  
- $ \hat{y}_i $ is the predicted value.  
- $ m $ is the number of samples.  



## 📌 3. Gradient Descent Optimization

### **Step 1: Compute Predictions**
Predicted values are:

$$
\hat{y} = \mathbf{X} \mathbf{w} + b
$$

### **Step 2: Compute Gradients**
The gradients w.r.t. $w$ and $b$ are:

$$
\frac{\partial MSE}{\partial w} = \frac{2}{m} \mathbf{X}^T (\mathbf{X} \mathbf{w} + b - \mathbf{y})
$$

$$
\frac{\partial MSE}{\partial b} = \frac{2}{m} \sum (\mathbf{X} \mathbf{w} + b - \mathbf{y})
$$

**Where:**  
- $ \mathbf{X}^T $ is the transpose of $ \mathbf{X} $.  
- $ \mathbf{X} \mathbf{w} $ is the predicted output.  

### **Step 3: Update Parameters**
Using the learning rate $ \alpha $:

$$
w := w - \alpha \cdot \frac{\partial MSE}{\partial w}
$$

$$
b := b - \alpha \cdot \frac{\partial MSE}{\partial b}
$$



## 📌 4. Closed-Form Solution (Normal Equation)

Instead of iterative optimization, the **Normal Equation** computes optimal weights:

$$
\mathbf{w} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}
$$

$$
b = \mathbf{w}_0
$$

**Where:**  
- $ (\mathbf{X}^T \mathbf{X})^{-1} $ is the inverse of $ \mathbf{X}^T \mathbf{X} $.  
- $ \mathbf{X}^T \mathbf{y} $ is the dot product of $ \mathbf{X}^T $ and $ \mathbf{y} $.  



## 📌 5. Model Evaluation Metrics

### **5.1 Mean Squared Error (MSE)**
$$
MSE = \frac{1}{m} \sum_{i=1}^{m} (y_i - \hat{y}_i)^2
$$

### **5.2 R² Score (Coefficient of Determination)**
$$
R^2 = 1 - \frac{\sum (y - \hat{y})^2}{\sum (y - \bar{y})^2}
$$

**Where:**  
- $ \bar{y} $ is the mean of $ y $.  
- $ R^2 $ measures how well the model explains variance in $ y $.  


## 📌 6. Data Visualization Methods

### **6.1 Loss Curve**
To visualize **Gradient Descent**, we plot loss over iterations.

$$
\text{Loss} = MSE
$$

### **6.2 Actual vs. Predicted Values**
A scatter plot is used to compare **predicted vs actual values**. A perfect model has points on the line $ y = x $.

### **6.3 Residuals Plot**
Residuals ($ y - \hat{y} $) should be randomly distributed around zero.


## 📌 7. Conclusion

- **Gradient Descent** iteratively optimizes weights.  
- **Normal Equation** computes weights directly but is inefficient for large datasets.  
- **MSE and R² Score** measure performance.  
- **Loss curves and residual plots** help visualize model behavior.  

Linear Regression remains a **powerful yet simple** algorithm. 🚀