# **Summer Session: Machine Learning for High School Students**  
## **Week 6: Linear Regression - Theory and Implementation**  

### **1. Introduction to Linear Regression (30 min)**  
**Objective:** Understand the mathematical foundations of linear regression, including different optimization methods.  

#### **1.1 What is Linear Regression?**  
- A supervised learning algorithm for predicting continuous numerical values.  
- **Example Applications:**  
  - Predicting house prices based on square footage.  
  - Estimating student test scores based on study hours.  

#### **1.2 Simple Linear Regression Equation**  
The model assumes a linear relationship between input `X` and output `y`:  

$$
y = \beta_0 + \beta_1 X + \epsilon
$$  

- $y$: Target variable (dependent variable).  
- $X$: Feature (independent variable).  
- $ \beta_0 $: Intercept (bias term).  
- $ \beta_1 $: Slope (coefficient).  
- $ \epsilon $: Error term (residuals).  

#### **1.3 Cost Function: Mean Squared Error (MSE)**  
The goal is to minimize the difference between predicted and actual values:  

$$
MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

- $ n $: Number of data points.  
- $ y_i $: Actual value.  
- $ \hat{y}_i $: Predicted value.  

#### **1.4 Optimization Methods**  

##### **Method 1: Normal Equation (Closed-Form Solution)**  
- Directly computes optimal coefficients without iteration.  
- Formula:  

$$
\beta = (X^T X)^{-1} X^T y
$$  

**Advantages:**  
✔ Exact solution (no approximation).  
✔ Works well for small datasets.  

**Disadvantages:**  
✖ Computationally expensive for large datasets $(O(n^3)$).  
✖ Requires matrix inversion (fails if $X^T X$ is singular).  

##### **Method 2: Gradient Descent (Iterative Approach)**  
- Gradually adjusts coefficients to minimize MSE.  
- Update rule:  

$$
\beta_j := \beta_j - \alpha \frac{\partial}{\partial \beta_j} MSE
$$

- $\alpha$: Learning rate (controls step size).  

**Advantages:**  
✔ Scalable for large datasets (\(O(n)\) per iteration).  
✔ Works well with high-dimensional data.  

**Disadvantages:**  
✖ Requires tuning learning rate.  
✖ May converge to local minima (rare in linear regression).  

---

### **2. Implementing Linear Regression in Python (60 min)**  

#### **Method 1: Using Scikit-learn**  
```python
from sklearn.linear_model import LinearRegression

model = LinearRegression()  # Uses Normal Equation internally
model.fit(X_train, y_train)
print(f"Slope (β₁): {model.coef_[0]}, Intercept (β₀): {model.intercept_}")
```

#### **Method 2: Normal Equation from Scratch**  
```python
def normal_equation(X, y):
    X_b = np.c_[np.ones((len(X), 1)), X]  # Add bias term (β₀)
    beta = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
    return beta

# Usage
X = np.array([1000, 1500, 1200, 1800, 2000])
y = np.array([300000, 400000, 350000, 450000, 500000])

beta = normal_equation(X.reshape(-1, 1), y)
print(f"Intercept (β₀): {beta[0]}, Slope (β₁): {beta[1]}")
```

#### **Method 3: Gradient Descent from Scratch**  
```python
def gradient_descent(X, y, learning_rate=0.01, epochs=1000):
    n = len(X)
    beta = np.random.randn(2, 1)  # Random initialization
    
    for _ in range(epochs):
        X_b = np.c_[np.ones((n, 1)), X]  # Add bias term
        y_pred = X_b @ beta
        gradients = (2/n) * X_b.T @ (y_pred - y)
        beta -= learning_rate * gradients
    
    return beta

beta = gradient_descent(X.reshape(-1, 1), y.reshape(-1, 1))
print(f"Intercept (β₀): {beta[0][0]}, Slope (β₁): {beta[1][0]}")
```

---

### **3. Hands-on Exercise (30 min)**  
**Task:** Compare Normal Equation and Gradient Descent on the `diabetes` dataset.  

1. Load the dataset:  
   ```python
   from sklearn.datasets import load_diabetes
   data = load_diabetes()
   X, y = data.data, data.target
   ```  
2. Implement both methods.  
3. Compare runtime and coefficients.  

**Discussion Questions:**  
- Which method is faster for small datasets?  
- What happens if \(X^T X\) is non-invertible?  
- How does learning rate affect gradient descent?  

---

### **Summary**  
- **Theory:**  
  - Linear regression models relationships using coefficients.  
  - Normal equation is exact but slow for large data.  
  - Gradient descent is scalable but requires tuning.  
- **Coding:**  
  - Implemented regression using Scikit-learn, normal equation, and gradient descent.  
- **Next Week:** Decision Trees for classification!