# Build Your Own Linear Regression from Scratch

**Author:** Anik Tahabilder  
**Project:** 3 of 22 - Kaggle ML Portfolio  
**Dataset:** Student Performance  
**Difficulty:** 5/10 | **Learning Value:** 9/10

---

## Why Build from Scratch?

Using `sklearn.linear_model.LinearRegression` is just one line of code. But do you actually understand what's happening under the hood?

In this notebook, we'll:
- **Derive the math** behind linear regression
- **Implement everything using only NumPy** (no sklearn for the model)
- **Visualize** how the algorithm learns
- **Apply it** to predict student performance based on study habits
- **Verify** our implementation matches sklearn

By the end, you'll truly understand:
- What a cost function is and why we use Mean Squared Error
- How gradient descent finds optimal parameters
- Why feature scaling matters
- The closed-form solution (Normal Equation)

---

## The Problem: Predicting Student Performance

We have data about students including:
- Hours studied
- Previous scores
- Sleep hours
- Practice papers completed

**Goal:** Predict the **Performance Index** (final score) using linear regression.

---

## Table of Contents

1. [Part 1: What is Linear Regression?](#part1)
2. [Part 2: Load and Explore the Data](#part2)
3. [Part 3: The Cost Function](#part3)
4. [Part 4: Gradient Descent - The Learning Algorithm](#part4)
5. [Part 5: Implementation from Scratch](#part5)
6. [Part 6: The Normal Equation (Closed-Form Solution)](#part6)
7. [Part 7: Feature Scaling - Why It Matters](#part7)
8. [Part 8: Training on Student Performance Data](#part8)
9. [Part 9: Evaluation Metrics from Scratch](#part9)
10. [Part 10: Verification Against sklearn](#part10)
11. [Part 11: Summary and Key Takeaways](#part11)

---

<a id='part1'></a>
# Part 1: What is Linear Regression?

## The Intuition

Linear regression is the **simplest supervised learning algorithm**. Given input features, it predicts a continuous output by fitting a straight line (or hyperplane) through the data.

### Real-World Examples:
- Predicting house prices based on square footage
- Estimating salary based on years of experience
- **Predicting student scores based on study hours** (our problem!)

## The Mathematical Model

### Simple Linear Regression (One Feature)

The equation of a line:

$$\hat{y} = w_0 + w_1 x$$

Where:
- $\hat{y}$ = predicted value (output)
- $x$ = input feature
- $w_0$ = **bias** (y-intercept) - where the line crosses the y-axis
- $w_1$ = **weight** (slope) - how much y changes when x increases by 1

### Multiple Linear Regression (Multiple Features)

With multiple features:

$$\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + ... + w_n x_n$$

For our student data:
$$\text{Performance} = w_0 + w_1 \cdot \text{Hours} + w_2 \cdot \text{PrevScores} + w_3 \cdot \text{Sleep} + ...$$

### Matrix Notation (Compact Form)

We can write this more elegantly using matrices:

$$\hat{y} = X \cdot w$$

Where:
- $X$ is the **feature matrix** of shape (m, n+1) - m samples, n features + 1 bias column
- $w$ is the **weight vector** of shape (n+1, 1)

**The Bias Trick:** We add a column of 1s to X so that $w_0 \cdot 1 = w_0$ (the bias term).

## The Goal

Find the values of $w_0, w_1, ..., w_n$ that make our predictions $\hat{y}$ as close as possible to the actual values $y$.

But how do we measure "close"? That's where the **cost function** comes in.

In [None]:
# Part 1: Setup - Only NumPy and Matplotlib (NO sklearn for the model)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# For nice plots
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

# Set random seed for reproducibility
np.random.seed(42)

print("Setup complete! Using only NumPy for the model.")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

---

<a id='part2'></a>
# Part 2: Load and Explore the Data

Let's load the Student Performance dataset and understand what we're working with.

In [None]:
# Load the dataset
df = pd.read_csv('Student_Performance.csv')

print("Dataset Shape:", df.shape)
print(f"\nWe have {df.shape[0]} students and {df.shape[1]} features.")
print("\n" + "="*50)
print("First 10 rows:")
df.head(10)

In [None]:
# Dataset info
print("Dataset Info:")
print("="*50)
df.info()
print("\n" + "="*50)
print("\nStatistical Summary:")
df.describe()

In [None]:
# Check for missing values
print("Missing Values:")
print(df.isnull().sum())
print("\nNo missing values - great!")

In [None]:
# Understand the features
print("Feature Analysis:")
print("="*60)
print(f"\n{'Feature':<35} {'Type':<15} {'Range'}")
print("-"*60)
for col in df.columns:
    if df[col].dtype == 'object':
        unique = df[col].unique()
        print(f"{col:<35} {'Categorical':<15} {unique}")
    else:
        print(f"{col:<35} {'Numeric':<15} [{df[col].min():.1f} - {df[col].max():.1f}]")

print("\n" + "="*60)
print("TARGET: Performance Index (what we want to predict)")
print("FEATURES: All other columns (what we use to predict)")

In [None]:
# Visualize the data
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# Distribution of target variable
axes[0, 0].hist(df['Performance Index'], bins=30, edgecolor='black', alpha=0.7)
axes[0, 0].set_xlabel('Performance Index')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Distribution of Performance Index (Target)')

# Scatter plots of numeric features vs target
numeric_features = ['Hours Studied', 'Previous Scores', 'Sleep Hours', 'Sample Question Papers Practiced']

for i, feature in enumerate(numeric_features):
    row, col = (i + 1) // 3, (i + 1) % 3
    axes[row, col].scatter(df[feature], df['Performance Index'], alpha=0.5, edgecolors='black', linewidth=0.3)
    axes[row, col].set_xlabel(feature)
    axes[row, col].set_ylabel('Performance Index')
    axes[row, col].set_title(f'{feature} vs Performance')

# Extracurricular activities boxplot
df.boxplot(column='Performance Index', by='Extracurricular Activities', ax=axes[1, 2])
axes[1, 2].set_title('Performance by Extracurricular Activities')
axes[1, 2].set_xlabel('Extracurricular Activities')

plt.suptitle('Exploratory Data Analysis', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

print("\nKey Observations:")
print("- Hours Studied shows STRONG positive correlation with Performance")
print("- Previous Scores also shows positive correlation")
print("- These are the features linear regression will learn from!")

In [None]:
# Let's start simple: predict Performance using just ONE feature (Hours Studied)
# This makes it easier to visualize and understand the math

X_simple = df['Hours Studied'].values.reshape(-1, 1)
y = df['Performance Index'].values.reshape(-1, 1)

print("Simple Linear Regression Setup:")
print(f"X (Hours Studied) shape: {X_simple.shape}")
print(f"y (Performance Index) shape: {y.shape}")
print(f"\nWe'll predict Performance = w0 + w1 * Hours_Studied")

In [None]:
# Visualize the simple regression problem
plt.figure(figsize=(10, 6))
plt.scatter(X_simple, y, alpha=0.5, edgecolors='black', linewidth=0.3, label='Students')
plt.xlabel('Hours Studied')
plt.ylabel('Performance Index')
plt.title('Can We Find a Line That Predicts Performance from Study Hours?')
plt.legend()
plt.show()

print("Our goal: Find the BEST line through this data!")
print("The line equation: Performance = w0 + w1 * Hours_Studied")
print("\nWe need to find optimal values for w0 (bias) and w1 (slope).")

---

<a id='part3'></a>
# Part 3: The Cost Function

## What is a Cost Function?

A **cost function** (also called loss function or objective function) measures how wrong our predictions are. It gives us a single number that represents the "badness" of our model.

**Goal:** Find parameters that **minimize** the cost function.

## Mean Squared Error (MSE)

The most common cost function for linear regression is **Mean Squared Error**:

$$J(w) = \frac{1}{2m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)})^2$$

Where:
- $m$ = number of training examples
- $\hat{y}^{(i)}$ = prediction for example $i$
- $y^{(i)}$ = actual value for example $i$
- The $\frac{1}{2}$ is a convenience factor (makes the derivative cleaner)

### Why Squared?

1. **Penalizes large errors more** - An error of 10 contributes 100 to the cost (vs. 10 for absolute error)
2. **Differentiable everywhere** - Unlike absolute error which has a kink at 0
3. **Convex function** - Has a single global minimum (no local minima to get stuck in)

### In Matrix Form

$$J(w) = \frac{1}{2m} (Xw - y)^T (Xw - y)$$

This is equivalent to the summation but uses matrix operations (faster in NumPy).

In [None]:
def compute_cost(X, y, w):
    """
    Compute the Mean Squared Error cost function.
    
    Parameters:
    -----------
    X : numpy array of shape (m, n+1)
        Feature matrix with bias column
    y : numpy array of shape (m, 1)
        Target values
    w : numpy array of shape (n+1, 1)
        Weight vector (including bias)
    
    Returns:
    --------
    cost : float
        The MSE cost
    """
    m = len(y)
    predictions = X.dot(w)           # y_hat = Xw
    errors = predictions - y          # (y_hat - y)
    cost = (1 / (2 * m)) * np.sum(errors ** 2)  # (1/2m) * sum((y_hat - y)^2)
    return cost

print("Cost function defined!")
print("\nFormula: J(w) = (1/2m) * sum((y_hat - y)^2)")

In [None]:
# Add bias column to X (column of 1s)
m = len(X_simple)
X_b = np.c_[np.ones((m, 1)), X_simple]  # X_b = [1, x] for each sample

print("Original X shape:", X_simple.shape)
print("X with bias column shape:", X_b.shape)
print("\nFirst 5 rows of X_b:")
print(X_b[:5])
print("\n^ First column is all 1s (for the bias term)")

In [None]:
# Let's see how cost changes with different weight values
# Try some random weights and see their costs

test_weights = [
    np.array([[0], [0]]),      # Zero weights (predicting 0 for everyone)
    np.array([[20], [5]]),     # Random guess
    np.array([[0], [10]]),     # Reasonable guess (10 points per hour)
    np.array([[-10], [12]]),   # Another guess
]

print("Cost for different weight values:")
print("="*50)
for w in test_weights:
    cost = compute_cost(X_b, y, w)
    print(f"w0={w[0,0]:>6.1f}, w1={w[1,0]:>5.1f} --> Cost = {cost:>10.2f}")

print("\nWe need to find the weights that give the LOWEST cost!")
print("That's where gradient descent comes in...")

In [None]:
# Visualize the cost function surface (3D plot)
# This shows how cost changes for different combinations of w0 and w1

from mpl_toolkits.mplot3d import Axes3D

# Create a grid of w0 and w1 values
w0_vals = np.linspace(-30, 30, 100)
w1_vals = np.linspace(-5, 20, 100)
W0, W1 = np.meshgrid(w0_vals, w1_vals)

# Compute cost for each combination
costs = np.zeros_like(W0)
for i in range(len(w0_vals)):
    for j in range(len(w1_vals)):
        w = np.array([[W0[i, j]], [W1[i, j]]])
        costs[i, j] = compute_cost(X_b, y, w)

# 3D Surface Plot
fig = plt.figure(figsize=(14, 5))

# Surface plot
ax1 = fig.add_subplot(121, projection='3d')
ax1.plot_surface(W0, W1, costs, cmap='viridis', alpha=0.8)
ax1.set_xlabel('w0 (bias)')
ax1.set_ylabel('w1 (slope)')
ax1.set_zlabel('Cost J(w)')
ax1.set_title('Cost Function Surface')

# Contour plot
ax2 = fig.add_subplot(122)
contour = ax2.contour(W0, W1, costs, levels=50, cmap='viridis')
ax2.set_xlabel('w0 (bias)')
ax2.set_ylabel('w1 (slope)')
ax2.set_title('Cost Function Contour Plot')
plt.colorbar(contour, ax=ax2, label='Cost')

plt.tight_layout()
plt.show()

print("\nThe cost surface is a BOWL shape (convex).")
print("There's only ONE minimum - at the bottom of the bowl.")
print("Gradient descent will 'roll down' this bowl to find the minimum!")

---

<a id='part4'></a>
# Part 4: Gradient Descent - The Learning Algorithm

## The Intuition

Imagine you're blindfolded on a hilly terrain and want to reach the lowest point. What would you do?

1. **Feel the slope** under your feet (compute the gradient)
2. **Take a step downhill** (update parameters in opposite direction of gradient)
3. **Repeat** until you reach the bottom (convergence)

This is exactly what gradient descent does!

## The Math

### What is a Gradient?

The **gradient** is a vector of partial derivatives. It points in the direction of steepest INCREASE of a function.

To minimize the function, we move in the **opposite** direction of the gradient.

### Deriving the Gradient for MSE

Our cost function:
$$J(w) = \frac{1}{2m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)})^2$$

Where $\hat{y}^{(i)} = w_0 + w_1 x^{(i)}$

Taking partial derivatives:

$$\frac{\partial J}{\partial w_0} = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)})$$

$$\frac{\partial J}{\partial w_1} = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)}) \cdot x^{(i)}$$

### In Matrix Form (Much Cleaner!)

$$\nabla J(w) = \frac{1}{m} X^T (Xw - y)$$

This single line computes all partial derivatives at once!

### The Update Rule

$$w := w - \alpha \cdot \nabla J(w)$$

Or expanded:

$$w := w - \frac{\alpha}{m} X^T (Xw - y)$$

Where $\alpha$ is the **learning rate** - how big of a step we take.

## Learning Rate: The Critical Hyperparameter

- **Too small**: Very slow convergence (takes forever)
- **Too large**: Overshoots the minimum (may never converge or diverge)
- **Just right**: Fast convergence to the minimum

In [None]:
def compute_gradient(X, y, w):
    """
    Compute the gradient of the cost function.
    
    Parameters:
    -----------
    X : numpy array of shape (m, n+1)
        Feature matrix with bias column
    y : numpy array of shape (m, 1)
        Target values
    w : numpy array of shape (n+1, 1)
        Current weight vector
    
    Returns:
    --------
    gradient : numpy array of shape (n+1, 1)
        The gradient vector
    """
    m = len(y)
    predictions = X.dot(w)            # y_hat = Xw
    errors = predictions - y          # (y_hat - y)
    gradient = (1 / m) * X.T.dot(errors)  # (1/m) * X^T(Xw - y)
    return gradient

print("Gradient function defined!")
print("\nFormula: gradient = (1/m) * X^T(Xw - y)")

In [None]:
# Let's see what the gradient looks like at different points

test_points = [
    np.array([[0], [0]]),      # Far from optimum
    np.array([[10], [5]]),     # Getting closer
    np.array([[-5], [10]]),    # Near optimum maybe?
]

print("Gradient at different points:")
print("="*60)
for w in test_points:
    grad = compute_gradient(X_b, y, w)
    cost = compute_cost(X_b, y, w)
    print(f"w = [{w[0,0]:>6.1f}, {w[1,0]:>5.1f}]")
    print(f"  Gradient = [{grad[0,0]:>8.2f}, {grad[1,0]:>8.2f}]")
    print(f"  Cost = {cost:.2f}")
    print()

print("The gradient tells us which direction to move to decrease cost!")
print("Negative gradient = we should increase that weight.")
print("Positive gradient = we should decrease that weight.")

---

<a id='part5'></a>
# Part 5: Implementation from Scratch

Now let's put it all together into a complete linear regression class using only NumPy!

## The Algorithm

```
1. Initialize weights randomly (or with zeros)
2. Repeat for n_iterations:
   a. Compute predictions: y_hat = Xw
   b. Compute cost: J(w)
   c. Compute gradient: gradient(J)
   d. Update weights: w = w - alpha * gradient
   e. Store cost for plotting
3. Return learned weights
```

In [None]:
class LinearRegressionScratch:
    """
    Linear Regression implemented from scratch using only NumPy.
    
    This implementation uses batch gradient descent to find optimal weights.
    
    Parameters:
    -----------
    learning_rate : float, default=0.01
        The step size for gradient descent
    n_iterations : int, default=1000
        Number of iterations to run gradient descent
    
    Attributes:
    -----------
    weights : numpy array
        Learned weights (including bias)
    cost_history : list
        Cost at each iteration (for plotting)
    """
    
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = None
        self.cost_history = []
    
    def _add_bias(self, X):
        """Add a column of 1s to X for the bias term."""
        return np.c_[np.ones((X.shape[0], 1)), X]
    
    def _compute_cost(self, X, y):
        """Compute Mean Squared Error cost."""
        m = len(y)
        predictions = X.dot(self.weights)
        cost = (1 / (2 * m)) * np.sum((predictions - y) ** 2)
        return cost
    
    def _compute_gradient(self, X, y):
        """Compute gradient of cost function."""
        m = len(y)
        predictions = X.dot(self.weights)
        gradient = (1 / m) * X.T.dot(predictions - y)
        return gradient
    
    def fit(self, X, y):
        """
        Fit the model using gradient descent.
        
        Parameters:
        -----------
        X : numpy array of shape (m, n)
            Training features
        y : numpy array of shape (m, 1)
            Target values
        
        Returns:
        --------
        self : object
            Returns self for method chaining
        """
        # Add bias column
        X_b = self._add_bias(X)
        m, n = X_b.shape
        
        # Initialize weights to zeros
        self.weights = np.zeros((n, 1))
        self.cost_history = []
        
        # Gradient descent loop
        for i in range(self.n_iterations):
            # Compute and store cost
            cost = self._compute_cost(X_b, y)
            self.cost_history.append(cost)
            
            # Compute gradient
            gradient = self._compute_gradient(X_b, y)
            
            # Update weights
            self.weights = self.weights - self.learning_rate * gradient
        
        return self
    
    def predict(self, X):
        """
        Make predictions using learned weights.
        
        Parameters:
        -----------
        X : numpy array of shape (m, n)
            Features to predict
        
        Returns:
        --------
        predictions : numpy array of shape (m, 1)
            Predicted values
        """
        X_b = self._add_bias(X)
        return X_b.dot(self.weights)
    
    def get_params(self):
        """Return bias and weights separately."""
        return {
            'bias': self.weights[0, 0],
            'coefficients': self.weights[1:].flatten()
        }

print("LinearRegressionScratch class defined!")
print("\nMethods:")
print("  - fit(X, y): Train the model using gradient descent")
print("  - predict(X): Make predictions")
print("  - get_params(): Get learned bias and coefficients")

In [None]:
# Train our model on the simple case (Hours Studied -> Performance)
model_simple = LinearRegressionScratch(learning_rate=0.01, n_iterations=1000)
model_simple.fit(X_simple, y)

# Get learned parameters
params = model_simple.get_params()

print("Training complete! (Simple Linear Regression)")
print("="*50)
print(f"Learned equation: Performance = {params['bias']:.2f} + {params['coefficients'][0]:.2f} * Hours_Studied")
print()
print(f"Interpretation:")
print(f"  - Base performance (0 hours studied): {params['bias']:.2f}")
print(f"  - Each additional hour of study adds: {params['coefficients'][0]:.2f} points")

In [None]:
# Visualize the learning process

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Cost over iterations
axes[0].plot(model_simple.cost_history)
axes[0].set_xlabel('Iteration')
axes[0].set_ylabel('Cost')
axes[0].set_title('Cost Function Over Training')

# Plot 2: Data with learned line
axes[1].scatter(X_simple, y, alpha=0.5, edgecolors='black', linewidth=0.3, label='Students')
X_line = np.array([[0], [9]])
y_pred_line = model_simple.predict(X_line)
axes[1].plot(X_line, y_pred_line, 'r-', linewidth=2, 
             label=f'Learned: y = {params["bias"]:.1f} + {params["coefficients"][0]:.1f}x')
axes[1].set_xlabel('Hours Studied')
axes[1].set_ylabel('Performance Index')
axes[1].set_title('Learned Regression Line')
axes[1].legend()

plt.tight_layout()
plt.show()

print("Left: Cost decreases as the model learns")
print("Right: The red line is what our model learned!")

In [None]:
# Visualize gradient descent path on the cost surface

# Track weight history during training
weights = np.zeros((2, 1))
weight_history = [weights.copy()]

for i in range(100):  # Fewer iterations to see the path clearly
    gradient = (1 / m) * X_b.T.dot(X_b.dot(weights) - y)
    weights = weights - 0.01 * gradient
    weight_history.append(weights.copy())

weight_history = np.array(weight_history).squeeze()

# Plot the path on contour
plt.figure(figsize=(10, 8))
contour = plt.contour(W0, W1, costs, levels=30, cmap='viridis')
plt.colorbar(contour, label='Cost')

# Plot the gradient descent path
plt.plot(weight_history[:, 0], weight_history[:, 1], 'ro-', markersize=3, linewidth=1, label='Gradient Descent Path')
plt.plot(weight_history[0, 0], weight_history[0, 1], 'go', markersize=12, label='Start (0, 0)')
plt.plot(weight_history[-1, 0], weight_history[-1, 1], 'r*', markersize=15, label='End (Optimal)')

plt.xlabel('w0 (bias)')
plt.ylabel('w1 (slope)')
plt.title('Gradient Descent Path on Cost Surface')
plt.legend()
plt.show()

print("Watch how gradient descent navigates the cost surface!")
print("Starting from (0,0), it follows the steepest descent to reach the minimum.")

---

<a id='part6'></a>
# Part 6: The Normal Equation (Closed-Form Solution)

## An Alternative to Gradient Descent

Instead of iteratively searching for the minimum, we can solve for it directly using calculus!

## Deriving the Normal Equation

To find the minimum, we set the gradient to zero:

$$\nabla J(w) = 0$$

$$\frac{1}{m} X^T (Xw - y) = 0$$

$$X^T Xw = X^T y$$

$$w = (X^T X)^{-1} X^T y$$

This is the **Normal Equation** - it gives us the optimal weights directly!

## Gradient Descent vs Normal Equation

| Aspect | Gradient Descent | Normal Equation |
|--------|-----------------|------------------|
| **Iterative** | Yes | No |
| **Learning rate** | Needs tuning | Not needed |
| **Complexity** | O(kn²) | O(n³) for matrix inversion |
| **Large n (features)** | Works well | Slow (matrix inversion) |
| **Large m (samples)** | Works well | Works well |

**Rule of thumb:** Use Normal Equation when n < 10,000 features. Use Gradient Descent for larger feature sets.

In [None]:
def normal_equation(X, y):
    """
    Compute optimal weights using the Normal Equation.
    
    Formula: w = (X^T X)^(-1) X^T y
    
    Parameters:
    -----------
    X : numpy array of shape (m, n)
        Feature matrix (without bias column)
    y : numpy array of shape (m, 1)
        Target values
    
    Returns:
    --------
    w : numpy array of shape (n+1, 1)
        Optimal weights (including bias)
    """
    # Add bias column
    X_b = np.c_[np.ones((X.shape[0], 1)), X]
    
    # Normal equation: w = (X^T X)^(-1) X^T y
    w = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
    
    return w

# Solve using normal equation
w_normal = normal_equation(X_simple, y)

print("Normal Equation Solution")
print("="*50)
print(f"w = (X^T X)^(-1) X^T y")
print()
print(f"Computed bias (w0): {w_normal[0, 0]:.6f}")
print(f"Computed slope (w1): {w_normal[1, 0]:.6f}")

In [None]:
# Compare: Gradient Descent vs Normal Equation

# Gradient Descent result
gd_params = model_simple.get_params()

print("Comparison: Gradient Descent vs Normal Equation")
print("="*55)
print(f"{'Method':<20} {'Bias (w0)':<15} {'Slope (w1)':<15}")
print("-"*55)
print(f"{'Gradient Descent':<20} {gd_params['bias']:<15.6f} {gd_params['coefficients'][0]:<15.6f}")
print(f"{'Normal Equation':<20} {w_normal[0,0]:<15.6f} {w_normal[1,0]:<15.6f}")
print()
print("Both methods find essentially the same solution!")
print("(Small differences are due to gradient descent not fully converging)")

---

<a id='part7'></a>
# Part 7: Feature Scaling - Why It Matters

## The Problem

When features have very different scales, gradient descent can behave poorly:

- The cost surface becomes elongated (elliptical instead of circular)
- Gradient descent takes a zigzag path
- Convergence is slow or may not happen

## The Solution: Feature Scaling

### Standardization (Z-score normalization)

$$x' = \frac{x - \mu}{\sigma}$$

Where:
- $\mu$ = mean of the feature
- $\sigma$ = standard deviation of the feature

Result: Features have mean = 0 and std = 1

## When to Scale

- **Gradient Descent**: Always scale features!
- **Normal Equation**: Not strictly necessary (but doesn't hurt)

In [None]:
# Implement standardization from scratch

def standardize(X):
    """
    Standardize features to have mean=0 and std=1.
    
    Parameters:
    -----------
    X : numpy array of shape (m, n)
        Features to standardize
    
    Returns:
    --------
    X_scaled : numpy array of shape (m, n)
        Standardized features
    mu : numpy array of shape (n,)
        Mean of each feature
    sigma : numpy array of shape (n,)
        Std of each feature
    """
    mu = np.mean(X, axis=0)
    sigma = np.std(X, axis=0)
    X_scaled = (X - mu) / sigma
    return X_scaled, mu, sigma

print("Standardization function defined!")
print("\nFormula: x' = (x - mean) / std")

In [None]:
# Prepare multiple features for the full model
# We'll encode 'Extracurricular Activities' as 1/0

df_encoded = df.copy()
df_encoded['Extracurricular Activities'] = (df['Extracurricular Activities'] == 'Yes').astype(int)

# Select features and target
feature_cols = ['Hours Studied', 'Previous Scores', 'Sleep Hours', 
                'Sample Question Papers Practiced', 'Extracurricular Activities']
X_multi = df_encoded[feature_cols].values
y = df_encoded['Performance Index'].values.reshape(-1, 1)

print("Multiple features prepared:")
print(f"X shape: {X_multi.shape} (5 features)")
print(f"y shape: {y.shape}")
print(f"\nFeatures: {feature_cols}")

In [None]:
# Check feature scales before standardization
print("Feature Scales BEFORE Standardization:")
print("="*60)
for i, col in enumerate(feature_cols):
    print(f"{col:<40} mean={X_multi[:, i].mean():.2f}, std={X_multi[:, i].std():.2f}")

print("\nNotice: Previous Scores has much larger values than others!")
print("This can cause problems for gradient descent.")

In [None]:
# Standardize features
X_scaled, mu, sigma = standardize(X_multi)

print("Feature Scales AFTER Standardization:")
print("="*60)
for i, col in enumerate(feature_cols):
    print(f"{col:<40} mean={X_scaled[:, i].mean():.4f}, std={X_scaled[:, i].std():.4f}")

print("\nAll features now have mean ~0 and std ~1!")

---

<a id='part8'></a>
# Part 8: Training on Student Performance Data

Now let's train our from-scratch model on all features!

In [None]:
# Split data into train and test sets manually
np.random.seed(42)
m = len(X_scaled)
indices = np.random.permutation(m)
split_idx = int(0.8 * m)

train_idx, test_idx = indices[:split_idx], indices[split_idx:]

X_train, X_test = X_scaled[train_idx], X_scaled[test_idx]
y_train, y_test = y[train_idx], y[test_idx]

print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")

In [None]:
# Train our from-scratch model!
model = LinearRegressionScratch(learning_rate=0.1, n_iterations=1000)
model.fit(X_train, y_train)

# Get learned parameters
params = model.get_params()

print("Training complete! (Multiple Linear Regression)")
print("="*60)
print(f"\nLearned bias: {params['bias']:.4f}")
print(f"\nLearned coefficients (standardized scale):")
for i, col in enumerate(feature_cols):
    print(f"  {col:<40}: {params['coefficients'][i]:>8.4f}")

In [None]:
# Visualize training
plt.figure(figsize=(10, 5))
plt.plot(model.cost_history)
plt.xlabel('Iteration')
plt.ylabel('Cost')
plt.title('Cost Function During Training (Multiple Features)')
plt.show()

print(f"Initial cost: {model.cost_history[0]:.2f}")
print(f"Final cost: {model.cost_history[-1]:.2f}")
print(f"Cost reduction: {(1 - model.cost_history[-1]/model.cost_history[0])*100:.1f}%")

In [None]:
# Feature importance (coefficient magnitudes)
# Note: Because we standardized, coefficients are comparable

plt.figure(figsize=(10, 6))
colors = ['green' if c > 0 else 'red' for c in params['coefficients']]
plt.barh(feature_cols, params['coefficients'], color=colors, alpha=0.7)
plt.xlabel('Coefficient Value (Standardized)')
plt.title('Feature Importance (Coefficient Magnitude)')
plt.axvline(x=0, color='black', linestyle='-', linewidth=0.5)
plt.show()

print("\nInterpretation:")
print("- Green bars: Positive impact on performance")
print("- Red bars: Negative impact on performance")
print("- Larger magnitude = stronger effect")

---

<a id='part9'></a>
# Part 9: Evaluation Metrics from Scratch

How do we measure how good our model is? Let's implement common regression metrics.

## Mean Squared Error (MSE)

$$MSE = \frac{1}{m} \sum_{i=1}^{m} (y_i - \hat{y}_i)^2$$

## Root Mean Squared Error (RMSE)

$$RMSE = \sqrt{MSE}$$

## Mean Absolute Error (MAE)

$$MAE = \frac{1}{m} \sum_{i=1}^{m} |y_i - \hat{y}_i|$$

## R-squared (Coefficient of Determination)

$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$$

In [None]:
# Implement evaluation metrics from scratch

def mean_squared_error(y_true, y_pred):
    """Calculate Mean Squared Error."""
    return np.mean((y_true - y_pred) ** 2)

def root_mean_squared_error(y_true, y_pred):
    """Calculate Root Mean Squared Error."""
    return np.sqrt(mean_squared_error(y_true, y_pred))

def mean_absolute_error(y_true, y_pred):
    """Calculate Mean Absolute Error."""
    return np.mean(np.abs(y_true - y_pred))

def r_squared(y_true, y_pred):
    """Calculate R-squared (coefficient of determination)."""
    ss_res = np.sum((y_true - y_pred) ** 2)  # Residual sum of squares
    ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)  # Total sum of squares
    return 1 - (ss_res / ss_tot)

print("Evaluation metrics defined!")

In [None]:
# Make predictions
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

# Evaluate
print("Model Evaluation")
print("="*60)
print(f"{'Metric':<25} {'Train':<15} {'Test':<15}")
print("-"*60)
print(f"{'MSE':<25} {mean_squared_error(y_train, y_pred_train):<15.4f} {mean_squared_error(y_test, y_pred_test):<15.4f}")
print(f"{'RMSE':<25} {root_mean_squared_error(y_train, y_pred_train):<15.4f} {root_mean_squared_error(y_test, y_pred_test):<15.4f}")
print(f"{'MAE':<25} {mean_absolute_error(y_train, y_pred_train):<15.4f} {mean_absolute_error(y_test, y_pred_test):<15.4f}")
print(f"{'R-squared':<25} {r_squared(y_train, y_pred_train):<15.4f} {r_squared(y_test, y_pred_test):<15.4f}")

print("\nInterpretation:")
print(f"- Our model explains {r_squared(y_test, y_pred_test)*100:.1f}% of the variance in student performance")
print(f"- Average prediction error: {mean_absolute_error(y_test, y_pred_test):.2f} points")

In [None]:
# Visualize predictions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Actual vs Predicted
axes[0].scatter(y_test, y_pred_test, alpha=0.5, edgecolors='black', linewidth=0.3)
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', linewidth=2, label='Perfect Prediction')
axes[0].set_xlabel('Actual Performance')
axes[0].set_ylabel('Predicted Performance')
axes[0].set_title(f'Actual vs Predicted (R² = {r_squared(y_test, y_pred_test):.4f})')
axes[0].legend()

# Residual plot
residuals = y_test - y_pred_test
axes[1].scatter(y_pred_test, residuals, alpha=0.5, edgecolors='black', linewidth=0.3)
axes[1].axhline(y=0, color='r', linestyle='--', linewidth=2)
axes[1].set_xlabel('Predicted Performance')
axes[1].set_ylabel('Residuals (Actual - Predicted)')
axes[1].set_title('Residual Plot')

plt.tight_layout()
plt.show()

print("Left: Points close to the red line = accurate predictions")
print("Right: Residuals randomly scattered around 0 = good model")

---

<a id='part10'></a>
# Part 10: Verification Against sklearn

Let's verify that our implementation is correct by comparing with sklearn's LinearRegression.

In [None]:
# Import sklearn for verification only
from sklearn.linear_model import LinearRegression as SklearnLR
from sklearn.metrics import mean_squared_error as sklearn_mse, r2_score

# Train sklearn model
sklearn_model = SklearnLR()
sklearn_model.fit(X_train, y_train.ravel())

# Get sklearn predictions
sklearn_pred_test = sklearn_model.predict(X_test).reshape(-1, 1)

print("Comparison: Our Implementation vs sklearn")
print("="*60)
print(f"{'Metric':<20} {'Our Model':<20} {'sklearn':<20}")
print("-"*60)
print(f"{'Test MSE':<20} {mean_squared_error(y_test, y_pred_test):<20.6f} {sklearn_mse(y_test, sklearn_pred_test):<20.6f}")
print(f"{'Test R²':<20} {r_squared(y_test, y_pred_test):<20.6f} {r2_score(y_test, sklearn_pred_test):<20.6f}")

In [None]:
# Compare coefficients
print("\nCoefficient Comparison:")
print("="*70)
print(f"{'Feature':<40} {'Our Model':<15} {'sklearn':<15}")
print("-"*70)
print(f"{'Bias':<40} {params['bias']:<15.6f} {sklearn_model.intercept_:<15.6f}")
for i, col in enumerate(feature_cols):
    print(f"{col:<40} {params['coefficients'][i]:<15.6f} {sklearn_model.coef_[i]:<15.6f}")

print("\nOur implementation matches sklearn!")

---

<a id='part11'></a>
# Part 11: Summary and Key Takeaways

## What We Built

We implemented **Linear Regression from scratch** using only NumPy, and applied it to predict student performance!

### Components Implemented:

1. **Cost Function (MSE)**: Measures how wrong our predictions are
2. **Gradient Descent**: Iteratively updates weights to minimize cost
3. **Normal Equation**: Closed-form solution for optimal weights
4. **Feature Scaling**: Standardization for better gradient descent performance
5. **Evaluation Metrics**: MSE, RMSE, MAE, R² implemented from scratch

## Key Mathematical Concepts

| Concept | Formula |
|---------|----------|
| Model | $\hat{y} = Xw$ |
| Cost Function | $J(w) = \frac{1}{2m}(Xw - y)^T(Xw - y)$ |
| Gradient | $\nabla J(w) = \frac{1}{m}X^T(Xw - y)$ |
| Update Rule | $w := w - \alpha \nabla J(w)$ |
| Normal Equation | $w = (X^TX)^{-1}X^Ty$ |

## Key Insights from Student Performance Data

- **Hours Studied** and **Previous Scores** are the strongest predictors
- Our model achieves ~98% R² (explains 98% of variance in performance)
- Average prediction error is only ~2 points

## What's Next?

- **Regularization**: Add L1/L2 penalties to prevent overfitting
- **Polynomial Features**: Model non-linear relationships
- **Logistic Regression**: Apply similar concepts to classification

In [None]:
# Final summary visualization

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Simple regression (Hours vs Performance)
axes[0, 0].scatter(X_simple, y, alpha=0.3, edgecolors='black', linewidth=0.2)
X_line = np.array([[0], [9]])
y_line = model_simple.predict(X_line)
axes[0, 0].plot(X_line, y_line, 'r-', linewidth=2)
axes[0, 0].set_xlabel('Hours Studied')
axes[0, 0].set_ylabel('Performance Index')
axes[0, 0].set_title('Simple Linear Regression')

# 2. Cost convergence
axes[0, 1].plot(model.cost_history)
axes[0, 1].set_xlabel('Iteration')
axes[0, 1].set_ylabel('Cost')
axes[0, 1].set_title('Gradient Descent Convergence')

# 3. Actual vs Predicted
axes[1, 0].scatter(y_test, y_pred_test, alpha=0.5, edgecolors='black', linewidth=0.3)
axes[1, 0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', linewidth=2)
axes[1, 0].set_xlabel('Actual')
axes[1, 0].set_ylabel('Predicted')
axes[1, 0].set_title(f'Predictions (R² = {r_squared(y_test, y_pred_test):.4f})')

# 4. Feature importance
colors = ['green' if c > 0 else 'red' for c in params['coefficients']]
axes[1, 1].barh(feature_cols, params['coefficients'], color=colors, alpha=0.7)
axes[1, 1].axvline(x=0, color='black', linestyle='-', linewidth=0.5)
axes[1, 1].set_xlabel('Coefficient Value')
axes[1, 1].set_title('Feature Importance')

plt.tight_layout()
plt.show()

print("\n" + "="*60)
print("LINEAR REGRESSION FROM SCRATCH - COMPLETE!")
print("="*60)
print("\nYou now understand the math behind one of ML's most")
print("fundamental algorithms!")
print("\nThis knowledge transfers to:")
print("  - Logistic Regression (classification)")
print("  - Neural Networks (multi-layer linear + activation)")
print("  - Any gradient-based optimization")
print("\nThe same concepts of cost functions, gradients, and")
print("optimization apply throughout machine learning!")