# Linear Regression - Gradient Descent Implementation

## Overview
This notebook demonstrates the implementation of Linear Regression using **Gradient Descent**.

### Gradient Descent Algorithm
Iteratively updates parameters to minimize the cost function:

$$\theta_j := \theta_j - \alpha \frac{\partial J(\theta)}{\partial \theta_j}$$

**Vectorized form:**
$$\theta := \theta - \frac{\alpha}{m} X^T(X\theta - y)$$

Where:
- $\alpha$ = learning rate
- $m$ = number of samples
- $X$ = feature matrix
- $y$ = target vector

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

np.random.seed(42)

## 1. Generate and Prepare Dataset

In [None]:
# Generate synthetic data
X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, alpha=0.6, label='Training')
plt.scatter(X_test, y_test, alpha=0.6, color='orange', label='Test')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Dataset')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## 2. Feature Scaling

**Important:** Feature scaling speeds up gradient descent convergence!

In [None]:
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Before scaling:")
print(f"  Mean: {X_train.mean():.4f}, Std: {X_train.std():.4f}")
print("\nAfter scaling:")
print(f"  Mean: {X_train_scaled.mean():.4f}, Std: {X_train_scaled.std():.4f}")

## 3. Add Intercept Term

In [None]:
def add_intercept(X):
    """Add column of ones for intercept."""
    m = X.shape[0]
    return np.concatenate([np.ones((m, 1)), X], axis=1)

X_train_b = add_intercept(X_train_scaled)
X_test_b = add_intercept(X_test_scaled)

print(f"Shape with intercept: {X_train_b.shape}")

## 4. Cost Function (MSE)

In [None]:
def compute_cost(X, y, theta):
    """
    Compute Mean Squared Error cost.
    
    J(θ) = (1/2m) * sum((h(x) - y)^2)
    """
    m = len(y)
    predictions = X @ theta
    errors = predictions - y.reshape(-1, 1)
    cost = (1 / (2 * m)) * np.sum(errors ** 2)
    return cost

# Test with random theta
test_theta = np.random.randn(X_train_b.shape[1], 1)
initial_cost = compute_cost(X_train_b, y_train, test_theta)
print(f"Initial cost (random theta): {initial_cost:.4f}")

## 5. Gradient Descent Implementation

In [None]:
def gradient_descent(X, y, theta, learning_rate, num_iterations):
    """
    Perform gradient descent to learn theta.
    
    Update rule: θ := θ - α * (1/m) * X^T * (Xθ - y)
    
    Args:
        X: Feature matrix with intercept (m, n+1)
        y: Target vector (m,)
        theta: Initial parameters (n+1, 1)
        learning_rate: Step size α
        num_iterations: Number of iterations
    
    Returns:
        theta: Optimized parameters
        cost_history: Cost at each iteration
    """
    m = len(y)
    cost_history = []
    theta = theta.copy()
    
    for i in range(num_iterations):
        # Compute predictions
        predictions = X @ theta
        
        # Compute errors
        errors = predictions - y.reshape(-1, 1)
        
        # Compute gradient
        gradient = (1 / m) * (X.T @ errors)
        
        # Update parameters
        theta = theta - learning_rate * gradient
        
        # Store cost
        cost = compute_cost(X, y, theta)
        cost_history.append(cost)
        
        # Print progress
        if (i + 1) % 100 == 0:
            print(f"Iteration {i+1:4d}: Cost = {cost:.4f}")
    
    return theta, cost_history

## 6. Train the Model

In [None]:
# Initialize parameters
initial_theta = np.zeros((X_train_b.shape[1], 1))

# Hyperparameters
learning_rate = 0.01
num_iterations = 1000

print(f"Training with learning rate = {learning_rate}, iterations = {num_iterations}\n")

# Run gradient descent
theta_optimal, cost_history = gradient_descent(
    X_train_b, y_train, initial_theta, learning_rate, num_iterations
)

print(f"\nFinal parameters:")
print(f"  θ₀ (intercept): {theta_optimal[0][0]:.4f}")
print(f"  θ₁ (slope):     {theta_optimal[1][0]:.4f}")
print(f"\nFinal cost: {cost_history[-1]:.4f}")

## 7. Visualize Cost Convergence

In [None]:
plt.figure(figsize=(12, 5))

# Full history
plt.subplot(1, 2, 1)
plt.plot(cost_history, 'b-', linewidth=2)
plt.xlabel('Iteration')
plt.ylabel('Cost J(θ)')
plt.title('Cost Function Convergence')
plt.grid(True, alpha=0.3)

# Last 50% of iterations (to see final convergence)
plt.subplot(1, 2, 2)
start_idx = len(cost_history) // 2
plt.plot(range(start_idx, len(cost_history)), cost_history[start_idx:], 'g-', linewidth=2)
plt.xlabel('Iteration')
plt.ylabel('Cost J(θ)')
plt.title('Cost Convergence (Last 50% of iterations)')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Cost decreased from {cost_history[0]:.4f} to {cost_history[-1]:.4f}")
print(f"Reduction: {((cost_history[0] - cost_history[-1]) / cost_history[0] * 100):.2f}%")

## 8. Make Predictions and Evaluate

In [None]:
def predict(X, theta):
    """Make predictions."""
    return (X @ theta).flatten()

# Predictions
y_train_pred = predict(X_train_b, theta_optimal)
y_test_pred = predict(X_test_b, theta_optimal)

# Metrics
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))

print("=" * 50)
print("MODEL PERFORMANCE")
print("=" * 50)
print(f"Training Set:")
print(f"  R²:   {train_r2:.4f}")
print(f"  RMSE: {train_rmse:.4f}")
print(f"\nTest Set:")
print(f"  R²:   {test_r2:.4f}")
print(f"  RMSE: {test_rmse:.4f}")
print("=" * 50)

## 9. Visualize Results

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Plot 1: Regression line
axes[0].scatter(X_train_scaled, y_train, alpha=0.6, label='Training')
axes[0].scatter(X_test_scaled, y_test, alpha=0.6, color='orange', label='Test')

X_range = np.linspace(X_train_scaled.min(), X_train_scaled.max(), 100).reshape(-1, 1)
X_range_b = add_intercept(X_range)
y_range_pred = predict(X_range_b, theta_optimal)
axes[0].plot(X_range, y_range_pred, 'r-', linewidth=2, label='Regression line')

axes[0].set_xlabel('X (scaled)')
axes[0].set_ylabel('y')
axes[0].set_title('Linear Regression (Gradient Descent)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot 2: Residuals
residuals_test = y_test - y_test_pred
axes[1].scatter(y_test_pred, residuals_test, alpha=0.6)
axes[1].axhline(y=0, color='r', linestyle='--', linewidth=2)
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Residuals')
axes[1].set_title('Residual Plot')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 10. Experiment with Different Learning Rates

In [None]:
learning_rates = [0.001, 0.01, 0.1, 0.5]
iterations = 500

plt.figure(figsize=(12, 6))

for lr in learning_rates:
    theta_init = np.zeros((X_train_b.shape[1], 1))
    _, cost_hist = gradient_descent(X_train_b, y_train, theta_init, lr, iterations)
    plt.plot(cost_hist, label=f'α = {lr}', linewidth=2)

plt.xlabel('Iteration')
plt.ylabel('Cost J(θ)')
plt.title('Effect of Learning Rate on Convergence')
plt.legend()
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.show()

print("Observations:")
print("- Too small (0.001): Slow convergence")
print("- Just right (0.01-0.1): Fast, stable convergence")
print("- Too large (0.5): May oscillate or diverge")

## 11. Multiple Features Example

In [None]:
# Generate multi-feature data
X_multi, y_multi = make_regression(n_samples=200, n_features=5, noise=15, random_state=42)
X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(X_multi, y_multi, test_size=0.2, random_state=42)

# Scale
scaler_m = StandardScaler()
X_train_m_scaled = scaler_m.fit_transform(X_train_m)
X_test_m_scaled = scaler_m.transform(X_test_m)

# Add intercept
X_train_m_b = add_intercept(X_train_m_scaled)
X_test_m_b = add_intercept(X_test_m_scaled)

# Train
theta_init_m = np.zeros((X_train_m_b.shape[1], 1))
theta_multi, cost_hist_m = gradient_descent(X_train_m_b, y_train_m, theta_init_m, 0.01, 1000)

# Evaluate
y_test_pred_m = predict(X_test_m_b, theta_multi)
print(f"\nMultiple Features Performance:")
print(f"  Test R²:   {r2_score(y_test_m, y_test_pred_m):.4f}")
print(f"  Test RMSE: {np.sqrt(mean_squared_error(y_test_m, y_test_pred_m)):.4f}")

## Key Takeaways

### Gradient Descent Algorithm:
1. **Iterative optimization** - Updates parameters gradually
2. **Learning rate (α)** - Controls step size (critical hyperparameter)
3. **Convergence** - Cost decreases over iterations
4. **Feature scaling** - Essential for faster convergence

### Advantages:
- ✅ Scales to large datasets
- ✅ Works with many features
- ✅ Memory efficient
- ✅ Can be used for online learning

### Challenges:
- ❌ Requires tuning learning rate
- ❌ Needs multiple iterations
- ❌ May need feature scaling
- ❌ Convergence depends on initialization

### Best Practices:
1. **Always scale features** (standardization or normalization)
2. **Start with small learning rate** (0.001 - 0.01)
3. **Plot cost history** to verify convergence
4. **Monitor for oscillation** or divergence
5. **Use vectorized operations** for efficiency