# 🔬 Linear Regression: Closed-Form vs Gradient Descent

This notebook accompanies the problem set from **Session 1** of *Decode Life 2025*.

## Goals:
- Generate synthetic linear data with small noise.
- Fit a line using **Closed Form Equation** and **Gradient Descent**.
- Compare the two methods in low-data vs. high-data regimes.

## Model:
We assume the model:
\begin{align}
y = wx + b + \text{noise}
\end{align}

This can be rewritten in matrix form as:
\begin{align}
\mathbf{y} = X \boldsymbol{\theta}, \quad \text{where } X =
\begin{bmatrix}
x_1 & 1 \\
x_2 & 1 \\
\vdots & \vdots \\
x_n & 1
\end{bmatrix}, \quad
\boldsymbol{\theta} =
\begin{bmatrix}
w \\
b
\end{bmatrix}
\end{align}

Our task is to minimize the total squared error:
\begin{align}
\min_{\boldsymbol{\theta}} \| X\boldsymbol{\theta} - \mathbf{y} \|^2
\end{align}

We will do this using:
1. The Closed Form equation: \begin{align}
 \boldsymbol{\theta}^* = (X^\top X)^{-1} X^\top \mathbf{y} \end{align}
2. Gradient descent:
\begin{align}
w \leftarrow w - \eta \frac{\partial J}{\partial w}, \quad b \leftarrow b - \eta \frac{\partial J}{\partial b}
\end{align}

In [None]:
import numpy as np
import pandas as pd

# Set seed for reproducibility
np.random.seed(42)

# Parameters
n_points = 100
x = np.linspace(0, 10, n_points)
true_w = 3
true_b = 2
noise = np.random.normal(0, 1, size=n_points)

# Generate y
y = true_w * x + true_b + noise

# Save to CSV
data = pd.DataFrame({'x': x, 'y': y})
data.to_csv("linear_regression_data.csv", index=False)

print("Data saved to linear_regression_data.csv")

# Linear Regression: Closed-Form vs Gradient Descent

This notebook explores two methods for fitting a line to data:

- Closed-form solution using the normal equation
- Gradient descent using calculus-based optimization

We will compare both methods under two settings:
- Low-dimensional regime (few data points, small feature range)
- High-dimensional regime (many data points, wide feature range)

## Step 1: Load the Data

First, load the synthetic data we generated. The file should contain two columns: `x` and `y`.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the data
data = pd.read_csv("linear_regression_data.csv")
x = data['x'].values
y = data['y'].values

plt.scatter(x, y, alpha=0.6)
plt.xlabel("x")
plt.ylabel("y")
plt.title("Synthetic Linear Data")
plt.grid(True)
plt.show()

## Step 2: Closed-Form Solution

In [None]:
import numpy as np

# Design matrix X with bias column
X = np.column_stack([x, np.ones_like(x)])

# Find Closed-form solution

## TO DO


## Step 3: Gradient Descent Method

In [None]:
# Gradient descent implementation
def gradient_descent(x, y, eta=0.001, steps=1000):
    w, b = 0.0, 0.0
    n = len(x)

## TO DO

    return w, b
w_gd, b_gd = gradient_descent(x, y, eta=0.001, steps=10000)
print(f"Gradient Descent solution: w = {w_gd:.3f}, b = {b_gd:.3f}")

## Step 4: Compare Predictions


In [None]:
## TO DO
plt.scatter(x, y, label="Data", alpha=0.6)
plt.plot(x, w_closed * x + b_closed, label="Closed-form", linewidth=2)
plt.plot(x, w_gd * x + b_gd, label="Gradient Descent", linestyle="--", linewidth=2)
plt.xlabel("x")
plt.ylabel("y")
plt.legend()
plt.grid(True)
plt.title("Line Fit Comparison")
plt.show()