### Step 1: The 2D linear regression model and per‑point loss

We work with a **two‑dimensional** linear regression model with no explicit intercept term: the input has two features $x_0$ and $x_1$, and the model has two parameters $\theta_0$ and $\theta_1$. [mattsosna](https://mattsosna.com/LR-grad-desc/)

- Model prediction for a single data point $i$:

$$
\hat{y}_i = \theta_0 x_{0,i} + \theta_1 x_{1,i}
$$

- Squared error loss **for that single point**:

$$
\ell_i(\theta_0,\theta_1) = (y_i - \hat{y}_i)^2
= \big(y_i - \theta_0 x_{0,i} - \theta_1 x_{1,i}\big)^2
$$

Here:

- $y_i$: true target for point i  
- $x_{0,i}, x_{1,i}$: the two feature values for point i  
- $\theta_0, \theta_1$: model parameters we want to learn  

This is **per‑example loss**, not averaged yet over all points. [mattsosna](https://mattsosna.com/LR-grad-desc/)

In the tips example:

- $x_0$ is a **bias** feature (all ones), representing the fixed offset tip. [pyimagesearch](https://pyimagesearch.com/2016/10/10/gradient-descent-with-python/)
- $x_1$ is the **bill amount**. [pyimagesearch](https://pyimagesearch.com/2016/10/10/gradient-descent-with-python/)

So the prediction is effectively:

$$
\hat{y}_i = \theta_0 \cdot 1 + \theta_1 \cdot \text{bill}_i
$$


### Step 2: Gradient of the loss for one data point

The gradient has two components:

$$
\nabla_{\theta} \ell_i =
\begin{bmatrix}
\dfrac{\partial \ell_i}{\partial \theta_0} \\
\dfrac{\partial \ell_i}{\partial \theta_1}
\end{bmatrix}
$$

Let

$$
e_i = y_i - \theta_0 x_{0,i} - \theta_1 x_{1,i}
$$

Then

$$
\ell_i = e_i^2
$$

Use the chain rule:

1. $\dfrac{\partial \ell_i}{\partial e_i} = 2 e_i$  
2. $\dfrac{\partial e_i}{\partial \theta_0} = -x_{0,i}$  
3. $\dfrac{\partial e_i}{\partial \theta_1} = -x_{1,i}$

So:

$$
\frac{\partial \ell_i}{\partial \theta_0}
= 2 e_i \cdot (-x_{0,i})
= -2\, (y_i - \theta_0 x_{0,i} - \theta_1 x_{1,i})\, x_{0,i}
$$

$$
\frac{\partial \ell_i}{\partial \theta_1}
= 2 e_i \cdot (-x_{1,i})
= -2\, (y_i - \theta_0 x_{0,i} - \theta_1 x_{1,i})\, x_{1,i}
$$

The above can be written as:

- $2(y_i - \theta_0 x_0 - \theta_1 x_1)(-x_0)$ for the θ₀ component  
- $2(y_i - \theta_0 x_0 - \theta_1 x_1)(-x_1)$ for the θ₁ component  

and notes the minus $x_0$ or minus $x_1$ comes from the chain rule. [mattsosna](https://mattsosna.com/LR-grad-desc/)

In column‑vector form:

$$
\nabla_{\theta} \ell_i =
\begin{bmatrix}
-2\, (y_i - \theta_0 x_{0,i} - \theta_1 x_{1,i})\, x_{0,i} \\
-2\, (y_i - \theta_0 x_{0,i} - \theta_1 x_{1,i})\, x_{1,i}
\end{bmatrix}
$$

### Step 3: Average gradient over all data points

The model loss over the **whole dataset** is the average of per‑point losses:

$$
L(\theta_0,\theta_1)
= \frac{1}{n} \sum_{i=1}^n \ell_i(\theta_0,\theta_1)
$$

So the gradient of the overall loss is the average of the per‑point gradients:

$$
\nabla_{\theta} L
= \frac{1}{n} \sum_{i=1}^n \nabla_{\theta} \ell_i
$$

Component‑wise:

$$
\frac{\partial L}{\partial \theta_0}
= \frac{1}{n}\sum_{i=1}^n -2\, (y_i - \theta_0 x_{0,i} - \theta_1 x_{1,i})\, x_{0,i}
$$

$$
\frac{\partial L}{\partial \theta_1}
= \frac{1}{n}\sum_{i=1}^n -2\, (y_i - \theta_0 x_{0,i} - \theta_1 x_{1,i})\, x_{1,i}
$$

Instead of summing over all data points only, you should average them. [mattsosna](https://mattsosna.com/LR-grad-desc/)

### Step 4: Writing the gradient in Python 
The code mirrors the formulas directly, with:

- `x0` = first column of `X` (bias = ones)  
- `x1` = second column of `X` (bill)  
- `theta[0]` = $\theta_0$, `theta[1]` = $\theta_1$

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.linear_model import LinearRegression

# Loading tips dataset
df = sns.load_dataset("tips")

# Add bias column
df["bias"] = 1.0

# Feature matrix X and target y
X = df[["bias", "total_bill"]]  
y = df["tip"]            

In [3]:
def mse_gradient(theta, X, y_obs):
    """Returns the gradient of the MSE on our data for the given theta"""    
    x0 = X.iloc[:, 0]
    x1 = X.iloc[:, 1]
    dth0 = np.mean(-2 * (y_obs - theta[0]*x0 - theta[1]*x1) * x0)
    dth1 = np.mean(-2 * (y_obs - theta[0]*x0 - theta[1]*x1) * x1)
    return np.array([dth0, dth1])

In [None]:
theta_init = np.array([0.0, 0.0])
grad_at_zero = mse_gradient(np.array([0, 0]), X, y)
print(grad_at_zero)

# both are negative, meaning **increase** both θ₀ and θ₁ to reduce loss (since gradient descent updates subtract the gradient). 
# [stackoverflow](https://stackoverflow.com/questions/17784587/gradient-descent-using-python-and-numpy)

[  -5.99655738 -135.22631803]


### Step 5: Single‑argument gradient wrapper

As with the loss, it is convenient to create a **single‑argument** gradient function that only takes θ; X and y are captured from the outer scope:

```python
def gradient_single_arg(theta):
    return gradient_theta(X, y, theta)
```

Now you can evaluate the “2D slope” at any θ by calling:

```python
gradient_single_arg(np.array([0.0, 0.0]))
# -> large magnitude entries (far from optimum)

gradient_single_arg(np.array([0.9, 0.1]))
# -> much smaller gradient (close to optimum)
```

The transcript notes that for θ close to the correct answer (~0.9, 0.1), the gradient components are **small**, indicating you are near a minimum. [stackoverflow](https://stackoverflow.com/questions/17784587/gradient-descent-using-python-and-numpy)

***

### Step 6: Reusing the same gradient descent function (1D → 2D)

A crucial point in the transcript: the **same** generic gradient‑descent function used in 1D works unchanged in 2D, thanks to NumPy’s vectorization. [stackoverflow](https://stackoverflow.com/questions/17784587/gradient-descent-using-python-and-numpy)

A generic gradient descent:

```python
def gradient_descent(df, theta_init, alpha, n_steps):
    """
    df:  function that returns gradient given theta (shape (d,))
    theta_init: initial parameter vector
    alpha: learning rate
    n_steps: number of iterations
    """
    theta = theta_init.copy()
    for step in range(n_steps):
        grad = df(theta)          # df returns vector gradient
        theta = theta - alpha * grad
    return theta
```

Note:

- In 1D, `theta` was a scalar, `grad` was a scalar; subtraction worked.  
- In 2D, `theta` is a 2‑element NumPy array, `grad` is a 2‑element array; subtraction also works elementwise.  

So we simply pass the new 2D gradient:

```python
theta_start = np.array([0.0, 0.0])
alpha = 0.0001            # example learning rate
n_steps = 100000          # many steps (the transcript notes this is slow)

theta_gd = gradient_descent(gradient_single_arg, theta_start, alpha, n_steps)
print(theta_gd)
```

- The instructor mentions not actually running it live because it takes ~30s–1min as written. [stackoverflow](https://stackoverflow.com/questions/17784587/gradient-descent-using-python-and-numpy)
- After running, the trajectory moves from (0, 0) toward something like (0.888, 0.10), close to the optimal (0.92, 0.105). [towardsdatascience](https://towardsdatascience.com/linear-regression-and-gradient-descent-for-absolute-beginners-eef9574eadb0/)

You can track the path by recording θ at each step:

```python
def gradient_descent_with_trace(df, theta_init, alpha, n_steps):
    theta = theta_init.copy()
    trace = [theta.copy()]
    for step in range(n_steps):
        grad = df(theta)
        theta = theta - alpha * grad
        trace.append(theta.copy())
    return theta, np.array(trace)
```

This trace can be plotted over the 2D loss surface (contours) to visualize the “path” sliding down toward the minimum; the transcript says you will explore such visualization on homework. [mattsosna](https://mattsosna.com/LR-grad-desc/)

***

### Step 7: Behavior of gradient descent in 2D

The transcript notes several qualitative behaviors:

- Starting from θ = , both components of the gradient are negative, so the algorithm increases both θ₀ and θ₁. [stackoverflow](https://stackoverflow.com/questions/17784587/gradient-descent-using-python-and-numpy)
- Early steps: the parameters move **quickly**, making large jumps as gradients are large. [towardsdatascience](https://towardsdatascience.com/linear-regression-and-gradient-descent-for-absolute-beginners-eef9574eadb0/)
- As we approach the bottom of the “bowl” (the convex loss surface), gradients shrink, steps become small, and θ values approach the optimal solution. [towardsdatascience](https://towardsdatascience.com/linear-regression-and-gradient-descent-for-absolute-beginners-eef9574eadb0/)
- With the simple implementation and chosen learning rate, the algorithm gets to about (0.888, 0.10), close but not exactly the scikit‑learn optimum, because:
  - learning rate may be suboptimal  
  - fixed step size  
  - limited iterations  

The instructor comments that this simple implementation is **“finicky”**: different starting points or learning rates can change convergence speed and behavior. Industrial‑strength implementations need more robust strategies (e.g., adaptive step sizes, momentum, etc.). [stackabuse](https://stackabuse.com/gradient-descent-in-python-implementation-and-theory/)

***

### Step 8: Big picture and why this is “cool”

The transcript’s main takeaways:

- You derive the gradient of the **per‑example squared error** for the 2D linear regression model using calculus and chain rule. [youtube](https://www.youtube.com/watch?v=KLXP2RL0-Vg)
- You average across all data to get the gradient of the overall loss. [mattsosna](https://mattsosna.com/LR-grad-desc/)
- You implement that gradient directly in Python/NumPy and plug it into a **generic** gradient descent routine previously used for 1D. [stackoverflow](https://stackoverflow.com/questions/17784587/gradient-descent-using-python-and-numpy)
- Without relying on scikit‑learn or other libraries’ optimizers, you now numerically find near‑optimal values for both θ₀ and θ₁ for the tips dataset. [mattsosna](https://mattsosna.com/LR-grad-desc/)

You’ve effectively built your own small 2D gradient descent optimizer for linear regression from first principles.


A faithful Python version:

```python
import numpy as np

def gradient_theta(X, y_obs, theta):
    """
    X: shape (n_samples, 2) -> columns: x0 (bias), x1 (bill)
    y_obs: shape (n_samples,)
    theta: shape (2,) -> [theta0, theta1]
    """
    x0 = X[:, 0]  # all ones (bias)
    x1 = X[:, 1]  # bill

    theta0, theta1 = theta

    # prediction: theta0 * x0 + theta1 * x1
    y_pred = theta0 * x0 + theta1 * x1
    residual = y_obs - y_pred  # y_i - (theta0*x0 + theta1*x1)

    # components of gradient (note the minus sign)
    d_theta0 = -2 * residual * x0
    d_theta1 = -2 * residual * x1

    # average across all points (np.mean is what transcript fixes)
    grad_theta0 = np.mean(d_theta0)
    grad_theta1 = np.mean(d_theta1)

    return np.array([grad_theta0, grad_theta1])
```

- Initially, the instructor forgets to take the mean and then fixes it with `np.mean`. [mattsosna](https://mattsosna.com/LR-grad-desc/)
- The returned gradient has two components, for θ₀ and θ₁.

If you call:

```python
theta_init = np.array([0.0, 0.0])
grad_at_zero = gradient_theta(X, y, theta_init)
print(grad_at_zero)
```

You get something like:

- $[ -5.99996,\ -135.22 ]$

which matches the transcript’s description: both are negative, meaning **increase** both θ₀ and θ₁ to reduce loss (since gradient descent updates subtract the gradient). [stackoverflow](https://stackoverflow.com/questions/17784587/gradient-descent-using-python-and-numpy)

***

### Step 5: Single‑argument gradient wrapper

As with the loss, it is convenient to create a **single‑argument** gradient function that only takes θ; X and y are captured from the outer scope:

```python
def gradient_single_arg(theta):
    return gradient_theta(X, y, theta)
```

Now you can evaluate the “2D slope” at any θ by calling:

```python
gradient_single_arg(np.array([0.0, 0.0]))
# -> large magnitude entries (far from optimum)

gradient_single_arg(np.array([0.9, 0.1]))
# -> much smaller gradient (close to optimum)
```

The transcript notes that for θ close to the correct answer (~0.9, 0.1), the gradient components are **small**, indicating you are near a minimum. [stackoverflow](https://stackoverflow.com/questions/17784587/gradient-descent-using-python-and-numpy)

***

### Step 6: Reusing the same gradient descent function (1D → 2D)

A crucial point in the transcript: the **same** generic gradient‑descent function used in 1D works unchanged in 2D, thanks to NumPy’s vectorization. [stackoverflow](https://stackoverflow.com/questions/17784587/gradient-descent-using-python-and-numpy)

A generic gradient descent:

```python
def gradient_descent(df, theta_init, alpha, n_steps):
    """
    df:  function that returns gradient given theta (shape (d,))
    theta_init: initial parameter vector
    alpha: learning rate
    n_steps: number of iterations
    """
    theta = theta_init.copy()
    for step in range(n_steps):
        grad = df(theta)          # df returns vector gradient
        theta = theta - alpha * grad
    return theta
```

Note:

- In 1D, `theta` was a scalar, `grad` was a scalar; subtraction worked.  
- In 2D, `theta` is a 2‑element NumPy array, `grad` is a 2‑element array; subtraction also works elementwise.  

So we simply pass the new 2D gradient:

```python
theta_start = np.array([0.0, 0.0])
alpha = 0.0001            # example learning rate
n_steps = 100000          # many steps (the transcript notes this is slow)

theta_gd = gradient_descent(gradient_single_arg, theta_start, alpha, n_steps)
print(theta_gd)
```

- The instructor mentions not actually running it live because it takes ~30s–1min as written. [stackoverflow](https://stackoverflow.com/questions/17784587/gradient-descent-using-python-and-numpy)
- After running, the trajectory moves from (0, 0) toward something like (0.888, 0.10), close to the optimal (0.92, 0.105). [towardsdatascience](https://towardsdatascience.com/linear-regression-and-gradient-descent-for-absolute-beginners-eef9574eadb0/)

You can track the path by recording θ at each step:

```python
def gradient_descent_with_trace(df, theta_init, alpha, n_steps):
    theta = theta_init.copy()
    trace = [theta.copy()]
    for step in range(n_steps):
        grad = df(theta)
        theta = theta - alpha * grad
        trace.append(theta.copy())
    return theta, np.array(trace)
```

This trace can be plotted over the 2D loss surface (contours) to visualize the “path” sliding down toward the minimum; the transcript says you will explore such visualization on homework. [mattsosna](https://mattsosna.com/LR-grad-desc/)

***

### Step 7: Behavior of gradient descent in 2D

The transcript notes several qualitative behaviors:

- Starting from θ = , both components of the gradient are negative, so the algorithm increases both θ₀ and θ₁. [stackoverflow](https://stackoverflow.com/questions/17784587/gradient-descent-using-python-and-numpy)
- Early steps: the parameters move **quickly**, making large jumps as gradients are large. [towardsdatascience](https://towardsdatascience.com/linear-regression-and-gradient-descent-for-absolute-beginners-eef9574eadb0/)
- As we approach the bottom of the “bowl” (the convex loss surface), gradients shrink, steps become small, and θ values approach the optimal solution. [towardsdatascience](https://towardsdatascience.com/linear-regression-and-gradient-descent-for-absolute-beginners-eef9574eadb0/)
- With the simple implementation and chosen learning rate, the algorithm gets to about (0.888, 0.10), close but not exactly the scikit‑learn optimum, because:
  - learning rate may be suboptimal  
  - fixed step size  
  - limited iterations  

The instructor comments that this simple implementation is **“finicky”**: different starting points or learning rates can change convergence speed and behavior. Industrial‑strength implementations need more robust strategies (e.g., adaptive step sizes, momentum, etc.). [stackabuse](https://stackabuse.com/gradient-descent-in-python-implementation-and-theory/)

***

### Step 8: Big picture and why this is “cool”

The transcript’s main takeaways:

- You derive the gradient of the **per‑example squared error** for the 2D linear regression model using calculus and chain rule. [youtube](https://www.youtube.com/watch?v=KLXP2RL0-Vg)
- You average across all data to get the gradient of the overall loss. [mattsosna](https://mattsosna.com/LR-grad-desc/)
- You implement that gradient directly in Python/NumPy and plug it into a **generic** gradient descent routine previously used for 1D. [stackoverflow](https://stackoverflow.com/questions/17784587/gradient-descent-using-python-and-numpy)
- Without relying on scikit‑learn or other libraries’ optimizers, you now numerically find near‑optimal values for both θ₀ and θ₁ for the tips dataset. [mattsosna](https://mattsosna.com/LR-grad-desc/)

You’ve effectively built your own small 2D gradient descent optimizer for linear regression from first principles.