Gradient descent and the normal equation are two different ways to choose the linear regression parameters $w_0, w_1$, but both aim to minimize the same loss (typically mean squared error).

### Role of gradient descent in linear regression

- In simple linear regression with one feature, the hypothesis is often written as $ \hat{y} = w_0 + w_1 x $, and the goal is to choose $w_0, w_1$ to minimize the average squared difference between predictions and true values (the cost).  

- Gradient descent does this by starting from an initial guess for the weights and iteratively updating them in the negative gradient direction of the cost, so the loss decreases step by step until it converges to (or near) a minimum.

### Role of the normal equation

- The normal equation provides a closed‑form solution for the weights that minimize mean squared error, typically written in matrix form as $ w = (X^\top X)^{-1} X^\top y $ when the inverse exists.  

- Unlike gradient descent, the normal equation does not require choosing a learning rate or running an iterative loop; instead, it directly computes the optimal parameters in a single algebraic step.

### When to use which

- Gradient descent scales better to large datasets and high‑dimensional problems because it avoids forming and inverting $X^\top X$, and it can be applied to models and loss functions where no closed‑form solution exists.  

- The normal equation is convenient for smaller problems with relatively few features, where computing $(X^\top X)^{-1}$ is feasible and provides an exact minimizer without tuning hyperparameters.

In [None]:
import seaborn as sns