# Normal equation

So far, we have used the least squares method to *fit* functions for linear regression. 

If we have only two data points, we can __always__ fit a line that passes through both points:
$$\begin{bmatrix}1 & x_1 \\ 1 & x_2 \end{bmatrix}
\cdot
\begin{bmatrix}c_0 \\ c_1\end{bmatrix} = \begin{bmatrix}y_1 \\ y_2\end{bmatrix}$$

If we have three (or more) points, we may not be able to fit a line through all three, because we have two unknowns but three equations:
$$\begin{bmatrix}1 & x_1 \\ 1 & x_2 \\ 1 & x_3 \end{bmatrix}
\cdot
\begin{bmatrix}c_0 \\ c_1\end{bmatrix} = \begin{bmatrix}y_1 \\ y_2 \\ y_3\end{bmatrix}$$

You will often see this written like:
$$ A \cdot \vec{c} = \vec{y}$$
and the residual error $$\vec{r} = \vec{y} - A \cdot \vec{c}$$

If we use least squares to solve for $\vec{c}$, we minimize $||\vec{r}||_2^2$ (the MSSE):
$$||\vec{r}||_2^2 = \vec{r}^T\vec{r} \\= (\vec{y}-A\vec{c})^T(\vec{y}-A\vec{c}) \\= \vec{y}^T\vec{y}-\vec{y}^TA\vec{c}-(A\vec{c})^T\vec{y} + (A\vec{c})^TA\vec{c}$$

Now, $(A\vec{c})^T = \vec{c}^TA^T$, so
$$ = \vec{y}^T\vec{y}-\vec{y}^TA\vec{c}-\vec{c}^TA^T\vec{y} + \vec{c}^TA^TA\vec{c} \\ = \vec{y}^T\vec{y}-2\vec{y}^TA\vec{c} + \vec{c}^TA^TA\vec{c}$$ (because $\vec{c}^TA^T\vec{y}$ is the same as $\vec{y}^TA\vec{c}$)

Then we take the partial derivative with respect to $\vec{c}$, $\frac{\partial}{\partial \vec{c}}$:

$$\vec{y}^T\vec{y} = 2\vec{y}^TA\vec{c} + \vec{c}^TA^TA\vec{c} = 0 $$ Now, $$\frac{\partial}{\partial \vec{c}} (\vec{y}^T\vec{y}) = 0$$ $$\frac{\partial}{\partial \vec{c}} (2\vec{y}^TA\vec{c}) = 2\vec{y}^TA = 2A^T\vec{y}$$ $$\frac{\partial}{\partial \vec{c}} (\vec{c}^TA^TA\vec{c}) = 2A^TA\vec{c} $$ so we have $$ -2\vec{y}^TA + 2A^TA\vec{c} = 0 \\ \vec{y}^TA = A^TA\vec{c} \\ A^T\vec{y} = A^TA\vec{c} \\ \vec{c} = (A^TA)^{-1} A^T\vec{y} $$

We call $ \vec{c} = (A^TA)^{-1} A^T\vec{y}$ the __normal equation__.

$(A^TA)^{-1}$ means to invert $A^TA$. The inverse of a matrix is another matrix which, if multiplied by the original matrix, gives you the identity matrix. 

Note that in some cases, $A^TA$ may not be *invertible*. Don't worry, we have other ways to fit linear regression! But if you are committed to normal equation and run into this situation, you probably have some features (variables) that are not independent of each other and you can remove them.

Also note that calculating $A^TA$ is computationally expensive. Don't worry, we have other ways to fit linear regression!

In the case of multiple linear regression, the solution to $ \vec{c} = (A^TA)^{-1} A^T\vec{y} $ is not a line, it's a plane.

# Resources

* http://mlwiki.org/index.php/Normal_Equation
* https://mathworld.wolfram.com/MatrixInverse.html

# Question for the class

How would we write the normal equation using numpy (or numpy and scipy)?