# Computing Parameters Analytically

## 1. Normal Equation

Until now we have been using the gradient descent iterative algorithm to minimize the cost function 
$J(\boldsymbol{\theta})$ and find the parameters $\boldsymbol{\theta}$.

In contrast, the normal equation will give us a method to find the parameters $\boldsymbol{\theta}$ analytically. This is, rather than needing to run the gradient descent iterations, we can solve for $\boldsymbol{\theta}$ in only one step.

As we pointed out in the last lecture, the cost function can be written as:

$$
J(\boldsymbol{\theta}) = \frac{1}{2m}\left\lvert\left\lvert\boldsymbol{X}\boldsymbol{\theta} - y^{(i)}\right\rvert\right\rvert^2,
$$

where 
$$
\boldsymbol{X} = \left[
\begin{array}{ccccc}
x_0^{(1)} & x_1^{(1)} & x_2^{(1)} & \dots  & x_n^{(1)} \\
x_0^{(2)} & x_1^{(2)} & x_2^{(2)} & \dots  & x_n^{(2)} \\
\vdots    & \vdots    & \vdots    & \ddots & \vdots    \\
x_0^{(m)} & x_1^{(m)} & x_2^{(m)} & \dots  & x_n^{(m)}
\end{array}
\right] \in \mathbb{R}^{m \times (n+1)}.
$$

One can note a pair of things:

$$
\frac{\partial}{\partial \boldsymbol{\theta}} J(\boldsymbol{\theta}) = \frac{1}{m} \boldsymbol{X}^T (\boldsymbol{X}\boldsymbol{\theta} - \boldsymbol{y}) \in \mathbb{R}^{n+1},
$$

and

$$
\frac{\partial^2}{\partial \boldsymbol{\theta}^2} J(\boldsymbol{\theta}) = \frac{1}{m} \boldsymbol{X}^T \boldsymbol{X} \in \mathbb{R}^{(n+1) \times (n+1)}.
$$

On the one hand, the Hessian $\frac{\partial^2}{\partial \boldsymbol{\theta}^2} J(\boldsymbol{\theta})$ is positive semi-definite, which assures the convexity of the cost function.

This implies that any local minimum of the cost function is also a global minimum.

On the other hand:

$$
\frac{\partial}{\partial \boldsymbol{\theta}} J(\boldsymbol{\theta}) = 0 \Leftrightarrow \frac{1}{m} \boldsymbol{X}^T (\boldsymbol{X}\boldsymbol{\theta} - \boldsymbol{y}) = 0 \Leftrightarrow \boldsymbol{\theta} = (\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T \boldsymbol{y}.
$$

which implies that $\boldsymbol{\theta} = (\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T \boldsymbol{y}$ is a global minimizer of the cost function $J(\boldsymbol{\theta})$.

### Which method should I use?

| **Gradient Descent**              | **Normal Equation**        |
| --------------------------------- | -------------------------- |
| Neet to choose $\alpha$           | No need to choose $\alpha$ |
| Needs many iterations             | Don't need to iterate      |
| Works well even when $n$ is large | Neet to compute $(X^T X)^{-1}$, which is slow if $n$ is large |

On the other hand, the Gradient Descent method can be used to optimize generic functions, whereas the Normal equation is only for linear regression.

## 2. Normal Equation Noninvertibility

What happens if $\boldsymbol{X}^T\boldsymbol{X}$ is not invertible (is singular)?

This happens when:

1. There are linearly dependent features.

2. There are too many features (more than training examples)

In any case, these problems can be avoided via a good **feature selection** before applying the linear regression.

<script>
  $(document).ready(function(){
    $('div.prompt').hide();
    $('div.back-to-top').hide();
    $('nav#menubar').hide();
    $('.breadcrumb').hide();
    $('.hidden-print').hide();
  });
</script>

<footer id="attribution" style="float:right; color:#808080; background:#fff;">
Created with Jupyter by Esteban Jiménez Rodríguez. Based on the content of the Machine Learning course offered through coursera by Prof. Andrew Ng.
</footer>