## Gradient Descent

The problem with the methods we've look at thus far is that they all require inverting the Hessian matrix. This can be very slow.

Here is a simpler approach. Consider $\nabla E(\hat\theta_{i, 0}, \hat\theta_{i, 1})$. This is the vector of partial derivatives. Let's consider a change $(\Delta \hat\theta_0, \Delta \hat\theta_1)$. Then, using the tangent line approximation to the quadratic surface, we should have:

\\[
\Delta E
\approx
    \nabla E(\hat\theta_{i, 0}, \hat\theta_{i, 1}) \cdot (\Delta \hat\theta_0, \Delta \hat\theta_1)
= \frac{\partial E}{\partial \hat\theta_0} \Delta\hat\theta_0
  + \frac{\partial E}{\partial \hat\theta_1} \Delta\hat\theta_1
\\]

Now, this is only an approximation to $\Delta E$, because the error surface is quadratic, not linear. Therefore, as we change $\hat\theta_0, \hat\theta_1$, the partial derivatives will change. Still, just like $f'(x)$ is the slope of the line tangent to $f$ at $x$, $\nabla E(\hat\theta)$ is the gradient of the linear surface tangent to $E$ at $\hat\theta$. (I wrote $\hat\theta$ which is the vector version of $(\hat\theta_0, \hat\theta_1)$.)

In other words, $\nabla E(\theta)$ gives us the best linear approximation to the quadratic surface. For small $\Delta\theta$ the approximation should stay pretty good.

Now, remember what we want to do: we want to find a minimum of the error surface $E$. But instead of trying to make a big jump to try to zero the partial derivatives, why don't we just try to make a small step in the downhill direction of the quadratic surface?

For instance, any update $\Delta\theta$ where we have $\nabla E(\theta) \cdot \Delta\theta < 0$ should reduce the error, provided $\Delta\theta$ is small enough that the approximation to $\Delta E$ is still good.

Since we want to focus on small $\Delta\theta$, let's consider taking a step of length $\epsilon$. The length of a vector $v$ is given by $\sqrt{\sum_i v_i^2}$; we write this length of a vector as $||v||$.

When we write $\alpha v$, where $\alpha \in \mathbb{R}$ (a real number) and $v \in \mathbb{R^n}$ (an $n$-dimensional vector of real numbers), we mean multiplying every component of $v_i$ by $\alpha$ to produce $\alpha v = (\alpha v_1, \alpha v_2, \ldots, \alpha v_n)$.

You may verify that $||\alpha v|| = \alpha ||v||$. So considering all vectors of length $\epsilon$ means consider every vector which can be written as $\epsilon u$, where $||u|| = 1$. A vector where $||u|| = 1$ is called a *unit* vector and a unit vector is often used when we want to focus on *direction* more than length.

Our question about $\Delta\theta$ is really about what *direction* to move in. Once we find a good direction, we'll just move $\epsilon$ units in that direction.

So what direction $u$ should we move in? Well, first note that any direction where $\nabla E(\theta) \cdot u > 0$ is heading *uphill* on the error surface.

On the other hand, $\nabla E(\theta) \cdot u = 0$ means you are traveling sideways on the error surface. A $u$ like this is *parallel to the contour* at $\theta$. That's because when you're on a contour line, if you move along that line, you don't change your height. Since moving in the direction $u$ doesn't change your height, it must be the direction of the contour line.

So we know we want $\nabla E(\theta) \cdot u < 0$. In fact, we want this to be as negative as possible: that would be the direction most downhill.

A lot 

**TODO**: Image of contour.
**TODO**: Animation of dot products with a circle.
* Show them how constrained optimization on a circle is easy.


In [1]:
%matplotlib inline
from examples.gradient_descent_example import GradientDescentExample

GradientDescentExample.run()