# Finding Critical Points with TensorFlow
## Part 0 - Working with Quadratic Forms

### Introduction: Why Critical Points?

The *critical points* of a differentiable function that
takes (possibly-multidimensional) real-valued input
and produces a scalar real output are those points where
the gradient of the function is equal to the $0$ vector.
An example of such a function is
the loss function of a neural network,
either as a function of network inputs
or as a function of network parameters.
Recall that the gradient $\nabla f$
of a function $f$ at a point $\theta$ is defined as the function that satisfies

$$
f(\theta+\epsilon) \approx f(\theta) + \epsilon \nabla f(\theta)
$$

for sufficiently small $\epsilon$.

This is a generalization of the notion of critical points
familiar from single-variable calculus,
where a critical point is where the derivative is equal to $0$.
In both cases, a critical point is a place where the
*best linear approximation is constant*:

$$
f(\theta+\epsilon) \approx f(\theta) + \epsilon 0
$$

Critical points are of interest because they are fixed points
for the *gradient descent* optimization algorithm.
In this algorithm, the value of $f(\theta)$
is minimized by, at each iteration,
taking a small step in the direction that would
most quickly minimize the linear approximation to $f$:

$$
\theta^{t+1} = \theta^t - \epsilon\nabla f(\theta)
$$

When the gradient is $0$, $\theta^{t+1} = \theta^t$,
and the parameters don't change after an iteration.
There is a long-standing debate whether neural networks reach critical points
when they are being trained and, if so,
to what kind of critical points they tend to converge.

While the linear approximation at every critical point is always a constant,
the higher order approximations can be wildly different.
The simplest way to differentiate critical points, then,
is according to what the *best quadratic approximation* to the function looks like
in the neighborhood of the critical point:

$$
f(\theta + \epsilon) \approx f(\theta) + \epsilon\nabla f(\theta) + \epsilon^\intercal \nabla^2 f(\theta) \epsilon
$$

Because the gradient is $0$ at a critical point,
the best quadratic approximation to a function
is parametrized by the matrix in the second term,
its matrix of second partial derivatives,
or Hessian,
just as the best linear approximation is parametrized by its vector of first partial derivatives.
We will be interested in the basis-independent properties of this matrix:
is it singular? what is its spectrum? does it have an eigenvalue gap? is it poorly conditioned?
and so on.

As it turns out, finding the critical points and calculating the curvature
of a high-dimensional, non-polynomial function
like a neural network loss function is quite difficult.
These notebooks are a record of our attempts to develop
and/or implement critical-point-discovery algorithms
with an aim to be useful as tutorials.

To develop our understanding of critical points and curvature,
we first focus on problems for which
*the best quadratic approximation is exact*.
This will allow us to analytically derive the locations of critical points
and the optimal values of hyperparameters
and so verify that our algorithms are properly implemented.
These functions are known as *quadratic forms*.

### Quadratic Forms

A quadratic form is a polynomial of degree two over an $n$-dimensional input. They are calculated as

$$
f(x) = \frac{1}{2}x^\intercal Q x
$$

where the factor of 1/2 is there to mostly simplify some later expressions
(but much like the seemingly innocuous scaling factors in front of Fourier transforms,
this scaling turns out to ensure certain favorable algebraic properties).

Below, I walk through the gradient and Hessian calculations.

This isn't a particularly interesting exercise, but it has allowed me to capture a number of bugs.

In [1]:
import tensorflow as tf

import matplotlib.pyplot as plt

import numpy as np

from crit_finder import graphs

## Testing with Identity Matrix

The simplest case, useful for working out basic bugs and flaws in reasoning, is the identity matrix, because  the quadratic form it defines is the squared $\ell_2$ norm:

$$
\frac{1}{2} x^\intercal I x = \frac{1}{2} x^\intercal x = \frac{1}{2}\sum_i x_i \cdot x_i = \frac{1}{2}\|x\|_2^2
$$

In [2]:
N = 2

identity_matrix = np.eye(N).astype(np.float32)
input_vector = np.sqrt([1/2,1/2]).astype(np.float32)

identity_quadratic_form_graph = graphs.make_quadratic_form(identity_matrix, input_vector,
                                                                hyperparameters=graphs.DEFAULTS)

In [3]:
out = graphs.get_result("output", input_vector, identity_quadratic_form_graph)

assert np.isclose(0.5*np.sum(np.square(input_vector)), out)

The gradient is just the input (and here the scaling by 1/2 is helpful):

$$
\nabla_x \frac{1}{2}x^\intercal I x = \frac{1}{2} \nabla_x \sum_i x_i\cdot x_i = \frac{1}{2}\cdot2\cdot I\cdot x = x
$$

In [4]:
gradient = np.squeeze(graphs.get_result("gradients", input_vector, identity_quadratic_form_graph))

gradient

array([0.70710677, 0.70710677], dtype=float32)

In [5]:
assert np.allclose(input_vector.T, gradient)

The Hessian matrix is therefore just $I$.

In [6]:
hessian = np.squeeze(graphs.get_result("hessian", input_vector, identity_quadratic_form_graph))

hessian

array([[1., 0.],
       [0., 1.]], dtype=float32)

In [7]:
assert np.allclose(hessian, np.eye(N))

## Random Matrix

Of course, to really ensure that our code is correct, we should test on less-symmetric problems, e.g. matrices from the Gaussian ensemble.

Generically, the gradient of $x^\intercal Q x$ is

$$
\nabla_x x^\intercal Q x = (Q + Q^\intercal)x
$$

Which we can get from the definition of the derivative, following
[this derivation from StackOverflow](https://math.stackexchange.com/questions/239207/hessian-matrix-of-a-quadratic-form).

Our goal is to find $\nabla_xf(x)$ such that

$$\begin{align}
f(x+\epsilon) &= f(x) + \nabla_xf(x)\epsilon + o(\epsilon) \\
\end{align}$$

as the norm of $\epsilon$ goes to $0$.
We expand the left-hand side,
moving the factor of $\frac{1}{2}$ over:

$$\begin{align}
2\cdot f(x+\epsilon) &= (x+\epsilon)^\intercal Q (x+\epsilon) \\
&= x^\intercal Q x
+ x^\intercal Q \epsilon + \epsilon^\intercal Q x
+ \epsilon^\intercal Q \epsilon
\end{align}$$

Using the rules for transposition and distribution, we can rewrite this as

$$\begin{align}
2\cdot f(x+\epsilon)
&= x^\intercal Q x
+ x^\intercal Q \epsilon + \epsilon^\intercal Q x
+ \epsilon^\intercal Q \epsilon \\
&= x^\intercal Q x
+ (x^\intercal Q \epsilon)^\intercal + \epsilon^\intercal Q x
+ \epsilon^\intercal Q \epsilon \\
&= x^\intercal Q x
+ \epsilon^\intercal Q x + \epsilon^\intercal Q^\intercal x
+ \epsilon^\intercal Q \epsilon \\
&= x^\intercal Q x
+ \epsilon^\intercal (Q+Q^\intercal) x
+ \epsilon^\intercal Q \epsilon \\
f(x+\epsilon) &= \frac{1}{2} + x^\intercal Q x
+ \frac{1}{2} \epsilon^\intercal (Q+Q^\intercal) x
+ \frac{1}{2}\epsilon^\intercal Q \epsilon \\
\end{align}$$

Comparing this to the definition of the derivative above,
we see that we have a term $f(x)$ and a term dominated by $\epsilon$,
leaving the middle term, sans $\epsilon$, to be our derivative:

$$
\nabla_x f(x) = \frac{1}{2}(Q + Q^\intercal)x
$$

Obviously, the Hessian is thus $\frac{1}{2}(Q+Q^\intercal)$.

In [8]:
N = 3

random_matrix = np.random.standard_normal(size=(N,N)).astype(np.float32)

input_vector = np.random.standard_normal(size=(N)).astype(np.float32)

random_quadratic_form = graphs.make_quadratic_form(random_matrix, input_vector,
                                                              graphs.DEFAULTS)


In [9]:
out = graphs.get_result("output", input_vector, random_quadratic_form)
out

0.24854209

In [10]:
assert np.isclose(out, 0.5*input_vector.T.dot(random_matrix).dot(input_vector))

In [11]:
gradient = np.squeeze(graphs.get_result("gradients", input_vector, random_quadratic_form))
gradient

array([-0.43652117,  0.6071776 , -0.21483205], dtype=float32)

In [12]:
assert np.allclose(gradient, 0.5*(random_matrix+random_matrix.T).dot(input_vector).squeeze())

In [13]:
tf_hessian = graphs.get_result("hessian", input_vector, random_quadratic_form)
tf_hessian

array([[ 0.21663527, -1.2224835 ,  0.1898982 ],
       [-1.2224835 , -1.4245507 ,  0.06303868],
       [ 0.1898982 ,  0.06303868,  0.24287845]], dtype=float32)

In [14]:
true_hessian = 0.5*(random_matrix+random_matrix.T)
true_hessian

array([[ 0.21663527, -1.2224835 ,  0.1898982 ],
       [-1.2224835 , -1.4245507 ,  0.06303868],
       [ 0.1898982 ,  0.06303868,  0.24287845]], dtype=float32)

In [15]:
assert np.allclose(true_hessian, tf_hessian)