# Finding Critical Points with TensorFlow
## Part 1b - Critical Points of Quadratic Forms - Gradient Norm Minimzation

An alternative, possibly more intuitive, method of finding critical points
is to simply minimize the norm of the gradient directly.

That is, we are interested in points where the gradient is close to the $0$ vector.
To find these points,
we descend a new function $g$
that is defined in terms of the
gradients of our original function:

$$
g(\theta) = \frac{1}{2}\|\nabla f(\theta) \|_2^2
$$

A quick application of the chain rule gives the following:

$$\begin{align}
\nabla g(\theta) &= \nabla\|\nabla f(\theta) \|_2^2\\
&= \nabla\nabla f(\theta)\cdot \nabla f(\theta)\\
&= \nabla^2f(\theta)\nabla f(\theta)
\end{align}$$

leading to the update rule:

$$\begin{align}
\theta^{t+1} &= \theta^{t} - \eta \nabla g(\theta)\\
&= \theta^{t} - \eta\nabla^2f(\theta)\nabla f(\theta)
\end{align}$$

Compare this to the Newton update:

$$\begin{align}
\theta^{t+1} &= \theta^{t} - \gamma \left(\nabla^2f(\theta)\right)^{-1}\nabla f(\theta)
\end{align}$$

Despite the intuitive appeal of the former method,
an analysis of its performance on polynomial functions
would seem to indicate that it is a *bad idea*,
because it converges more slowly when the step size is correctly chosen
and diverges horribly if it is not.

It is unclear, however, whether this carries over to other kinds of functions.

In [1]:
import tensorflow as tf

import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="white")

import numpy as np

from crit_finder import train, evaluate
from crit_finder.graphs import quadratics

%matplotlib inline

### Testing with Identity Matrix

The identity matrix quadratic form also makes a sanity check for the gradient norm minimization technique,
since the Hessian is the identity,
and so the updates from gradient norm minimization should exactly match those
from gradient descent.

In [2]:
N = 2

identity_matrix = np.eye(N).astype(np.float32)

initial_values = quadratics.generate_initial_values(N)

identity_quadratic_form = quadratics.make(identity_matrix, initial_values,
                                                                 quadratics.DEFAULTS)

In [3]:
gradmin_final_output, gradmin_final_parameters = quadratics.run_algorithm(identity_quadratic_form,
                                                                             "gradient_norm_min", 50)
gradmin_final_output, gradmin_final_parameters

(1.3510263e-06, array([-0.00154637, -0.00055748], dtype=float32))

In [4]:
gd_final_output, gd_final_parameters = quadratics.run_algorithm(identity_quadratic_form,
                                                                   "gradient_descent", 50)

gd_final_output, gd_final_parameters

(1.3510263e-06, array([-0.00154637, -0.00055748], dtype=float32))

In [5]:
assert all(np.equal(gradmin_final_parameters, gd_final_parameters))

### Random Symmetric Matrix

We again first extend to the case of random symmetric matrices
(for more on the specific random ensemble, see
[the previous notebook](./01a - Critical Points of Quadratic Forms - Newton's Method.ipynb)).

In [6]:
N = 5

initial_values = quadratics.generate_initial_values(N)

random_symmetric_matrix = quadratics.generate_symmetric(N)

random_symmetric_quadratic_form = quadratics.make(random_symmetric_matrix, initial_values,
                                                                 quadratics.DEFAULTS)

_, values = quadratics.run_algorithm(random_symmetric_quadratic_form, "gradient_norm_min", 1500)

np.linalg.norm(values)

1.8901263e-09

Despite also using curvature information,
gradient norm minimization takes *far* more steps to reach a given error than
does Newton's method:
while Newton can get to solutions with norm of order `1e-8` in two or three steps,
gradient norm minimization takes thousands of steps even on small problems,
with the number of steps increasing with problem size.

### Random Positive Definite Matrix

The advantage of gradient norm minimization over Newton's method,
however, is that it does not require a matrix inverse calculation.

This has a computational benefit,
since the matrix inversion step is complexity $O(n^{k})$
for some $k$ in $(2, 3]$,
depending on the algorithm,
while all other computational steps are at most complexity $O(n^2\log n)$.

But even more crucially,
avoiding the matrix inverse means that all of the issues regarding numerical non-invertibility
that bedeviled the Newton's methods discussed in the last notebook
are avoided.

We again select random ill-conditioned and singular matrices
according to the Wishart distribution,
and we find that gradient norm minimization successfully finds a point with small gradient norm.

In [7]:
N = 500
k = 100

wishart_random_matrix = quadratics.generate_wishart(N, k)

initial_values = quadratics.generate_initial_values(N)

hyperparameters = {"learning_rate":0.1,
            "newton_rate":1,
            "fudge_factor":0.0,
            "inverse_method":"fudged",
            "gradient_norm_min_rate":0.001}


wishart_quadratic_form = quadratics.make(wishart_random_matrix, initial_values,
                                                                 hyperparameters)

In [8]:
output, _ = quadratics.run_algorithm(wishart_quadratic_form, "gradient_norm_min", 1500)
output

3.5502217e-07

In [9]:
N = 500
k = 100

generate_full_rank_wishart = lambda N: quadratics.generate_wishart(N, k)

evaluate.gradient_test(N,generate_full_rank_wishart,'gradient_norm_min', 1500, hyperparameters)

output:
	initial: 0.5760053396224976 	final: 3.848661435768008e-06
gradient norm:
	initial: 2.7083492279052734 	final: 0.00349063565954566


In [10]:
N = 500
k = 500

wishart_random_matrix = quadratics.generate_wishart(N, k)

initial_values = quadratics.generate_initial_values(N)



wishart_quadratic_form = quadratics.make(wishart_random_matrix, initial_values,
                                                                 quadratics.DEFAULTS)

In [11]:
output, _ = quadratics.run_algorithm(wishart_quadratic_form, "gradient_norm_min", 1500)
output

0.0014270995

In [12]:
N = 500
k = 500

generate_full_rank_wishart = lambda N: quadratics.generate_wishart(N, k)

evaluate.gradient_test(N,generate_full_rank_wishart,'gradient_norm_min', 1500, quadratics.DEFAULTS)

output:
	initial: 0.44226109981536865 	final: 0.0013975112233310938
gradient norm:
	initial: 1.2912944555282593 	final: 0.010902918875217438
