# Second-Order Methods in TensorFlow - Part 1

Doing second-order methods in tensorflow is not well-supported, so it's easy to make a mistake.

To get a better handle on using second-order methods for neural networks, where the ground truth is unclear or hard to calculate, I'm working through second-order methods for a case where we have ground truth: quadratic forms.

A quadratic form is a polynomial of degree two over an $n$-dimensional input. They are calculated as

$$
\mathbf{x}^\intercal Q \mathbf{x}
$$

Below, I walk through the gradient and Hessian calculations.

This isn't a particularly interesting exercise, but it has allowed me to capture a number of bugs.

In [1]:
import tensorflow as tf

import matplotlib.pyplot as plt

import numpy as np

import second_order

## Testing with Identity Matrix

The simplest case, useful for working out basic bugs and flaws in reasoning, is the identity matrix, because  the quadratic form it defines is the squared $\ell_2$ norm:

$$
\mathbf{x}^\intercal I \mathbf{x} = \mathbf{x}^\intercal \mathbf{x} = \sum_i x_i \cdot x_i
$$

In [2]:
N = 2

identity_matrix = np.eye(N).astype(np.float32)
input_vector = np.sqrt([1/2,1/2]).astype(np.float32)

identity_quadratic_form_graph = second_order.make_quadratic_form(identity_matrix, input_vector,
                                                                hyperparameters=second_order.DEFAULTS)

In [3]:
out = second_order.get_result("output", input_vector, identity_quadratic_form_graph)

assert np.isclose(np.sum(np.square(input_vector)), out)

The gradient is just twice the input:

$$
\nabla_\mathbf{x} \mathbf{x}^\intercal I \mathbf{x} = \nabla_x \sum_i x_i\cdot x_i = 2\cdot I\cdot \mathbf{x}
$$

In [4]:
gradient = np.squeeze(second_order.get_result("gradients", input_vector, identity_quadratic_form_graph))

gradient

array([1.4142135, 1.4142135], dtype=float32)

In [5]:
assert np.allclose(input_vector.T*2, gradient)

The Hessian matrix is just $2\cdot I$.

In [6]:
hessian = np.squeeze(second_order.get_result("hessian", input_vector, identity_quadratic_form_graph))

hessian

array([[2., 0.],
       [0., 2.]], dtype=float32)

In [7]:
assert np.allclose(hessian, 2*np.eye(N))

## Random Matrix

Of course, to really ensure that our code is correct, we should test on less-symmetric problems, e.g. matrices from the Gaussian ensemble.

Generically, the gradient of $\mathbf{x}^\intercal Q \mathbf{x}$ is

$$
\nabla_\mathbf{x} \mathbf{x}^\intercal Q \mathbf{x} = (Q + Q^\intercal)\mathbf{x}
$$

Which we can get from the definition of the derivative, following
[this derivation from StackOverflow](https://math.stackexchange.com/questions/239207/hessian-matrix-of-a-quadratic-form).

Writing $f(\mathbf{x})$ for $\mathbf{x}^\intercal Q \mathbf{x}$, our goal is to find $\nabla_\mathbf{x}f(\mathbf{x})$ such that

$$\begin{align}
f(\mathbf{x}+\epsilon) &= f(\mathbf{x}) + \nabla_\mathbf{x}f(\mathbf{x})\epsilon + o(\epsilon) \\
\end{align}$$

as the norm of $\epsilon$ goes to $0$.
We expand the left-hand side:

$$\begin{align}
f(\mathbf{x}+\epsilon) &= (\mathbf{x}+\epsilon)^\intercal Q (\mathbf{x}+\epsilon) \\
&= \mathbf{x}^\intercal Q \mathbf{x}
+ \mathbf{x}^\intercal Q \epsilon + \epsilon^\intercal Q \mathbf{x}
+ \epsilon^\intercal Q \epsilon
\end{align}$$

Using the rules for transposition and distribution, we can rewrite this as

$$\begin{align}
f(\mathbf{x}+\epsilon)
&= \mathbf{x}^\intercal Q \mathbf{x}
+ \mathbf{x}^\intercal Q \epsilon + \epsilon^\intercal Q \mathbf{x}
+ \epsilon^\intercal Q \epsilon \\
&= \mathbf{x}^\intercal Q \mathbf{x}
+ \mathbf{x}^\intercal Q \epsilon + (\epsilon^\intercal Q \mathbf{x})^\intercal
+ \epsilon^\intercal Q \epsilon \\
&= \mathbf{x}^\intercal Q \mathbf{x}
+ \mathbf{x}^\intercal Q \epsilon + \mathbf{x}^\intercal Q^\intercal \epsilon
+ \epsilon^\intercal Q \epsilon \\
&= \mathbf{x}^\intercal Q \mathbf{x}
+ \mathbf{x}^\intercal (Q+Q^\intercal) \epsilon
+ \epsilon^\intercal Q \epsilon \\
\end{align}$$

Comparing this to the definition of the derivative above,
we see that we have a term $f(\mathbf{x})$ and a term dominated by $\epsilon$,
leaving the middle term, sans $\epsilon$, to be our derivative:

$$
\nabla_\mathbf{x} \mathbf{x}^\intercal Q \mathbf{x} = \mathbf{x}^\intercal (Q + Q^\intercal)
$$

Obviously, the Hessian is thus $Q+Q^\intercal$.

In [8]:
N = 3

random_matrix = np.random.standard_normal(size=(N,N)).astype(np.float32)

input_vector = np.random.standard_normal(size=(N)).astype(np.float32)

random_quadratic_form = second_order.make_quadratic_form(random_matrix, input_vector,
                                                              second_order.DEFAULTS)


In [9]:
out = second_order.get_result("output", input_vector, random_quadratic_form)
out

-0.26744586

In [10]:
assert np.isclose(out, input_vector.T.dot(random_matrix).dot(input_vector))

In [11]:
gradient = np.squeeze(second_order.get_result("gradients", input_vector, random_quadratic_form))
gradient

array([-1.2694671,  1.062211 ,  0.6193731], dtype=float32)

In [12]:
assert np.allclose(gradient, (random_matrix+random_matrix.T).dot(input_vector).squeeze())

In [13]:
tf_hessian = second_order.get_result("hessian", input_vector, random_quadratic_form)
tf_hessian

array([[-2.6165638 ,  1.5296859 ,  0.44749093],
       [ 1.5296859 ,  0.41847938,  0.95105606],
       [ 0.44749093,  0.95105606,  4.860975  ]], dtype=float32)

In [14]:
true_hessian = random_matrix+random_matrix.T
true_hessian

array([[-2.6165638 ,  1.5296859 ,  0.44749093],
       [ 1.5296859 ,  0.41847938,  0.95105606],
       [ 0.44749093,  0.95105606,  4.860975  ]], dtype=float32)

In [15]:
assert np.allclose(true_hessian, tf_hessian)