## Taylor approximations.

We have some function, say $y = sin(x)$ and we want to approximate it within some neighborhood located around $a$. 

We can do this (very poorly) by doing a linear approximation, $f(x) = mx + c$. We know $a$, the neighborhood that we want to approximate y around. So, the approximation $f$ should be the same as the true function at a, $f(a) = y(a) = 0$, and the gradient should also be the same, $df/dx(a) = dy/dx(a) = 1$.

Given these two requirements we can solve for $m$ and $c$.

$$
f_a(x) = mx + c = 1 \\
\frac{df_a}{dx} = m = 0 \\
$$

Therefore $f(x) = 1\cdot x = x$. We could continue this process, assuming we have access to higher order gradients, to improve the accuracy of our approximation. So using $\frac{d^2y}{dx^2}$ we could fit a polynomial of order 2 to the second derivative.

$$
f(a) = ax^2 + bx + c \\
df/dx = 2ax + b \\
d^2f/dx^2 = 2a \\
$$

Continuing further gives the taylor approximation, 
$$f(x) \approx f(a)+{\frac {f'(a)}{1!}}(x-a)+{\frac {f''(a)}{2!}}(x-a)^{2}+{\frac {f'''(a)}{3!}}(x-a)^{3}+\cdots$$

<!-- insert joke... in the hood-->

<!-- what if we learn each a? or could even just average the backpropagated grads? but how could we learn the higher order grads? estimates from the many first order ones?!? 
what about predicting $f''(a)$ -->



## Neural taylor approximations

In a machine learning setting, we have access to multiple neighborhoods where we can evaluate, $y,x$ (the training examples), $dy/dx$ (and even higher order derivatives if we are willing to pay the computational price). But storing all the neighborhoods is not ideal (it would require linear memory in the size of the dataset). So which neighborhoods are the best and how should they be used? Could we learn which neighborhoods where the best!?

<!-- Can we collect/represent/approximate higher order gradients in some efficient manner? -->

In [None]:
def nta(x): # neural taylor approximation
    # a set of function approximators. in this case linear. y = mx + c
    # M is the number of neighborhoods we want to use
    # N is the order of our approximation
    nets = [net for _ in range(M*N)]  # could just be lambda x: tf.matmul(x, w)
    
    # predict/lookup values of `a` depending on `x`. 
    # could just be a nearest neighbor...
    # this is a neighborhood we are going to use to approximate f(x) at. 
    neighborhoods = [net(x) for net in nets]
    # NOTE preferably a = x but we may not have good estimates for the gradients at, or close to x. 
    # so we need to generalise!
    
    # evaluate the gradients within this neighborhood
    # could use;
    # - local variables to average/collect gradients (aka weights in final layer)
    # - the grads of the fn approximators
    # - accumulate the gradients w.r.t the error to find the best set of directions to remember
    grads = [net(a) for a in neighborhoods]  # inits a new net for each c. 
    # NOTE is there any intelligent sharing that can be done here?
    
    # take a linear combination of our estimates.
    # weight the estimates upon how 'good' the neighborhood, a, will be at x. 
    y = tf.reduce_sum([((x-a)^i)*grad(a, i)/factorial(i) for a, grad in zip(neighborhoods, grads) for i in range(N)])
    
    return y

nta(x)

So this first order neural taylor approximation gives a piecewise linear approximation. This interpretation implies that;

we want to learn;
* the best (finite) set of neighborhoods that 'cover' the input distribution of data. (the first layer)
* how to combine predictions of f(x) from the different neighborhoods. (the second layer)
<img src="neuraltaylor.png" alt="pic" style="width:500px;height:300px;">

This may remind you of a 1 hidden layer neural network! __TODO__ make this more rigorous.

Cool. So 1 hidden layer neural networks do a piecewise 1st order taylor approximation. 

_From this perspective;_
* _it would make more sense to use a bounded ramp function? min(max(0, x), 10)? as we want neighborhoods. otherwise we need to cancel out the other values of the ramp) -- actually the grad of sigmoid seems ideal...?_
* _Width is the number of local approximators. And depth would be the ability to have non-linear weightings of the local approximation. I.e. we can repreat it a few times, recycle it over a few different areas, put half of it over there... _
* _maybe it would be nice to have some prior on the distribution of neighborhoods?_
* _as the dimensionality of x increases. there is exponentially more space that is 'far away' from a neighborhood._


Cool, given these awesome results the next questions that follow are;
* _What do 2 hidden layer neural networks do?_
* _What would a 2nd order neural taylor approximation look like?_
* _Can correlations of correlations help us learn?_

Ultimately, I would like to investigate whether these are all the same thing, and if not, why not.

<!-- Such that we are doing a piecewise _(cts would be better than piecewise)_ combination of, say, hessians. Switching out the right hessian (or higher order derivatives) depending on the neighborhood of the data. -->

## Second order neural taylor approximations



## Two hidden layer neural networks

$$
\begin{align}
y &= A\rho(B\rho(Cx)) \\
&= 
\begin{bmatrix}
a_{11} & a_{12} \\
a_{21} & a_{22} \\
\end{bmatrix} \rho \big(
\begin{bmatrix}
b_{11} & b_{12} \\
b_{21} & b_{22} \\
\end{bmatrix} \rho \big(
\begin{bmatrix}
c_{11} & c_{12} \\
c_{21} & c_{22} \\
\end{bmatrix} x \big) \big)\\
&= cbf \\
\end{align}
$$

## Second order neural nets

Aaand finally. The real goal.

Second order networks. How can we train them, what do they learn, how do they generalise?

$$
f(x) = x\mathbf Ax^T + Bx + c \\
\frac{\partial f}{\partial x} = x\mathbf A + B \\
\frac{\partial^2 f}{\partial x^2} = \mathbf A \\
$$
aka A == the hessian!?

$$
\mathcal L = \parallel y - f(x)\parallel_2^2 \\
\frac{\partial \mathcal L}{\partial \mathbf A} = \\
\frac{\partial \mathcal L}{\partial B} = \frac{\partial \mathcal L}{\partial f} \frac{\partial f}{\partial B}  = ?? (I \circ x)\\
\frac{\partial \mathcal L}{\partial c} =  \frac{\partial \mathcal L}{\partial f} \frac{\partial f}{\partial c} = ??(I)\\
$$

What about a non-linearity? The function is already non-linear. But maybe we want to ?!?!? choose which neighborhood to use to approximate, ...

What if we nest multiple? Like a deep neural net?
$$
y = f_2(f_1(x)) \\
y = (x\mathbf A_1 x^T +B_1x +c_1) \mathbf A_2 (x\mathbf A_1 x^T +B_1x +c_1) + B_2(x\mathbf A_1 x^T +B_1x +c_1) + c_2 \\
$$

## Problems.

the nth order derivative (e.g. hessian) is large. and each higher order moment/derivative requires more space, scaling exponentially.

solution. find compact/sparse representations, factorise their representations.

could do;
* a rank decomposition. $x^TWy = B^T(Ax \circ By)$
* local decomposition (related somehow to a conv) $x^TWy = M\hat\Sigma : M \in \mathbb R^{o\times(k\times d)}, \hat\Sigma is diagonal$


$$
\mathbf A_{i, :, :}
\begin{bmatrix}
a_{1,1} & a_{1,2} & \dots & & & \\
a_{2,1} & a_{2,2} & a_{2,3} &\dots& & \\
\dots &   a_{3,2} & a_{3,3} & a_{3,4} &\dots& \\
      &     \dots & a_{4,3} & a_{4,4} & a_{4,5} \\
\end{bmatrix}
$$

but is symmetric so we dont need upper triangular?


what about circular convolutions? they seem somehow relevant here?

Resources

* http://www.math.smith.edu/~rhaas/m114-00/chp4taylor.pdf
* http://mathinsight.org/taylor_polynomial_multivariable_examples
* 3blue1brown...