In [None]:
import tensorflow as tf

## Taylor approximations.

We have some function, say $y = sin(x)$ and we want to approximate it within some neighborhood located around $a$. 

We can do this (very poorly) by doing a linear approximation, $f(x) = mx + c$. So we know $a$, the neighborhood that we want to approimate y around. So $f$ should be the same as $y(a)$,  so $f(a) = y(a) = 0$, and also the gradient should also be the same, $df/dx(a) = dy/dx(a) = 1$.

So given these two relations we can solve for $m$ and $c$.

$$
f(a) = mx + c = 1 \\
df/dx = m = 0 \\
$$

Therefore $f(x) = 1$. We could continue this process, assuming we have access to higher order gradients, to improve the accuracy of our approximation. So using $d^2y/dx$ we could fit a polynomial of order 2 to the second derivative.

$$
f(a) = ax^2 + bx + c \\
df/dx = 2ax + b \\
d^2f/dx^2 = 2a \\
$$

Which ends up giving
$$f(x) \approx f(a)+{\frac {f'(a)}{1!}}(x-a)+{\frac {f''(a)}{2!}}(x-a)^{2}+{\frac {f'''(a)}{3!}}(x-a)^{3}+\cdots$$



<!-- what if we learn each a? or could even just average the backpropagated grads? but how could we learn the higher order grads? estimates from the many first order ones?!? 
what about predicting $f''(a)$ -->


#### ML

In machine learning, we have access to multiple neighborhoods where we can evaluate, y,x (the training examples), dy/dx (and even higher order derivatives if we are willing to pay the computational price). But surely storing all the neighborhoods is not idea, which ones are the best and how should they be used?

Questions. Can we collect/represent/approximate higher order gradients in some efficient manner? 

In [None]:
def nta(x): # neural taylor approximation
    # a set of function approximators.
    nets = [net for _ in range(N)]  # could just be lambda x: tf.matmul(x, w)
    
    # predict/lookup values of a depending on x. 
    # could just be a nearest neighbor...
    # the neighborhood we are going to use to approximate f(x) around. 
    # preferably at x but we may not have good estimates for the gradients at, or close to x. (this seems close related to generalisation??!)
    neighborhoods = [net(x) for net in nets]
    
    # evaluate the gradients within this neighborhood
    # could use;
    # - the grads of the fn approximators
    # - local variables to average/collect gradients
    # - more fn approximators?!
    grads = [net(a) for a in neighborhoods]  # inits a new net for each c. 
    # is there any intelligent sharing that can be done here?
    
    # take a linear combination of ...?!!?!
    y = tf.reduce_sum([((x-a)^i)*grad(a)/factorial(i) for i, a, grad in zip(range(N), neighborhoods, grads)])
    
    return y

nta(x)

So the first order neural taylor expansion gives a piecewise linear approximation. This interpretation implies that;

we want to learn;
* the best (finite) set of neighborhoods that allow use to 'cover' the input distribution of data. (the first layer)
* how to combine predictions of f(x) from the different neighborhoods. (the second layer)

Cool. So 1 hidden layer neural networks roughly do a 1st order taylor approximation. What do 2 layer neural networks do? And what would a 2nd order neural taylor approximation look like?



## Problems.

the nth order derivative (e.g. hessian) is large. and each higher order moment/derivative requires more space, scaling exponentially.

solution. find compact/sparse representations, factorise their representations.

could do;
* a rank decomposition. $x^TWy = B^T(Ax \circ By)$
* local decomposition (related somehow to a conv) $x^TWy = M\hat\Sigma : M \in \mathbb R^{o\times(k\times d)}, \hat\Sigma is diagonal$

what about circular convolutions? they seem somehow relevant here?

## Correlation.

$$ 
\begin{align}
cov(x, y) &= \mathbb E \big[(X - \mathbb E [X]) (Y - \mathbb E [Y]) \big] \\
cov(x, x) &= \mathbb E \big[(x - \mathbb E [x]) (X - \mathbb E [x]) \big] \\
&= \mathbb E[xx] - \mathbb E[x]\mathbb E [x] \\
&= \mathbb E[x^2] - \mathbb E[x]^2 \\
&\approx X - \sum_i x_i^2 \\
\end{align}
$$

#### nth order moments



#### relation to derivatives

what relation is there to statistical moments of different orders? if we rearrange the we get (roughly)
$$
f(x)\approx f(a)+Df(a)(x−a)+\frac{1}{2}Hf(a)(x−a)^2\\
f(x)\approx f(a)+Df(a)(x−a)+\frac{1}{2}Hf(a)\Sigma_{xx}^{@a}\\
$$

so the hessian picks out a linear combination of the covariance (no, not covariance, correlation) values?!?


Resources

* http://www.math.smith.edu/~rhaas/m114-00/chp4taylor.pdf
* http://mathinsight.org/taylor_polynomial_multivariable_examples
* 3blue1brown...