# Summary 

Recursive chain rule.
Keeping track of transformations on variables.

> _Automatic differentiation is a way to find the derivative of an expression without finding an expression for the derivative. [Dan Kalman](http://www1.american.edu/cas/mathstat/People/kalman/pdffiles/mmgautodiff.pdf)_


__What is the best way to think about rates of change in large systems?__

# What?

General setting. Formulation.



$f: \mathbb{R}^m \rightarrow \mathbb{R}^n$



# Chain rule


Want - $f(x+\epsilon x′)=f(x)+\epsilon f′(x)x′$ - aka the chain rule?

$$
z_0 = f(x), z_1 = g(X) \\
z_2 = h(z_0,z_1) \\
\frac{d z_2}{d x} = \frac{\partial z_2}{\partial z_0} \frac{d z_0}{d x} + \frac{\partial z_2}{\partial z_1} \frac{d z_1}{d x}\\
$$

# Dual numbers

Look into this abit? The ring of real numbers and $\epsilon$. Symmetries?
$(a + b\epsilon)(a + b\epsilon) = a^2 + 2ab\epsilon + \epsilon^2\\ $

Nilpotent $\epsilon \implies \epsilon^2 = 0$

https://en.wikipedia.org/wiki/Dual_number

## Lifting algebra

What does this mean?

### Linear (matrix) representation
$$
a + b\epsilon \rightarrow
\begin{bmatrix}
a & b \\
0 & a \\
\end{bmatrix}\\
\begin{bmatrix}
a & b \\
0 & a \\
\end{bmatrix}
\begin{bmatrix}
a & b \\
0 & a \\
\end{bmatrix} = 
\begin{bmatrix}
a^2 & 2ab \\
0 & a^2 \\
\end{bmatrix}
$$

Similarity to complex number's matrix representation

$$
a + \imath b \rightarrow
\begin{bmatrix}
a & -b \\
b & a \\
\end{bmatrix} \\
\begin{bmatrix}
a & -b \\
b & a \\
\end{bmatrix}
\begin{bmatrix}
a & -b \\
b & a \\
\end{bmatrix} = 
\begin{bmatrix}
a^2-b^2 & -2ab \\
2ab & a^2 - b^2 \\
\end{bmatrix}
$$

https://en.wikipedia.org/wiki/Complex_number#Matrix_representation_of_complex_numbers

#### Notes

* Doesnt work for partial derivatives.
* The cool thing is that this algebra encodes the idea of the chain rule, ... Its grammar ? enforces the chain rule, as well as the usual binary operations.

## Geometry



# Taylor series





## Automatic integration
What about integration? Well the taylor series expansion of any function is $f(x) \mid_a = f(a)+{\frac {f'(a)}{1!}}(x-a)+{\frac {f''(a)}{2!}}(x-a)^{2}+{\frac {f'''(a)}{3!}}(x-a)^{3}+\cdots$

# Computational complexity
Lets find some bounds on complexty.. O(??). Time and space.

Fundamentally, what information is required to calculate a derivative?
What is the algorithm?

## Automatic vs symbolic
__Why is automatic faster than symbolic?__

> Symbolic generation of derivatives can lead to exponential growth in the length of expressions [Dan Kalman]

Proof? Why??

At no point do we need to rearrange equations, substitute, manipulate, ... etc.


## Forward mode: m > 1, n = 1




## Reverse mode: m = 1, n > 1

> spatial complexity is usually proportional to the time complexity of the original program [Griewank 1992](ftp://ftp.mcs.anl.gov/pub/tech_reports/reports/P228.pdf)

aka long time dependencies will not work! (linear time dependency for 1000 transformations/operations per second means 
Need a way to solve this?!? -- maybe this isnt such a big problem...?)
* checkpoint, TD learning?
* skip connections? (not really a solution... unless we forget a bunch of them)
* 

## Forward or reverse??  m > 1, n > 1

> _For n > 1 and m > 1 there is a golden mean, but finding the optimal way is probably an NP-hard problem_ [lec notes](http://www.robots.ox.ac.uk/~tvg/publications/talks/autodiff.pdf) 

## Ideas

* Parameter tying. How can we take advantage of variables that have the same functions applied to them? (well the parameter should be tied in the original model, we should be dealing with it here?!?)
* 

# Matrix-calculus AD

Reverse - trace?


# Backprop

Is a special case.


http://neuralnetworksanddeeplearning.com/chap2.html

# How?

### Source code transformation



### Operator-overloading


### New programming language...


### ??? Ideas?



### Issues

* Need to be sufficiently smooth so that higher order derivatives are cts.
* 


# Notes


> and where differentiating an approximation to $f$ produces much worse answers than explicitly approximating the (known) derivative of $f$. [alexey radul](http://alexey.radul.name/ideas/2013/introduction-to-automatic-differentiation/)

E.g. numerical integration



# Resources

### Automatic differentiation
* [Great intro](http://alexey.radul.name/ideas/2013/introduction-to-automatic-differentiation/)
* http://conal.net/blog/posts/what-is-automatic-differentiation-and-why-does-it-work
* http://conway.rutgers.edu/~ccshan/wiki/blog/posts/Differentiation/
* http://blog.sigfpe.com/2005/07/automatic-differentiation.html
* [Julia implementation](http://int8.io/automatic-differentiation-machine-learning-julia/#Reverse_mode_automatic_differentiation_8211_basic_bits)

### Backprop

* http://int8.io/backpropagation-from-scratch-in-julia-part-ii-derivation-and-implementation/
* http://colah.github.io/posts/2015-08-Backprop/

### Other
* https://en.wikipedia.org/wiki/Dual_number

# Questions and thoughts



* How is this related to functional programming and transforms on data?
* Relation to dynamic programming (colah mentioned this)
* Answer: Why is automatic better than symbolic?
* Investigate the geometry of dual numbers.
* Make an AD module in haskell.
* Contribute to autograd.
* Figure out relation to taylor expansion.