# Computational Graphs

Computational graphs (CG) are a way of representing mathematical functions and how values propagate between mathematical expressions. Among other things, the study of CG leads to deeper understanding of calculus, especially in simplifying taking derivatives by breaking complex expressions into simpler ones and chaining them up. When implemented in a programming language, CGs are able to automatically differentiate functions numerically exact or even symbolically. 

CGs thus play an important rule in mathematical function optimization, especially when computing derivatives is analytically infeasable. Supervised training of neural networks, for example, maps to optimizing (minimizing) a cost function with respect to all the neural weights in the network chained in different layers. Optimizing is usually performed using a gradient descent approach which requires first order derivatives of the cost function with respect to all the parameters in the network. As we will see, CGs not only make this doable but also provide an computationally efficient algorithm named [backpropagation](https://en.wikipedia.org/wiki/Backpropagation) to compute the partial derivatives.

## What this is about

This notebook gives an introduction into CGs and how to implement them in Python. By the end of the notebook we will have a working framework that
 - is able to evaluate mathematical expressions represented as CGs,
 - perform [automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation) to find numerically exact  derivatives (up to floating point precision),
 - perform [symbolic differentiation](https://en.wikipedia.org/wiki/Symbolic_computation) to deduce higher order derivatives,
 - simplify expressions to improve performance and readability.

## What it isn't about

To keep the code basis readable there are some shortcomings to the developed framework. Foremost it is not complete. That means you won't be able to plugin every possible function and expect it return the correct result. This is mostly a problem of not providing derivatives for all elementary functions. However, the framework is structured in such a way that you will find it easy to add new blocks it and make it even more feature complete.

Also, we'll be mostly dealing with so called multi variate real-valued scalar functions $f:\mathbb {R} ^{n}\to \mathbb {R}$. In other words, we constrain our self to real values and functions that have multiple inputs but only output a single scalar. A glimpse on how to use the framework for vector-valued functions will be given in the examples section towards the end of this document.

## Introduction to computational graphs (CG)

Consider a function $$f(x,y) = (x + y) * x$$

We'll call $x$ and $y$ symbols, $+$ and $*$ will be called operations / functions / nodes. When we evaluate $f$, we first add up the values of $x$ and $y$ and then multiply the result by $x$. The output of the multiplication is what we call the value of $f(x,y)$. Now, a CG is a [directed graph](https://en.wikipedia.org/wiki/Directed_graph) that represents this procedure. A graphical representation of CG for $f$ is shown below

<img src="intro_0.png" width="400">

### Computing the value of $f$

Computing the value of $f$ in the CG is a matter of following the arrows. Assume we want to evaluate $f(2,3)$. First, we send 2 and 3 along the out-edges of $x$ and $y$ respectively.

<img src="intro_1.png" width="400">

Next, we compute the value of $+$. Note that $*$ cannot be computed as one of its inputs, namely $+$ is missing.

<img src="intro_2.png" width="400">

Finally, we find value of $f$ by evaluating $*$

<img src="intro_3.png" width="400">

In evaluating the value of $f$ one needs to process all processors of a node $n$ before evaluating $n$ itself. Such an ordering on a CG is called a [topological order](https://en.wikipedia.org/wiki/Topological_sorting). 

### Computing the partial derivatives

We will now turn our attention towards computing of all partial derivatives. In order to do so, we will be a bit more abstract and use symbols for the outputs of all nodes. Like so.

<img src="intro_4.png" width="400">

Consider any node in this CG in isolation, irrespectively of where it is located in the graph. When computing partial derivatives of isolated nodes with respect to their inputs, each input is treated as a separate abstract symbol and therefore taking the derivative completely ignores the fact that each input symbol might be the result a complex operation by itself. By taking this perspective of isolating nodes, computation of derivatives is simplified. 

For example, take the multiplication node 

$$f(x,a)=a*x$$ 

in isolation. Computing the partial derivatives of $f(x,a)$ with respect to its inputs, requires

$$\frac{\mathrm{d}f(x,a)}{\mathrm{d}a}, \frac{\mathrm{d}f(x,a)}{\mathrm{d}x}$$

to be found. Of course this will amount to

$$\frac{\mathrm{d}f(x,a)}{\mathrm{d}a}=x, \frac{\mathrm{d}f(x,a)}{\mathrm{d}x}=a$$


Similarily, for the addition node

$$a(x,y)=x + y$$

we get

$$\frac{\mathrm{d}a(x,y)}{\mathrm{d}x}=1, \frac{\mathrm{d}a(x,y)}{\mathrm{d}y}=1$$

In order to display the partial derivatives in the CG diagrams, we will use backward oriented arrows between nodes to which we attach the corresponding partial derivatives as shown below.

<img src="intro_5.png" width="400">

To compute the partial derivative for any node $n$ with respect to function $f(x,y)=(x+y)*x$, or $f(x,a)$ in the CG diagram, one needs to
 - find all the backward pathes starting from $f(x,a)$ reaching $n$
 - for each path build the product of partial derivatives along its chain of backward arrows
 - sum over all path products of partial derivatives
 
For example, the partial derivative of $\frac{\mathrm{d}f(x,a)}{\mathrm{d}y}$ amounts to

$$\frac{\mathrm{d}f(x,a)}{\mathrm{d}y} = 
\frac{\mathrm{d}f(x,a)}{\mathrm{d}f(x,a)}
\frac{\mathrm{d}f(x,a)}{\mathrm{d}a(x,y)}
\frac{\mathrm{d}a(x,y)}{\mathrm{d}y}
$$

Now since $f(x,a)$ is equivalent to $f(x,y)$ by construction, we can deduce

$$\frac{\mathrm{d}f(x,y)}{\mathrm{d}y} = 
    \frac{\mathrm{d}f(x,a)}{\mathrm{d}y} = 
    \frac{\mathrm{d}f(x,a)}{\mathrm{d}f(x,a)}
    \frac{\mathrm{d}f(x,a)}{\mathrm{d}a(x,y)}
    \frac{\mathrm{d}a(x,y)}{\mathrm{d}y}$$

By subsitutation (see CG diagram above) we find

$$\frac{\mathrm{d}f(x,y)}{\mathrm{d}y} = 1*x*1 = x$$

Finding $\frac{\mathrm{d}f(x,y)}{\mathrm{d}x}$ is very similar, except that this time we have two paths in which $x$ can be reached from $f(x,a)$.

$$\frac{\mathrm{d}f(x,y)}{\mathrm{d}x} = 
    \frac{\mathrm{d}f(x,a)}{\mathrm{d}x} = 
    \frac{\mathrm{d}f(x,a)}{\mathrm{d}f(x,a)}
    \frac{\mathrm{d}f(x,a)}{\mathrm{d}a(x,y)}
    \frac{\mathrm{d}a(x,y)}{\mathrm{d}x} +
    \frac{\mathrm{d}f(x,a)}{\mathrm{d}f(x,a)}
    \frac{\mathrm{d}f(x,a)}{\mathrm{d}x}   
$$

By subsitution we get

$$\frac{\mathrm{d}f(x,y)}{\mathrm{d}x} = 1*x*1 + 1*a(x,y) = x + (x+y) = 2x+y$$


### Backpropagation

While you can use the above recipe to compute partial derivatives for any two nodes connected by a path in the CG, it should be noted that it is not for computationally efficient yet. Notice the terms 

$$\frac{\mathrm{d}f(x,a)}{\mathrm{d}f(x,a)}\frac{\mathrm{d}f(x,a)}{\mathrm{d}a(x,y)}$$

appear in computations of both $\frac{\mathrm{d}f(x,y)}{\mathrm{d}x}$ and $\frac{\mathrm{d}f(x,y)}{\mathrm{d}y}$. The idea of the backpropagation algorithm for computing derivatives
