# Computational Graphs

Computational graphs (CG) are a way of representing mathematical functions and how values propagate between mathematical expressions. Among other things, the study of CG leads to deeper understanding of calculus, especially in simplifying taking derivatives by breaking complex expressions into simpler ones and chaining them up. When implemented in a programming language, CGs are able to automatically differentiate functions numerically exact or even symbolically. 

CGs thus play an important rule in mathematical function optimization, especially when computing derivatives is analytically infeasable. Supervised training of neural networks, for example, maps to optimizing (minimizing) a cost function with respect to all the neural weights in the network chained in different layers. Optimizing is usually performed using a gradient descent approach which requires first order derivatives of the cost function with respect to all the parameters in the network. As we will see, CGs not only make this doable but also provide an computationally efficient algorithm named [backpropagation](https://en.wikipedia.org/wiki/Backpropagation) to compute the partial derivatives.

## What this is about

This notebook gives an introduction into CGs and how to implement them in Python. By the end of the notebook we will have a working framework that
 - is able to evaluate mathematical expressions represented as CGs,
 - perform [automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation) to find numerically exact  derivatives (up to floating point precision),
 - perform [symbolic differentiation](https://en.wikipedia.org/wiki/Symbolic_computation) to deduce higher order derivatives,
 - simplify expressions to improve performance and readability.

## What it isn't about

To keep the code basis readable there are some shortcomings to the developed framework. Foremost it is not complete. That means you won't be able to plugin every possible function and expect it return the correct result. This is mostly a problem of not providing derivatives for all elementary functions. However, the framework is structured in such a way that you will find it easy to add new blocks it and make it even more feature complete.

Also, we'll be mostly dealing with so called multi variate real-valued scalar functions $f:\mathbb {R} ^{n}\to \mathbb {R}$. In other words, we constrain our self to real values and functions that have multiple inputs but only output a single scalar. A glimpse on how to use the framework for vector-valued functions will be given in the examples section towards the end of this document.

## Introduction to computational graphs (CG)

Consider a function $$f(x,y) = (x + y) * x$$

We'll call $x$ and $y$ symbols, $+$ and $*$ will be called operations / functions / nodes. When we evaluate $f$, we first add up the values of $x$ and $y$ and then multiply the result by $x$. The output of the multiplication is what we call the value of $f(x,y)$. Now, a CG is a [directed graph](https://en.wikipedia.org/wiki/Directed_graph) that represents this procedure. A graphical representation of CG for $f$ is shown below

<img src="intro_0.png" width="400">

### Computing the value of $f$

Computing the value of $f$ in the CG is a matter of following the arrows. Assume we want to evaluate $f(2,3)$. First, we send 2 and 3 along the out-edges of $x$ and $y$ respectively.

<img src="intro_1.png" width="400">

Next, we compute the value of $+$. Note that $*$ cannot be computed as one of its inputs, namely $+$ is missing.

<img src="intro_2.png" width="400">

Finally, we find value of $f$ by evaluating $*$

<img src="intro_3.png" width="400">

In evaluating the value of $f$ one needs to process all processors of a node $n$ before evaluating $n$ itself. Such an ordering on a CG is called a [topological order](https://en.wikipedia.org/wiki/Topological_sorting). 

### Computing the partial derivatives

Now that we know how to evaluate CGs through forward propagation of values, we'll turn our attention towards computing of all partial derivatives. In order to do so, we will be a bit more abstract and use symbols for the outputs of all nodes. Like so.
The introduction of $\hat{x}$, $\hat{y}$ demands a few words. It can be considered a minor technicality. When you see $\hat{x}$, think $x$ and equally for $\hat{y}$. $\hat{x}$ can bee seen as function of $x$ that yields $x$: $\hat{x}(x) = x$

<img src="intro_4.png" width="400">

Consider any node in a CG in isolation, irrespectively of where it is located in the graph. Now, when one computes the partial derivatives of the node's output with respect to its inputs, we won't care whether or not the inputs are results of former operations. In the current perspective they are variables, just like $x$ and $y$ have been in the beginning. 

Take the multiplication node in the above diagram. Inputs are $\hat{x}$ and $a$, output is $f$. Equivalently we could say $f$ is function of $\hat{x}$ and $a$, in particular $f(\hat{x},a)=\hat{x}a$. Computing the partial derivatives amounts to finding $\dfrac{\mathrm{d}f}{\mathrm{d}\hat{x}}$ and $\dfrac{\mathrm{d}f}{\mathrm{d}a}$. Since $f$ in isolation is just a plain multiplication, it won't take long to convince you that $\dfrac{\mathrm{d}f}{\mathrm{d}\hat{x}} = a$ and $\dfrac{\mathrm{d}f}{\mathrm{d}a} = \hat{x}$. Similarily for the addition operation we have $a(\hat{x},\hat{y})=\hat{x}+\hat{y}$ and the partial derivatives are given by $\dfrac{\mathrm{d}a}{\mathrm{d}\hat{x}} = 1$ and $\dfrac{\mathrm{d}a}{\mathrm{d}\hat{y}} = 1$

In order to display the partial derivatives in the diagrams used, we will use backward oriented arrows between nodes to which we attach the corresponding partial derivatives. The diagram below shows the graph for $f(x,y) = (x + y) * x$ plus the partial derivatives in isolation. Another technicality is the introduction of $\dfrac{\mathrm{d}f}{\mathrm{d}f} = 1$, which sole purpose is completness in backward edges.

<img src="intro_5.png" width="400">

$\dfrac{\mathrm{d}f}{\mathrm{d}x} = \dfrac{\mathrm{d}f}{\mathrm{d}f} \dfrac{\mathrm{d}f}{\mathrm{d}\hat{x}} \dfrac{\mathrm{d}\hat{x}}{\mathrm{d}x}$


