# Computational Graphs - Symbolic Computation
*Christoph Heindl 2017, https://github.com/cheind/py-cgraph/*

This is part two in a series about computational graphs and their applications. The first part covered theoretical foundations of computational graphs and associated algorithms to perform forward function evaluation and backward derivative computations.

This part will focus on developing Python code that allows numeric and symbolic differentiation of arbitrary (real valued) functions.


## CGraph 

CGraph is the name of the Python library to be developed during the remainder of this notebook. While the code inside the notebook is functional a separate, self-contained and enhanced implementation of CGraph is available [cgraph.py](../cgraph.py). 

CGraph performs numeric and symbolic differentiation using backpropagation.

```Python
import cgraph as cg

x = cg.Symbol('x')
y = cg.Symbol('y')
z = cg.Symbol('z')

f = (x * y + 3) / (z - 2)

# Evaluate function
cg.value(f, {x:2, y:3, z:3}) # 9.0

# Partial derivatives (numerically)
d = cg.numeric_gradient(f, {x:2, y:3, z:3})
d[x] # df/dx 3.0
d[z] # df/dz -9.0

# Partial derivatives (symbolically)
d = cg.symbolic_gradient(f)
cg.simplify(d[x]) # (y*(1/(z - 2)))
cg.value(d[x], {x:2, y:3, z:3}) # 3.0

# Higher order derivatives
ddx = cg.symbolic_gradient(d[x])
cg.simplify(ddx[y]) # ddf/dxdy
# (1/(z - 2))
```

Python 3.5 will be used for development. The reader is assumed to be familiar with its concepts including generators and decorators. Also a technique called monkey patching will be used to iteratively refine classes already introduced.

### Expression trees

Before diving into code, we need to cover the concept of [expression trees](https://en.wikipedia.org/wiki/Binary_expression_tree). Expression trees will be used to represent function decompositions in Python code to be developed. While there are not a fundamentally new concept they deserve some words at this point.

An expression tree is similar to the computational graphs introduced, but the arrows by default point backwards. It turns out that constructing function expression in a tree like manner (top node is the function itself, function parameters are leafs) simplifies development dramatically.

Take the CG of the toy example used $f(x,y)=(x+y)x$

<img src="intro_0.png" width="400">

The following expression tree represents the same function

<img src="exp_tree.png" width="400">

Notice that we now have a tree like structure. Our root node is the final operation to be executed to receive the result of $f(x,y)$. Also notice that $x$ shows up twice. Finding the value of an expression tree requires to compute values for nodes in lower layers first and bubble information up towards the root node. Backpropagation on the other hand ist just a matter of following the forward edges. Again, when computing derivatives, a summation over all paths from the top that lead to a given node will be performed.

### Representing expression trees
First we need to come up with a way to represent expression that were introducted in the previous part. trees in Python code. Naturally, we will have base class `Node` that manages child references and derived classes that actually implement operations, symbols and constants.

In [1]:
class Node:

    def __init__(self, nary=0):
        self.children = [None]*nary

    def __repr__(self):
        return self.__str__()

Node for now just tracks references to its children. Note that operations can be binary (e.g. addition), unary (e.g cosine) or don't have children at all (e.g. symbols). We can also think of n-ary functions such as summation. Next, we'll define the leaf nodes Symbol and Constant

In [2]:
class Symbol(Node):

    def __init__(self, name):
        super(Symbol, self).__init__(nary=0)
        self.name = name

    def __str__(self):
        return self.name

    def __hash__(self):
        return hash(self.name)            
    
    def __eq__(self, other):
        if isinstance(other, self.__class__):
            return self.name == other.name      
        else:
            return False

Symbols are identified by their name, like  $x$. They don't have any children. When printed we print the name of the symbol.

In [3]:
class Constant(Node):

    def __init__(self, value):
        super(Constant, self).__init__(nary=0)
        self.value = value

    def __str__(self):
        return str(self.value)

Constants are 'immutable' values. Next we start to add operations. For this notebook we will provide addition, multiplication. cgraph.py has more operations defined and once you know how to implement them it will be easy for you to add new ones.

In [4]:
class Add(Node):

    def __init__(self):
        super(Add, self).__init__(nary=2)

    def __str__(self):
        return '({}+{})'.format(str(self.children[0]), str(self.children[1]))
    
class Mul(Node):

    def __init__(self):
        super(Mul, self).__init__(nary=2)

    def __str__(self):
        return '({}*{})'.format(str(self.children[0]), str(self.children[1]))

`Add` and `Mul` don't do much yet expect that stating that they are binary functions plus some pretty printing (recursively calling `__str__` on its children). Next, well just have a helper function that builds our toy function $f(x,y)=(x+y)x$. This looks a bit clumsy right now but we'll improve the syntax as we go.

In [5]:
def gen_f(x, y): 
    a = Add()
    a.children[0] = x
    a.children[1] = y
    
    m = Mul()
    m.children[0] = a
    m.children[1] = x
    
    return m

x = Symbol('x')
y = Symbol('y')
f = gen_f(x, y)
f

((x+y)*x)

### Computing function values

Next we'll turn our attention towards computing values of functions represented as expression trees. As mentioned earlier to compute the value, we'll need to bubble up information from layers further down in hierarchy up to the root. Traversing expression trees can be performed in multiple ways. What we are looking for is [depth-first-search](https://en.wikipedia.org/wiki/Depth-first_search) in [post-order](https://en.wikipedia.org/wiki/Tree_traversal). There are many ways to implement the traversal, i've chosen the recursive generator approach because of its shortness.

In [6]:
def postorder(node):
    for c in node.children:
        yield from postorder(c)
    yield node

In [7]:
[n for n in postorder(f)]

[x, y, (x+y), x, ((x+y)*x)]

As you can see, children are evaluated before their parents. Excactly what's needed for computing function values. Next, define the method that computes the forward pass, i.e the value of the function.

In [8]:
def values(f, fargs):
    """Returns a dictionary of computed values for each node in the expression tree including `f`."""
    v = {}
    v.update(fargs)
    for n in postorder(f):
        if not n in v:
            v[n] = n.compute_value(v)
    return v

This method calls for each node `compute_value(values)` and expects the node to return its value. Since we haven't defined this function for our nodes yet, it's time to do so. Also note that `fargs` will be assumed to contain the values for the symbols in the expression tree.

In [9]:
# Monkey patching for compute_value
Symbol.compute_value = lambda self, values: values[self]
Constant.compute_value = lambda self, values : self.value
Add.compute_value = lambda self, values: values[self.children[0]] + values[self.children[1]]
Mul.compute_value = lambda self, values: values[self.children[0]] * values[self.children[1]]

After monkey patching in `compute_value` for all nodes we can evaluate `f` by

In [10]:
values(f, {x:2, y:3})[f]

10

Since `values` computes all the values even for intermediate nodes we need to add `[f]` as a postfix. Just accessing the value of `f` will however be so common task that we provide a shortcut for it named `value`.

In [11]:
def value(f, fargs):
    return values(f, fargs)[f]

value(f, {x:2, y:3})

10

### Syntactic sugar

Before continuing it makes sense to use Python's internal methods for 'overloading' the `+` and `*` operation for Nodes. First, we'll define a decorator that will wrap plain numbers to `Constants`.

In [12]:
from numbers import Number

def wrap_args(func):
    """Wraps function arguments that are numbers as Constant objects."""
    def wrapped(*args, **kwargs):
        new_args = []
        for a in args:
            if isinstance(a, Number):
                a = Constant(a)
            new_args.append(a)
        return func(*new_args, **kwargs)
    return wrapped

Next, we'll define some free functions that perform the 'lengthy' addition and multiplication. By convention these free functions will start with the prefix `sym_` (for symbolic). When adding new functionality you should always provide such a function (e.g `sym_pow, sym_cos`).

In [13]:
@wrap_args
def sym_add(x, y):
    n = Add()
    n.children[0] = x
    n.children[1] = y
    return n

@wrap_args
def sym_mul(x, y):
    n = Mul()
    n.children[0] = x
    n.children[1] = y
    return n

Finally we monkey patch `Node` to support `+` and `*` operations

In [14]:
Node.__add__ = lambda self, other: sym_add(self, other)
Node.__radd__ = lambda self, other: sym_add(other, self)
Node.__mul__ = lambda self, other: sym_mul(self, other)
Node.__rmul__ = lambda self, other: sym_mul(other, self)

Note that the `__r*` methods are also overloaded so that expressions of the type `n*3` and `3*n` work equally well. With that we can rewrite `gen_f` introduced by simply saying

In [15]:
f = (x + y)*x
f

((x+y)*x)

### Computing numeric derivatives

Next we will turn our attention to the backpropagation for computing numerical derivatives. For this we first need another traversal, one that visits all nodes on the same level before moving on to the next level. Such a traversal is called [breadth-first-search](https://en.wikipedia.org/wiki/Breadth-first_search) and it can also be implemented in numerous ways.

The way it is implemented here is based on a generator that uses a queue. However when performing backpropagation, we'd like to communicate values back to the generator for each children of the current node. We then expect the handed values to be passed to us when we visit the corresponding child. Doing so turns the generator into [co-routine](https://en.wikipedia.org/wiki/Coroutine).

In [16]:
def bfs(node, node_data):
    q = [(node, node_data)]
    while q:
        t = q.pop(0)
        node_data = yield t
        for idx, c in enumerate(t[0].children):
            q.append((c, node_data[idx]))

Next, `numeric_gradient` is presented. It takes an expression tree and function arguments for the contained symbols and computes all numeric partial derivatives with respect to the root node passed.

In [17]:
from collections import defaultdict

def numeric_gradient(f, fargs):
    vals = values(f, fargs)
    derivatives = defaultdict(int) # by default 0 is the derivative for unknown nodes.

    gen = bfs(f, 1)
    try:
        n, in_grad = next(gen)
        while True:
            derivatives[n] += in_grad
            local_grad = n.compute_gradient(vals)
            n, in_grad = gen.send([l*in_grad for l in local_grad])
    except StopIteration:
        return derivatives

First, numeric_gradient performs a forward pass to compute all function values. Next, breadth-first-search is kicked of by f and a value of 1. Then for each node visited we accumulate incoming partial derivatives send along the edges. Next, the local isolated gradient is computed. We communicate back the local gradient times incoming partial derivative as explained in the backpropagation introduction before. Finally a dictionary of partial derivatives for each operation is returned.

Also note that we need to provide implementations of `compute_gradient(values)` for each node. `compute_gradient` is expected to take in a values dictionary and return the isolated partial derivative for every children in array form. As always, lets monkey patch.

In [18]:
# Monkey patch for compute_gradient
Symbol.compute_gradient = lambda self, values: [] # Nothing todo
Constant.compute_gradient = lambda self, values: [] # Nothing todo

Add.compute_gradient = lambda self, values: [1, 1] # dx+y/dx = 1, dx+y/dy = 1
Mul.compute_gradient = lambda self, values: [values[self.children[1]], values[self.children[0]]] # dx*y/dx = y, dx*y/dy = x

The isolated gradients for `Add` and `Mul` should look familiar to you. If not, go back to the introduction on computational graphs. With that in place we can now compute numeric derivatives like so

In [19]:
numeric_gradient(f, {x:2, y:3})

defaultdict(int, {x: 7, y: 2, (x+y): 2, ((x+y)*x): 1})

As you can see $\frac{\mathrm{d}f(x,y)}{\mathrm{d}x}\Bigr|_{\substack{x=2\\y=3}} = 7$ and $\frac{\mathrm{d}f(x,y)}{\mathrm{d}y}\Bigr|_{\substack{x=2\\y=3}} = 2$. But the dictionary provides more, also the derivative w.r.t $(x+y)$ is given! Some more examples:

In [20]:
numeric_gradient(x*x+y*y, {x:2, y:3})

defaultdict(int, {x: 4, y: 6, ((x*x)+(y*y)): 1, (y*y): 1, (x*x): 1})

In [21]:
z = Symbol('z')
numeric_gradient((x+3)*(y+4)*z*z, {x:2, y:3, z:5})

defaultdict(int,
            {((x+3)*(y+4)): 25,
             x: 175,
             z: 350,
             ((((x+3)*(y+4))*z)*z): 1,
             3: 175,
             (x+3): 175,
             y: 125,
             (y+4): 125,
             (((x+3)*(y+4))*z): 5,
             4: 125})

### Computing symbolic derivatives

Now that we can compute numerica derivates one might wonder if we could do the same symbolically, i.e instead of returning a number we return some expression tree. Clearly such a feature would be beneficial as it would allow computation of higher order derivatives. Additionally, pre-factoring the derivative expressions might be favorable when invoking the derivative evaluation multiple times.

Turns out modifying the numeric gradient computation for symbolic computation is straight forward. All that needs to be done is to return appropriate `Node`s instead of numeric values. Infact with the overloaded `+` and `*` operations in place for nodes, the symbolic gradient computation looks nearly identical to `numeric_gradient` as defined earlier.

Here it is, `symbolic_gradient`

In [22]:
def symbolic_gradient(f):
    derivatives = defaultdict(lambda: Constant(0))
    
    gen = bfs(f, Constant(1))
    try:
        n, in_grad = next(gen) # Need to use edge info when expressions are reused!
        while True:
            derivatives[n] = derivatives[n] + in_grad
            local_grad = n.symbolic_gradient()
            n, in_grad = gen.send([l * in_grad for l in local_grad])
    except StopIteration:
        return derivatives

Only differences: we use `Constant` instead of numbers directly and a method to be defined `symbolic_gradient` is invoked. One can probably guess that the operations `+` and `*` call the corresponding overloads for `Node` objects introduced earlier. As always, let's monkey patch for `symbolic_gradient`.

In [23]:
# Monkey patch for symbolic_gradient
Symbol.symbolic_gradient = lambda self: [] # Nothing todo
Constant.symbolic_gradient = lambda self: [] # Nothing todo

Add.symbolic_gradient = lambda self: [Constant(1), Constant(1)] # dx+y/dx = 1, dx+y/dy = 1
Mul.symbolic_gradient = lambda self: [self.children[1], self.children[0]] # dx*y/dx = y, dx*y/dy = x

Again, replace numbers by Constants and for `Mul` just return the opposite child expression. Let's test.

In [24]:
symbolic_gradient(f)

defaultdict(<function __main__.symbolic_gradient.<locals>.<lambda>>,
            {x: ((0+((x+y)*1))+(1*(x*1))),
             y: (0+(1*(x*1))),
             (x+y): (0+(x*1)),
             ((x+y)*x): (0+1)})

Have a look at $x$. It claims derivative of $f$ with respect to $x$ is equal to `((0 + ((x + y)*1)) + (1*(x*1)))`. After massaging the terms you indeed find that this is the same as $2x+y$. Although not very readable, the reported results are correct. Something we will tackle in the next section when we begin to simplify expressions. 

Since the returned dictionary contains expressions, we can apply all our arsenal of functions developed so far onto them.

In [25]:
d = symbolic_gradient(f)
print('df/dx at (x=2,y=3) is {}'.format(value(d[x], {x:2, y:3})))
print('df/dy at (x=2,y=3) is {}'.format(value(d[y], {x:2, y:3})))

# Let's try second order derivatives
ddx = symbolic_gradient(d[x])
ddy = symbolic_gradient(d[y])
print('ddf/dxdx at (x=2,y=3) is {}'.format(value(ddx[x], {x:2, y:3})))
print('ddf/dxdy at (x=2,y=3) is {}'.format(value(ddx[y], {x:2, y:3})))
print('ddf/dydx at (x=2,y=3) is {}'.format(value(ddy[x], {x:2, y:3})))
print('ddf/dydy at (x=2,y=3) is {}'.format(value(ddy[y], {x:2, y:3})))

df/dx at (x=2,y=3) is 7
df/dy at (x=2,y=3) is 2
ddf/dxdx at (x=2,y=3) is 2
ddf/dxdy at (x=2,y=3) is 1
ddf/dydx at (x=2,y=3) is 1
ddf/dydy at (x=2,y=3) is 0


### Adding new operations

Adding new operations to CGraph is not hard. The following recipe sums up the necessary steps.
 1. Add a new class inheriting `Node`
 1. Add implementations for `compute_value`, `symbolic_gradient` and optionally for `compute_gradient`
 1. Provide one or more free function with prefix `sym_*` that connects arguments as inputs for your operation. Use `@wrap_args` where appropriate.
 1. Optionally provide and implementation for `__str__`
 1. Optionally provide new `__*__` methods in `Node` to support improved syntax delegating to `sym_*` methods.
 
Note that `compute_gradient` is optional, as it can always be mimicked by `symbolic_gradient` followed by `value`. 

A word of caution: when computing numeric gradients through `compute_gradient` you might find yourself in a position of potentially dividing by zero or raising any other math expection. When this happens on path you are even not interested in, gradient computation will stop. For example consider $f(x,y) = x^y$ and let $x=-1, y=2$. Then when evaluating the gradient for $y$ you will find that it corresponds to $x^y*log(x)$. Unfortunately $x$ is negative and so the value can not be computed. Python raises an exception and gradient computation fails.

CGraph handles this by using `NAN`s instead of exceptions. Almost all operations that invoke `NAN`s result in `NAN`, so they propagate nicely. In [CGraph](../cgraph.py) a function named `nan_on_fail` is used to catch math exceptions and instead return `NAN`. See `Div` for example usage.

### Expression simplification

Earlier we saw that symbolic differentiation produces hardly readable expressions, such as `((0+((x+y)∗1))+(1∗(x∗1)))`. In this section we will see how to simplify such expressions. Not only will this improve readability but also performance as lesser nodes need to be evaluated (pays off especially when you invoke it many times after simplification).

The way CGraph implements expression simplification is by traversing the computational graph, while trying to apply simplification rules. Each rule is given a node and may produce a simplier version of that node. The new node 'replaces' the old node in an expression tree that is formed in parallel. That means that we won't fiddle around with the original expression tree given, but rather generate a expression tree that represents a simplified version of the original one.

First we will implement a rule filter decorator, a helper function and a single rule.