In [2]:
%%javascript
MathJax.Hub.Config({
    TeX: { equationNumbers: { autoNumber: "AMS" } }
});


<IPython.core.display.Javascript object>

# Automatic Differentiation

Automatic differentiation is a tecnique for computing *exact* numerical derivatives of a simple expression or complicated models. The main idea is to decompose any numerical expression, simple or complicated', into a series of elementary steps where each step can be easily differentiated.

A common task in optimization is to compute the derivative of an object function. In many cases the object function is a single number but may depend on many variables, and the relation between the variables and the function value might be such that the derivatives in principle can be computed, but might take a lot of effort. In such cases Automatic Differentiation can be used. 

We will first illustrate the method with an (over) simplified example. Assume that we want to differentiate the function

\begin{eqnarray}
  f(x,y) = x^2 + y^2
                  \label{eq:f}
\end{eqnarray}

We can, of course, compute the derivatives of this function with respect to $x$ and $y$ trivialy as

\begin{eqnarray}
  \frac{\partial f(x,y)}{\partial x} = 2x, \nonumber\\
  \frac{\partial f(x,y)}{\partial y} = 2y.
\end{eqnarray} 

Alternatively we can first split the computation of $f(x,y)$ into three simple steps
 
1. $w_1  =  x^2$ 
2. $w_2  =  y^2$
3. $w_3  = w_1 + w_2$


Formaly $w_3$ is a function of $w_2$ and $w_1$ which again is a function of $x$ and $y$, so to compute the partial derivatives of $w_3 = f$ we have to use the chain rule

\begin{eqnarray}
\frac{\partial w_3}{\partial x} & = & \frac{\partial w_3}{\partial w_1}\frac{\partial w_1}{\partial x},
                                \nonumber \\
\frac{\partial w_3}{\partial y} & = &\frac{\partial w_3}{\partial w_2}\frac{\partial w_2}{\partial y}.
                                \label{eq:chain}
\end{eqnarray}
We see that each of the partial derivatives in the chain rule can be computed as:
1. $\frac{\partial w_1}{\partial x} = 2x$
2. $\frac{\partial w_2}{\partial y} = 2y$
3. $\frac{\partial w_3}{\partial w_1} = 1$
4. $\frac{\partial  w_3}{\partial w_2} = 1$


To compute the derivatives we use the list above and add the computation of the derivatives to
each step

1. $w_1  =  x^2$, $\frac{\partial w_1}{\partial x}=2x$
2. $w_2  =  y^2$, $\frac{\partial w_2}{\partial y}=2y$
3. $w_3  = w_1 + w_2$, $\frac{\partial w_3}{\partial w_1}\frac{\partial w_1}{\partial x} 
                       +\frac{\partial w_3}{\partial w_2}\frac{\partial w_2}{\partial y}
                       = 1\cdot 2x + 1\cdot 2y. =2x+2y$
                       
 We see that in the evaluation of the derivatives we start with the derivatives of $w_1$ and $w_2$ with respect to $x$ and $y$ and then continue with the derivatives of $w_3$ with respect to $w_1$ and $w_2$.
Refering to equation (\ref{eq:chain}) we proceed from *right* to *left*.

How can we use these concepts to do actual computations?

A simple implementation in python is shown below.

                      



  
  

In [2]:
''' Automatic differentiation of simple expression'''

def f1(x):
  ''' Compute derivative of x*x
       
    Arg   : x input argument
    Return: list of function value and derivative
  ''' 
  return [x*x,2*x]

def f2(y):
  ''' Compute derivative of x*x
       
    Arg   : x input argument
    Return: list of function value and derivative
  '''
  return [y*y, 2*y]

def f3(w1,w2):
  ''' Compute derivative of w1+w1
       
    Args   : w1,w2 list with function value and derivative
    Return: list of function value and total derivative
  '''
  fval = w1[0]+w2[0]
  dval = 1*w1[1]+1*w2[1]
  return [fval,dval]

#Initial values for x and y
x=1
y=1

# Step 1
w1 = f1(x)
# Step 2
w2 = f2(y)
# Step 3
w3 = f3(w1,w2)
print("f:  ",w3[0])
print("df: ",w3[1])
  
    
    
    

f:   2
df:  4


In the above equations and python code note that what we actually compute is

\begin{eqnarray}
  \delta f = \frac{\partial f(x,y)}{\partial x} + \frac{\partial f(x,y)}{\partial y},
                                                          \label{eq:df}
\end{eqnarray}

which is the total derivative of $f$. If we instead want the components of the gradient we need
the vector

\begin{eqnarray}
  \nabla f(x,y) = \left[\frac{\partial f(x,y)}{\partial x},\frac{\partial f(x,y)}{\partial y}\right] \nonumber 
\end{eqnarray}

What we have computed is formaly 

\begin{eqnarray}
\delta f = \nabla f(x,y)\cdot \mathbf{e}
\end{eqnarray}

which is  the dot product of the gradient with respect to the unit vector $\mathbf e=(1,1)$.
To actually compute the gradient we would have to modify the steps above as

1. $w_1  =  x^2$, $\frac{\partial w_1}{\partial x}=2x$
2. $w_2  =  y^2$, $\frac{\partial w_2}{\partial y}=2y$
3. $w_3  = w_1 + w_2$, $\frac{\partial w_3}{\partial w_1}\frac{\partial w_1}{\partial x}e_x 
                       +\frac{\partial w_3}{\partial w_2}\frac{\partial w_2}{\partial y}e_y
                       = 1\cdot 2x \cdot e_x + 1\cdot 2y \cdot e_y =2x+2y$
 where $e_x$ and $e_y$ are the $x$ and $y$ components of the unit vector $\mathbf{e} = (e_x,e_y)$ 
 If we consider the unit vector $\mathbf{e}$ as input to the computation we can modify the python code as 


In [9]:
def f1(x):
  return [x*x,2*x]


def f3(w1,w2,e):
    fval = w1[0]+w2[0]
    dval = e[0]*w1[1]+e[1]*w2[1]
    return [fval,dval]



def df(x,e):
  # Step 1
  w1 = f1(x[0])
  # Step 2
  w2 = f1(x[1])
  # Step 3
  w3 = f3(w1,w2,e)
  return w3

#Compute x-component of gradient
x=[1,1]
e=[1,0]
f=df(x,e)
print("f:  ",f[0])
print("df/dx: ",f[1])

#Compute y-component of gradient
e=[0,1]
f=df(x,e)
print("df/dy: ",f[1])
  
  

f:   2
df/dx:  2
df/dy:  2


In effect we have to evaluate the function $f$ twice to get the gradient components. If $f$ have a large number of components we have to evaluate the function many times.

## The backward method

The method to compute derivatives outlined above uses the chain rule. Each step consist 
of a multiplication of the derivatives from previous steps and the derivative in the current step. Refering
to the chain rule we evaluate the partial derivatives right to left starting with
$\frac{\partial w_1}{\partial x}$. In principle we could also compute the derivatives in the opposite
order.
Starting with $w_3$ we could compute

\begin{eqnarray}
\frac{\partial w_3}{\partial w_1},\nonumber \\
\frac{\partial w_3}{\partial w_2}
\end{eqnarray}

Looking closely at the stepwise computational list

1. $w_1  =  x^2$, $\frac{\partial w_1}{\partial x}=2x$
2. $w_2  =  y^2$, $\frac{\partial w_2}{\partial y}=2y$
3. $w_3  = w_1 + w_2$, $\frac{\partial w_3}{\partial w_1}\frac{\partial w_1}{\partial x} 
                       +\frac{\partial w_3}{\partial w_2}\frac{\partial w_2}{\partial y}
                       = 1\cdot 2x e_x + 1\cdot 2y e_y =2x+2y$
                       
we see that the last step involves a summation. If we start the computation at step 3 we first have
to undo the summation, which in this case consists of computing the factors
$\frac{\partial w_3}{\partial w_1}$ and $\frac{\partial w_3}{\partial w_2}$ which in our case
both are equal to one. So our first step would be to compute a vector, $\mathbf{u}$ as follows

3. $\mathbf{u}^3 = (u^3_0,u^3_1)=(\frac{\partial w_3}{\partial w_1}, \frac{\partial w_3}{\partial w_2}) = (1,1)$,
since $w_3 = w_1 + w_2$.

Remembering that the derivatives given by

\begin{eqnarray}
\frac{\partial w_3}{\partial x} & = & \frac{\partial w_3}{\partial w_1}\frac{\partial w_1}{\partial x},
                                \nonumber \\
\frac{\partial w_3}{\partial y} & = &\frac{\partial w_3}{\partial w_2}\frac{\partial w_2}{\partial y}.
                                \label{eq:chain2}
\end{eqnarray}

we have now computed the first factors on the right hand side, so we evaluate now left to right which
is the opposite order of the calculations above where we evaluated the factors right to left.

We can now proceed to calculate the full derivative 
by

2. $u^2 = u^3_0\frac{\partial w_2}{\partial y} = 1\cdot 2y$ 

and 

1. $u^1 = u^3_1\frac{\partial w_1}{\partial x} = 1\cdot 2x$

The gradient is now given by 
\begin{eqnarray}
\frac{\partial f(x,y)}{\partial x} & = & u^2 = 2x\nonumber,\\
\frac{\partial f(x,y)}{\partial x} & = & u^1 = 2y \nonumber. 
\end{eqnarray}

Doing the calculation in this **backward** order we actually end up with the components of the 
gradient and not the sum of the components as we did in the **forward** calculation.
The list of computing steps for the backward method is then
<ol reversed>
<li >$\mathbf{u}^3 = (u^3_0,u^3_1)=(\frac{\partial w_3}{\partial w_1}, \frac{\partial w_3}{\partial w_2}) = (1,1)$
<li> $u^2 = u^3_0\frac{\partial w_2}{\partial y} = 1\cdot 2y$ 
<li> $u^1 = u^3_1\frac{\partial w_1}{\partial x} = 1\cdot 2x$
</ol>

Note that the backward method is *not* just reversal of steps of the forward method, it is a different calculation.






           


In [16]:
def f1(x):
  return [x*x,2*x]


def f3(w1,w2,e):
    fval = w1[0]+w2[0]
    dval = e[0]*w1[1]+e[1]*w2[1]
    return [fval,dval]

def df(x,e):
  # Step 1
  w1 = f1(x[0])
  # Step 2
  w2 = f1(x[1])
  # Step 3
  w3 = f3(w1,w1,e)
  return w3

def b3(w3) :
  return [1,1]

def b2(u3,x) :
  return u3[0]*2*x[1]

def b1(u3,x) :
  return u3[1]*2*x[0]

def dfb(w3,x):
  #Step 3
  u3 = b3(w3)
  u2 = b2(u3,x)
  u1 = b1(u3,x)
  return [u1,u2]

#Initial values for x
x = [1,1]
e = [1,1]

#First run the forward calculation
w3 = df(x,e)

#Run the backward calculation
u = dfb(w3,x)
print("df/dx:", u[0])
print("df/dy:", u[1])




df/dx: 2
df/dy: 2


To do the calculation above in the backward mode we see that we need to first do the forward calculation and then compute the derivatives. The amount of computation is roughly two times the amount of computation in the forward calculation, which is the same as the two calculations required using the forward method.
However, if the number of components in the gradient is large, the amount of computation is proportional to the number of components, while the backward method still requires only two forward calculations.

##  Python libraries for automatic differentiation

Automatic differentiation can be implemented as we have done above by manually creating computational steps and assigning differentiation rules. However, this is not feasible if the computation is very complicated. Fortunately it is possible to implement automatic differentiation in a fully automatic (no pun intended) way. The methodlogy is the same as described above but implemented by parsing expressions and building the derivative rules automatically.

There are several python libraries who are capable of doing this, somewhat arbitrarily I use a library called Autograd.


In [7]:
import autograd.numpy as np
from autograd import grad



def f(x) :
    y = np.dot(x,x)
    return y

x=np.zeros(2)
x[0] = 1.0
x[1] = 1.0
grad_f = grad(f)       # Obtain its gradient function
df = grad_f(x)

print("df/dx: ", df)



df/dx:  [2. 2.]


## Optimization using automatic differentiation

Now assume we want to solve an optimization problem as f.ex find
the minimum of the function $f(x,y)$

\begin{eqnarray}
f(x.y) = (x-2)^2 + (y-1)^2
\end{eqnarray}
We can solve the problem by using two python libraries


In [1]:
import autograd.numpy as np
from autograd import grad
from scipy.optimize import fmin_cg as cg


def f(x) :
    y = np.dot(x-2,x-1)
    return y

def gradient(x):
  grad_f = grad(f)       # Obtain its gradient function
  df = grad_f(x)
  return(df)

x0=np.zeros(2)
x0[0] = 0.0
x0[1] = 0.0
x=cg(f,x0,fprime=gradient)
print("x: ",x)

Optimization terminated successfully.
         Current function value: -0.500000
         Iterations: 1
         Function evaluations: 3
         Gradient evaluations: 3
x:  [1.5 1.5]


## Relation with adjoint 

For the simple example the error function $f$ is given by

\begin{eqnarray}
  f(x,y) = x^2+y^2
\end{eqnarray}

For a small change $\Delta f$ we can write

\begin{eqnarray}
  \Delta f(x,y) = \frac{\partial f(x,y)}{\partial x}\Delta x + \frac{\partial f(x,y)}{\partial y}\Delta y
\end{eqnarray}

In matrix notation this is
\begin{eqnarray}
\begin{bmatrix}
  \Delta f \\
  0        
\end{bmatrix}=
\begin{bmatrix}
\frac{\partial f(x,y)}{\partial x} & \frac{\partial f(x,y)}{\partial y} \\
         0                                 &          0                 \\                                  
 \end{bmatrix}
 \begin{bmatrix}
 \Delta x \\
 \Delta y \\
 \end{bmatrix}
\end{eqnarray}

If we denote

\begin{eqnarray}
\mathbf{J} = 
\begin{bmatrix}
\frac{\partial f(x,y)}{\partial x} & \frac{\partial f(x,y)}{\partial y} \\
         0                                 &          0                 \\                                  
 \end{bmatrix}
\end{eqnarray}
and write the vectors
\begin{eqnarray}
\mathbf{f} = 
\begin{bmatrix}
\Delta f \\
0
\end{bmatrix}
\end{eqnarray},
\begin{eqnarray}
\mathbf{x} = 
\begin{bmatrix}
\Delta x \\
\Delta y
\end{bmatrix}
\end{eqnarray}


We can write
\begin{eqnarray}
\mathbf{f} = \mathbf{J}\mathbf{x}
                \label{eq:J}
\end{eqnarray}

If we now use $\mathbf{J}^T$ to operate on $\mathbf{f}$

\begin{eqnarray}
\mathbf{J}^T \mathbf{f} = \mathbf{J}^T\mathbf{J}\mathbf{x}
                \label{eq:Jt}
\end{eqnarray}

we get on the left hand side


Which gives
\begin{eqnarray}
\begin{bmatrix}
\frac{\partial f(x,y)}{\partial x}                   &    0           \\
\frac{\partial f(x,y)}{\partial y}                   &    0           \\                   
 \end{bmatrix}
 \begin{bmatrix}
  \Delta f \\
  0        
\end{bmatrix}=
 \begin{bmatrix}
 \frac{\partial f(x,y)}{\partial x}\Delta f \\
 \frac{\partial f(x,y)}{\partial y}\Delta f \\
 \end{bmatrix}
\end{eqnarray}

If $\Delta f=1$ we get the gradient components.
$\mathbf{J}^T$ is said to be adjoint of $\mathbf{J}$.
We see that using the adjoint of the jacobian to operate on the
output $\Delta f$ we get the gradient components.
The backward method corresponds to using the adjoint of the Jacobian, while
the forward method corresponds to using the Jacobian.

