# <center>Automatic Differentiation</center>
### <center>[Dr David Race](dr.david.race@gmail.com)</center>

This notebook is specifically designed to provide a quick demonstration of "autograd" capabilities.  This is designed to be the first in a series on convex function minimization within Machine Learning (ML) environments, so it starts with the basics of differentiation.  This notebook uses the "tensor" concepts to demonstrate some of the nice methods available with both the MinPy and PyTorch packages.  These are both foundations for Deep Learning (DL) environments, but are equally adapt with some standard mathematics.

There are two main sections:
1.  MinPy
2.  PyTorch

The examples grow in sophistication as the notebook progresses, so be sure to follow the instructions carefully.

<i>NOTE:  This is designed to run in Colaboratory, but is likely to run in most other Jupyter envrionments also.  In particular this does not connect to a GPU, so it will run on minimal hardware.</i>

##1.   Differentiation with MinPy.Autograd

sympy is a great package for symbolic differentiation, but sympy will just be used for simple comarison of results so you can better understand the results.  autograd is much more appropriate for larger problems so is more important for Differential Equations and Linear Algebra applications.  The reference for autograd is found at [MinPy - Autograd](https://minpy.readthedocs.io/en/latest/tutorial/autograd_tutorial.html#).

### 1.1 - Set Up Environment

This section installs Theano into the Colaboratory environment and imports the standard python numeric packages.

In [0]:
!pip install autograd
#python imports
import os, sys
#numpy
import numpy as np
from numpy import linspace
import scipy as sp
#sympy
import sympy as smp
from sympy import *
from sympy import Function, Symbol
from sympy import Derivative
#
import autograd

Thse section sets up the graphics 

In [0]:
#The general plot capabilities
import matplotlib
import matplotlib.pyplot as plt
#Since these are images, turn off the grids
matplotlib.rc('axes',**{'grid':False})
#  sympy plotting
from sympy.plotting import plot
#seaborn
import seaborn as sns

### 1.2 -Example 1 - $f(x) = x^2 + 4$

We start with the basic problem and progress along the knowledge path.

#### 1.2.1  Sympy Implementation

This section uses the symbolic package to derive the known quantities for our function.  This could be done by hand, but is intended to show how the results mesh together.

In [0]:
#Define the function
x = Symbol('x')
f = Function('f')(x)
f = x**2 + 4
#Show the function definition
print("The function f")
smp.pprint(f)
#take the derivative
f_prime = f.diff(x)
print('The derivative of f')
smp.pprint(f_prime)
#  Plot the function and derivative
p1 = plot(f,xlim=(-3.0,3.0),ylim=(0.0,12.0))
#  Compute the values of f between -3 and 3
f_n = lambdify(x, f, "numpy")
f_prime_n = lambdify(x,f_prime,"numpy")
x_vals = linspace(-3.0, 3.0)
y_vals = f_n(x_vals)
y_prime_vals = f_prime_n(x_vals)

sns.set_style('dark')
fig, ax = plt.subplots()
plt.ylim(0.0,12.0)
plt.yticks(np.arange(1,13))
ax.axvline(0.0, color='k')
ax.axhline(0.0, color='k')
fn, = ax.plot(x_vals,y_vals, label='$f$')
fprimen, = ax.plot(x_vals,y_prime_vals, label='$\\frac{\\partial f}{\\partial x}$')
plt.legend(handles=[fn, fprimen])
plt.show()

This is a standard an easily understood problem, so not much effort is put into the plot.  The main point is generation of the x and y values.

#### 1.2.2  Autograd Implementation

Autograd understands the same type operations, but rather than a focus on symbolic computation the focus is on numeric computation using a similar underlying framework.  The main difference is that the gradient <i>(yes, these are the partial derivatives)</i> are taken relative to a scalar <i>"loss"</i> value.  Therefore when working with tensors of numbers, we need to define the function that will be differentiated in terms of a loss value <i>(NOTE:  The use of the loss value stems from Machine Learning.)</i>  The following code generates the same example data.

It may not be obvious, but this provides a way to automatically compute the derivative of a function at many point concurrently.  Here is the process:

1.  Define your function, $f$, so it inputs a tensor (vector, matrix, etc).
2.  Define your loss function, $loss_f$ to be $np.sum(f(x)) $
3.  Define the gradient of $f$ to be $grad(loss_f)$
4.  Then for clarity, define a function g, that outputs the $f(x)$ and $f^\prime(x)$

These steps are shown in the next example:


In [0]:
import autograd.numpy as np #This is so the gradient understands the numpy operations
from autograd import grad
#Follow the steps

def f(x):
  y = x*x + 4.0
  return y
def loss_f(x):
  loss = np.sum(f(x))
  return loss
f_p = grad(loss_f)
def g(x):
  return f(x), f_p(x)
#Compute points
y, y_p = g(x_vals)
#plot
sns.set_style('dark')
fig, ax = plt.subplots()
plt.ylim(0.0,12.0)
plt.yticks(np.arange(1,13))
ax.axvline(0.0, color='k')
ax.axhline(0.0, color='k')
fn, = ax.plot(x_vals,y, label='$f$')
fprimen, = ax.plot(x_vals,y_p, label='$\\frac{\\partial f}{\\partial x}$')
plt.legend(handles=[fn,fprimen])
plt.show()
#
#  check the results
#
max_der_diff = np.max(np.abs(y_p - y_prime_vals))
print("The max difference in the derivative computation: {:.8f}".format(max_der_diff))

As you can see, ther results are exactly the same as expected and performs correctly.  Lets, see why:

Recall from above, we defined the loss function as $np.sum(f(x))$, thus $loss_f = \sum_{i=0}^{N-1} f(x_i)$; therefore,

$\frac{\partial f(x_i)}{\partial x_i} = f^{\prime}(x_i)$

since the value of $f(x_i)$ only appears once in the summation.

Consequently, using the $np.sum$ function provides a quick way to compute the derivatives of $f$ for the input $x$ values.

### Example 1.3 - $f(x,y) = x^2 + y^2 + 4$

In this example, we will compute the gradient of f, namely  of $grad(f) = \nabla(f) = \begin{bmatrix} \frac{\partial f}{\partial x} \\ \frac{\partial f}{\partial y} \end{bmatrix} = \begin{bmatrix}2x \\ 2y\end{bmatrix}$ for a set of random points with $x,y \in [0,1)$.

This works much the same as the previous example by leveraging the grad function and providing the appropriate loss function that can operate on multiple inputs concurrently.

In [0]:
import autograd.numpy as np #This is so the gradient understands the numpy operations
from autograd import grad
#Follow the steps

def f(xy):
  z = xy[0]*xy[0] + xy[1]*xy[1] + 4.0
  return z
def loss_f(z):
  loss = np.sum(f(z))
  return loss
f_p = grad(loss_f)
def g(xy):
  return np.array(f(xy)),np.array(f_p(xy))
#Define the x and y
x_vals = np.random.uniform(-1,1,50)
y_vals = np.random.uniform(-1,1,50)
xy = [x_vals,y_vals]
#
#Compute points
z, z_p = g(xy)
#Compute the formula values
z_p_compute = np.array([2*x_vals,2*y_vals])

max_err = np.max(np.abs(z_p - z_p_compute))
pprint("Max Error: {:8f}".format(max_err))


Once again, the use of grad allows for easy computation of exactly the values we need for computation.

### 1.4  Conclusion

With autograd, the computation of gradients is automatic.  Even though autograd has its primary use in Machine Learning, this tool can be very powerful for mathematics operations since it supports both GPUs and targets numpy compatibility.

## 2. Differentiation with PyTorch.autograd

The MinPy.autograd is a very nice package, but at this point PyTorch probably has a larger user community and it is also very pythonic.  Like MinPy, it has GPU support, but it doesn't overload the numpy packages.  Given its sponsors (including Facebook), the implementation for Machine Learning is very robust and it has several pre-trained models that are ready for use in solving problems.  This series of studies on using gradients generally focuses on PyTorch; however, most of the work can be done within MinPy.

The documentation for Pytorch can be found at [Docs](https://pytorch.org/docs/stable/index.html).

###2.1  Set up Environment

In [0]:
#
!pip3 install -U torch
#
import torch as torch
import torch.tensor as T
import torch.autograd as t_autograd  #normally I use autograd, but I want to distinguish between MinPy autograd
#
#  Output Environment Information
#
has_cuda = torch.cuda.is_available()
current_device = torch.cuda.current_device() if has_cuda else -1
gpu_count = torch.cuda.device_count() if has_cuda else -1
gpu_name = torch.cuda.get_device_name(current_device) if has_cuda else "NA"
print("Current device {}".format(current_device))
print("Number of devices: {:d}".format(gpu_count))
print("Current GPU Number: {:d}".format(current_device))
print("GPU Name: {:s}".format(gpu_name))
#Set the accelerator variable
accelerator = 'cuda' if has_cuda else 'cpu'
print("Accelerator: {:s}".format(accelerator))

### 2.2 -Example 1 - $f(x) = x^2 + 4$

This section solves the same problem as the previous section, but is written to accomodate a GPU so it includes some of the details to use a GPU.

In [0]:
#define setup
#
N = 50
device = torch.device(accelerator)
#
#Define the function and loss
#
def f(x):
  y = x * x + 2.0
  return y
def loss_f(x):
  z = f(x).sum()
  return z
def f_p(x):
  z = loss_f(x)
  z.backward()
  return x.grad
x_val = np.linspace(-3., 3.0, N)
x = T(x_val, requires_grad = True).to(device)
#Get the data
x_vals = x.data.numpy()
y = f(x).data.numpy()
y_p = f_p(x).data.numpy()
#Graph
sns.set_style('dark')
fig, ax = plt.subplots()
plt.ylim(0.0,12.0)
plt.yticks(np.arange(1,13))
ax.axvline(0.0, color='k')
ax.axhline(0.0, color='k')
fn, = ax.plot(x_vals,y, label='$f$')
fprimen, = ax.plot(x_vals,y_p, label='$\\frac{\\partial f}{\\partial x}$')
plt.legend(handles=[fn,fprimen])
plt.show()
#
#  check the results
#
max_der_diff = np.max(np.abs(y_p - y_prime_vals))
print("The max difference in the derivative computation: {:.8f}".format(max_der_diff))


As you can see, the computations are similar to using MinPy, but instead of using a grad function this uses a backward function <i>(backward is the word in Machine Learning that computes the derivative relative to the loss)</i> and then <i>grad</i> is a property of the variable that us used for the computation that required the gradient.

### Example 2.3 - $f(x,y) = x^2 + y^2 + 4$

In [0]:

#Follow the steps

def f(xy):
  z = xy[0]*xy[0] + xy[1]*xy[1] + 4.0
  return z
def loss_f(z):
  loss = f(z).sum()
  return loss
def f_p(xy):
  z = loss_f(xy)
  z.backward()
  return xy.grad
def g(xy):
  return f(xy).data.numpy(),f_p(xy).data.numpy()
#Define the x and y
x_vals = np.random.uniform(-1,1,50)
y_vals = np.random.uniform(-1,1,50)
xy = T([x_vals,y_vals], requires_grad = True).to(device)
#
#Compute points
z, z_p = g(xy)
#Compute the formula values
z_p_compute = np.array([2*x_vals,2*y_vals])

max_err = np.max(np.abs(z_p - z_p_compute))
pprint("Max Error: {:8f}".format(max_err))

### 2.4 Conclusion

PyTorch provide both a numpy compatible interface for the numpy functions, so starting with a minimal set of code using numpy, it is easy to scale up to use GPUs and PyTorch.autograd.  The interworking of PyTorch.autograd are exactly as we expect.

## Overall Conclusion

Both MinPy and PyTorch are environments for Machine Learning, but they provide many benefits to numerical computations and modeling.  These free tools coupled with Colaboratory greatly expands the types of mathematic modeling and computations that are available to developers.  MinPy appears to have a smaller footprint, but PyTorch appears to have a larger following (especially when considering ML).  Using both isn't a bad option depending on the resources available for processing.