In [27]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')

# Numerical differentiation

Suppose we want to apply Newton's method to a function that we know how to evaluate, but don't have code to differentiate.  This is often because it's difficult/error-prone to write or because the interface by which we call it does not support derivatives.  (Commercial packages often fall in this category.)  So we just have a black box function $f(x)$ and wish to approximate its derivative,
$$ f'(x) = \lim_{h\to 0} \frac{f(x+h) - f(x)}{h} .$$

In [28]:
def diff(f, x, h=1e-5):
    return (f(x + h) - f(x)) / h

diff(np.sin, 0.7, 1e-8) - np.cos(0.7)

2.297793622041411e-09

In [29]:
x = .5
diff(np.tan, x, 1e-8) - 1/np.cos(x)**2

1.77456063177317e-08

This formula works pretty well for the functions above.  It isn't always so nice, however:

In [30]:
x = 3.14/2
[(h, diff(np.tan, x, h) - 1/np.cos(x)**2)
 for h in [1e-14, 1e-12, 1e-10, 1e-8, 1e-6, 1e-4]]

[(1e-14, -1271.3873432741966),
 (1e-12, 140.14843387738802),
 (1e-10, 0.32726590032689273),
 (1e-08, 19.79343111347407),
 (1e-06, 1982.7670766180381),
 (0.0001, 226466.6387994727)]

Recall that the derivative is ill conditioned for this function
$$ \kappa(f') = |f''| \frac{|x|}{|f'|} $$
but also that our best accuracy is worse than 

In [31]:
y = 3.14/2;
print(
    np.tan(y), # f
    np.cos(y)**(-2), # f'
    np.cos(y)**(-2) * y / np.tan(y), # cond(f)
    2*np.cos(y)**(-3)*np.sin(y) * y * np.cos(y)**2, # cond(f')
)

1255.7655915007897 1576948.220797328 1971.5532288895718 3943.1039573124795


In other cases, we see excellent accuracy so long as the size of $h$ is chosen appropriately.

In [32]:
x = 1e4
[(h, diff(np.log, x, h) - 1/x)
 for h in [1e-14, 1e-12, 1e-10, 1e-8, 1e-6, 1e-4, 1e-2]]

[(1e-14, -0.0001),
 (1e-12, -0.0001),
 (1e-10, -1.1182158029987482e-05),
 (1e-08, 8.89005823409637e-09),
 (1e-06, 8.274037095116864e-12),
 (0.0001, -9.489531298885641e-12),
 (0.01, -5.0168102921151377e-11)]

Asking the user to choose a good value of $h$ is a leaky abstraction and unmanageable complexity in many applications.

### Automatically choosing a suitable $h$
This is one simple and popular choice.

In [33]:
def diff_wp(f, x, eps=1e-8):
    """Numerical derivative with Walker and Pernice (1998) choice of step"""
    h = eps * (1 + abs(x))
    return (f(x+h) - f(x)) / h

x = 1e4
[(eps, diff_wp(np.log, x, eps) - 1/x) for eps in [1e-14, 1e-12, 1e-10, 1e-8, 1e-6, 1e-4, 1e-2]]

[(1e-14, -1.1191038926094873e-05),
 (1e-12, -1.109830782817004e-09),
 (1e-10, -1.109830782817004e-09),
 (1e-08, -8.599665508239422e-12),
 (1e-06, -5.0162259288282114e-11),
 (0.0001, -5.000167712764913e-09),
 (0.01, -4.967408090441846e-07)]

In [34]:
x = 0
[(eps, diff_wp(np.exp, x, eps) - np.exp(x)) for eps in [1e-14, 1e-12, 1e-10, 1e-8, 1e-6, 1e-4, 1e-2]]

[(1e-14, -0.0007992778373591136),
 (1e-12, 8.890058234101161e-05),
 (1e-10, 8.274037099909037e-08),
 (1e-08, -6.07747097092215e-09),
 (1e-06, 4.999621836532242e-07),
 (0.0001, 5.0001667140975314e-05),
 (0.01, 0.005016708416794913)]

In both of these experiments, `eps = 1e-8` (that is, $\sqrt{\epsilon_{\text{machine}}}$) produces good results.
This isn't always the case; consider $\log x$ for small values of $x$:

In [35]:
x = 1e-4
[(eps, diff_wp(np.log, x, eps) - 1/x) for eps in [1e-14, 1e-12, 1e-10, 1e-8, 1e-6, 1e-4, 1e-2]]

[(1e-14, -0.11098307828251563),
 (1e-12, -0.0008599665507063037),
 (1e-10, -0.005016225928557105),
 (1e-08, -0.5000167712751136),
 (1e-06, -49.67408090441677),
 (0.0001, -3068.721334766654),
 (0.01, -9538.52419539635)]

This algorithm is imperfect, leaving some scaling responsibility to the user.
It is the default in PETSc's "matrix-free" Newton-type solvers; it's cheap and works well when problems are well scaled.
It's worth considering why we use
$$ h_{WP} = \sqrt{\epsilon_{\text{machine}}} (1 + \lvert x \rvert) $$
instead of $h_? = \epsilon_{\text{machine}} \lvert x \rvert$.
This choice would be scale-invariant, but problematic when $f(0)$ is not small, as in the example $f(x) = e^x$.

### Accuracy of numerical differentiation

#### Taylor expansion

Classical accuracy analysis assumes that functions are sufficiently smooth, meaning that derivatives exist and Taylor expansions are valid within a neighborhood.  In particular,
$$ f(x+h) = f(x) + f'(x) h + f''(x) \frac{h^2}{2!} + \underbrace{f'''(x) \frac{h^3}{3!} + \dotsb}_{O(h^3)} . $$

The big-$O$ notation is meant in the limit $h\to 0$.  Specifically, a function $g(h) \in O(h^p)$ (sometimes written $g(h) = O(h^p)$) when
there exists a constant $C$ such that
$$ g(h) \le C h^p $$
for all sufficiently *small* $h$.

**Note:** When analyzing algorithms, we will refer to cost being $O(n^2)$, for example, where $n\to \infty$.
In this case, the above definition applies for sufficiently *large* $n$.  Whether the limit is small ($h\to 0$) or large ($n \to\infty$) should be clear from context.

#### Discretization error
The `diff` and `diff_wp` functions use a "forward difference" formula
$$ \tilde f'(x) := \frac{f(x+h) - f(x)}{h}.$$
Using the Taylor expansion of $f(x+h),$ we compute the discretization error
$$ \begin{split} \frac{f(x+h) - f(x)}{h} - f'(x) &= \frac{f(x) + f'(x) h + f''(x) h^2/2 + O(h^3) - f(x)}{h} - f'(x) \\
&= \frac{f'(x) h + f''(x) h^2/2 + O(h^3)}{h} - f'(x) \\
&= f''(x) h/2 + O(h^2) .
\end{split} $$

This is the *discretization error* caused by choosing a finite (not infinitesimal) differencing parameter $h$, and the leading order term depends linearly on $h$.
This is also called *truncation error*, due to truncating the Taylor series; the terms are interchangeable.

#### Rounding error
We have an additional source of error, *rounding error*, which comes from not being able to compute $f(x)$ or $f(x+h)$ exactly, nor subtract them exactly.  Suppose that we can, however, compute these functions with a relative error on the order of $\epsilon_{\text{machine}}$.  This leads to
$$ \begin{split}
\tilde f(x) &= f(x)(1 + \epsilon_1) \\
\tilde f(x \oplus h) &= \tilde f((x+h)(1 + \epsilon_2)) \\
&= f((x + h)(1 + \epsilon_2))(1 + \epsilon_3) \\
&= [f(x+h) + f'(x+h)(x+h)\epsilon_2 + O(\epsilon_2^2)](1 + \epsilon_3) \\
&= f(x+h)(1 + \epsilon_3) + f'(x+h)x\epsilon_2 + O(\epsilon_{\text{machine}}^2 + \epsilon_{\text{machine}} h)
\end{split}
$$

where each $\epsilon_i$ is an independent relative error on the order of $\epsilon_{\text{machine}}$ and we have used a Taylor expansion at $x+h$ to approximate $f(x \oplus h)$.
We thus write the rounding error in the forward difference approximation as
$$ \begin{split}
\left\lvert \frac{\tilde f(x+h) \ominus \tilde f(x)}{h} - \frac{f(x+h) - f(x)}{h} \right\rvert &=
  \left\lvert \frac{f(x+h)(1 + \epsilon_3) + f'(x+h)x\epsilon_2 + O(\epsilon_{\text{machine}}^2 + \epsilon_{\text{machine}} h) - f(x)(1 + \epsilon_1) - f(x+h) + f(x)}{h} \right\rvert \\
  &\le \frac{|f(x+h)\epsilon_3| + |f'(x+h)x\epsilon_2| + |f(x)\epsilon_1| + O(\epsilon_{\text{machine}}^2 + \epsilon_{\text{machine}}h)}{h} \\
  &\le \frac{(2 \max_{[x,x+h]} |f| + \max_{[x,x+h]} |f' x| \epsilon_{\text{machine}} + O(\epsilon_{\text{machine}}^2 + \epsilon_{\text{machine}} h)}{h} \\
  &= (2\max|f| + \max|f'x|) \frac{\epsilon_{\text{machine}}}{h} + O(\epsilon_{\text{machine}}) \\
\end{split} $$
where we have assumed that $h \ge \epsilon_{\text{machine}}$.
This error becomes large (relative to $f'$ -- we are concerned with relative error after all)
* $f$ is large compared to $f'$
* $x$ is large
* $h$ is too small

#### Total error and optimal $h$

Suppose we would like to choose $h$ to minimize the combined discretization and rounding error,
$$ h^* = \arg\min_h | f''(x) h/2 | + (2\max|f| + \max|f'x|) \frac{\epsilon_{\text{machine}}}{h} $$
(dropping the higher order terms), which we can compute by differentiating with respect to $h$ and setting the result equal to zero
$$ |f''|/2 - (2\max|f| + \max|f'x|) \frac{\epsilon_{\text{machine}}}{h^2} = 0 $$
which can be rearranged as
$$ h^* = \sqrt{\frac{4\max|f| + 2\max|f'x|}{|f''|}} \sqrt{\epsilon_{\text{machine}}} .$$
Of course this formula is of little use for computing $h$ because all this is to compute $f'$, which we obviously don't know yet, much less $f''$.
However, it does have value:
* It explains why `1e-8` (i.e., $\sqrt{\epsilon_{\text{machine}}}$) was empirically found to be about optimal for well-behaved/scaled functions.
* It explains why even for the best behaved functions, our best attainable accuracy with forward differencing is $\sqrt{\epsilon_{\text{machine}}}$.
* If we have some special knowledge about the class of functions we need to differentiate, we might have bounds on these quantities and thus an ability to use this formula to improve accuracy.  Alternatively, we could run a parameter sweep to empirically choose a suitable $h$, though we would have to re-tune in response to parameter changes in the class of functions.
* If someone claims to have a simple and robust rule for computing $h$ then this formula tells us how to build a function that breaks their rule.  There are no silver bullets.
* If our numerical differentiation routine produces a poor approximation for some function that we run into in the wild, this helps us explain what happened and how to fix it.

### Centered difference

Instead of the forward difference approximation
$$ \frac{f(x+h) - f(x)}{h} $$
we could use the centered difference formula,
$$ \frac{f(x+h) - f(x-h)}{2h} . $$
(One way to derive this formula is to average a forward and backward difference.  We will learn a more general method later in the course when we do interpolation.)
We can compute the discretization error by Taylor expansion,
\begin{split} \frac{f(x) + f'(x)h + f''(x)h^2/2 + f'''(x)h^3/6 - f(x) + f'(x)h - f''(x)h^2/2 + f'''(x) h^3/6 + O(h^4)}{2h} \\
= f'(x) + f'''(x)h^2/6 + O(h^3) \end{split}
showing that the leading error term is of order $h^2$, versus order $h$ for forward differences.
A similar computation including rounding error will find that the optimal $h$ is now of order $\sqrt[3]{\epsilon_{\text{machine}}}$ so the best attainable accuracy is $\epsilon_{\text{machine}}^{2/3}$.
This accuracy improvement (versus $\sqrt{\epsilon_{\text{machine}}}$) is significant, but we'll also see that it is twice as expensive when computing derivatives of multi-variate functions.

# Symbolic differentiation

We've been differentiating basic mathematical functions, for which there is a formula for the derivative.
Symbolic differentiation is a tool that can compute those expressions (and generate code to evaluate the expressions numerically).

In [36]:
import sympy
from sympy.abc import x

f = sympy.cos(x**sympy.pi) * sympy.log(x)
f

log(x)*cos(x**pi)

In [37]:
sympy.diff(f, x)

-pi*x**pi*log(x)*sin(x**pi)/x + cos(x**pi)/x

In [38]:
sympy.ccode(f, 'y')

'y = log(x)*cos(pow(x, M_PI));'

In [39]:
sympy.fcode(f, 'y')

'      y = log(x)*cos(x**3.1415926535897932d0)'

In [40]:
f.evalf(40, subs={x: 1.9})

0.2155134138380419067452319459177557208730

In [41]:
def g(x, m=np):
    y = x
    for i in range(2):
        # a = m.log(y)
        # b = y ** m.pi
        # c = m.cos(b)
        # y = c * a
        y = m.cos(y**m.pi) * m.log(y)
    return y

gexpr = g(x, m=sympy)
gexpr

log(log(x)*cos(x**pi))*cos((log(x)*cos(x**pi))**pi)

In [42]:
sympy.diff(gexpr, x)

-pi*(log(x)*cos(x**pi))**pi*(-pi*x**pi*log(x)*sin(x**pi)/x + cos(x**pi)/x)*log(log(x)*cos(x**pi))*sin((log(x)*cos(x**pi))**pi)/(log(x)*cos(x**pi)) + (-pi*x**pi*log(x)*sin(x**pi)/x + cos(x**pi)/x)*cos((log(x)*cos(x**pi))**pi)/(log(x)*cos(x**pi))

# Hand-coding derivatives

The size of these expressions grow exponentially in the number of loop iterations, yet one can write efficient code for computing the derivative by hand.  We use the variational notation

$$ \operatorname{d} f = f'(x) \operatorname{d} x $$

which allows us to break a large computation into simple pieces that we can compute incrementally, instead of trying to build up expressions for complicated functions.  That is, we can differentiate a composition $h(g(f(x)))$ as

\begin{align}
  \operatorname{d} h &= h' \operatorname{d} g \\
  \operatorname{d} g &= g' \operatorname{d} f \\
  \operatorname{d} f &= f' \operatorname{d} x.
\end{align}
Consider our example above.

In [17]:
def gprime(x):
    y = x
    dy = 1
    for i in range(2):
        a = np.log(y)
        da = 1/y * dy
        b = y ** np.pi
        db = np.pi * y ** (np.pi - 1) * dy
        c = np.cos(b)
        dc = -np.sin(b) * db
        y = c * a
        dy = dc * a + c * da
    return y, dy

print('by hand', gprime(1.9))
print('numerical', diff_wp(g, 1.9))

by hand (-1.5346823414986814, -34.03241959914048)
numerical -34.032439961925064


* This code is pretty mechanical to write
* It's hard to maintain as you add new features
* It's hard to debug
  * You can test using finite differencing
  * You can take apart pieces for unit testing and/or debugging
* If you know you'll be writing this sort of code, plan ahead!

### Variational notation is handy (an example)

We'll differentiate the expression

$$ I = A^{-1} A $$
applying the product rule

$$ 0 = A^{-1} (\operatorname dA) + (\operatorname dA^{-1}) A $$
and collect terms

$$ \operatorname dA^{-1} = - A^{-1} (\operatorname dA) A^{-1}. $$

This expression for the derivative $\operatorname d A^{-1}$ in direction $\operatorname d A$ is useful when differentiating algorithmn that involve linear algebra.

## Reverse-mode

What we've done above is called "forward mode", and amounts to placing the parentheses in the chain rule like

$$ \operatorname d h = \frac{dh}{dg} \left(\frac{dg}{df} \left(\frac{df}{dx} \operatorname d x \right) \right) .$$

The expression means the same thing if we rearrange the parentheses,

$$ \operatorname d h = \left( \left( \left( \frac{dh}{dg} \right) \frac{dg}{df} \right) \frac{df}{dx} \right) \operatorname d x .$$

In [18]:
def gprime_rev(x):
    # First compute all the values by going through the iteration forwards
    # I'm unrolling two iterations here for clarity ("static single assignment" form)
    # It is possible to write code that keeps the loop structure.
    a1 = np.log(x)
    b1 = x ** np.pi
    c1 = np.cos(b1)
    y1 = c1 * a1
    a2 = np.log(y1)
    b2 = y1 ** np.pi
    c2 = np.cos(b2)
    y = c2 * a2 # Result
    # Now go backwards computing dy/d_ for each variable
    y_ = 1
    y_c2 = y_ * a2
    y_a2 = c2 * y_
    y_b2 = -y_c2 * np.sin(b2) # dy/db2 = dy/dc2 dc2/db2
    y_y1 = y_b2 * np.pi * y1 ** (np.pi - 1) + y_a2 / y1
    y_c1 = y_y1 * a1
    y_a1 = c1 * y_y1
    y_b1 = -y_c1 * np.sin(b1)
    y_x = y_b1 * np.pi * x ** (np.pi - 1) + y_a1 / x
    return y, y_x

print('forward', gprime(1.9))
print('reverse', gprime_rev(1.9))

forward (-1.5346823414986814, -34.03241959914048)
reverse (-1.5346823414986814, -34.03241959914049)


* This is fairly mechanical, similar to forward-mode
* It is more complicated than forward-mode
* This sort of code is tricky to debug
  * You can test using forward-mode or finite differencing
* We need the results of intermediate computation in reverse order
  * We have to store those values somewhere ("taping" in the literature)
  * Or we have to recompute them (see "hierarchical checkpointing")
* Reverse-mode is also known as the "adjoint" method and "back-propagation".
  
### Why reverse?

If all we had was scalar functions of scalar inputs, we would never use reverse mode.  But let's suppose we are given a dot product with a constant vector.

$$ f(\mathbf x) = \mathbf c^T \mathbf x = \begin{pmatrix} c_0 & c_1 & c_2 & \dotsb \end{pmatrix} \begin{pmatrix} x_0 \\ x_1 \\ x_2 \\ \vdots \end{pmatrix} $$
and wish to compute the gradient
$$ \nabla_{\mathbf x} f = \frac{\partial f}{\partial \mathbf x} = \begin{pmatrix} \frac{\partial f}{\partial x_0} & \frac{\partial f}{\partial x_1} & \frac{\partial f}{\partial x_2} & \dotsb \end{pmatrix} . $$

In [45]:
def dot(c, x):
    n = len(c)
    sum = 0
    for i in range(n):
        sum += c[i] * x[i]
    return sum
        
n = 20
c = np.random.randn(n)
x = np.random.randn(n)
f = dot(c, x)
f

-0.7881902865938712

If we use forward mode, we can only compute one direction at a time, effectively
$$ \left(\nabla_{\mathbf x} f\right) \cdot \operatorname d x $$
for one value of the vector $\operatorname d x$ at a time.
We can compute the full gradient by choosing $\operatorname d x$ to be each column of the identity.

In [46]:
def dot_x(c, x, dx):
    """Compute derivative in direction dx"""
    n = len(c)
    dsum = 0
    for i in range(n):
        dsum += c[i] * dx[i]
    return dsum

def grad_dot(c, x):
    n = len(c)
    I = np.eye(n)
    grad = np.zeros(n)
    for j in range(n):
        dx = I[:,j]
        grad[j] = dot_x(c, x, dx)
    return grad

grad_dot(c, x)

array([ 0.41288309, -0.02618   , -1.63423663,  0.3006944 , -0.25078029,
       -0.45382014,  1.94208405,  0.30730437,  0.45064477, -0.89218614,
       -1.25475455, -0.76033123, -0.54544687, -1.40029722, -0.74489052,
       -0.18788402, -1.00248867,  1.5845576 , -0.29049147,  0.98277278])

We've now traversed the loop with our work as many times as there are components in the vector.  The forward evaluation for `dot` costs $O(n)$ and computing the gradient costs $O(n^2)$ because we have to do $O(n)$ for for each direction and there are $n$ directions.

Compare with reverse-mode

In [49]:
def grad_dot_rev(c, x):
    n = len(c)
    sum_ = np.zeros(n)
    for i in range(n):
        sum_[i] = c[i]
    return sum_

grad_dot_rev(c, x)

array([ 0.41288309, -0.02618   , -1.63423663,  0.3006944 , -0.25078029,
       -0.45382014,  1.94208405,  0.30730437,  0.45064477, -0.89218614,
       -1.25475455, -0.76033123, -0.54544687, -1.40029722, -0.74489052,
       -0.18788402, -1.00248867,  1.5845576 , -0.29049147,  0.98277278])

* We get the same values in only $O(n)$ work!
* The astute reader may recall that we already worked out this case,
$$ \frac{\partial \mathbf c^T \mathbf x}{\partial \mathbf x} = \mathbf c^T .$$

## Shape of the gradient (Jacobian)

Suppose we have a vector-valued function of vector-valued input, $\mathbf f(\mathbf x)$ where $\mathbf f$ has length $m$ and $\mathbf x$ has length $n$.
* The gradient (Jacobian) matrix $J = \nabla_{\mathbf x} \mathbf f$ has shape $m\times n$.
* Usually in optimization, $m=1$ because we only have one objective
* If $m\ll n$ then finite differencing and forward-mode differentiation will be much more expensive than reverse-mode differentiation
  * Find a way to use reverse-mode!
* If $m \approx n$ then either is about as efficient, but forward-mode is simpler.
* If $m \gg n$ then forward-mode is the ticket.
* In real computations, there may be expensive stages that have lower dimension inputs or outputs, in which case those can be captured. An example is
$$ \mathbf f(\mathbf x) = \mathbf q \sigma(\mathbf q^T \mathbf x) $$
where $\sigma$ is an expensive nonlinear function.
The Jacobian $J = \nabla_{\mathbf x} \mathbf f$ is a square matrix, but naive forward- and reverse-mode would both require $n$ evaluations of $\sigma$.
Since $\sigma$ is a scalar-valued function of a scalar argument, $\sigma'(\mathbf q^T \mathbf x)$ is just one number, and thus $J = (\sigma') \mathbf q \mathbf q^T$ is readily available (and you know it's rank-1 so don't need to store all $n^2$ entries). Models of this sort show up frequently in physical modeling.

# Algorithmic (automatic) differentiation

Next, we'll consider ways to have libraries/compilers generate by-hand code such as we see above.
We'll use the [JAX](https://jax.readthedocs.io/en/latest/) library, which offers differentiation of NumPy computations (and offload to GPUs, which we won't use now).
Uncomment the line below if you need to install `jax` and `jaxlib`.

In [22]:
# ! pip install jax jaxlib

In [50]:
import jax
import jax.numpy as jnp

def g_jax(x):
    """Same function as before, but using jnp in place of np."""
    y = x
    for i in range(2):
        y = jnp.cos(y**jnp.pi) * jnp.log(y)
    return y

gprime_jax = jax.grad(g_jax)
print(gprime_jax(1.9))
print(gprime(1.9)[1])

-34.03244
-34.03241959914048


In [51]:
jax.grad(dot)(c, x), x # Differentiates with respect to argument zero

(DeviceArray([-1.9289212 ,  2.097858  ,  0.20503117,  0.22631234,
              -1.2956989 ,  0.5156321 , -0.5992178 ,  0.56441456,
              -0.23191658,  0.07927939,  0.451759  ,  1.7861412 ,
              -1.1517502 ,  0.4325543 ,  0.26122522, -0.44151178,
              -1.85014   ,  0.8240289 ,  0.69433916,  0.46808633],            dtype=float32),
 array([-1.92892122,  2.09785787,  0.20503117,  0.22631233, -1.29569892,
         0.51563212, -0.59921775,  0.56441458, -0.23191657,  0.07927939,
         0.451759  ,  1.78614113, -1.15175023,  0.4325543 ,  0.26122522,
        -0.44151178, -1.85014002,  0.82402888,  0.69433915,  0.46808634]))

In [52]:
jax.grad(dot, argnums=1)(c, x) - c

DeviceArray([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
             0., 0., 0., 0., 0.], dtype=float32)

## Software

* Algorithmic differentiation (AD) software has been around for over 40 years
* There are two classical approaches
  * Source transformation: AD tool emits Fortran (or C, etc.) code, which is compiled by a normal compiler
  * Operator overleading: each basic operation is overloaded to transform objects holding values + derivatives
* Source transformation is usually more efficient, retaining loop structure, etc.
* Implementations tend to have poor ergonomics, odd restrictions on use, poor composition.
* Vectorization has been poor with most classical tools.
* AD *implementations* have come a long way in the past few years (despite the math being old)
* Just-in-time compilation and extensive software engineering
* Exemplars:
  * [JAX](https://jax.readthedocs.io/en/latest/) for Python
  * [Zygote.jl](http://fluxml.ai/Zygote.jl/latest/) for Julia
* AD is great within its domain, but is still intrusive (especially for multi-language projects, languages with poor AD tooling, etc.).  Even in JAX, you'll see [various constraints](https://jax.readthedocs.io/en/latest/notebooks/Common_Gotchas_in_JAX.html), such as that you can't in-place update an array.

In [26]:
z = jnp.zeros(3)
z[1] = 1

TypeError: '<class 'jax.interpreters.xla.DeviceArray'>' object does not support item assignment. JAX arrays are immutable; perhaps you want jax.ops.index_update or jax.ops.index_add instead?

* If you work in this space, you'll eventually learn to judge when to use AD and when to hand-code a derivative.  This type of decision lies at the intersection of numerical analysis and software engineering.