<img src=../figures/Brown_logo.svg width=50%>

# Data-Driven Design & Analyses of Structures & Materials (3dasm)

## Lecture 1b: Finding gradients

### Suryanarayanan M. S. | <a href = "mailto: s.manojsanu@tudelft.nl">s.manojsanu@tudelft.nl</a>  | PhD Candidate

# Outline for today's lecture
* Finite differencing
* Symbolic differentiation
* Automatic differentiation

**References:**
* J. R. R. Martins & Andrew Ning, Engineering Design Optimization, 2021 - Chapter-6
* Extras:
    * Nocedal, Jorge, and Stephen J. Wright. Numerical optimization. Springer Science & Business Media, 2006. - Chapter 8
    * Automatic Differentiation in Machine Learning: a Survey, https://arxiv.org/abs/1502.05767

## Four main methods

- Hand calculation (Analytical)
- Finite differences 
- Symbolic differentiation 
- Automatic differentiation

## A. Hand calculations

* Trusting your high-school calculus     
        
* Chain-rule for derivatives
    * $y = f(x) \Rightarrow \frac{dy}{dx} = f^{'}(x)$
    * $z = g(y) \Rightarrow \frac{dz}{dy} = g^{'}(x)$
    <img align=right src=./data/hand_rule_graph.png width=40%>
    * $\frac{dz}{dx} = ?$
    


$$\frac{dz}{dx} = \frac{dz}{dy} \times \frac{dy}{dx}$$
<img align=center src=./data/hand_rule_sliders.png width=40%>

### Slighly more complicated
<img align=center src=./data/hand_with_branches.png width=40%>

* $f: \mathcal{R} \rightarrow \mathcal{R}^2$ - A vector valued function
* $g: \mathcal{R}^2 \rightarrow \mathcal{R}$

$$\frac{dz}{dx} = \Big(\frac{\partial z}{\partial y_1} \times \frac{d y_1}{dx} \Big) + \Big(\frac{\partial z}{\partial y_2} \times \frac{d y_2}{dx}\Big)$$


### For a one layer neural network


* Input = $x$
* Linear layer ($l$), $z = Wx + b$
* Non-linearity, $y = \sigma(z)$
* MSE loss, $L = 0.5 * (y - \bar{y})^2$$

<img align=center src=./data/nn_hand.png width=40%>


#### Exercise 
* Find $\frac{\partial L}{\partial W}$ and $\frac{\partial L}{\partial b}$

Derivation:
* $\frac{\partial L}{\partial y} = (y-\bar{y})$
* $\frac{dy}{dz} = \sigma^{'}$
* $\frac{\partial z}{\partial W} = x$
* $\frac{\partial z}{\partial b} = 1$

$$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial y} \times \frac{dy}{dz} \times  \frac{\partial z}{\partial W}$$
$$= (y-\bar{y}).\sigma^{'}.x$$

$$\frac{\partial L}{\partial b} = \frac{\partial L}{\partial y} \times \frac{dy}{dz} \times  \frac{\partial z}{\partial b}$$
$$= (y-\bar{y}).\sigma^{'}.1$$


* Pros:
    * Fast
    * Exact
    * No special software needed
* Cons:
    * *Trust* your skills
        * error-prone
    * Time-consuming
    * Redundancies!
        * $(y-\bar{y}).\sigma^{'}.\{\}$ - Common


- Formula: $$\frac{df(x)}{dx} = \frac{f(x + h) - f(x)}{h}$$
- Standard way to check evey other method!

## B. Finite differencing

* Approximate the derivative by a small change in the input
    * Definition of derivative
    
**Taylor series makes a comeback**
* From 1D Taylor series
$$f(x + h) = f(x) + h\frac{df}{dx} + \mathcal{O}(h^2)$$
$$f(x + h) - f(x) = h\frac{df}{dx} + \mathcal{O}(h^2)$$
$$\frac{df}{dx} = \frac{f(x + h) - f(x)}{h} + \mathcal{O}(h)$$

**In n-dimensions**

We saw that Taylor series is expanded along a direction!
$$ f(\vec{x} + \alpha \vec{p})|_{x=\vec{x}_0} \approx f(\vec{x}_0) + \alpha \nabla f(\vec{x}_0)^T \vec{p} + \frac{1} {2}\alpha^2 \vec{p}^T   \mathbf{H}(\vec{x}_0)   \vec{p}$$

For finite differencing, the directions are the unit-vectors along a coordinate axis.
$$f(\vec{x} + h \hat{e}_j) = f(\vec{x}) + h \frac{\partial f}{ \partial x_j} +  \mathcal{O}(h^2)$$
$$\frac{\partial f}{ \partial x_j} = \frac{(\vec{x} + h \hat{e}_j) - f(\vec{x})}{h} + \mathcal{O}(h)$$
$$ j= 1, 2, 3, ..., n$$
where $n$ is the dimensionality of $\vec{x}$


Strictly,
* This is the forward finite difference methods
* This are reverse and central finite difference methods as well

### Notes
In 2D, it is easier to see how the first formula gets transformed into the second.
* If $\vec{x}  = \begin{bmatrix} x_1  \\ x_2 \end{bmatrix} \in \mathcal{R}^2$
* Then we need two derivatives to form the gradient of f i.e.
    $$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2}\end{bmatrix}$$
* We have two coordinate directions as well
$$\vec{p} = \begin{bmatrix} 1  \\ 0 \end{bmatrix} \quad \text{or} \quad \begin{bmatrix} 0  \\ 1 \end{bmatrix}$$

* For getting the first component of the gradient, we use the corresponding direction

**Pros:**
* Simple to implement
* Works for any function (black-box)
* Used to check other methods

**Cons:**
- Not exact - depends on the choice of h
    * $h$ needs to be small for accurate gradients [From definition] - Truncation error
    * If $h$ is too small, finite precision errors creep in - Roundoff error
- Slow - requires multiple function evaluations for a simple gradient
    - E.g. For a neural network, we need to evaluate the function for each parameter
        - What if we have 1 million parameters? or if the function is a simulation!


## C. Symbolic differentiation
- Use symbolic data types to represent mathematical expressions
- Use equations of calculus
    - Basically what you did with hand but now automatic!

<img align=right src=./data/expression_swell.png width=40%>

**Pros:**
- Exact
- The formula identifies problem structure!

**Cons:**
- Slow due to redundancies
- Not scalable - expression swell
- Wasteful
    - We need the gradient's value at a point and not a formula!

## D. Automatic differentiation

* aka Algorithmic differentiation (or autodiff)
* Combines the best aspect of symbolic and numerical differentiation
* a. Use on any function (like Finite Differences)
    - ifs, loops, conditionals, etc
* b. Exact (like Symbolic differentiation)
    - No truncation error
* c. Fast and scalable (unlike both)
* d. Modular

**Core idea**
* Every function is made from elementary operations (in a computer!).
* We know the derivative of each elementary operation.
* So, we can compute the derivative of the entire function by applying the chain rule repeatedly!

**For implementation:**
<img align=right src=./data/comp_graph.png width=40%>
- A program as an acyclic computational graph!
    - Variables connected by operations (as nodes)
    - Any function can be an operation
    - Obtained usually by tracing variables
- Look up the rules for each operation somewhere
    - E.g. `def sin_derivative(x): return cos(x)`
    

#### Two modes of automatic differentiation

- Direction of traversing the graph
<img align=right src=./data/forward_vs_backward.png width=40%>
- Forward accumulation
    - Derivatives "flow" along with the program execution
    - Inputs to outputs
- Reverse accumulation
    - Derivative computation is done once the execution is over
    - Similar to how we did by hand [Working backwards]
    - Outputs to inputs

**Analogy: Thinking of autodiff as a pipe-system**

<img align=right src=./data/pipe_ad.png width=20%>

- Input channels ($n$)
- Output channels ($m$)
- Many many intermediate operations
    - Connects the inputs to the outputs
- Unconnected parts have no influence on one another
    - Derivatives = 0
- Cyclic dependencies
    - Causes flow stagnation
    - A big NO!

#### Forward-mode 

<img align=right src=./data/forward_1.png width=30%>

* If you want to know the partial derivative w.r.t one of inputs
    * You start from that variable
        * aka Seeding
    * Flow through the graph
    * Get "accumulated" at the outputs
* It does not matter how many outputs you have
    * You get the effect of the input on all the outputs!
* If $x_1$ is seeded, we will get ($m \times 1$) matrix
$$\begin{bmatrix}\frac{\partial y_1}{\partial x_1} \\ \frac{\partial y_2}{\partial x_1} \\ \frac{\partial y_3}{\partial x_1} \\ ... \end{bmatrix}$$

**We get the influence of $x_1$ on all outputs in one go!**

#### Reverse-mode

<img align=right src=./data/reverse_1.png width=15%>

* If you want *all* partial derivatives of *one* of the outputs!
    * Seed the output you are interested in
    * Flow through the graph
    * This is essentially the gradient ($1 \times n$) matrix

$$\begin{bmatrix}\frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \frac{\partial y_1}{\partial x_3} &... \end{bmatrix}$$

* It does not matter how many inputs there are
    * You will get the effect of all of them on a single output    

## Mathematics behind autodiff

* Consider a general function $f: \mathcal{R}^n \rightarrow \mathcal{R}^m$
    * Domain = $\mathcal{R}^n$; $n$ inputs $\begin{bmatrix}x_1 \\ x_2 \\... \\x_n \end{bmatrix}$
    * Range = $\mathcal{R}^m$; $m$ outputs $\begin{bmatrix}y_1 \\ y_2 \\... \\y_m \end{bmatrix}$
    * $\vec{y} = f(\vec{x})$
* We can define the jacobian matrix ($\mathcal{J}$) as:
$$\mathcal{J} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & ... & \frac{\partial y_1}{\partial x_n} \\
... & ... & ... & ...\\
... & ... & ... & ...\\
\frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} & ... & \frac{\partial y_m}{\partial x_n}
\end{bmatrix}_{\quad m\times n}$$


### Why jacobians matter?

<img align=right src=./data/jacobian.png width=50%>

* Forward-mode
    * Gave us one column of the jacobian
    * To construct the entire jacobian
        * $n$ forward accumulation steps are needed
        * One per input
    * Useful when the function has more outputs than inputs
* Reverse-mode
    * Gave us one row of the jacobian
    * To construct the entire jacobian
        * $m$ reverse accumulation steps are needed
        * One per output
    * Useful when the function has more inputs than outputs
    
   
**IN ML, most of the times, number of inputs (model parameters) >>>>> Number of outputs (loss value). So, Reverse-mode is better and is known as back propagation**

### Autodiff's modularity

<img align=right src=./data/modularity.png width=30%>

* Each operation works independently
* Each operation needs to propagate information
    * Either from inputs to outputs (Upstream to downstream)
    * Or from outputs to inputs (Downstream to upstream)

**For forward-mode**

<img align=right src=./data/tangents_primals.png width=30%>

* Propagation of primals and tangents through a function
    * Primals (Primary values)
        * Inputs to outputs
    * Tangents (Derivatives)
        * Geometrically, tangent to a curve = derivative!
        * $\mathcal{J}_f$ is the function's jacobian at the given input and output
* The jacobian times a vector is propagated forward
    * This is called `jacobian-vector product` or `jvp`
    * Jacobian times an input vector = output vector
* This means we dont have to store anything!
    * Forward-mode is independent of the depth of the graph
    * E.g. Think extremely deep neural networks

**For forward-mode autodiff to work, `jvp` rules have to written for all operations!**

In [4]:
# In pseudo-code
# Some frameworks may not support forward-mode autodiff


def unknown_function(x, y):
    """ A black-box function.

    Parameters
    ----------
    x, y
        Scalars, Arrays, or matrices of VALUES.
        Unlike symbolic differentiation, remember that we work with numerical values!
    """
    # Do something blackboxy!
    ...
    return z

def unknown_function_jvp_rule(up_primals, up_tangents):
    """ Tells the autodiff program how to differentiate the above function.

    Briefly tell the software how to propagate the derivatives!
    See how modular this is. User doesnot need to know the computational graph.

    Parameters
    ----------
    up_primals
        Upstream primals = Inputs to the original function
        i.e. x & y
    up_tangents
        Upstream tangents = Derivative information accumulated in the primals
        i.e. x_dot & y_dot
        Imagine the flow of water!

    Returns
    -------
    down_primals
        Primals to pass on downstream = Outputs of the function
        z = unknown_function(x, y)
    down_tangents
        Downstream tangents = Derivative information incorporting the up_tangents
        and the function's jacobian
        z_dot = f(x_dot, y_dot, Jf)
    """
    x, y = up_primals
    x_dot, y_dot = up_tangents
    down_primals = z = unknown_function(x, y)
    # Compute the jacobian of the unknown_function =  Jf
    down_tangents =  XXX # Compute the jvp [Jacobian times x_dot & y_dot]
    return down_primals, down_tangents

# Register the jvp rule

# YOUR_FRAMEWORK.register_jvp(func=unknown_function,
#                             jvp_rule=unknown_function_jvp_rule)

# Now this function can be differentiated by the system

#### Notes:
* Upstream derivatives are denoted with "dots" notation i.e. $\dot{\vec{x}}$
* Forward-mode also has a nice interpretation using *Dual numbers*
    * Dual numbers are of the form $a + \epsilon b$, where $a$ and $b$ $\in \mathcal{R}^n$
    * $\epsilon$ is a hypothetical number having the property $\epsilon^2 = 0$ and   $\epsilon \neq 0$ [I know its weird]
    * They are represented as $(a, b)$
    * If you take the Taylor series expansion of any function at $a$ along $b$
    $$f\Big((a, b)\Big) = f(a + \epsilon b) = f(a) + \epsilon b * f^{'}(a) + 0 + 0 + 0 + ... \\
    = c + \epsilon d = (c, d)\\
    \text{where,} \; c = f(a) \\
    d = b* f^{'}(a)$$
* If $(a, b) = (\vec{x}, \dot{\vec{x}})$
    * For any function, $\vec{y} = f(\vec{x})$
    * $f\Big((\vec{x}, \dot{\vec{x}})\Big) = (\vec{y}, \dot{\vec{y}})$
    * Where, $\dot{\vec{y}} = \mathcal{J}_f*\dot{\vec{x}}$ = JVP!
    * i.e. Any function, evaluated on dual numbers, propagates outputs and derivatives

**For Reverse-mode**
* First, we have to go from the start to end of the graph
    * Propagate the primals only
* Next, we start traversing backwards
    * Propagate the downstream derivatives [from outputs to inputs]
    * Forward pass stores required values to help with this
* What is needed are not `JVP` rules
    * We need `VJP`s or `vector-jacobian` products
    * We are "pulling" tangents backwards
    * Ouput vector times the Jacobian = input vector
* Memory scales with graph depth!

**Notes**
* JVP = $\mathcal{J} \times \vec{v}$, where $\vec{v}$ is from the input of the function 
* VJP = $\vec{w} \times \mathcal{J}$, where $\vec{w}$ is from the output of the function
    *  = $\mathcal{J}^T \times \vec{w}$

In [None]:
# In pseudo-code

def unknown_function(x, y):
    """ A black-box function.

    Parameters
    ----------
    x, y
        Scalars, Arrays, or matrices of VALUES.
        Unlike symbolic differentiation, remember that we work with numerical values!
    """
    # Do something blackboxy!
    ...
    return z

# VJP rule - Needs two functions unlike forward-mode!

def unknown_function_forward(up_primals):
    """Forward pass during reverse-mode.

    This is very similar to the original function except that you can store (cache)
    values for the backward pass.

    Parameters
    ----------
    up_primals
        The inputs of the original function

    Returns
    -------
    down_primals
        Output of the original function
    stuff_to_store
        Residual values needed for backward pass
    """

    down_primals = z = unknown_function(x, y)

    stuff_to_store = (x, y)  # People call this as residual [Leftovers from evaluating the function]

    return down_primals, stuff_to_store

def unknown_function_backward(stored_stuff, down_tangents):
    """This is executed only after the entire graph has been traversed.

    Parameters
    ----------
    stored_stuff
        The stuff you stored during the forward pass (A long time ago)
    down_tangents
        Downstream tangents (or derivatives) accumulated so far

    Returns
    -------
    up_tangents
        The derivatives propagated through the function to its inputs
    """
    x, y = stored_stuff
    z_bar = down_tangents

    # We need to propagate z_bar to get x_bar and y_bar
    x_bar = z_bar * XXX # The vector-jacobian product [w.r.t x]
    y_bar = z_bar * YYY # The vector-jacobian product [w.r.t y]

    up_tangents = (x_bar, y_bar)
    return up_tangents


# Register the vjp rule

# YOUR_FRAMEWORK.register_vjp(func=unknown_function,
#                             func_forward=unknown_function_forward,
#                             func_backward=unknown_func_backward)

# Now this function can be differentiated by the system

### Summary

### The end!