<img src=../figures/Brown_logo.svg width=50%>

# Data-Driven Design & Analyses of Structures & Materials (3dasm)

## Lecture 1b: Finding gradients

### Suryanarayanan M. S. | <a href = "mailto: s.manojsanu@tudelft.nl">s.manojsanu@tudelft.nl</a>  | PhD Candidate

# Outline for today's lecture
* Finite differencing
* Symbolic differentiation
* Automatic differentiation

**References:**
* J. R. R. Martins & Andrew Ning, Engineering Design Optimization, 2021 - Chapter-6
* Extras:
    * Nocedal, Jorge, and Stephen J. Wright. Numerical optimization. Springer Science & Business Media, 2006. - Chapter 8
    * Automatic Differentiation in Machine Learning: a Survey, https://arxiv.org/abs/1502.05767

## Four main methods

- Hand calculation (Analytical)
- Finite differences 
- Symbolic differentiation 
- Automatic differentiation

## A. Hand calculations

* Trusting your high-school calculus     
        
* Chain-rule for derivatives
    * $y = f(x) \Rightarrow \frac{dy}{dx} = f^{'}(x)$
    * $z = g(y) \Rightarrow \frac{dz}{dy} = g^{'}(x)$
    <img align=right src=./data/hand_rule_graph.png width=40%>
    * $\frac{dz}{dx} = ?$
    


$$\frac{dz}{dx} = \frac{dz}{dy} \times \frac{dy}{dx}$$
<img align=center src=./data/hand_rule_sliders.png width=40%>

### Slighly more complicated
* $f: \mathcal{R} \rightarrow \mathcal{R}^2$ - A vector valued function
* $g: \mathcal{R}^2 \rightarrow \mathcal{R}$
<img align=center src=./data/hand_with_branches.png width=40%>

$$\frac{dz}{dx} = \Big(\frac{\partial z}{\partial y_1} \times \frac{d y_1}{dx} \Big) + \Big(\frac{\partial z}{\partial y_2} \times \frac{d y_2}{dx}\Big)$$

### For a one layer neural network


* Input = $x$
* Linear layer ($l$), $z = Wx + b$
* Non-linearity, $y = \sigma(z)$
* MSE loss, $L = 0.5 * (y - \bar{y})^2$$

<img align=center src=./data/nn_hand.png width=40%>


#### Exercise 
* Find $\frac{\partial L}{\partial W}$ and $\frac{\partial L}{\partial b}$

Derivation:
* $\frac{\partial L}{\partial y} = (y-\bar{y})$
* $\frac{dy}{dz} = \sigma^{'}$
* $\frac{\partial z}{\partial W} = x$
* $\frac{\partial z}{\partial b} = 1$

$$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial y} \times \frac{dy}{dz} \times  \frac{\partial z}{\partial W}$$
$$= (y-\bar{y}).\sigma^{'}.x$$

$$\frac{\partial L}{\partial b} = \frac{\partial L}{\partial y} \times \frac{dy}{dz} \times  \frac{\partial z}{\partial b}$$
$$= (y-\bar{y}).\sigma^{'}.1$$


* Pros:
    * Fast
    * Exact
    * No special software needed
* Cons:
    * *Trust* your skills
        * error-prone
    * Time-consuming
    * Redundancies!
        * $(y-\bar{y}).\sigma^{'}.\{\}$ - Common


- Formula: $$\frac{df(x)}{dx} = \frac{f(x + h) - f(x)}{h}$$
- Standard way to check evey other method!

## B. Finite differencing

* Approximate the derivative by a small change in the input
    * Definition of derivative
    
**Taylor series makes a comeback**
* From 1D Taylor series
$$f(x + h) = f(x) + h\frac{df}{dx} + \mathcal{O}(h^2)$$
$$f(x + h) - f(x) = h\frac{df}{dx} + \mathcal{O}(h^2)$$
$$\frac{df}{dx} = \frac{f(x + h) - f(x)}{h} + \mathcal{O}(h)$$

**In n-dimensions**

We saw that Taylor series is expanded along a direction!
$$ f(\vec{x} + \alpha \vec{p})|_{x=\vec{x}_0} \approx f(\vec{x}_0) + \alpha \nabla f(\vec{x}_0)^T \vec{p} + \frac{1} {2}\alpha^2 \vec{p}^T   \mathbf{H}(\vec{x}_0)   \vec{p}$$

For finite differencing, the directions are the unit-vectors along a coordinate axis.
$$f(\vec{x} + h \hat{e}_j) = f(\vec{x}) + h \frac{\partial f}{ \partial x_j} +  \mathcal{O}(h^2)$$
$$\frac{\partial f}{ \partial x_j} = \frac{(\vec{x} + h \hat{e}_j) - f(\vec{x})}{h} + \mathcal{O}(h)$$
$$ j= 1, 2, 3, ..., n$$
where $n$ is the dimensionality of $\vec{x}$


Strictly,
* This is the forward finite difference methods
* This are reverse and central finite difference methods as well

### Notes
In 2D, it is easier to see how the first formula gets transformed into the second.
* If $\vec{x}  = \begin{bmatrix} x_1  \\ x_2 \end{bmatrix} \in \mathcal{R}^2$
* Then we need two derivatives to form the gradient of f i.e.
    $$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2}\end{bmatrix}$$
* We have two coordinate directions as well
$$\vec{p} = \begin{bmatrix} 1  \\ 0 \end{bmatrix} \quad \text{or} \quad \begin{bmatrix} 0  \\ 1 \end{bmatrix}$$

* For getting the first component of the gradient, we use the corresponding direction

**Pros:**
* Simple to implement
* Works for any function (black-box)
* Used to check other methods

**Cons:**
- Not exact - depends on the choice of h
    * $h$ needs to be small for accurate gradients [From definition] - Truncation error
    * If $h$ is too small, finite precision errors creep in - Roundoff error
- Slow - requires multiple function evaluations for a simple gradient
    - E.g. For a neural network, we need to evaluate the function for each parameter
        - What if we have 1 million parameters? or if the function is a simulation!


## C. Symbolic differentiation
- Use symbolic data types to represent mathematical expressions
- Use equations of calculus
    - Basically what you did with hand but now automatic!

<img align=right src=./data/expression_swell.png width=40%>

**Pros:**
- Exact
- The formula identifies problem structure!

**Cons:**
- Slow due to redundancies
- Not scalable - expression swell
- Wasteful
    - We need the gradient's value at a point and not a formula!

## D. Automatic differentiation

* aka Algorithmic differentiation (or autodiff)
* Combines the best aspect of symbolic and numerical differentiation
* a. Use on any function (like Finite Differences)
    - ifs, loops, conditionals, etc
* b. Exact (like Symbolic differentiation)
    - No truncation error
* c. Fast and scalable (unlike both)



**Core idea**
* Every function is made from elementary operations (in a computer!).
* We know the derivative of each elementary operation.
* So, we can compute the derivative of the entire function by applying the chain rule repeatedly!

For implementation:
- Function as a computational graph!
    - Obtained usually by tracing
- Look up the rules for each operation somewhere

E.g. computational graph of our one-layer NN!

Two modes of automatic differentiation: -> Traversing the graph!
- Forward accumulation
- Reverse accumulation

Analogy:
- Computational graph is like a connected pipe system
- Forward accumulation is like pouring water from the input to the output
    - Put water in one of the inputs and you see its influence on all the outputs
- Reverse accumulation is like pouring water from the output to the input
    - Put water in one of the outputs and you see its influence on all the inputs


Many inputs and few outputs -> Forward accumulation

Few inputs and many outputs -> Reverse accumulation

#### Mathematically
- Notation:
    - Upstream [dots]
    - Downstream [bars]
- Forward: 
    - Dual numbers!
    - Based on JVPs
    - Pushforward
- Reverse:
    - Based on VJPs
    - Pullback


Why VJPS and JVPs matter?
- Explain with KU=F [matrix-vector products are easier to store than matrices!] 


#### Implementation example (using JAX)

The end!