#### Affine function

$ z(x) = \sum \limits _{i} ^{n} w_{i}x_{i} + b = w \cdot x + b $

#### Common loss function

$ \frac {1}{N} \sum \limits _{x} (target(x) - activation(x))^{2} = $
$ \frac {1}{N} \sum \limits _{x} (target(x) - max(0, \sum \limits _{i}^{\vert x \vert} w_{i}x_{i} + b))^2 $

#### Table of Derivatives

![basic derivative chain rules](images/basic-derivative-chain-rules.png)


#### Partial Derivative Example

Consider the function $ f(x,y) = 3x^{2}y $

The derivative of _f_ wrt to _x_ is:

$\frac{\partial f}{\partial x} = $
$\frac{\partial} {\partial x} 3x^{2}y  = $
$3y\frac{\partial}{\partial x}x^{2} = $
$ 6yx $

The derivative of _f_ wrt to _y_ is:

$\frac{\partial f}{\partial y} = $
$\frac{\partial} {\partial y} 3x^{2}y = $
$3x^{2}\frac{\partial}{\partial y}y = $
$ 3x^{2} $


#### Gradient of $ f(x,y) $

$ \nabla f(x,y) = [\frac{\partial f(x,y)}{\partial x},\frac{\partial f(x,y)}{\partial y}] = $
$ [6yx, 3x^{2}] $


### Matrix calculus

Given the functions $ f(x,y) = 3x^{2}y  $ and 
$ g(x,y) = 2x + y^{8} $


$ \frac{\partial g(x,y)}{\partial x} = $
$ \frac{\partial 2x}{\partial x} + \frac{\partial y^{8}}{\partial x} = $
$ 2\frac{\partial x}{\partial x} + 0 = $
$ 2 \times 1 = 2 $

and

$ \frac{\partial g(x,y)}{\partial y} = $
$ \frac{\partial 2x}{\partial y} + \frac{\partial y^{8}}{\partial y} = $
$ 0 + 8y^{7} = $
$ 8y^{7} $

giving us the gradient of $ g(x,y) $ as

$ \nabla g(x,y) = $
$ [2,8y^{7}] $

### Jacobian matrix (aka Jacobian)

**_numerator layout_(used here)** :
* rows: _equations(f,g)_ , 
* columns: _variables(x,y)_ 


$$
J = 
\begin{bmatrix}
\nabla f(x,y) \\
\nabla g(x,y)
\end{bmatrix} =
\begin{bmatrix}
6yx & 3x^{2} \\
2 & 8y^{7}
\end{bmatrix} 
$$

_denominator layout_ - rows: _variables(x,y)_ , columns: _equations(f,g)_

$$
\begin{bmatrix}
6yx &  2 \\
3x^{2} & 8y^{7}
\end{bmatrix} 
$$


### Generalization of the Jacobian

Combine $ f(x,y,z) \Rightarrow f(x) $ where **`x`** is a vector (aka $ \vec{x} $) and _x_ are scalars. e.g.  $ x_{i} $ is the $i^{th}$ element of vector **x**. 

Assume vector **x** is a column vector (vertical vector) by default of size $ n \times 1 $. 

$$
x = 
\begin{bmatrix}
x_{1} \\
x_{2} \\
\vdots \\
x_{n}
\end{bmatrix} 
$$

For multiple scalar-valued functions, combine all into a vector just like the parameters. 

Let $ y = f(x) $ be a vector of `m` scalar-valued functions that each take a vector **x** of length $ n = \vert x \vert $ where $ \vert x \vert $ is the count of elements in **x**.

Each $ f_{i} $ function within **f** returns a scalar.

$$
\begin{matrix}
y_{1} = f_{1}(x) \\
y_{2} = f_{2}(x) \\
\vdots \\
y_{m} = f_{m}(x)
\end{matrix} 
$$

For instance, given $ f(x,y) = 3x^{2}y $ and $ g(x,y) = 2x + y^{8} $,
then 

$ y_{1} = f_{1}(x) = 3x_{1}^{2}x_{2} $ (substituting $ x_{1} $ for _x_, $ x_{2} $ for _y_ )

$ y_{2} = f_{2}(x) = 2x_{1} + x_{2}^{8} $




For the identity function $ y = f(x) = x $ it will be the case that $ m = n $ :
$$
\begin{matrix}
y_{1} = f_{1}(x) = x_{1} \\
y_{2} = f_{2}(x) = x_{2} \\
\vdots \\
y_{m} = f_{m}(x) = x_{n}
\end{matrix} 
$$

So for the identity function, we will $ m = n $ functions and parameters.

Generally, the Jacobian matrix is the collection of all $ m \times n $ possible partial derivatives (_m_ rows and _n_ columns), which is a stack of _m_ gradients with respect to **x**:

$$
\frac {\partial y}{\partial x} 
=
\begin{bmatrix}
\nabla f_{1}(x) \\
\nabla f_{2}(x) \\
\dotsb \\
\nabla f_{m}(x)
\end{bmatrix} 
=
\begin{bmatrix}
\frac{\partial}{\partial x}f_{1}(x) \\
\frac{\partial}{\partial x}f_{2}(x) \\
\dotsb \\
\frac{\partial}{\partial x}f_{m}(x) \\
\end{bmatrix} 
=
\begin{bmatrix}
\frac{\partial}{\partial x_{1}}f_{1}(x) & \frac{\partial}{\partial x_{2}}f_{1}(x) & \dotsb & \frac{\partial}{\partial x_{n}}f_{1}(x) \\
\frac{\partial}{\partial x_{1}}f_{2}(x) & \frac{\partial}{\partial x_{2}}f_{2}(x) & \dotsb & \frac{\partial}{\partial x_{n}}f_{2}(x) \\
\dotsb \\
\frac{\partial}{\partial x_{1}}f_{m}(x) & \frac{\partial}{\partial x_{2}}f_{m}(x) & \dotsb & \frac{\partial}{\partial x_{n}}f_{m}(x) \\
\end{bmatrix} 
$$


Each $ \frac{\partial}{\partial x}f_{i}(x) $ is a horizontal _n_-vector b/c the partial derivative wrt to the vector **x**, whose length $ n = \vert x \vert $. The _width_ of the Jacobian is _n_ if we take the partial derivative with respect to **x** because there are _n_ parameters we can wiggle, each potentially changing the function's value. Therefore, the Jacobian is always _m_ rows for _m_ equations.

##### Jacobian Shapes 
![jacobian shapes](images/jacobian_shapes.png)

#### An Example: Jacobian of the identity function
Given the identity function $ \pmb{f}(x) = \pmb{x} $, with $ f_{i}(x) = x_{i} $, the Jacobian of the identity function  has _n_ functions and each function has _n_ parameters held in a single vector **x**. The Jacobian is, therefore, a square matrix since $ m = n $:

$$
\frac{\partial y}{\partial x} 
=
\begin{bmatrix}
\frac{\partial}{\partial x}f_{1}(x) \\
\frac{\partial}{\partial x}f_{2}(x) \\
\dotsb \\
\frac{\partial}{\partial x}f_{m}(x) \\
\end{bmatrix} 
=
\begin{bmatrix}
\frac{\partial}{\partial x_{1}}f_{1}(x) & \frac{\partial}{\partial x_{2}}f_{1}(x) & \dotsb & \frac{\partial}{\partial x_{n}}f_{1}(x) \\
\frac{\partial}{\partial x_{1}}f_{2}(x) & \frac{\partial}{\partial x_{2}}f_{2}(x) & \dotsb & \frac{\partial}{\partial x_{n}}f_{2}(x) \\
\dotsb \\
\frac{\partial}{\partial x_{1}}f_{m}(x) & \frac{\partial}{\partial x_{2}}f_{m}(x) & \dotsb & \frac{\partial}{\partial x_{n}}f_{m}(x) \\
\end{bmatrix} 
=
\begin{bmatrix}
\frac{\partial}{\partial x_{1}}x_{1} & \frac{\partial}{\partial x_{2}}x_{1} & \dotsb & \frac{\partial}{\partial x_{n}}x_{1} \\
\frac{\partial}{\partial x_{1}}x_{2} & \frac{\partial}{\partial x_{2}}x_{2} & \dotsb & \frac{\partial}{\partial x_{n}}x_{2} \\
\dotsb \\
\frac{\partial}{\partial x_{1}}x_{n} & \frac{\partial}{\partial x_{2}}x_{n} & \dotsb & \frac{\partial}{\partial x_{n}}x_{n} \\
\end{bmatrix}
$$

And since $ \frac{\partial}{\partial x_{j}}x_{i} = 0 $ for $ j \ne i $ and $ \frac{\partial}{\partial x_{j}}x_{i} = 1 $ for $ j = i $

$$
=
\begin{bmatrix}
\frac{\partial}{\partial x_{1}}x_{1} & 0 & \dotsb & 0 \\
0 & \frac{\partial}{\partial x_{2}}x_{2} & \dotsb & 0 \\
  & & \ddots \\
0 & 0 & \dotsb & \frac{\partial}{\partial x_{n}}x_{n} \\
\end{bmatrix} 
=
\begin{bmatrix}
1 & 0 & \dotsb & 0 \\
0 & 1 & \dotsb & 0 \\
  & & \ddots \\
0 & 0 & \dotsb & 1 \\
\end{bmatrix}
= I 
$$

(_I_ is the identity matrix with the ones down the diagonal)

### Derivatives of vector element-wise binary operators

_Element-wise binary operations on vectors_ - applying an operation to the first element of each vector to get the first element of the output, then apply to the second items of each vector to get the second item of the output, and so forth.



#### Generalized notation for element-wise binary operations
$$
\pmb{y} = \pmb{f}(w) \bigcirc \pmb{g}(x) 
$$

where $ m = n = \vert y \vert = \vert w \vert = \vert x \vert $

Reminder: $ \vert x \vert $ is the number of items in _x_

Zooming in $ \pmb{y} = \pmb{f}(\pmb{w}) \bigcirc \pmb{g}(\pmb{x}) $
gives :

$$
\begin{bmatrix}
y_{1} \\
y_{2} \\
\vdots \\
y_{n}
\end{bmatrix}
=
\begin{bmatrix}
f_{1}(w) \bigcirc g_{1}(x) \\
f_{2}(w) \bigcirc g_{2}(x)  \\
\vdots \\
f_{n}(w) \bigcirc g_{n}(x) 
\end{bmatrix} 
$$

#### Jacobian of Elementwise Binary Operations
The general case for the Jacobian of **y** wrt **w** is the square matrix:
$$
J_{W} =
\frac{\partial y}{\partial w} =
\begin{bmatrix}
\frac{\partial}{\partial w_{1}}(f_{1}(w) \bigcirc g_{1}(x)) & \frac{\partial}{\partial w_{2}}(f_{1}(w) \bigcirc g_{1}(x)) & \dotsb & \frac{\partial}{\partial w_{n}}(f_{1}(w) \bigcirc g_{1}(x)) \\
\frac{\partial}{\partial w_{1}}(f_{2}(w) \bigcirc g_{2}(x)) & \frac{\partial}{\partial w_{2}}(f_{2}(w) \bigcirc g_{2}(x)) & \dotsb & \frac{\partial}{\partial w_{n}}(f_{2}(w) \bigcirc g_{2}(x)) \\
\dotsb \\
\frac{\partial}{\partial w_{1}}(f_{n}(w) \bigcirc g_{n}(x)) & \frac{\partial}{\partial w_{2}}(f_{n}(w) \bigcirc g_{n}(x)) & \dotsb & \frac{\partial}{\partial w_{n}}(f_{n}(w) \bigcirc g_{n}(x)) 
\end{bmatrix} 
$$


The general case for the Jacobian of **y** wrt **x** is the square matrix:
$$
J_{X} =
\frac{\partial y}{\partial x} =
\begin{bmatrix}
\frac{\partial}{\partial x_{1}}(f_{1}(w) \bigcirc g_{1}(x)) & \frac{\partial}{\partial x_{2}}(f_{1}(w) \bigcirc g_{1}(x)) & \dotsb & \frac{\partial}{\partial x_{n}}(f_{1}(w) \bigcirc g_{1}(x)) \\
\frac{\partial}{\partial x_{1}}(f_{2}(w) \bigcirc g_{2}(x)) & \frac{\partial}{\partial x_{2}}(f_{2}(w) \bigcirc g_{2}(x)) & \dotsb & \frac{\partial}{\partial w_{n}}(f_{2}(w) \bigcirc g_{2}(x)) \\
\dotsb \\
\frac{\partial}{\partial x_{1}}(f_{n}(w) \bigcirc g_{n}(x)) & \frac{\partial}{\partial x_{2}}(f_{n}(w) \bigcirc g_{n}(x)) & \dotsb & \frac{\partial}{\partial x_{n}}(f_{n}(w) \bigcirc g_{n}(x)) 
\end{bmatrix} 
$$


### Diagonal Jacobians
In a Diagonal Jacobian, all elements off the diagonal are zero, $ \frac{\partial}{\partial w_{j}}(f_{i}(w) \bigcirc g_{i}(x)) = 0 $ where $ j \ne i $

This will be the case when $ f_{i} $ and $ g_{i} $ are constants wrt  $ w_{j} $:

$$ 
\frac{\partial}{\partial w_{j}}f_{i}(w) =
\frac{\partial}{\partial w_{j}}g_{i}(x) = 0
$$

Regardless of the operation $ \bigcirc $, if the partial derivatives go to zero, $ 0 \bigcirc 0 = 0 $ and the partial derivative of a constant is zero.

These partial derivatives go to zero when $f_{i} $ and $ g_{i} $ are not functions of $ w_{j} $.

Element-wise operations imply that $ f_{i} $ is purely a function of $ w_{i} $ and $ g_{i} $ is purely a function of $ x_{i} $.

For example, $ \pmb{w} + \pmb{x} $ sums $ w_{i} + x_{i} $. 

Consequently, $ f_{i}(w) \bigcirc g_{i}(x) $ reduces to $ f_{i}(w_{i}) \bigcirc g_{i}(x_{i}) $ and the goal becomes $ \frac{\partial}{\partial w_{j}}f_{i}(w_{i}) = 0$ and $ \frac{\partial}{\partial w_{j}}g_{i}(x_{i}) = 0 $

Notice that $ f_{i}(w_{i}) $ and $ g_{i}(x_{i}) $ look like constants to the partial differentiation wrt to $ w_{j} $ when $ j \ne i $

#### Element-wise diagonal condition
_Element-wise diagonal condition_ refers to the constraint that $ f_{i}(w) $ and $ g_{i}(x) $ access at most only $ w_{i} $ and $ x_{i} $, respectively.

### Jacobians under an element-wise diagonal condition

Under this condition, the elements along the diagonal of the Jacobian are $ \frac{\partial}{\partial w_{i}}(f_{i}(w_{i}) \bigcirc g_{i}(x_{i})) $:

$$
\frac{\partial y}{\partial w} = 
\begin{bmatrix}
\frac{\partial}{\partial w_{1}}(f_{1}(w_{1}) \bigcirc g_{1}(x_{1})) & & \\
& \frac{\partial}{\partial w_{2}}(f_{2}(w_{2}) \bigcirc g_{2}(x_{2})) & \huge0 \\
\dotsb \\
\huge0 & & \frac{\partial}{\partial w_{n}}(f_{n}(w_{n}) \bigcirc g_{n}(x_{n}))
\end{bmatrix}
$$

More succinctly, we can rewrite the following as:
$$
\frac{\partial y}{\partial w} = 
diag \left(\frac{\partial}{\partial w_{1}}(f_{1}(w_{1}) \bigcirc g_{1}(x_{1})),
\frac{\partial}{\partial w_{2}}(f_{2}(w_{2}) \bigcirc g_{2}(x_{2})),
\dotsb ,
\frac{\partial}{\partial w_{n}}(f_{n}(w_{n}) \bigcirc g_{n}(x_{n}))\right)
$$

and

$$
\frac{\partial y}{\partial x} = 
diag \left(\frac{\partial}{\partial x_{1}}(f_{1}(w_{1}) \bigcirc g_{1}(x_{1})),
\frac{\partial}{\partial x_{2}}(f_{2}(w_{2}) \bigcirc g_{2}(x_{2})),
\dotsb ,
\frac{\partial}{\partial x_{n}}(f_{n}(w_{n}) \bigcirc g_{n}(x_{n}))\right)
$$

where $ diag(x) $ constructs a matrix whose diagonal elements are taken from the vector $ \pmb{x} $ 


### Questions and Exercises

1. Compute the Jacobian  of the following function $ \pmb{f}(\pmb{w}) $:

    1. such that $ f_{i}(w) = w_{i}^{2} $
    
    1. such that $ f_{i}(w) = w_{i}^{i} $
    
    1. such that $ f_{i}(w) = \sum \limits _{j} ^{m} w_{j}^{i} $ where $ n = m = \vert w \vert $

