<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:200%;
           font-family:Arial;letter-spacing:0.5px">

<p width = 20%, style="padding: 10px;
              color:white;">
Neural Networks: Architecture
              
</p>
</div>

Data Science Cohort Live NYC Feb 2022
<p>Phase 4: Topic 40</p>
<br>
<br>

<div align = "right">
<img src="Images/flatiron-school-logo.png" align = "right" width="200"/>
</div>
    
    

In [1]:
import os
import sys
module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)
    
import numpy as np
from matplotlib import pyplot as plt
import matplotlib.image as mpimg
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits

#### What is a neural network?

- Computational graph made of layers of composite calculation units:
    - designed to compute a function mapping inputs to outputs. 
- Each unit/layer gets tuned during training:
    - learns some specific part of input-output mapping.

<center> <img src = "images/dogcat.gif" width = 500 > </center>

The many interconnections: important
    
- Let nodes at each layer use relations it learned in adjacent layers.

- Build high degree of flexibility: **can learn very complex functions**

- Use connections in model to learn what aspects of data to rely on.

<center><img src = "images/dogcat.gif" width = 500 ></center?

#### Function complexity

<img src = "Images/neural-networks-layers.webp" width = 600 >

Can learn complex functions and decision boundaries.

Let's see how a neural network is learning at each layer:

<a href = "http://playground.tensorflow.org" >Tensorflow Playground</a>

Neural network: minimizing objective function (squared loss/binary cross-entropy)
- tune connections
- each node/unit learning features in the process
- feeds features to next layer. Learns more complex features to predict with, etc.

Complexity sufficient to learn/generalize on some pretty difficult problems:

<center><img src = "Images/image_multiclass.png" width = 800 ></center>

#### But how does all this work?
- Back to basics.

**Composition of a single unit** 
- can be thought of as a model

<img src = "Images/single-unit.png" width = 700>

- If $f$ is identity matrix: literally linear regression (with sum squared error objective function)


$$ \text{Output} = x_1 w_1 + x_2 w_2  + x_3 w_3 + b $$

In vector form: 

$$ \text{Output} = \textbf{w}^T \textbf{x} + \textbf{b} $$

<img src = "Images/single-unit.png" width = 500>

- Goal: tune weights $\textbf{w}$ to minimize objective function.

#### Logistic regression 

<img src = "Images/single-unit.png" width = 500>

Now changing $f$ to sigmoid:

$$ y_{pred} = \sigma(w_1x_1+...+w_nx_n)$$

Weight connections trained on binary cross entropy via gradient descent.

<img src = "Images/linear_vs_logistic_regression.png" >
<center> Linear doesn't model well</center>

#### NN relation to logistic regression

- Complexity and increase in representational power acheived via two key ingredients:
- ***hidden* layers with more than one node in each layer**
- nonlinear activation

<img src = "Images/single-unit.png" width = 500>

Now changing $f$ to sigmoid:

$$ f = \sigma(w_1x_1+...+w_nx_n) = \sigma(\textbf{w}^T\textbf{x} + b)$$

Becomes with single hidden layer:
-  square bracket superscript index for layer number
- subscript indexes node number in a given layer
- hidden layer has activation $g$

<img src = "images/activations_nn_layer.png" >

- $ a^{[1]}$, $ z^{[1]}$  are 4-dimensional vectors here (corresponding to number of nodes).

- **Layer 1**: weight *matrix* $W^{[1]}$ and bias *vector* $b^{[1]}$  for computing $z^{[1]}$ from feature vector $\textbf{x}$.
- Dimension of $W^{[1]T}$ is (4,3)
- Dimension of $W^{[1]T}$ is (4,1)

To be explicit about it

$$ \textbf{z}^{[1]} = W^{[1]T} \textbf{x} + \textbf{b}^{[1]} \\ = \begin{bmatrix}
           z^{[1]}_{1} \\
           z^{[1]}_{2} \\
           z^{[1]}_{3} \\
           z^{[1]}_{4} \\
         \end{bmatrix} = \left[
  \begin{array}{ccc}
    w_{11} & w_{12} & w_{13}\\
    w_{21} & w_{22} & w_{23} \\
    w_{31} & w_{32} & w_{33} \\    
    w_{41} & w_{42} & w_{43} \\
  \end{array}
\right] 
\begin{bmatrix}
           x_{1} \\
           x_{2} \\
           x_{3} \\
         \end{bmatrix}
+ \begin{bmatrix}
           b_{1}^{[1]} \\
           b_{2}^{[1]} \\
           b_{3}^{[1]} \\
           b_{4}^{[1]} 
           \\
         \end{bmatrix}$$ 

And in the general case:

$$ \textbf{z}^{[1]} = W^{[1]T} \textbf{x} + \textbf{b}^{[1]} \\ = \begin{bmatrix}
           z^{[1]}_{1} \\
           z^{[1]}_{2} \\
           \vdots \\
           z^{[1]}_{n^{[1]}} \\
         \end{bmatrix} = \left[
  \begin{array}{ccc}
    w_{11} & w_{12} & \cdots & w_{1n}\\
    w_{21} & w_{22} & \cdots & \vdots\\
    \vdots & \vdots & \ddots & \vdots \\    
    w_{n^{[1]}1} & w_{n^{[1]}2} & \cdots & w_{n^{[1]}n}
  \end{array}
\right] 
\begin{bmatrix}
           x_{1} \\
           x_{2} \\
           \vdots \\
           x_{n} \\
         \end{bmatrix}
+ \begin{bmatrix}
           b_{1}^{[1]} \\
           b_{2}^{[1]} \\
           \vdots \\
           b_{n^{[1]}}^{[1]} 
           \\
         \end{bmatrix}$$ 
         
- where $n^{[1]}$ represent the number of nodes in hidden layer 1
- $n$ is input feature dimensionality

<img src = "images/activations_nn_layer.png" >

The second layer is a single sigmoid output node:

$$ \hat{y} = \textbf{a}^{[2]} =\sigma( W^{[2]T} \textbf{a}^{[1]} + \textbf{b}^{[2]} )  $$

Note that it takes in the previous layer's activations.

The shape of $\textbf{W}^{[2]}$: is it really a matrix in this case?

An easy interpretation: 
- we are performing logistic regression over hidden features $a^{[1]}$.
- result of a linear feature transformation on $\textbf{x}$  composed with hidden activation $g$

- Complexity and increase in representational power acheived via two key ingredients:
- *hidden* layers with more than one node in each layer
- **nonlinear activation**

 If $g$ is identity: turns out to be too simple.
 
$$ \textbf{a}^{[2]} = \sigma( W^{[2]T} \textbf{a}^{[1]} + \textbf{b}^{[2]} ) $$


$$ \textbf{a}^{[2]} = \sigma( W^{[2]T} W^{[1]T} \textbf{x} + W^{[1]T} \textbf{b}^{[1]} + \textbf{b}^{[2]} ) $$


$$ \textbf{a}^{[2]} = \sigma( W^{[2]T} W^{[1]T} \textbf{x} + W^{[1]T} \textbf{b}^{[1]} + \textbf{b}^{[2]} ) $$

$$ \hat{y} = \textbf{a}^{[2]} = \sigma( W^{T} \textbf{x} + \textbf{b} ) $$


$$ W = W^{[2]T} W^{[1]T} $$

$$ \textbf{b} = W^{[1]T} \textbf{b}^{[1]}+ \textbf{b}^{[2]} $$

When $g$ is $I$:
- haven't gained anything in terms of expressive power over simple logistic regression
- require nonlinearity in activation to yield useful feature transformations/selections using hidden layers.

<center><img src = "Images/activation_func.png" >
Typical choices for activation function $g$</center>

**ReLU: the most common hidden layer activation function**

- simple thresholding behavior
- automated feature selection: 
    - turn off/on node depending on feature/data in previous steps to network.

<center><img src = "Images/relu.png" >
    If input < 0: turn off neuron/corresponding feature.</center>

<center><img src = "Images/relu_nn.webp" >
    If input < 0: turn off neuron/corresponding hidden nodes.</center>

<center><img src = "Images/relu.png" >
    If input < 0: turn off neuron/corresponding feature.</center>

Relu allows network to learn hidden feature set in each layer important for each classification.

**Class of non-linear function for non-trivial learning**: linear in input stimulus over some region with thresholding behavior or saturation nonlinearity behavior.
- Leaky ReLu
- tanh (runs into some issues)
    
There are other activation functions; [see here](https://towardsdatascience.com/comparison-of-activation-functions-for-deep-neural-networks-706ac4284c8a). 

<center><img src = "Images/activation_func.png" >
Typical choices for activation function $g$</center>

**Multiple target classes with one hidden layer**

<img src = "images/activations_nn_layer.png" >

The second layer now computes a vector:

$$ \hat{\textbf{y}} = \textbf{a}^{[2]} =softmax( W^{[2]T} \textbf{a}^{[1]} + \textbf{b}^{[2]} )  $$

Taking in the previous layer's activations.

What is the shape of $\textbf{W}^{[2]}$ now? Is it a matrix?

**Unpacking softmax**

The second layer now computes a vector:

$$ \hat{\textbf{y}} = \textbf{a}^{[2]} =softmax( W^{[2]T} \textbf{a}^{[1]} + \textbf{b}^{[2]} )  $$

Taking in the previous layer's activations.

<center><img src = "Images/softmax.png" width = 400></center>

**Extending to multiple hidden layers**
- this is where our networks can start to learn complex, high level representations.

<img src = "images/hiddenrep_manylayers.png" >

- sequence of layers of linear (affine) transformations + activation functions for that layer:
$$ \textbf{z}^{[l]} = \textbf{W}^{[l]T}\textbf{a}^{[l-1]} + \textbf{b}^{[l]}$$
$$ \textbf{a}^{[l]} = g_l(\textbf{z}^{[l]})$$

What is the shape of $\textbf{W}^{[3]T}$?

Understanding shapes of inputs and weight matrices important in diagnosing errors during NN execution.

**A note about model interpretability**
- Neural network learns hidden layer features during optimization.
- Don't really know/control what hidden layers are finding: **black box** 
    - inputs/outputs in hidden layers are hidden

### The Optimization Algorithm

Because we need to fit those weights...

**Forward Propagation**

- Pass input: compute function map layer by layer using weights
- yield output $\hat{y}$ (regression target, class label, etc.)
- Compute cost function for each data point $L(\hat{y},y)$


For the $i^{th}$ training sample:

<center><img src = "Images/forward_prop_cost.png" width = 700></center>

We aggregate the cost over training samples:

The  many-sample cost function 
$$ 
J(\{W^{[l]}\}, \{b^{[l]}\} = \frac{1}{m} \sum_{i = 1}^m L(\hat{y_i}, y_i)$$ 

where $m$ is the trainset size used during forward propagation.

I could do this by a for loop:
- feed in one train sample $x^{(i)}$ at a time
- compute $ L(\hat{y_i}, y_i)$
- aggregate to get $ J(\{W^{[l]}\}, \{b^{[l]}\} )$

For the $i^{th}$ training sample:

<center><img src = "Images/forward_prop_cost.png" width = 700></center>

Sometimes done this way. But we can also put multiple samples in at one time:
- vectorize operations across training set. 

Vector $x^{(i)}$ for a single sample:

$$ x^{(i)} = \begin{bmatrix}
           x^{(i)}_{1} \\
           x^{(i)}_{2} \\
           \vdots \\
           x^{(i)}_{n} \\
         \end{bmatrix}$$

Construct a matrix of inputs:

$$ X = \begin{bmatrix}
           x^{(1)}_{1} & \cdots &  x^{(m)}_{1}\\
           x^{(1)}_{2} & \cdots &  x^{(m)}_{2}\\
           \vdots & \vdots & \vdots \\
           x^{(1)}_{n} & \cdots &  x^{(m)}_{n}\\
         \end{bmatrix} = \begin{bmatrix}
           \mid & \mid &        & \mid \\
           x^{(1)} & x^{(2)} & \cdots &  x^{(m)}\\
           \mid & \mid & & \mid\\
         \end{bmatrix} $$


- $n$ is feature dimensionality
- $m$ is training set size fed into the network

Each layer now computes a matrix multiplication across training samples:



$$ \textbf{Z}^{[1]} = W^{[1]T} \textbf{x} + \textbf{b}^{[1]} \\ = \begin{bmatrix}
           z^{(1)}_{1} & \cdots &  z^{(m)}_{1}\\
           z^{(1)}_{2} & \cdots &  z^{(m)}_{2}\\
           \vdots & \vdots & \vdots \\
           z^{(1)}_{n^{[1]}} & \cdots &  z^{(m)}_{n^{[1]}}\\
         \end{bmatrix}  = \left[
  \begin{array}{ccc}
    w_{11} & w_{12} & \cdots & w_{1n}\\
    w_{21} & w_{22} & \cdots & \vdots\\
    \vdots & \vdots & \ddots & \vdots \\    
    w_{n^{[1]}1} & w_{n^{[1]}2} & \cdots & w_{n^{[1]}n}
  \end{array}
\right] 
\begin{bmatrix}
           x^{(1)}_{1} & \cdots &  x^{(m)}_{1}\\
           x^{(1)}_{2} & \cdots &  x^{(m)}_{2}\\
           \vdots & \vdots & \vdots \\
           x^{(1)}_{n} & \cdots &  x^{(m)}_{n}\\
         \end{bmatrix} 
+ \begin{bmatrix}
           b_{1}^{[1]} \\
           b_{2}^{[1]} \\
           \vdots \\
           b_{n^{[1]}}^{[1]} 
           \\
         \end{bmatrix}$$ 
         
- where $n^{[1]}$ represent the number of nodes in hidden layer 1
- $n$ is input feature dimensionality
- now, note the extension via columns to account for samples fed in

Then compute activation $g$ element-wise across matrix $Z^{[1]}$:

$$ A^{[1]} = g(Z^{[1]}) $$

This extension is done for every layer in the network.

- sequence of layers of linear (affine) transformations + activation functions for that layer:
$$ \textbf{Z}^{[l]} = \textbf{W}^{[l]T}\textbf{A}^{[l-1]} + \textbf{b}^{[l]}$$
$$ \textbf{A}^{[l]} = g_l(\textbf{Z}^{[l]})$$

<center><img src = "Images/forward_prop_matrices.png" width = 700></center>

In the end we get a matrix $A^{[L]}$ which has predicted values for each sample as columns:

$$ A^{[L]} = \begin{bmatrix}
            \mid & \mid &        & \mid \\
           \hat{y}^{(1)} & \hat{y}^{(2)} & \cdots &  \hat{y}^{(m)}\\
           \mid & \mid & & \mid\\
         \end{bmatrix} $$

Which, along with the true target matrix: 

$$ Y = \begin{bmatrix}
            \mid & \mid &        & \mid \\
           y^{(1)} & y^{(2)} & \cdots &  y^{(m)}\\
           \mid & \mid & & \mid\\
         \end{bmatrix} $$
         
can be used to compute $ J(\{W^{[l]}\}, \{b^{[l]}\} )$

**Now: to need to optimize weights**

**Gradient descent algorithm**
- Compute $\frac{\partial J}{\partial W^{[l]}}$ and $\frac{\partial J}{\partial b^{[l]}}$ for each layer $l$.
- Update each bias vector/weight matrix:
$$  W^{[l]} \rightarrow W^{[l]} - \alpha \frac{\partial J}{\partial W^{[l]}}$$

- Lower $ J(\{W^{[l]}\}, \{b^{[l]}\} )$

#### Backpropagation

The **backpropagation** algorithm: adjusting the parameters (weights) to get a better result. 

Turns out: computing each gradient $\frac{\partial J}{\partial W^{[l]}}$ in a straightforward manner is prohibitively expensive.

#### Backpropagation 


<img src = "Images/backpropagation.png" width = 500 >

- Uses chain rule in calculus to break up calculation.
- Recursion in reverse graph traversal / storing derivatives makes this fast.

To compute $\frac{\partial L}{\partial w_1} =  \frac{\partial L}{\partial h}\frac{\partial h}{\partial w_1}$ 

- Already have $\frac{\partial L}{\partial h}$
- Just compute  $\frac{\partial h}{\partial w_1}$
- Multiply by $\frac{\partial L}{\partial h}$

<img src = "Images/backpropagation.png" width = 500 >

<img src = "Images/simplegraph_chainrule.png" >

Then forward prop again. Repeat back prop...cycle until $J$ converges:

<img src = "Images/backprop.gif" width = 500>

**Update made to weights after computing all gradients traversing in backpropagation.**

Explanation of backpropagation by 3Blue1Brown (part of a full playlist): [Backpropagation calculus | Deep learning, chapter 4](https://www.youtube.com/watch?v=tIeHLnjs5U8&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=4)