# Serverless Neural Network Introduction
## Architecture Overview

The **“Serverless Neural Network”** is a solution for generating a Cloud Native image classifier, using idempotent Amazon Web Services ([AWS](https://aws.amazon.com/what-is-aws/)) Lambda Functions. The **"SNN"** is used to train a model to predict whether a particular image is a **“cat”** vs. **“non-cat”** and fits within an overall prediction pipeline as the model training process. The primary objective of using Lambda Functions as opposed to other services like **SageMaker** or dedicated Machine Leaning Frameworks like **MXNet **or ** TensorFlow**, is to remove the abstraction layer that inevitably perpetuates the concept that Neural Networks or Deep Learning is a “black box” architecture and thus somewhat difficult to understand.

<img src="images/Prediction_Architecture.png" style="width:800px;height:500px;">
<caption><center>**Machine Learning Pipeline**</center></caption><br>

By simulating the individual Neurons and the mathematical functions they perform, it's easier to learn the exactly what each neuron is doing and how they contribute to optimizing (or “learning”) the overall hyper-parameters used for the final prediction model. Additionally, leveraging this framework for model training will hopefully provide a more in-depth understanding of how each neuron deals with the "vectorized" matrix calculations during Forward Propagation process **AND** the gradient derivative calculations during the Backward Propagation process.

## Neural Network Overview

<img src="images/2layerNN_kiank.png" style="width:800px;height:500px;">
<caption><left>[*image source](https://www.deeplearning.ai)</left></caption><br>

The Neural Network Model (shown above) can be summarized as:
  
**INPUT --> LINEAR/RELU --> LINEAR/SIGMOID --> OUTPUT**  

- The **Input** is a $(64, 64, 3)$ image what is flattened to a vector $(12288, 1)$. See the **Data Overview** Section.
- The corresponding vector: $[x_{0}, x_{1}, \dots, x_{12287}]^T$ is then multiplied by the **weight matrix** $W^{[1]}$ of size $(n^{[1]}, 12288)$.
- The **bias** term is then added to take the **Relu** (non-linear activation) to get a vector of size $[a^{[1]}_0, a^{[2]}_1, \dots, a^{[1]}_{n^{[1]}-1}]^T$.
- The process is then repeated for the next layer, by taking the resulting vector and multiplying it by the weight matrix $W^{[2]}$ and then adding the intercept (**bias**).
- Lastly, the **sigmoid** activation is applied to the result. If the result is greater then $0.5$, it is classified as a **Cat**.

Therefore, the **Network Model Parameters** (*parameters.json*) for the above process are as follows:

```json
{
    "epochs": 10,
    "layers": 2,
    "activations": {
        "layer1": "relu",
        "layer2": "sigmoid"
    },
    "neurons": {
        "layer1": 3,
        "layer2": 1
    },
    "learning_rate": 0.0075
}
```

## Neural Network Implementation
To implement the Neural Network using the **SNN** framework and the above Network configuration, the workflow is comprised of five key steps:  
1. Network Initialization.
2. Forward Propagation.
3. Calculate the Loss (Cost Function).
4. Backward Propagation.
5. Parameter Optimization (Gradient Descent).

The outcome of the above stages provides the optimal model parameters, for use in final prediction, as can be seen in the process diagram below.  

<img src="images/final_outline.png" style="width:800px;height:500px;">
<caption><left>[*image source](https://www.deeplearning.ai)</left></caption><br>

The next sections will further describe each phase in more detail.

### Network Initialization
For an **L-Layer** network, the *Weights* and *Bias* must be initialized for each individual layer, therefore the dimensions for these matrices must match to the dimensions of each layer. For example, if $n^{[l]}$ is the number of hidden units (neurons) in layer $l$ and the size of the input $X$ is $(12288, 209)$, for $m = 209$ training examples, then:


|               	|      **Shape of W**      	|  **Shape of b**  	|                 **Activation**                	| **Shape of Activation** 	|
|---------------	|:------------------------:	|:----------------:	|:---------------------------------------------:	|:-----------------------:	|
| **Layer 1**   	| $(n^{[1]},12288)$        	| $(n^{[1]},1)$    	| $Z^{[1]} = W^{[1]},X + b^{[1]}$               	| $(n^{[1]},209)$         	|
| **Layer 2**   	| $(n^{[2]}, n^{[1]})$     	| $(n^{[2]},1)$    	| $Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}$         	| $(n^{[2]}, 209)$        	|
| $\vdots$      	| $\vdots$                 	| $\vdots$         	| $\vdots$                                      	| $\vdots$                	|
| **Layer L-1** 	| $(n^{[L-1]}, n^{[L-2]})$ 	| $(n^{[L-1]}, 1)$ 	| $Z^{[L-1]} =,W^{[L-1]} A^{[L-2]} + b^{[L-1]}$ 	| $(n^{[L-1]}, 209)$      	|
| **Layer L**   	| $(n^{[L]}, n^{[L-1]})$   	| $(n^{[L]}, 1)$   	| $Z^{[L]} =,W^{[L]} A^{[L-1]} + b^{[L]}$       	| $(n^{[L]}, 209)$        	|



When we compute $W X + b$ in python, it carries out broadcasting. For example, if:  

$$
W = \begin{bmatrix}
j  & k  & l \\\
m  & n & o  \\\
p  & q & r
\end{bmatrix}
\space
\space
X = \begin{bmatrix}
a  & b  & c \\\
d  & e & f \\\
g  & h & i 
\end{bmatrix}
\space
\space
b =\begin{bmatrix}
s  \\\
t  \\\
u
\end{bmatrix}$$

Then $WX + b$ will be:

$$
WX + b = \begin{bmatrix}
(ja + kd + lg) + s  & (jb + ke + lh) + s  & (jc + kf + li)+ s\\\
(ma + nd + og) + t & (mb + ne + oh) + t & (mc + nf + oi) + t\\\
(pa + qd + rg) + u & (pb + qe + rh) + u & (pc + qf + ri)+ u
\end{bmatrix}$$

To start the initialization process, the *Weights* are initialized randomly using a "standard" normal distribution with a mean of $0$ and a standard deviation of $1$. To further constrain the weights to be close to zero **but** not exactly zero (for *symmetry breaking*), each random weight is multiplied by $0.01$. The *Bias* is initialized to zero but also multiplied by $0.01$.

### Forward Propagation
The *Forward Propagation* step of the process is comprised of two separate pieces, the **Linear** activation and the **Non-Linear** activation to constrain the outputs between $0$ and $1$.  

#### Linear Activation
The *Linear* part of the activation computes the following equation:  
$$Z^{[l]} = W^{[l]} \cdot A^{[l]} + b^{[l]}$$

>**Note:** It is important to cache the Linear Activations ($Z$) for later use in he Backward Propagation process.

Where $A^{[0]} = X$

#### Non-Linear Activation
The **L-Layer** Neural Network implements two differnt non-linear activation functions:  

- **Rectified Linear Unit (ReLU):** The mathematical formula for the *ReLU* function is $A = ReLU(Z) = max(0, Z)$.  
- **Sigmoid:** The methematical formula for the *Sigmoid* function is $\sigma(Z) = \sigma(W\cdot A+b) = \frac{1}{1 + e^{(-z)}}$.  

### Loss
**Cross Entropy** is commonly-used in binary classification (labels are assumed to take values $0$ or $1$) as a loss function which is computed by:

$$\mathcal{L} = -\frac{1}{m} \sum\limits_{i = 1}^{m} \big[y^{(i)}\cdot\log\left(a^{[L] (i)}\right) + (1-y^{(i)})\cdot\log\left(1-a^{[L](i)}\right)\big]$$

Where $a^{[L](i)}$ is the last layer of the network and is synonymous with $\hat{y}$.

Cross entropy measures the divergence between two probability distribution, if the cross entropy is large, which means that the difference between two distribution is large, while if the cross entropy is small, which means that two distribution is similar to each other. Generally, comparing to quadratic cost function, cross entropy cost function has the advantages that fast convergence and is more likely to reach the global optimization. For the mathematical details, see [wikipedia](https://en.wikipedia.org/wiki/Cross_entropy).

### Backward Propagation
*Backward Propagation* is used to calculate the the gradient of the *Loss* function with respect to the various paramaters, as follows:

<img src="images/backprop_kiank.png" style="width:800px;height:500px;">
<caption><left>[*image source](https://www.deeplearning.ai)</left></caption><br>

#### Non-Linear Derivative
As with *Forward Propagation*, there are two derivative non-linear activation functions for *Sigmoid* and *ReLU* respectively. If $g(\cdot)$ is the activations function, then the derivative of *Sigmoid* and *ReLU* compute:
$$dZ^{[l]} = \frac{\partial\mathcal{L}}{\partial Z^{[l]}} = dA^{[l]} \cdot g^{'}(Z^{[l]})$$

#### Linear Derivative
Once the derivative of the non-linear activation is computed, the derivatives of $W^{[l]}$, $b^{[l]}$ and $A^{[l]}$, are computed using the input $dZ^{[l]}$, to get , $dW^{[l]}$, $db^{[l]}$, $dA^{[l-1]}$ as follows:

$$ dW^{[l]} = \frac{\partial \mathcal{L} }{\partial W^{[l]}} = \frac{1}{m} dZ^{[l]} A^{[l-1] T}$$  
$$ db^{[l]} = \frac{\partial \mathcal{L} }{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[l](i)}$$  
$$ dA^{[l-1]} = \frac{\partial \mathcal{L} }{\partial A^{[l-1]}} = W^{[l] T} dZ^{[l]}$$  

### Parameter Update using Gradient Descent
The Model parameters ($W^{[l]}$ and $b^{[l]}$) are updated using **Gradient Descent** using the following formula:  

$$W^{[l]} = W^{[l]} - \alpha\cdot dW^{[l]}$$  
$$b^{[l]} = b^{[l]} - \alpha\cdot db^{[l]}$$  

Where $\alpha$ is the *Learning Rate*.

### Prediction
After the fitted parameters are updated using *Gradient Descent*, the paramaters can be used to predict wether a new image can classified as a **cat** or **non-cat** image. For further information on how the accuracy of the trained model fairs against testing data or unseen data, see the **Analysis** Notebook.

---
## Next: Code Overview
Now that the **Serverless Neural Network** has beenn introduced, it's time to review the Python code that makes up the various Lambda Functions in the [**Codebook**](./Codebook.ipynb).