In [1]:
%%html
<style>
  table {margin-left: 0 !important;}
</style>

# Deep Learning

## Deep learning - Linear Regression

Set of input features and weights used to calculated a real number output

* $ x_0 $ is the intercept with an initial value of 1
* $ y = w_0 x_0 + w_1 x_1 + ... + w_m x_m = \sum_{i=0}^N w_i X_i$

Developing the model is the matter of assigning the weights to the features.

* [Example notebook](https://github.com/ChandraLingam/AmazonSageMakerCourse/blob/master/GradientDescent/linear_cost_example.ipynb)

Notes based on the notebook.

* Simple data set - one feature, target is a line plus noise
* Create a data set with some random noise - straight line with noice
* Fit with linear regression

How did the algorithm determine the weights?

* Need a way to measure how close a predicton is to ground truth
    * Squared loss error function (aka mean squared error)
    * Loss function measures how close the predicted value is to ground truth
    
* Gradient descent optimizer used to determine the weights
    * Plot the loss at different weight - plot is parabolic 
    * Algorithm starts with random wights
    * Gradient (slope) of curve lets us know which way to go (larger or smaller) to increase or decrease the loss
    * Negative slope - increasing weight moves downhill
    * Positive slope - decreasing weight moves downhill
    
* Magnitude of weight adjustments
    * Learning rate determines the size of the weight adjustment, tradeoff is number of iterations vs ability to converge on the optimal weight
    * [More info on optimizing gradient descent](https://ruder.io/optimizing-gradient-descent/)
    * Some adjust learning rate based on degree of slope, some use momentum, etc.
    
Gradient descent modes

* Batch
    * Compute loss for all training examples
    * Adjust weight
    * Example: 150 samples in training set. For each iteration, weight is adjusted once
* Stochastic
    * Compute loss for next example
    * Adjust weight
    * Example: 150 samples in training set. For each iteration, weight is adjusted 150 times
* Mini-batch
    * Compute loss for a specified number of examples
    * Adjust weight
    * Example: 150 samples in training set, mini batch size is 15. For each iteration, weight is adjusted 10 times.

## Logistic Regression (Binary Classification)

Set up is similar to linear regression - we have a set of features and assign a weight to each feature, we sum the products of features and weights. But, for output we want to know probability of the output belonging to the positive class, based on an output of 0 or 1.

We can use the sigmoid function to run the output of the sum through - sigmoid function output is bounded between zero and one - see [this](https://github.com/ChandraLingam/AmazonSageMakerCourse/blob/master/GradientDescent/logistic_cost_example.ipynb) notebook.

* Typically need to assign a cut off value for the output, for example anything greater than 0.5 is positive.

Training objective with logistic regression is to select the weights that lowers the misclassification.

* Use the logistic cost loss function, which separates negative and positive values (loss curves for positive and negative samples)
* Logistic loss function is parabolic in nature, also with the property of not only indicating the loss at a given weight, but also indicating which direction to adjust the weight.

How to find the optimal weights?

* Use the gradient descent optimizer

## Neural Networks

Linear models are simple and easy to understand, but typically underperform on non linear data (underfit). They require extensive feature engineering, features need to be on similar range and scale.

Linear models form the foundation for understanding neural networks. NN looks like stacking several logistic models, generalizing sigmoid with an activation function.

Summation of features plus weights ran through an activation function is a 'neuron', at each layer the features can be connected to multiple neurons with the weights specific to each neuron.

* The neurons generate new features by combining existing ones, which are then inputs to the next layer of neurons.
* Basic architecture has an input layer, one or more hidden layers, and an output layer.

Benefits

* Automatic feature engineering - mixes features to create new ones
* Handles non-linear datasets
* Standard techniques to deal with overfitting  (easy to overfit) - regularization, reduce model complexity, etc.

Activation Functions

* Introduce non-linearity into the model
* Improves ability of model to fit complex non-linear datasets
* Three popular activation fucntions: sigmoid, tanh, relu

Activation function notebook - see [here](https://github.com/ChandraLingam/AmazonSageMakerCourse/blob/master/GradientDescent/activation_functions.ipynb)

* sigmoid - converts input to a number between 0 and 1
* tanh - output varies from -1 to 1
* relu - netgative input output is 0, otherwise same as input

Deep learning - subset of machine learning that uses complex networks that have hundreds of layers. Why so popular?

* Traditional ML algorithms appear to saturate on how much they can learn. Having massive amounts of data does not translate to more learning
* Small NN can learn better. Medium NN can learn even more, and large NNs can keep learning with more data.

Binary classifier - send the output through a sigmoid function.

Multiclass classifier - use softmax to convert to array of probability scores for each class, sum of probs for all classes is 1.

Popular NN architectures

* General purpose
    * fully connected network
    * example: treats each pixel as a separate feature
* Convolutional Neural Network (CNN)
    * Useful for image analysis
    * Example: considers pixels and its surrounding pixels
* Recurrent NN
    * Looks at history
    * Used for timeseries prediction, natural language processing
    * Example: timeseries forcasting - model looks at current values and historical values
    

## MIT Introduction to Deep Learning - Lecture 1

Slides [here](http://introtodeeplearning.com/slides/6S191_MIT_DeepLearning_L1.pdf)
Video [here](https://www.youtube.com/watch?v=5v1JnYv_yWs&feature=youtu.be)

### Why deep learning?

* Traditional ML algorithms - hand engineer features, which is time consuming, brittle, not scalable
* Can we learn the features directly from raw data?
    * lines and edges to eyes and noses to faces
* Why now?
    * big data (more and larger data sets, easier collection and storage), hardware (GPUs, parallel processing), software (improved techniques, new models, open source)
    
### The Perceptron - the structural block of deep learning

Feed-forward: inputs, weights, sum, non-linearity, output

$ \hat y = g(\sum_{i=1}^m x_i w_i)$ where $ g $ is the activation function

Note we also have another term, the bias term, which lets us shift the activation left or right:

$ \hat y = g(w_0 + \sum_{i=1}^m x_i w_i)$ where $ w_0 $ is the bias term.

We can rewrite this linear algebra style:

$ \hat y = g( w_0 + X^TW) $ where:

$ X = \begin{bmatrix} x_1 \\ \vdots \\ x_m \end{bmatrix} $ and $ W = \begin{bmatrix} w_1 \\ \vdots \\ w_m \end{bmatrix} $


Activation function - typically a non-linear function like the sigmoid function, e.g.

$ \sigma (z) =  \frac{\mathrm{1} }{\mathrm{1} + e^{-z} }  $

Can also be tanh or ReLU


### Importance of Activation Functions

The purpose is to introduce non-linearities into the network, as many problems in the real world involve non-linearities.
    * Think about nonlinear decision boundaries
    
    
### Building Neural Networks with Perceptrons

3 steps to computing the output of a perceptron: dot product, add a bias, take a non-linearity

$ y = g(z) $ where $ z = w_0 + \sum_{j=1}^m x_j w_j $

Multi Output Perceptron

$ y_1 = g(z_1) $
$ y_2 = g(z_2) $
$ z_i = w_0,_i + \sum_{j=1}^m x_j w_j,_i $

Single Layer Nueral Network

* Input layer, fully connected hidden layer, output layer (two outputs)

$ z_i = w^{(1)}_{0,i} + \sum_{j=1}^m x_j w^{(1)}_{j,i} $

and...

$ \hat {y_i} = g(w^{(2)}_{0,i} + \sum_{j=1}^{d_1} z_j w^{(2)}_{0,i})$

Middle layer has $ z_1 ... z_{d_1} $ nodes

The hidden layer is learning, not observable like the input layer and the output layer.

Another name for fully connected layers is a dense layer.

In Keras/TF:

```
from tf.keras.layers import *

inputs = Inputs(m)
hidden = Dense(d1)(inputs)
outputs = Dense(2)(hidden)
model = Model(inputs, outputs)
```


### Applying Neural Networks

Will I pass this class?

* Two inputs: hours spend on the final project, number of lectures attended
* Output: pass/fail

How to train the model? First need to know how to quantify the loss.

$ L(f(x^{(i)};W), y^{(i)}) $ (compares predicted and actual value)

Loss is low if close to actual, higher if not.

Empiracal loss - measures the total over out entire dataset.

* aka Objective function, cost function, empirical risk

$ J(W) = \frac{1}{n} \sum_{i=1}^n L(f(x^{(i)};W), y^{(i)})$

0/1 Output - use Binary Cross Entropy Loss
Computer a grade or number output - use Mean Squared Error Loss

### Training Neural Networks

Training is about loss optimization. We want to find the network weights that achive the lowest loss

$ W^{*} = \frac{argmin}{W} \frac{1}{n} \sum_{i=1}^n L(f(x^{(i)};W), y^{(i)}) $

$ W^{*} = \frac{argmin}{W} J(W) $

Remember: $ W = \{ W^{(0)},W^{(1)},... \} $

We find the optimal weights via Gradient Descent

Algoritm:

1. Initiaize weights randomly $ \sim N(O, \sigma^2 )$
2. Loop until convergence

    1. Compute gradient $ \frac{\partial{J(W)}}{\partial{W}} $
    2. Update weights $ W \leftarrow W - \eta \frac{\partial{J(W)}}{\partial{W}} $
5. Return weights

```
weights = tf.random_nomal(shape, stddev=sigma)
grads = tf.gradients(ys=loss,xs=weights)
weights_new=weights.assign(weights = lr * grads)
```

#### Computing the Gradients: Backpropagation

Single network: $x \rightarrow w_1 \rightarrow z_1 \rightarrow w_2 \rightarrow \hat y \rightarrow J(W)$

How does a small change in weight ($w_2$) affect the final loss $ J(W) $?

Unpack using the chain rule

$ \frac{\partial{J(W)}}{\partial{w_2}} = \frac{\partial{J(W)}}{\partial{\hat y}} * \frac{\partial{\hat y}}{\partial{w_2}}$

And the influence of w1? Apply the chain rule again

$ \frac{\partial{J(W)}}{\partial{w_1}} = \frac{\partial{J(W)}}{\partial{\hat y}} * \frac{\partial{\hat y}}{\partial{z_1}} * \frac{\partial{z_1}}{\partial{w_1}}$

Repeat this for every weight in the enetwork using gradients from later layers.


### Neural Networks in Practice: Optimization

In practice training neural networks is difficult. Loss landscape is complex with many local optima. Loss optimization can be difficult to optimize.

Setting the learning rate $ \eta $ is difficult and can greatly affect the optimization.

How to deal with this?

* Try a lot of learning rates to see what works best.
* Adaptive learning rates - don't fix the rate, but adjustis based on how large the gradient is, how fast learning is happenig, zied of particular weights, etc.
    * Tensor flow examples: momentum, adagrad, adadelta, adam, rmsprop
    
### Stochastic Gradient Descent

Just compute at a single point to save some cycles... but noisy

1. Initiaize weights randomly $ \sim N(O, \sigma^2 )$
2. Loop until convergence
    1. Pick single data point $i$
    1. Compute gradient $ \frac{\partial{J_i(W)}}{\partial{W}} $
    2. Update weights $ W \leftarrow W - \eta \frac{\partial{J(W)}}{\partial{W}} $
5. Return weights

### Mini-batches

Easier to compute, less noise than stochastic as you are considering a wider population

1. Initiaize weights randomly $ \sim N(O, \sigma^2 )$
2. Loop until convergence
    1. Pick batch of B data points
    1. Compute gradient $ \frac{\partial{J_i(W)}}{\partial{W}} = \frac{1}{B}\sum_{k=1}^B \frac{\partial{J_k(W)}}{\partial{W}} $
    2. Update weights $ W \leftarrow W - \eta \frac{\partial{J(W)}}{\partial{W}} $
5. Return weights

### Overfitting

Want a model that performs well and generalizes well

* Underfit - model not complex enough to fully learn
* Ideal fit
* Overfitting - too complex, extra parameters, memorizes the training set, does not generalize well

#### Regularization

Regularization is a technique that constrains our optimization problem to discourage complex models. Why do we need it? To improve generalization of our model to unseen model.

Regularization 1: Drop out

* During training, randomly set some activations to 0
    * Typically drop 50% of activations in a layer in any given training iteration
    * Creates an ensemble of multiple models through the paths
    * Forces network to not rely on any 1 node
    
Regularization 2: Early stopping

* Stop training before we have a change to overfit.
    * Usually prior to where the loss of the testing performance starts rising
    



## MIT 6.S191 (2019): Convolutional Neural Networks

Lecture 3 from Introduction to Deep Learning

Video [here](https://www.youtube.com/watch?v=H-HVZJ7kGI0&feature=youtu.be)
Slides [here](http://introtodeeplearning.com/slides/6S191_MIT_DeepLearning_L3.pdf)

Image recognition

* Input is a 2D image, vector of pixel values
* Output - class label, cna produce probability of belonging to a particular class

Fully connected network architecture

* Not a good fit for vision processing
* Squashing a 2D matrix into a vector and fully connecting it means you lost spatial information, and connectivity makes computation too expensive

Goal: use an architecture that uses the spatial structure

* connect patches of input to neurons in hidden layer
* Pixels next to each other are probably realted
* Use a sliding patch windows across the image to define connections

How to weight to weight the patch to detect particular features?

* Apply a set of weights - a filter - to extract local features.
* Use multiple filters to extract different features
* Spatially share parameters of each filter

Feature extraction with convolution

* Filter of size 4x4: 16 different weights
* Apply this same filter to 4x4 patches in the input
* Shift by 2 pixels for next patch

Convolution

* Uses filters to identify where features expressed in filter 'pop up' in the image
* Perform element wise multiplication of patch with location of window on image

Example - 5x5 image convolved with a 3x3 filter

* Yields 3x3 feature map

