In [1]:
%%html
<style>
  table {margin-left: 0 !important;}
</style>

# Deep Learning

## Deep learning - Linear Regression

Set of input features and weights used to calculated a real number output

* $ x_0 $ is the intercept with an initial value of 1
* $ y = w_0 x_0 + w_1 x_1 + ... + w_m x_m = \sum_{i=0}^N w_i X_i$

Developing the model is the matter of assigning the weights to the features.

* [Example notebook](https://github.com/ChandraLingam/AmazonSageMakerCourse/blob/master/GradientDescent/linear_cost_example.ipynb)

Notes based on the notebook.

* Simple data set - one feature, target is a line plus noise
* Create a data set with some random noise - straight line with noice
* Fit with linear regression

How did the algorithm determine the weights?

* Need a way to measure how close a predicton is to ground truth
    * Squared loss error function (aka mean squared error)
    * Loss function measures how close the predicted value is to ground truth
    
* Gradient descent optimizer used to determine the weights
    * Plot the loss at different weight - plot is parabolic 
    * Algorithm starts with random wights
    * Gradient (slope) of curve lets us know which way to go (larger or smaller) to increase or decrease the loss
    * Negative slope - increasing weight moves downhill
    * Positive slope - decreasing weight moves downhill
    
* Magnitude of weight adjustments
    * Learning rate determines the size of the weight adjustment, tradeoff is number of iterations vs ability to converge on the optimal weight
    * [More info on optimizing gradient descent](https://ruder.io/optimizing-gradient-descent/)
    * Some adjust learning rate based on degree of slope, some use momentum, etc.
    
Gradient descent modes

* Batch
    * Compute loss for all training examples
    * Adjust weight
    * Example: 150 samples in training set. For each iteration, weight is adjusted once
* Stochastic
    * Compute loss for next example
    * Adjust weight
    * Example: 150 samples in training set. For each iteration, weight is adjusted 150 times
* Mini-batch
    * Compute loss for a specified number of examples
    * Adjust weight
    * Example: 150 samples in training set, mini batch size is 15. For each iteration, weight is adjusted 10 times.

## Logistic Regression (Binary Classification)

Set up is similar to linear regression - we have a set of features and assign a weight to each feature, we sum the products of features and weights. But, for output we want to know probability of the output belonging to the positive class, based on an output of 0 or 1.

We can use the sigmoid function to run the output of the sum through - sigmoid function output is bounded between zero and one - see [this](https://github.com/ChandraLingam/AmazonSageMakerCourse/blob/master/GradientDescent/logistic_cost_example.ipynb) notebook.

* Typically need to assign a cut off value for the output, for example anything greater than 0.5 is positive.

Training objective with logistic regression is to select the weights that lowers the misclassification.

* Use the logistic cost loss function, which separates negative and positive values (loss curves for positive and negative samples)
* Logistic loss function is parabolic in nature, also with the property of not only indicating the loss at a given weight, but also indicating which direction to adjust the weight.

How to find the optimal weights?

* Use the gradient descent optimizer

## Neural Networks

Linear models are simple and easy to understand, but typically underperform on non linear data (underfit). They require extensive feature engineering, features need to be on similar range and scale.

Linear models form the foundation for understanding neural networks. NN looks like stacking several logistic models, generalizing sigmoid with an activation function.

Summation of features plus weights ran through an activation function is a 'neuron', at each layer the features can be connected to multiple neurons with the weights specific to each neuron.

* The neurons generate new features by combining existing ones, which are then inputs to the next layer of neurons.
* Basic architecture has an input layer, one or more hidden layers, and an output layer.

Benefits

* Automatic feature engineering - mixes features to create new ones
* Handles non-linear datasets
* Standard techniques to deal with overfitting  (easy to overfit) - regularization, reduce model complexity, etc.

Activation Functions

* Introduce non-linearity into the model
* Improves ability of model to fit complex non-linear datasets
* Three popular activation fucntions: sigmoid, tanh, relu

Activation function notebook - see [here](https://github.com/ChandraLingam/AmazonSageMakerCourse/blob/master/GradientDescent/activation_functions.ipynb)

* sigmoid - converts input to a number between 0 and 1
* tanh - output varies from -1 to 1
* relu - netgative input output is 0, otherwise same as input

Deep learning - subset of machine learning that uses complex networks that have hundreds of layers. Why so popular?

* Traditional ML algorithms appear to saturate on how much they can learn. Having massive amounts of data does not translate to more learning
* Small NN can learn better. Medium NN can learn even more, and large NNs can keep learning with more data.

Binary classifier - send the output through a sigmoid function.

Multiclass classifier - use softmax to convert to array of probability scores for each class, sum of probs for all classes is 1.

Popular NN architectures

* General purpose
    * fully connected network
    * example: treats each pixel as a separate feature
* Convolutional Neural Network (CNN)
    * Useful for image analysis
    * Example: considers pixels and its surrounding pixels
* Recurrent NN
    * Looks at history
    * Used for timeseries prediction, natural language processing
    * Example: timeseries forcasting - model looks at current values and historical values
    

## MIT Introduction to Deep Learning - Lecture 1

Slides [here](http://introtodeeplearning.com/slides/6S191_MIT_DeepLearning_L1.pdf)
Video [here](https://www.youtube.com/watch?v=5v1JnYv_yWs&feature=youtu.be)

### Why deep learning?

* Traditional ML algorithms - hand engineer features, which is time consuming, brittle, not scalable
* Can we learn the features directly from raw data?
    * lines and edges to eyes and noses to faces
* Why now?
    * big data (more and larger data sets, easier collection and storage), hardware (GPUs, parallel processing), software (improved techniques, new models, open source)
    
### The Perceptron - the structural block of deep learning

Feed-forward: inputs, weights, sum, non-linearity, output

$ \hat y = g(\sum_{i=1}^m x_i w_i)$ where $ g $ is the activation function

Note we also have another term, the bias term, which lets us shift the activation left or right:

$ \hat y = g(w_0 + \sum_{i=1}^m x_i w_i)$ where $ w_0 $ is the bias term.

We can rewrite this linear algebra style:

$ \hat y = g( w_0 + X^TW) $ where:

$ X = \begin{bmatrix} x_1 \\ \vdots \\ x_m \end{bmatrix} $ and $ W = \begin{bmatrix} w_1 \\ \vdots \\ w_m \end{bmatrix} $


Activation function - typically a non-linear function like the sigmoid function, e.g.

$ \sigma (z) =  \frac{\mathrm{1} }{\mathrm{1} + e^{-z} }  $

Can also be tanh or ReLU


### Importance of Activation Functions

The purpose is to introduce non-linearities into the network, as many problems in the real world involve non-linearities.
    * Think about nonlinear decision boundaries
    
    
### Building Neural Networks with Perceptrons

3 steps to computing the output of a perceptron: dot product, add a bias, take a non-linearity

$ y = g(z) $ where $ z = w_0 + \sum_{j=1}^m x_j w_j $

Multi Output Perceptron

$ y_1 = g(z_1) $
$ y_2 = g(z_2) $
$ z_i = w_0,_i + \sum_{j=1}^m x_j w_j,_i $

Single Layer Nueral Network

* Input layer, fully connected hidden layer, output layer (two outputs)

$ z_i = w^{(1)}_{0,i} + \sum_{j=1}^m x_j w^{(1)}_{j,i} $

and...

$ \hat {y_i} = g(w^{(2)}_{0,i} + \sum_{j=1}^{d_1} z_j w^{(2)}_{0,i})$

Middle layer has $ z_1 ... z_{d_1} $ nodes

The hidden layer is learning, not observable like the input layer and the output layer.

Another name for fully connected layers is a dense layer.

In Keras/TF:

```
from tf.keras.layers import *

inputs = Inputs(m)
hidden = Dense(d1)(inputs)
outputs = Dense(2)(hidden)
model = Model(inputs, outputs)
```


### Applying Neural Networks

Will I pass this class?

* Two inputs: hours spend on the final project, number of lectures attended
* Output: pass/fail

How to train the model? First need to know how to quantify the loss.

$ L(f(x^{(i)};W), y^{(i)}) $ (compares predicted and actual value)

Loss is low if close to actual, higher if not.

Empiracal loss - measures the total over out entire dataset.

* aka Objective function, cost function, empirical risk

$ J(W) = \frac{1}{n} \sum_{i=1}^n L(f(x^{(i)};W), y^{(i)})$

0/1 Output - use Binary Cross Entropy Loss
Computer a grade or number output - use Mean Squared Error Loss

### Training Neural Networks

Training is about loss optimization. We want to find the network weights that achive the lowest loss

$ W^{*} = \frac{argmin}{W} \frac{1}{n} \sum_{i=1}^n L(f(x^{(i)};W), y^{(i)}) $

$ W^{*} = \frac{argmin}{W} J(W) $

Remember: $ W = \{ W^{(0)},W^{(1)},... \} $

We find the optimal weights via Gradient Descent

Algoritm:

1. Initiaize weights randomly $ \sim N(O, \sigma^2 )$
2. Loop until convergence

    1. Compute gradient $ \frac{\partial{J(W)}}{\partial{W}} $
    2. Update weights $ W \leftarrow W - \eta \frac{\partial{J(W)}}{\partial{W}} $
5. Return weights

```
weights = tf.random_nomal(shape, stddev=sigma)
grads = tf.gradients(ys=loss,xs=weights)
weights_new=weights.assign(weights = lr * grads)
```

#### Computing the Gradients: Backpropagation

Single network: $x \rightarrow w_1 \rightarrow z_1 \rightarrow w_2 \rightarrow \hat y \rightarrow J(W)$

How does a small change in weight ($w_2$) affect the final loss $ J(W) $?

Unpack using the chain rule

$ \frac{\partial{J(W)}}{\partial{w_2}} = \frac{\partial{J(W)}}{\partial{\hat y}} * \frac{\partial{\hat y}}{\partial{w_2}}$

And the influence of w1? Apply the chain rule again

$ \frac{\partial{J(W)}}{\partial{w_1}} = \frac{\partial{J(W)}}{\partial{\hat y}} * \frac{\partial{\hat y}}{\partial{z_1}} * \frac{\partial{z_1}}{\partial{w_1}}$

Repeat this for every weight in the enetwork using gradients from later layers.


### Neural Networks in Practice: Optimization

In practice training neural networks is difficult. Loss landscape is complex with many local optima. Loss optimization can be difficult to optimize.

Setting the learning rate $ \eta $ is difficult and can greatly affect the optimization.

How to deal with this?

* Try a lot of learning rates to see what works best.
* Adaptive learning rates - don't fix the rate, but adjustis based on how large the gradient is, how fast learning is happenig, zied of particular weights, etc.
    * Tensor flow examples: momentum, adagrad, adadelta, adam, rmsprop
    
### Stochastic Gradient Descent

Just compute at a single point to save some cycles... but noisy

1. Initiaize weights randomly $ \sim N(O, \sigma^2 )$
2. Loop until convergence
    1. Pick single data point $i$
    1. Compute gradient $ \frac{\partial{J_i(W)}}{\partial{W}} $
    2. Update weights $ W \leftarrow W - \eta \frac{\partial{J(W)}}{\partial{W}} $
5. Return weights

### Mini-batches

Easier to compute, less noise than stochastic as you are considering a wider population

1. Initiaize weights randomly $ \sim N(O, \sigma^2 )$
2. Loop until convergence
    1. Pick batch of B data points
    1. Compute gradient $ \frac{\partial{J_i(W)}}{\partial{W}} = \frac{1}{B}\sum_{k=1}^B \frac{\partial{J_k(W)}}{\partial{W}} $
    2. Update weights $ W \leftarrow W - \eta \frac{\partial{J(W)}}{\partial{W}} $
5. Return weights

### Overfitting

Want a model that performs well and generalizes well

* Underfit - model not complex enough to fully learn
* Ideal fit
* Overfitting - too complex, extra parameters, memorizes the training set, does not generalize well

#### Regularization

Regularization is a technique that constrains our optimization problem to discourage complex models. Why do we need it? To improve generalization of our model to unseen model.

Regularization 1: Drop out

* During training, randomly set some activations to 0
    * Typically drop 50% of activations in a layer in any given training iteration
    * Creates an ensemble of multiple models through the paths
    * Forces network to not rely on any 1 node
    
Regularization 2: Early stopping

* Stop training before we have a change to overfit.
    * Usually prior to where the loss of the testing performance starts rising
    



## MIT 6.S191 (2019): Convolutional Neural Networks

Lecture 3 from Introduction to Deep Learning

Video [here](https://www.youtube.com/watch?v=H-HVZJ7kGI0&feature=youtu.be)
Slides [here](http://introtodeeplearning.com/slides/6S191_MIT_DeepLearning_L3.pdf)

Image recognition

* Input is a 2D image, vector of pixel values
* Output - class label, cna produce probability of belonging to a particular class

Fully connected network architecture

* Not a good fit for vision processing
* Squashing a 2D matrix into a vector and fully connecting it means you lost spatial information, and connectivity makes computation too expensive

Goal: use an architecture that uses the spatial structure

* connect patches of input to neurons in hidden layer
* Pixels next to each other are probably realted
* Use a sliding patch windows across the image to define connections

How to weight to weight the patch to detect particular features?

* Apply a set of weights - a filter - to extract local features.
* Use multiple filters to extract different features
* Spatially share parameters of each filter

Feature extraction with convolution

* Filter of size 4x4: 16 different weights
* Apply this same filter to 4x4 patches in the input
* Shift by 2 pixels for next patch

Convolution

* Uses filters to identify where features expressed in filter 'pop up' in the image
* Perform element wise multiplication of patch with location of window on image

Example - 5x5 image convolved with a 3x3 filter

* Yields 3x3 feature map

Feature extraction with convolution: 

* Different fiters can extract different features based on the spatial structure inherent in the data.
    * Apply a set of wieghts - a filter - to extract local features
    * Use multiple filters to extract different features
    * Spatially share parameters of each filter
    
Convolutional Neural Networks - CNNs

1. Convolution: apply filters with learned weights to generate feature maps
2. Apply non-linearity: often RELU
3. Pooling: downsampling operation on each feature map
    * reduce diminsionality
    * retain spatial invariance

Train model with image data, learn weights of filters in convolutional layers.

* What features in the image are we learning

Image input -> convolution (feature maps) -> maxpooling -> fully connected output layer

ImageNet

* Most famous data set for training CNNs
* 14 million images across 21,841 

Architecture for Many Applications

* Feature learning part, classification part
* Uses
    * Semantic segmentation
    * Object detection
    * Image captioning
    
Semantic Segmentation

* Task is to assign each pixel of the image to an object
* FCN: fully convolutional networks
    * Network design with all convolutional laters
    * Input and output sizes the same
    * Convolutional layers with downsampling and upsampling operations
* Applied to real time driving scenes for example

Object Detection

* R-CNN: 
    * Find regions we think have objects 
    * Use CNN to classify.
    
Image Captioning

* Generate a sentance that describes the scene
* Classify images with a CNN
* Connect to an RNN to generate the sentance
* Fixed output of CNN used to initiate the RNN



## MIT 6.S191: Recurrent Neural Networks

Video [here](https://www.youtube.com/watch?v=_h66BW-xNgk&feature=youtu.be)
Slides [here](http://introtodeeplearning.com/slides/6S191_MIT_DeepLearning_L2.pdf)

Think of the movement of a ball and predicting where it move next?

* Much easier to predict where it will go next in space if we know the history of its previous positions

Sequences in the wild

* Audio - sequence of Sound waves
* Text - sequences of words
* Stock prices
* Genomic data

A sequence modeling problem: predict the next word

* If we have "This morning I took my cat for a " - can we predict the next word?
* Need a way to handle variable length input. In a traditional feed forward network we could...
    * Use a fixed window, for examples previous two words. Problem - you may need more history than the window holds.
    * Bag of words - vector representing words, with count of words in each slot. Problem: does not preserve order.
    * Use a really big fixed window. Problem: no parameter sharing, things we learn about the sequence in one point doesn't transfer to another point.
    
Sequence Modeling - Design Criteria

To model sequences, we need to...'

* Handle variable-length sequences
* Track long-term dependencies
* Maintain information about the sequence order
* Share parameters learned across the entire sequence

RNNs

* Standard feed forward architecture - propagate from input to output, in one direction
* RNN - sequence of data fed through the model, can return a single output or an output at different points in time (e.g. music generation)
* RNNs i input vector, output vector, maintain internal states and passes the internal state from current step to the next step

Apply a recurrence relation at every time step to process a squence.

$ h_t = f_W(h_{t-1},x_t)$

Function is parameterized by a set of weights. Note the same function and set of parameters are used at each step. Each step includes both an output and a state update.

Update hidden state:

$ h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t)$

Output at a timestep is a transformed version of the internal state:

$ \hat y_t = W_{hy}h_t$

Can think of an RNN computational graph across time.

* Conceptually copies of the network passing state as an input to the next copy in the chain.
* Can make the parameters (weight matrices) explicit
    * Weights to transform inputs to the hidden state
    * Weights to transform the previous hidden state to the next hidden state
    * Weights to transform the hidden state to the output
* Same weights used throughout

Loss function

* Computer a loss at each step in the network
* Total loss is the sum of all the individual step losses

Backpropagation Through Time (BPTT)

* Forward pass - forward across time, outputs, states, losses
* Backprop - errors are back propagated at each individual time step, and across time steps from where you are to all the way back to the beginning of the squence.
* Computing the gradient wrt to $h_0$ involved many factors of $W_{hh}$ (and repeated $f^{'}$)
    * Many values > 1: exploding gradients (become very large, difficult to optimize)
        * Handle this using gradient clipping to scale large gradients
    * Many values < 1: vanishing gradients
        * Change the activation function
        * Weight initialization
        * Network architecture
        
Why are vanishing gradients a problem?
    * Multiply many small numbers together
    * Errors due to further back time steps have smaller and smaller gradients
    * Bias network to capture short-term dependencies
    
Activation functions
   
* Derivitives of tanh and sigmoid usually less than one
* ReLU derivitive is either 1 (x > 0) or 0 

Initialization

* Initializing wieghts to the identity matrix helps preven vanishing

Architecture

* Use a more complex recurrent unit with gates to control what information is passed through
* LSTM, GRU, etc


Long Short Term Memory

* Standard model has a single computation (e.g. tanh), LSTM has several computing units
* LSTMs maintain a cell state where it's easy for information to flow, this is in addition to unit/step state
* Single computation for cell state
* Inforamtion is added or removed to a cell state through structures called gates
    * Optionally let information through via a sigmoid nn layer and pointwise multiplication
    
How do LSTs work?

* *Forget* irrelevant parts of the previous state (forget gate)
* Selectively *update* cell state values
    * Sigmoid layer to decide what values to update
    * tanh layer: generate new vector of candiate value that could be added to the state
* Use an output gate to *output* certain parts of the cell state
    * Apply forget operation to previous cell state
    * Add new candidate values, scaled by how much we decided to update

LSTM Output

* Output filtered version of cell state

LSTM Gradient Flow

* Back propagation requires only element-wise multiplication. No matrix multiplication, avoiding the vanishing gradient problem.

RNN Applications

* Example: music generation
    * Input: sheet music
    * Output: next character in sheet music
* Example: sentiment classification
    * Input: sequence of words
    * Output: sentiment
* Example: machine translation
    * Input: sentence in one language
    * Output: sentance in a different language
    * Attention mechanisms: bring early encoder output to the decoder stages
        * AMs in neural networks provide learnable memory access

# NIPS 2016 tutorial: Nuts and bolts of building AI applications

Video [here](https://www.youtube.com/watch?v=wjqaz6m42wU&feature=youtu.be)
Slides [here](https://media.nips.cc/Conferences/2016/Slides/6203-Slides.pdf)

### Major DL Trends

#### Trend: Scale

Many old ideas, why are they taking off now?

* Older algs performance tapers off, even though there's more data available - learning capacity is limited.
* Small NNs had slightly better performance
* Mediun even better
* Large nets get better and better

Scale is driving improvement

* Amount of data 
* Size of networks

Note in the small data regime the above ordering of the algoritms in performance may vary.

Team Org

* ML team has both deep learning specialists, computer systens teams
* Leading edge is trending towards the HPC specialists

Different buckets

* General NN
* Sequence Models - RNN/LSTM/GRU
* Image (2d/3d), ConvNets
* Other - Unsupervised, RL

Most of what's shipping is the first 3 catagories (supervised learning), due a lot of labeled data still to be exploited.

Unsupervised learning algorithms needs ever more data and computation.



#### Trend: End to End DL for Complex Outputs

For supervised algs

Classification - output a number

* Movie review -> Sentiment
* Image -> Category

Shift to more complex outputs

* Audio -> text
* Image -> caption
* English -> French
* Text -> Audio

Achille's Heel - still need a lot of labeled data

Speech Recognition - Traditional

* non end to end - several intermediate steps
* audio -> features -> phonomes -> transcript

Speech Recognition - End to End

* Audio -> Transcript

Face recognition - turnstile

* Picture of person approaching - extract face
* Compare face to face from database
* Not enough pictures of people approaching the turnstile to train a single end to end model

Pediatrians - Look at Hand XRays to Estimate Age

* Image -> Bones -> Age (works well)
* Image -> Age (not enough data to do well)

Autonomous Driving

* Image -> Position of cars, Position of pestrians (DL) -> Planning Route (trad) -> steering
* Image -> Steering (very difficult to do well based the amount of data available and the level of precision needed).

Chat Bots

* Best performance: text -> inference engine -> reponse
* End to end has been done for toy problems but not suitable for production apps: text -> response

### ML Strategy

Example: Build Human Level Speech Recognition System

Common Practice - split the data set, e.g. 60% train, 20% dev set, 20% (final) test set

Common Diagnostics

* Human level error - for example 1%
* Training set error - for example 5%
* Dev set error - for example 6%

Insights

* Immediately training set/human perf difference  (5% - 1%), no hope of meeting human level performance (avoidable bias)
* Diff between dev set error and training set error indicates we're no fitting the training set that well (variance)

* If we saw 1%, 2%, 6% -> over fitting
* If we saw 1%, 6%, 10% -> high bias and high variance

Recipe for Driving Machine Learning Progress

1. Training error high?
    * Yes -> Bias Problem
        * Traing a bigger network
        * Train longer/optimize
        * Try new architecture
2. Once 1 is fixed, Is Dev Set Error High?
    * Yes -> Variance Problem
        * More data
        * Regularization
        * New model architecture
3. Done (assuming you also do well on the test set)

Trend: less consideration of the bias/variance tradeoff

* With deep learning we have techniques to address bias and variance independently; not always a trade off.

    Data Synthesis

    * OCR - Print a word on a random background, use a training set
        * Need to do a lot of finiky work to make this work, e.g. color distribution needs to match between training and test data
        * Need to figure out tricks to get this to work well
    * Speech Recognition - take audio and add in background noise
        * 10,000 hours speech, 10 hours of background added - neural network overfits to the background noise
            * Need more background noise to randomly select (e.g. 100h or 1000h)
    * Are we trading hand engineered features for hand engineered data sets
    * Autonomous driving - use grand theft auto?
        * Game probably has just 20 different types of cars
        * Algorithm will overfit

Best Practice - Unified Data Warehouse

* All your customer data flows into one place
* Much more efficient way to get things done, will move faster
    * Data is like dynamite, need to put a lot of it together to get a big bang!

Training and Test Set Distributions

* Speech enabled rearview mirror
    * 50K hours general speech recognition, not recorded from a rear view mirror in a car
    * Add 10 hours of rearview mirror recorded in a car with a background noise, etc.
* Approaches
    * Train on most of the 50k, save rest of 50K for dev, test on the car data
        * No good; development and test sets drawn from different distributions
    * Train on the 50k set, split the in-car and use half for dev set, half for test set
        * Dev and test drawn from the same distribution
            * Human level - 1% 
            * Training - 10% (training - human): avoidable bias
            * Training Dev - 10% (training dev - training): variance
            * Dev - 10% (Dev - Training Dev): data mismatch
            * Test - 10% (Test - Dev): Overfit dev set
            
Back to the recipe...

1. Training error high?
    * Yes -> Bias Problem
        * Traing a bigger network
        * Train longer/optimize
        * Try new architecture
2. Once 1 is fixed, Is Train Dev Set Error High?
    * Yes -> Variance Problem
        * More data
        * Regularization
        * New model architecture
3. Once 2 is good, is the Dev set error high?
    * Yes -> Data mismatch problem
        * Make the data more similar
        * Play with data synthesis
        * Domain adaptation (research problem)
        * New model architecture
4. Once doing well on 3, Test set error is high?
    * Yes -> Need more dev data (you've overfit the dev set)
5. If 4 good, done.

Take away: be methodical, we are learning to organize the work of machine learning.

Three things you can ask?

* What is human level performance?
* Performance on examples you've trained on
* Performance on unseen examples?

Machine learning is more iterative today

* Train and dev set needed to drive fast iterations
* Dev and test with same distribution may be overfit if you can't randomly draw samples (e.g. driving data collection from 4 cities, wanting to generalize to 5th city. Here you basically have 4 data points even if you grab 1000 hours in each city)

Human Level Performance

* As machine learning approaches human level performance, progress slows down. Two reasons for this
    1. Human level performance - proxy for Bayes optimal error (mathematically optimal error rate anyone can get)
    2. Use humans to label data, use human insight in error analysis, estimate bias and variance
    
Future of AI

* Value of supervised learning has taken off
* Unsupervised learning is making steady progress
* Re-enforcement learning should take off...
* Should see transfer learning value getting big lift soon

AI Product Management

* Workflow is User -> PM -> Engineers
    * PM can often mock up the app
    * ML is different
* Speech recognition - what to focus on?
    * Noisy environments (car, cafe), Low bandwidth audio, accented speech, latency, binary size?
    * PM decides
* Fuzzy set - what customers do, another set - what ML can do. Intersection is what is possible...
* How does the PM communicate to the engineers?
    * PM provides dev and test sets
    * Provides a validation metric
* Engineering team
    * Obtain the training data
    * Build system that works on the dev and test set
    
To excel:

* Read a lot of papers
* Replicate others work
* Dirty work


## Lab - Regression with SKLearn Nueral Net

[This notebook](https://github.com/ChandraLingam/AmazonSageMakerCourse/tree/master/DeepLearning/BikeSharingRegression)

Notes:

* Data set from Kaggle
* Model predices log(rentals)
* One hot encode categorial features
* Scale numerical features
* Use relu for the activation
* Easy to use this to build a single layer network, but framework does not support GPU


Use the column transformer 

* Tuple of transformer and columns to transform
* Fit the transformer to the training data
    * Reuse the transformer instance of all data
    


## Lab - Regression with Keras and TensorFlow

Lots of libraries - TensorFlow, theano, MxNet


Keras API

* High level wrapper for TensorFlow, CNTK, Theano.
* Single API, pick your backend

Lab - use [this notebook](https://github.com/ChandraLingam/AmazonSageMakerCourse/blob/master/DeepLearning/BikeSharingRegression/bikerental_neural_network_keras.ipynb)

* Pick the kernel you want to use corresponding to the backend you want (SageMaker)
* Dense layer - fully conntected
* Adam optimizer - variant of gradient descent
* Use early stopping to avoid overfitting
* Can rerun the notebook with a different kernel, could even take advantage of GPUs based on the underlying kernel

## Lab - Binary Classification with NN

Customer Churn Prediction

Data prep notebook [here](https://github.com/ChandraLingam/AmazonSageMakerCourse/blob/master/DeepLearning/CustomerChurnClassification/customer_churn_data_preparation_onehotencoded.ipynb)

NN notebook [here](https://github.com/ChandraLingam/AmazonSageMakerCourse/blob/master/DeepLearning/CustomerChurnClassification/customer_churn_neural_network.ipynb)

Notes - Data prep:

* Data prep needs to one-hot encode categorical data
* Scale numeric features
* Split into training, validation, test - use a validation dataset for tuning so we don't memorize the test set

Notes: Model training

* Metrics
    * high precision - indicates whenever a model makes a positive prediction there is a high likelyhood of it matching ground truth.
    
Improving recall - lower the threshold.

    * Can in some cases come up with a formula to calculate cost of misclassification and optimize the cutoff threshold