In [1]:
%%html
<style>
  table {margin-left: 0 !important;}
</style>

# Deep Learning

## Deep learning - Linear Regression

Set of input features and weights used to calculated a real number output

* $ x_0 $ is the intercept with an initial value of 1
* $ y = w_0 x_0 + w_1 x_1 + ... + w_m x_m = \sum_{i=0}^N w_i X_i$

Developing the model is the matter of assigning the weights to the features.

* [Example notebook](https://github.com/ChandraLingam/AmazonSageMakerCourse/blob/master/GradientDescent/linear_cost_example.ipynb)

Notes based on the notebook.

* Simple data set - one feature, target is a line plus noise
* Create a data set with some random noise - straight line with noice
* Fit with linear regression

How did the algorithm determine the weights?

* Need a way to measure how close a predicton is to ground truth
    * Squared loss error function (aka mean squared error)
    * Loss function measures how close the predicted value is to ground truth
    
* Gradient descent optimizer used to determine the weights
    * Plot the loss at different weight - plot is parabolic 
    * Algorithm starts with random wights
    * Gradient (slope) of curve lets us know which way to go (larger or smaller) to increase or decrease the loss
    * Negative slope - increasing weight moves downhill
    * Positive slope - decreasing weight moves downhill
    
* Magnitude of weight adjustments
    * Learning rate determines the size of the weight adjustment, tradeoff is number of iterations vs ability to converge on the optimal weight
    * [More info on optimizing gradient descent](https://ruder.io/optimizing-gradient-descent/)
    * Some adjust learning rate based on degree of slope, some use momentum, etc.
    
Gradient descent modes

* Batch
    * Compute loss for all training examples
    * Adjust weight
    * Example: 150 samples in training set. For each iteration, weight is adjusted once
* Stochastic
    * Compute loss for next example
    * Adjust weight
    * Example: 150 samples in training set. For each iteration, weight is adjusted 150 times
* Mini-batch
    * Compute loss for a specified number of examples
    * Adjust weight
    * Example: 150 samples in training set, mini batch size is 15. For each iteration, weight is adjusted 10 times.

## Logistic Regression (Binary Classification)

Set up is similar to linear regression - we have a set of features and assign a weight to each feature, we sum the products of features and weights. But, for output we want to know probability of the output belonging to the positive class, based on an output of 0 or 1.

We can use the sigmoid function to run the output of the sum through - sigmoid function output is bounded between zero and one - see [this](https://github.com/ChandraLingam/AmazonSageMakerCourse/blob/master/GradientDescent/logistic_cost_example.ipynb) notebook.

* Typically need to assign a cut off value for the output, for example anything greater than 0.5 is positive.

Training objective with logistic regression is to select the weights that lowers the misclassification.

* Use the logistic cost loss function, which separates negative and positive values (loss curves for positive and negative samples)
* Logistic loss function is parabolic in nature, also with the property of not only indicating the loss at a given weight, but also indicating which direction to adjust the weight.

How to find the optimal weights?

* Use the gradient descent optimizer

## Neural Networks

Linear models are simple and easy to understand, but typically underperform on non linear data (underfit). They require extensive feature engineering, features need to be on similar range and scale.

Linear models form the foundation for understanding neural networks. NN looks like stacking several logistic models, generalizing sigmoid with an activation function.

Summation of features plus weights ran through an activation function is a 'neuron', at each layer the features can be connected to multiple neurons with the weights specific to each neuron.

* The neurons generate new features by combining existing ones, which are then inputs to the next layer of neurons.
* Basic architecture has an input layer, one or more hidden layers, and an output layer.

Benefits

* Automatic feature engineering - mixes features to create new ones
* Handles non-linear datasets
* Standard techniques to deal with overfitting  (easy to overfit) - regularization, reduce model complexity, etc.

Activation Functions

* Introduce non-linearity into the model
* Improves ability of model to fit complex non-linear datasets
* Three popular activation fucntions: sigmoid, tanh, relu

Activation function notebook - see [here](https://github.com/ChandraLingam/AmazonSageMakerCourse/blob/master/GradientDescent/activation_functions.ipynb)

* sigmoid - converts input to a number between 0 and 1
* tanh - output varies from -1 to 1
* relu - netgative input output is 0, otherwise same as input

Deep learning - subset of machine learning that uses complex networks that have hundreds of layers. Why so popular?

* Traditional ML algorithms appear to saturate on how much they can learn. Having massive amounts of data does not translate to more learning
* Small NN can learn better. Medium NN can learn even more, and large NNs can keep learning with more data.

Binary classifier - send the output through a sigmoid function.

Multiclass classifier - use softmax to convert to array of probability scores for each class, sum of probs for all classes is 1.

Popular NN architectures

* General purpose
    * fully connected network
    * example: treats each pixel as a separate feature
* Convolutional Neural Network (CNN)
    * Useful for image analysis
    * Example: considers pixels and its surrounding pixels
* Recurrent NN
    * Looks at history
    * Used for timeseries prediction, natural language processing
    * Example: timeseries forcasting - model looks at current values and historical values
    