Training Neural Networks
-----

<center><img src="https://cdn.meme.am/instances/500x/65051569/waiting-skeleton-near-computer-still-waiting-for-my-neural-network.jpg" width="400"/></center>

By The End Of This Session You Should Be Able To:
----

- List the hyperparameters to tune for a NN
- Design the best general architecture for a NN
- Compare activation functions
- Explain how learning rate effects model performance

What is goal of machine learning?
------

Automatically learning from data to generalize (across domain and time)

Hyperparameters to tune
-----

1. Number of epochs
1. Architecture 
1. Activation function
1. Weight update rate, aka learning rate
1. Define Training set
1. Weight update algorithm, aka optimitizer 
1. Regularization
1. Minibatch size
1. Batch normalization 
1. Types of optimizers
1. Data Augmentation
1. Weight sharing
1. Fine tuning, aka pretrained networks

What are epochs?
-------

A single iteration over the entire training set.


There is a trade-off between training time and test performance

Architecture 
------

1. Number of nodes in each layer 
1. Number of layers  

Number of nodes in each layer
-------

Why do we want to stack NN layers (instead of very wide layers)?

Better at learning combinations of features!

Very wide, shallow networks are very good at memorization, but not so good at generalization. If you train the network with every possible input value, a super wide network could eventually memorize the corresponding output value that you want

Add layers: Always go deeper!
------

<center><img src="https://cdn-business.discourse.org/uploads/analyticsvidhya/original/2X/5/55cce711aecac89c48d50691cf1c525f785d794f.png" height="500"/></center>

We'll more tricks later

Activation Function
------

To allow Neural Networks to learn complex decision boundaries, 
we apply a nonlinear activation function to some of its layers. 

Check for understanding
-------

What is the fundamental requirement for an activation function?

A neural network function must be differentiable because of back prop.

Commonly used activation functions:
-------

- Sigmoid
- Tanh
- ReLU (Rectified Linear Unit)

What is limiting about sigmoid?
-------

1. A node's activation saturates at either tail of 0 or 1, 
2. Local gradient maximium at .25

Sigmoids saturate and kill gradients.
-------

<center><img src="https://cdn-images-1.medium.com/max/800/1*gkXI7LYwyGPLU5dn6Jb6Bg.png" height="500"/></center>

A node's activation saturates at either tail of 0 or 1, the gradient at these regions is almost zero. 

[Source](http://www.kdnuggets.com/2016/03/must-know-tips-deep-learning-part-2.html)

Local gradient maximum at 0.25,
------
<center><img src="https://qph.ec.quoracdn.net/main-qimg-b89e3c9b324b958b1b38ec2976a18583" height="500"/></center>

Another non-obvious fun fact about sigmoid is that its local gradient achieves a maximum at 0.25, when z = 0.5. 

Thus, every time the gradient signal flows through a sigmoid gate, its magnitude always diminishes by one quarter (or more).   

If you’re using basic SGD, this would make the lower layers of a network train much slower than the higher ones.

[Source](https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b)

<center><img src="images/relu.png" height="500"/></center>

What is the derivate of relu?
-----

<center><img src="https://cdn-images-1.medium.com/max/800/1*g0yxlK8kEBw8uA1f82XQdA.png" height="500"/></center>

What are the advantages of ReLU?
----

1) ReLUs does not suffer from saturating. It go to ∞!

2) Induces the sparsity in the hidden units. Hidden units can be driven to zero.

3) Deeper networks can be trained because it minimizes the "gradient vanishing problem"

<center><img src="images/more_relu.jpg" height="500"/></center>

What is a drawback of ReLU?
-----

__“Dead ReLU” problem__

If a ReLU neuron is unfortunately initialized such that it never fires, or if a neuron’s weights ever get knocked off with a large update during training into this regime, then this neuron will remain __permanently dead__.

[Source 1](http://www.kdnuggets.com/2016/03/must-know-tips-deep-learning-part-2.html)  
[Source 2](http://datascience.stackexchange.com/questions/5706/what-is-the-dying-relu-problem-in-neural-networks)

Weight update rate, aka learning rate
------

Picking the optimal learning rate is a impossible.

Picking a mediocre learning rate is a easy.

Picking a decent learning rate is the goal (and an art).

<center><img src="images/learning_rate.png" height="500"/></center>

[Source](http://cs231n.github.io/neural-networks-3/#baby)

If we pick a learning rate that's too big, we'll mostly likely start diverging (weights will go to infinity)

If we pick a learning rate that's too small, we risk taking too long during the training process.

How do I know when to stop?
------

When there are no more errors!

Handling Errors
------

<center><img src="images/errors.jpg" height="500"/></center>

[Source](http://bytes.schibsted.com/deep-learning-changing-data-science-paradigms/)

The power of common objective benchmarks
-----

Machine learning's secret sauce.

Everyone competes on single dataset with clear benchmarks

Other disciplines do not use it, for example Computer Science and Neuroscience

How good is "good enough" on MNIST?
------

[ML benchmarks](https://en.wikipedia.org/wiki/MNIST_database)

Lab tips:
------

> Being a machine learning researcher is a lot like being an addict at a slot machine, forever running experiments to see if intuitions about hyperparameters or setups are working,

> This sort of slot machine mentality does not encourage good science. Perhaps by chance we get to a set of parameters that “looks promising”.

[Source](https://jack-clark.net/2017/03/27/import-ai-issue-35-the-end-of-imagenet-unsupervised-image-fiddling-with-discogan-and-alibabas-voice-data-stockpile/)

Lab tips: Take detailed notes!
-----

<br>

<center><img src="images/time.png" height="500"/></center>

[Source](https://www.slideshare.net/JenAman/large-scale-deep-learning-with-tensorflow)

Summary
----

- There __a lot__ of NN hyperparameters. Choose wisely.
- Architectures should be deep and simple.
- ReLU is currently the best activation function.
- It is an art to find the "Goldilocks" learning rate.

<br>
<br>
--

<center><img src="http://7pn4yt.com1.z0.glb.clouddn.com/blog-prelu.png" height="500"/></center>

<br>
<br> 
<br>

----