# ♫ harder, better, faster, stronger ♫

-written by Victor Geislinger

We can optimize our model to run faster and better

## Some useful resources on optimization

- Keras documentation & discussion: https://keras.io/optimizers/
- Excellent blog post on optimizations: http://ruder.io/optimizing-gradient-descent/index.html#gradientdescentoptimizationalgorithms

# Vanishing Gradient Problem

When our gradients are too small we can run into a few issues when training our model (adjusting weights and biases):
- Slow training
- Local minimum problem 

## Activation functions (from before)

A simple way to change the gradient, is change the activation function that produces different gradients

## Random Starts

To avoid local minimums, what if we just randomly initialize ourselves somewhere on the cost function?

We'll have to run the same model multiple times, but we start off with different weights and can help avoid falling into a local minimum

## Momentum

- Power over the hill of local min
- Dampens oscillations
- Average over the past steps (decay the old steps)
    + $S_n = S_n + \beta \cdot S_{n-1} + \beta^2 \cdot S_{n-2} + \cdots$ 
    + $S_n = S_n + \beta \cdot \big ( S_{n-1} - S_n \big )$ 
    + $S_n = \beta \cdot S_{n-1} + \big ( 1- \beta \big ) \cdot S_n$ 

# Speeding Up

## Normalization

This allows speedier training so no one feature will overpower the direction down the hill

## Stochastic Gradient Descent

Instead of taking a long time through the whole process with dataset (careful step), go through part of the dataset quicker (quick, "drunken" steps)

"Stochastic" == Random

### Steps

1. Take a random set (batch)
2. Feedforward batch through model
3. Calculate error (loss) from batch
4. Adjust weights via backpropogagtion
5. Repeat with all points/batches

## Learning Rate Decay

Idea is that:

- **high learning rate** value --> we move fast but possibly skip over the local minimum
- **low learning rate** value --> we move slowly, maybe never getting there

So there are advantages and disadvantages for each

<img width=40% src='images/why-not-both.jpg'/>

As we have more epochs, we should approach our answer. We can decrease $\alpha$ as we complete epochs, so we (hopefully) do the following: 
- If steep (large) gradient, we use a large $\alpha$
- If level (small) gradient, we use a small $\alpha$