In [7]:
import numpy as np
import nnfs
from nnfs.datasets import spiral_data

# Chapter 10 - Optimisers - Stochastic Gradient Descent (SGD)

- **The Stochastic Gradient Descent** historically refers to an optimizer that fits a single
sample at a time. 
- **The Batch Gradient Descent**, is an optimizer used to fit a whole dataset at once. 
- **Mini-batch Gradient Descent**, is used to fit slices of a dataset, which we’d call batches in our context. 


In the context of deep learning and this book, we call *slices* of data *batches*, where,
historically, the term to refer to slices of data in the context of Stochastic Gradient Descent was *mini-batches*. In a future chapter, we’ll introduce data slices, or batches, so we should start by using the Mini-batch Gradient Descent optimizer.


In the case of Stochastic Gradient Descent, we choose a learning rate, such as 1.0. We then
subtract the learning_rate · parameter_gradients from the actual parameter values.

If our learning rate is 1, then we’re subtracting the exact amount of gradient from our parameters. We’re going to start with 1 to see the results, but we’ll be diving more into the learning rate shortly. Let’s create the SGD optimizer class code. The initialization method will take hyper-parameters, starting with the learning rate, for now, storing them in the class’ properties.

In [8]:
# SGD optimizer
class Optimizer_SGD:
    # Initialize optimizer - set settings,
    # learning rate of 1. is default for this optimizer
    def __init__(self, learning_rate=1.0):
        self.learning_rate = learning_rate
        
    # Update parameters
    def update_params(self, layer):
        weight_updates = -self.current_learning_rate * layer.dweights
        bias_updates = -self.current_learning_rate * layer.dbiases
        # Update weights and biases using either
        # vanilla or momentum updates
        layer.weights += weight_updates
        layer.biases += bias_updates
        

In [9]:
#The update_params method, given a layer object, multiplies the gradients stored in the layers by the negated learning rate and adds the result to the layer’s parameters.
optimizer = Optimizer_SGD()
# update our network layer’s parameters after calculating the gradient using
optimizer.update_params(dense1)
optimizer.update_params(dense2)



NameError: name 'dense1' is not defined

Learning Rate Decay

In [None]:
# Exponential decay, 1/t

starting_learning_rate = 1.
learning_rate_decay = 0.1
step = 1
for step in range(1, 20, 1):
    learning_rate = starting_learning_rate * \
    (1. / (1 + learning_rate_decay * step))
    print(learning_rate)


0.9090909090909091
0.8333333333333334
0.7692307692307692
0.7142857142857143
0.6666666666666666
0.625
0.588235294117647
0.5555555555555556
0.5263157894736842
0.5
0.47619047619047616
0.45454545454545453
0.4347826086956522
0.41666666666666663
0.4
0.3846153846153846
0.37037037037037035
0.35714285714285715
0.3448275862068965


## Learning Rate Momentum

THe problem with regular SGD is that it will always roughly point in the direction of the local minimum, and this can cause us to get stuck. Adding momentum to our gradients will allow us to "roll over" the local minima. This is done by allowing the previous gradients to influence the outcome. We can add another hyperparameter to tune how much momentum we'd like to carry over. 



In [None]:
# Update parameters
def update_params(self, layer):
    # If layer does not contain momentum arrays, create them
    # filled with zeros
    if not hasattr(layer, 'weight_momentums'):
        layer.weight_momentums = np.zeros_like(layer.weights)
        # If there is no momentum array for weights
        # The array doesn't exist for biases yet either.
        layer.bias_momentums = np.zeros_like(layer.biases)
    # Build weight updates with momentum - take previous
    # updates multiplied by retain factor and update with
    # current gradients
    weight_updates = self.momentum * layer.weight_momentums - self.current_learning_rate * layer.dweights
    layer.weight_momentums = weight_updates
    # Build bias updates
    bias_updates = self.momentum * layer.bias_momentums - self.current_learning_rate * layer.dbiases
    layer.bias_momentums = bias_updates


# AdaGrad: Adapative Gradient 

Innovated on learning rate by implenting a per-parameter learning rate rather than a globally shared single value  (like above). 
The point is to normalize updates as during training neurons with smaller weights won't change much. 
AdaGrad does this by keeping a history of previous updates. This lets less-frequently
updated parameters to keep-up with changes, effectively utilizing more neurons for training.

```python
cache += parm_gradient ** 2
parm_updates = learning_rate * parm_gradient / (sqrt(cache) + epsilon)
```

- **cache** holds a history of squared gradients
- **parm_updates** is a function of the learning rate multiplied by the gradient 
- **epsilon** is a hyperparameter preventing division by 0.


We are adding squared values and taking the square root, which is not the same as just adding the value. The resulting cache value grows slower.


Implementation:

In [None]:
# Adagrad optimizer
class Optimizer_Adagrad:
    # Initialize optimizer - set settings
    def __init__(self, learning_rate=1., decay=0., epsilon=1e-7):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay
        self.iterations = 0
        self.epsilon = epsilon
        
    # Call once before any parameter updates
    def pre_update_params(self):
        if self.decay:
            self.current_learning_rate = self.learning_rate * (1. / (1. + self.decay * self.iterations))
    # Call once after any parameter updates
    def post_update_params(self):
        self.iterations += 1
        
    # Update parameters
    def update_params(self, layer):
        # If layer does not contain cache arrays,
        # create them filled with zeros
        if not hasattr(layer, 'weight_cache'):
            layer.weight_cache = np.zeros_like(layer.weights)
            layer.bias_cache = np.zeros_like(layer.biases)
            
        # Update cache with squared current gradients
        layer.weight_cache += layer.dweights**2
        layer.bias_cache += layer.dbiases**2
        # Vanilla SGD parameter update + normalization
        # with square rooted cache
        layer.weights += -self.current_learning_rate * layer.dweights / (np.sqrt(layer.weight_cache) + self.epsilon)
        layer.biases  += -self.current_learning_rate * layer.dbiases / (np.sqrt(layer.bias_cache) + self.epsilon)

# RMSProp
## Root Mean Square Propagation

Similar to AdaGrad, cache is calculated differently

```python
cache = rho * cache + (1 - rho) * gradient**2
```

This adds a mechanism similar to momemntum, learning rate changes are smoother as we've implemented a moving average of the cache.
New hyper parameter **rho**, is cache memory decay rate. 

"Help retain global direction of changes"

Each update to the cache retains a part of
the cache and updates it with a fraction of the new, squared, gradients. In this way, cache contents
“move” with data in time, and learning does not stall.


In [None]:

# Update parameters
def update_params(self, layer):
    # If layer does not contain cache arrays,
    # create them filled with zeros
    if not hasattr(layer, 'weight_cache'):
        layer.weight_cache = np.zeros_like(layer.weights)
        layer.bias_cache = np.zeros_like(layer.biases)
    # Update cache with squared current gradients
    layer.weight_cache = self.rho * layer.weight_cache + (1 - self.rho) * layer.dweights**2
    layer.bias_cache = self.rho * layer.bias_cache + (1 - self.rho) * layer.dbiases**2
    # Vanilla SGD parameter update + normalization
    # with square rooted cache
    layer.weights += -self.current_learning_rate * layer.dweights /  (np.sqrt(layer.weight_cache) + self.epsilon)
    layer.biases  += -self.current_learning_rate * layer.dbiases /   (np.sqrt(layer.bias_cache) + self.epsilon)

# Adam
## Adaptive Momentum

Currently the most widely used optimizer, built atop RMSProp. 