## In this exercise, we will use the layer model which we built last week. Here we will explore different optimizer functions . In the previous section, we implemented the SGD with momentum. It does this by adding a fraction of the update vector of the past time step to the current update vector which accelerates the parameter updates for faster convergence 

References

http://cs231n.github.io/neural-networks-3/

http://ruder.io/optimizing-gradient-descent/index.html#gradientdescentvariants

In [6]:
import numpy as np

### Adagrad

A adaptive learning rate which individually adapts the learning rates of all model parameters by scaling them inversaly proportional to the square root of sum of all historical value of the gradient. It adapts to lower learning rates(updates) to the frequently occuring features and larger learning rates/updates to the infrequent feature

The main benefits about the Adagrad is it doesn't need the manual tuning of learning rate  and it is intially defaulted to 0.01 and leave it as it is.

the main  cons of adagrad is accumulation of squared gradients in the denominator and it keeps growing during the training process. This may shrink the learning rate too too small in way that that it can't move forward to gather additional knowledge.

In [7]:
class Adagrad:
    def __init__(self, learning_rate = 0.01, eps = 1e-8):
        self.learning_rate = learning_rate 
        #sum of squares of gradient
        self.ssg = None
        #smoothing term avoids division by zero
        self.eps = eps       
    
    def update(self, w, grad):
        if self.ssg is None:
            self.ssg = np.zeros(np.shape(w)) 
            
        # Add the square of the gradient of the loss function at w
        self.ssg += np.power(grad, 2)
        
        # Adaptive gradient with higher learning rate for infrequent data  
        return (w - (self.learning_rate * grad)) / np.sqrt(self.ssg+ self.eps)

## RMSProp

RMSProp a adaptive learning techniques which address the shrinking learning rate problem imposed by "AdaGrad" in a simple way by taking the running average of squared gradients.

cache = decay_rate * cache + (1 - decay_rate) * dx**2

x += - learning_rate * dx / (np.sqrt(cache) + eps)

In [8]:
class RmsProp:
    def __init__(self,learning_rate=0.001, decay_rate=0.9,eps =1e-8):
        self.learning_rate = learning_rate
        self.decay_rate = decay_rate
        
        #smoothing term avoids division by zero . same like adagrad
        self.eps = eps  
        
        #running average of gradient - cache
        self.rag = None
        
    def update(self, w, grad):
        if self.rag is None:
            self.rag = np.zeros(np.shape(grad)) 
            
        
        self.rag = (self.decay_rate * self.rag) + ((1- self.decay_rate) * np.power(grad, 2))
        
         # Adaptive gradient with higher learning rate for infrequent data  
        return (w - (self.learning_rate * grad)) / np.sqrt(self.rag+ self.eps)

## Adam = Adaptive Moment Estimation

Adam takes in addition to the moving average of past squared gradients , it also takes the moving average of the past gradients. in short we can call this as RMSProp with momentum.

This is refered from http://ruder.io/optimizing-gradient-descent/index.html#gradientdescentvariants

We compute the decaying averages of past and past squared gradients mt and vt respectively as follows

mt=β1mt−1+(1−β1)gt

vt=β2vt−1+(1−β2)gt**2

mt and vt are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively, hence the name of the method. As mt and vt are initialized as vectors of 0's, the authors of Adam observe that they are biased towards zero, especially during the initial time steps, and especially when the decay rates are small (i.e. β1 and β2 are close to 1).

They counteract these biases by computing bias-corrected first and second moment estimates:

m-cap= mt/1−β1

v-cap=vt/1−β2

They then use these to update the parameters just as we have seen in Adadelta and RMSprop, which yields the Adam update rule

grad_updated = w - ((learning_rate * m-cap) / sqrt(v-cap) + eps)


In [10]:
class Adam:
    def __init__(self, learning_rate=0.001,b1=0.9,b2=0.999,eps =1e-8):
        self.learning_rate = learning_rate
        
        #decay rates(b1,be) and smoothing term
        self.b1, self.b2,self.eps = b1,b2,eps
        #to hold moving averages of the gradients
        self.mt = None
        #to hold moving averages of the squared gradients
        self.vt = None
        
    def update(self, w, grad):
        if self.mt is None:
            self.mt = np.zeros(np.shape(grad)) 
        if self.vt is None:
            self.vt = np.zeros(np.shape(grad)) 
            
        self.mt = (self.b1 * self.mt) + ((1-self.b1) * grad)
        self.vt = (self.b2 * self.vt) + ((1-self.b2) * np.power(grad,2))
        
        #to overcome these vectors get biased towards zero
        m_cap=  self.mt/(1-self.b1)
        v_cap= self.vt/(1-self.b2)
        
        grad_updated = self.learning_rate * m_cap / (np.sqrt(v_cap) + self.eps)
        
            
        return w - grad_updated
        
        

## Now let's test our handwritten dataset with these optimization algorithms, I have added the last session code into a python script named "DeepLearnerBase.py". Hence using the same  

In [28]:
from scipy.io import loadmat
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from DeepLearnerBase import Sequential, Dense, Activation, CrossEntropyForSoftMax,relu,softmax
%matplotlib inline

In [29]:
data = loadmat("data\handwritten.mat")
print(data['X'].shape)
print(data['y'].shape)

(5000, 400)
(5000, 1)


In [30]:
X = data['X']
y =  data['y']

In [31]:
X_train, X_valid, y_train, y_valid = train_test_split(
            X,y, test_size=0.20, random_state=42)
print(X_train.shape)
print(X_valid.shape)

(4000, 400)
(1000, 400)


In [32]:
optimizer = Adam()
loss = CrossEntropyForSoftMax()

In [33]:
model = Sequential([    
    Dense(100),
    Activation(relu),    
    Dense(10),
    Activation(softmax)    
],  optimizer, loss, X.shape[1])

In [34]:
model.summary()

+---------------+
| Model Summary |
+---------------+
Input Shape: 400
+------------+-------------+--------------+------------+
| Layer Name | Input Shape | Output Shape | Shape      |
+------------+-------------+--------------+------------+
| Dense      | 400         | 100          | (400, 100) |
| relu       | 100         | 100          | (100, 100) |
| Dense      | 100         | 10           | (100, 10)  |
| softmax    | 10          | 10           | (10, 10)   |
+------------+-------------+--------------+------------+


In [35]:
model.fit(X_train,y_train,X_valid,y_valid,epochs= 10000,batchsize= 5000)

  0% (5 of 10000) |                       | Elapsed Time: 0:00:00 ETA:  0:09:01

Epoch# 0 Training Loss:2.324754138876231 Validation Loss: 2.2704057128941066 Training Accuracy:0.0925 Validation Accuracy:0.193


 10% (1004 of 10000) |##                  | Elapsed Time: 0:00:48 ETA:  0:07:10

Epoch# 1000 Training Loss:0.6545246111218269 Validation Loss: 0.3474585389286208 Training Accuracy:1.0 Validation Accuracy:0.931


 20% (2004 of 10000) |####                | Elapsed Time: 0:01:37 ETA:  0:06:25

Epoch# 2000 Training Loss:0.8866205573666668 Validation Loss: 0.408305362960412 Training Accuracy:1.0 Validation Accuracy:0.929


 30% (3004 of 10000) |######              | Elapsed Time: 0:02:27 ETA:  0:06:00

Epoch# 3000 Training Loss:1.0397182766869877 Validation Loss: 0.45107997656365545 Training Accuracy:1.0 Validation Accuracy:0.928


 40% (4005 of 10000) |########            | Elapsed Time: 0:03:18 ETA:  0:05:00

Epoch# 4000 Training Loss:1.162536772729566 Validation Loss: 0.48619939954213465 Training Accuracy:1.0 Validation Accuracy:0.929


 50% (5004 of 10000) |##########          | Elapsed Time: 0:04:08 ETA:  0:03:47

Epoch# 5000 Training Loss:1.270333699059994 Validation Loss: 0.5178631301244215 Training Accuracy:1.0 Validation Accuracy:0.93


 60% (6005 of 10000) |############        | Elapsed Time: 0:04:53 ETA:  0:03:03

Epoch# 6000 Training Loss:1.3693007216574993 Validation Loss: 0.5478026637751537 Training Accuracy:1.0 Validation Accuracy:0.931


 70% (7005 of 10000) |##############      | Elapsed Time: 0:05:38 ETA:  0:02:15

Epoch# 7000 Training Loss:1.462527951222865 Validation Loss: 0.5762268455951637 Training Accuracy:1.0 Validation Accuracy:0.931


 80% (8003 of 10000) |################    | Elapsed Time: 0:06:27 ETA:  0:01:33

Epoch# 8000 Training Loss:1.5516516580796622 Validation Loss: 0.6040875292600255 Training Accuracy:1.0 Validation Accuracy:0.931


 90% (9004 of 10000) |##################  | Elapsed Time: 0:07:14 ETA:  0:00:47

Epoch# 9000 Training Loss:1.637452369519704 Validation Loss: 0.6317042878785042 Training Accuracy:1.0 Validation Accuracy:0.931


100% (10000 of 10000) |###################| Elapsed Time: 0:08:02 Time: 0:08:02


## we also has other optimization algorithm like Nadam, AdaMax etc... Adam appears to be widely used optimization algorithms, but still in some areas the SGD with momentum outperforms Adam (for ex:- in my last project of detecting solar panel from aerial images). Here it gives the similar performances like SGD with momentum with this dataset