### Adam's Optimizer

#### Basic Math

Let us consider a distribution of data which is linearly distributed.

Equation of a straight line:

$h_\theta(x) = \theta_0 + \theta_1 x$

Equations when using batch gradient descent:

The cost function can be represented as:

$J(\theta_0,\theta_1) = {1 \over 2m} \sum\limits_{i=1}^m (h_\theta(x_i)-y_i)^2$

We need to minimize our cost function to the global minima.

The global minima can be effectively reached using various techniques.

In order to do that, we need to the find the gradients of the features.\
Gradients will be:

$\frac{\partial}{\partial \theta_0} J(\theta_0,\theta_1) = \frac{1}{m} \sum\limits_{i=1}^m (h_\theta(x_i)-y_i)$

$\frac{\partial}{\partial \theta_1} J(\theta_0,\theta_1) = \frac{1}{m} \sum\limits_{i=1}^m ((h_\theta(x_i)-y_i) \cdot x_i)$


Equations when using stochastic gradient descent:

The cost function to be minimized:

$J(\theta_0,\theta_1) = {1 \over 2} (h_\theta(x_i)-y_i)^2$

Gradients will be:

$\frac{\partial}{\partial \theta_0} J(\theta_0,\theta_1) = h_\theta(x_i)-y_i$

$\frac{\partial}{\partial \theta_1} J(\theta_0,\theta_1) = (h_\theta(x_i)-y_i) \cdot x_i$



In [1]:
# the equation of a straight line. where theta_0 is the intercept and theta_1 is the slope.

h = lambda theta_0, theta_1, x: theta_0 + theta_1*x

# the cost function (for the whole batch. for comparison later)
def J(x, y, theta_0, theta_1):
    m = len(x)
    returnValue = 0
    for i in range(m):
        returnValue += (h(theta_0, theta_1, x[i]) - y[i])**2
    returnValue = returnValue/(2*m)
    return returnValue

# finding the gradient per each training example
def grad_J(x, y, theta_0, theta_1):
    returnValue = np.array([0., 0.])
    returnValue[0] += (h(theta_0, theta_1, x) - y)
    returnValue[1] += (h(theta_0, theta_1, x) - y)*x
    return returnValue

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

### Adam's Optimizer

We can define our Adam's Optimizer which we can use with various classification as well as Neural Networks.\
It is made with reference to https://arxiv.org/pdf/1412.6980v8.pdf

We can use or tune the default values of:\
The hyper-parameters present here are:\
$\alpha$         (learning rate)    = 0.001 (say).\
$\beta_1$ (exp decay rate)   = 0.9.\
$\beta_2$ (curr w,b)         = 0.999.\
$\epsilon$        (tolerance)      = 1e-8

$\theta$  (Initial Parameter Vector)\
$m_0$ = 0  (Initialize 1st Moment Vector)\
$v_0$ = 0  (Initialize 2nd Moment Vector)\
t = 0  (Initialize timestep)

If $\theta$ is not converged, perform:

$t = t + 1$

Get Gradients of stochastic objective at time t\
$ g_t = div.f_t(\theta_t-1) $

Update biased first moment estimate:\
$ m_t = \beta_1.m_t + (1-\beta_1).g_t $

Update biased second raw moment estimate:\
$ v_t = \beta_1.v_t + (1-\beta_1).g_t^2$

Here we also do bias and weight correction as well

Compute bias correction of first moment

$ mhat_t = {m_t\over(1 - \beta_1^t)}$

Compute bias correction of second raw moment

$ vhat_t = {v_t\over(1 - \beta_2^t)}$

Update parameters:

$ \theta_t = \theta_t - \alpha.{mhat_t\over(sqrt(vhat_t) + \epsilon) $

In [3]:
class AdamOptimizer:
    def __init__(self, weights, alpha=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.alpha = alpha
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = 0
        self.v = 0
        self.t = 0
        self.theta = weights
        
    def pass_back(self, gradient):
        self.t = self.t + 1
        # Update First Biased Moment Estimate
        self.m = self.beta1*self.m + (1 - self.beta1)*gradient
        # Update Second Biased Raw Moment Estimate
        self.v = self.beta2*self.v + (1 - self.beta2)*(gradient**2)
        # Bias Correction
        m_hat = self.m/(1 - self.beta1**self.t)
        v_hat = self.v/(1 - self.beta2**self.t)
        # Update Parameters
        self.theta = self.theta - self.alpha*(m_hat/(np.sqrt(v_hat) - self.epsilon))
        return self.theta

#### Some common variables

In [None]:
# initializing
epochs = 15000
print_interval = 1000
m = len(x)

In [None]:
# initial value of theta and cost function, before gradient descent
initial_theta = np.array([0., 0.]) 
initial_cost = J(x, y, initial_theta[0], initial_theta[1])

### Normal Stochastic Gradient Descent with Adam

In [None]:
# Learning Rate
alpha = 0.001 
theta = initial_theta

# Plot SGD trajectory history
sgd_history = []
sgd_history.append(dict({'theta': theta, 'cost': initial_cost}))

for j in range(epochs):
    for i in range(m):
        # Calculate Gradient of cost-funtion using grad_J()
        gradients = grad_J(x[i], y[i], theta[0], theta[1])
        theta -= gradients*alpha
    
    # We need to plot datapoints only per print interval
    if ((j+1)%print_interval == 0 or j==0):
        cost = J(x, y, theta[0], theta[1])
        print ('After {} epochs, Cost = {}, theta = {}'.format(j+1, cost, theta))
        sgd_history.append(dict({'theta': theta, 'cost': cost}))
        
print('\nFinal theta = {}'.format(theta))

### Stochastic Gradient Descent using Adams

In [None]:
theta = initial_theta
# call AdamOptimizer class with hyperparameters
adam_optimizer = AdamOptimizer(theta, alpha=0.001)

# Plot SGD with Adams trajectory history
adam_history = []
adam_history.append(dict({'theta': theta, 'cost': initial_cost}))

for j in range(epochs):
    for i in range(m):
        # Calculate Gradient of cost-funtion same as SGD
        gradients = grad_J(x[i], y[i], theta[0], theta[1])
        # Fetch the optimized theta value using adam's technique
        theta = adam_optimizer.pass_back(gradients)
    
    # Plot trajectory only per print interval
    if ((j+1)%print_interval == 0 or j==0):
        cost = J(x, y, theta[0], theta[1])
        print ('After {} epochs, Cost = {}, theta = {}'.format(j+1, cost, theta))
        adam_history.append(dict({'theta': theta, 'cost': cost}))
        
print ('\nFinal theta = {}'.format(theta))