# Simple Code Implementation Of Gradient Descent And RMSProp

## Loss Function, Gradient, Starting Values

In [4]:
import numpy as np

In the following is a simple loss function for example purposes only: J(**theta**) = theta_1 \* theta_2 \* ... (multiplication of all elements). The loss function J is determined by the real values **y** and the predictions/classifications **y_hat**, which is a function of the predictors **x** (the data), and the parameters **theta** of the prediction model. E.g. loss J = mean(abs(f(**x**, **theta**) - **y**)) would be the Mean Absolute Error. To keep the example simple, we just assume a certain influence of **theta** without explaining how exactly **theta** effects the loss through the model f(**x**, **theta**) = **y_hat**. The given loss funciton does not converge to a minimum, but it has a saddle point at (0,0,...). In this example, the goal is to get close to the saddle point (while the real life goal with more complex loss functions is to get close to a minimum).

In [1]:
# Loss Function J
def J(theta):
    loss = np.prod(theta) #multiply all elements of theta, e.g. theta1*theta2*...
    return loss

In [2]:
# Gradient Of J
def dJ(theta):
    gradient = theta * 0
    for i in range(len(theta)):
        gradient[i] = np.prod(theta[np.arange(len(theta)) != i])
    return gradient

In [5]:
# Setting The Starting Values Of Theta, Getting Loss And Gradient At Theata_Start
theta_start = np.array([2., 2.])
print(J(theta_start))
print(dJ(theta_start))

4.0
[2. 2.]


## Algorithms: Gradient Descent And RMSProp

In [6]:
def gradient_descent(J, dJ, theta_start, steps = 100, lr = 0.001):
    theta = np.copy(theta_start)
    loss = []
    for _ in range(steps):
        gradient = dJ(theta)
        theta -= lr * gradient
        loss.append(J(theta))

    return theta, loss

In [7]:
def rmsprop(J, dJ, theta_start, steps = 100, lr = 0.001, decay = .9, delta = 1e-6):
    theta = np.copy(theta_start)
    loss = []
    gradient_mean_sqr = np.zeros(theta.shape, dtype=float) # vector full of zeros

    for _ in range(steps):
        gradient = dJ(theta)
        gradient_mean_sqr = decay * gradient_mean_sqr + (1 - decay) * gradient ** 2  
        # **2 means elementwise ^2
        theta -= lr * gradient / (np.sqrt(gradient_mean_sqr) + delta)  
        # np.sqrt meanselementwise square roots
        # / means elementwise division
        loss.append(J(theta))
    
    return theta, loss

"gradient_mean_sqr" is always positive, but might have values equal or close to zero. "delta" is only there to stabilize the divisions by "gradient_mean_sqr".

## Usage And Comparison

In [8]:
theta_opt1, loss1 = gradient_descent(J, dJ, theta_start, steps = 5, lr = .5)
print(loss1)
print(theta_opt1)

[1.0, 0.25, 0.0625, 0.015625, 0.00390625]
[0.0625 0.0625]


In [9]:
theta_opt2, loss2 = rmsprop(J, dJ, theta_start, steps = 5, lr = .5)
print(loss2)
print(theta_opt2)

[0.17544677397202935, 0.006086804167495761, 0.00012448839791948842, 1.1634232222190242e-06, 2.6593314747211394e-09]
[5.15687064e-05 5.15687064e-05]


In both cases, 5 iterations were done. We can see the loss getting smaller and smaller with each iteration. RMSProp reached a point closer to the saddle point (0,0) compared to Gradient Descent.