Gradient validation and finite differences
=======================

In [1]:
import Orange
from matplotlib import pyplot as plt
import numpy as np
%matplotlib inline

Loading the data
---------------

In [2]:
data = Orange.data.Table("data/age-sbp.tab")

In [3]:
x, y = data.X, data.Y
X = np.column_stack((np.ones(len(x)), x))

Analytical gradient, gradient descent and cost function for linear regression
----------------

In [4]:
def grad(X, y, theta):
    return (X.dot(theta) - y).dot(X)

def gradient_descent(X, y, alpha=0.0001, epochs=1000):
    """For a matrix x and vector y return a linear regression model."""
    theta = np.zeros(X.shape[1]).T
    for i in range(epochs):
        theta = theta - alpha * grad(X, y, theta) / len(X)
    return theta

def J(X, y, theta):
    return 0.5 * sum((X.dot(theta) - y)**2)

In [5]:
theta = gradient_descent(X, y, alpha=0.0005, epochs=100000)
print(theta)

[ 98.04582353   0.98421061]


Gradient checking through comparison with finite differences
--

Here we test our gradient of the cost function J and our Python implementation are correct. Namely, alternatively, we can compute the gradient with the method of [finite differences](https://en.wikipedia.org/wiki/Finite_difference), and then compare the two solutions. For this to work, we need an function that computes the cost function J (above). 

In [6]:
def grad_approx(X, y, theta, e=1e-1):
    return np.array([(J(X, y, theta+eps) - J(X, y, theta-eps))/(2*e)
                     for eps in np.identity(len(theta)) * e])

In [9]:
theta = np.array([-10, -42])
print(grad(X, y, theta))
print(grad_approx(X, y, theta))

[  -61444. -3064664.]
[  -61443.99999991 -3064663.99999999]


This works fine. Great! We could use this way of computing the gradient in the first place, and then avoid all the "unnecessary" mathematics for computing the analytical solution. Right? Wrong! Computing the gradient through finite differences is much slower. Consider, how many times we need to evaluate the cost function. Also, the finite differences approximation depends on parameter epsilon, which in general we do not know how to set appropriately, for a given J and theta. If there exists an analytical solution for the gradient, we should use it.