 #### Gradient Descent minimizes the cost function iteratively
 To implement, we need to implement the graident of the cost function with regards to each model parameter $\theta_j$
 Essentially we need to calculate how much the cost function changes for a small change in $\theta_j$ (the _partial derivative_)
The partial derivative is noted as: ${\delta\over{\delta\theta_j}} MSE(\theta)$


#### Partial derivatives of the cost function for parameter $\theta$
##### ${{\partial\over{\partial\theta_j}}MSE(\theta)} = {{2 \over m} \sum^m_{i=1}{\biggr( \theta^T . \mathbf x^{(i)} -y^{(i)} \biggr) x^{(i)}_j} }$ 

Notation
* $m$ = Number of instances in the dataset
* $x^{(i)}$ = vector of all feature values excluding label of the $i^{th}$ instance
* $y^{(i)}$ = is the label for the $i^{th}$ instance

#### Batch Gradient Descent
Rather than compute gradients individually we can copute all of them for the dataset in one go. We will effectively be getting a gradient vector $\nabla_{\theta}MSE(\theta)$ which will contain all partial derivatives of the cost function for each model parameter.

Over a full training set the formula would be:

${\nabla_{\theta}MSE(\theta)}={\left(
                                  \begin{array}{c} 
                                    {{\partial\over{\partial\theta_0}}MSE(\theta)}\\
                                    {{\partial\over{\partial\theta_1}}MSE(\theta)}\\
                                    {{\partial\over{\partial\theta_2}}MSE(\theta)}\\
                                    {\vdots}\\
                                    {{\partial\over{\partial\theta_n}}MSE(\theta)}\\
                                  \end{array} 
                                \right)} = {2\over m}{\mathbf X^T . (\mathbf X. \theta - \mathbf y)}$

the Gradient vector points uphill, to go downhill (the descent) we subtract ${\nabla_{\theta}MSE(\theta)}$ from $\theta$ . To determine the size of the descent multiply the gradient vector by learning rate - $\eta$ 
Represented by the equation
###  ${\theta^{(next step)} =  \theta - \eta {\nabla_{\theta}MSE(\theta)}}$


In [1]:
# A simple Gradient Descent implementation

#Prep stuff
import numpy as np
import numpy.random as rnd
import os
%matplotlib inline
import matplotlib 
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

#Generate random linear looking data
X = 2 * rnd.rand(100, 1)
y = 4 + 3 * X + rnd.randn(100, 1)

import numpy.linalg as LA
x = np.ones((100, 1)) 
X_b = np.c_[x, X] # concatenate our features to array vector of 1's
#theta_best = LA.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)


In [4]:

eta = 0.1
iterations = 1000
m = 100

# random initialization
theta = np.random.randn(2,1)

for iteration in range(iterations):
    gradients = 2/m * X_b.T.dot(X_b.dot(theta)-y)
    theta = theta - eta * gradients


In [5]:
theta

array([[ 4.20823178],
       [ 2.75650503]])

Using gradients returns the same result as the Normal Equation.