In [None]:
import numpy as np
import pandas as pd
from IPython import display

pd.set_option('display.max_columns',500)

# Gradient Calculation

So let's get our hands dirty and actually compute the derivative of the error function. The first thing to notice is that the sigmoid function has a really nice derivative. Namely,

$\sigma'(x)$ = $\sigma(x)$ (1-$\sigma(x)$)σ 

(x)=σ(x)(1−σ(x))

The reason for this is the following, we can calculate it using the quotient formula and using the **chain rule**:

In [None]:
display.Image('./images/derivative-sigmoid.gif')

Chain rule says take the derivative of the outer (i.e., $X^{(-1)}$ -> -X) function and multiply that by the derivative of the internal function (i.e., (1+$e^{(-x)}$ -> )

And now, let's recall that if we have mm points labelled $x^{(1)}$, $x^{(2)}$, $\ldots$, $x^{(m)}$,the error formula is:

E = $-\frac{1}{m} \sum_{i=1}^m \left( y_i \ln(\hat{y_i}) + (1-y_i) \ln (1-\hat{y_i}) \right)$ 
where the prediction is given by $\hat{y_i} = \sigma(Wx^{(i)} + b)$. 

Our goal is to calculate the gradient of E, at a point $x = (x_1, \ldots, x_n),$ given by the partial derivatives

$\nabla E$ = given by the partial derivatives

$\nabla E =\left(\frac{\partial}{\partial w_1}E, \cdots, \frac{\partial}{\partial w_n}E, \frac{\partial}{\partial b}E \right)$

To simplify our calculations, we'll actually think of the error that each point produces, and calculate the derivative of this error. The total error, then, is the average of the errors at all the points. The error produced by each point is, simply,


E =$ - y \ln(\hat{y}) - (1-y) \ln (1-\hat{y})$

In order to calculate the derivative of this error with respect to the weights, we'll first calculate $\frac{\partial}{\partial w_j} \hat{y}$. Recall that $\hat{y} = \sigma(Wx+b)$, so:

In [None]:
display.Image('./images/derivative-sigmoid2.gif')

The last equality is because the only term in the sum which is not a constant with respect to $w_j$ is precisely $w_j x_j$, which clearly has derivative $x_j$.

Now, we can go ahead and calculate the derivative of the error $E$ at a point $x$, with respect to the weight $w_j$.

In [None]:
display.Image('./images/derivative-sigmoid3.png')

A similar calculation will show us that

In [None]:
display.Image('./images/derivative-sigmoid4.png')

This actually tells us something very important. For a point with coordinates $(x_1, \ldots, x_n)$, label y and prediction $\hat{y}$, the gradient of the error function at that point is $\left(-(y - \hat{y})x_1, \cdots, -(y - \hat{y})x_n, -(y - \hat{y}) \right)$. In summary, the gradient is

$\nabla E $ = $-(y - \hat{y}) (x_1, \ldots, x_n, 1)$.

If you think about it, this is fascinating. The gradient is actually a scalar times the coordinates of the point! And what is the scalar? Nothing less than a multiple of the difference between the label and the prediction. What significance does this have?