In [1]:
import numpy as np
import pandas as pd


Defining the Sigmoid function

\begin{align}
\sigma{(x)} & = 1/1+e^{-x}
\end{align}

In [2]:
def sigmoid(x):
    return 1/1+np.exp(-x)

Taking the derivative of sigmoid function we get
\begin{equation*}
\frac{\partial }{\partial x}  \sigma{(x)}  = {\sigma{(x)}} * (1-\sigma{(x)})
\end{equation*}

In [3]:
def sigmoid_prime(x):
    return sigmoid(x) * 1-(sigmoid(x))

In [4]:
# Input data
x = np.array([0.1, 0.3])
# Target
y = 0.2
# Input to output weights
weights = np.array([-0.8, 0.5])

In [5]:
learnrate = 0.5

In [6]:
# the linear combination performed by the node (h in f(h) and f'(h))
h = x[0]*weights[0] + x[1]*weights[1]

The output of neural network would be

In [7]:
nn_output = sigmoid(h)

 \begin{align}
 Error =  y - {y\widehat{}}
         \end{align}

Here,
\begin{equation*}
y\widehat{}
\end{equation*}
is actually nn_output

In [8]:
error = y - nn_output

Gradient of output can be found by passing the output(*h*) through the derivative of sigmoid

In [9]:
output_grad = sigmoid_prime(h)

 \begin{align}
 \delta =  Error * output grad
         \end{align}

In [10]:
error_term = error*output_grad

Now Let us perfrom Gradient Decent Step

In [11]:
del_w = [ learnrate * error_term * x[0],
          learnrate * error_term * x[1]]

print('Neural Network output:')
print(nn_output)
print('Amount of Error:')
print(error)
print('Change in Weights:')
print(del_w)

Now let us try to implement the same for 4 values

In [12]:
learnrate = 0.5
x = np.array([1, 2, 3, 4])
y = np.array(0.5)
# Initial weights
w = np.array([0.5, -0.5, 0.3, 0.1])

In [13]:
h = np.dot(x,w)

In [14]:
nn_output=sigmoid(h)

In [15]:
error = y - nn_output

In [16]:
output_grad = sigmoid_prime(h)


In [17]:
error_term = error * output_grad


In [18]:
del_w = learnrate * error_term * x


In [19]:
print('Neural Network output:')
print(nn_output)
print('Amount of Error:')
print(error)
print('Change in Weights:')
print(del_w)

Neural Network output:
1.4493289641172216
Amount of Error:
-0.9493289641172216
Change in Weights:
[-0. -0. -0. -0.]


## Implementing Gradient Descent in Actual Dataset

Now we have seen how to update the weights for a single iteration but we need to do the same for multiple epochs
Let us use this [http://www.ats.ucla.edu/stat/data/binary.cs](https://stats.idre.ucla.edu/stat/data/binary.csv) dataset for example.

The dataset given here have three input features GRE score GPA and rank of the undergrad school.
The goal here is to predict if the student gets accepted to the university with the given criterion we will use sigmoid funtion here also for output activation.

## Data Cleanup

If you look at the dataset now you will notice that the column rank is not hot encoded correctly,
Rank 2 is not twice Rank 1 or Rank 3 is not thrice Rank 1 or anything like that.

For this we need to use dummyvariables and split the ranks into four columns and then use 0 or 1 to represent them.

First lets read the data using pandas

In [20]:
admissions = pd.read_csv('binary.csv')


Now we can do the encoding in rank

In [21]:
# Make dummy variables for rank
data = pd.concat([admissions, pd.get_dummies(admissions['rank'], prefix='rank')], axis=1)
data = data.drop('rank', axis=1)

We now have to standadize the gre and gpa since they are fairly large values. 
We are using sigmoid activation function and this would squash fairly large and small values.
The gradient of this would become 0 and the gradient descent step would go to zero and die off.
We should standardize the values by keeping a mean of 0 and standard deviation of 1.


In [22]:
pd.__version__

'0.25.1'

In [None]:
# Standarize features
for field in ['gre', 'gpa']:
    mean, std = data[field].mean(), data[field].std()
    data.loc[:,field] = (data[field]-mean)/std
    
# Split off random 10% of the data for testing
np.random.seed(42)
sample = np.random.choice(data.index, size=int(len(data)*0.9), replace=False)
data, test_data = data.ix[sample], data.drop(sample)

# Split into features and targets
features, targets = data.drop('admit', axis=1), data['admit']
features_test, targets_test = test_data.drop('admit', axis=1), test_data['admit']

## Mean Square Error

Since we are using large data here using, SSE would be hard since the gradient would decrease very fast and the training won't be proper. We have to compensate this by using a very small learning rate. We can also use Mean Square Error menthod and the learning rate would be in the range of 0.01 and 0.001.

 \begin{align}
 Error = \frac{1}{2m} & \sum_{\mu} ({y^{\mu} - y\widehat{}^\mu})^2
         \end{align}

What we are going to do

1. Set the change in weight to zero \begin{equation*} \nabla w_i = 0 \end{equation*}
2. For record in training data:

    * We calculate the output for \begin{equation*} y\widehat{} = f(\sum_i w_ix_i) \end{equation*}
    * Calculate the error term for output \begin{equation*} \delta = (y - y\widehat{}) * f^| (\sum_i w_i x_i)\end{equation*}
    * Update the weigth step \begin{equation*} w_i = w_i + \eta \delta x_i\end{equation*}
    
3. Update the weights \begin{equation*} w_i = w_i + \eta \nabla w_i / m\end{equation*} Here n is the learning rate and m in the number of records.(Averaging the weights to avoid large variations of training data)