# 2 layer Neural network from Scratch
https://youtu.be/vcZub77WvFA?list=PL2-dafEMk2A5BoX3KyKu6ti5_Pytp91sk

In [1]:
import numpy as np
import time

Define <b>variables</b>

In [2]:
# Now let us define our variables
n_hidden = 10 # number of hidden neurons
n_in = 10

n_out = 10 # number of outputs
n_sample = 300


Hyperparameters

In [3]:
# Hyperparameters
learning_rate = 0.01
momentum = 0.9

np.random.seed(0) # we want to make sure that every time we run this code, the same random number is generated 

Define our <b>activation function</b> 
https://youtu.be/-7scQpJT7uo


Activation functions add nonlinearity properties to our network. A <b>Linear function</b> is a polynomial with just one degree,i.e <b> y = x or y = 2x</b> this always forms a straight line. But every other equation is <b> non-linear</b> . Linear equations are easy to solve but they are limited in their complexity and neural networks are known as <b> universal function approximators</b>. Any process you can imagine can be thought of as a function computation, so we need a way to compute not just linear fxn but non-linear fxn as well 

Popular ones are: <b>sigmoid function</b> - takes some number and squashes it into a range between 0 - 1. The problem is it causes are gradient to vanish (<b>vanishing and exploding gradient</b>):
- When a neuron's activation saturates it close to either 0 or 1, the gradient at these regions is very close to zero. During back propagation this local gradient will be multiplied by this gates output for the whole objective, so if the local gradient is really small it will make the gradient slowly vanish and close to no signal will flow through the neuron to it's weights and recursively to its data.
- Second problem is that it's output isn't <b>centered</b>. It starts from 0 and ends at 1. That means the value after the function will be positive and that makes the gradient of the weights becomes either all negative or all positive. This makes the gradient updates go too far in different directions which make optimazation harder.


We also have <b>tanh(x)</b>, its squashes the real numbers into into a range -1 to 1 so its output is zero centered which makes optimization easier. In practice it is preferred over the sigmoid. But just like the sigmoid it suffers from the <b>vanishing gradient problem</b>

So then we have <b>ReLU</b>, which has become very popular. It's simply <b>max 0x</b> - which means that the value is 0 when x is less than 0 and linear with a slope of 1 when x is greater than 0. It was noted that it had a 6x convergence over tanh(x) in the landmark imagenet classification paper by Khrushchevski. It doesn't involve expensive calculations like tanh(x) or sigmoid so it learns faster and it avoids the vanishing gradient problem. <i>But it's used for the <b>hidden layers</b></i>.
The output layer should use a <b>Soft max function for classification since it gives probabilities for different classes</b> and a <b>linear function for Regression</b> since the signal goes through unchanged.
<b>NB:</b> One problem that ReLU has is that some units can become fragileduring training and die; meaning a big gradient flowing through a neuron could cause a weight update that makes it never activate on any data point again, so then gradients flowing through it will always be zero from that point on. So a varient was introduced callled the <b>Leaky ReLU</b> to slove this problem.

<img src="activation-functions.png">

In [4]:
# Sigmoid function
def sigmoid(x):
    return 1.0/(1.0 + np.exp(-x))

# Hyperbolic tangent function
def tanh_prime(x):
    return 1 - np.tanh(x)**2

<b>Training function</b>

In [5]:
# contains 5 parameters
# x - input data
# t - transpose, which is going to help in matrix multiplication
# V, W - layers
# bv, bw - biasis

def train(x, t, V, W, bv, bw):
    
    # forward propagation. Matrix multiply + biasis
    A = np.dot(x,V) + bv # multiply the inputs by the weights and add the bias
    Z = np.tanh(A) # apply an activation function. Repeat for next layer
    
    B = np.dot(Z, W) + bw
    Y = sigmoid(B)
    
    # Backward propagation
    Ew = Y - t # substracting the output of th last layer with its inverse
    Ev = tanh_prime(A) * np.dot(W, Ew)
    
    # Predict loss
    dW = np.outer(Z, Ew) 
    dV = np.outer(x, Ev)
    
    # loss function     -- cross entropy, generally prefered for 'classification'
    loss = -np.mean(t * np.log(Y) + (1 - t) * np.log(1 - Y ))
    return loss, (dV, dW, Ev, Ew)



<img src="ml-model.png">

The final output value is our prediction. We find the difference between it and the expected output, then use that error value to compute the partial derivative with respect to the weights in each layer going backwards recursively. We then update the wieghts with these values and repeate the proceess until the error is as small as possible.


<b>Prediction function</b> 

In [6]:
def predict(x, V, W, bv, bw):
    A = np.dot(x, V) + bv
    B = np.dot(np.tanh(A), W) + bw
    return (sigmoid(B) > 0.5).astype(int)
    

Creating layers

In [7]:
#n_hidden = 10 # number of hidden neurons
#n_in = 10

#n_out = 10 # number of outputs
#n_sample = 300

V = np.random.normal(scale = 0.1, size = (n_in, n_hidden))
W = np.random.normal(scale = 0.1, size = ( n_hidden, n_out))

# initialize our biasis
bv = np.zeros(n_hidden)
bw = np.zeros(n_out)

params = [V, W, bv, bw]

# Generating our data using numpy
X = np.random.binomial(1, 0.5, (n_sample, n_in))
T = X ^ 1 # the inverse of X

#print(X[5])  [0 0 1 1 0 1 0 0 0 0]
#print(T[5])  [1 1 0 0 1 0 1 1 1 1]
print(len(params))
print(range(X.shape[0]))

4
range(0, 300)


<b>Training time</b>

In [8]:
for epoch in range(100):
    err = [] # error list
    upd = [0]*len(params)
    
    t0 = time.process_time()
    # for each data point, update our weights
    for i in range(X.shape[0]): # we use the .shape() fxn to know how big our data is
        loss, grad = train(X[i], T[i], *params) # bcos we don't want to write V,W,bv,... we just use '*params'
        
        # update loss
        for j in range(len(params)):
            params[j] -= upd[j]
        
        for j in range(len(params)):
            upd[j] = learning_rate * grad[j] + momentum * upd[j] 
            
        err.append(loss)
        
    print(" Epoch: %d, loss: %.8f, Time: %.4fs" %(epoch, np.mean(err), time.process_time() - t0))
        

 Epoch: 0, loss: 0.45465070, Time: 0.0296s
 Epoch: 1, loss: 0.13697961, Time: 0.0275s
 Epoch: 2, loss: 0.06206941, Time: 0.0316s
 Epoch: 3, loss: 0.04092746, Time: 0.0175s
 Epoch: 4, loss: 0.03159958, Time: 0.0186s
 Epoch: 5, loss: 0.02592744, Time: 0.0236s
 Epoch: 6, loss: 0.02199575, Time: 0.0243s
 Epoch: 7, loss: 0.01907812, Time: 0.0186s
 Epoch: 8, loss: 0.01682099, Time: 0.0185s
 Epoch: 9, loss: 0.01502363, Time: 0.0213s
 Epoch: 10, loss: 0.01356039, Time: 0.0199s
 Epoch: 11, loss: 0.01234775, Time: 0.0219s
 Epoch: 12, loss: 0.01132776, Time: 0.0234s
 Epoch: 13, loss: 0.01045887, Time: 0.0243s
 Epoch: 14, loss: 0.00971052, Time: 0.0244s
 Epoch: 15, loss: 0.00905971, Time: 0.0200s
 Epoch: 16, loss: 0.00848887, Time: 0.0241s
 Epoch: 17, loss: 0.00798436, Time: 0.0242s
 Epoch: 18, loss: 0.00753542, Time: 0.0152s
 Epoch: 19, loss: 0.00713347, Time: 0.0166s
 Epoch: 20, loss: 0.00677160, Time: 0.0235s
 Epoch: 21, loss: 0.00644415, Time: 0.0227s
 Epoch: 22, loss: 0.00614650, Time: 0.0232

Time to predict something 

In [9]:
x = np.random.binomial(1 ,0.5, n_in)
print("XOR Prediction")
print(x)
print(predict(x, *params))

XOR Prediction
[1 0 1 1 1 1 0 1 0 0]
[0 1 0 0 0 0 1 0 1 1]
