# Basics of Deep Learning
In this notebook, we will cover the basics behind Deep Learning. I'm talking about building a brain....

![gif of some colours](https://www.fleetscience.org/sites/default/files/images/neural-mlblog.gif)

Only kidding. Deep learning is a fascinating new field that has exploded over the last few years. From being used as facial recognition in apps such as Snapchat or challenger banks, to more advanced use cases such as being used in [protein-folding](https://www.independent.co.uk/life-style/gadgets-and-tech/protein-folding-ai-deepmind-google-cancer-covid-b1764008.html).

In this notebook we will:
- Explain the building blocks of neural networks
- Go over some applications of Deep Learning

## Building blocks of Neural Networks

I have no doubt that you have heard/seen how similar neural networks are to....our brains. 


### The Perceptron

The building block of neural networks. The perceptron has a rich history (covered in the background section of this book). The perceptron was created in 1958 by Frank Rosenblatt (I love that name) in Cornell, however, that story is for another day....or section in this book (backgrounds!),

The perceptron is an algorithm that can learn a binary classifier (e.g. is that a cat or dog?). This is known as a threshold function, which maps an input vector *x* to an output decision *$f(x)$ = output*. Here is the formal maths to better explain my verbal fluff:

$ f(x) = { 1 (if: w.x+b > 0), 0 (otherwise) $


### The Artificial neural network

Lets take a look at the high level architecture first.

![3blue1brown neural network gif](https://thumbs.gfycat.com/DeadlyDeafeningAtlanticblackgoby-max-1mb.gif) 

The gif above of a neural network classifying images is one of the best visual ways of understanding how neural networks, work. The neural network is made up of a few key concepts:
- An input: this is the data you pass into the network. For example, data relating to a customer (e.g. height, weight etc) or the pixels of an image
- An output: this is the prediction of the neural network
- A hidden layer: more on this later
- Neuron: the network is made up of neurons, that take an input, and give an output

Now, we have a slightly better understanding of what a neuron is. Lets look at a very simple neuron:

![simple neural network](https://databricks.com/wp-content/uploads/2019/02/neural1.jpg) 

From the above image, you can clearly see the three components listed above together. 

### But Abdi, what is the goal of a neural network?

Isn't it obvious? To me, it definitely was not when I first started to learn about neural networks. Neural networks are beautifully complex to understand, but with enough time and lots of youtube videos, you'll be able to master this topic.

The goal of a neural network is to make a pretty good guess of something. For example, a phone may have a face unlock feature. The phone probably got you to take a short video/images of yourself in order to set up this security feature, and when it **learned** your face, you were then able to use it to unlock your phone. This is pretty much what we do with neural networks. We teach it by giving it data, and making sure it gets better at making predictions by adjusting the weights between neurons. More on this soon.


Neural networks are made up of:
- input layer: this is where you feed the features into the network
- hidden layers: 1 or more layers can be built into the network, in order to build a better function approximator
- output layer: an output layer, this can have 1 or more nodes.

The above is a basic architecture. There are more that we will cover in different sections of this book.

## Gradient Descent Algo

One of the best videos on neural networks, by 3Blue1Brown:

[![3b1b logo](https://cdn.shopify.com/s/files/1/0506/0633/collections/3blue1brown_logo_1300x.jpg?v=1528987740)](https://www.youtube.com/watch?v=aircAruvnKk)

His series on Neural networks and Linear algebra are golden sources for learning Deep Learning.

### Simple Gradient Descent Implementation
with the help from our friends over at Udacity, please view below a simple implementation of a neural network. This is a very basic neural network that only has its inputs linked directly to the outputs, hence no hidden layers.

We begin by defining some functions.

In [1]:
import numpy as np 

# We will be using a sigmoid activation function
def sigmoid(x):
    return 1/(1+np.exp(-x))

# derivation of sigmoid(x) - will be used for backpropagating errors through the network
def sigmoid_prime(x):
    return sigmoid(x)*(1-sigmoid(x))

We begin by defining a simple neural network:
- two input neurons: x1 and x2
- one output neuron: y1

In [4]:
x = np.array([1,5])
y = np.array([0.4])

In [5]:
print(f'Shape of our input: {x.shape}')
print(f'Shape of our output: {y.shape}')

Shape of our input: (2,)
Shape of our output: (1,)


We have a neuron that takes 2 inputs, and returns one output.

We now define the weights, w1 and w2 for the two input neurons; x1 and x2. Also, we define a learning rate that will help us control our gradient descent step

In [8]:
weights = np.array([-0.2,0.4])
learnrate = 0.5
print(f'Weight from Node 1 to output is {weights[0]}, and the weight from node 2 to the output is {weights[1]}. Our learning rate is {learnrate}.')

Weight from Node 1 to output is -0.2, and the weight from node 2 to the output is 0.4. Our learning rate is 0.5.


we now start moving forwards through the network, known as feed forward. Here, we want to take a weighted sum:

$ output = Input \cdot Weights $

$ output = x_{1} \cdot w_{1} + x_{2} \cdot w_{2} + ... + x_{i} \cdot w_{i}  $

$ output = \sum_{i=1}^{n} x_{i} \cdot w_{i}  $

We can combine the input vector with the weight vector using numpy's dot product

In [9]:
# linear combination
# h = x[0]*weights[0] + x[1]*weights[1]
h = np.dot(x, weights)

We now apply our non-linearity, this will provide us with our output.

In [10]:
# apply non-linearity
output = sigmoid(h)

Now that we have our prediction, we are able to determine the `error` of our neural network. Here, we will use the difference between our actual and predicted.

The goal now is to determine how to change our weights in order to reduce the error above. This is where our good friend gradient descent and the chain rule come into play:

- we determine the derivative of our error with respect to our input weights. Hence:
- change in weights = $ \frac{d}{dw_{i}} \frac{1}{2}{(y - \hat{y})^2}$
- simplifies to = learning rate * error term * $ x_{i}$
- where:
    - learning rate = $ n $
    - error term = $ (y - \hat{y}) * f'(h) $
    - h =  $ \sum_{i} W_{i} x_{i} $ 


The network we created here is simple: 
- input nodes:
    - calculate a weighted sum
- activation:
    - calculate activation function e.g. sigmoid
- cost
    - calculate the error

Hence, from the above, when we are determining the error with respect to a change in the weights, we will need to backpropagate the error. Here, we will use partial derivatives:

$ \frac{dE}{dW} = \frac{dE}{dy} \cdot \frac{dy}{dh} \cdot \frac{dh}{dW}$

Hence, putting it all together forms:

In [13]:
# derivative of the error with respect to the activation
dEdy = y - output
# derivative of the activation with respect to weighted sum
dydh = sigmoid_prime(h)
# derivative of the weighted sum with respect to the weights
dhdW = x

Now, we can calcualte our error term

In [15]:
gradient_descent_step = learnrate * (dEdy * dydh * dhdW)

Hence:

In [21]:
print(f'Actual output: \t\t{y}')
print(f'Predicted output: \t{output}')
print(f'Error: \t\t\t{error}\n')
print(f'Previous weights: \t{weights}')
print(f'Updated weights: \t{weights - gradient_descent_step}')
print(f'Weight change: \t\t{gradient_descent_step}')

Actual output: 		[0.4]
Predicted output: 	0.8581489350995123
Error: 			[-0.45814894]

Previous weights: 	[-0.2  0.4]
Updated weights: 	[-0.17211492  0.53942542]
Weight change: 		[-0.02788508 -0.13942542]


### But that was a tiny network, lets go bigger

We will begin by creating some fake data, followed by implementing our neural network.

In [24]:
x = np.random.rand(200,2)
no_data_points, no_features = x.shape
print(f'For features: We have {no_data_points} instances and {no_features} features per instance')
y = np.random.randint(low=0, high=2, size=(200,1))
print(f'For labels: We have {y.shape[0]} instances and {y.shape[1]} label per instance\n')

print('lets take a look at our X')
print(x[:2])
print('\nLets look at our label')
print(y[:2])

For features: We have 200 instances and 2 features per instance
For labels: We have 200 instances and 1 label per instance

lets take a look at our X
[[0.30700922 0.90934141]
 [0.63079007 0.05063643]]

Lets look at our label
[[1]
 [1]]


In [25]:
def sig(x):
    '''Calc for sigmoid'''
    return 1 / (1+np.exp(-x))

Lets create the weights that connect our inputs to our output

In [37]:
weights = np.random.normal(scale=1/no_features**.5, size=no_features)
assert len(weights) == no_features

We have now defined our neural network:
- We have 200 data points
- each instance has 2 features
- with a single prediction
- we have defined the weights between the input neurons and the output

Now, lets define our neural network with a feed forward pass. Passing our entire dataset `epoch` number of times.

In [40]:
epochs = 1000
learning_rate = 0.5
last_loss = None

for single_data_pass in range(epochs):
    # reset the change_in_weights
    change_in_weights = np.zeros(weights.shape)
    for features, label in zip(x, y):
        h = np.dot(features, weights)
        pred = sigmoid(h)
        
        error = (label - pred)
        # error term = error * f'(h)
        error_term = error * (label * (1-label))
        # now multiply this by the current x & add to our weight update
        change_in_weights += (error_term * features)
    # now update the actual weights after a complete pass of our data (mean of individual weight updates)
    weights += (learning_rate * change_in_weights / no_data_points)

    # print the loss every 100th pass
    if single_data_pass % (epochs/10) == 0:
        # use current weights in NN to determine outputs
        output = sigmoid(np.dot(features,weights))
        # find the loss
        loss = np.mean((output-label)**2)
        # 
        if last_loss and last_loss < loss:
            print(f'Train loss: {loss}, WARNING - Loss is inscreasing')
        else:
            print(f'Training loss: {loss}')
        last_loss = loss 

Training loss: 0.25692385394795986
Training loss: 0.25692385394795986
Training loss: 0.25692385394795986
Training loss: 0.25692385394795986
Training loss: 0.25692385394795986
Training loss: 0.25692385394795986
Training loss: 0.25692385394795986
Training loss: 0.25692385394795986
Training loss: 0.25692385394795986
Training loss: 0.25692385394795986


## Backpropa what?

Ok, so now, how do we refine our weights? Well, this is where **backpropagation** comes in. After feeding our data forwards through the network, using feed forward, we propagate our errors backwards, making use of things such as the chain rule.

Lets do an implementation.

In [44]:
# we have three input nodes
x = np.array([0.5, 0.2, -0.3])
# one output node
y = 0.7

learnrate = 0.5
# 2 nodes in hidden layer
weights_input_hidden = np.array([[0.5, -0.6], [0.1, -0.2], [0.1, 0.7]])
weights_hidden_output = np.array([0.1,-0.3])

In [45]:
# feeding data forwards through the network
hidden_layer_input = np.dot(x, weights_input_hidden)
hidden_layer_output = sigmoid(hidden_layer_input)
#---
output_layer_input = np.dot(hidden_layer_output, weights_hidden_output)
y_hat = sigmoid(output_layer_input)

In [47]:
# backward propagate the errors to tune the weights

# 1. calculate errors
error = y - y_hat
output_node_error_term = error * (y_hat * (1-y_hat))
#----
hidden_node_error_term = weights_hidden_output * output_node_error_term *(hidden_layer_output * (1-hidden_layer_output))

# 2. calculate weight changes
delta_w_output_node = learnrate * output_node_error_term * hidden_layer_output
#-----
delta_w_hidden_node = learnrate * hidden_node_error_term * x[:,None]


In [48]:
print(f'Original weights:\n{weights_input_hidden}\n{weights_hidden_output}')
print()
print('Change in weights for hidden layer to output layer:')
print(delta_w_output_node)
print('Change in weights for input layer to hidden layer:')
print(delta_w_hidden_node)

Original weights:
[[ 0.5 -0.6]
 [ 0.1 -0.2]
 [ 0.1  0.7]]
[ 0.1 -0.3]

Change in weights for hidden layer to output layer:
[0.01492263 0.00975438]
Change in weights for input layer to hidden layer:
[[ 0.00032851 -0.00092784]
 [ 0.0001314  -0.00037114]
 [-0.00019711  0.0005567 ]]


## Putting it all together

In [14]:
features = np.random.rand(200,2)
target = np.random.randint(low=0, high=2, size=(200,1))

def complete_backprop(x,y):
    '''Complete implementation of backpropagation'''
    n_hidden_units = 2
    epochs = 900
    learnrate = 0.005

    n_records, n_features = features.shape
    last_loss = None

    w_input_to_hidden = np.random.normal(scale=1/n_features**.5,size=(n_features, n_hidden_units))
    w_hidden_to_output = np.random.normal(scale=1/n_features**.5, size=n_hidden_units)

    for single_epoch in range(epochs):
        delw_input_to_hidden = np.zeros(w_input_to_hidden.shape)
        delw_hidden_to_output = np.zeros(w_hidden_to_output.shape)

        for x,y in zip(features, target):
            # ----------------------
            # 1. Feed data forwards
            # ----------------------
            
            hidden_layer_input = np.dot(x,w_input_to_hidden)
            hidden_layer_output = sigmoid(hidden_layer_input)

            output_layer_input = np.dot(hidden_layer_output, w_hidden_to_output)
            output_layer_output = sigmoid(output_layer_input)

            # ----------------------
            # 2. Backpropagate the errors
            # ----------------------

            # error at output layer
            prediction_error = y - output_layer_output
            output_error_term = prediction_error * (output_layer_output * (1-output_layer_output))

            # error at hidden layer (propagated from output layer)
            # scale error from output layer by weights
            hidden_layer_error = np.multiply(output_error_term, w_hidden_to_output)
            hidden_error_term = hidden_layer_error * (hidden_layer_output * (1-hidden_layer_output))

            # ----------------------
            # 3. Find change of weights for each data point
            # ----------------------

            delw_hidden_to_output += output_error_term * hidden_layer_output
            delw_input_to_hidden += hidden_error_term * x[:,None]
        
        
        # Now update the actual weights
        w_hidden_to_output += learnrate * delw_hidden_to_output / n_records
        w_input_to_hidden += learnrate * delw_input_to_hidden / n_records

        # Printing out the mean square error on the training set
        if single_epoch % (epochs / 10) == 0:
            hidden_output = sigmoid(np.dot(x, w_input_to_hidden))
            out = sigmoid(np.dot(hidden_output,
                                w_hidden_to_output))
            loss = np.mean((out - target) ** 2)

            if last_loss and last_loss < loss:
                print("Train loss: ", loss, "  WARNING - Loss Increasing")
            else:
                print("Train loss: ", loss)
            last_loss = loss


complete_backprop(features,target)

Train loss:  0.2525403681187093
Train loss:  0.2524365740756108
Train loss:  0.25233663631306025
Train loss:  0.2522404119715983
Train loss:  0.2521477634680961
Train loss:  0.25205855830450913
Train loss:  0.2519726688830313
Train loss:  0.251889972327498
Train loss:  0.2518103503108861
Train loss:  0.2517336888887526
