# Neural networks

During the previous week when we analyzed the neural activity data, we discussed the main principles behind the neurons' communication: if a neuron sends electrochemical signal which is above a certain threshold, the nearby neuron is activated.

Fundamentally, **artificial neural networks** (ANN) are quite similar. When we are building ANN, we are using (multiple) building blocks called neurons that can be defined as a mathematical function that takes data as input, performs transformation and produces an output.

To better understand the mathematics behind ANN, let's look at a single neuron.

## Neuron

![neuron](https://miro.medium.com/max/875/1*NZc0TcMCzpgVZXvUdEkqvA.png)

The diagram above demonstrates the basic structure of the single neuron (also known as perceptron).
When we pass input features through perceptron, each feature ($x_1, x_2, ...$) is multiplied by its weight ($w_1, w_2, ...$). The sum of the multiplication results is then added to the bias ($b$) that can be imagined as a first term independent from the features (*starting value*). The result is then passed through a nonlinear function called **activation** function which produces the output.

The whole perceptron training process can be divided into 3 steps:
- Forward propogation
- Loss calculation
- Backward calculation

### Foward propogation

The forward propogation can be described as a series of computations made to produce a prediction (it is the process we have just described in the previous section).

The previously described steps can be expressed mathematically as follows:
- The output from the neuron can be written as $z = \sum_{i = 1}^nw_ix_i + b$
- This output is passed through the activation function ($A$) to produce an output, $\hat{y} = A(z)$

Similar to the previous lectures, this produced output is compared to the expected value to calculate loss. But before moving to loss calculation, it might be useful to look at some of the activation functions.

##### Activation functions

For the simplicity sake, we will not cover all activation functions (at least in this tutorial). Instead, we will focus on two activation functions, we will most likely use in this week's challenge - **reLU** and **sigmoid**.

**ReLU** (or rectified linear unit) is a simple function that compares the values with zero. In other words, if the passed value is greater than zero, it will output the value that was passed. Otherwise, the output is zero. In mathematical terms - $A(z) = max(0, x)$.

We have already covered **sigmoid function** in the logistic regression tutorial. It can be mathematically expressed in the following way, $A(z) = \frac{1}{1+exp(-z)}$.


### Loss calculation

The loss function is a way of mathematicall measuring how good our model prediction is (to later adjust weights and biases).

Throughout the series, we are going to introduce a variety of different loss functions, however, for a start let's look to just a few of them.

##### Cross-Entropy loss

- For the classification tasks, we commonly choose cross-entropy loss.
- It can be calculated using the following formula: $loss = -\sum_{i}^Cy_ilog(\hat{y_i})$
- For the binary classification problem ($C = 2$), such loss function can be written as $loss = -y_1log(\hat{y_1}) - (1 - y_1)log(1-\hat{y_1})$


##### Mean Squared Error (MSE)
- Can be calculated using the following formula: $loss = \frac{1}{N}\sum_{i = 1}^n(y_i - \hat{y_i})^2$

### Back propogation

Back propogation is basically a process of training a neural network by updating its weights and bias. In a nutshell, our model computes predictions that are compared to the expected value which allows to calculate loss function. After some number of epochs, the weights and bias are adjusted in a way that minimizes the loss value, thus ensuring a more accurate predictions.

Similar to the previous models, the process of updating coefficients (or in this case, weights and bias) involves calculating loss derivatives in respect to loss functions, multiplying value by the learning rate and subtracting from the previous coefficient value.

To better visualize the whole process, let's look at neuron with 2 inputs and sigmoid activation function.

![neuron](https://i0.wp.com/neptune.ai/wp-content/uploads/Backpropagation-parameters.png?resize=581%2C361&ssl=1)

In such case, the weights and bias would be updated in the following way:
- $w_{1new} = w_1 - lr * \frac{\partial loss}{\partial w_1}$

- $w_{2new} = w_2 - lr * \frac{\partial loss}{\partial w_2}$

- $b_{new} = b - lr * \frac{\partial loss}{\partial b}$

On the other hand, coefficients are passed through multiple functions until they reach the final loss value meaning that we will have to use the chain rule.

First, let's have a look how it writen for $w_1$. We know that the loss function is initial calculated from the predicted output ($\hat{y}$), which is calculated by inserting weighted sum ($z$) to sigmoid activation function. Finally, the weighted sum is dependent from the weight in respect to which we are trying to find the derivative. Using the chain rule:
- $\frac{\partial loss}{\partial w_1} = \frac{\partial loss}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial z}\frac{\partial z}{\partial w_1}$

Similarly, we can find the derivatives for the remaining weights and bias to get the following update equations:

- $w_{1new} = w_1 - lr * \frac{\partial loss}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial z}\frac{\partial z}{\partial w_1}$

- $w_{2new} = w_2 - lr * \frac{\partial loss}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial z}\frac{\partial z}{\partial w_2}$

- $b_{new} = b - lr * \frac{\partial loss}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial z}\frac{\partial z}{\partial b}$



## Python implementation

Let's say we want to program a neuron containing 2 inputs, sigmoid activation and MSE loss functions.

In [8]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


def init():
    np.random.seed(1)

    #definining weights for 3 features and 1 output (binary classification)
    W = np.random.randn(3, 1)
    b = np.random.rand(1,)
    
    lr = 0.001
    
    return W, b


#Defining our activation function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))
    
    
def mse_loss(y, yhat):
    num_sample = len(y)
    #To avoid assigning the initial value to 0, we are going to use extremely small value
    yhat = np.maximum(y_hat, 0.00000001)
    
    loss = - 1/num_sample * (np.subtract(y - yhat)) ^ 2
        
    return loss


def forward(X, W, b):
    
    z = X.dot(W) + b
    yhat = sigmoid(z)
    
    loss = mse_loss(y, yhat)
    
    return yhat, loss

def back_propogation(yhat, X, lr):
    
    dl_wrt_yhat = np.subtract(yhat - y)
    dl_wrt_z = dl_wrt_yhat * (1 / (1 + np.exp(-yhat))) * (1 - 1 / (1 + np.exp(-yhat)))
    dl_wrt_w = dl_wrt_z.dot(X.T)
    
    
    W = W - lr * dl_wrt_w
    
    return W

    
def fit(X, y, epochs):
    
    W, b, lr = init()
    
    
    for i in range(epochs):
        yhat, loss = forward(X, W, b)
        W = back_propagation(yhat, X, lr)
        

## Multiple layers

So far, we have learned how to build a perceptrons. On the other hand, as you might imagine, does not provide accurate results when applying to large, complex datasets. As the term neural network might suggest, one of the main ways of making our model more sophisticated is using multiple neurons and layers.

To better understand how this works, let's look at the example structure of 3 layer neural network.

![neural network](https://miro.medium.com/max/875/1*Z3zHoX1nhK6Rsmd4yNPdsg.jpeg)

Even though the structure itself looks way more complex, the working principle remains the same. Each neuron has weights for each neuron-neuron connection and each neuron has its bias. The output of each neuron (described in single perceptron section) is passed, weighted and summed at the connected neurons (basically, one neuron's output becomes another's input). After passing all layers, we generate the final output which is then used to measure the loss and start back propagation.