# Neural networks

This week, we discussed the main principles behind the neurons' communication: if a neuron sends electrochemical signal which is above a certain threshold, the nearby neuron is activated.

Fundamentally, **artificial neural networks** (ANN) are quite similar. When we are building ANN, we are using (multiple) building blocks called neurons that can be defined as a mathematical function that takes data as input, performs transformation and produces an output.

To better understand the mathematics behind ANN, let's look at a single neuron.

## Neuron

![neuron](https://miro.medium.com/max/875/1*NZc0TcMCzpgVZXvUdEkqvA.png)

The diagram above demonstrates the basic structure of the single neuron (also known as perceptron).
When we pass input features through perceptron, each feature ($x_1, x_2, ...$) is multiplied by its weight ($w_1, w_2, ...$). The sum of the multiplication results is then added to the bias ($b$) that can be imagined as a first term independent from the features (*starting value*). The result is then passed through a nonlinear function called **activation** function which produces the output.

The whole perceptron training process can be divided into 3 steps:
- Forward propogation
- Loss calculation
- Backward calculation

### Foward propogation

The forward propogation can be described as a series of computations made to produce a prediction (it is the process we have just described in the previous section).

The previously described steps can be expressed mathematically as follows:
- The output from the neuron can be written as $z = \sum_{i = 1}^nw_ix_i + b$
- This output is passed through the activation function ($A$) to produce an output, $\hat{y} = A(z)$

Similar to the previous lectures, this produced output is compared to the expected value to calculate loss. But before moving to loss calculation, it might be useful to look at some of the activation functions.

##### Activation functions

For the simplicity sake, we will not cover all activation functions (at least in this tutorial). Instead, we will focus on two activation functions, we will most likely use in this week's challenge - **reLU** and **sigmoid**.

**ReLU** (or rectified linear unit) is a simple function that compares the values with zero. In other words, if the passed value is greater than zero, it will output the value that was passed. Otherwise, the output is zero. In mathematical terms - $A(z) = max(0, z)$.

We have already covered **sigmoid function** in the logistic regression tutorial. It can be mathematically expressed in the following way, $A(z) = \frac{1}{1+exp(-z)}$.

Note that the code below is only for educational purpose. Soon, we're going to build a more professional code using a class.

In [None]:
import numpy as np

# Defining our activation function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def relu(z):
    return np.maximum(z, 0)

# forward propagation
def forward(X, W, b):
    z = X.dot(W) + b
    yhat = sigmoid(z)
    
    return yhat

### Loss calculation

The loss function is a way of mathematicall measuring how good our model prediction is (to later adjust weights and biases).

Throughout the series, we are going to introduce a variety of different loss functions, however, for a start let's look to just a few of them.

##### Cross-Entropy loss

- For the classification tasks, we commonly choose cross-entropy loss.
- It can be calculated using the following formula: $loss = -\sum_{i}^Cy_ilog(\hat{y_i})$
- For the binary classification problem ($C = 2$), such loss function can be written as $loss = -y_1log(\hat{y_1}) - (1 - y_1)log(1-\hat{y_1})$


##### Mean Squared Error (MSE)
- Can be calculated using the following formula: $loss = \frac{1}{N}\sum_{i = 1}^n(y_i - \hat{y_i})^2$

In [None]:
def mse_loss(y, yhat):
    num_sample = len(y)

    #To avoid assigning the initial value to 0, we are going to use extremely small value
    yhat = np.maximum(y_hat, 1e-6)
    
    loss = - 1/num_sample * (np.subtract(y - yhat)) ^ 2
        
    return loss

### Back propogation

Back propogation is basically a process of training a neural network by updating its weights and bias. In a nutshell, our model computes predictions that are compared to the expected value which allows to calculate loss function. After some number of epochs, the weights and bias are adjusted in a way that minimizes the loss value, thus ensuring a more accurate predictions.

Similar to the previous models, the process of updating coefficients (or in this case, weights and bias) involves calculating loss derivatives in respect to loss functions, multiplying value by the learning rate and subtracting from the previous coefficient value.

To better visualize the whole process, let's look at neuron with 2 inputs and sigmoid activation function.

![neuron](https://i0.wp.com/neptune.ai/wp-content/uploads/Backpropagation-parameters.png?resize=581%2C361&ssl=1)

In such case, the weights and bias would be updated in the following way:
- $w_{1new} = w_1 - lr * \frac{\partial loss}{\partial w_1}$

- $w_{2new} = w_2 - lr * \frac{\partial loss}{\partial w_2}$

- $b_{new} = b - lr * \frac{\partial loss}{\partial b}$

On the other hand, coefficients are passed through multiple functions until they reach the final loss value meaning that we will have to use the chain rule.

First, let's have a look how it writen for $w_1$. We know that the loss function is initial calculated from the predicted output ($\hat{y}$), which is calculated by inserting weighted sum ($z$) to sigmoid activation function. Finally, the weighted sum is dependent from the weight in respect to which we are trying to find the derivative. Using the chain rule:
- $\frac{\partial loss}{\partial w_1} = \frac{\partial loss}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial z}\frac{\partial z}{\partial w_1}$

Similarly, we can find the derivatives for the remaining weights and bias to get the following update equations:

- $w_{1new} = w_1 - lr * \frac{\partial loss}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial z}\frac{\partial z}{\partial w_1}$

- $w_{2new} = w_2 - lr * \frac{\partial loss}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial z}\frac{\partial z}{\partial w_2}$

- $b_{new} = b - lr * \frac{\partial loss}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial z}\frac{\partial z}{\partial b}$

## Multiple layers

So far, we have learned how to build a perceptrons. On the other hand, as you might imagine, does not provide accurate results when applying to large, complex datasets. As the term neural network might suggest, one of the main ways of making our model more sophisticated is using multiple neurons and layers.

To better understand how this works, let's look at the example structure of 3 layer neural network.

![neural network](https://miro.medium.com/max/875/1*Z3zHoX1nhK6Rsmd4yNPdsg.jpeg)

Even though the structure itself looks way more complex, the working principle remains the same. Each neuron has weights for each neuron-neuron connection and each neuron has its bias. The output of each neuron (described in single perceptron section) is passed, weighted and summed at the connected neurons (basically, one neuron's output becomes another's input). After passing all layers, we generate the final output which is then used to measure the loss and start back propagation.

## Putting it all together

In this exercise, we'll build a simple 2-layer model that classifies iris. Defining each layer like ReLU or Dense in a separate class makes it easy to stack the layers.

In [None]:
class ReLU:
    def forward(self, inputs):
        return np.maximum(0, inputs)
    
    def backward(self, inputs, grad_inputs):
        positive_indices = inputs > 0
        return positive_indices * grad_inputs

In [None]:
class SoftmaxCrossEntropy:
    def __init__(self, num_classes=10, is_sparse=False):
        self.softmax_outputs = None
        self.num_classes = num_classes
        self.is_sparse = is_sparse
    
    def softmax(self, logits):
        max_logit = np.max(logits)
        exp = np.exp(logits - max_logit) # prevent overfitting
        exp_sum = np.sum(exp, axis=-1, keepdims=True)
        return exp / exp_sum

    def convert_sparse_labels_to_one_hots(self, labels):
        """
        If the 
        """
        return np.eye(self.num_classes)[labels]
    
    def crossentropy_loss(self, predictions, labels):
        predictions = np.clip(predictions, 1e-9, 1. - 1e-9)
        return - np.sum(labels * np.log(predictions + 1e-9))
    
    def forward(self, predictions, labels):
        """
        returns a crossentropy loss of softmax
        """
        batch_size = len(predictions)

        self.softmax_outputs = self.softmax(predictions)
        # print(self.softmax_outputs)

        if self.is_sparse:
            labels = self.convert_sparse_labels_to_one_hots(labels)

        crossentropy_loss = self.crossentropy_loss(predictions, labels) / batch_size
        return crossentropy_loss
    
    def backward(self, inputs, labels):
        """
        returns softmax - labels
        """
        batch_size = len(inputs)

        if self.is_sparse:
            labels = self.convert_sparse_labels_to_one_hots(labels)

        grad = (self.softmax_outputs - labels) / batch_size
        return grad

In [None]:
class Dense:
    def __init__(self, num_input_units, num_output_units, learning_rate=0.01):
        xavier_bound = np.sqrt(6 / (num_input_units + num_output_units))
        self.W = np.random.uniform(-xavier_bound, xavier_bound, (num_input_units, num_output_units))
        self.b = np.random.rand(num_output_units)
        self.learning_rate = learning_rate
    
    def forward(self, inputs):
        return inputs @ self.W + self.b
    
    def backward(self, inputs, grad_inputs):
        grad_W = inputs.T @ grad_inputs
        grad_b = np.mean(grad_inputs, axis=0)
        
        self.W -= self.learning_rate * grad_W
        self.b -= self.learning_rate * grad_b
        
        grad_outputs = grad_inputs @ self.W.T
        
        return grad_outputs

In [None]:
class NeuralNetwork:
    def __init__(self, layers):
        self.layers = layers
        
    def predict(self, X):
        outputs = X
        for layer in self.layers:
            outputs = layer.forward(outputs)
        
        return outputs
    
    def get_mini_batch(self, X, y, batch_size):
        random_indices = np.random.choice(len(X), len(X), replace=False)
        shuffled_X = X[random_indices]
        shuffled_y = y[random_indices]
        
        for i in range(0, len(X)-batch_size, batch_size):
            yield shuffled_X[i:i+batch_size], shuffled_y[i:i+batch_size]
    
    def train(self, X, y, num_epochs, batch_size, loss_function):
        num_iterations = len(X) // batch_size
        
        for epoch in range(num_epochs):
            batch_generator = self.get_mini_batch(X, y, batch_size)
            for iteration in range(num_iterations):
                batch_X, batch_y = next(batch_generator)
                loss = self.backpropagate(batch_X, batch_y, loss_function)
                
                if iteration % 3 == 0:
                    print(f"Epoch: {epoch} Iteration: {iteration} / {num_iterations} Loss: {loss}")

            print()
            
                    
    def backpropagate(self, X, y, loss_function):
        # get activation for each layer including the input layer
        activations = [X]  # initialise with the input layer
        for layer in self.layers:
            activation = layer.forward(activations[-1])
            activations += activation,
        
        # get loss and its gradient
        predictions = activations[-1]
        loss = loss_function.forward(predictions, y)
        grad = loss_function.backward(predictions, y)
        
        for layer_index in range(len(self.layers))[::-1]:
            layer = self.layers[layer_index]
            grad = layer.backward(activations[layer_index], grad)

        return np.mean(loss)

Let's train our model on Iris dataset. It's an easy multi-class classification dataset. The task is to classify three different types of iris, given 150 data points with 4 features, 

In [None]:
from sklearn.datasets import load_iris
dataset = load_iris()

x_train = dataset.data
y_train = dataset.target

print(x_train.shape, y_train.shape)

(150, 4) (150,)


In [None]:
num_features = x_train.shape[-1]
num_classes = 3

In [None]:
layers = [
    Dense(num_features, 32),
    ReLU(),
    Dense(32, 32),
    ReLU(),
    Dense(32, num_classes)
]

In [None]:
network = NeuralNetwork(layers)
network.train(x_train, y_train, num_epochs=4, batch_size=4, loss_function=SoftmaxCrossEntropy(num_classes=num_classes, is_sparse=True))

Epoch: 0 Iteration: 0 / 37 Loss: 10.729487079160572
Epoch: 0 Iteration: 3 / 37 Loss: 15.165377040535766
Epoch: 0 Iteration: 6 / 37 Loss: 10.87084199127973
Epoch: 0 Iteration: 9 / 37 Loss: 10.726921646096883
Epoch: 0 Iteration: 12 / 37 Loss: 10.6314229773482
Epoch: 0 Iteration: 15 / 37 Loss: 11.127993131575321
Epoch: 0 Iteration: 18 / 37 Loss: 10.3688210302413
Epoch: 0 Iteration: 21 / 37 Loss: 6.250562720254374
Epoch: 0 Iteration: 24 / 37 Loss: 10.367717317469392
Epoch: 0 Iteration: 27 / 37 Loss: 15.200618721084249
Epoch: 0 Iteration: 30 / 37 Loss: 5.526456463961456
Epoch: 0 Iteration: 33 / 37 Loss: 6.3394019269506625
Epoch: 0 Iteration: 36 / 37 Loss: 1.6010766707781028

Epoch: 1 Iteration: 0 / 37 Loss: 1.1315527013772961
Epoch: 1 Iteration: 3 / 37 Loss: 5.493653808402377
Epoch: 1 Iteration: 6 / 37 Loss: 6.4086806027411285
Epoch: 1 Iteration: 9 / 37 Loss: 0.6840143648487138
Epoch: 1 Iteration: 12 / 37 Loss: 0.34710307245456107
Epoch: 1 Iteration: 15 / 37 Loss: 0.1576412415970579
Epoch: 

## Keras

There was a lot of work to build this simple model. We had to implement every bit of the model, from cross entropy loss to training loop. Besides, this model is sometimes unstable, often encoutering NaN loss. And what if you want to use fancy techniques like [Batch Normalisation](https://en.wikipedia.org/wiki/Batch_normalization) or [Adam Optimser](https://www.geeksforgeeks.org/intuition-of-adam-optimizer/)? 

This is where TensorFlow and Keras come to the rescue. 




In [None]:
import tensorflow as tf
from tensorflow import keras

TensorFlow even lets you load a image dataset called [MNIST](https://www.tensorflow.org/datasets/catalog/mnist). Let's build a simple model of two layers that classifies digits from images. 

In [None]:
(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
        
# Normalize pixel values
x_train = x_train / 255.0

In [None]:
data_shape = x_train.shape

print(f"There are {data_shape[0]} examples with shape ({data_shape[1]}, {data_shape[2]})")

There are 60000 examples with shape (28, 28)


In [None]:
# Define the model
model = tf.keras.models.Sequential([ 
    keras.layers.Flatten(),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dense(10, activation='softmax'),
]) 

In [None]:
# Compile the model
model.compile(optimizer='adam', 
              loss='sparse_categorical_crossentropy', 
              metrics=['accuracy']) 

In [None]:
# Fit the model for 10 epochs
model.fit(x_train, y_train, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f114039eb10>

With only a few lines of code, we could train a model that classifies digits with 99.5% of accuracy. From now on, we will utilise this wonderful framework to build various models like Recurrent Neural Network, Convolutional Neural Network, etc.