# Artificial Neural Networks  

**Import libraries**  

In [1]:
import numpy as np
seed = 42
rng = np.random.default_rng(seed=seed)

## Network Class  

We shall start writing the `Network Class`. The two methods that are indispensible for any ML class are:  

* `fit`  
* `predict`  
* 
Fitting a neural network model requires us to compute two passes on the data:  

* `forward`  
* `backward`  

We need to start at some place by initializing the network and various hyperparameters and this requires an `init` method:  

* `init`  

In most of these methods, we would have to take help of certain `helper` functions:  

* `activations`  
* `losses`  

This is the process but we will work through it in reverse order so that each step of the process does not have any forward references:  

* `helpers` --> `init` --> `forward` --> `backward` --> `fit` --> `predict`  

The skeleton of the class is given in the code block below. For ease of exposition, we are going to discuss the methods one at a time and then plug them into the class right at the end.  

### Skeleton of Network Class  

In [4]:
class Network:

    def init(self, layers, activation_choice = 'relu',
                        output_choice = 'softmax',
                        loss_choice = 'cce'):
        pass

    def forward(self, X):
        pass

    def backward(self, Y, Y_hat):
        pass

    def fit(self, X, lr = 0.01,
                    epochs = 100,
                    batch_size = 100):
        pass

    def predict(self, X):
        pass

## Activation functions  

Hidden layer  

We will look at two activation functions for the hidden layers. Both these functions will be applied element-wise. The input to these functions can be scalars, vectors or matrices.  

* `Sigmoid`  
* `ReLU`  

We also need the derivatives of these functions while computing the backward pass. Deriving the mathematical expressions for them are left as an exercise to the learners.  

In [6]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def grad_sigmoid(z):
    return sigmoid(z) * (1 - sigmoid(z))

def relu(z):
    return np.where(z >= 0, z, 0)

def grad_relu(z):
    return np.where(z >= 0, 1, 0)

# A dictionary of activation functions will be used while initializing the network  
hidden_act = {'sigmoid': sigmoid, 'relu': relu}
grad_hidden_act = {'sigmoid': grad_sigmoid, 'relu': grad_relu}

## Output layer  

We will look at the two activation functions for the output layer.  

* `Identity` for regression  
* `Softmax` for classification  

**Note**: In softmax, to avoid overflow, we will subtract the row-wise maximum from each row while computing the softmax.  

In [7]:
def identity(z):
    return z

def softmax(z):
    '''Row-wise softmax'''
    # Check if z is a matrix
    assert z.ndim == 2
    
    # To prevent overflow, subtract softmax row-wise
    z -= z.max(axis=1, keepdims=True)

    # Compute row-wise softmax
    prob = np.exp(z) / np.exp(z).sum(axis=1, keep_dims=True)

    # Check if the row probability is a probability distribution
    assert np.allclose(prob.sum(axis=1), np.ones(z.shape[0]))
    return prob

output_act = {'softmax': softmax, 'identity': identity}

## Loss  

There are two types of losses we will use:  

* `Least square error` for regression  
* `Categorical cross-entropy` loss for classification  

[video link](https://youtu.be/q1GTG13OgNY?t=345)  

In [9]:
def least_square(y, y_hat):
    return 0.5 * np.sum((y_hat - y) * (y_hat - y))

def cce(Y, Y_hat):      # Note that capital Y is used to denote that Y is a matrix with shape n*k
    return -np.sum(Y * np.log(Y_hat))

losses = {'least_square': least_square, 'cce': cce}

### Initialization  

Here, we will look at two parts:  

* Network architecture  
* Weight initialization  

`Network architecture`  

The following components mainly determine the structure of the network:  
* number of layers  
* number of neurons per layer  

We will use `l` to index the layers. The network has `L` layers in all.  

* `l = 0`: Input layer  
* `1 <= l <= L -1`: Hidden layers  
* `l = L`: Output layer  

We shall represent the number of layers and the neurons using a list `layers`. The variable L will never make an explicit appearance anywhere, instead we will use `range(len(layers))` to iterate through the layers.  

One useful task is to compute the total number of parameters in the network. This will come in handy later on.  

In [10]:
def count_params(layers):
    num_params = 0
    for l in range(1, len(layers)):
        num_weights = layers[l-1] * layers[l]
        num_biases = layers[l]
        num_params += (num_weights + num_biases)
    return num_params

# Test count_params  
assert count_params([64, 5, 10]) == (64 * 5 + 5) + (5 * 10 + 10)

`Parameter initialization`  

