## implement a neural net

In [1]:
import numpy as np
import numpy.random as rd
from scipy.special import expit, logit, softmax

In [2]:
# this is a simple neural network with one hidden layer with n neurones and an output layer
class Simple_nnet:

    def __init__(self, input_shape , hidden_layer_shape = 16, output_shape = 2) :
        self.input_shape = input_shape
        self.hidden_layer_shape = hidden_layer_shape
        self.output_shape = output_shape

        # init wheights and bias for hidden layer with random numbers 
        self.h_weights = rd.rand(hidden_layer_shape, input_shape)
        self.h_bias = rd.rand(hidden_layer_shape, 1)

        # init wheights and bias for output layer with random numbers 
        self.o_weights = rd.rand(output_shape, hidden_layer_shape)
        self.o_bias = rd.rand(output_shape, 1)

    def process_input(self, input):
        # reshape input
        input = np.reshape(input, (input.shape[0], 1))
        # calculate hidden layer neurones. @ operator do matrix multiplication
        a1 = (self.h_weights @ input )+ self.h_bias
        a1 = expit(a1) # apply sigmoid function

        # calculate output layer. @ operator do matrix multiplication
        a2 = (self.o_weights @ a1 )+ self.o_bias
        a2 = softmax(a2) # apply softmax function
        
        return a2

    def __str__(self):
        return str(self.input_shape) + " --> " + str(self.hidden_layer_shape) + " --> " + str(self.output_shape)



In [3]:
nnet1 = Simple_nnet(4, 8)

In [4]:
input1 = np.array([1,1,1,1])

In [5]:
nnet1.process_input(input1)

array([[0.56470204],
       [0.43529796]])

In [6]:
softmax([10, 20])[0] + softmax([10, 20])[1]

1.0

Cost function (the output layer has $ n_o $ neurones): 

$$
  C =  \frac{1}{2} \sum_{i=1}^{n_o} (\hat{y}_{i} - y_i)^2
$$

So : 

$$
\frac{\partial{C}}{\hat{y}_{i}} = \hat{y}_{i} - y_i

$$

Activation for the last layer (output) : </br>
$$ 
a_i^{o} = \sigma (z_i^{o}) = \hat{y}_{i} 
$$

where :
$$
z_i^{o} = \sum_{k=1}^{n_H} W_{i, k}^{o}.a_{k}^{H} + b_{i}^{o}
$$


So, let's calculate the gradients of cost with regards to bias $ b_i^{o} $ . <br>
For this, we need to understand how each term depends on the other so that we can use the chain rule. <br>
This notation : Y &rarr; X, mean that Y depends on X, and if it's the case, generally we calculate the derivative of Y with regards to X : $  \frac{\partial{Y}}{\partial{X}} $. <br>
In our case, we have these dependencies (see formulas above): C &rarr; $ a_{i}^{o} $ &rarr; $ z_{i}^{o} $ &rarr; $ b_{i}^{o} $. <br>
So we can use the chain rule to calculate gradient of cost with regards to $ b_i $ : 

$$

\frac{\partial{C}}{\partial{b_{i}^{o}}} = \frac{\partial{C}}{\partial{a_{i}^{o}}} \times \frac{\partial{a_{i}^{o}}}{\partial{z_{i}^{o}}} \times \frac{\partial{z_{i}^{o}}}{\partial{b_{i}^{o}}}

$$

Let's calculate each term in the right separately : 

$$

\frac{\partial{C}}{\partial{a_{i}^{o}}} = (a_{i}^{o} - y_{i})

$$

$$

\frac{\partial{a_{i}^{o}}}{\partial{z_{i}^{o}}} = \sigma^{'}(z_{i}^{o}) = \sigma(z_{i}^{o}) \times (1 - \sigma(z_{i}^{o})) = a_{i}^{o} \times (1 - a_{i}^{o})

$$

$$
 \frac{\partial{z_{i}^{o}}}{\partial{b_{i}^{o}}} = 1
$$
So : 

$$
    \frac{\partial{C}}{\partial{b_{i}^{o}}} = (a_{i}^{o} - y_{i}) \times a_{i}^{o} \times (1 - a_{i}^{o})
$$
NB : Remenber that : $ a_{i}^{o} = \hat{y}_{i} $ 
The following function implements just this : 

In [7]:
def gradient_output_bias(y_hat, y):

    """
        Calculate the gradient of cost function with regards to the bias of output layer. 
        The cost function is : 1/2 * some(y_hat_i - y_i) , i = 1,...,n_o , n_o is the the length of output
        y_hat : output vector
        y : ground truth vector of the same lenth as y_hat
    """

    return (y_hat-y)*y_hat*(1-y_hat)

# test function
y_hat = np.array([1, 2])
y = np.array([0.5, 1.5])
gradient_output_bias(y_hat, y)

array([ 0., -1.])

Now let's move on to the gradients with regards to the weights of the output layer : 
$$

\frac{\partial{C}}{\partial{W_{i, k}^{o}}}

$$

For this we will again the rule chain : 
$$

\frac{\partial{C}}{\partial{W_{i, k}^{o}}} = \frac{\partial{C}}{\partial{a_{i}^{o}}} \times \frac{\partial{a_{i}^{o}}}{\partial{z_{i}^{o}}} \times \frac{\partial{z_{i}^{o}}}{\partial{W_{i, k}^{o}}}

$$

The two first terms were already calculated (see above) : 

$$

\frac{\partial{C}}{\partial{a_{i}^{o}}} = (a_{i}^{o} - y_{i})

$$

$$
    \frac{\partial{a_{i}^{o}}}{\partial{z_{i}^{o}}} =  a_{i}^{o} \times (1 - a_{i}^{o})
$$

For the third term, we know that : $ z_i^{o} = \sum_{k=1}^{n_H} W_{i, k}^{o}.a_{k}^{H} + b_{i}^{o} $, so : 

$$

    \frac{\partial{z_{i}^{o}}}{\partial{W_{i, k}^{o}}} = a_{k}^{H}
$$

Therfore : 
$$
    \frac{\partial{C}}{\partial{W_{i, k}^{o}}} = (a_{i}^{o} - y_{i}) \times a_{i}^{o} \times (1 - a_{i}^{o}) \times a_{k}^{H}
$$

<b> NB : Remenber that : $ a_{i}^{o} = \hat{y}_{i} $ </b>

Let's implement this calculations in a function :

In [8]:
def gradient_output_weights(a_o, a_h, y):

    """
        Calculate the gradient of cost function with regards to the weights of output layer. 
        a_o : vector of dimension no, activation of output ( = y_hat)
        a_h : vector of dimension nh, activation of hidden layer
        y : ground truth vector of the same lenth as y_hat
    """

    # first we need to get dimensions of a_o and a_h
    no = len(a_o) # equals dim of y 
    nh = len(a_h)

    # intermidate calculs (multiplication of the 3 first terms in eq. above)
    temp = (a_o - y) * a_o * (1 - a_o) # here we get a vector of length no

    # reshape vectors to make matrix multiplication
    temp = temp.reshape((no, 1))
    a_h = a_h.reshape((1, nh))

    return temp @ a_h

In [9]:
# test the function
a_o_1 = np.array([1, 2])
y_1 = np.array([0.5, 1.9])
a_h_1 = np.array([15, 20, 16])

gradient_output_weights(a_o_1, a_h_1, y_1)

array([[ 0. ,  0. ,  0. ],
       [-3. , -4. , -3.2]])

Now let's move on to the gradients with regards to the bias of hidden layer :
$$

\frac{\partial{C}}{\partial{b_{i}^{H}}} , i = 1,... n_{o}
$$

To calculate this we need to how C depends on $ b_i^H $ $ i = 1,...n_o $ : <br>

A change in  $ b_{i}^{H} $ changes $ z_{i}^{H} $ which changes  $ a_{i}^{H} $ which changes $ ( z_{1}^{o} $ to $ z_{n_o}^{o} )  $ which changes C  <br>


This mean that a change in $ b_{i}^{H} $ will change all $  z_{j}^{o} $ from j = 1 to ${n_o}$  which result in change of C. <br>

So we can write : 


$$

\frac{\partial{C}}{\partial{b_{i}^{H}}} = \frac{\partial{C}}{\partial{z_{1}^{o}}} \times \frac{\partial{z_{1}^{o}}}{\partial{b_{i}^{H}}} + \frac{\partial{C}}{\partial{z_{2}^{o}}} \times \frac{\partial{z_{2}^{o}}}{\partial{b_{i}^{H}}}
$$

This can be written in a matricielle form : 

$$

\frac{\partial{C}}{\partial{b^{H}}} = (\frac{\partial{C}}{\partial{z^{o}}})^{trans} \times \frac{\partial{z^{o}}}{\partial{b^{H}}} 
$$
which trans, stands for tranpose