# Let's build a class that defines a *feed-forwad artificial neural network*. 

Such "network" is basically a (in principle) very complicated function, with (in principle) a lot of adjastable parameters that is used to approximate (or interpolate) functions.  

A neural network is made up form neurons that are connected with each other. A feed-forwad artificial neural network takes inputs, transforms them by passing them through its "hidden" layer, and produces outputs.

A feed-forwad artificial neural network looks loke the following figure:

<img src="misc/FFANN.png" style="height:250px">

where the triangles represent the action of the neuron (some simple function we define), the coloured circles represent the "weights", and the empty ones represent the "biases". The weights and biases are adjustable parameters that we tune, in order to get the outputs we need from given inputs.  We should note that neurons do not connect with all the other ones, but the entire network is separated in "layers", with each layer taking as input the output of the previous one.

Every neuron takes a number of inputs coming from the previous layer. The neuron, then does the following:

<img src="misc/Neuron.png" style="height:450px">

That is, it takes the outputs from the previous layer ($l$), and the weights that represent the strength of the connection between the node $i$ of the $l$ layer and node $j$ of the $l+1$ layer (all these inputs and outputs are called "signals"). Then it adds all the different contributions and returns the output of which is:
$$
s^{(l)}_{j} = \theta\left( \displaystyle \sum_{i=0}^{n^{(l)}-1} w^{(l)}_{j i}  s^{(l)}_{i} + b^{(l+1)}_{j}\right) \; , 
$$

with $\theta$ the "activation function", which is some simple function we choose.


# The ```FFANN``` class

the class ```FFANN``` defines a feed-forward artificial neural network with an arbitrary number of input/output nodes and hidden layers.


## Functions and conventions


The constructor is called using

```FFANN( number_of_input_nodes, number_of_output_nodes, hidden_nodes, activation_functions)```

```hidden_nodes``` a list that contains the number of nodes in each hidden layer 
(e.g. a list like [2,3,5] means that we have 3 hidden layers with 2,3,5 number of nodes respectively).

The weights, biases, and signals are stored in ```FFANN.weights```, ```FFANN.biases```,```FFANN.signals```. For teh moment they can be initialized using ```FFANN.init_params```, but one can define their own initialization. It is **important** that the biases of the $l^{\rm th}$ layer correspond to the ```FFANN.biases[l-1]```, because the input nodes do not have a bias.

```activation_functions``` is a  list with activation functions in each node. Again (**important**) the activations of the $l^{\rm th}$ layer correspond to the ```FFANN.activations[l-1]```, because there is one less activatioin needed (the output is just the output).

### Evaluation

The different ways to get the output of the network are:

1. ```FFANN.evaluate(x)```, which calculates *only* the signals.

1. ```FFANN(x)```, which calculates the signals and the "local" derivatives of the network with self.signals[0]=x. This stores all the signals in ```FFANN.signals``` and the local derivatives in ```self.derivatives[l][j][i]```$= \dfrac{\partial s^{l+1}_{j}}{\partial s^{(l)}_{i}}$.

1. ```FFANN.feedForwardDerivatives()```. Calculates the signals and the "local" derivatives of the network (at the current ```FFANN.signals[0]```) This feed forward function also fills ```self.totalDerivatives```, with ```self.totalDerivatives[l][j][i]``` $= \dfrac{\partial s^{l+1}_{j}}{\partial s^{(0)}_{i}}$.

After ```FFANN(x)``` one call ```FFANN.backPropagation```. This fills ```FFANN.Delta[f][j][i]``` $= \dfrac{\partial s^{[N-1)]}_{k}}{\partial s^{[N-(f+2)]}_{i}}$, with $N$ the total numer of layers ($N-1$ is the output layer since we start counting form $l=0$), $f=0,1,2,...N-2$. *Notice that* ```Delta``` is like ```totalDerivatives``` but backwards (hence **backPropagation**).

### Derivatives

As already mensioned there are two different ways to get the derivates.

1. call ```FFANN.feedForwardDerivatives()```. After this you can find
$\dfrac{\partial s^{(N-1)}_{j}}{\partial s^{(0)}_{i}}$ in ```FFANN.totalDerivatives[N-2][j][i]```.

2. ```FFANN(x)``` and then ```FFANN.backPropagation()```. After this you can find
$\dfrac{\partial s^{(N-1)}_{j}}{\partial s^{(0)}_{i}}$ in ```FFANN.Delta[N-2][j][i]```.


Once you run ```FFANN.backPropagation()```, you can also calculate the derivatives with respect to the weights, biases, and signals. Simply run ```FFANN.derivative_bw(self,l,j,i)``` which stores $\dfrac{\partial s^{(N-1)}_{r}}{\partial w^{(l)}_{ji}}$ and  $\dfrac{\partial s^{(N-1)}_{r}}{\partial b^{(l+1)}_{j}}$ (for all $r=0,1,... \# \text{ output nodes}$) to ```FFANN.dsdw``` and ```FFANN.dsdb```, respectively.


---


#### Why two kinds of "derivatives"?

Both use the chain rule to obtain derivatives. However:
    
1. ```FFANN.totalDerivatives``` is obtained by applying the chain-rule during feed forward 
(i.e. at the same time as the calculation of the signals!). Therefore, this is useful when the network is already trained, and we just need its derivatives (since they are calculated during the feed forward). 

2. ```FFANN.Delta``` is obtained by applying the chain-rule during back propagation
(i.e. its elements are $\dfrac{\partial s^{[N-1)]}_{k}}{\partial s^{[N-(f+2)]}_{i}}$ 
for $N= total \ \# \ layers$, $f=N-(l+3)$ and $l<N-2$).
This is useful as we train the network, as it cannot be caclulated during feed forward.


The good thing is that both can be used to obtain the derivative of the outputs with respect to the signals. That is:
$\dfrac{\partial s^{(N-1)}_{j}}{\partial s^{(0)}_{i}}$ = ```FFANN.Delta[N-2][j][i]``` = ```FFANN.totalDerivatives[N-2][j][i]```

----

## Numerical derivatives

For testing, debugginf or any other purpose one can think of, the numerical derivatives can be obtained using 
```FFANN.totalNumericalDerivative()```, which stores $\dfrac{\partial s^{(N-1)}_{j}}{\partial s^{(0)}_{i}}$ in ```FFANN.numericalDerivatives[j][i]```. 

The numerical derivatives of $\dfrac{\partial s^{(N-1)}_{r}}{\partial w^{(l)}_{ji}}$ and $\dfrac{\partial s^{(N-1)}_{r}}{\partial b^{(l+1)}_{j}}$ (for all $r=0,1,... \# \text{ output nodes}$) are stored in ```FFANN.numerical_dsdw``` and ```FFANN.numerical_dsdb``` after  calling ```FFANN.numericalDerivative_bw(l,j,i)```.  It should be noted, that this function has to be called after (at least) ```FFANN.evaluate``` since it uses the fact that this derivative needs only to re-calculate the signals after the  $(l+1)^{(\rm th)}$ layer. 




# Derivatives with respect to the inputs and weights


During ```FFANN.feedForward``` we store all $\dfrac{\partial s^{(l+1)}_{j}}{\partial s^{(l)}_{i}}= \theta^{\prime \, (l+1)}_{j} w^{(l)}_{ji}$
in ```FFANN.derivatives```. For a network with $N$ layers ($N-2$ hidden + $1$ input layer + $1$ output layer), we wish to calculate  

* $\dfrac{\partial s^{(N-1)}_{r}}{\partial s^{(0)}_{p}}$ . [#1](#1)
* $\dfrac{\partial s^{(N-1)}_{r}}{\partial w^{(l)}_{ji}} = 
\dfrac{\partial s^{(N-1)}_{r}}{\partial s^{(l+1)}_{j}} \theta^{\prime\, (l+1)}_{j}s^{(l)}_{i}$ (no sum over $j$). [#2](#2)


In order to do this we can accumulate
$$
\Delta^{ (0) }_{j i} = \dfrac{\partial s^{(N-1)}_{j}}{\partial s^{(N-2)}_{i}} \\
\Delta^{ (1) }_{j i} = \dfrac{\partial s^{(N-1)}_{j}}{\partial s^{(N-3)}_{i}}=
\dfrac{\partial s^{(N-1)}_{j}}{\partial s^{(N-2)}_{k_1}} \cdot 
\dfrac{\partial s^{(N-2)}_{k_1}}{\partial s^{(N-3)}_{i}}=
\Delta^{(0)}_{j k}\cdot \dfrac{\partial s^{(N-2)}_{k}}{\partial s^{(N-3)}_{i}}
\\
\Delta^{ (2) }_{j i} = \dfrac{\partial s^{(N-1)}_{j}}{\partial s^{(N-4)}_{i}}=
\dfrac{\partial s^{(N-1)}_{j}}{\partial s^{(N-2)}_{k_1}} \cdot 
\dfrac{\partial s^{(N-2)}_{k_1}}{\partial s^{(N-3)}_{k_2}}\cdot 
\dfrac{\partial s^{(N-3)}_{k_2}}{\partial s^{(N-4)}_{i}}=
\Delta^{(1)}_{j k}\cdot  \dfrac{\partial s^{(N-3)}_{k}}{\partial s^{(N-4)}_{i}}\\
\vdots\\
\Delta^{ (f) }_{j i} =\Delta^{ (f-1) }_{j i} \cdot  \dfrac{\partial s^{[N-(f+1)]}_{k}}{\partial s^{[N-(f+2)]}_{i}} \;,
$$
where the dot ($\cdot$) indicates summation over repeated indices. For convinience we can also define $\Delta^{(-1)}_{ji} = \delta_{ij}$. 


For [#1](#1), $ f= N-2 $, i.e. $\dfrac{\partial s^{(N-1)}_{r}}{\partial s^{(0)}_{p}} = \Delta^{(N-2)}_{ji}$.

For [#2](#2),  $f= N-(l+3)$, i.e. 
$$\dfrac{\partial s^{(N-1)}_{r}}{\partial w^{(l)}_{ji}} = 
\Delta^{ (N-(l+3)) }_{r j} \theta^{\prime\, (l+1)}_{j}s^{(l)}_{i} \qquad\qquad\qquad\qquad 
\text{ for } l \leq N -3
\\
\dfrac{\partial s^{(N-1)}_{r}}{\partial w^{(N-2)}_{ji}} = 
\Delta^{ (-1) }_{r j} \theta^{\prime\, (N-1)}_{j}s^{(N-2)}_{i} = \delta_{r j} \ \theta^{\prime\, (N-1)}_{j}s^{(N-2)}_{i} 
\qquad \text{ for } l=N-2 \; .
$$


Similarly we can obtain the derivatives with respect to the biases, since
$$
\dfrac{\partial s^{(N-1)}_r}{\partial b^{(l+1)}_{j}} = \dfrac{\partial s^{(N-1)}_{r}}{\partial s^{(l+1)}_{j}} \theta^{\prime\, (l+1)}_{j} 
;.
$$

This calculation can be done at the same time as  $\dfrac{\partial s^{(N-1)}_{r}}{\partial w^{(l)}_{ji}}$ since
$$
\dfrac{\partial s^{(N-1)}_{r}}{\partial w^{(l)}_{ji}} = \dfrac{\partial s^{(N-1)}_r}{\partial b^{(l+1)}_{j}} \ s^{(l)}_{i} \;.
$$