# Artificial Neural Networks (ANN) Basics: Error Back Propagation

## Table of Content <a name="TOC"></a>

1. [General setups](#setups)

2. [Creation and initialization](#creation) 

3. [Training](#training) 


### A. Learning objectives

- to create an ANN of arbitrary architecture
- to initialize the ANN parameters
- to construct an ANN training algorithm
- to know the difference between the online and batch training of an ANN
- to train ANN and track the progress
- to predict the outputs using the ANN and known inputs

### B. Use cases

- [Deep machine learning: Multilayer Perceptron](#mlp-1)


### C. Functions

- `liblibra::liblinalg`
  - [`pop_submatrix`](#pop_submatrix-1)

- `liblibra::libspecialfunctions`
  - [`randperm`](#randperm-1) | [also here](#randperm-2)

  
### D. Classes and class members

- `liblibra::libann`
  - [`NeuralNetwork`](#NeuralNetwork-1)
    - [`Nlayers`](#Nlayers-1)  
    - [`Npe`](#Npe-1)
    - [`W`](#W-1) | [also here](#W-2)
    - [`dW`](#dW-1) | [also here](#dW-2)
    - [`B`](#B-1) | [also here](#B-2)
    - [`dB`](#dB-1) | [also here](#dB-2)
    - [`propagate`](#propagate-1) | [also here](#propagate-2)
    - [`back_propagate`](#back_propagate-1) | [also here](#back_propagate-2)
    - [`init_weights_biases_uniform`](#init_weights_biases_uniform-1)
    - [`init_weights_biases_normal`](#init_weights_biases_normal-1)
    - [`train`](#train-1)    

- `liblibra::librandom`
  - [`Random`](#Random-1)    
    - [`normal`](#normal-1)
    

## 1. General setups
<a name="setups"></a> [Back to TOC](#TOC)

In [1]:
import math
import sys
import cmath
import math
import os

if sys.platform=="cygwin":
    from cyglibra_core import *
elif sys.platform=="linux" or sys.platform=="linux2":
    from liblibra_core import *
import util.libutil as comn

from libra_py import units
from libra_py import data_outs
import matplotlib.pyplot as plt   # plots
#matplotlib.use('Agg')
#%matplotlib inline 

import numpy as np
#from matplotlib.mlab import griddata

plt.rc('axes', titlesize=24)      # fontsize of the axes title
plt.rc('axes', labelsize=20)      # fontsize of the x and y labels
plt.rc('legend', fontsize=20)     # legend fontsize
plt.rc('xtick', labelsize=16)    # fontsize of the tick labels
plt.rc('ytick', labelsize=16)    # fontsize of the tick labels

plt.rc('figure.subplot', left=0.2)
plt.rc('figure.subplot', right=0.95)
plt.rc('figure.subplot', bottom=0.13)
plt.rc('figure.subplot', top=0.88)

colors = {}

colors.update({"11": "#8b1a0e"})  # red       
colors.update({"12": "#FF4500"})  # orangered 
colors.update({"13": "#B22222"})  # firebrick 
colors.update({"14": "#DC143C"})  # crimson   

colors.update({"21": "#5e9c36"})  # green
colors.update({"22": "#006400"})  # darkgreen  
colors.update({"23": "#228B22"})  # forestgreen
colors.update({"24": "#808000"})  # olive      

colors.update({"31": "#8A2BE2"})  # blueviolet
colors.update({"32": "#00008B"})  # darkblue  

colors.update({"41": "#2F4F4F"})  # darkslategray

clrs_index = ["11", "21", "31", "41", "12", "22", "32", "13","23", "14", "24"]

  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


## 2. Creation and initialization
<a name="creation"></a> [Back to TOC](#TOC)

Create the ANN object with 3 layers: input, 1 hidden, and 1 output layers. 
<a name="NeuralNetwork-1"></a>

In [2]:
ANN = NeuralNetwork( Py2Cpp_int( [2, 5, 1] ) )

We can then check that:
    
* there are 3 layers
* the input layers has 2 neurons
* the hidden layer has 5 neurons
* the output has 1 neuron
<a name="Nlayers-1"></a><a name="Npe-1"></a>

In [3]:
print(F"Number of layers = {ANN.Nlayers}")
print(F"Input layer dimension = {ANN.Npe[0]}")
print(F"Hidden layer dimension = {ANN.Npe[1]}")
print(F"Output layer dimension = {ANN.Npe[2]}")

Number of layers = 3
Input layer dimension = 2
Hidden layer dimension = 5
Output layer dimension = 1


This operation simply creates a collection of the weights and biases, that are stored in the matrices `W` and `B` correspondingly.

Also, this operation initializes the storage for the deltas of these parameters - `dW` and `dB`

Note that matrices `W[0]` and `B[0]` are irrelevant (junk), are not used and are only needed for the consistency of the implementation with the common ways the ANN theory is described in the literature
<a name="W-1"></a><a name="B-1"></a>

In [4]:
print("W[0]"); data_outs.print_matrix(ANN.W[0])
print("W[1]"); data_outs.print_matrix(ANN.W[1])
print("W[2]"); data_outs.print_matrix(ANN.W[2])

print("B[0]"); data_outs.print_matrix(ANN.B[0])
print("B[1]"); data_outs.print_matrix(ANN.B[1])
print("B[2]"); data_outs.print_matrix(ANN.B[2])

W[0]
1.0  0.0  
0.0  1.0  
W[1]
0.0  0.0  
0.0  0.0  
0.0  0.0  
0.0  0.0  
0.0  0.0  
W[2]
0.0  0.0  0.0  0.0  0.0  
B[0]
0.0  
0.0  
B[1]
0.0  
0.0  
0.0  
0.0  
0.0  
B[2]
0.0  


As an example, consider the AND gate: 


| Input A | Input B |  Output A and B |
| --- | --- | --- |
|  0  |  0  |  0  |
|  0  |  1  |  0  |
|  0  |  1  |  0  |
|  1  |  1  |  1  |

For the numerical convenience, the inputs and outputs are rescaled down to the [0.0, 0.5] range

The AND truth table can be summarized as 4 inputs and 4 outputs. Each input and output constitute a column of the corresponding matrix. The length of each column (the number of rows in the matrices) corresponds to the dimensionalty of the input (2 - for the A and B) and output (A and B representation)


In [5]:
inputs = MATRIX(2, 4)
outputs = MATRIX(1, 4)

# Pattern 0
inputs.set(0, 0, 0.0)
inputs.set(1, 0, 0.0)
outputs.set(0, 0, 0.0)

# Pattern 1
inputs.set(0, 1, 0.0)
inputs.set(1, 1, 0.5)
outputs.set(0, 1, 0.0)

# Pattern 2
inputs.set(0, 2, 0.5)
inputs.set(1, 2, 0.0)
outputs.set(0, 2, 0.0)

# Pattern 3
inputs.set(0, 3, 0.5)
inputs.set(1, 3, 0.5)
outputs.set(0, 3, 0.5)

Before we start, we need to initialize the values of the weights and biases of the ANN
<a name="Random-1"></a><a name="normal-1"></a>

In [6]:
rnd = Random()

for L in range(1, ANN.Nlayers):
    for i in range(ANN.Npe[L]):
        for j in range(ANN.Npe[L-1]):
            ANN.W[L].set(i, j, 0.1*rnd.normal())
        ANN.B[L].set(i, 0, 0.1*rnd.normal() )
        
print("W[0]"); data_outs.print_matrix(ANN.W[0])
print("W[1]"); data_outs.print_matrix(ANN.W[1])
print("W[2]"); data_outs.print_matrix(ANN.W[2])

print("B[0]"); data_outs.print_matrix(ANN.B[0])
print("B[1]"); data_outs.print_matrix(ANN.B[1])
print("B[2]"); data_outs.print_matrix(ANN.B[2])

W[0]
1.0  0.0  
0.0  1.0  
W[1]
-0.07467980759493115  0.11535009504088899  
-0.09194019431931354  -0.05608876287839962  
0.021204018696542492  0.11335481251639841  
-0.01365116321426383  -0.04694757874421607  
0.2074992501529765  0.09111487578906989  
W[2]
0.0688396129970422  -0.07374125095097554  0.12718422746755392  0.04758712038343996  -0.016865861519680028  
B[0]
0.0  
0.0  
B[1]
0.21761276937257526  
0.04667804195371106  
0.02300427344265035  
-0.029876880048314982  
-0.013555923647332428  
B[2]
0.09641082564507389  


This operation can also be done with the help of the auxiliary function `init_weights_biases_normal` or `init_weights_biases_uniform`:

    void init_weights_biases_uniform(Random& rnd, double left_w, double right_w, double left_b, double right_b);
    void init_weights_biases_normal(Random& rnd, double scaling_w, double shift_w, double scaling_b, double shift_b);
 
<a name="init_weights_biases_normal-1"></a>

In [7]:
ANN.init_weights_biases_normal(rnd, 0.1, 0.0, 0.1, 0.0)

print("W[0]"); data_outs.print_matrix(ANN.W[0])
print("W[1]"); data_outs.print_matrix(ANN.W[1])
print("W[2]"); data_outs.print_matrix(ANN.W[2])

print("B[0]"); data_outs.print_matrix(ANN.B[0])
print("B[1]"); data_outs.print_matrix(ANN.B[1])
print("B[2]"); data_outs.print_matrix(ANN.B[2])

W[0]
1.0  0.0  
0.0  1.0  
W[1]
0.12444313572962284  -0.045443184771789685  
-0.016257630548090516  -0.011498667603197187  
0.019002906005639406  0.17753078630284458  
-0.05183249870320597  0.0159620186499942  
0.11555013892191231  -0.19511227958004979  
W[2]
0.1962570269469692  0.06204127313536661  0.11199818560287973  -0.039360286341801026  -0.0017918247500841996  
B[0]
0.0  
0.0  
B[1]
0.11389210692325298  
0.09354658686501437  
-0.11337431543870458  
0.21521497810771573  
-0.05955214956328814  
B[2]
-0.1319612639552035  


Initialize the weights and biases using random numbers sampled from a uniform distribution.
<a name="init_weights_biases_uniform-1"></a>

In [8]:
ANN.init_weights_biases_uniform(rnd, -0.1, 0.1, -0.1, 0.1)

print("W[0]"); data_outs.print_matrix(ANN.W[0])
print("W[1]"); data_outs.print_matrix(ANN.W[1])
print("W[2]"); data_outs.print_matrix(ANN.W[2])

print("B[0]"); data_outs.print_matrix(ANN.B[0])
print("B[1]"); data_outs.print_matrix(ANN.B[1])
print("B[2]"); data_outs.print_matrix(ANN.B[2])

W[0]
1.0  0.0  
0.0  1.0  
W[1]
0.005786912285623558  -0.03757073643504211  
-0.04014041458263082  0.005311267685755738  
0.09958013631383897  0.08877924642934429  
0.08163965948933721  -0.0494387475538248  
-0.03441802250892763  0.03828138650314947  
W[2]
0.08378060952936328  0.03417718062837477  -0.07769754816670787  -0.05893732679958331  -0.04999567556660422  
B[0]
0.0  
0.0  
B[1]
0.0012398463214001731  
-0.0158380354362717  
-0.04609113798760396  
-0.0684166976103637  
0.09977275384579448  
B[2]
-0.07770523725901975  


## 3. Training
<a name="training"></a> [Back to TOC](#TOC)

To train the ANN on a given set of patterns, we use two key functions: `propagate` and `back_propagate`, which take the signatures:

    vector<MATRIX> propagate(MATRIX& input);
    double back_propagate(vector<MATRIX>& Y, MATRIX& target);

The `propagate` function takes a given input (which could be as many patterns as needed) and computes the outputs on each layer for each pattern. The results are returned as the lists of matrices.

The `back_propagate` function takes the output in each layer (as returned by the `propagate` function) as well as the expected target output and computes the error on each layer (and the corresponding derivatives of the weights and biases), starting from the last (output) layer and working its way down to the first one. The error is thus propagated backwards, hence the name. 

As a result, the procedure updates the `dW` and `dB` values stored internally in the ANN object. The function also returns the error in the last layer to facilitae the tracking of the progress.

Note that if there are many patterns are given, the `dW` and `dB` variables are computed as the average over those values over all the patterns.
<a name="dW-1"></a><a name="dB-1"></a><a name="propagate-1"></a><a name="back_propagate-1"></a>

In [9]:
Y = ANN.propagate(inputs)
res = ANN.back_propagate(Y, outputs)
print(F"Error = {res}")

print("dW[0]"); data_outs.print_matrix(ANN.dW[0])
print("dW[1]"); data_outs.print_matrix(ANN.dW[1])
print("dW[2]"); data_outs.print_matrix(ANN.dW[2])

print("dB[0]"); data_outs.print_matrix(ANN.dB[0])
print("dB[1]"); data_outs.print_matrix(ANN.dB[1])
print("dB[2]"); data_outs.print_matrix(ANN.dB[2])

Error = 0.04519214277195703
dW[0]
0.0  0.0  
0.0  0.0  
dW[1]
0.006981422098446493  0.006966319701928951  
0.0028453098207850695  0.0028396355800587676  
-0.006462613642992636  -0.006448841305310657  
-0.004899931056470072  -0.004884864656572826  
-0.004125887118961536  -0.0041134794537402855  
dW[2]
-0.0023874349345629504  -0.0061559730794529255  0.006207818630260986  -0.011350855168443252  0.020743749447816894  
dB[0]
0.0  
0.0  
dB[1]
0.017211631886592204  
0.007015850018761089  
-0.015935499196266023  
-0.012070339046558558  
-0.010167499375128945  
dB[2]
0.20547487545388454  


Now, we can formulate a simple procedure perform the simple gradient descent optimization of the weights and biases.

Note that `dW` and `dB` are the negative gradients of the error w.r.t. to those parameters. So, in the gradiens descent algorithm, these come with the "+" sign

Naturally, we don't want to plot all the stuff, only once in a while. 
<a name="W-2"></a><a name="B-2"></a><a name="dW-2"></a><a name="dB-2"></a><a name="propagate-2"></a><a name="back_propagate-2"></a>

In [10]:
n_epochs = 20
steps_per_epoch = 1000
dt = 0.01

for epoch in range(n_epochs):
    
    res, Y = 0.0, None
    for i in range(steps_per_epoch):
    
        for L in range(ANN.Nlayers):
            ANN.W[L] = ANN.W[L] + dt * ANN.dW[L]
            ANN.B[L] = ANN.B[L] + dt * ANN.dB[L]

        Y = ANN.propagate(inputs)
        res = ANN.back_propagate(Y, outputs)
        
    print(F"epoch = {epoch}  error = {res}")
    
    data_outs.print_matrix(Y[2])

epoch = 0  error = 0.023440060714565544
0.1251404016968137  0.1261748411974405  0.12405553896998396  0.12509934878777298  
epoch = 1  error = 0.022877031224618205
0.12079707098479159  0.12623134245508394  0.12440946486007147  0.12984834098216064  
epoch = 2  error = 0.021931851676821348
0.11343032688733137  0.12672544345056644  0.12492340040874216  0.13816691642502021  
epoch = 3  error = 0.020131881373898083
0.09843806164998027  0.12770045604872832  0.12576847769806593  0.15468871644234208  
epoch = 4  error = 0.017154659528963057
0.07001100954140131  0.1289552383265973  0.12691086082897943  0.18440545840196135  
epoch = 5  error = 0.013521031474355027
0.025989260420289657  0.12991805799449047  0.12794685345462875  0.2275230082453813  
epoch = 6  error = 0.010612814718493173
-0.024970551805868203  0.13028236876132387  0.12860830622325864  0.27468822212148164  
epoch = 7  error = 0.009141472331890901
-0.0682336216531886  0.13029595092996296  0.12901914858518704  0.31331048234882913  
e

After enough steps and cycles, we can compute the ANN recall (prediction) using the current state of the ANN parameters.

As an example, we use the input that was also used in the training.

In [11]:
Y = ANN.propagate(inputs)

In [12]:
data_outs.print_matrix(Y[2])

-0.12847185170227976  0.1290775453577819  0.1286563945348703  0.36801825524340365  


Here, we can see that the results are pretty close to our expectations.

In the above example, we have utilized all of our training examples in each step. This is called **batch** training

However, sometimes it is adwantageous to use randomly selected subsets of the training examples in each step. This is called **online** training, and the number of examples presented at each time is called **epoch size**. In the above example of the batch training, we used an epoch size of 4.

Let's consider smaller epoch sizes.

In order to implement such a functionality, we need a procedure to select random sequences of numbers. This can be done with the help of `randperm` function, which takes the signature:

    int randperm(int size,int of_size,vector<int>& result)
    
For instance, if we want to create a random sequence of 3 numbers from 5 numbers [0, 1, 2, 3, 4], we do:
<a name="randperm-1"></a>

In [13]:
res = intList()
randperm(3, 5, res)
print( Cpp2Py(res) )

[0, 2, 3]


Now, we are ready to formulate the algorithm.

<a name="pop_submatrix-1"></a>
Note how we use the `pop_submatrix` function to take certain columns (as defined by the `subset` variable) out of the full matrix of all inputs. We do such "extraction" for both inputs and outputs, congruently.
<a name="randperm-2"></a>

In [14]:
ANN2 = NeuralNetwork( Py2Cpp_int( [2, 5, 1] ) )
ANN2.init_weights_biases_uniform(rnd, -0.1, 0.1, -0.1, 0.1)


n_epochs = 20
steps_per_epoch = 1000
epoch_size = 2
n_patterns = 4
dt = 0.01

input_subset = MATRIX(2, epoch_size)
output_subset = MATRIX(1, epoch_size)
subset = intList()

for epoch in range(n_epochs):
    
    res, Y = 0.0, None
    for i in range(steps_per_epoch):    
        for L in range(ANN.Nlayers):
            ANN2.W[L] = ANN2.W[L] + dt * ANN2.dW[L]
            ANN2.B[L] = ANN2.B[L] + dt * ANN2.dB[L]
            
        # Make a random selection of the training patterns
        randperm(epoch_size, n_patterns, subset)
        
        # Extract the corresponding matrices from the inputs and outputs
        pop_submatrix(inputs, input_subset, Py2Cpp_int( [0, 1] ), subset )
        pop_submatrix(outputs, output_subset, Py2Cpp_int( [0] ), subset )

        Y = ANN2.propagate(input_subset)
        res = ANN2.back_propagate(Y, output_subset)
        
    print(F"epoch = {epoch}  error = {res}")
    
    data_outs.print_matrix(Y[2])
    
Y = ANN2.propagate(inputs)
data_outs.print_matrix(Y[2])

epoch = 0  error = 0.03823947430633084
0.1252214825107677  0.12949024635898065  
epoch = 1  error = 0.037252160169705985
0.13301653599205693  0.1197154034514969  
epoch = 2  error = 0.008482091112683457
0.13380230949427305  0.1265910993108623  
epoch = 3  error = 0.031340193420606954
0.13213619266623478  0.17151742775285705  
epoch = 4  error = 0.022018538568777207
0.07394119111043716  0.21258591799934606  
epoch = 5  error = 0.017121485645586906
0.23845297783791403  0.008893693298789666  
epoch = 6  error = 0.015762960491216046
0.14083840252468122  0.292114487423582  
epoch = 7  error = 0.005340491264554217
-0.08496770977843712  0.11892204738072815  
epoch = 8  error = 0.010621065936320971
0.12621924425300274  0.33704919108674325  
epoch = 9  error = 0.009597623762396534
0.13726153725449614  0.13982047575415663  
epoch = 10  error = 0.008965033872500968
0.36175077571492065  0.12941131123119684  
epoch = 11  error = 0.008742237662892307
0.1315795352113171  0.3671249588329584  
epoch = 

Apparently, the above simple procedure can be run as a sinlge function `train`
<a name="train-1"></a><a name="mlp-1"></a>

In [15]:
ANN3 = NeuralNetwork( Py2Cpp_int( [2, 5, 1] ) )
ANN3.init_weights_biases_uniform(rnd, -0.1, 0.1, -0.1, 0.1)

params = { "num_epochs":20, 
           "steps_per_epoch":1000, 
           "epoch_size":2, "learning_rate":0.01, 
           "verbosity":1 }

ANN3.train(rnd, params, inputs, outputs )

Y = ANN3.propagate(inputs)
data_outs.print_matrix(Y[2])

-0.11978020876310912  0.13074530990140731  0.13483606915940663  0.3696247393767548  


## Exercise 1

Train an ANN to learn the exclusinve OR (XOR) gate:

| Input A | Input B |  Output A and B |
| --- | --- | --- |
|  0  |  0  |  0  |
|  0  |  1  |  1  |
|  0  |  1  |  1  |
|  1  |  1  |  0  |

Experiment with the ANN architecture and training parameters. Can you make the ANN with no hidden layers to learn this pattern?


## Exercise 2

Train an ANN to learn the quadratic function $y(x) = x^2$ on the [0, 5] interval. 

Hint: keep in mind that the output of the $tanh(x)$ function can be in the [-1, 1] interval, so you need to transform target y values into that interval. Even better, to something like [-0.5, 0.5]

Also, the best learning happens where the slope of the transfer function isn't too close to zero, so it is a good idea to convert the input variables into another range, e.g. [-1, 1]