# Implementing a Multilayer Preceptron with one hidden layer.


## We need to compute the following operations in order:
1. linear function aggregation $z$

2. sigmoid function activation $a$

3. cost function (error) calculation $E$

4. derivative of the error w.r.t. the weights $w$ and bias $b$ in the $(L)$ layer

5. derivative of the error w.r.t. the weights $w$ and bias $b$ in the $(L−1)$ layer

6. weight and bias update for the $(L)$ layer

7. weight and bias update for the $(L−1)$ layer

In [3]:
import numpy as np

## Initialize training parameters
Generate initial parameters sampled from an uniform distribution

### Args:

`n_features (int)`: number of feature vectors

`n_neurons (int)`: number of neurons in hidden layer

`n_output (int)`: number of output neurons

### Returns:

parameters dictionary

`W1`: weight matrix, *shape = [n_features, n_neurons]*

`b1`: bias vector, *shape = [1, n_neurons]*

`W2`: weight matrix, *shape = [n_neurons, n_output]*

`b2`: bias vector, *shape = [1, n_output]*

In [4]:
def init_parameters(n_features, n_neurons, n_output):
    # for reproducibility
    np.random.seed(100)
    W1 = np.random.uniform(size=(n_features,n_neurons))
    b1 = np.random.uniform(size=(1,n_neurons))
    W2 = np.random.uniform(size=(n_neurons,n_output))
    b2 = np.random.uniform(size=(1,n_output))
    
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}
    
    return parameters

## Compute $z$: linear function

computes net input as dot product

### Args:

`W (ndarray)`: weight matrix

`X (ndarray)`: matrix of features

`b (ndarray)`: vector of biases

### Returns:

`Z (ndarray)`: weighted sum of features

In [5]:
def linear_function(W, X, b):    
    return (X @ W)+b

### Example

This is an example using data set of **2 samples** *(Rows)* and **2 features** *(columns)*

**Input matrix** $X∈R^{2×2}$

$$
X =\begin{bmatrix} X_{1,1} & X_{1,2} \\ X_{2,1} & X_{2,2} \end{bmatrix}


$$

**Weight matrix** $W∈R^{2×1}$:

$$
W =\begin{bmatrix} W_{1,1} \\ W_{2,1} \end{bmatrix}

$$

**Bias vector** $b∈R^1$:

$$
b =\begin{bmatrix} b_{1}\end{bmatrix}
$$

1. Matrix Multiplication $XW$
  

$$
XW = 
\begin{bmatrix} X_{1,1} & X_{1,2} \\ X_{2,1} & X_{2,2} \end{bmatrix} 
\cdot
\begin{bmatrix} W_{1,1} \\ W_{2,1} \end{bmatrix}
=
\begin{bmatrix}
X_{1,1} \cdot W_{1,1} + X_{1,2} \cdot W_{2,1} \\
X_{2,1} \cdot W_{1,1} + X_{2,2} \cdot W_{2,1}
\end{bmatrix} 
$$

2. Add the Bias $b$ to $XW$
  

$$
XW + b = 
\begin{bmatrix}
X_{1,1} \cdot W_{1,1} + X_{1,2} \cdot W_{2,1} + b_{1}\\
X_{2,1} \cdot W_{1,1} + X_{2,2} \cdot W_{2,1} + b_{1}
\end{bmatrix} 
$$

3. Result
  

$$
Z = 
\begin{bmatrix}
Z_{1} \\
Z_{2}
\end{bmatrix}
=
\begin{bmatrix}
X_{1,1} \cdot W_{1,1} + X_{1,2} \cdot W_{2,1} + b_{1}\\
X_{2,1} \cdot W_{1,1} + X_{2,2} \cdot W_{2,1} + b_{1}
\end{bmatrix} 
$$

## Compute $a$: sigmoid activation function
computes sigmoid activation element wise
    
### Args:

`Z (ndarray)`: weighted sum of features
    
### Returns: 

`S (ndarray)`: neuron activation

In [6]:
def sigmoid_function(Z):
    return 1/(1+np.exp(-Z))

## Compute cost (error) function $E$:

computes squared error
    
### Args:
`A (ndarray)`: neuron activation

`y (ndarray)`: vector of expected values
    
### Returns:

`E (float)`: total squared error

In [7]:
def cost_function(A, y):
    return (np.mean(np.power(A - y,2)))/2

## Compute predictions $y\hat{}$ with learned parameters

computes predictions with learned parameters
    
### Args:

`X (ndarray)`: matrix of features

`W1 (ndarray)`: weight matrix for the first layer

`W2 (ndarray)`: weight matrix for the second layer

`b1 (ndarray)`: bias vector for the first layer

`b2 (ndarray)`: bias vector for the second layer
        
### Returns:

`d (ndarray)`: vector of predicted values

In [8]:
def predict(X, W1, W2, b1, b2):
    Z1 = linear_function(W1, X, b1)
    S1 = sigmoid_function(Z1)
    Z2 = linear_function(W2, S1, b2)
    S2 = sigmoid_function(Z2)
    # a threshold function for binary classification
    return np.where(S2 >= 0.5, 1, 0)

## Backpropagation and training loop
Multilayer Perceptron trained with backpropagation

### Args:
    
`X (ndarray)`: matrix of features
    
`y (ndarray)`: vector of expected values
    
`n_features (int)`: number of feature vectors 
    
`n_neurons (int)`: number of neurons in hidden layer
    
`n_output (int)`: number of output neurons
    
`iterations (int)`: number of iterations over the training set
    
`eta (float)`: learning rate
    
### Returns: 
    
`errors (list)`: list of errors over iterations
    
`param (dic)`: dictionary of learned parameters

### How it works
This `fit` function trains the network.
* The first part of the function initializes the parameters by calling the `init_parameters` function.

* The loop inside the function does the following:

    1. the `Forward-propagation` section chains the linear and sigmoid functions to compute the network output.

    2. the Error computation section computes the cost function value after each iteration.

    3. the Backpropagation section does two things:

        - computes the gradients for the weights and biases in the (L) and (L−1) layers.

        - update the weights and biases in the (L) and (L−1) layers

* the fit function returns a list of the errors after each iteration and an updated dictionary with the learned weights and biases.

In [9]:
def fit(X, y, n_features=2, n_neurons=3, n_output=1, iterations=10, eta=0.001):
    #__Initialize_parameters 
    param = init_parameters(n_features=n_features, 
                            n_neurons=n_neurons, 
                            n_output=n_output)

    #__storage_errors_after_each_iteration
    errors = []

    for _ in range(iterations):
        #__ForwardPropagation
        Z1 = linear_function(param['W1'], X, param['b1'])
        S1 = sigmoid_function(Z1)
        Z2 = linear_function(param['W2'], S1, param['b2'])
        S2 = sigmoid_function(Z2)
        
        #__Error_computation
        error = cost_function(S2, y)
        errors.append(error)
        
        #__Backpropagation
        #__update_output_weights
        delta2 = (S2 - y)* S2*(1-S2)
        W2_gradients = S1.T @ delta2
        param["W2"] = param["W2"] - W2_gradients * eta

        #__update_output_bias
        param["b2"] = param["b2"] - np.sum(delta2, axis=0, keepdims=True) * eta

        #__update_hidden_weights
        delta1 = (delta2 @ param["W2"].T )* S1*(1-S1)
        W1_gradients = X.T @ delta1 
        param["W1"] = param["W1"] - W1_gradients * eta

        #__update_hidden_bias
        param["b1"] = param["b1"] - np.sum(delta1, axis=0, keepdims=True) * eta
        
    return errors, param

## Application

Solving the XOR problem

This is the Truth Table For XOR Function
|x1|x2|y|
|--|--|-|
|0 | 0|0|
|0 | 1|1|
|1 | 0|1|
|1 | 1|0|

In [None]:
#__expected_values
y = np.array([[0, 1, 1, 0]]).T

#__features
X = np.array([[0, 0, 1, 1],
              [0, 1, 0, 1]]).T

[[0]
 [1]
 [1]
 [0]]
[[0 0]
 [0 1]
 [1 0]
 [1 1]]


### Multilayer Perceptron training 

training the network by running 5,000 iterations with a learning rate of $η=0.1$

In [15]:
errors, param = fit(X, y, iterations=5000, eta=0.1)

### Multilayer Perceptron predictions and error

In [21]:
y_pred = predict(X, param["W1"], param["W2"], param["b1"], param["b2"])
num_correct_predictions = (y_pred == y).sum()
accuracy = (num_correct_predictions / y.shape[0]) * 100
print('Multi-layer perceptron accuracy: %.2f%%' % accuracy)

Multi-layer perceptron accuracy: 100.00%


In [22]:
import altair as alt
import pandas as pd
alt.data_transformers.disable_max_rows()

df = pd.DataFrame({"errors":errors, "time-step": np.arange(0, len(errors))})
alt.Chart(df).mark_line().encode(x="time-step", y="errors").properties(title='Chart 2')