# Multilayer Perceptron (MLP)

© 2024 by [Damir Cavar](http://damir.cavar.me/)


**Download:** This and various other Jupyter notebooks are available from my [GitHub repo](https://github.com/dcavar/python-tutorial-for-ipython).

**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))

**Literature:**

- Samy Baladram "[Multilayer Perceptron, Explained: A Visual Guide with Mini 2D Dataset](https://towardsdatascience.com/multilayer-perceptron-explained-a-visual-guide-with-mini-2d-dataset-0ae8100c5d1c)"



In [1]:
import numpy as np

## Data

We will use the data set from Samy Baladram's article listed above. The data shows scores for temperature and humidity from 0 to 3, and a corresponding decision whether playing golf is possible. See [here](https://towardsdatascience.com/support-vector-classifier-explained-a-visual-guide-with-mini-2d-dataset-62e831e7b9e9) for an explanation of the data set.

In [5]:
training_data = [
    (0, 0, 1),
    (1, 0, 0),
    (1, 1, 0),
    (2, 0, 0),
    (3, 1, 1),
    (3, 2, 1),
    (2, 3, 1),
    (3, 3, 0)
]

In [6]:
test_data = [
    (0, 1, 0),
    (0, 2, 0),
    (1, 3, 1),
    (2, 2, 1),
    (3, 1, 1)
]

## Introduction

The network architecture will consume an input vector with two dimensions. One dimension is the score for temperature and the other is the score for humidity.

We can design the first hidden layer with three nodes, a second subsequent hidden layer with two nodes, and an output layer with one node.

All nodes are fully connected and represented as a matrix $W$ of 2 x 3 dimensions. The second hidden layer is a matrix $U$ with 3 x 2 dimensions.

In [37]:
W = np.random.random((2, 3))
print(f"W {W}")
U = np.random.random((3, 2))
print(f"U {U}")
bias_W = np.random.random((1, 3))
print(f"bias_W {bias_W}")
bias_U = np.random.random((1, 2))
print(f"bias_U {bias_U}")
O = np.random.random((2, 1))
print(f"O {O}")
bias_O = np.random.random((1, 1))
print(f"bias_O {bias_O}")

W [[0.57916493 0.1989773  0.71685006]
 [0.06420334 0.23917944 0.03679699]]
U [[0.44530666 0.60784364]
 [0.77164787 0.40612112]
 [0.83222563 0.69558143]]
bias_W [[0.90328775 0.89391968 0.63126251]]
bias_U [[0.93231218 0.7755912 ]]
O [[0.6369282 ]
 [0.36734706]]
bias_O [[0.93714153]]


In [16]:
input_data = np.array([[x[0], x[1]] for x in training_data])
input_data_ground_truth = np.array([[x[2]] for x in training_data])
print(f"input_data {input_data}")
print(f"input_data_ground_truth {input_data_ground_truth}")

input_data [[0 0]
 [1 0]
 [1 1]
 [2 0]
 [3 1]
 [3 2]
 [2 3]
 [3 3]]
input_data_ground_truth [[1]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [0]]


In [17]:
one_hot = np.array([0, 1, 0, 0, 0, 0, 0, 0])
one_hot.dot(input_data)

array([1, 0])

In [18]:
for row, true_score in zip(input_data, input_data_ground_truth):
    print(row, true_score)

[0 0] [1]
[1 0] [0]
[1 1] [0]
[2 0] [0]
[3 1] [1]
[3 2] [1]
[2 3] [1]
[3 3] [0]


In [38]:
def sigmoid(z):
    return 1/(1 + np.exp(-z))

In [42]:
def loss_function(predicted, actual):
    return np.log(predicted) if actual else np.log(1 - predicted)

In [None]:
learning_rate = 0.01

In [50]:
for row, true_score in zip(input_data, input_data_ground_truth):
    # print(row, true_score)
    hidden_layer_W = np.maximum(row.dot(W) + bias_W, 0)[0]  # ReLU activation
    # print(f"hidden_layer_W {hidden_layer_W}")
    hidden_layer_U = np.maximum(hidden_layer_W.dot(U) + bias_U, 0)[0]  # ReLU activation
    # print(f"hidden_layer_U {hidden_layer_U}")
    output = (sigmoid(hidden_layer_U.dot(O) + bias_O))[0][0]
    loss = loss_function(output, true_score[0])
    print(f"output {output} - true score: {true_score[0]} - loss {loss}")

output 0.9658545034605426 - true score: 1 - loss -0.03474207364924937
output 0.986959889282255 - true score: 0 - loss -4.3397252318950565
output 0.9894527613414252 - true score: 0 - loss -4.5518911918432865
output 0.995086368253607 - true score: 0 - loss -5.315741947375225
output 0.9985133193959704 - true score: 1 - loss -0.0014877868101581678
output 0.9988002123932317 - true score: 1 - loss -0.0012005079281317262
output 0.9974135571146144 - true score: 1 - loss -0.002589793507494032
output 0.9990317957413032 - true score: 0 - loss -6.940067481896969


Adding a loss function using binary cross-entropy:

## Training

## Backpropagation 


### Derivative Rules


#### Constant Rule

$y = k$ with $k$ a constant: $\frac{dy}{dx}=0$


#### Power Rule

$y=x^n$ the derivative is: $\frac{dy}{dx} (n -1)x^{n-1}$ 


#### Exponential Rule

$y=e^{kx}$ the derivative is: $\frac{dy}{dx}= k e^{kx}$


#### Natural Logarithm Rule

$y=ln(x)$ the derivative is: $\frac{dy}{dx}=\frac{1}{x}$


#### Sum and Difference Rule

$y = u + v$ or $y = u - v$ the derivatives are: $\frac{dy}{dx} = \frac{du}{dx} + \frac{dv}{dx}$ or $\frac{dy}{dx} = \frac{du}{dx} - \frac{dv}{dx}$


#### Product Rule

$y = u v$  the derivative is: $\frac{dy}{dx} = \frac{du}{dx} v + \frac{dv}{dx} u$


#### Chain Rule

$y(x) = u(v(x))$ the derivative is: $\frac{dy(x)}{dx} = \frac{du(v(x))}{dx} \frac{dv(x)}{dx}$



**© 2024 by [Damir Cavar](http://damir.cavar.me/)**