# Multilayer Perceptron (MLP)

- "vanilla" feed-forward neural network

- consists of an input layer, multiple hidden layers and an output layer (deep neural network => deep learning)

- trained using the backpropagation algorithm

![MLP](img/mlp.png)

# Forward pass

- passing data through the network until output $\hat{\mathbf{y}}$ is calculated

- matrix multiplication of an input matrix $\mathbf{X}$ and weights of every hidden layer $\mathbf{W}_n$ plus a bias vector $\mathbf{b}_n$ passed to an activation function $f_n$ until the output layer is reached

- it is quite common for the output layer to have a different activation function as compared to the hidden layers

- forward pass for the image above:

$$
\begin{align*}
\mathbf{Z}_1 &= \mathbf{X} \cdot \mathbf{W}_1 + \mathbf{b}_1 \\
\mathbf{U}_1 &= f_1(\mathbf{Z}_1) \\
& \text{Hidden layer 1}
\end{align*} \quad \quad


\begin{align*}
\mathbf{Z}_2 &= \mathbf{U}_1 \cdot \mathbf{W}_2 + \mathbf{b}_2 \\
\mathbf{U}_2 &= f_2(\mathbf{Z}_2) \\
& \text{Hidden layer 2}
\end{align*} \quad \quad



\begin{align*}
\mathbf{Z}_3 &= \mathbf{U}_2 \cdot \mathbf{W}_3 + \mathbf{b}_3 \\
\mathbf{U}_3 &= f_3(\mathbf{Z}_3) \\
& \text{Hidden layer 3}
\end{align*} \quad \quad



\begin{align*}
\mathbf{Z}_4 &= \mathbf{U}_3 \cdot \mathbf{W}_4 + \mathbf{b}_4 \\
\mathbf{U}_4 &= f_4(\mathbf{Z}_4) \\
& \text{Hidden layer 4}
\end{align*} \quad \quad



\begin{align*}
\mathbf{Z}_5 &= \mathbf{U}_4 \cdot \mathbf{W}_5 + \mathbf{b}_5 \\
\hat{\mathbf{y}} &= f_5(\mathbf{Z}_5) \\
& \text{Output layer} 
\end{align*}
$$

$$
\begin{align*}

\begin{aligned}
\mathbf{Z}_1 &= \mathbf{X} \cdot \mathbf{W}_1 + \mathbf{b}_1 \\
\mathbf{U}_1 &= f_1(\mathbf{Z}_1) \\
&\text{Hidden layer 1}
\end{aligned} \quad \quad

\begin{aligned}
\mathbf{Z}_2 &= \mathbf{U}_1 \cdot \mathbf{W}_2 + \mathbf{b}_2 \\
\mathbf{U}_2 &= f_2(\mathbf{Z}_2) \\
&\text{Hidden layer 2}
\end{aligned} \quad \quad

\begin{aligned}
\mathbf{Z}_3 &= \mathbf{U}_2 \cdot \mathbf{W}_3 + \mathbf{b}_3 \\
\mathbf{U}_3 &= f_3(\mathbf{Z}_3) \\
&\text{Hidden layer 3}
\end{aligned} \quad \quad

\begin{aligned}
\mathbf{Z}_4 &= \mathbf{U}_3 \cdot \mathbf{W}_4 + \mathbf{b}_4 \\
\mathbf{U}_4 &= f_4(\mathbf{Z}_4) \\
&\text{Hidden layer 4}
\end{aligned} \quad \quad

\begin{aligned}
\mathbf{Z}_5 &= \mathbf{U}_4 \cdot \mathbf{W}_5 + \mathbf{b}_5 \\
\hat{\mathbf{y}} &= f_5(\mathbf{Z}_5) \\
&\text{Output layer}
\end{aligned}

\end{align*}
$$


# Backpropagation (backward pass)

- distributing gradients backwards through the network using the chain rule

- gradient calculation starts at the output layer using the loss function $L(\hat{y}, y)$

- calculated gradients of network parameters (weights $\mathbf{W}$ and biases $\mathbf{b}$) are used to minimize the loss function using gradient descent

- backward pass for the image above ($\cdot$ denotes dot product, $\odot$ denotes element-wise multiplication (Hadamard product)):

$$
\begin{align*}

\frac{\partial L}{\partial \mathbf{Z}_5} &= \frac{\partial L}{\partial \hat{\mathbf{y}}} \cdot \frac{\partial \hat{\mathbf{y}}}{\partial \mathbf{Z}_5} \\
&= \frac{\partial L}{\partial \hat{\mathbf{y}}} \odot f_5'(\mathbf{Z}_5) \\
&= \delta_{5}

\end{align*} \quad \quad

\begin{align*}

\frac{\partial L}{\partial \mathbf{W}_5} &=  \frac{\partial L}{\partial \mathbf{Z}_5} \cdot \frac{\partial \mathbf{Z}_5}{\partial \mathbf{W}_5} \\
&= \mathbf{U}_{4}^{\intercal} \cdot \frac{\partial L}{\partial \mathbf{Z}_5} \\
&= \mathbf{U}_{4}^{\intercal} \cdot \delta_{5}

\end{align*} \quad \quad

\begin{align*}

\frac{\partial L}{\partial \mathbf{b}_5} &=  \frac{\partial L}{\partial \mathbf{Z}_5} \cdot \frac{\partial \mathbf{Z}_5}{\partial \mathbf{b}_5} \\
&= \mathbf{U}_{4}^{\intercal} \cdot 1 \\
&= \sum \frac{\partial L}{\partial \mathbf{Z}_5}

\end{align*}
$$

$$
\begin{align*}

\frac{\partial L}{\partial \mathbf{U}_4} &= \frac{\partial L}{\partial \mathbf{Z}_5} \cdot \frac{\partial \mathbf{Z}_5}{\partial \mathbf{U}_4} \\
&= \frac{\partial L}{\partial \mathbf{Z}_5} \cdot \mathbf{W}_{5}^{\intercal} \\
&= \delta_{5} \cdot \mathbf{W}_{5}^{\intercal}

\end{align*} \quad \quad

\begin{align*}

\frac{\partial L}{\partial \mathbf{Z}_4} &= \frac{\partial L}{\partial \mathbf{U}_4} \cdot \frac{\partial \mathbf{U}_4}{\partial \mathbf{Z}_4} \\
&= \left( \delta_{5} \cdot \mathbf{W}_{5}^{\intercal} \right) \odot f_4'(Z_4)\\
&= \delta_{4}

\end{align*} \quad \quad

\begin{align*}

\frac{\partial L}{\partial \mathbf{W}_4} &=  \frac{\partial L}{\partial \mathbf{Z}_4} \cdot \frac{\partial \mathbf{Z}_4}{\partial \mathbf{W}_4} \\
&= \mathbf{U}_{3}^{\intercal} \cdot \frac{\partial L}{\partial \mathbf{Z}_4} \\
&= \mathbf{U}_{3}^{\intercal} \cdot \delta_{4}

\end{align*} \quad \quad

\begin{align*}

\frac{\partial L}{\partial \mathbf{b}_4} &=  \frac{\partial L}{\partial \mathbf{Z}_4} \cdot \frac{\partial \mathbf{Z}_4}{\partial \mathbf{b}_4} \\
&= \delta_{4} \cdot 1 \\
&= \sum \delta_{4}

\end{align*}

$$

$$\vdots \\
\vdots$$

$$
\begin{align*}

\frac{\partial L}{\partial \mathbf{U}_1} &= \frac{\partial L}{\partial \mathbf{Z}_2} \cdot \frac{\partial \mathbf{Z}_2}{\partial \mathbf{U}_1} \\
&= \frac{\partial L}{\partial \mathbf{Z}_2} \cdot \mathbf{W}_{2}^{\intercal} \\
&= \delta_{2} \cdot \mathbf{W}_{2}^{\intercal}

\end{align*} \quad \quad

\begin{align*}

\frac{\partial L}{\partial \mathbf{Z}_1} &= \frac{\partial L}{\partial \mathbf{U}_1} \cdot \frac{\partial \mathbf{U}_1}{\partial \mathbf{Z}_1} \\
&= \left( \delta_{2} \cdot \mathbf{W}_{2}^{\intercal} \right) \odot f_1'(Z_1)\\
&= \delta_{1}

\end{align*} \quad \quad

\begin{align*}

\frac{\partial L}{\partial \mathbf{W}_1} &=  \frac{\partial L}{\partial \mathbf{Z}_1} \cdot \frac{\partial \mathbf{Z}_1}{\partial \mathbf{W}_1} \\
&= \mathbf{X}^{\intercal} \cdot \frac{\partial L}{\partial \mathbf{Z}_1} \\
&= \mathbf{X}^{\intercal} \cdot \delta_{1}

\end{align*} \quad \quad

\begin{align*}

\frac{\partial L}{\partial \mathbf{b}_1} &=  \frac{\partial L}{\partial \mathbf{Z}_1} \cdot \frac{\partial \mathbf{Z}_1}{\partial \mathbf{b}_1} \\
&= \delta_{1} \cdot 1 \\
&= \sum \delta_{1}

\end{align*}

$$

# Implementation

In [None]:
import numpy as np
import matplotlib.pyplot as plt

class Layer:
    
    def __init__(self, n_inputs: int, n_neurons: int) -> None:
        """
        Layer of neurons consisting of a weight matrix and bias vector.

        Parameters
        ----------
        n_inputs : int
            Number of inputs that connect to the layer.

        n_neurons : int
            Number of neurons the layer consists of.

        Attributes
        ----------
        weights : numpy.ndarray
            Matrix of weight coefficients.

        biases : numpy.ndaray
            Vector of bias coefficients.
        """

        # Weights are randomly initialized, small random numbers seem to work well
        self.weights = 0.1 * np.random.randn(n_inputs, n_neurons)
        # Bias vector is initialized to a zero vector
        self.biases = np.zeros(n_neurons)

    def forward(self, inputs: np.ndarray) -> None:
        """
        Forward pass using the layer. Creates output attribute.

        Parameters
        ----------
        inputs : numpy.ndarray
            Input matrix.

        Returns
        -------
        None
        """
        # Store inputs for later use (backpropagation)
        self.inputs = inputs
        # Output is the dot product of the input matrix and weights plus biases
        self.output = np.dot(inputs, self.weights) + self.biases

    def backward(self, delta: np.ndarray) -> None:
        """
        Backward pass using the layer. Creates gradient attributes for weights, biases and inputs.

        Parameters
        ----------
        delta : np.ndarray
            Accumulated gradient obtained by backpropagation.

        Returns
        -------
        None
        """
        self.dweights = np.dot(self.inputs.T, delta)
        self.dbiases = np.sum(delta, axis=0)
        self.dinputs = np.dot(delta, self.weights.T)

class ReLU:

    def __init__(self) -> None:
        """
        Rectified Linear Unit activation function.
        """
        pass

    def forward(self, inputs) -> None:
        """
        Forward pass using ReLU. Creates output attribute.

        Parameters
        ----------
        inputs : numpy.ndarray
            Input matrix.

        Returns
        -------
        None
        """
        # Store inputs for later use (backpropagation)
        self.inputs = inputs
        # Output is max value between 0 and inputs
        self.output = np.maximum(0, inputs)

    def backward(self, delta):
        """
        Backward pass using ReLU. Creates gradient attribute for inputs.

        Parameters
        ----------
        delta : np.ndarray
            Accumulated gradient obtained by backpropagation.

        Returns
        -------
        None
        """
        self.dinputs = delta.copy()
        self.dinputs[self.inputs < 0] = 0

a = ReLU()