A Multi Layer Perceptron (MLP) is an Artificial Neural Network (ANN), which consists of

 - an input layer (which is actually not counted as layer)

 - an output layer

 - one or more hidden layers in between the input and output layer

All neurons of a layer are connected to all neurons of the successive layer. A layer with this property is also called a fully-connected layer or a dense layer.

![mlpL3.png](attachment:mlpL3.png)

In contrast to Single-Layer Perceptrons, Multi-Layer Perceptrons are able to learn non-linear models.
This difference is depicted below:
 - the left hand side shows the linear classification-boundary, as learned by a SLP
 - on the right-hand side the non-linear boundary, as learned by a MLP from the same training data, is plotted

![nonlinearClassification.png](attachment:nonlinearClassification.png)

Architecture hyperparams:
    - number of hidden layers
    - number of neurons per layer
    - activation function
    - loss function
    - optimizer
    - learning rate
    - batch size
    - number of epochs and/or termination criteria

Number of hidden layers - strongly influence the network’s performance

Appropriate values for these parameters strongly depend on the application and data at hand. They can not be calculated analytically, but have to be determined in corresponding evaluation and optimization experiments.

Activation functions in hidden layers

Most used activations for the hidden-layers are:

 - sigmoid

 - tanh

 - relu

Finding the best, or at least an appropriate, activation function for the application and data at hand requires empirical analysis.

Activation functions in the output layer and loss functions

Regression:

 - Number of neurons in the output-layer: 1

 - Activation function in the output-layer: linear

 - Loss Function: Mean Squared Error, Mean Squared Logarithmic Error, Mean Absolute Error


Binary Classification:

 - Number of neurons in the output-layer: 1

 - Activation function in the output-layer: sigmoid

 - Loss Function: Binary Cross-Entropy


Multi-Class Classification:

 - Number of neurons in the output-layer: same as the number of classes (possible outcomes)

 - Activation function in the output-layer: softmax

 - Loss Function: Multi-Class Cross-Entropy, Sparse Multiclass Cross-Entropy

More about activation and loss functions at https://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/.

In [None]:
import random
import numpy as np

In [None]:
class Layer:
    def __init__(self, n_outputs, n_inputs = 0):
        self.n_inputs = n_inputs
        self.n_outputs = n_outputs

        self.weights = [] # an array for each neuron, so final type should be float[][]
        self.bias = [] # oen value for each neuron, so final type should be float[]
        self.last_deltas = [] # one value for each neuron, so final type should be float[]
        self.last_outputs = [] # one value for each neuron, so final type should be float[]
        
    def init_weights(self):
        # self.weights = random values
        # self.bias = random values
        pass
        
    def activate(self, X):
        # compute Y_hat for a vector of values
        pass

In [None]:
class ANN:
    def __init__(self):
        self.layers = []
        
    def add_layer(self, layer):
        # Add a new layer and init the weights for this layer
        pass
        
    def forward_propagate(self, X): # aka predict() or feed_forward()
        # Compute the output values for each layer
        pass
    
    def derivative(self, V):
        return V * (1.0 - V)
    
    def cost_function_mse(self, Y, Y_hat):
        return np.sum((Y - Y_hat) ** 2)
    
    def backward_propagate_error(self, Y):
        # compute deltas for last layer as the value of the error * the derivative of the last output on last layer

        for i in reversed(range(len(self.layers) - 1)):
            # compute deltas for following layers as the value of the error * the derivative of the last output on current layer
            # the error of current layer is computed as the deltas of next layer (dot product) the weights between current and next layer
            pass
                
    def gradient_descent_step(self, X, alpha):
        # Take a step in the steepest direction
        # compute the new weights and bias for first layer

        for i in range(1, len(self.layers)):
            # compute weights and bias for following layers
            # use outputs of previous layer as the X values
            pass
            
    def fit(self, X, Y, epochs=10, lr=0.01):
        # The train function, should have an iteration over all epochs and apply the gradient descent
        for epoch in range(epochs):
            err = 0
            for row in range(len(X)):
                # compute Y_hat
                # err += sum([(Y[row][i]-Y_hat[i])**2 for i in range(len(Y_hat))])

                # take a gradient descent step
                pass

#             print('>epoch=%d, lrate=%.3f, error=%.3f' % (epoch, lr, err))

In [None]:
dataset = np.array([[2.7810836,2.550537003,0],
    [1.465489372,2.362125076,0],
    [3.396561688,4.400293529,0],
    [1.38807019,1.850220317,0],
    [3.06407232,3.005305973,0],
    [7.627531214,2.759262235,1],
    [5.332441248,2.088626775,1],
    [6.922596716,1.77106367,1],
    [8.675418651,-0.242068655,1],
    [7.673756466,3.508563011,1]])

# preprocessing -> compute X and Y
# Y should have 2 values for each entry, the index of the right answer should be 1, the other value set to 0

n_inputs = len(dataset[0]) - 1
n_outputs = len(set([row[-1] for row in dataset]))

net = ANN()
# Add 2 layers to the network
# First layer with 2 neurons
# The 2nd layer with the number of classes

# Train the network for 20 epochs with a learning rate of 0.5


In [None]:
dataset = np.array([[2.7810836,2.550537003,0],
    [1.465489372,2.362125076,0],
    [3.396561688,4.400293529,0],
    [1.38807019,1.850220317,0],
    [3.06407232,3.005305973,0],
    [7.627531214,2.759262235,1],
    [5.332441248,2.088626775,1],
    [6.922596716,1.77106367,1],
    [8.675418651,-0.242068655,1],
    [7.673756466,3.508563011,1]])

# compute X and Y, same as previous dataset

In [None]:
import tensorflow as tf

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(2, input_dim=2, activation="sigmoid", use_bias=True),
    tf.keras.layers.Dense(2, input_dim=2, activation="softmax", use_bias=True)
])

model.compile(optimizer=tf.optimizers.SGD(lr=0.5), loss='mean_squared_error')
model.fit(X, Y, epochs=20)