#### This submission is for... (*put up to three people*)
- Abraham Gassama (2285843)
- Zhuo Le Lee (2317240)


# Exercise 1 - Simple Neural Network

In the first exercise, you will incrementally build a simple neural network from scratch with numpy. Our first challenge is solving the XOR task that you've seen in the lecture, before we move to a slightly more complex problem, namely the Iris dataset.

You can receive up to three points for your implementation of Exercise 1. After that you can either choose Exercise 2A or Exercise 2B to receive another three points. In sum, you can get up to six bonus points for the exam.

- **Exercise 2A**: Building a Transformer network with PyTorch applied in the NLP domain
- **Exercise 2B**: Building a GAN network with PyTorch applied in the image domain

**Important Notice**: Throughout the notebook, basic structures are provided such as functions and classes without bodies or partial bodies, and variables that you need to assign to. **Don't change the names of functions, variables, and classes - and make sure that you are using them!** You're allowed to introduce helper variables and functions. Occasionally, we use **type annotations** that you should follow. They are not enforced by Python. Whenenver you see an ellipsis `...` or TODO comment, you're supposed to insert code.

## XOR Task

XOR (exclusive OR) is a logic function that gives 1 as an output when the number of true inputs is odd, otherwise it outputs a 0. Our goal is to model this function using neurons. We'll start with a single neuron.

<center><img src="https://community.anaplan.com/t5/image/serverpage/image-id/29631i3AA6C01377A8550F/image-size/large?v=v2&px=999" width="250"/></center>

## A Single Neuron (Perceptron)

Let's start with importing some necessary dependencies that we will need throughout the notebook.

In [94]:
import numpy as np
import math

In the first part of this exercise you'll build a perceptron, a single neuron, that takes both binary input values and returns a binary output value.

<center><img src="https://i.stack.imgur.com/eBSki.jpg" width="280" />

<center><img src="" width="280"/>

Perceptron can be seen as a single neuron, mapping an input $\textbf{x}$ to an output $o$ using weights $\textbf{w}$ and a bias $b$. $\cdot$ is the dot product.

$o = \textbf{w}\cdot \textbf{x}+b$

#### Perceptron Update Rule

In the lecture we learned about the **Perceptron algorithm** / **Perceptron update rule** which we can apply to binary classification problems. Let's use it here to have a first baseline.

For classification problems $0>o$ is interpreted as class 1, and $o<0$ is interpreted as class 0. 

For updating the associated weights, we can use the following update rule:

$w_i = w_i + \nabla w_i$

where

$\nabla w_i = \eta(t-o)x_i$

- $t$ is the target
- $o$ is the output
- $\eta$ is the learning rate (a small constant)

### Implementation of a Perceptron

In [95]:
class perceptron_implementation():
    def __init__(self):
        self.neuron_weights = None
        self.bias = None
        self.initialize_weights()
        
    def initialize_weights(self):
        # TODO: 
        # Initialize weights 
        # For perceptrons, it's possible to initialize the weights with 0
        self.neuron_weights = np.array([0.0, 0.0])
        self.bias = 1.0
        # END TODO

    def forward_pass(self, x):
        # TODO
        # Implement forward propagation
        # x = hilfsvariable für update rule
        output = None
        output = self.neuron_weights[0]*x[0] + self.neuron_weights[1]*x[1] + self.bias
        
        if output > 0.0 :
          output = 1.0
        elif output < 0.0 :
          output = 0.0

        
        # END TODO
        return output, x

    def perceptron_update_rule(self, target, prediction, input_perceptron_ur, learning_rate = 1):
        # TODO
        # Perform perceptron update rule that is defined above
        # use self.neuron_weights
        new_weights = None      
        update_delta = (learning_rate*(target - prediction))*input_perceptron_ur
        print(update_delta)
        new_weights = self.neuron_weights + update_delta
        # END TODO
        self.neuron_weights = new_weights

    def train(self, input_data, targets):
        """
        input_data: Multi-dimensional array that contains all inputs
        """
        # TODO
        # Call the necessary functions to train a single neuron for the given task
        # Complete the rest of the code to correctly train the model

        output_tr = None
        input_tr = None

        for i in range(len(input_data)) : 
          output_tr, input_tr  =  self.forward_pass(input_data[i])
          self.perceptron_update_rule(targets[i], output_tr, input_tr, learning_rate = 1)
        # END TODO

    def inference(self, input_data):
        # TODO
        # Test the trained neuron
        # do to the return of the forward pass we store the second return in a variable which isnt returned 
        output = []
        var1 = None
        var2 = None

        
        for i in range(len(input_data)) : 
          var1, var2 = self.forward_pass(input_data[i])       
          output.append(var1)

        output = np.asarray(output)

        # END TODO
        return output

### Training

In [96]:
perceptron = perceptron_implementation()

# TODO

input_data = np.array([(0,0) , (1,0) , (0,1) , (1,1)], dtype=float)
targets = np.array([(0), (1), (1), (0)], dtype = float)


# train the corresponding single neuron 
perceptron.train(input_data, targets)
# END TODO

[-0. -0.]
[0. 0.]
[0. 0.]
[-1. -1.]


### Inference

In [97]:
# TODO
# Test the trained model

predictions = perceptron.inference(input_data)
print(predictions)
# END TODO

[1. 0. 0. 0.]


### Evaluation

For evaluation, we will need to consider appropriate metrics. For classification tasks, **accuracy** is one of the most common metrics.

It is defined as:

$\textrm{Accuracy}=\frac{1}{N}\sum_i^N1(y_i=\hat{y}_i)$

where $y$ is an array of our target values, and $\hat{y}$ is an array of our predictions.

For accuracy, if outputs are probabilities, there needs to be a threshold for transforming logit predictions to binary `(0,1)` predictions. We will set this threshold to `0.5`. For our perceptron this is not needed, since we already output binary values, however, we will use the `accuracy` function later on, so the predictions should be considered to be probabilities.

In [98]:
def accuracy(predictions: np.ndarray, targets: np.ndarray, threshold=0.5) -> float:
    # TODO
    # Implement the accuracy metric
    # As we have different dimensions in the iris task then the rest implemented first a funcion which converts the targets into one hot encodings first and then performed
    # the first case is the iris case, the second is the rest
    # i could have converted the the one hot encodings back to the ints for each classes but i already did this 
    if len(targets)>5 :

      prediction_copy = predictions
      for i in range(len(predictions)):
        copy = predictions[i]
        max_value = np.max(copy)
        for j in range (len(predictions[0])):

            if copy[j] == max_value:
                copy[j] = 1.
            else: 
                copy[j] = 0.
        prediction_copy[i] = copy

      help_value = 0.0

      for i in range(len(predictions)):
        if (prediction_copy[i] == targets[i]).all() == True:
            help_value = help_value + 1

      output = help_value/len(predictions)
      return output, prediction_copy

    else: 
      for i in range(len(targets)) : 
        if targets[i] > 0.5 : 
          targets[i] = 1.0
        else: 
          targets[i] = 0.0

        if predictions[i] > 0.5 : 
          predictions[i] = 1.0
        else: 
          predictions[i] = 0.0
      help_value = 0.0
      for i in range(len(targets)):
        if targets[i] == predictions[i] : 
          help_value += 1

      accuracy_value = help_value/len(targets)
             
    
      return accuracy_value

# END TODO

# TODO
# Call accuracy function and provide necessary inputs to calculate accuracy
accuracy_value = accuracy(predictions, targets)
print(accuracy_value)
# END TODO

0.25


## Multiple Neurons

The perceptron algorithm can't be generalized to multiple neurons or even layers of neurons, that's why we will now use **backpropagation**. This requires us to have a **loss function**.

For our XOR task, it is now our goal is now to build a network akin to this, i.e., a network with three single-neuron hidden layers:

<img src="https://i.imgur.com/oErVmm2.png">

#### Backpropagation

<center><img src="https://i.imgur.com/LgBzpYD.png" width="400" /></center>

### Sigmoid Activation Function

For a binary classification problem, we can use the sigmoid activation function in the output layer which outputs values in the range of 0 and 1. So, for a positive case (class 1), we can interpret $p_1 = \sigma(o)$ as the probability of that class, while $p_0 = 1 - p_1$ can be seen the probability of the negative case (class 0).

In [99]:
class sigmoid_activation_function():

    def forward(self, input_data):
        output = None
        
        # TODO
        # implement Sigmoid function for the input_data
        output = 1.0/(1.0 + math.exp(-input_data))
        # END TODO
        return output
    
    def backward(self, gradients):
        # calculate the gradients with help of the derivative
        derivate_sigmoid = None

        derivate_sigmoid = gradients*(1.0 - gradients)

        return derivate_sigmoid

### Loss Function (Binary Cross Entropy)

$L=-\frac{1}{N}\sum_{i=0}^Ny_i log(p(y_i))+(1-y_i)log(1-p(y_i))$

where $N$ is the batch size.

In [100]:
class binary_cross_entropy():

    def forward(self, output, target, batch_size=None):
        loss = 0.0
        
        # TODO
        # implement Binary Cross-Entrops loss function for output, target, batch_size selbst gemacht


        if batch_size == None:
            
            if target == 1:

                loss = (target*np.log(output) + (1.0 - target)*np.log(1.0-output))

            else:

                loss = (target*np.log(1.0-output) + (1.0 - target)*np.log(1.0-(1.0-output)))
        
        else:
            
            if target == 1:

                for i in range(batch_size):
                    loss = loss - (1.0/batch_size)*(target*np.log(output) + (1.0 - target)*np.log(1.0-output))
                   
            else:

                for i in range(batch_size):
                    loss = loss - (1.0/batch_size)*(target*np.log(1.0-output) + (1.0 - target)*np.log(1.0-(1.0-output)))
                    

        # END TODO
        return loss, output, target
    
    def backward(self, output, target):
         # calculate the gradients with help of the derivative

        if target == 1:

            derivate_bce = (-(target/output) + (1.0 - target)/(1.0 - output))

        else:

            derivate_bce = (-(target/(1.0-output)) + (1.0 - target)/(1.0 - (1.0-output)))

        return derivate_bce

### Initializing Weights

Xavier intitialization is commonly used to initialize the weights of a network. It is a random uniform distribution that’s bounded between $\pm\frac{\sqrt{6}}{\sqrt{n_i+n_{i+1}}}$ where $n_i$ is the number of incoming network connections, and $n_{i+1}$ is the number of outgoing network connections.

In [101]:
def xavier_initialization(n_incoming: np.ndarray, n_outgoing: np.ndarray) -> np.ndarray:
    """ Returns a numpy array of initialized weights """

    n_incoming = np.asarray(n_incoming)
    n_outgoing = np.asarray(n_outgoing)
    n_in = np.sum(n_incoming)
    n_out = np.sum(n_outgoing)
    range = (np.sqrt(6)/np.sqrt(n_in + n_out))

    xavier_initialized_weights = np.random.uniform(low = (-range), high = range, size = n_in)

    return xavier_initialized_weights



### Implement Multiple Neurons

#### Feed-Forward Layer

A feed-forward layer applies a linear transformation to the input $x$ using a weight matrix $\textbf{W}$ and a bias vector $b$:

$z = x\textbf{W}^T+b$

In [102]:
class multi_neuron_implementation():
    def __init__(self, number_of_neurons, loss_function, output_activation_function, bias, extra_hidden_layers):
        self.neuron_weights = xavier_initialization((2,),(1,))
        self.number_of_neurons = number_of_neurons
        self.loss_function = loss_function
        self.output_activation_function = output_activation_function
        self.bias = bias
        self.extra_hidden_layers = extra_hidden_layers
        print(self.neuron_weights)

    def forward_pass(self, x):
        # TODO
        # Implement forward propagation
        output = 0

        for i in range(self.number_of_neurons):


            output = output + self.neuron_weights[i]*x[i]  
            i += 1
        output = output + self.bias
        output = self.output_activation_function().forward(output)
        for i in range(self.extra_hidden_layers): 
            output = self.output_activation_function().forward(output)
        

        # END TODO
        return output, x

    def backward_pass(self, x, target, input):
        # TODO
        # Perform backpropagation by calculating derivative
        # input von x_0 und x_1
        output = None
        output = (self.loss_function().backward(x, target))*(self.output_activation_function().backward(x))
        for i in range(self.extra_hidden_layers):
            output = self.output_activation_function().backward(output)
        output = output*input
        # END TODO
        return output

    def update_parameter(self, derivative, learning_rate = 1):
        # TODO
        # Perform weight update
        # use self.neuron_weights
        new_weights = None
        new_weights = self.neuron_weights - learning_rate*derivative
        # END TODO
        self.neuron_weights = new_weights

    def train(self, input_data, targets):
        # TODO
        # Call the necessary functions to train the model with multiple neurons for the given task
        # Complete the rest of the code to correctly train the model
        for i in range(len(input_data)):
            output_tr, input_tr = self.forward_pass(input_data[i])
            derivate_tr = self.backward_pass(output_tr, targets[i], input_tr)
            self.update_parameter(derivate_tr, learning_rate=1)      
        # END TODO

    def inference(self, input_data):
        # TODO
        # Test the trained model
        
        output = []
        var1 = None
        var2 = None

        
        for i in range(len(input_data)) : 
          var1, var2 = self.forward_pass(input_data[i])       
          output.append(var1)

        output = np.asarray(output)

        # END TODO
        return output

### Training

In [103]:
multi_neuron = multi_neuron_implementation(number_of_neurons=2, loss_function=binary_cross_entropy, output_activation_function=sigmoid_activation_function,bias = 1, extra_hidden_layers=3)

# TODO
# train the corresponding single neuron 
multi_neuron.train(input_data, targets)
# END TODO

[ 0.81952103 -0.99499788]


### Inference

In [104]:
# TODO
# Test the trained model
predictions = multi_neuron.inference(input_data)
print(predictions)
# END TODO

[0.65985098 0.66205247 0.65981958 0.66204444]


### Evaluation

In [105]:
# TODO
# Call accuracy function and provide necessary inputs to calculate accuracy
accuracy_value = accuracy(predictions, targets)
print(accuracy_value)
# END TODO

0.5


## Multi-Layer Perceptron (MLP)

Let's generalize even further and build a network with an arbitrary (parametrized) number of hidden layers and hidden dimensions. For the XOR task specifically, we will consider a network with three hidden layers and a hidden dimension of three. We will also add an activiation function to introduce nonlinearity in our hidden layers.

<img src="https://i.imgur.com/IUQ05Ol.png">

### Implementation

In [106]:
#input data shape geandert für mlp weil (2,1) unterschiedlich von (2,)


input_data = np.array([[[0],[0]] , [[1],[0]] , [[0],[1]] , [[1],[1]]], dtype=float)
targets = np.array([(0), (1), (1), (0)], dtype = float)



print(input_data[0].shape)
print(targets[0].shape)


(2, 1)
()


In [107]:
from re import I


class MLP_implementation():
    def __init__(self,
        input_size,
        hidden_layers,
        hidden_layers_size,
        hidden_activation_func,
        output_size,
        output_activation_function,
        loss_function,
    ):
        self.input_size = input_size
        self.hidden_layers = hidden_layers
        self.hidden_layers_size = hidden_layers_size
        self.output_size = output_size
        self.hidden_activation_func = hidden_activation_func
        self.loss_function = loss_function
        self.output_activation_function = output_activation_function
        #input_size und output_size selbst gemacht
        # TODO
        # Implement your MLP model 

        self.weights_list = []

        for i in range(self.hidden_layers + 1):
            if i == 0:
                self.weights_list.append(np.random.randn(self.hidden_layers_size, self.input_size)*np.sqrt(1. / self.hidden_layers_size))
                

            elif i == self.hidden_layers:
                self.weights_list.append(np.random.randn(self.output_size, self.hidden_layers_size)*np.sqrt(1. / self.hidden_layers_size))
                
            else:
                self.weights_list.append(np.random.randn(self.hidden_layers_size, self.hidden_layers_size)*np.sqrt(1. / self.hidden_layers_size))
                
        # END TODO

    def forward_pass(self, x):
        # TODO
        # Implement forward propagation
        # Its the same principle as in the lecture 
        # I use self.hidden states for the backpro with h1 = self.forward_hidden_states[1] 
        # the first index is None (better for visualisation and if it is used by error i would get a Nan as output)
        output = None
        
        self.forward_hidden_states = [None]
       

        for i in range (self.hidden_layers + 1):
            if i == 0:
                
                
                output = self.hidden_activation_func().forward(np.dot(self.weights_list[0], x))
                
                self.forward_hidden_states.append(output)     

            elif i == self.hidden_layers:  

                output_final = self.output_activation_function().forward(np.dot(self.weights_list[i], output)) 
                
            
            else:
                output = self.hidden_activation_func().forward(np.dot(self.weights_list[i], output))

                self.forward_hidden_states.append(output) 
                
        
        # END TODO
        return output_final, x 

    def backward_pass(self, x, target, input_of_mlp):
        # TODO
        # Perform backpropagation by calculating derivative
        # x = output von MLP
        # Same procedure as in class with matrix multiplication. I use a list as an ouput and append from the end towards the beginning for the weights
        # o and v are variables which will not be used again. its due to the forward return
    
        loss, o, v = self.loss_function().forward(x, target)

        output = [None]*(self.hidden_layers+1)

        
        self.hidden_states_backprop = [None]*(self.hidden_layers+1)

        # for it to be more conveniant i used the already calculed expression in the Classes dl/do = P-Y as we have a softmax and cross entropy. It needs less computational power
        # As i already calculated the output using the loss function i did the same for the loss which is printed after each iteration. Which could be plotted as a graph during training
        # I used the output size as we use softmax and cross entropy for muti output probalities 
    

        if self.output_size > 1 :

            error_at_output = x-target
            
            
        else:
            
            error_at_output = np.dot(self.loss_function().backward(x, target), self.output_activation_function().backward(x))
            print(error_at_output)

        
        for i in range(self.hidden_layers, -1, -1):

            
            if i == self.hidden_layers:
                # w3 h3
            
                output[i] = np.dot(error_at_output ,(self.forward_hidden_states[i].T))
                self.hidden_states_backprop[i] = np.dot(self.weights_list[i].T, error_at_output)
                

            elif i == 0:

                c = self.hidden_states_backprop[i+1] * self.hidden_activation_func().backward(self.forward_hidden_states[i+1])                                                            

                output[i] = np.dot( c , input_of_mlp.T)
                   

            else:

                a = self.hidden_states_backprop[i+1]* self.hidden_activation_func().backward(self.forward_hidden_states[i+1])

                output[i] = np.dot( a, self.forward_hidden_states[i].T)

                b =   self.hidden_activation_func().backward(self.forward_hidden_states[i+1])* self.hidden_states_backprop[i+1]

                self.hidden_states_backprop[i] = np.dot(self.weights_list[i].T, b )

                

        # END TODO
        return output, loss

    def update_parameter(self, derivative, learning_rate = 0.015, loss=None):
        # TOD
        # Perform weight update
        
        for i in range(self.hidden_layers + 1):
            
            
            self.weights_list[i] = self.weights_list[i] - learning_rate*derivative[i]
            
        # END TODO

    def train(self, input_data, targets):
        # TODO
        # Call the necessary functions to train the model for the given task
        # Complete the rest of the code to correctly train the model

        for i in range(len(input_data)):

            output, mlp_input = self.forward_pass(input_data[i])
            
            weight_error, loss = self.backward_pass(output, targets[i], mlp_input)

            self.update_parameter(weight_error, learning_rate=1, loss = None)

            print('loss = ', loss)

        # END TODO

    def inference(self, input_data):
        # TODO
        # Test the trained model

        output = []
        var1 = None
        var2 = None

        
        for i in range(len(input_data)) : 
          var1, var2 = self.forward_pass(input_data[i])       
          output.append(var1)

        output = np.asarray(output)

        # END TODO
        return output

### Adding Nonlinearity

This time, you need to implement and apply nonlinearity. For this, you should implement Rectified Linear Unit (ReLU) and apply it to provide nonlinearity to the network.

Basically, ReLU activation function is a mathematical operation that processes the input data and checks whether the input is positive or not. If it is positive, then it does not change anything. Otherwise, ReLU outputs zero. 

When we examine the ReLU behavior, it looks like it is the combination of two different linear functions. This property makes the training easier yet effective since ReLU does not have any learnable parameters as well as easy to apply because of combination of two simple linear functions. The following equation and figure show how ReLU acts.

$$ y = max(0, x) $$

<center><figure><img src="https://machinelearningmastery.com/wp-content/uploads/2018/10/Line-Plot-of-Rectified-Linear-Activation-for-Negative-and-Positive-Inputs.png" width="450"/><figcaption>Graph of the ReLU activation function. <a href="https://machinelearningmastery.com/wp-content/uploads/2018/10/Line-Plot-of-Rectified-Linear-Activation-for-Negative-and-Positive-Inputs.png">Image is taken from</a></figcaption></figure></center>

In [108]:
# You need to use the same implementation that you do in the previous task, if you need.
# You only need to include ReLU activation function in your implementation.

class relu_activation_function():

    def forward(self, input_data):
        output = None
        
        # TODO
        # implement ReLU function for the input_data
        
        output = np.maximum(0.0,input_data) 
        
        
        # END TODO
        return output
    
    def backward(self, gradients):
         # calculate the gradients with help of the derivative
        
        gradients[gradients<=0] = 0.0
        
        

        return gradients

### MLP Inititialization

In [109]:
relu = relu_activation_function()
xor_mlp = MLP_implementation(2, 3, 3, relu_activation_function, 1, sigmoid_activation_function, binary_cross_entropy)


### Training

In [110]:
# Train the same model with ReLU activation function
# TODO
xor_mlp.train(input_data, targets)
# END TODO

0.5
loss =  -0.6931471805599453
-0.48985186101077344
loss =  -0.6730541268352774
-0.4888275958166507
loss =  -0.6710483598028536
0.47660489388551275
loss =  -0.64741863911229


### Evaluation

In [111]:
# Test and evaluate your new model as in the previous task
# TODO
predictions = xor_mlp.inference(input_data)
print(predictions)
accuracy_value = accuracy(predictions, targets)
print(predictions)
print(accuracy_value)
# END TODO

[0.5        0.50953659 0.50988458 0.51941385]
[0. 1. 1. 1.]
0.75


## Application

### Iris Dataset 🌷

Iris is a genus of hundreds of species of flowering plants with showy flowers. The Iris data set consists of 150 samples from three species of Iris which are hard to distinguish (Iris setosa, Iris virginica and Iris versicolor). There are four features from each sample: the length and the width of the sepals and petals, in centimeters. Based on these features, the goal is to predict which species of Iris the sample belongs to.

<center><img src="https://www.oreilly.com/library/view/python-artificial-intelligence/9781789539462/assets/462dc4fa-fd62-4539-8599-ac80a441382c.png" width="450"/></center>

###  Loading Dataset

In [112]:
from sklearn.datasets import load_iris
from sklearn.utils import shuffle
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into training and test dataset
# they are ndarray
# we change them also in one hot encodings for the probalities, also we shuffle them so the data is better distributed
# However it doesnt watch out about the distribution of the different classes, We could use the train test split from sklearn so the class representation is 'similar' on the train and test Set
# But we had to use numpy and it wasnt really a task itself so i didnt looked further
# We also normalized the data so the MLP can learn better 

X_shuffled, y_shuffled = shuffle(X, y)

X_shuffled_normed = X_shuffled / X_shuffled.max(axis=0)

X_shuffled_good_dim = [None]*(len(X_shuffled))

X_shuffled_good_dim = np.array(X_shuffled_good_dim)

y_one_hot_good_dim = [None]*(len(y_shuffled))

y_one_hot_good_dim = np.array(y_one_hot_good_dim)

y_one_hot_enc = np.eye(3)[y_shuffled.reshape(-1)]

for i in range(len(X_shuffled)):
    X_shuffled_good_dim[i] = X_shuffled_normed[i].reshape((4,1))

for i in range(len(y_shuffled)):
    y_one_hot_good_dim[i] = y_one_hot_enc[i].reshape((3,1))

train_length = int(0.8*len(X))

X_train, y_train = X_shuffled_good_dim[0:train_length], y_one_hot_good_dim[0:train_length]
X_test, y_test = X_shuffled_good_dim[train_length:-1], y_one_hot_good_dim[train_length:-1]
print(X_train[0])
print(y_train[0])


[[0.60759494]
 [0.77272727]
 [0.27536232]
 [0.08      ]]
[[1.]
 [0.]
 [0.]]


### Softmax

Previously, we only considered a **binary classification problem**. Iris, however, is a **multiclass classification problem** that requires us to distinguish between three classes. For this case, we can use a **Softmax activation function** in the output layer to transform our outputs (logits) to a probability distribution over our classes.

Softmax is defined as

$\texttt{softmax}(z)_i=\frac{e^{z_i}}{\sum_{j=1}^N e^{z_j}}$

In [113]:
def softmax_test(x):
    e =np.exp(x - np.max(x))
    p = e/e.sum()
    return p
    
trest = [[3.0], [1.0], [0.2]]
print(softmax_test(trest))

[[0.8360188 ]
 [0.11314284]
 [0.05083836]]


In [114]:
class softmax_activation_function():

    def forward(self, input_data):
        output = None
        
        # TODO
        # implement Softmax function for the input_data
        # numerisch stabiler softmax
       

        y = np.exp(input_data - np.max(input_data))
        output = y / y.sum()
        
        # END TODO
        return output
    
    def backward(self, gradients):
        # calculate the gradients with help of the derivative

        jacobian_m = np.diag(gradients)

        for i in range(len(jacobian_m)):
            for j in range(len(jacobian_m)):
                if i == j:
                    jacobian_m[i][j] = gradients[i] * (1-gradients[i])
                else: 
                    jacobian_m[i][j] = -gradients[i]*gradients[j]
        return jacobian_m

        #return

### Loss Function (Cross-Entropy)

Related to the previous notes about sigmoid and softmax, we now also need to move from a binary cross entropy loss to a more general cross entropy loss for a multiclass classification problem.

Cross-Entropy loss is defined as:

$L=-\frac{1}{N}\sum_{n=0}^{N}\sum_i y_i log(y_i')$

In [115]:
class cross_entropy_loss():

    def forward(self, output, target):
        loss = None
        
        # TODO
        # implement Cross-Entrops loss function for output, target
       
        loss = np.sum(target*np.log(output))
        
        # END TODO
        return loss, output, target
    
    def backward(self, output, target):
        # calculate the gradients with help of the derivative
        
        derivate = -target*(1/output)

        return derivate

### Architecture

We will again use an MLP for this task. Intitialize a model with **4 hidden layers** and a **hidden layer size of 24**.

### Training

In [116]:
iris_mlp = MLP_implementation(4, 4, 24, relu_activation_function, 3, softmax_activation_function, cross_entropy_loss)


In [117]:
iris_mlp.train(X_train, y_train)

loss =  -1.094810848524672
loss =  -1.1296956467855443
loss =  -1.1761605018003272
loss =  -1.1006173788689138
loss =  -1.0542484597093058
loss =  -0.987129511084618
loss =  -1.208723391213119
loss =  -1.1335378517539678
loss =  -1.0134629463581801
loss =  -0.9621954419428576
loss =  -1.1213000700365126
loss =  -1.0709877002452317
loss =  -1.3131761040037364
loss =  -1.0362895087494735
loss =  -1.0372370656175236
loss =  -1.010546118746285
loss =  -1.3138789254806165
loss =  -1.006413847633985
loss =  -1.0569059935492622
loss =  -0.9846588777218911
loss =  -1.3242427847118123
loss =  -1.0628418281361842
loss =  -1.0137634919912064
loss =  -1.006028959898533
loss =  -0.953469950813325
loss =  -1.3621328230033884
loss =  -1.0561353139203045
loss =  -0.9993689890194964
loss =  -0.9562857721124082
loss =  -0.9303551820878171
loss =  -1.399924038705112
loss =  -1.067123955245302
loss =  -1.2933100371664439
loss =  -0.9650366299704213
loss =  -1.0531927046721796
loss =  -0.9264779375212445
l

### Evaluation

Show the overall accuracy of our model on the test dataset. Use the existing `accuracy` function that you implemented earlier.

In [118]:
# TODO
# Test the trained model
predictions_MLP = iris_mlp.inference(X_test)
#print(predictions_MLP)

# Call accuracy function and provide necessary inputs to calculate accuracy
accuracy_value_MLP, y_final_predicted = accuracy(predictions_MLP,y_test)
print("iris accuracy =" ,accuracy_value_MLP*100)

# END TODO

iris accuracy = 82.75862068965517


In [119]:
# Reshape for the confusion matrix as we have dimensions of (29,3,1)

y_bruh = np.empty((29,3,1))

for i in range(len(y_test)):
    y_bruh[i] = y_test[i]

y_bruh= y_bruh.reshape((29,3))
y_final_predicted = y_final_predicted.reshape((29,3))

print(y_bruh.shape)
print(y_final_predicted.shape)

(29, 3)
(29, 3)


Print the confusion matrix using `sklearn.metrics.confusion_matrix`.

In [120]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_bruh.argmax(axis=1), y_final_predicted.argmax(axis=1),labels=[0,1,2])


array([[10,  0,  0],
       [ 0,  9,  0],
       [ 0,  5,  5]], dtype=int64)

Now please also look at the confusion matrix, what can you conclude from it? (no code, write text as part of this question)

We can see that from the 29 test samples 24 are predicted correctly. Only 5 are not classified correctly however they are all classified as the same class(five virginica are classified as versicolors). This could be due to the order of training dataset as it gets updated after each sample. The solution could be to have good mix/order of the different classes for the test set. But we have an overall accuracy of 82.75% which can be improved either by twisting the hyperparameters or by having a more even data/classes distribution in the training and test dataset. After each run we will have a different data distribution due to the shuffle operation and also different weights due to random. Also we could have better accuracys (in general) and by implementing a bias in our MLP for example in the mlp xor task. Our accuracy could also be more Stable if we used batch size Training instead of an update for each sample which makes it highly dependet of the order of training dataset. However the update for each sample is better for finetuning our model and get better general accuracys for a given Problem. The best we had was 96%

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=faa4af3b-d086-4f42-8b7d-d29c91b1d0f6' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>