# Machine Learning - Assignment 3

## Artificial Neural Network

The aim of the assignment is to implement an artificial neural network (mostly) from scratch. This includes implementing or fixing the following:

* Add support for additional activation functions and their derivatives.
* Add support for loss functions and their derivative.
* Add the use of a bias in the forward propagation.
* Add the use of a bias in the backward propagation.

In addition, you will we doing the following as well:

* Test the algorithm on 3 datasets.
* Compare neural networks with and without scaling.
* Hyper-parameter tuning.

The forward and backward propagation is made to work through a single layer, and are re-used multiple times to work for multiple layers.

Follow the instructions and implement what is missing to complete the assignment. Some functions have been started to help you a little bit with the implementation.

**Note:** You might need to go back and forth during your implementation of the code. The structure is set up to make implementation easier, you might find yourself going back and and forth to change something to make it easier later on.

## Assignment preparations

We help you out with importing the libraries.

**IMPORTANT NOTE:** You may not import any more libraries than the ones already imported!

In [53]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# We set seed to better reproduce results later on.
np.random.seed(12345)

## Neural Network utility functions

### 1) Activation functions

Below is some setup for choosing activation function. Implement 2 additional activation functions, "ReLU" and one more of your choosing.

In [54]:
# Activation functions
def activate(activations, selected_function = "none"):
    
    if selected_function == "none":
        y = activations
    elif selected_function == "relu" :
        # TODO: Implement the "ReLU" activation function
        y = np.maximum(0, activations)
    elif selected_function == "sigmoid" :
        # TODO: Implement another activation function activation function of your own choice.
        y = 1 / (1 + np.exp(-activations))
    return y

In [None]:
# TODO Test your activation functions, is the returning values what you expect?
test_values = np.array([-2, -1, 0, 1, 2, 3, 5])
print("Activation 'none' :", activate(test_values, "none"))
print("Activation 'relu' :", activate(test_values, "relu"))
print("Activation 'sigmoid' :", activate(test_values, "sigmoid"))

### 2) Activation function derivatives

Neural networks need both the activation function and its derivative. Finish the code below.

In [56]:
def d_activate(activations, selected_function = "none"):
    if selected_function == "none":
        dy = np.ones_like(activations)
    elif selected_function == "relu":
        # TODO: Implement the "ReLU" derivative
        dy = np.where(activations > 0, 1, 0)
    elif selected_function == "sigmoid" :
        # TODO: Implement the derivative of the activation function you chose yourself.
        sigmoid = 1 / (1 + np.exp(-activations))
        dy = sigmoid * (1 - sigmoid)
    return dy

In [None]:
# TODO Test your activation function derivatives, is the returning values what you expect?
test_values = np.array([-2, -1, 0, 1, 2, 3, 5])
print("Derivative 'none' :", d_activate(test_values, "none"))
print("Derivative 'relu' :", d_activate(test_values, "relu"))
print("Derivative 'sigmoid' :", d_activate(test_values, "sigmoid"))

### 3) Loss functions

To penalize the network when it predicts incorrect, we need to meassure how "bad" the prediction is. This is done with loss-functions.

Similar as with the activation functions, the loss function needs its derivative as well.

Finish the MSE_loss (Mean Squared Error loss), as well as adding one additional loss function.

In [58]:
# This is the loss for a set of predictions y_hat compared to a set of real valyes y
def MSE_loss(y_hat, y):
    loss = np.mean((y_hat - y) ** 2) # TODO: Finish this function
    return loss


# TODO: Choose another loss function and implement it
def BCE_loss(y_hat, y):
    epsilon = 1e-15
    y_hat = np.clip(y_hat, epsilon, 1 - epsilon)
    loss = -np.mean(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat)) # TODO: Finish this function
    return loss


def CCE_loss(y_hat, y):
    epsilon = 1e-15
    y_hat = np.clip(y_hat, epsilon, 1 - epsilon)
    loss = -np.mean(np.sum(y * np.log(y_hat), axis=1))
    return loss

The derivatives of the loss is with respect to the predicted value **y_hat**.

In [59]:
def d_MSE_loss(y_hat, y):
    dy = 2 * (y_hat - y) / y.size 
    # TODO: Finish this function
    return dy

# TODO: Choose another loss function and implement it
def d_BCE_loss(y_hat, y):
    epsilon = 1e-15
    y_hat = np.clip(y_hat, epsilon, 1 - epsilon)
    dy = (y_hat - y) / (y_hat * (1 - y_hat) * y.size)
    # TODO: Finish this function
    return dy


def d_CCE_loss(y_hat, y):
    dy = (y_hat - y) / y.shape[0]
    return dy

### 4) Forward propagation

The first "fundamental" function for neural networks is to be able to propagate the data forward through the neural network. We will implement this function here.

In [60]:
def propagate_forward(weights, activations, bias, activation_function="none"):
    
    # TODO: Add support for the use of bias

    dot_product = np.dot(weights, activations) + bias

    new_activations = activate(dot_product, activation_function)

    return new_activations

### 5) Back-propagation

To be able to train a neural network, we need to be able to propagate the loss backwards and update the weights. We will implement this function here.

In [61]:
# Calculates the backward gradients that are passed throught the layer in the backward pass.
# Returns both the derivative of the loss in respect to the weights and the input signal (activations).

def propagate_backward(weights, activations, dl_dz, bias, activation_function="none"):
    # NOTE: dl_dz is the derivative of the loss based on the previous layers activations/outputs

    # TODO: Add support for the use of bias

    z = np.dot(weights, activations) + bias
    d_loss = d_activate(z, activation_function) * dl_dz

    d_weights = np.dot(d_loss, activations.T)
    
    d_bias = np.sum(d_loss, axis=1, keepdims=True)
    d_activations = np.dot(weights.T, d_loss)
    
    return d_weights, d_activations, d_bias

## Neural network implementation

### 6) Fixing the neural network

Below is a class implementation of a MLP neural network. This implementation is still lacking several areas that are needed for the network to be robust and function well. Your task is to improve and fix it with the following:

1. Add a bias to the activation functions, and make sure the bias is also updated during training. 
2. Add a function that trains the network using minibatches (such that the neural network trains on a few samples at a time). 
3. Make use of an validation set in the training function. The model should stop training when the loss starts to increase for the validatin set. This feature should be able to be turned on and off to test the difference.


In [62]:
class NeuralNet(object):
    
    # Setup all parameters and activation functions.
    # This function runs directly when a new instance of this class is created.
    def __init__ (self, input_dim, output_dim, neurons = []):

        # NOTE: The "neurons" parameter is given as a list.
        # E.g., [4, 8, 4] means 4 neurons in layer 1, 8 neurons in layer 2 etc...

        # TODO: Add support for bias for each neuron in the code below.
        
        self.weights = [np.random.normal(0, 2, (m, n)) for n, m in zip([input_dim] + neurons, neurons + [output_dim])]

        
        self.biases = [np.zeros((layer_size, 1)) for layer_size in neurons + [output_dim]]
        
        self.activation_functions = ["relu"] * len(neurons) + ["none"]

    # Predict the input throught the network and calculate the output.
    def forward(self, x):
        activation = x.T
        # TODO: Add support for a bias for each neuron in the code below.
        for layer_weights, layer_bias, layer_activation_function in zip(self.weights,self.biases , self.activation_functions):
            activation = propagate_forward(layer_weights, activation, layer_bias, layer_activation_function)
            
        return activation.T
    
    
    # Adjust the weights in the network to better fit the desired output (y), given the input (x).
    # The weight updates are happening "in-place", thus we are only returning the loss from this function.
    # Note that this function can handle a variable size of the input (x), both full datasets or smaller parts of the dataset.
    def adjust_weights(self, x, y, learning_rate=1e-4):
                
        # TODO: Add support for a bias for each neuron and make sure these are learnt as well in the code below.

        activation = x.T
        activation_history = [activation] # NOTE: We need the previous (or intermediate) activations to make use of the "chain rule" (see lecture notes).
        
        for layer_weights, layer_bias, layer_activation_function in zip(self.weights, self.biases, self.activation_functions):

            activation = propagate_forward(layer_weights, activation, layer_bias, layer_activation_function)
            activation_history.append(activation)

        # NOTE: The "activation" variable is changing as we go forward in the neural network.
        y_pred = activation_history[-1].T
        loss = MSE_loss(y_pred, y)

        d_activations = d_MSE_loss(y_pred,y).T # NOTE: The final output can be "seen as" the final activations, thus the name.
        
        for i in range(len(self.weights))[::-1]:
            # print(f"Backward pass for layer {i}")
            current_activation = activation_history[i]
            current_wights = self.weights[i]
            current_bias = self.biases[i]
            current_activation_function = self.activation_functions[i]

            d_weights, d_activations, d_bias = propagate_backward(current_wights, current_activation, d_activations, current_bias, current_activation_function)
            
            self.weights[i] -= learning_rate * d_weights
            self.biases[i] -= learning_rate * d_bias

        return loss
    
    
    # A function for the training of the network.
    def train_net(self, x, y, batch_size=32, epochs=100, learning_rate=1e-4, use_validation_data=False, val_split=0.2):
        
        # TODO: Add a training loop where the weights and biases of the network is learnt over several epochs.

        # TODO: Add support for mini batches. That is, in each epoch the data should be split into several
        #       smaller subsets and the model should be trained on each of these subsets one at a time.

        # TODO: Implement the use of validation data, that is, splitting the training data into training data and validation data.
        #       The validation data should be used to stop the training when the model stops to generalise and starts to overfit.
        #       This feature should be able to be turned on and off to test the difference.

        # NOTE: Make use of previously implemented functions here.
        num_samples = x.shape[0]
        if use_validation_data:
            split_idx = int(num_samples * (1 - val_split))
            x_train, x_val = x[:split_idx], x[split_idx:]
            y_train, y_val = y[:split_idx], y[split_idx:]
        else:
            x_train, y_train = x, y

        history = []
        val_history = []
        for epoch in range(epochs):
            permutation = np.random.permutation(x_train.shape[0])
            x_train_shuffled = x_train[permutation]
            y_train_shuffled = y_train[permutation]
            
            epoch_loss = 0
            for i in range(0, x_train_shuffled.shape[0], batch_size):
                x_batch = x_train_shuffled[i:i+batch_size]
                y_batch = y_train_shuffled[i:i+batch_size]
                loss = self.adjust_weights(x_batch, y_batch, learning_rate)
                epoch_loss += loss * x_batch.shape[0]
            epoch_loss /= x_train_shuffled.shape[0]
            history.append(epoch_loss)
            
            if use_validation_data:
                y_val_pred = self.forward(x_val)
                val_loss = MSE_loss(y_val_pred, y_val)
                val_history.append(val_loss)
                print(f"Epoch {epoch+1}/{epochs}, Training Loss: {epoch_loss:.6f}, Validation Loss: {val_loss:.6f}")
            else:
                print(f"Epoch {epoch+1}/{epochs}, Loss: {epoch_loss:.6f}")
        if use_validation_data:
            return history, val_history
        else:
            return history

## Train Neural Networks

### 7) Simple test

In this a very simple test for you to use and toy around with before using the datasets.

Make sure to test both the **adjust_weights** function and the **train_net** function. What is the difference between the two?

Also, be sure to **plot the loss for each epoch** to see how the network training is progressing!

In [None]:
# TODO: You can change most things in this cell if you want to, we encurage it!

n = 1000
d = 4

k = np.random.randint(0, 10, (d, 1))
x = np.random.normal(0, 1, (n, d))
y = np.dot(x, k) + 0.1 + np.random.normal(0, 0.01, (n, 1))

nn = NeuralNet(d, 1, [18, 12])

loss_1 = [nn.adjust_weights(x, y) for _ in range(1000)]


# TODO: Use the train_net function to compare with the "adjust_weights" function.
loss_2 = nn.train_net(x, y, batch_size=32, epochs=100, learning_rate=1e-4, use_validation_data=False, val_split=0.2)


plt.plot(loss_1)
plt.title("Loss 1 (adjust_weights)")
plt.show()

plt.plot(loss_2)
plt.title("Loss 2 (train_net)")
plt.show()

### Real test and preprocessing

When using real data and neural networks, it is very important to scale the data between smaller values, usually between 0 and 1. This is because neural networks struggle with larger values as input compared to smaller values. 

To test this, we will use our first dataset and test with and without scaling.

Similar as with assignment 2, we will use the scikit-learn library for this preprocessing: https://scikit-learn.org/stable/modules/preprocessing.html

### 8) Dataset 1: Wine - with and without scaling

Wine dataset: https://archive.ics.uci.edu/dataset/109/wine

Train two neural network, one with scaling and one without. Are we able to see any difference in training results or loss over time?

**Note:** Do not train for to many epochs (more than maybe 50-100). The network might "learn" anyway in the end, but you should still be able to see a difference when training.

In [64]:
from sklearn import preprocessing

data_wine = pd.read_csv("wine.csv").to_numpy()

# TODO: Set up the data and split it into train and test-sets.

# TODO: Train and test your neural networks.
# NOTE: Use the same train/test split for both neural network models!

# TODO: Do the above at least 3 times
# NOTE: Use loops here!

# TODO: Plot the results with matplotlib (plt)
# NOTE: One combined lineplot with the scaling and one without the scaling, 2 plots in total.
# NOTE: Plot both the accuracy and the loss!

### Real data and hyper-parameter tuning

Now we are going to use real data, preprocess it, and do hyper-parameter tuning.

Choose two hyper-parameters to tune to try and achive an even better result.

**NOTE:** Changing the number of epochs should be part of the tuning, but it does not count towards the two hyper parameters.

### 9) Dataset 2: Mushroom

Mushroom dataset: https://archive.ics.uci.edu/dataset/73/mushroom

Note: This dataset has one feature with missing values. Remove this feature.

In [65]:
data_mushroom = pd.read_csv("mushroom.csv").to_numpy()

# TODO: Preprocess the data.

# TODO: Split the data into train and test

# TODO: Train a neural network on the data

# TODO: Visualize the loss for each epoch

# TODO: Visulaize the test accuracy for each epoch

When hyper-parameter tuning, please write the parameters and network sizes you test here:

* Parameter 1: 
* Parameter 2:

* Neural network sizes: 

In [66]:
# TODO: Hyper-parameter tuning

# TODO: Visualize the loss after hyper-parameter tuning for each epoch

# TODO: Visulaize the test accuracy after hyper-parameter tuning for each epoch

### 10) Dataset 3: Adult

Adult dataset: https://archive.ics.uci.edu/dataset/2/adult

**IMPORTANT NOTE:** This dataset is much larger than the previous two (48843 instances). If your code runs slow on your own computer, you may exclude parts of this dataset, but you must keep a minimum of 10000 datapoints.

In [None]:
dataset_3 = pd.read(...) # TODO: Read the data.

# TODO: Preprocess the data.

# TODO: Split the data into train and test

# TODO: Train a neural network on the data

# TODO: Visualize the loss for each epoch

# TODO: Visulaize the test accuracy for each epoch

When hyper-parameter tuning, please write the parameters and network sizes you test here:

* Parameter 1: 
* Parameter 2:

* Neural network sizes: 

In [None]:
# TODO: Hyper-parameter tuning

# TODO: Visualize the loss after hyper-parameter tuning for each epoch

# TODO: Visulaize the test accuracy after hyper-parameter tuning for each epoch

# Questions for examination:

In addition to completing the assignment with all its tasks, you should also prepare to answer the following questions:

1) Why would we want to use different activation functions?

2) Why would we want to use different loss functions?

3) Why are neural networks sensitive to large input values?

4) What is the role of the bias? 

5) What is the purpose of hyper-parameter tuning?

6) A small example neural network will be shown during the oral examination. You will be asked a few basic questions related to the number of weights, biases, inputs and outputs.

# Finished!

Was part of the setup incorrect? Did you spot any inconsistencies in the assignment? Could something improve?

If so, please write them and send via email and send it to:

* marcus.gullstrand@ju.se

Thank you!