# Tutorial on Neural Networks 
## ECM3412/ECMM409 - Nature Inspired Computation

In this tutorial, we will learn how to build a multi-layer perceptron (MLP) to solve the logic XOR gate problem. In addition to that, we will look into how to use Python `scikit-learn` to build a MLP classfier for a real world dataset - [The wine dataset](https://archive.ics.uci.edu/ml/datasets/wine).

To conduct this tutorial, please make sure you have `numpy`, `pandas`, `matplotlib` and `scikit-learn` installed on your local machine. Alternatively, you can use [Google Colab](https://colab.research.google.com/notebooks/intro.ipynb) to run your code on Google cloud. (You will need a Goolgle account to do so.)

**Intended learning outcomes:**
- To familise yourself with the learning algorithm for MLP, in particular, the feedforwad and backpropagation phases involved in training a neural network. 
- To gain hands-on experience on building neural networks model using Python `scikit-learn`.
- To understand how to evaluate model performance.
- To understand how to tune parameters to achieve better performance.
- To understand why standardisation may help improving the performance.


## 1. MLP for the XOR problem


###  The structure of a neural network with one hidden layer

![nn_pic.png](nn_pic.png)

Please note that the bias terms are introduced in the above diagram. Bias is just like an intercept added in a linear equation. It is an additional parameter in the neural network which is used to adjust the output along with the weighted sum of the inputs to the neuron. Moreover, bias value allows you to shift the activation function to either right or left

The XOR logic gate returns **True (1)** when the two Boolean inputs are different, otherwise it returns **False (0)**.
Here is the simple training dataset.

**x1**|**x2**|**y**
:-----:|:-----:|:-----:
0|0|0
0|1|1
1|0|1
1|1|0

In [None]:
import numpy as np # For array operations
import matplotlib.pyplot as plt # For plotting

In [None]:
# Define the training data as a numpy array
# Please note that the first column is bias
X = np.array([[1, 0, 0],
            [1, 0, 1],
            [1, 1, 0],
            [1, 1, 1]])

# The labels for the training data.
y = np.array([[0],
            [1],
            [1],
            [0]])

In [None]:
X

In [None]:
y

### Additional parameters

In [None]:
num_i_units = 3 # Number of Input units (bias included)
num_h_units = 2 # Number of Hidden units
num_o_units = 1 # Number of Output units

# The learning rate for Gradient Descent.
learning_rate = 0.15
# error
costs = []   # a list to record the cost of the NN after each Gradient Descent iteration.

# number of epochs
epochs = 10000

# Number of training examples
m = len(X)

### Weights and Biases
These are the parameters that the neural network needs to learn in order to make accurate predictions.

For the connections being made from the input layer to the hidden layer, the weights and biases are arranged in the following order: **each column contains the weights for each hidden unit**. Then, the shape of these set of weights is: *(number of input units $\times$ number of hidden units)*. 

So, the overall shape of the weights and biases are:

**Weights1 (Connection from input to hidden layers)**: num_i_units $\times$ num_h_units

**Weights2 (Connection from hidden to output layers)**: num_h_units $\times$ num_o_units

### Initialising the Weights and Biases

The weights here are going to be generated using a [Normal Distribution(Gaussian Distribution)](http://mathworld.wolfram.com/NormalDistribution.html). They will also be seeded so that the outcome always comes out the same.

In [None]:
# Set random seed for reproducible results. Bear in mind how different random states will affect the algorithm's convergence.
np.random.seed(3412) 

W1 = np.random.randn(num_i_units, num_h_units) # 
W2 = np.random.randn(num_h_units+1,1) # 

In [None]:
W1

In [None]:
W2

### Activation function

We will use the Sigmoid (Logistic) function as the activation function. The sigmoid function is a non-linear function that maps any input to a value between 0 and 1.
![](sigmoid-curve.png)

In [None]:
# Activation function: sigmoid
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Derivative of sigmoid for backpropagation
def sigmoid_deriv(x):
    return sigmoid(x)*(1-sigmoid(x))

### Forward propagation
In the phase of forward propagation, the inputs are fed into the network to compute the prediction.
In this implementation, the forward function accepts feature matrix with each row representing a feature vector for a single sample. Also, the predict boolean, if set to true, only returns the output. Otherwise, it returns the outputs of all the layers.

In [None]:
# Define a forward function to calculate the predictions 
def forward(x, W1, W2, predict=False):

    a1 = np.matmul(x, W1)  # pre-activation for the hidden layer (4x3)x(3x2)-->(4x2)
    z1 = sigmoid(a1)  # output of the hidden layer (4x2)-->(4x2)
    
    # create and add bias
    bias = np.ones((len(z1), 1))  # bias term for hidden (4x1)
    z1 = np.concatenate((bias, z1), axis=1)  # condatenate bias terms for hidden layer
    a2 = np.matmul(z1, W2)  # pre-activation for the output neuron
    z2 = sigmoid(a2)  # output

    if predict: 
        return z2
    return a1, z1, a2, z2

### Backpropagation

The process of propagating the error in the output layer, backwards through the NN to calculate the error in each layer. Intuition: It's like forward propagation, but backwards.

In [None]:
# Backprop function
def backprop(a2, z0, z1, z2, y):
    delta2 = z2-y  # output - target
    Delta2 = np.matmul(z1.T, delta2)
    
    delta1 = (delta2.dot(W2[1:,:].T))*sigmoid_deriv(a1)
    Delta1 = np.matmul(z0.T, delta1)
    
    return delta2, Delta1, Delta2

### Training
This is the training function which contains the operations in both forward propagation and backpropagation phases.

The gradients(errors) of the weights and biases are used to update the corresponding weights and biases by multiplying them with the negative of the learning rate and scaling it by dividing it by the number of training examples.

While iterating over all the training examples, the cost is also being calculated simultaneously for each example. 

In [None]:
for i in range(epochs):

    # Forward propagation
    a1, z1, a2, z2 = forward(X, W1, W2)

    # Back propagation
    delta2, Delta1, Delta2 = backprop(a2, X, z1, z2, y)

    W1 = W1 - learning_rate*(1/m)*Delta1
    W2 = W2 - learning_rate*(1/m)*Delta2

    # Add costs to list for plotting
    c = np.mean(np.abs(delta2))
    costs.append(c)

    if i % 1000 == 0:
        print(f"Iteration: {i}; Error {c}")

# Training complete
print("Training complete.")

In [None]:
# Print the trained weights and biases

In [None]:
W1

In [None]:
W2

### Make predictions

In [None]:
z3 = forward(X, W1, W2, True)
print(f"Percentages:\n {z3}\n")
print(f"Predictions:\n {np.round(z3)}\n")

### Plot results
Plot the error signal 

In [None]:

# Assigning the axes to the different elements.
plt.plot(range(epochs), costs)

# Labelling the x axis as the iterations axis.
plt.xlabel("Iterations")

# Labelling the y axis as the cost axis.
plt.ylabel("Error")

# Showing the plot.
plt.show()

## Questions:

1. How does the learning rate parameter affect the convergence of the learning algorithm? 
2. How does the number of epochs affect the convergence of the learning algorithm?
3. How does the structure of the neural network affect the convergence of the learning algorithm? (Tips: try to change the number of units on the hidden layers)? How about the number of hidden layers?


# 2. The Wine Dataset
### Dataset Information:
https://archive.ics.uci.edu/ml/datasets/Wine
>These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.
The attributes are (dontated by Riccardo Leardi, riclea '@' anchem.unige.it )
 0. **Wine class**
 1. Alcohol
 2. Malic acid
 3. Ash 
 4. Alcalinity of ash
 5. Magnesium 
 6. Total phenols 
 7. Flavanoids
 8. Nonflavanoid phenols
 9. Proanthocyanins
 10. Color intensity
 11. Hue
 12. OD280/OD315 of diluted wines
 13. Proline

In [None]:
import pandas as pd

In [None]:
wine = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data",header=None)
print(wine.shape)
wine.head()


In [None]:
X = wine[np.arange(1,14)]
y = wine[0]

Now our task is to split the data into the training and testing sets, using the sklearn `train_test_split` helper function.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
seed=3412 # set random state
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3, random_state=seed)

In [None]:
X_train.head()

Now, let us train our MLP Neural Network with **50 units** in the hidden layer. 

In [None]:
from sklearn.neural_network import MLPClassifier

In [None]:
clf = MLPClassifier(hidden_layer_sizes=50, random_state=seed)

In [None]:
clf.fit(X_train,y_train)

### Let us measure the performance of our NN using the test set.

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(clf.predict(X_test),y_test)

### Let us use cross-validation to have a more reliable estimation of the accuracy score.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

In [None]:
num_folds = 5
kfold = KFold(n_splits=num_folds, shuffle=True, random_state=seed)

In [None]:
clf_cv = MLPClassifier(hidden_layer_sizes=50, random_state=seed)
scores = cross_val_score(clf_cv, X, y, cv=kfold)

In [None]:
print(f"Accuracy: mean: {scores.mean()} standard deviation: {scores.std()}")

## Exercise 1
Now, your task is to use cross-validation to find the best number of hidden units in the network. 
You will have to do the following:
 * Iterate over a range of number of hidden units [20,30,40,50,60,70] and measure the performance of the classifier using K-Fold cross validation (K=5) for each of these configurations. 
 
 What configuration produced the best perfomance?

### Now we will rescale the data before the training. 
Standardize features by removing the mean and scaling to unit variance

[Scikit-learn standard scaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

[Why do we need scaling?](https://towardsdatascience.com/all-about-feature-scaling-bcc0ad75cb35#:~:text=Feature%20scaling%20is%20essential%20for,that%20calculate%20distances%20between%20data.&text=Since%20the%20range%20of%20values,not%20work%20correctly%20without%20normalization.)

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
X_s = StandardScaler().fit_transform(X)

## Exercise 2
 * Repeat the experiments above on the normalized datasets.
 * How much the peformance increased compared with the unormalized dataset?