**Section 9: The Perceptron and Neural Networks**

Notebook for "Introduction to Data Science and Machine Learning"

version 1.0, July 1 2024


## Required `import`-statements

`import` statements required for this notebook.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris, load_breast_cancer
from sklearn.linear_model import Perceptron
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.preprocessing import StandardScaler,scale

With the following code we suppress warning that some seaborn code will be deprecated in the future.

In [None]:
import warnings # To suppress some warnings
 
# Suppress the specific FutureWarning
warnings.filterwarnings("ignore", category=FutureWarning, module="seaborn")


# 1. Introduction 

In this notebook we will look at **Neural Networks**. We will start by the neuron, the **perceptron**, the simple unit neural networks are composed of. A perceptron can be used to classify **linearly separable** data. We will use it to classify `and`. Then we will try to implement the `xor` operator. This is not possible, as xor is not linearly separable.

We will then use a neural network to classify the `xor` operation. The network can be trained with standard parameters, but will become quite complex. Knowing more of the problem details, `xor` can be implemented in a simple network.

After this we will look at the Iris and the Breast cancer data set. We will specifically:
* see how to split data in "normal" and in stratified folds and observe the difference
* split the data in trainings and test data
* output the confustion matrix and accuracy
* use normalization and observe the difference
* use cross fold validation
* test different network architectures


# 2. The Perceptron

We start by defining a data frame for the `and` operator.

| `x` |  `y` |  `x and y` |
|---|----|----|
|`False`| `False`| `False`|
|`False`| `True`| `False`|
|`True`| `False`| `False`|
|`True`| `True`| `True`|


As the perceptron works with numerical input and output we replace `False`  by 0 and `True` by 1.

In [None]:
andData=np.array([[0,0,0],[0,1,0],[1,0,0],[1,1,1]])
andFrame=pd.DataFrame(andData,columns=['x','y','and'])

And we plot it. To better see the points, we enlarge the displayed axis segments:

In [None]:
cols=andFrame.columns
plt.Figure()
sns.scatterplot(data=andFrame, x=cols[0], y=cols[1],hue=cols[2])
plt.xlim((-.5,1.5))
plt.ylim((-.5,1.5))
plt.title('and')

Now we create the $X$ matrix (the features / samples) and the $y$ array with the labels:

In [None]:
X=andFrame.copy()
y=X.pop('and')

We create an instance of the Perceptron and learn the classifier:

In [None]:
perc=Perceptron(random_state=10)
perc.fit(X,y)
print("Weights:",perc.coef_)
print("Intercept:",perc.intercept_)
print("unique class labels:",perc.classes_)

ypred=perc.predict(X)
print(ypred)
print('Accuracy:',perc.score(X,y))

As the function is linearly separable, the result is correct and learned quite fast:

Now we generate an `xor` data frame.

| `x` |  `y` |  `x xor y` |
|---|----|----|
|`False`| `False`| `False`|
|`False`| `True`| `True`|
|`True`| `False`| `True`|
|`True`| `True`| `False`|

In [None]:
xorData=np.array([[0,0,0],[0,1,1],[1,0,1],[1,1,0]])
xorFrame=pd.DataFrame(xorData,columns=['x','y','xor'])

And plot it as before

In [None]:
cols=xorFrame.columns
plt.Figure()
sns.scatterplot(data=xorFrame, x=cols[0], y=cols[1],hue=cols[2])
plt.xlim((-.5,1.5))
plt.ylim((-.5,1.5))
plt.title('xor')

We can easily see that the data is not linearly separable. We try to learn the function by training the perceptron:

In [None]:
X=xorFrame.copy()
y=X.pop('xor')

perc=Perceptron(random_state=10)
perc.fit(X,y)
print("Weights:",perc.coef_)
print("Intercept:",perc.intercept_)
print("unique class labels:",perc.classes_)

ypred=perc.predict(X)
print(ypred)
print('Accuracy:',perc.score(X,y))

The function was not learned. In fact, `xor` cannot be learned by a simple perceptron as this function is not linearly separable. 

# 3. A Neural Network for `xor`

We now use a neural network to learn a classifier for `xor`. 

We instantiate a classifier and use it with standard parameter values. `MLPClassifier` stands for "Multi-layer Perceptron classifier".

In [None]:
nn=MLPClassifier()
nn.fit(X,y)
print("number of iterations:",nn.n_iter_)
print("number of weight updates:",nn.t_)

ypred=nn.predict(X)
print(ypred)

There is a warning, that convergion has not yet been reached. The standard number of iterations is 200. We can add more iterations for the training. 

In [None]:
nn=MLPClassifier(max_iter=300)
nn.fit(X,y)
print("number of iterations:",nn.n_iter_)
print("number of weight updates:",nn.t_)

ypred=nn.predict(X)
print(ypred)

Test  different values for `max_iter` and try to get rid of the warning.

In [None]:
# Your Code

**Solution**: 

The trained network has the standard architecture which consists of one hidden layer with 100 nodes. 

In [None]:
print(nn.hidden_layer_sizes)

A neural network for `xor` can be built with one hidden layer with 2 nodes only, i.e. a much smaller network.

Let's test this:

In [None]:
nn=MLPClassifier(hidden_layer_sizes=(2,),random_state=10)
nn.fit(X,y)
print("number of iterations:",nn.n_iter_)
print("number of weight updates:",nn.t_)

ypred=nn.predict(X)
print(ypred)


`xor` was not corretly learned. The weights are not yet stable.

Let's try some more iterations:

In [None]:
nn=MLPClassifier(hidden_layer_sizes=(2,),random_state=10,max_iter=1000)
nn.fit(X,y)
print("number of iterations:",nn.n_iter_)
print("number of weight updates:",nn.t_)

ypred=nn.predict(X)
print(ypred)

And again its not stable, so let's try more.

In [None]:
nn=MLPClassifier(hidden_layer_sizes=(2,),random_state=10,max_iter=4000)
nn.fit(X,y)
print("number of iterations:",nn.n_iter_)
print("number of weight updates:",nn.t_)

ypred=nn.predict(X)
print(ypred)

The training was finished but `xor` was not learned correctly. 

We know that `xor` can be classified using a neural network with the given architectre. So let's modify other parameters: 
- the activation function as well as the 
- solver algorithm.

In [None]:
nn=MLPClassifier(hidden_layer_sizes=(2,),random_state=10,activation='tanh',solver='lbfgs')
nn.fit(X,y)
print("number of iterations:",nn.n_iter_)
print("number of weight updates:",nn.t_)
print("coefficients:",nn.coefs_)
print("intercepts:",nn.intercepts_)
ypred=nn.predict(X)
print(ypred)
print('Accuracy:',perc.score(X,y))

The network was correctly trained in 40 iterations. 

**Exercise:** Please sketch a network with two nodes on the input layer, one hidden layer with two nodes and one node on the output layer **on a sheet of paper**.
Assign the weights and intercepts to the sketch and test it by classiying the input `(1,0)` and `(1,1).

# 4. The Iris Dataset

Now we train a neural network for the iris flower data set.

First we load the data:

In [None]:
irisDS=load_iris()

In order to create a nice scatter plot using `pairplot()` from the `seaborn` module the data must be available in form of a pandas data frame. Therefore we create a data frame from the iris data set:

In [None]:
irisDF=pd.DataFrame(data=irisDS.data,columns=irisDS.feature_names)
irisDF["class"]=irisDS.target

And create and save the scatter plot:

In [None]:
sns.pairplot(irisDF,hue='class')
plt.savefig('irisScatter.png',dpi=600)


Now let's train the network with standard parameters:

In [None]:
X=irisDS.data
y=irisDS.target

nn=MLPClassifier(random_state=10)
nn.fit(X,y)
print("number of iterations:",nn.n_iter_)
print("number of weight updates:",nn.t_)

The network did not reach a stable state, so we increase the number of iterations:

In [None]:
nn=MLPClassifier(random_state=10,max_iter=1000)
nn.fit(X,y)
print("number of iterations:",nn.n_iter_)
print("number of weight updates:",nn.t_)

And let's take a look at the accuracy:

In [None]:
print('accuracy:',nn.score(X,y))

We calculated the accuracy value for the trainings data. This is of course not the best idea, as we aim to learn the classification of formerly unknown data. Therefore, we split trainings and test data:  

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=10)

And repeat the training and look at the output:

In [None]:
nn=MLPClassifier(random_state=10,max_iter=1000)
nn.fit(X_train,y_train)
print("number of iterations:",nn.n_iter_)
print("number of weight updates:",nn.t_)
print('accuracy:',nn.score(X_test,y_test))

As discussed in the lecture it might be important to "stratify" the sets, i.e. to make sure that the class disctribution of classes in the trainigs and test sets correspond to the class distribution of the original data set.

We create stratified sets by using the additional parameter `stratify`. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=10,stratify=y)
nn.fit(X_train,y_train)
print("number of iterations:",nn.n_iter_)
print("number of weight updates:",nn.t_)
print('accuracy:',nn.score(X_test,y_test))

Depending on the seed for the random number, this might lead to different results. Please note that the splitting is randomized, Therefore, the results might be better or worse.

Next to the accuracy we can calculate and print the confusion matrix:

In [None]:
ypred=nn.predict(X_test)
cm=confusion_matrix(y_test,ypred)
print(cm)

And equally create a colorful display:

In [None]:
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=irisDS.target_names)
disp.plot()

# 5. The Breast Cancer Data Set

We load the data set:

In [None]:
ds=load_breast_cancer()
X=ds.data
y=ds.target

The data set has a class attribute with two values, benign and malign.

We split the data in stratified trainings and test data, create a neural network with standard parameters, train it and output the result:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=10)

nn=MLPClassifier(random_state=10)
nn.fit(X_train,y_train)

ypred=nn.predict(X_test)
print('accuracy:',nn.score(X_test,y_test)) 
cm=confusion_matrix(y_test,ypred)
print(cm)

Now we repeat the process using stratified trainings and test sets:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=10,stratify=y)

nn=MLPClassifier(random_state=10)
nn.fit(X_train,y_train)

ypred=nn.predict(X_test)
print('accuracy:',nn.score(X_test,y_test)) 
cm=confusion_matrix(y_test,ypred)
print(cm)

Here the model did not yet converge, we increase the number of iterations: 

In [None]:
nn=MLPClassifier(random_state=10,max_iter=500)
nn.fit(X_train,y_train)

ypred=nn.predict(X_test)
print('accuracy:',nn.score(X_test,y_test)) 
cm=confusion_matrix(y_test,ypred)
print(cm)

We see that the accuracy was improved. 

Remember: The accuracy depends on the splits as trainings and test data are split randomly! 

In [None]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, test_size=0.33, random_state=12, stratify=y)
nn=MLPClassifier(random_state=10,max_iter=1000)
nn.fit(X_train2,y_train2)

ypred2=nn.predict(X_test2)
print('accuracy:',nn.score(X_test2,y_test2)) 
cm=confusion_matrix(y_test2,ypred2)
print(cm)

Now we will display min and max values of all the features using a loop in tabular form:

In [None]:
for i in range(X.shape[1]):
    print('{:25}: minimum: {:8.3f} maximum: {:9.3f}'.format(ds.feature_names[i],X[:,i].min(),X[:,i].max()))

We observe that the dimensions of the range of the features is quite different. As neural networks are sensitive to these differences, we **scale** the data using a z-score normalization:

In [None]:
Xscaled=scale(X)

And output the data again:

In [None]:
for i in range(Xscaled.shape[1]):
    print('{:25}: minimum: {:8.3f} maximum: {:9.3f}'.format(ds.feature_names[i],Xscaled[:,i].min(),Xscaled[:,i].max()))

**Warning:** As discussed in the lecture, trainings, validation and test data sets should be independent! By performing z-score scaling on the whole data set, we validate this requirement to a certain extend, as the parameters of the scaling base on the complete data set and therefore some information of the test data will be applied during scaling of the trainings data!

We use this data set now to train and test the model:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(Xscaled, y, test_size=0.33, random_state=10,stratify=y)

nn=MLPClassifier(random_state=10,max_iter=1000)
nn.fit(X_train,y_train)

ypred=nn.predict(X_test)
print('accuracy:',nn.score(X_test,y_test)) 
cm=confusion_matrix(y_test,ypred)
print(cm)

And we see that the accuracy has improved.

But we can still not make any valid conclusion about the accuracy, as we base on one split only. So we call cross validation and compare the results for the scaled and non scaled data. The function `cross_validate()` uses stratified folds if a classifier is trained and the problem is a binary / multiclass classification problem:

In [None]:
nn=MLPClassifier(random_state=10,max_iter=1000)

results=cross_validate(nn,X,y)
resultsUnscaled=results['test_score']
print('Unscaled:', resultsUnscaled)

results=cross_validate(nn,Xscaled,y)
resultsScaled=results['test_score']
print('Scaled:  ',resultsScaled)

print('mean accuracy unscaled: {:.6f}, std: {:.6f}, var: {:.6f}'.format(resultsUnscaled.mean(),
                                                                   resultsUnscaled.std(),resultsUnscaled.var()))
print('mean accuracy   scaled: {:.6f}, std: {:.6f}, var: {:.6f}'.format(resultsScaled.mean(),
                                                                   resultsScaled.std(),resultsScaled.var()))      



In the lecture we said that the network architecture can equally be trained. In the following code we test several different network architectures using a `for` - loop:

In [None]:
nnLayers=[(100,),(50,),(20,),(50,20),(50,20,5)]

results={}
for hl in nnLayers:
    nn=MLPClassifier(random_state=10,hidden_layer_sizes=hl,max_iter=1000)
    results[str(hl)]=cross_validate(nn,Xscaled,y)

for hl in results.keys():
    res=results[hl]['test_score']
    print('{:12}: mean accuracy: {:.6f}, std: {:.6f}, var: {:.6f}'.format(hl,res.mean(),res.std(),res.var()))

As can be seen in above results, more nodes or more layers do not necessarily lead to better results, based on the specific status of the random number gernerator.... $\Rightarrow$ Many tests are required and (remember!) we only can get an estimate of the true accuracy within a confidence interval.

In fact with the breast cancer data set we do not observe noticable differences. Depending on the data set different neural network architectures may lead to relevant differences in the results. So while we do not observe significant differences in this example, the methods can be applied to optimize neural networks and improve classification results.

# 6. Exercise

Load the penguiun data set and create a classifier for the `species` attribute.

In [None]:
import seaborn as sns

penguins=sns.load_dataset('penguins')


# 7. Summary

In this last notebook of the course some of the necessary steps to train a model were applied. Let's summarize them
- data preparation: as neural networks are suscreptible to different ranges of variables, the values were scaled / normalized using the z-score scaling. As we apply this on the whole data set before any splits are performed, the trainings data will not be completely independent of test data.
- hyperparameter settings: in the different examples we specified different hyperparameters such as the architecture, the solver algorithm and the activation function
- we tested several hyperparameters using a for loop. Please note that we did not use a seperate validation set.
- we used cross-fold va


*End of the Notebook*

<a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png" /></a><br />This notebook was created by Christina B. Class for teaching at EAH Jena and is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a>.
