# **Part 2: Supervised Learning: Classification**

In **Supervised Learning**, we have a dataset consisting of both features and labels.
The task is to construct an estimator which is able to predict the label of an object
given the set of features. A relatively simple example is predicting the species of
Iris' given a set of measurements of its flower. Some more complicated examples are:

- Given a multicolor image of an object through a telescope, determine
  whether that object is a star, a quasar, or a galaxy.
- Given a photograph of a person, identify the person in the photo.
- Given a list of movies a person has watched and their personal rating
  of the movie, recommend a list of movies they would like
  (So-called *recommender systems*: a famous example is the [Netflix Prize](http://en.wikipedia.org/wiki/Netflix_prize)).

What these tasks have in common is that there is one or more unknown
quantities associated with the object which needs to be determined from other
observed quantities.

Supervised learning is further broken down into two categories, **classification** and **regression**.
* In **classification**, the label is discrete, while in regression, the label is continuous. For example,
in astronomy, the task of determining whether an object is a star, a galaxy, or a quasar is a
classification problem: the label is from three distinct categories.
* On the other hand, we might wish to estimate the age of an object based on such observations: this would be a **regression** problem, because the label (age) is a continuous quantity.
This this worksheet, we will focus on **classification** problems.

References: https://www.ibm.com/topics/supervised-learning

## **Setting up our Notebook**

To begin, we need to setup some python programs and download some files from Github. Run the code below by  **clicking the <img src="https://github.com/GardenGroupUO/Computational_Silver_Nanoparticle_Exercise_Data/blob/main/Images/stop_images/playsvg.png?raw=true" alt="drawing" width="28"/> button below to load our prerequisite files**.


In [None]:
# Numpy, Pandas and Scikit-Learn modules for Machine Learning Exercise
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier

# Pylab and Matplotlib for plotting
import pylab as pl
import matplotlib.pyplot as plt
from   matplotlib.lines import Line2D
from   matplotlib.colors import ListedColormap
%matplotlib inline
plt.style.use('seaborn')

# Create color maps for 3-class classification problem, as with iris
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00A550', '#0000FF'])

# Download and install packages that will be helpful for this section
%pip install -q ipywidgets
!apt-get -qq install subversion
!svn checkout https://github.com/geoffreyweal/MESA_Bootcamp_2023_ML_Tutorial/trunk/Notebooks/figure_code

# Clear the output just for this cell cause there is a lot going on and doesnt help us.
# Comment this out if there is a problem when loading the programs and packages above
from IPython.display import clear_output
clear_output()

When you click the <img src="https://github.com/GardenGroupUO/Computational_Silver_Nanoparticle_Exercise_Data/blob/main/Images/stop_images/playsvg.png?raw=true" alt="drawing" width="28"/> button, it will be replaced with a <img src="https://github.com/GardenGroupUO/Computational_Silver_Nanoparticle_Exercise_Data/blob/main/Images/stop_images/stopsvg.gif?raw=true" alt="drawing" width="28"/> icon.

When the files have been successfully loaded, you will see the <img src="https://github.com/GardenGroupUO/Computational_Silver_Nanoparticle_Exercise_Data/blob/main/Images/stop_images/stopsvg.gif?raw=true" alt="drawing" width="28"/> icon turn back into the <img src="https://github.com/GardenGroupUO/Computational_Silver_Nanoparticle_Exercise_Data/blob/main/Images/stop_images/playsvg.png?raw=true" alt="drawing" width="28"/> button, and a <font color="green" size="4">&check;</font>tick symbol will appear next to the <img src="https://github.com/GardenGroupUO/Computational_Silver_Nanoparticle_Exercise_Data/blob/main/Images/stop_images/playsvg.png?raw=true" alt="drawing" width="28"/> button. The <font color="green" size="4">&check;</font>tick symbol means that Google Colab was successfully able to run the code. Google Colab will tell you more information if it encounters an issue.

## **Troubleshooting**

If you have any problems running this notebook, these suggestions may help.

### **What to do if there is a problem running code:**

Easiest thing to do if the code is not working as expected is to.

On Google Colab:
1. Click ``Runtime`` > ``Disconnect and delete runtime``
2. A message will appear that will say ``Disconnect and delete runtime
Are you sure that you want to reset this runtime? The state of this runtime, including all local variables and files, will be lost``. Click the ``Yes`` button.
3. Repeat all the code you had run, or go ``Runtime`` > ``Run all`` to run all the code in sequence.

On Visual Studio:
1. Click the ``Restart`` button at the top of Visual Studio,
2. Repeat all the code you had run, or click the ``Run All`` button at the top of Visual Studio.

### **Can't I just load and run everything at once**

You can run these Google Colab notebooks in full from start to end. To do this, at the top of the notebook click `Runtime -> Run all`. This will run every code box in the notebook. However, **this may take quite some time and is not recommended**.

### **Help, I ran the code by accident!**

If you want to stop code running at any time, you can press the <img src="https://github.com/GardenGroupUO/Computational_Silver_Nanoparticle_Exercise_Data/blob/main/Images/stop_images/stopsvg.gif?raw=true" alt="drawing" width="28"/> icon at any time to stop the notebook from running the notebook's code.



## **Using the k Nearest Neighbours (kNN) ML algorithm to Predict Species of Iris'**

The k nearest neighbors (kNN) algorithm is one of the simplest machine learning strategies. This algorithm works by making a decision based on the results of the k nearest neighbours, and giving a prediction based on the majority of what those points are. The algorithm works by:

1. Give the parameters for a test sample you would like to get a prediction for.
2. Determine the k samples from the training set that have the most similar parameters to your sample.
3. Of those k most similar training samples, our prediction is the most common type.

Let's try it out on our Iris classification problem. To begin, we will load the Iris data, create the kNN object that will perform the k-nearest neighbours ML algorithm, and get ``knn`` to fit our Iris data:

### **Inspecting and Visualising the Iris Data**

To begin, lets load the Iris data from Scikit-learn and look at what it contains.

In [None]:
# Load our Iris data from Scikit-learn.
iris = datasets.load_iris()
print(iris.keys())

The Iris dataset contains two components we will focus on, the ``data``, and the ``target``.

The rows of the ``data`` and ``target`` lists give the corresponding sepal and petal information for each Iris that were sampled.

The ``data`` contains an array with four columns. These columns describe the sepal length, sepal width, petal length, and petal width, respectively:

In [None]:
df = pd.DataFrame(iris['data'], columns=iris['feature_names'], index=range(1,len(iris['data'])+1))
display(df)

The ``target`` contains an array containing either the numbers:
* 0: Indicating the Iris is a Setosa,
* 1: Indicating the Iris is a Versicolour,
* 2: Indicating the Iris is a Virginica.

In [None]:
corresponding_species = [iris['target_names'][target] for target in iris['target']]
df = pd.DataFrame({'target': iris['target'], 'Corresponding Species': corresponding_species}, index=range(1,len(iris['data'])+1))
display(df)

Lets take a look at the data visually. However, we have 4 pieces of data, the sepal length, sepal width, petal length, and petal width (Side note: this means our feature space has 4 dimensions). To begin, lets just focus on the sepal length and sepal width.

In [None]:
# Make a figure to show the data points for the sepal lengths and widths.
pl.figure()

# Get our Iris data. Here, X_1 and y_1 will be used as our training data, where:
# * X_1 are our feature vectors
# * y_1 indicates if our Iris' flower is a setosa, versicolor, or virginica species
X_1 = iris.data[:, :2] # We only take the first two features (the sepal length and width).
y_1 = iris.target
sepal_length_limits  = (4.0, 8.0)
sepal_width_limits   = (1.9, 4.5)

# Plot our sepal lengths and widths.
pl.scatter(X_1[:, 0], X_1[:, 1], c=y_1, cmap=cmap_bold)
pl.xlabel('sepal length (cm)')
pl.ylabel('sepal width (cm)')
pl.axis('tight')
pl.xlim(sepal_length_limits)
pl.ylim(sepal_width_limits)

# Give the legend details.
legend_elements = [Line2D([0], [0], marker='o', color=cmap_bold(0), label='Setosa',      linewidth=0),
                   Line2D([0], [0], marker='o', color=cmap_bold(1), label='Versicolour', linewidth=0),
                   Line2D([0], [0], marker='o', color=cmap_bold(2), label='Virginica',   linewidth=0)]
pl.legend(handles=legend_elements,frameon=True)
pl.show()

### **Making Iris Predictions using the kNN algorithm**

We will now use a kNN model to predict the type of Iris species we have given Iris data. For our first example, we will only fit our sepal data (the sepal length and width). Here, we will set up a kNN model in python, and train our kNN model on our Iris data.

In [None]:
# Get our Iris data. Here, X_1 and y_1 will be used as our training data, where:
# * X_1 are our feature vectors
# * y_1 indicates if our Iris' flower is a setosa, versicolor, or virginica species
X_1 = iris.data[:, :2]  # We only take the first two features (the sepal length and width).
y_1 = iris.target

# Create the model in python.
n_neighbors_1 = 3
knn_1 = KNeighborsClassifier(n_neighbors=n_neighbors_1,metric='minkowski',p=2)

# Train our kNN model on our Iris' data.
knn_1.fit(X_1, y_1)

We can now make a plot that shows what our kNN model predicts for different values of sepal length and width (our feature vectors).

In [None]:
# Create a meshgrid showing what our knn model predicts for all sepal lengths
# and widths between our limits.
sepal_length_limits  = (4.0, 8.0)
sepal_width_limits   = (1.9, 4.5)
sepal_lengths        = np.linspace(sepal_length_limits[0], sepal_length_limits[1], 100)
sepal_widths         = np.linspace(sepal_width_limits[0], sepal_width_limits[1], 100)
xx_1, yy_1           = np.meshgrid(sepal_lengths, sepal_widths)

# Make a prediction of the results for the knn machine learning model for
# every sepal length between 4.0 cm to 8.0 cm, and every sepal width between
# 2.0 cm and 4.5 cm.
Z_1 = knn_1.predict(np.c_[xx_1.ravel(), yy_1.ravel()])

# Convert our results into a meshgrid corresponding to xx_1 and yy_1 for all
# sepal lengths and widths between our limits.
Z_1 = Z_1.reshape(xx_1.shape)

# Make a figure that shows the predictions of our knn model.
pl.figure()
pl.pcolormesh(xx_1, yy_1, Z_1, cmap=cmap_light)

# Plot also our sepal training points.
pl.scatter(X_1[:, 0], X_1[:, 1], c=y_1, cmap=cmap_bold)
pl.xlabel('sepal length (cm)')
pl.ylabel('sepal width (cm)')
pl.axis('tight')
pl.xlim(sepal_length_limits)
pl.ylim(sepal_width_limits)

# Give the legend details.
legend_elements = [Line2D([0], [0], marker='o', color=cmap_bold(0), label='Setosa',      linewidth=0),
                   Line2D([0], [0], marker='o', color=cmap_bold(1), label='Versicolour', linewidth=0),
                   Line2D([0], [0], marker='o', color=cmap_bold(2), label='Virginica',   linewidth=0)]
pl.legend(handles=legend_elements,frameon=True)
pl.show()

For example, if we look at a iris with a sepal length of 5.5 cm and a sepal width of 3.0 cm, we see a green section of our figure. This would indicate our kNN model would predict an iris with this sepal width and length to be a versicolour species.

We can look at how the algorithm is making preductions by using the interactive plot below.

In [None]:
# Create interactive figure, containing lines for the k neighest neighbours about point i.
from figure_code import visualise_knn
visualise_knn(sepal_length_limits, sepal_width_limits, n_neighbors_1, 'definite', X_1, y_1, Z_1, xx_1, yy_1, cmap_light, cmap_bold, knn_1)

**Use the sliders in the above plot to determine the prediction for different sepal lengths and widths and answer the following questions**.

1. How many setosa, versicolour, and virginica nearest neighbours do we have for a sepal length of ~5.8 cm and a sepal width of ~3.4 cm?
2. What is our species prediction for this iris with a sepal length of ~5.8 cm and a sepal width of ~3.4 cm?
3. Repeat this exercise for an iris with a sepal length of ~5.6 cm and a sepal width of ~3.4 cm?

Note: There is an weird area of this figure (sepal length of 6.25 cm and a sepal width of 3.18 cm) where versicolour is predicted, but there seems to be more blue point than green around this area. This is because there are more green points, they are just hidden beneath blue points with the same sepal length and sepal widths (This data contains overlapping flowers with similar or same sepal widths and lengths, but of different or same species).

### **Predicting the Probability of Iris species using the kNN algorithm**

The version of the kNN model we have used above has been set to give a definite prediction for a Iris with a given sepal width and length. However, maybe we dont want our model to give a definite answer but rather a probabilistic prediction of what our Iris flower is?

Lets try this below. NOTE: For this plot, the colours mean:

* <font color="FF0000" size="">Red</font> region: Likely Setosa
* <font color="E0B91D" size="">Yellow</font> region: Either Setosa or Versicolour
* <font color="1AAA1A" size="">Green</font> region: Likely Versicolour
* <font color="#0FA3B1" size="">Aqua-marine</font> region: Either Versicolour or Virginica
* <font color="0000FF" size="">Blue</font> region: Likely Virginica
* <font color="F99EAD" size="">Pink</font> region: Either Versicolour or Virginica
* White region: Equally likely Setosa, Versicolour, or Virginica

In [None]:
# To begin, we will get our kNN model to give us predictions of what type of Iris species we have.
Z_1 = knn_1.predict_proba(np.c_[xx_1.ravel(), yy_1.ravel()])

# Put the result into a color plot
Z_1 = Z_1.reshape(list(xx_1.shape)+[3])

# To make it a bit easier to see what is going on, we will lighten the the regions of colour a bit.
from figure_code import lighten_rgb_color
for i1 in range(len(Z_1)):
    for i2 in range(len(Z_1[i1])):
        Z_1[i1][i2] = lighten_rgb_color(Z_1[i1][i2], 0.65)

# Now lets visualise the data like we did before, but now for probabilities.
visualise_knn(sepal_length_limits, sepal_width_limits, n_neighbors_1, 'probabilistic', X_1, y_1, Z_1, xx_1, yy_1, cmap_light, cmap_bold, knn_1)

**Use the sliders in the above plot to determine the probabilistic predictions for different sepal lengths and widths and answer the following questions**.

1. How many setosa, versicolour, and virginica nearest neighbours do we have for a sepal length of ~5.8 cm and a sepal width of ~3.4 cm?
2. What is our species prediction for this iris with a sepal length of ~5.8 cm and a sepal width of ~3.4 cm?
3. Repeat this exercise for an iris with a sepal length of ~5.6 cm and a sepal width of ~3.4 cm?

--------

In ML models, a common method for improving the predictive ability of the model is to tune the settings of the model. For the k-Nearest Neighbour algorithm, this is done by changing the number of nearest neighbours that are analysed.

Currently the number of nearest neighbours (``n_neighbors``) that are analysed is 3. We could try a range of these ``n_neighbors`` and see what happens. This may be important, in paricular because some of our versicolour and virginic datapoints have exactly the same sepal widths and lengths:

Note: when changing the ``prediction_type``, it may take a few second to load the definite/probabilistic plots.

In [None]:
# Reload our Iris data, and only read in the sepal width and length from the dataset.
iris = datasets.load_iris()
X = iris.data[:, :2] # We only take the first two features (the sepal length and width).
y = iris.target

# Show figure that indicates what type of iris is predicted by the knn algorithm for different values of n_neighbors.
from figure_code import visualise_multi_knn
visualise_multi_knn(sepal_length_limits, sepal_width_limits, X, y, cmap_light, cmap_bold)

These plots are great to see how our kNN model decides what type of iris type we have based on sepal length and width, but it's still a bit hard to figure out which value of ``n_neighbors`` to choose.

What we could also do is to test how well the kNN algorithm works using a **training and testing set of data**. Here, we randomly split our data into 70 % training data and 30 % testing data.

In [None]:
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# Get our Iris data.
iris = datasets.load_iris()
X = iris.data[:, :2]  # We only take the first two features (the sepal length and width).
y = iris.target

# Split the data into training and testing data.
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, train_size=0.7, random_state=42)

# Testing KNN algorithm for several values of ``n_neighbors``.
training_accuracy = []
test_accuracy = []
neighbors_settings = tuple(range(1, 10))
for n_neighbors in neighbors_settings:
    # Build the model.
    clf = KNeighborsClassifier(n_neighbors=n_neighbors)
    clf.fit(X_train, y_train)
    # Record training set accuracy.
    training_accuracy.append(clf.score(X_train, y_train))
    # Record generalization accuracy.
    test_accuracy.append(clf.score(X_test, y_test))

# Plot results for several values of ``n_neighbors``.
pl.plot(neighbors_settings, training_accuracy, '-', label="training accuracy")
pl.plot(neighbors_settings, test_accuracy, 'r--', label="test accuracy")
pl.ylabel("Accuracy")
pl.xlabel("n_neighbors")
pl.ylim((0.5,1.0))
pl.legend()
pl.show()

Now, we can see maybe ``n_neighbors=6`` may be the best based on the test data.

## **Using More than 2 Features in our Feature Space**

So far we have only used the sepal data, but we should also use the petal width and length. Lets repeat the above testing proceedure, where the feature vector includes the sepal length and width, as well as the petal length and width.

Again, we will split our data into 70 % training set and 30 % test set.

In [None]:
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# Get our Iris data.
iris = datasets.load_iris()
X = iris.data  # We will now use all the Iris data for this exercise. This was originally "X_1 = iris.data[:, :2]"
y = iris.target

# Split the data into training and testing data.
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, train_size=0.7, random_state=42)

# Testing KNN algorithm for several values of ``n_neighbors``.
training_accuracy = []
test_accuracy = []
for n_neighbors in range(1, 10):
    # Build the model.
    knn = KNeighborsClassifier(n_neighbors=n_neighbors)
    knn.fit(X_train, y_train)
    # Record training set accuracy.
    training_accuracy.append(knn.score(X_train, y_train))
    # Record generalization accuracy.
    test_accuracy.append(knn.score(X_test, y_test))

# Plot results for several values of ``n_neighbors``.
pl.plot(neighbors_settings, training_accuracy, '-', label="training accuracy")
pl.plot(neighbors_settings, test_accuracy, 'r--', label="test accuracy")
pl.ylabel("Accuracy")
pl.xlabel("n_neighbors")
pl.ylim((0.5,1.0))
pl.legend()
pl.show()

We can see why adding the petal length and width will improve the accuracy of our model by looking at the data.

Because it is hard to plot four feature vectors (sepal length, width width, petal length, and petal width), lets just plot the sepal length, width width, and the petal length on a 3D scatter plot.

We can see that even just including the petal width significantly helps to distinguish between versicolor and virginica iris'.

In [None]:
# We will plot the sepal length, sepal width, and petal width in a 3D scatterplot.
import plotly.express as px
df = px.data.iris()
fig = px.scatter_3d(df, x='sepal_length', y='sepal_width', z='petal_width', color='species', color_discrete_map={'setosa': '#FF0000', 'versicolor': '#00A550', 'virginica': '#0000FF'}, width=1200, height=900)
fig.show()

We can see that the accuracy of our kNN model is considerably better than before for any value of ``n_neighbours`` we choose by including more feature vectors (i.e. including petal width and length in our kNN model).

Lets arbitrarily choose ``n_neighbours=5`` for the last exercise, but partially because ``n_neighbours=5`` does the best for preducting the testing set of data. Now lets make a prediction for a sample of an iris with:
* Sepal length: 3.6 cm
* Sepal width: 2.9 cm
* Petal length: 3.1 cm
* Petal width: 2.0 cm

In [None]:
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import pandas as pd

# Get our Iris data.
iris = datasets.load_iris()
X = iris.data  # We only take the first two features (the sepal length and width).
y = iris.target

# Splitting the data into training and testing data.
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, train_size=0.7, random_state=69)

# Initialise our kNN classification machine learning model.
knn = KNeighborsClassifier(n_neighbors=5)

# Fit our kNN model with the Iris data .
knn.fit(X, y)

# Make a probablistic prediction of the type of Iris species we have
# for a Iris with the following sepal length, sepal width, petal length,
# and petal width.
iris_input_values = [3.6, 2.9, 3.1, 2.0]
probabilities = knn.predict_proba([iris_input_values,])*100.0

# Show the probabilities that are species is a setosa, versicolour,
# or virginica iris in a table
df = pd.DataFrame(probabilities)
df.columns = ['Setosa', 'Versicolour', 'Virginica']
df.index = ['Probabilities (%)']
display(df)

We can see that for this set of input variables, our model indicates that our sample is likely to be a versicolour (80 %), but there is a slight change it is a virginica (20 %).