# Exercise set 6: Classification

The main goals of this exercise are to create classifiers and calculate and interpret some performance metrics that can be used to assess the classifiers.

**Learning Objectives:**

After completing this exercise set, you will be able to:

- Create classification models.
- Create and interpret the confusion matrix and use it to evaluate classifier performance.
- Visualise how a decision tree is making its classification.

**To get the exercise approved, complete the following problems:**

- [6.1(b)](#6.1(b)), [6.1(c)](#6.1(c)), [6.1(d)](#6.1(d)), and [6.1(e)](#6.1(e)): To show that you can create a decision tree, plot the confusion matrix and visualise the decision tree itself, and compare classifiers using the confusion matrix.

**Note:** A solution to [Exercise 6.2](#Exercise-6.2) is available online (see Blackboard or [GitHub](https://github.com/andersle/chemometrics/tree/main/exercises/exercise-006)) for those who wish to practice the interpretation aspect ([6.2(b)](#6.2(b)) and [6.2(f)](#6.2(f))) without completing the programming portion.

## Exercise 6.1 Penguins

In this exercise, we will have a look at penguins! We will attempt to figure out the species of penguins based
on their bill length, bill depth, flipper length, and body mass.
The data is from a paper by 
[Gorman, Williams, and Fraser](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090081)
and can also be found in the R package [palmerpenguins](https://github.com/allisonhorst/palmerpenguins).
Here, we will use a version of the data set [penguins.csv](penguins.csv) where missing
values have been removed and we only have two species of penguins: [Adelie](https://en.wikipedia.org/wiki/Ad%C3%A9lie_penguin) and [Chinstrap](https://en.wikipedia.org/wiki/Chinstrap_penguin).

The image below shows the three islands where these penguins can be found (click the image to make it larger): 
| <a href="penguins.png"><img src="penguins2.png" width="50%"></a>           |
|:-:|
| **Fig. 1** *Location of islands and images of the penguin species.*    |


You will find seven columns in the [penguins.csv](./Data/penguins.csv) data file. Each row is a measurement for
a single penguin for the seven variables found in the columns:


| Column            |  Description                                                        |
|:------------------|--------------------------------------------------------------------:|
| species           | The species (Chinstrap or Adelie)                                   |
| island            | The island where the observation was made (Dream/Torgersen/Biscoe)  |
| bill_length_mm    | (See the illustration below) (measured in mm)                       |
| bill_depth_mm     | (See the illustration below) (measured in mm)                       |
| flipper_length_mm | (See the illustration below) (measured in mm)                       |
| body_mass_g       | The weight of the penguin (in grams)                                |
| sex               | Female/Male                                                         |


| <img src="bill.png" width="50%">                                   |
|:-:|
| **Fig. 2** *Illustration of bill length, bill depth, and flipper length. (The foot is not used in this data set.)*    |

The data can be loaded as follows:

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

%matplotlib inline
sns.set_theme(style="ticks", context="notebook", palette="colorblind")


data = pd.read_csv("penguins3.csv")
data.head()

### 6.1(a)

**Task: Investigate (by creating figures) if the variables `bill_length_mm`, `bill_depth_mm`,
`flipper_length_mm`, and `body_mass_g` can be used to separate the different
species.**

**Hint:** Several plots can be used to get an overview of the data. For instance, the [scatter plot matrix](https://seaborn.pydata.org/examples/scatterplot_matrix.html), [jointplot](https://seaborn.pydata.org/tutorial/introduction.html#multivariate-views-on-complex-datasets)
from seaborn, or a [boxplot](https://seaborn.pydata.org/generated/seaborn.boxplot.html). Here is one example of how to create the scatter plot matrix:
```python
# To create a scatter plot matrix with seaborn:
grid = sns.pairplot(
    data,
    corner=True,
    hue="species",  # Hue is used to select a column from data to use for colouring
)
```

In [None]:
# Your code here

#### Your answer to question 6.1(a): Are there any promising variables that could separate the species?
*Double click here*

### 6.1(b)

**Task: Create a training set and a test set to use to classify the penguin species. What is the fraction of Adelie penguins in the original, test, and training data?**

**Hint:** With scikit-learn's [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html), splitting the data can be done with
```python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    stratify=y
)
```
In the example above, we use stratification for the y-values, this is to **ensure that each split** (training and testing) **contains approximately the same proportion of samples from each class as the original dataset**

In [None]:
# Your code here

#### Your answer to question 6.1(b): What is the fraction of Adelie penguins?
*Double click here*

### 6.1(c)

**Task: Create a decision tree classifier to classify the penguin species. Use two levels for the tree and show the confusion matrix for the training and the test set. Is your classifier making any mistakes?**

**Hint:**

1. A decision tree can be created using scikit-learn's [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html):
```python
from sklearn.tree import DecisionTreeClassifier

# Create the tree. The parameter max_depth selects the number of levels in the tree
my_first_tree = DecisionTreeClassifier(max_depth=2)

# Train the tree:
my_first_tree.fit(X_train, y_train)
```

2. To show the confusion matrix:
```python
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_estimator(
    my_first_tree,  # The classifier to construct the confusion matrix for.
    X_train,  # The X data.
    y_train,  # The true y data.
    colorbar=True,  # Add a colorbar to show the color scale.
)
```

In [None]:
# Your code here

#### Your answer to question 6.1(c): Is your classifier making any mistakes?
*Double click here*

### 6.1(d)

**Task: Visualise your decision tree and use this to describe how the classification is made.**

**Hint:** The decision tree can be visualized using [plot_tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html) or [export_graphviz](https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html),

1. Using [plot_tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html):

```python
from sklearn import tree

variables = [
    "bill_length_mm",
    "bill_depth_mm",
    "flipper_length_mm",
    "body_mass_g",
]

tree.plot_tree(
    my_first_tree,  # The tree to plot
    filled=True,  # Add color to the boxes.
    feature_names=variables,  # Get name for variables from the variables list.
    class_names=my_first_tree.classes_,  # Get the name of the different classes from the tree.
)
```

2. Alternative: Using [export_graphviz](https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html):

```python
from sklearn.tree import export_graphviz  # To create the tree.
import graphviz  # To turn the three into a graph, you may need to install this (pip install graphviz).
from IPython.display import display  # To show the graph.

dot_data = export_graphviz(
    my_first_tree,  # The tree to plot.
    out_file=None,  # Do not write to file.
    feature_names=variables,  # Get name for variables from the variables list.
    class_names=my_first_tree.classes_,  # Get the name of the different classes from the tree.
    rounded=True,  # Show the boxes in the tree with rounded corners.
    filled=True,  # Add color to the boxes.
)
display(graphviz.Source(dot_data))  # Show the tree in a notebook.
```

In [None]:
# Your code here

#### Your answer to question 6.1(d): Describe the decision-making process of your classifier.
*Double click here*

### 6.1(e)

The figure below compares a decision tree classifier to a k-nearest neighbours classifier (using one neighbour) for the test set.

**Task: Use the figure to compare the two classifiers (the left part shows the confusion matrix of the tree classifier applied to the test set, and the right part shows the k-nearest neighbours classifier applied to the same test set). Which one performs best?**

![Compare classifiers](comparecls.png)

**Note:** Your confusion matrix in [6.1(c)](#6.1(c)) may differ from the one shown here since the splitting into a test and training set is randomized.

#### Your answer to question 6.1(e): Which of the two classifiers performs best?
*Double click here*

## Exercise 6.2

[Schummer *et al.*](https://doi.org/10.1016/S0378-1119(99)00342-X) used microarray technology to analyze the expression of 1536 genes in ovarian cancer and non-cancer tissues. Their primary objective was to identify differentially expressed genes in ovarian cancer versus non-cancer tissues to discover genes with diagnosis potential.

The data file [`ovo.csv`](ovo.csv) contains numerical gene expressions (for 1536 genes) for 54 tissue samples. Each column corresponds to a specific gene, named `X.1`, `X.2`, and so on. Each tissue sample has been classified as non-cancer (`N`) or cancer (`C`) tissue, and these labels can be found in the column `class`. The raw data has been preprocessed by centring each gene expression so that no further preprocessing is needed. The raw data can be loaded as follows:

In [None]:
"""Load the data set."""

import pandas as pd

data_ovo = pd.read_csv("ovo.csv")
classes = data_ovo["class"]  # Classification of samples.
# Turn the class labels into numbers for numeric methods
y_ovo = [1 if i == "C" else 0 for i in classes]
X_ovo = data_ovo.filter(like="X.", axis=1)  # Gene expressions for samples.

### 6.2(a)

**Task: Explore the raw data. Do you find genes that appear to show significant differences in expression between non-cancer and cancer tissue?**

**Hint:** You can, for instance, inspect the raw data by running a principal component analysis.

In [None]:
# Your code here

#### Your answer to question 6.2(a): Did you find any promising genes?
*Double click here*

### 6.2(b)

**Task: In the following task, you will develop a classifier to predict whether a tissue sample is cancerous or non-cancerous based on gene expression data. Which error type (false positive or false negative) should be minimised?**

#### Your answer to question 6.2(b): Will you minimise false positives or negatives?
*Double click here*

### 6.2(c)

**Task: Create a decision tree classifier to classify tissue type from the gene expressions. Optimize the tree depth using cross-validation on a training set. Report the optimal maximum depth of the resulting tree.**

With reference to the previous problem:

* If you prioritised minimising false positives, use the `precision` as your optimisation metric.
* If you prioritised minimising false negatives, use the `recall` as your optimisation metric.
* If you opted for a balanced approach, use the `balanced_accuracy` as your optimisation metric.


**Hint:**

1. The optimisation of the decision tree can be done as follows (assuming that you have already split into the training and test sets):

```python
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

# Set up a grid search:
parameters = {"max_depth": range(1, 10)}
grid = GridSearchCV(
    DecisionTreeClassifier(),
    parameters,
    scoring="recall",  # Swap this with the metric you prefer
)
# Run the grid search:
grid.fit(X_train, y_train)

# Get the best classifier from the grid search:
best_tree = grid.best_estimator_
print("Best tree:", best_tree)
print("Best score", grid.best_score_)
print("Best parameters", grid.best_params_)
```

#### Your answer to question 6.2(c): What depth did you get for your tree?
*Double click here*

### 6.2(d)

**Task: Create a k-nearest neighbours classifier to classify tissue type from the gene expressions. Optimize the number of neighbours using cross-validation on a training set. Report the optimal number of neighbours.**

**Hint:**

1. The optimisation of the k-nearest neighbours classifier can be done as follows (assuming that you have already split into the training and test sets):

```python
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

# Set up a grid search:
parameters = {"n_neighbors": range(1, 15)}
grid = GridSearchCV(
    KNeighborsClassifier(),
    parameters,
    scoring="recall",  # Swap this with the metric you prefer
)
# Run the grid search:
grid.fit(X_train, y_train)

# Get the best classifier from the grid search:
best_knn = grid.best_estimator_
print("Best knn:", best_knn)
print("Best score", grid.best_score_)
print("Best parameters", grid.best_params_)
```

In [None]:
# Your code here

#### Your answer to question 6.2(d): What was the optimal number of neighbours?
*Double click here*

### 6.2(e)

**Task: Create a random forest classifier to classify tissue type from the gene expressions. Optimize the number of trees and levels using cross-validation on a training set. Report the optimal number of trees and levels.**

**Hint:**

1. The optimisation of the random forest classifier can be done as follows (assuming that you have already split into the training and test sets):

```python
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Set up a grid search:
parameters = {
    "n_estimators": [10, 50, 100, 200, 500],  # the number of trees
    "max_depth": range(1, 11),  # the maximum depth
}
grid = GridSearchCV(
    RandomForestClassifier(),
    parameters,
    scoring="recall",  # Swap this with the metric you prefer
    verbose=2,  # Print out text to show the progress of the fitting
)
# Run the grid search:
grid.fit(X_train, y_train)

# Get the best classifier from the grid search:
best_forest = grid.best_estimator_
print("Best forest:", best_forest)
print("Best score", grid.best_score_)
print("Best parameters", grid.best_params_)
```

In [None]:
# Your code here

#### Your answer to question 6.2(e): What was the optimal number of estimators and tree depth?
*Double click here*

### 6.2(f)

**Task: Compare the three optimised classifiers you have made by applying them to the test set and obtaining the corresponding confusion matrices. Also compute the [precision](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html), [recall](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html), and the [balanced accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html) for the test set. Which classifier performs best?**



**Hint:** The metrics can be computed as follows:
```python
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import (
    recall_score,
    precision_score,
    balanced_accuracy_score,
)

y_hat = best_tree.predict(X_test)
recall_tree = recall_score(y_test, y_hat)
precision_tree = precision_score(y_test, y_hat)
bac_tree = balanced_accuracy_score(y_test, y_hat)
print(f"Recall: {recall_tree:.3f}")
print(f"Precision: {precision_tree:.3f}")
print(f"Balanced accuracy: {bac_tree:.3f}")

ConfusionMatrixDisplay.from_estimator(
    best_tree,
    X_test,
    y_test,
    colorbar=True,
)
```

In [None]:
# Your code here

#### Your answer to question 6.2(f): Which classifier performs best?
*Double click here*

## Your feedback for Exercise 6

1. **Time & Difficulty:**
* Length (1=too short, 5=too long): 1  2  3  4  5
* Difficulty (1=too easy, 5=too difficult): 1  2  3  4  5
* Most challenging part: ________________________

2. **Code Examples:**
* More or less example code?  More  Less  About Right
* Areas where more examples would be helpful: ________________________

3. **Errors/Inconsistencies:** Did you encounter any?  Yes  No  If yes, please describe: ________________________
    
4. **Suggestions:** How could this exercise be improved? ________________________