# Exercise set 10


>The goal of this exercise is to gain familiarity with some
classification methods and the different ways we can assess and compare them.

## Exercise 10.1


In this exercise, we will consider the
[UCI ML Breast Cancer Wisconsin (Diagnostic) dataset](https://goo.gl/U2Uwz2).

This data set contains 569 tumours classified
as malignant or benign. In addition, 30 variables have been
measured, and our goal is to make a predictive model that
can classify new tumours as malignant or benign.

An overview of the different variables can be found
on the 
[scikit-learn website](https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-dataset).
In the following, we are going to label the two classes as:

* `benign` as a negative ($-1$), and
* `malignant` as a positive ($+1$). 


In the lectures, we mentioned categorical variables and that we might have to
transform these to use them in practice.
[Dummy variables](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) and
[one-hot encoding](https://en.wikipedia.org/wiki/One-hot) are examples of such transformations.
In scikit-learn, we do normally not have to worry about this for the y-values we use in classification.
For instance, the
[scikit-learn documentation for decision trees](https://scikit-learn.org/stable/modules/tree.html#classification)
says that a decision tree 
> is capable of both binary (where the labels are $[-1, 1]$) classification and multiclass (where the labels are 
$[0, \ldots, K-1]$) classification

so we use the values $-1$ and $+1$ to represent the two classes here. 

### 10.1(a) 

Begin by loading the raw data and creating
a test set using $33$\% of the available data points for the test set.
The example code below can be used to load the data set
and create training/test sets:

In [None]:
import black
import jupyter_black

jupyter_black.load(
    lab=False,
    line_length=79,
    verbosity="DEBUG",
    target_version=black.TargetVersion.PY310,
)

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
X = data["data"]
# "Rename" y so that -1 = benign and 1 = malignant:
y = [-1 if i == 1 else 1 for i in data["target"]]
class_names = ["benign", "malignant"]
print("Classes:")
print(class_names)

print("Variables:", data["feature_names"])

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.33,
    # stratify=y, # Uncomment if you are using stratification
)

For creating the training/test sets we use the method
[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
from the module [sklearn.model_selection](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection). One of the input parameters to `train_test_split` is
`stratify`:


* (i) Reading the documentation for
  [stratification](https://scikit-learn.org/stable/modules/cross_validation.html#stratification)
  (or the [Wikipedia entry on stratified sampling](https://en.wikipedia.org/wiki/Stratified_sampling))
  can you explain what `stratify` does?


* (ii)  Should we use `stratify` here?

In [None]:
%matplotlib notebook
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale
import numpy as np


sns.set_context("notebook")

data = load_breast_cancer()
X = data["data"]
X = scale(X)
# "Rename" y so that -1 = benign and 1 = malignant:
y = np.array([-1 if i else 1 for i in data["target"]])
class_names = ["benign", "malignant"]
print("Classes:")
print(class_names)

print("Variables:")
print(data["feature_names"])
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42, stratify=y
)
print(f"\nItems in training set: {len(y_train)}")
print(f"Items in test set: {len(y_test)}")
print(f"fraction malignant total: {len(y[y==-1])/len(y)}")
print(f"fraction malignant train: {len(y_train[y_train==-1])/len(y_train)}")
print(f"fraction malignant test: {len(y_test[y_test==-1])/len(y_test)}")

#### Your answer to question 10.1(a):

(i) and (ii):

Stratification ensures that the distribution of the different classes is the same in the test and training set as in the original data set. This is important if the classes in the data set are unevenly distributed. In this case, the number of benign and malignant tumours are not too different, but to be safe, we tell train_test_split to stratify according to the y-values (the class information).

### 10.1(b)
In this case, we have to determine what
metric we are going to use
to judge the performance of the classifiers we make. Before selecting
a metric, we should consider what false positives and false negatives
mean for our current problem: How would you define these two terms in our present
case, and would you say that false positives are a more serious mistake than false negatives here?

#### Your answer to question 10.1(b):

Here we can make the following interpretations:


- False positive: This is a tumour classified as malignant, but it is, in fact, benign.
  The seriousness of mistakes like this depends on what we do with samples we believe to be malignant.
  If we, for instance, start treatment based on these results, we would, in this case, treat something
  that we did not need to treat. Depending on the treatment, this can be a grave mistake to make.


- False negative: This is a tumour classified as benign, but it is malignant! Here, we are
  missing potentially dangerous tumours (we think they are benign!). This is a grave
  mistake since it could delay the start of treatment.


Here, I will assume that the outcome of the classifier is used to screen for malignant tumours. A positive outcome is followed up by further tests to determine if the tumour is indeed malignant. In such a case, we would rather have someone tested extra due to a false positive than delay diagnosis due to a false negative. I will thus focus on minimizing the false negatives.

### 10.1(c)
Following up on the previous question, here
are some [metrics](https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html) we could use to assess the performance of a
classifier model we make:


* **Precision**: The ratio of true positives to the sum of
  true positives and false positives.


* **Recall**: The ratio of true positives to the sum of true positives and false negatives.


* **F1**: The (harmonic) mean of the precision and recall.


In addition, we can summarize the performance using the [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix).


(Note: There are many [other possibilities](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics)
as well! If you are curious, you can, for instance, include the
*accuracy* (the ratio of correct predictions
to the number of total predictions).)


The choice of the metric for assessing a classifier will lead to different results.
For instance, if we choose to use precision as our metric, we will maximize it
during the optimization of our model. This means that we will *minimize* the
number of *false positives*. If we choose to use the recall, on the other hand,
we will *minimize* the number of *false negatives*.


In the following, we will calculate all these metrics for the
different classification methods we consider. At the end of the
exercise, you will be asked to compare the different classifiers
using them. But before we do that: 
Which of the
metrics mentioned above is most important for
our classification task?

(Note: There is no single correct answer here: it depends on how you judge the seriousness of false positives vs false negatives.)

#### Your answer to question 10.1(c):

As stated in the previous question, I focus on minimising the number of false negatives. I will pick the **recall** since it will tell us how many malignant breast cancer tumours we correctly identified, and optimising it will minimise the number of false negatives.

**Note:** We can always make the number of false negatives go to zero by just classifying everything as positive. However, this would make the classification pretty useless (we do not even need a classifier to say that everything is positive!). Therefore, we will monitor the other scores (precision and F1) to ensure that the number of false positives is manageable.

### 10.1(d)
Create a [$k$-nearest neighbour classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
 with 3 neighbours and
fit it using your training set. Evaluate (with the test set) the classifier using the
precision, recall, and F1 metrics, and plot the confusion matrix.

An
example of how this can be done is:

In [None]:
from matplotlib import pyplot as plt
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import ConfusionMatrixDisplay


# Make a table to store all results
table = {
    "classifier": [],
    "recall": [],
    "precision": [],
    "f1": [],
}


def add_results_to_table(table, name, recall, precision, f1):
    """Add results to the given table."""
    table["classifier"].append(name)
    table["recall"].append(recall)
    table["precision"].append(precision)
    table["f1"].append(f1)


# Create a classifier:
knn3 = KNeighborsClassifier(n_neighbors=3)

# Fit the classifier on the training set:
knn3.fit(X_train, y_train)

# Use classifier for prediction for the test set:
y_hat = knn3.predict(X_test)

# Calculate the precision etc. for the test set:
precision = precision_score(y_test, y_hat)
recall = recall_score(y_test, y_hat)
f1 = f1_score(y_test, y_hat)
print(f"precision = {precision}")
print(f"recall = {recall}")
print(f"f1 = {f1}")

add_results_to_table(table, "knn_with_3_neighbours", recall, precision, f1)

# Make confusion matrix:
fig, ax = plt.subplots(constrained_layout=True)
ConfusionMatrixDisplay.from_estimator(
    knn3,
    X_test,
    y_test,
    display_labels=["Benign", "Malignant"],
    ax=ax,  # Use the figure we created above
)

How many false positives and false negatives do you get?

#### Your answer to question 10.1(d):
Here (see the confusion matrix above) we got 0 false positives and 9 false negatives. The test set contains 61 + 9 = 70 malignant tumors, and the classifier found 61 of these correctly. But it is not performing great, missing about 12,8% (9 out of 70) of the positive (malignant) tumors.

### 10.1(e)
We will now try to optimize the $k$ for a $k$-nearest neighbour classifier.
This can be done using the method [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).

One of the inputs to this method is the `scoring` parameter, which
selects the metric to use for finding the best $k$. Use the metric
you deemed most important in question [10.1(c)](#10.1(c)) 
and use $k$-values in the range $1 \leq k \leq 10$ in your search for the best $k$.

An example
of how this can be done is:

In [None]:
from matplotlib import pyplot as plt
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import ConfusionMatrixDisplay


# Set up a grid search:
parameters = {"n_neighbors": range(1, 11)}
grid = GridSearchCV(
    KNeighborsClassifier(),
    parameters,
    scoring="accuracy",  # Select scoring here!
)

# Run the grid search:
grid.fit(X_train, y_train)

# Get the best classifier from the grid search:
best_knn = grid.best_estimator_
print("Best knn:", best_knn)

# Use the best classifier for the test set:
y_hat = best_knn.predict(X_test)

# Calculate the precision etc. for the test set:
precision = precision_score(y_test, y_hat)
recall = recall_score(y_test, y_hat)
f1 = f1_score(y_test, y_hat)
print(f"precision = {precision}")
print(f"recall = {recall}")
print(f"f1 = {f1}")


# Make confusion matrix:
fig, ax = plt.subplots(constrained_layout=True)
ConfusionMatrixDisplay.from_estimator(
    best_knn,
    X_test,
    y_test,
    display_labels=["Name of class one", "Name of class two"],
    ax=ax,  # Use the figure we created above
)

Evaluate the optimised classifier using
the metrics mentioned above (with the test set) and plot the confusion matrix.
What value for $k$ did you find? And did the number of false
positives and false negatives change compared to the non-optimised $k$-nearest neighbour classifier?

In [None]:
from matplotlib import pyplot as plt
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import ConfusionMatrixDisplay


# Set up a grid search:
parameters = {"n_neighbors": range(1, 11)}
grid = GridSearchCV(
    KNeighborsClassifier(),
    parameters,
    scoring="recall",
    cv=5,
    return_train_score=True,
)

# Run the grid search:
grid.fit(X_train, y_train)

# Get the best classifier from the grid search:
best_knn = grid.best_estimator_
print("Best knn:", best_knn)

In [None]:
print(f"Best recall score (train): {grid.best_score_}")
print(f"Best k: {grid.best_params_['n_neighbors']}")
y_hat = best_knn.predict(X_test)

recall = recall_score(y_test, y_hat)
precision = precision_score(y_test, y_hat)
f1 = f1_score(y_test, y_hat)
print("Best recall score (test):", recall)
print("Best precision score (test):", precision)
print("Best F1 score (test):", f1)


add_results_to_table(table, "knn_optimized", recall, precision, f1)

In [None]:
# Make confusion matrix:
fig, ax = plt.subplots(constrained_layout=True)
ConfusionMatrixDisplay.from_estimator(
    best_knn,
    X_test,
    y_test,
    display_labels=["Benign", "Malignant"],
    ax=ax,  # Use the figure we created above
)

#### Your answer to question 10.1(e):

We already guessed the optimal number of k (3), and the number of false positives and negatives did not change.

### 10.1(f)
Create a [decision tree classifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
and fit it using your training set. Limit the tree to $3$ levels by setting
the parameter `max_depth=3`.
Evaluate the classifier using the
metrics mentioned above (with the test set) and plot the confusion matrix.

Note, the example below question [10.1(h)](#10.1(h)) shows how you can 
create the decision tree and optimise it. The example will plot the decision tree.
You can use this code as
inspiration for solving [10.1(f)](#10.1(f)) and the following two questions (maybe you
prefer to do them all at once).

In [None]:
from sklearn.tree import DecisionTreeClassifier

tree3 = DecisionTreeClassifier(max_depth=3)
tree3.fit(X_train, y_train)
y_hat = tree3.predict(X_test)


recall = recall_score(y_test, y_hat)
precision = precision_score(y_test, y_hat)
f1 = f1_score(y_test, y_hat)

print("Tree recall score (test):", recall)
print("Tree precision score (test):", precision)
print("Tree F1 score (test):", f1)

add_results_to_table(table, "tree_depth_3", recall, precision, f1)


# Make confusion matrix:
fig, ax = plt.subplots(constrained_layout=True)
ConfusionMatrixDisplay.from_estimator(
    tree3,
    X_test,
    y_test,
    display_labels=["Benign", "Malignant"],
    ax=ax,  # Use the figure we created above
)

#### Your answer to question 10.1(f):
It performs worse than the k-nearest neighbours classifier!

### 10.1(g)
We will also
try to tune the `DecisionTreeClassifier`
by determining the maximum depth
we should use for the tree. Use the method
`GridSearchCV` to optimize the parameter
`max_depth` for the `DecisionTreeClassifier`.
Use the metric you deemed most important
in question [10.1(c)](#10.1(c)). Limit the depth to the range `max_depth = range(1, 21)`, but also
include a depth where you set `max_depth = None` (this lets the
tree expand as far down as possible).

Evaluate the classifier with the best `max_depth` using the
metrics mentioned above (with the test set) and plot the confusion matrix.

What is the best `max_depth` you find in this case?

In [None]:
classifier = DecisionTreeClassifier(random_state=123)
parameters = [{"max_depth": list(range(1, 21)) + [None]}]
grid = GridSearchCV(
    classifier,
    param_grid=parameters,
    scoring="recall",
    cv=5,
    return_train_score=True,
)
grid.fit(X_train, y_train)
print("Best recall score (train):", grid.best_score_)
print("Best depth:", grid.best_params_["max_depth"])
best = grid.best_estimator_
y_hat = best.predict(X_test)

recall = recall_score(y_test, y_hat)
precision = precision_score(y_test, y_hat)
f1 = f1_score(y_test, y_hat)

print("Best recall score (test):", recall)
print("Best precision score (test):", precision)
print("Best F1 score (test):", f1)

add_results_to_table(table, "tree_optimized", recall, precision, f1)


# Make confusion matrix:
fig, ax = plt.subplots(constrained_layout=True)
ConfusionMatrixDisplay.from_estimator(
    best,
    X_test,
    y_test,
    display_labels=["Benign", "Malignant"],
    ax=ax,  # Use the figure we created above
)

#### Your answer to question 10.1(g):
Six seems to be the optimal depth of the tree. There is some randomness used in the algorithm for making the decision tree. Above, the random state is set to a specified number to get the same decision tree each time we run the code. If we remove this, we will get a different number for the best depth, which changes a bit. This indicates that the recall scores are maybe not too different for different depts. Let us inspect how much the recall is changing as a function of the depth:

In [None]:
score_mean = grid.cv_results_["mean_test_score"]
score_std = grid.cv_results_["std_test_score"]
depths = [i["max_depth"] for i in grid.cv_results_["params"]]
# Deal with the "None" for plotting
maxdepth = max([i for i in depths if i != None])
depths = [
    i if i != None else maxdepth + 1 for i in depths
]  # "Rename" None to 21, that is, 21 is now max_depth = None

fig, ax = plt.subplots(constrained_layout=True)
ax.errorbar(depths, score_mean, yerr=score_std, marker="o")
ax.set_xticks(depths)
labels = [i if i != 21 else "None" for i in depths]
ax.set_xticklabels(labels)
sns.despine(fig=fig)

The figure above shows that the recall score varies a lot; here, it indicates that the tree might have difficulties "generalizing" the data.

### 10.1(h)
Visualise the best decision tree you found. This
can be done using the
method [export_graphviz from sklearn.tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html),
or the method [plot_tree from sklearn.tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html)


In [None]:
import graphviz
from sklearn.tree import export_graphviz
from IPython.display import display

dot_data = export_graphviz(
    best,
    out_file=None,
    feature_names=data["feature_names"],
    class_names=["benign", "malignant"],
    rounded=True,
    filled=True,
)
display(graphviz.Source(dot_data))

#### Your answer to question 10.1(h):
(See the three above.)

### 10.1(i)
Compare the precision, recall, and F1 scores for the classifiers you have considered.
If you were to select one
classifier to put into real-life use, which one would you choose and why?

In [None]:
import pandas as pd

pd.DataFrame(table)

In [None]:
idx = np.argmax(table["recall"])
best_recall = table["recall"][idx]
best_name = table["classifier"][idx]
print(f"Best classifier was {best_name} with a recall = {best_recall}")

#### Your answer to question 10.1(i):
Of the classifiers we have considered here, k-nearest neighbours with three neighbours seem to perform best (it has better scores for precision, recall, and F1). We would therefore use this one, but we are not completely satisfied (see the question below)!

### 10.1(j)
Extra task for the curious student: Create an alternative classifier, for instance,
using a so-called [support vector machine](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html). We will not go into the details about how
this classifier works in our lectures, but with `sckikit-learn` it is rather easy
to try
it and see what it can do for us.

In the scikit-learn documentation, there is also [an example that compares several classifiers](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html). Maybe you can find one
that outperforms those we have considered in this exercise?

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import SVC
from catboost import CatBoostClassifier

names = [
    "Naive bayes",
    "LDA",
    "Support vector machine",
    "CatBoost",
]

classifiers = [
    GaussianNB(),
    LinearDiscriminantAnalysis(),
    SVC(),
    CatBoostClassifier(verbose=0),
]
for clf, name in zip(classifiers, names):
    clf.fit(X_train, y_train)
    y_hat = clf.predict(X_test)
    recall = recall_score(y_test, y_hat)
    precision = precision_score(y_test, y_hat)
    f1 = f1_score(y_test, y_hat)

    add_results_to_table(table, name, recall, precision, f1)
    print(f"\nResults for classifier: {name}")
    print("Recall score (test):", recall)
    print("Precision score (test):", precision)
    print("F1 score (test):", f1)
    if recall > best_recall:
        print(f"** Recall score is better than {best_name}! **")

In [None]:
table2 = pd.DataFrame(table)
table2.sort_values(by="recall", ascending=False)

#### Your answer to question 10.1(j):
Here, we see that some of the classifiers we tried are performing better. We could now
expand our study and try to find optimized parameters for them to see if we can do even better!

## Exercise 10.2

Consider again the data set for ovarian cancer and the measured gene expressions (see exercise 9).
Create a decision tree classifier for this data set. Limit the depth of the decision
tree to 2, and visualise the decision tree. How do the "rules" the decision tree uses
for its classification compare to what you found from the PCA analysis?
Does it consider the same genes?

Note: There is some "randomness" in decision trees, so
the tree you now
create will likely
use different genes from the ones you
found in exercise 9. You can rerun your code a few times to see how the randomness influences
things or you can also change the depth of the tree to see if it picks out
more genes.

In [None]:
# Load the data:
data_ovo = pd.read_csv("Data/ovo.csv")
classes_ovo = data_ovo["class"]
X_ovo = data_ovo.filter(like="X.", axis=1).values
X_ovo = scale(X_ovo, with_std=False)

feature_names_ovo = ["gene_{}".format(i) for i in range(X_ovo.shape[1])]

y_ovo = [1 if i == "C" else 0 for i in classes_ovo]
# Create a test and training set:
X_train_ovo, X_test_ovo, y_train_ovo, y_test_ovo = train_test_split(
    X_ovo,
    y_ovo,
    test_size=0.2,
    stratify=y_ovo,
)
tree_ovo = DecisionTreeClassifier(max_depth=2, random_state=1)
tree_ovo.fit(X_train_ovo, y_train_ovo)
dot_data_ovo = export_graphviz(
    tree_ovo,
    out_file=None,
    feature_names=feature_names_ovo,
    class_names=["N", "C"],
    rounded=True,
    filled=True,
)
display(graphviz.Source(dot_data_ovo))

In [None]:
# Score this classifier:
y_hat_ovo = tree_ovo.predict(X_test_ovo)
print("Recall score (test):", recall_score(y_test_ovo, y_hat_ovo))
print("Precision score (test):", precision_score(y_test_ovo, y_hat_ovo))
print("F1 score (test):", f1_score(y_test_ovo, y_hat_ovo))

#### Your answer to question 10.2:
In this case, what genes you get will vary due to the randomness in the algorithm making the decision tree classifier. However, it will typically pick out some of the genes we found in the previous exercise. Here, we could try many trees and see the most frequently picked genes. Let us use a random forest classifier and plot the importance of the different genes for making the classification:

In [None]:
from sklearn.ensemble import RandomForestClassifier

rnd_tree = RandomForestClassifier(
    max_depth=2, n_estimators=500, random_state=42
)  # max depth is 2, and we use 500(!) trees.
rnd_tree.fit(X_train_ovo, y_train_ovo)
# Plot the importance of variables (genes):
importance = rnd_tree.feature_importances_
idx = np.argsort(importance)
y_pos = []
y_label = []
figi, axi = plt.subplots(constrained_layout=True)

colors = sns.color_palette("husl", 20)

for i, idxi in enumerate(
    idx[-20:]
):  # 20 is here to plot the 20 most important ones:
    y_pos.append(i)
    y_label.append(feature_names_ovo[idxi])
    axi.barh(i, importance[idxi], align="center", color=colors[i])
axi.set_yticks(y_pos)
axi.set_yticklabels(y_label)
axi.set_xlabel("Variable importance for a random forest classifier")
sns.despine(fig=figi)

We see that this analysis picks out many of the genes we found using PCA, for instance, gene 92 or 1490.