# Exercise set 10


>The goal of this exercise is to gain familiarity with some
>classification methods and the different ways we can assess and compare them.

## Exercise 10.1


In this exercise, we will consider the
[UCI ML Breast Cancer Wisconsin (Diagnostic) dataset](https://goo.gl/U2Uwz2
).

This data set contains 569 tumors that have been classified
as malignant or benign. In addition, 30 variables have been
measured and it is our goal to make a predictive model which
can classify new tumors as being malignant or benign.
An overview of the different variables can be found
on the
[sklearn website](https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-dataset).

In the following, we are going to label the two classes
as:


* "benign" as a negative ($-1$), and


* "malignant" as a positive ($+1$).


In the lectures, we mentioned categorical variables and that we might have to
transform these to use them in practice
([dummy variables](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) and
[one-hot encoding](https://en.wikipedia.org/wiki/One-hot)
are examples of such transformations).
In sklearn, we do normally not have to worry about this for the y-values we use in classification.
For instance, the [sklearn documentation for decision trees](https://scikit-learn.org/stable/modules/tree.html#classification) says that a decision tree

> is capable of both binary
> (where the labels are $[-1, 1]$) classification and multiclass (where the labels are
> $[0, \ldots, K-1]$)
> classification

so we use the values $-1$ and $+1$ to represent the two classes here.

### 10.1(a) 

Begin by loading the raw data and creating
a test set using $33$\% of the available data points for the test set.
The example code below can be used to load the data set
and create training/test sets:

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
X = data["data"]
# "Rename" y so that -1 = benign and 1 = malignant:
y = [-1 if i == 1 else 1 for i in data["target"]]
class_names = ["benign", "malignant"]
print("Classes:")
print(class_names)

print("Variables:", data["feature_names"])

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.33,
    # stratify=y, # Uncomment if you are using stratification
)

For creating the training/test sets we use the method
[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
from the module [sklearn.model_selection](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection). One of the input parameters to `train_test_split` is
`stratify`:


* (i) Reading the documentation for
  [stratification](https://scikit-learn.org/stable/modules/cross_validation.html#stratification)
  can you explain what `stratify` does?


* (ii)  Should we use `stratify` here?

In [None]:
# Your code here

#### Your answer to question 10.1(a):
*Double click here*

### 10.1(b)
In this case, we have to determine what
metric we are going to use
to judge the performance of the classifiers we make. Before selecting
a metric, we should consider what false positives and false negatives
mean for our current problem: How would you define these two terms in our present
case, and would you say that false positives are a more serious compared to
false negatives here?

In [None]:
# Your code here

#### Your answer to question 10.1(b):
*Double click here*

### 10.1(c)
Following up on the previous question, here
are some [possible metrics](https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html) we could use to assess the performance of a
classifier model we make:


* Precision: The ratio of true positives to the sum of
  true positives and false positives.


* Recall: The ratio of true positives to the sum of true positives and false negatives.


* F1: The (harmonic) mean of the precision and recall.


In addition, we can summarize the performance using the [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix).


(Note: There are many [other possibilities](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics)
as well! If you are curious you can for instance include the
*accuracy* which is  the ratio of correct predictions
to the number of total predictions.)


The choice of the metric for assessing a classifier will lead to different results.
For instance, if we choose to use precision as our metric, we will maximize it
during the optimization of our model. This means that we will *minimize* the
number of *false positives*. If we choose to use the recall, on the other hand,
we will *minimize* the number of *false negatives*.


In the following, we will calculate these metrics for the
different classification methods we consider. At the end of the
exercise, you will be asked to compare the different classifiers
using them. But before we do that: 
Which of the
aforementioned metrics would you say is most important for
the classification task we have here?

(Note: There is no single correct answer here, and it
really depends on how *you* judge the seriousness of false positives vs.
false negatives.)

In [None]:
# Your code here

#### Your answer to question 10.1(c):
*Double click here*

### 10.1(d)
Create a [$k$-nearest neighbor classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
 with 3 neighbors and
fit it using your training set. Evaluate (with the test set) the classifier using the
precision, recall, and F1 metrics, and plot the confusion matrix.
An
example of how this can be done is (for some generated data):

In [None]:
from matplotlib import pyplot as plt
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import ConfusionMatrixDisplay


# Just generate some synthetic classification data for this example:
X, y = make_classification(n_samples=1000)
# Create test/training sets:
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Create a classifier:
knn3 = KNeighborsClassifier(n_neighbors=3)

# Fit the classifier:
knn3.fit(X_train, y_train)

# Use classifier for prediction for the test set:
y_hat = knn3.predict(X_test)

# Calculate the precision etc. for the test set:
precision = precision_score(y_test, y_hat)
recall = recall_score(y_test, y_hat)
f1 = f1_score(y_test, y_hat)
print(f"precision = {precision}")
print(f"recall = {recall}")
print(f"f1 = {f1}")

# Make confusion matrix:
fig, ax = plt.subplots(constrained_layout=True)
ConfusionMatrixDisplay.from_estimator(
    knn3,
    X_test,
    y_test,
    display_labels=["Name of class one", "Name of class two"],
    ax=ax,  # Use the figure we created above
)

How many false positives and false negatives do you get?

In [None]:
# Your code here

#### Your answer to question 10.1(d):
*Double click here*

### 10.1(e)
We will now try to optimize the $k$ for a $k$-nearest neighbor classifier.
This can be done using the method [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).

One of the inputs to this method is the `scoring` parameter, which
selects the metric to use for finding the best $k$. Here, use the metric
you deemed most important in question [10.1(c)](#10.1(c)). An example
of how this can be done is:

In [None]:
from matplotlib import pyplot as plt
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import ConfusionMatrixDisplay


# Just generate some synthetic classification data for this example:
X, y = make_classification(n_samples=1000)

# Create test/training sets:
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Set up a grid search:
parameters = {"n_neighbors": range(1, 11)}
grid = GridSearchCV(
    KNeighborsClassifier(),
    parameters,
    scoring="accuracy",  # Select scoring here!
)

# Run the grid search:
grid.fit(X_train, y_train)

# Get the best classifier from the grid search:
best_knn = grid.best_estimator_
print("Best knn:", best_knn)

# Use the best classifier for the test set:
y_hat = best_knn.predict(X_test)

# Calculate the precision etc. for the test set:
precision = precision_score(y_test, y_hat)
recall = recall_score(y_test, y_hat)
f1 = f1_score(y_test, y_hat)
print(f"precision = {precision}")
print(f"recall = {recall}")
print(f"f1 = {f1}")

# Make confusion matrix:
fig, ax = plt.subplots(constrained_layout=True)
ConfusionMatrixDisplay.from_estimator(
    best_knn,
    X_test,
    y_test,
    display_labels=["Name of class one", "Name of class two"],
    ax=ax,  # Use the figure we created above
)

When using `GridSearchCV`, consider $k$-values in the range $1 \leq k \leq 10$
for your search for the best $k$.

Evaluate the optimized classifier using
all of the aforementioned metrics (with the test set) and plot the confusion matrix.

What value for $k$ did you find in this case? And did the number of false
positives and false negatives change compared to the non-optimized $k$-nearest neighbor classifier?

In [None]:
# Your code here

#### Your answer to question 10.1(e):
*Double click here*

### 10.1(f)
Create a [decision tree classifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
and fit it using your training set. Limit the tree to $3$ levels by setting
the parameter `max_depth=3`.

Evaluate the classifier using the
aforementioned metrics (with the test set) and plot the confusion matrix.

There is an example below question [10.1(h)](#10.1(h)) that uses a decision tree and
shows how it can be imported. This example will plot the decision tree
and try to optimize the depth of the tree. You can use this example as
inspiration for solving this question and the next two questions (if you
prefer to do them all at once).

In [None]:
# Your code here

#### Your answer to question 10.1(f):
*Double click here*

### 10.1(g)
We will also
try to tune the `DecisionTreeClassifier`
by determining the maximum depth
we should use for the tree. Again, you can use the method
`GridSearchCV` to optimize the parameter
`max_depth` for the `DecisionTreeClassifier`.
Use the metric you deemed most important
in question [10.1(c)](#10.1(c)) and consider depths
in the range `max_depth = range(1, 21)`, and, in addition,
a depth
where you set `max_depth = None` (this lets the
tree expand as far down as possible).

Evaluate the classifier with the best `max_depth` using the
aforementioned metrics (with the test set) and plot the confusion matrix.

What is the best `max_depth` you find in this case?




In [None]:
# Your code here

#### Your answer to question 10.1(g):
*Double click here*

### 10.1(h)
Visualize the best decision tree you found. This
can be done using the
method [export_graphviz from sklearn.tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html),
or the method [plot_tree from sklearn.tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html)

An example using `export_graphviz` is:

In [None]:
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import graphviz
from sklearn.model_selection import train_test_split, GridSearchCV


# Just generate some synthetic classification data for this example:
X, y = make_classification(n_samples=1000, n_features=5)
# Create test/training sets:
X_train, X_test, y_train, y_test = train_test_split(X, y)
# Just make up some variable names for the generated data we have used:
variables = [f"x{i}" for i in range(X.shape[1])]

# Set up a grid search:
parameters = {"max_depth": list(range(1, 21)) + [None]}
grid = GridSearchCV(
    DecisionTreeClassifier(),
    parameters,
    scoring="accuracy",  # Select scoring here!
)
# Run the grid search:
grid.fit(X_train, y_train)

# Get the best classifier from the grid search:
best_clf = grid.best_estimator_
print("Best tree:", best_clf)

# Show the decision tree:
dot_data = export_graphviz(
    best_clf,  # The decision tree we want to draw
    out_file=None,  # We will set the file name later
    feature_names=variables,  # Name of variables
    class_names=["Name of first class", "Name of second class"],  # Class names
    rounded=True,  # Use rounded boxes
    filled=True,  # Use colors
)
graph = graphviz.Source(dot_data)
graph.render("tree", view=False)  # Create a tree.pdf for the tree.

(If the code above executed successfully, a file named [tree.pdf](./tree.pdf) should have been created.)

In [None]:
# Your code here

#### Your answer to question 10.1(h):
*Double click here*

### 10.1(i)
Compare the precision, recall, and F1 scores for the classifiers you have considered.
If you were to select one
classifier to put into real-life use, which one would you choose and why?

In [None]:
# Your code here

#### Your answer to question 10.1(i):
*Double click here*

### 10.1(j)
Extra task for the curious student: Create an alternative classifier, for instance,
using a so-called [support vector machine](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html). We will not go into the details about how
this classifier works in our lectures, but with `sklearn` it is rather easy
to just try
it and see what it can do for us.

In the sklearn documentation there is also [an example that compares several classifiers](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html). Maybe you can find one that performs better than the ones
we have considered so far in this exercise?

In [None]:
# Your code here

#### Your answer to question 10.1(j):
*Double click here*

## Exercise 10.2

Consider again the data set for ovarian cancer and the measured gene expressions (see exercise 9).
Create a decision tree classifier for this data set. Limit the depth of the decision
tree to 2, and visualize the decision tree. How do the "rules" the decision tree
is using for its classification compare to what you found from the PCA analysis?
Does it consider the same genes?

Note: There is some "randomness" in decision trees, so
the tree you now
create will likely
use different genes from the ones you
found in exercise 9. You can rerun your code a few times to see how the randomness influences
things, or you can also play with the depth of the tree so see if it picks out
more genes.

In [None]:
# Your code here

#### Your answer to question 10.2:
*Double click here*