**Exercise set 10**
==============


>The goal of this exercise is to gain familiarity with some
>classification methods and the different ways we can assess and compare them.


**Exercise 10.1**


In this exercise, we will consider the
[UCI ML Breast Cancer Wisconsin (Diagnostic) dataset](https://goo.gl/U2Uwz2).

This data set contains $569$ tumors which have been classified
as malignant or benign. In addition, $30$ variables have been
measured and it is our goal to make a predictive model which
can classify new tumors as being malignant or benign.
An overview of the different variables can be found
on the [`sklearn` website](https://scikit-learn.org/stable/datasets/index.html#breast-cancer-dataset).

In the following, we are going to label the two classes as:

* "benign" as a negative, and

* "malignant" as a positive.


**(a)**  Begin the exercise by loading the raw data, and creating a test set.
Create the test set using 33 % of the available data points for the test set.
The data set itself can be loaded directly from `sklearn`
as follows:


In [None]:
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X = data['data']
# "Rename" y so that 0 = benign and 1 = malignant:
y = [0 if i == 1 else 1 for i in data['target']]
class_names = ['benign', 'malignant']
print('Classes:')
print(class_names)

print('Variables:')
print(data['feature_names'])

The test set can be created by using the method
`train_test_split` which can be found in the module
`sklearn.model_selection`. One of the input parameters to `train_test_split` is
`stratify`. Reading the documentation for [`stratification`](https://scikit-learn.org/stable/modules/cross_validation.html#stratification)
can you explain what this parameter does? And is it important for the data set we
are considering here?

In [None]:
# Your code here

**Your answer to question 10.1(a):** *Double click here*

**(b)** In this case, we have to determine what
quantity we are going to use
to compare the different classification methods. Before selecting
what quantity to use, we should consider what false positives and false negatives
mean in our current context. How would you define these two terms in our present
case, and would you say that false positives are a more serious mistake to make
than false negatives?

**Your answer to question 10.1(b):** *Double click here*

**(c)** Following up on the previous question, here
are some possible metrics we could use to assess the performance of a
classifier model we <abbr title="Note: There are other possibilities
as well! If you are curious you can for instance include the
*Accuracy* which is  the ratio of correct predictions
to the number of total predictions.">make:</abbr>


* (i)  Precision: The ratio of true positives to the sum of true positives and false positives.

* (ii)  Recall: The ratio of true positives to the sum of true positives and false negatives.

* (iii)  F1: The (harmonic) mean of the precision and recall.

In addition, we can summarize the performance using the *confusion matrix*.

The choice of the metric for assessing a classifier will lead to different results.
For instance, if we choose to use use the precision as our metric, we will maximize it
during the optimization of our model. This means that we will *minimize* the
number of *false positives*. If we choose to use the recall, on the other hand,
we will *minimize* the number of *false negatives*.


In the following, we will calculate all these metrics for the
different classification methods we consider. At the end of the
exercise, you will be asked to compare the different classifiers
using them. 

But before we do that: Which of the
aforementioned metrics would you say is most important for
the classification task we have here? Base this on your answer to
the previous point.

**Note:** There is no single correct answer here, and it
really depends on how *you* judge the seriousness of false positives vs.
false negatives.

**Your answer to question 10.1(c):** *Double click here*

**(d)**  Create a $k$-nearest neighbor classifier(
This classifier is available from
`sklearn.neighbors.KNeighborsClassifier`
) with $3$ neighbors and
fit it using your training set. Evaluate (with the test set) the classifier using the
precision, recall, and F1 metrics, and plot the confusion matrix.

The different metrics are available in the `sklearn.metrics` module.
Here, there is also a method, `plot_confusion_matrix` which you
can use for plotting the confusion matrix.

How many false positives and false negatives do you get?

In [None]:
# Your code here

**Your answer to question 10.1(d):** *Double click here*

**(e)**  We will now try to optimize the $k$ for a $k$-nearest neighbor classifier.
This can be done using the method [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).

One of the inputs to this method is the `scoring` parameter, which
selects the metric to use for finding the best $k$. Here, use the metric
you deemed most important in question **10.1.(c)**.

When using `GridSearchCV`, consider $k$-values in the range $1 \leq k \leq 10$
for your search for the best $k$.

Evaluate the classifier with the best $k$ using
all of the aforementioned metrics (with the test set) and plot the confusion matrix.

What value for $k$ did you find in this case? And did the number of false
positives and false negatives change compared to the non-optimized $k$-nearest neighbor classifier?

In [None]:
# Your code here

**Your answer to question 10.1(e):** *Double click here*

**(f)**  Create a decision tree classifier (This classifier is available from
`sklearn.tree.DecisionTreeClassifier`) and fit it using your training set. Limit the tree to $3$ levels by setting
the parameter `max_depth=3`.

Evaluate the classifier using the
aforementioned metrics (with the test set) and plot the confusion matrix.

In [None]:
# Your code here

**Your answer to question 10.1(f):** *Double click here*

**(g)**  We will also
try to tune the `DecisionTreeClassifier`
by determining the maximum depth
we should use for the tree. Again, you can use the method
`GridSearchCV` to optimize the parameter
`max_depth` for the `DecisionTreeClassifier`.
Use the metric you deemed most important
in question **10.1(c)** and consider depths
in the range `max_depth = range(1, 21)`, and, in addition,
a depth
where you set `max_depth = None` (this lets the
tree expand as far down as possible).

Evaluate the classifier with the best `max_depth` using the
aforementioned metrics (with the test set) and plot the confusion matrix.

What is the best `max_depth` you find in this case?

In [None]:
# Your code here

**Your answer to question 10.1(g):** *Double click here*

**(h)**  Visualize the decision tree with 3 levels. This
can be done using the method `export_graphviz`
from `sklearn.tree`, or the method `plot_tree`
from `sklearn.tree`. (Please
see the sklearn [`tree`](https://scikit-learn.org/stable/modules/tree.html) documentation and documentation for using [`export_graphviz`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html).)

In [None]:
# Your code here

**Your answer to question 10.1(h):** *Double click here*

**(i)**  Compare the precision, recall, and F1 scores for all
the classifiers you have considered.

If you were to select one
classifier to put into real-life use, which one would you choose and why?

In [None]:
# Your code here

**Your answer to question 10.1(i):** *Double click here*

**(j)**  Extra task for the curious student: Create an alternative classifier, for instance,
using a so-called support vector machine. We will not go into the details about how
this classifier works in our lectures, but with `sklearn` it is rather easy
to just try
it and see what it can do for us. In sklearn this is available as the
object `SVC` from `sklearn.svm`.

There is [an example](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) that will compare several classifiers, and maybe you
can find one that performs better than the ones
we have considered so far in this exercise?

In [None]:
# Your code here

**Your answer to question 10.1(j):** *Double click here*

**Exercise 10.2**

Consider again the data set for ovarian cancer and the measured gene expressions (see exercise 9).
Create a decision tree classifier for this data set. Limit the depth of the decision
tree to 2, and visualize the decision tree. How do the "rules" the decision tree
is using for its classification compare to what you found from the PCA analysis?
Does it consider the same genes?

In [None]:
# Your code here

**Your answer to question 10.2:** *Double click here*