**Exercise set 8**
==============


>The goal of this exercise is to gain familiarity with some
>classification methods and the different ways we can assess and compare them.


**Exercise 8.1**


In this exercise, we will consider the
[UCI ML Breast Cancer Wisconsin (Diagnostic) dataset](https://goo.gl/U2Uwz2).

This data set contains $569$ tumors which have been classified
as malignant or benign. In addition, $30$ variables have been
measured and it is our goal to make a predictive model which
can classify new tumors as being malignant or benign.
An overview of the different variables can be found
on the [`sklearn` website](https://scikit-learn.org/stable/datasets/index.html#breast-cancer-dataset).


**(a)**  Begin the exercise by loading the raw data, and creating a test set.
The data set itself can be loaded directly from `sklearn`
as follows:

```python
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X = data['data']
# "Rename" y so that 0 = benign and 1 = malignant:
y = [0 if i == 1 else 1 for i in data['target']]
class_names = ['benign', 'malignant']
print('Classes:')
print(class_names)

print('Variables:')
print(data['feature_names'])
```

The test set can be created as in previous exercises by using the method
`train_test_split` which can be found in the module
`sklearn.model_selection`.

Create the test set
using $33$\% of the available data points for the test set.
What is important to consider when creating this test set?
Check that this is indeed satisfied for your test set and
for your training set.




In [None]:
# Your code here

**Your answer to question 8.1(a):** *Double click here*

**(b)** In this case, we have to determine what
quantity we are going to use
to compare the different classification methods. Briefly explain
how the following possible metrics for assessing classifiers are
calculated:


* (i)  Accuracy.

* (ii)  Precision.

* (iii)  Recall.

* (iv)  F1.

* (v)  The confusion matrix.



Can you an example of a classification task where false positives
should be avoided and false negative are tolerable? Can you give
an example of the opposite (i.e. where false positives are tolerable
and false negatives should be avoided)?


In the following, we will calculate all these metrics for the
different classification methods we consider. At the end of the
exercise, you will be asked to compare the different classifiers
using these metrics. But before we do that: Which of the
aforementioned metrics would you say is most important for
the classification task we have here?




In [None]:
# Your code here

**Your answer to question 8.1(b):** *Double click here*

**(c)**  Create a $k$-nearest neighbor classifier(This classifier is available from
`sklearn.neighbors.KNeighborsClassifier`) with $3$ neighbors and
fit it using your training set. Evaluate the classifier using the
aforementioned metrics (with the test set) and plot the confusion matrix.(The different metrics are available in the `sklearn.metrics` module. Here, there is also a method, `plot_confusion_matrix` which you
can use for plotting the confusion matrix.)




In [None]:
# Your code here

**Your answer to question 8.1(c):** *Double click here*

**(d)**  We will now try to optimize the $k$ for a $k$-nearest neighbor classifier.
This can be done using the method `GridSearchCV` which we used in
exercise set $6$. Use this method to find the "best" value for $k$, and
use the metric you deemed most important in part **(b)** for
scoring what "best" is. Consider $k$ values in the range $1 \leq k \leq 10$
for your search for the best $k$.

Evaluate the classifier with the best $k$ using the
aforementioned metrics (with the test set) and plot the confusion matrix.
What value for $k$ did you find in this case?




In [None]:
# Your code here

**Your answer to question 8.1(d):** *Double click here*

**(e)**  Create a decision tree classifier (This classifier is available from
`sklearn.tree.DecisionTreeClassifier`)

and fit it using your training set.
Evaluate the classifier using the
aforementioned metrics (with the test set) and plot the confusion matrix.


In [None]:
# Your code here

**Your answer to question 8.1(e):** *Double click here*

**(f)**  We will also
try to tune the `DecisionTreeClassifier`
by determining the maximum depth
we should use for the tree. Again, you can use the method
`GridSearchCV` to optimize the parameter
`max_depth` for the `DecisionTreeClassifier`.
Use the metric you deemed most important
in question **(b)** and consider depths
in the range `max_depth = range(1, 21)`, and, in addition,
a depth
where you set `max_depth = None` (this lets the
tree expand as far down as possible).

Evaluate the classifier with the best `max_depth` using the
aforementioned metrics (with the test set) and plot the confusion matrix.

What is the best `max_depth` you find in this case? Is it different
from letting the tree expand as far down as possible and is this as you
would expect?

In [None]:
# Your code here

**Your answer to question 8.1(f):** *Double click here*

**(g)**  In addition to the classifiers we have considered so far,
we will try $3$ alternative classifiers.
For these, we will not try
to optimize any extra parameters.

For each of the following classifiers:


* (i)  Naive Bayes. (Available as `GaussianNB` from `sklearn.naive_bayes`)

* (ii)  LDA. (Available as `LinearDiscriminantAnalysis` from `sklearn.discriminant_analysis`)

* (iii)  Random forest. (Available as `RandomForestClassifier` from `sklearn.ensemble`)



Train the classifier using your training set, and evaluate it using
the aforementioned metrics (with the test set) and plot the confusion matrix.

For the Random forest classifier: Investigate the most important
variables for this classifier. This information is available through the
property `feature_importances_` of your
`RandomForestClassifier` object.

In [None]:
# Your code here

**Your answer to question 8.1(g):** *Double click here*

**(h)**  Compare the accuracy, precision, recall and the F1 scores for all
the classifiers you have considered.

If you were to select one
classifier to put into real-life use, which one would you choose and why?




In [None]:
# Your code here

**Your answer to question 8.1(h):** *Double click here*

**(i)**  Two extra tasks for the curious student:

* (i)  Visualize the decision tree you have created. This
can be done by using the method `export_graphviz`
from `sklearn.tree`, or the method `plot_tree`
from `sklearn.tree`. (Please see https://scikit-learn.org/stable/modules/tree.html and https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html )


* (ii)  Create an additional classifier using a support vector machine. This is
available via `SVC` from `sklearn.svm`.




In [None]:
# Your code here

**Your answer to question 8.1(i):** *Double click here*