# Using Machine Learning Tools: Workshop 5

The aim of this week's workshop is to get to know the task of classification and to apply and compare two different classification methods.

You will use [Wisconsin Breast Cancer data set](https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-dataset), which is included in scikit learn (follow the link and read about what the dataset is)

Note: the link on webpage to the dataset is 'dead'. However, you can load the data directly using the below (as used in the below code):

`from sklearn.datasets import load_breast_cancer`

`data = load_breast_cancer()`


## **Setup and data loading**

Import required libraries and access the Wisconsin Breast Cancer data set by running the cells below.


In [None]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# To plot even prettier figures
import seaborn as sn

# General data handling (pure numerics are better in numpy)
import pandas as pd

In [None]:
# Load the dataset

from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()

In [None]:
# This is where the numerical data is
xarray = data.data
yarray = data.target

In [None]:
# This is where the names of features and targets are
print(f'Features names are: {data.feature_names}')
print(f'Label names are: {data.target_names}')

In [None]:
# We recommend inverting the labels so that malignant (the worse disease) = 1 (i.e. positive)
yarray = 1 - yarray
# Don't forget to switch the label names too (if you are going to use them anywhere)
# Though it is good practice to switch them here anyway, as future modifications to the code then won't get confused

In [None]:
# This is how you could put it all into a pandas dataframe (useful for some investigations)
df = pd.DataFrame(some_numerical_array, columns = list_of_column_names)


## **Inspection and visualisation [15 mins]**

Familiarise yourself with the dataset and then use several different methods to display its properties, just like in Workshops 2 and 3. Pay particular attention to the class variable - this is often called the label or the target. The classifier aims to predict the label based on the remaining features in the data set. That makes it a supervised learning task.

In [None]:
# Add your code here to investigate the data further ...

## **Splitting into separate datasets and building a pre-processing pipeline [10 mins]**

Split the data into a train and test set using an 80:20 ratio.
Then split the training part into a reduced training set and a validation set.  We will use a fixed hold-out validation set for this workshop, but we could have done K-fold cross-validation in the same way as we did for regression.

Once this is done, build an appropriate pre-processing pipeline, based on what you’ve seen in your investigations of the data above. What elements should you put in a pre-processing pipeline and why?

You can use tools like ChatGPT to help you write your code.

In [None]:
from sklearn.model_selection import train_test_split

# Your code for splitting into separate data sets goes here...

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

preproc_pl = Pipeline([ ??? ]) # Your code for building a pre-processing pipeline goes here...

## **Implementing a Stochastic Gradient Descent Classifier [20 mins]**

Build a new pipeline that includes your preprocessing and the Stochastic Gradient Descent classifier (which is a binary classifier that uses a linear model). You can find this in `sklearn.linear_model` called `SGDClassifier`. Use the parameter setting `loss=‘log’` (as this allows us to get probability outputs, whilst other loss functions, including the default, do not have anything except binary/integer outputs).

Train the classifier and then apply it to the validation set to get both binary class outputs and also probabilistic outputs using two separate calls, and store the outputs separately. We will use the binary class outputs for the next few steps, but will want the probabilities a little later.

After this:
* Display the results graphically, along with the true labels, in such a way that it is easy to identify which ones are correct or incorrect.
* Calculate the confusion matrix using confusion_matrix from sklearn.metrics
* Display this confusion matrix using the seaborn heatmap function, with annotations on (i.e. annot=True)
* Calculate and display a normalised version of the confusion matrix, such that each row (True classes) sums to 1.0
* Calculate the accuracy of the classification using accuracy_score from sklearn.metrics
* Calculate the precision and recall using sklearn’s precision_score and recall_score
* Calculate and display the Receiver-Operator-Characteristic (ROC) curve using roc_curve from sklearn.metrics
* Calculate the Area Under the Curve (AUC) using the function auc from sklearn.metrics
* Calculate and display the Precision-Recall curve (using precision_recall_curve). This plots Precision vs Recall in the same way as an ROC curve, as this is an alternative way of looking at things, but where the top right corner represents the best result. These are particularly useful when you don’t care about True Negatives, as these don’t feature in Precision or Recall.

You can use tools like ChatGPT to help you write your code.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import SGDClassifier

# Your code here...


**Questions:**
* How many distinct points are there in the ROC curve? Try calculating the ROC curve again using the probability outputs instead. Look at the thresholds and compare these to the predicted probability outputs from the classifier (especially those where the classification was incorrect). There is an additional notebook (ROC.ipynb) provided in the MyUni module where you can observe how ROC curve is created.

* If this classifier would be used to make decisions in the hospital, which threshold would you choose? Is precision more important or recall? Do you think this classifier is good enough or does it need more optimisation?

## **Implementing a Decision Tree Classifier [20 mins]**

Train a decision tree classifier on the same training data using the `DecisionTreeClassifier` from `sklearn.tree`. The default parameters are fine for this.

Display the decision tree using the function `plot_tree` in `sklearn.tree`. Hint: To increase the resolution use `plt.rcParams['figure.dpi'] = 200` (if you did our standard `import matplotlib.pyplot as plt`)

After this:
* Apply the classifier to the validation set. Note that you cannot get a probability output from a single decision tree.
* Calculate the confusion matrix, precision and recall. Display the confusion matrix.
* Calculate and plot the ROC curve.
* Calculate the AUC.

You can use tools like ChatGPT to help you write your code.

In [None]:
from sklearn.tree import DecisionTreeClassifier, plot_tree

# Your code here

**Questions:**
* What do each of the components (nodes, branches, thresholds) of the decision tree mean?
* Why are there so few points in the ROC curve?  Does it still show useful information?
* How does the decision tree compare to the SGD linear model? List 2 pros and 2 cons of each approach.

## **Model selection [20 mins]**

* What do you think would be a good performance metric to use in this case, and why? Choose one to work with here.

Note: A good answer here will depend on what you think is most important in the context of the task. If we want to try and supress False Negatives primarily then it would be good to choose an option with a good Recall, but still with acceptable Precision.  Based on the Precision-Recall curves alone you would normally choose the model with an operating point "nearest" to the top right. Looking at the class predictions (y_val_pred) as opposed to the probabilities (y_val_prob) shows that it is already choosing a good operating point, as shown also by the confusion matrices.

Compare the two models (pipelines) using your chosen performance metric, based on the results from the validation set.

Take the chosen model and re-train it on the combination of training and validation datasets.

Evaluate the chosen model on the test set. Compare the results to what you got from the validation data.

**Question:** What would it mean if there was a big difference between the performance scores on the validation and test datasets?

In [None]:
# Your code here

## **Extension**

* How stable do you think these results are?
Try flipping the value in one element of a probability or class output (i.e. new_val = 1 - old_val) and see how much the results and curves change.