## DSCI 100 - Introduction to Data Science


### Lecture 7 - Classification II: Evaluating & Tuning

<img src="https://datasciencebook.ca/_main_files/figure-html/06-decision-grid-K-1.png" width=500>


## Housekeeping

- Midterm on Thursday at 12:30
- Covers weeks 1-6
    - Introduction to Python and Pandas ([Ch 1 in textbook](https://python.datasciencebook.ca/intro.html))
    - Reading Data ([Ch 2](https://python.datasciencebook.ca/reading.html))
    - Wrangling Data ([Ch 3](https://python.datasciencebook.ca/wrangling.html))
    - Visualizing Data ([Ch 4](https://python.datasciencebook.ca/viz.html))
    - Version Control ([Ch 12](https://python.datasciencebook.ca/version-control.html))
    - Classification I: training & predicting ([Ch 5](https://python.datasciencebook.ca/classification1.html))
- quiz is 70 minutes
    - Multiple choice
    - Short answer
    - Fill in the blank coding questions
- on Canvas with lockdown browser
    - use Chrome or Firefox. Disable browser extensions
    - You will be able to access the [Reference sheet for Python](https://ubc-dsci.github.io/dsci-100-student/REFERENCE_PYTHON.html)
    - Please bring a calculator 

## Today: unanswered questions from last week

1. Is our model any good? How do we **evaluate** it?

2. How do we choose `k` in K-nearest neighbours classification? 

## How to measure classifier performance?
</br>

### Accuracy

$$Accuracy  = \dfrac{\#\; correct\; predictions}{\#\; total\; predictions}$$


Downside: doesn't tell you the type of mistake being made


### Confusion matrix
</br>

Here is an example of confusion matrix with cancer diagnosis data we've seen before.

</br>

<table>
<thead style="font-size: 40px";>
<tr class="header">
<th></th>
<th>Truly Malignant</th>
<th>Truly Benign</th>
</tr>
</thead>
<tbody style="font-size: 40px";>
<tr class="odd">
<td><strong>Predicted Malignant</strong></td>
<td>1</td>
<td>4</td>
</tr>
<tr class="even">
<td><strong>Predicted Benign</strong></td>
<td>3</td>
<td>57</td>
</tr>
</tbody>
</table>
</br>




Typically we consider one of the class labels as "positive" - in this case the "Malignant" status is more interesting to researchers, hence we consider that label as "positive".

In this matrix, observations are sorted into the four cells based on its true class and predicted class. Each cell gives the total count of observations with a particular combination of true/predicted class. There's a tonne of information you can learn from this table. For eample, it shows us how often predictions are wrong, broken down by the predicted class. Or it shows us how likely that a person with malignant cancer doesn't get diagnosed correctly, etc.

Downside: confusion matrix contains super useful information, but we want to summarize it in a way that is easy for making comparisons.
To introduce summary measures of the confusion matrix in general, let us consider one class label as "positive" - usually the one considered more interesting, e.g. "Maglignant". 

Relabeling the above confusion matrix: 

</br>

<table>
<thead style="font-size: 40px";>
<tr class="header">
<th></th>
<th>Truly Positive</th>
<th>Truly Negative</th>
</tr>
</thead>
<tbody style="font-size: 40px";>
<tr class="odd">
<td><strong>Predicted Positive</strong></td>
<td>1</td>
<td>4</td>
</tr>
<tr class="even">
<td><strong>Predicted Negative</strong></td>
<td>3</td>
<td>57</td>
</tr>
</tbody>
</table>
</br>

Note that:
* Top left cell = # correct positive predictions.
* Top *row* = # total positive predictions.
* Left *column* = # truly positive observations.




### Precision and Recall


$$
{Precision}  = \dfrac{{\#\; correct\; positive\; predictions}}{{\#\; total\; positive\; predictions}} \quad\quad\quad \quad {Recall}  = \dfrac{{\#\; correct\; positive\; predictions}}{{\#\; total\; truly\; positive\; observations}}
$$

</br>

In the above confusion matrix, precision = 1/(1+4) and recall = 1/(1+3).

Precision quantifies how many of the positive predictions the classifier made were actually positive. Intuitively, we would like a classifier to have a high precision: for a classifier with high precision, if the classifier reports that a new observation is positive, we can trust that the new observation is indeed positive. 

Recall quantifies how many of the positive observations in the test set were identified as positive. Intuitively, we would like a classifier to have a high recall: for a classifier with high recall, if there is a positive observation in the test data, we can trust that the classifier will find it.

### How good is good and which metric's more important?

...is application context dependent. Use your judgement.

</br>

For example:
- a 99% accuracy on cancer prediction may not be very useful. Why?
- If we need patients with truly malignant cancer to be diagnosed correctly, what metric should we prioritize?
- What if a classifier never guess positive except for the very few observations it is super confident in? What metric is affected?


Ask students to discuss the three questions on the slides with their neighbor:

- a 99% accuracy on cancer prediction may not be very useful. Why?
    - Because malign cancer samples are much less common so a 99% accuracy might just mean that the classifier is alwayws predicting "benign", and if there are 99% "benign" cases in the training data, then the accuracy will also be 99%.
- If we need patients with truly malignant cancer to be diagnosed correctly, what metric should we prioritize?
    - The patients with truly malignant cancer are the "total truly positive observations" denominator in the "Recall" equation above. If we mostly care about these patients being correctly classified, we need to prioritize a high recall.
- What if a classifier never guess positive except for the very few observations it is super confident in? What metric is affected?
    - This will lead to fewer "total positive predictions" which is the denominator in the "Precision" equation, so the Precision score will be higher.
    - It could also affect the number of "correct positive predictions", which would affect Precision, Recall, and Accuracy, but we don't have enough information in the prompt to tell for sure: this strategy could lead to  same number of correct positive predictions and just fewer incorrect positive predictions, or it could lead to fewer correct positive predictions as well.

## Adding evaluation to the pipeline for building a classifier

To add evaluation into our classification pipeline, we:

1. Split our data into two subsets: *training data* and *testing data*.
2. Build the model & choose K using training data only (sometimes called tuning)
3. Compute performance metrics (accuracy, precision, recall, etc.) by predicting labels on testing data only

We'll now talk about each step individually.

<center>
<img src="https://python.datasciencebook.ca/_images/training_test.png" width="1100"/>
</center>

## Tuning and evaluating the Model

<center>
<img src="https://python.datasciencebook.ca/_images/ML-paradigm-test.png" width="1700"/>
</center>


**Golden Rule of Machine Learning / Statistics:** *Don't use your testing data to train your model!*

## Why?

Showing your classifier the labels of evaluation data is like cheating on a test; it'll look more accurate than it really is<br>

- "training your model" includes choosing K, choosing predictors, choosing the model, scaling/centering variables, etc!

<br>
<br>
<center>
<img width="400px" src="https://media2.giphy.com/media/12vJgj7zMN3jPy/giphy.gif"/>
</center>



## Splitting Data

There are two important things to do when splitting data.

1. **Shuffling:** randomly reorder the data before splitting
2. **Stratification:** make sure the two split subsets of data have roughly equal proportions of the different labels


<center>
<img src="https://python.datasciencebook.ca/_images/training_test.png" width="500"/>
</center>


**Why?** 

(`sklearn` thankfully automatically does both of these things)


## Choosing K (or, "tuning'' the model)

Want to choose K to maximize accuracy, but:
- we can't use test data to evaluate performance (cheating!)
- we can't use training data to evaluate performance (choosing K is part of training!)

**Solution:** Split the training data further into *training data* and *validation data sets*

<br>2a. Choose some candidate values of K
<br>2b. Split the **training data** into two sets - one called the **training set**, another called the **validation set**
<br>2c. For each K, train the model using **training set only**
<br>2d. Evaluate accuracy (and/or other metrics of performance) for each using **validation set only**
<br>2e. Pick the K that maximizes validation accuracy

*But what if we get a bad training set? Just by chance?*

## Cross-Validation

We can get a better estimate of performance by splitting *multiple ways* and *averaging*

<center>
<img src="https://python.datasciencebook.ca/_images/cv.png" width="1100"/>
</center>

## Underfitting & Overfitting


**Overfitting:** when your model is too sensitive to your training data; noise can influence predictions!

**Underfitting:** when your model isn't sensitive enough to training data; useful information is ignored!


<center>
<img width="800" src="img/under_over_fitting.png">
</center>

Source: http://kerckhoffs.schaathun.net/FPIA/Slides/09OF.pdf

## Underfitting & Overfitting
**Which of these are under-, over-, and good fits?** 

<center>
<img width="1200px" src="https://datasciencebook.ca/_main_files/figure-html/06-decision-grid-K-1.png"/>
</center>



Ask students to discuss with their neighbor.

- K = 1 and K = 7 overfits as can been seen on the wiggly decision border.
- K = 20 looks pretty decent
- K = 300 underfits and misses many of the malignant observations.

## Underfitting & Overfitting
For KNN: small K overfits, large K underfits, both cause lower accuracy

<center>
<img src="https://datasciencebook.ca/_main_files/figure-html/06-lots-of-ks-1.png"/>
</center>

## The Big Picture

<center>
<img align="left" src="https://python.datasciencebook.ca/_images/train-test-overview.png" width="700"/></center>

## Worksheet Time! Go for it!

1. Go to your **project groups** 
2. You can work on the worksheet or discuss your group project proposal if needed 

## Class Activity

In your group, discuss the following prompts. 
- Explain what a test, validation and training data set are in your own words
- Explain cross-validation in your own words
- Imagine if we train *and* evaluate accuracy on all the data. **How can I get 100% accuracy, *always*?**
- Why can't I use cross validation when testing?