Classification Metrics
---
<center><img src="https://i.pinimg.com/736x/18/c0/36/18c036f262ef322194553462279e5bbf.jpg" width="100%"/></center>

Why do care about evaluation metrics?
-----

Evaluation metrics help select better, aka more useful, models.

Are these good parameter estimates?     
Which hyperparamters are better?   
Which algorithm should we use?  

What are common evaluation metrics for regression?
-----

- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)

What are common evaluation metrics for classification?
-----

- Accuracy 
- Recall
- Precision
- F-score

By the end of this session, you should be able to:
---

- List common classification metrics
- Explain the limitations of accuracy as a metric
- Construct a confusion matrix
- Extend confusion matrix beyond binary classification
- Define precision, recall, and F score
- Draw and explain a ROC curve
- Define AUC

Accuracy
------

$$Accuracy = \frac{All\ Correct}{Total}$$

- Fraction of observations classified correctly
- 1 - error rate

What is the biggest limitation of accuracy?
-------

Accuracy is an overall measure (ignores which classes were correctly predicted). It does not tell you what "types" of errors your classifier is making


It is effected by class imbalances, when there is much one group than another group.

Null Accuracy
-------

Accuracy that could be achieved by always predicting the most frequent class

Check for Understanding
-----

Draw a confusion matrix

Confusion Matrix
------

<center><img src="http://i.stack.imgur.com/ysM0Z.png" width="40%"/></center>

- True Positives (TP): correctly predicted a succesfull outcome /  one 
label 
- True Negatives (TN): correctly predicted a lack of an outcome / other label 
- False Positives (FP): incorrectly predicted a succesfull outcome (a "Type I error")
- False Negatives (FN): incorrectly predicted lack of an outcome (a "Type II error")


<center><img src="https://chemicalstatistician.files.wordpress.com/2014/05/pregnant.jpg?w=500" width="75%"/></center>

[Source](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)

Let's classify movies as "RomCom" or not...
------

<center><img src="images/rom1.png" width="45%"/></center>

Extension beyond 2 groups
---

<center><img src="images/rom2.png" width="45%"/></center>

The classifier is misclassifying movies as Comedy when they are RomCom more often than Drama.

It is always important to look at the confusion matrix to analyze your results as it also gives you very strong clues as to where your classifier is going wrong.

Check for understanding
----

How does confusion matrix scale as a function of the number of classes (k)?

If there are 10 classes, how many cells does the confusion matrix have?

Number of Classes -> Number of Cells 
------

- 2 -> 4
- 3 -> 9
- 4 -> 15
- …
- 10 -> 100

The number of cells in a confusion matrix scale is __k<sup>2</sup>__.

<center><img src="http://www.bluemontlabs.com/images/statistical-classification-metrics.png" width="80%"/></center>

<center><img src="images/roc.png" width="100%"/></center>

<center><img src="http://www.info.univ-angers.fr/~gh/Predipath/confus1.png" width="75%"/></center>

Precision
------

$$Precision = \frac{Class\ Correct}{Class\ Total\ Predicted}$$

Fraction of labeled items assigned to a class that are actually members of that class

Recall
-----

$$Recall = \frac{Class\ Correct}{Class\ Total\ Actual}$$

Fraction of labeled items in a class that are classified correctly

Example
-----

<center><img src="images/results.png" width="85%"/></center>

The data is the labeled ground truth.

Let's compare models: 1 vs 2.

<center><img src="images/results.png" width="85%"/></center>

What is the baserate for red?

20% (2 reds out of 10 total)

<center><img src="images/results.png" width="85%"/></center>

What is the accuracy of Model 1? 

80% accurate 

Model 1 predicts all blue and gets all the blue dots correct. Misses the 2 reds.

<center><img src="images/results.png" width="85%"/></center>

What is the recall of Model 1?

Each class should be calculated separately!

0% recall for red. The model fails to label any true red as red.  
100% recall for blue. The model labels all true blues as blue.

You can also weight recall for multi-class classification. Read more [here](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html)

<center><img src="images/results.png" width="85%"/></center>

What is the accuracy of Model 2? 

70% accurate  

<center><img src="images/results.png" width="85%"/></center>

What is the recall of Model 2?

100% recall for red. The model labels all true red as red.   
75% recall for blue. The model labels 6 out 8 possible.

<center><img src="images/results.png" width="85%"/></center>

Which model would you deploy into production?

Extending to more than 2 groups
------
<center><img src="images/p_r.png" width="90%"/></center>

ROC (receiver operating characteristic) curve 
----
<center><img src="images/roc_first.png" width="50%"/></center>

ROC curve & Thresholds
----

<center><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/4f/ROC_curves.svg/300px-ROC_curves.svg.png" width="55%"/></center>

ROC curve to compare models
----

<center><img src="images/roc_2.png" width="65%"/></center>

Model A is strictly better than Model B

ROC curve to compare models
----

<center><img src="images/Roccurves.png" width="50%"/></center>

IRL, some models will do better at different thresholds

AUC: Area Under the Curve
-----

<center><img src="https://i.stack.imgur.com/9NpXJ.png" width="55%"/></center>

A single metric to combine

Check for understanding
-----

What is AUC for random guessing a binary classifier with even base rates?

50%

What is the highest possible AUC? What does the ROC curve look like?

100%

F<sub>1</sub> score
-----

$$F_1\ Score = 2•\frac{Precision•Recall}{Precision+Recall}$$

A single metric that combines precision and recall.

In Machine Learning, we want a single metric.

Generalized F score
-----

<center><img src="images/f_score_2.png" width="75%"/></center>

F<sub>1</sub> weighs recall and precision equally.

F<sub>0.5</sub> weighs recall lower than precision (by reducing the influence of false negatives).

F<sub>2</sub> weighs recall higher than precision (by placing more emphasis on false negatives).

Which metrics should you focus on?
------

Choice of metric depends on your business objective

You can define custom evaluaton mertrics

airbnb custom metrics
------
<center><img src="https://adriancolyer.files.wordpress.com/2018/09/airbnb-fig-4.jpeg?w=480" width="55%"/></center>
<center><img src="https://adriancolyer.files.wordpress.com/2018/09/airbnb-fig-7.jpeg?w=480" width="55%"/></center>

Source: https://blog.acolyer.org/2018/10/03/customized-regression-model-for-airbnb-dynamic-pricing/

Summary, part I
---

- The most common classification metrics:
    - Accuracy
    - Precision
    - Recall
    - F-score
- Accuracy is not accurate if there are class imbalances. In real DS, there are __always__ class imbalances.
- A confusion matrix is an awesome way to visualize error types.
- Confusion matrices can be extend to multinomial classification.
- Brian ♥️s confusion matrices so always bring him one when asking for advice.


Summary, part II
---

- Precision measures if a classifier has right labels.
- Recall measures if a classifier does not miss a label.
- F score combines precision and recall.
- A ROC curve visualizes how changing a threshold impacts model performance.
- AUC is the area under the ROC curve.

<center><img src="images/break.png" width="55%"/></center>

<br>

Bonus Material
----

Precision recall curves
-----

![](images/pr.png)

Precision vs recall as we vary the threshold τ.

This curve can be summarized as a single number using the mean precision (averaging over recall values), which approximates the area under the curve. 