# The One Goal for Today

Understand different ways to evaluate models.

## Evaluating regression models

*What are three types of regression model we know how to fit?*

*How do we evaluate the performance of a regression model?*

## Evaluating clustering models

*What is one type of clustering method we know how to fit?*

Clustering is unsupervised; this makes evaluation harder because there is no *ground truth* (there are no *labels*). 

One thing we can calculate without labels is the **silhouette coefficient**. The silhouette coefficient is calculated as:
$$SC = 1/N \sum_i^N \frac{b_i-a_i}{max(a_i,b_i)}$$

where $a_i$ is the average distance between the $ith$ datapoint and all other datapoints in its cluster, and $b_i$ is the average distance between the $ith$ datapoint and all datapoints in its next nearest cluster.

For more evaluation metrics for clustering, including ones that require you to obtain labels for some data points, see:
https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation.


## Evaluating binary classification models

Our Craigslist car dataset includes listings for two car manufacturers: Hyundai and Kia. Let's image that we train a kNN classifier to distinguish Hyundais (H) from Kias (K). Let's imagine these are the results for ten of the datapoints in the test data:

| Item | $y$ | $\hat{y}$ |
| ---- | --- | -------- |
|  0   |  H  |   H |
|  1   |  H  |   H |
|  2   |  H  |   K |
|  3   |  H  |   H |
|  4   |  H  |   K |
|  5   |  K  |   H |
|  6   |  K  |   K |
|  7   |  K  |   K |
|  8   |  K  |   K |
|  9   |  K  |   K |

### Accuracy

For k-nearest neighbors, so far we have evaluated using **accuracy**: the percentage of data points for which the predicted class is the same as the actual class. 

*Using the Craigslist result table, what is the accuracy of our model?*

### Confusion matrix

Accuracy is a nice simple metric, but it doesn't help us understand *which* data points are being misclassified, which might be important for improving the model or deciding whether to deploy the model. For example, if a model for car logo identification works great on every manufacturer other than Kia, then maybe we focus on getting better data for Kias. Or if a model for determining which students to admit to Colby does a good job for white and Asian students, but a terrible job for Black students, maybe we *do not deploy that model*. What *can* help us dig deeper into the performance of a model is a **confusion matrix**.

For binary classification (two labels), pick one class to be 'positive' and the other 'negative'; then a confusion matrix looks like:

| Total population = P+N | Predict positive | Predict negative | 
| -- | --- | --- |
| **Actual positive** (P) | TP | FN | 
| **Actual negative** (N) | FP | TN | 

*Using the Craigslist results table, what is the confusion matrix for our model?*



### TPR, FPR, Precision, Recall, F

Once we have a confusion matrix we can calculate interesting things from it, including:
1. **True Positive Rate (TPR)**: TPR = TP/(TP+FN) (this is also called Recall, R)
2. **False Positive Rate (FPR)**: FPR = FP/(FP+TN)
3. **Accuracy** (!): ACC = (TP+TN)/(P+N)
4. **Precision (P)**: P = TP/(TP+FP)
5. **F1**: F1 = 2x((PxR)/(P+R)) (we call this F1 because you could pick a number other than 2 and get a different F)

*Using the confusion matrix you created, what are the TPR (R), FPR, Accuracy, P and F1 for our model?*

# Multiclass classification

All those great metrics for binary classifiers do have multiclass equivalents, but they require a small mental leap.

To create a confusion matrix for a multiclass classifier, we have to think of it as a combination (or *ensemble*) of binary classifiers, either:
* One-vs-rest (one-vs-all) - one binary classifier for each class, with the positive examples being data points in the class and the negative examples being data points in any other class. 
* One-vs-one - one binary classifier for each pair of classes.

Note: depending on the ML algorithm, we don't have to actually *fit* a bunch of binary classifiers, we just have to *imagine that we did*. For example, for kNN we only fit one model regardless of the number of classes.

I like one-vs-rest, for a reason that will become clear in a minute.

Our car logo dataset has 34 car logos in it.

Questions:
1. *For 34 classes, how many classifiers would we (mentally) fit for one-vs-rest?*
2. *For 34 classes, how many classifiers would we (mentally) fit for one-vs-one?*

Once we have (mentally) fit binary classifiers, we can create one confusion matrix per model. Then we can calculate all the metrics for each class.

Now let's have a stab at writing the code to get a confusion matrix and calculate TPR, FPR, P, Accuracy and F1 from it.

In [1]:
import numpy as np

# this captures the Craigslist results table; we will use 0 for H and 1 for K
y = np.array([0,0,0,0,0,1,1,1,1,1])
yhat = np.array([0,0,1,0,1,0,1,1,1,1])

def confusion_matrix(y, yhat):
    ??

def true_positive_rate(confusion_matrix):
    ??

def false_positive_rate(confusion_matrix):
    ??

def true_positive_rate(confusion_matrix):
    ??

def precision(confusion_matrix):
    ??

def recall(confusion_matrix):
    ??

def f1(confusion_matrix):
    ??