In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

plt.style.use('seaborn-white')
plt.rc('figure', dpi=100, figsize=(7, 5))
plt.rc('font', size=12)

import warnings
warnings.simplefilter('ignore')

# Lecture 27 – Classifier Evaluation and Fairness

## DSC 80, Spring 2022

### Announcements
- The Final Exam is on **Saturday, June 4th from 11:30AM-2:30PM in-person**!
    - See [this Campuswire post](https://campuswire.com/c/G325FA25B/feed/1754) for all the details, **including seating assignments and charts**.
    - Lectures 1-26 (including some of today's lecture), Projects 1-5, Labs 1-9, and Discussions 1-8 are all in scope.
    - **Discussion today is office hours; a [Fall 2021 Final walkthrough video](https://www.youtube.com/watch?v=8JZ71x-gr8E) was posted.**
- Project 5 is due on **Thursday, June 9th at 11:59PM**!
- If at least 80% of the class fills out both [CAPEs](https://cape.ucsd.edu/) and the [End-of-Quarter Survey](https://docs.google.com/forms/d/e/1FAIpQLSepSEBy0KC1-RHGF6dixYKZ-2p3SVdiPHB9spXPlA6PZNUy4A/viewform), then everyone will receive an extra 0.5% added to their overall course grade. 
    - Deadline: **Saturday at 8AM**.
    - Currently at ~40%.
- The Grade Report will be updated before Friday's lecture, with Lab 8-9 and Project 4 grades.

### Agenda

- Classifier evaluation.
    - Last in-scope topic for the final.
- Example: Tumor malignancy prediction (via logistic regression).
- Fairness.

## Classifier evaluation

### Recall

| | Predicted Negative | Predicted Positive |
| --- | --- | --- |
| **Actually Negative** | TN = 90 ✅ | FP = 1 ❌ |
| <span style='color:orange'><b>Actually Positive</b></span> | <span style='color:orange'>FN = 8</span> ❌ | <span style='color:orange'>TP = 1</span> ✅ |

<center><i><small>UCSD Health test results</small></i></center>

🤔 **Question:** What proportion of individuals who actually have COVID did the test **identify**?

**🙋 Answer:** $\frac{1}{1 + 8} = \frac{1}{9} \approx 0.11$

More generally, the **recall** of a binary classifier is the proportion of <span style='color:orange'><b>actually positive instances</b></span> that are correctly classified. We'd like this number to be as close to 1 (100%) as possible.

$$\text{recall} = \frac{TP}{TP + FN}$$

To compute recall, look at the <span style='color:orange'><b>bottom (positive) row</b></span> of the above confusion matrix.

### Recall isn't everything, either!

$$\text{recall} = \frac{TP}{TP + FN}$$

🤔 **Question:** Can you design a "COVID test" with perfect recall?

**🙋 Answer:** Yes – **just predict that everyone has COVID!**

| | Predicted Negative | Predicted Positive |
| --- | --- | --- |
| **Actually Negative** | TN = 0 ✅ | FP = 91 ❌ |
| <span style='color:orange'><b>Actually Positive</b></span> | <span style='color:orange'>FN = 0</span> ❌ | <span style='color:orange'>TP = 9</span> ✅ |

<center><i><small>everyone-has-COVID classifier</small></i></center>


$$\text{recall} = \frac{TP}{TP + FN} = \frac{9}{9 + 0} = 1$$

Like accuracy, recall on its own is not a perfect metric. Even though the classifier we just created has perfect recall, it has 91 false positives!

### Precision

| | Predicted Negative | <span style='color:orange'>Predicted Positive</span> |
| --- | --- | --- |
| **Actually Negative** | TN = 0 ✅ | <span style='color:orange'>FP = 91</span> ❌ |
| **Actually Positive** | FN = 0 ❌ | <span style='color:orange'>TP = 9</span> ✅ |

<center><i><small>everyone-has-COVID classifier</small></i></center>


The **precision** of a binary classifier is the proportion of <span style='color:orange'><b>predicted positive instances</b></span> that are correctly classified. We'd like this number to be as close to 1 (100%) as possible.

$$\text{precision} = \frac{TP}{TP + FP}$$

To compute precision, look at the <span style='color:orange'><b>right (positive) column</b></span> of the above confusion matrix.

- **Tip:** A good way to remember the difference between precision and recall is that in the denominator for 🅿️recision, both terms have 🅿️ in them (TP and FP).

- Note that the "everyone-has-COVID" classifier has perfect recall, but a precision of $\frac{9}{9 + 91} = 0.09$, which is quite low.

- 🚨 **Key idea:** There is a "tradeoff" between precision and recall. For a particular prediction task, one may be important than the other.

### Precision and recall

<center><img src="imgs/Precisionrecall.svg.png" width=30%></center>

<center>(<a href="https://en.wikipedia.org/wiki/Precision_and_recall">source</a>)</center>

### Precision and recall

$$\text{precision} = \frac{TP}{TP + FP} \: \: \: \:  \: \: \: \: \text{recall} = \frac{TP}{TP + FN}$$

🤔 **Question:** When might high **precision** be more important than high recall?

**🙋 Answer:** For instance, in deciding whether or not someone committed a crime. Here, **false positives are really bad** – they mean that an innocent person is charged!

🤔 **Question:** When might high **recall** be more important than high precision?

**🙋 Answer:** For instance, in medical tests. Here, **false negatives are really bad** – they mean that someone's disease goes undetected!

### Discussion Question

Consider the confusion matrix shown below.

| | Predicted Negative | Predicted Positive |
| --- | --- | --- |
| **Actually Negative** | TN = 22 ✅ | FP = 2 ❌ |
| **Actually Positive** | FN = 23 ❌ | TP = 18 ✅ |

What is the accuracy of the above classifier? The precision? The recall?

<br>

After calculating all three on your own, click below to see the answers.

<details>
    <summary>Accuracy</summary>
    (22 + 18) / (22 + 2 + 23 + 18) = 40 / 65
</details>

<details>
    <summary>Precision</summary>
    18 / (18 + 2) = 9 / 10
</details>

<details>
    <summary>Recall</summary>
    18 / (18 + 23) = 18 / 41
</details>    

End of Final Exam content! 🎉

(Note that the remaining content is still relevant for Project 5.)

## Example: Tumor malignancy prediction (via logistic regression)

### Wisconsin breast cancer dataset

The Wisconsin breast cancer dataset (WBCD) is a commonly-used dataset for demonstrating binary classification. It is built into `sklearn.datasets`.

In [None]:
from sklearn.datasets import load_breast_cancer
loaded = load_breast_cancer() # explore the value of `loaded`!
data = loaded['data']
labels = 1 - loaded['target']
cols = loaded['feature_names']
bc = pd.DataFrame(data, columns=cols)

In [None]:
bc.head()

1 stands for "malignant", i.e. cancerous, and 0 stands for "benign", i.e. safe.

In [None]:
labels[:5]

In [None]:
pd.Series(labels).value_counts(normalize=True)

Our goal is to use the features in `bc` to predict `labels`.

### Aside: Logistic regression

Logistic _regression_ is a linear _classification?_ technique that builds upon linear regression. It models **the probability of belonging to class 1, given a feature vector**:

$$P(y = 1 | \vec{x}) = \sigma (\underbrace{w_0 + w_1 x^{(1)} + w_2 x^{(2)} + ... + w_d x^{(d)}}_{\text{linear regression model}})$$

Here, $\sigma(t) = \frac{1}{1 + e^{-t}}$ is the **sigmoid** function; its outputs are between 0 and 1 (which means they can be interpreted as probabilities).

🤔 **Question:** Suppose our logistic regression model predicts the probability that a tumor is malignant is 0.75. What class do we predict – malignant or benign? What if the predicted probability is 0.3?

🙋 **Answer:** We have to pick a threshold (e.g. 0.5)!
- If the predicted probability is above the threshold, we predict malignant (1).
- Otherwise, we predict benign (0).

### Fitting a logistic regression model

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [None]:
X_train, X_test, y_train, y_test = train_test_split(bc, labels)

In [None]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

How did `clf` come up with 1s and 0s?

In [None]:
clf.predict(X_test)

It turns out that the predicted labels come from applying a **threshold** of 0.5 to the predicted probabilities. We can access the predicted probabilities via the `predict_proba` method:

In [None]:
# [:, 1] refers to the predicted probabilities for class 1
clf.predict_proba(X_test)[:, 1]

Note that our model still has $w^*$s:

In [None]:
clf.intercept_

In [None]:
clf.coef_

### Evaluating our model

Let's see how well our model does on the test set.

In [None]:
from sklearn import metrics

In [None]:
y_pred = clf.predict(X_test)

In [None]:
metrics.accuracy_score(y_test, y_pred)

In [None]:
metrics.precision_score(y_test, y_pred)

In [None]:
metrics.recall_score(y_test, y_pred)

Which metric is more important for this task – precision or recall?

In [None]:
metrics.confusion_matrix(y_test, y_pred)

In [None]:
metrics.plot_confusion_matrix(clf, X_test, y_test);

### What if we choose a different threshold?

🤔 **Question:** Suppose we choose a threshold **higher** than 0.5. What will happen to our model's precision and recall?

🙋 **Answer:** Precision will increase, while recall will decrease*.
- If the "bar" is higher to predict 1, then we will have fewer false positives. 
- The denominator in $\text{precision} = \frac{TP}{TP + FP}$ will get smaller, and so precision will increase.
- However, the number of false negatives will increase, as we are being more "strict" about what we classify as positive, and so $\text{recall} = \frac{TP}{TP + FN}$ will decrease.
- *It is possible for either or both to stay the same, if changing the threshold slightly (e.g. from 0.5 to 0.500001) doesn't change any predictions.

Similarly, if we decrease our threshold, our model's precision will decrease, while its recall will increase. 

### Trying several thresholds

The classification threshold is not actually a hyperparameter of `LogisticRegression`, because the threshold doesn't change the coefficients ($w^*$s) of the logistic regression model itself (see [this article](https://stats.stackexchange.com/questions/390186/is-decision-threshold-a-hyperparameter-in-logistic-regression#:~:text=The%20decision%20threshold%20is%20not,how%20hyper%2Dparameters%20are%20tuned.) for more details).

As such, if we want to imagine how our predicted classes would change with thresholds other than 0.5, we need to manually threshold.

In [None]:
thresholds = np.arange(0, 1.01, 0.01)
precisions = np.array([])
recalls = np.array([])

for t in thresholds:
    y_pred = clf.predict_proba(X_test)[:, 1] >= t
    precisions = np.append(precisions, metrics.precision_score(y_test, y_pred))
    recalls = np.append(recalls, metrics.recall_score(y_test, y_pred))

Let's visualize the results in `plotly`, which is interactive.

In [None]:
px.line(x=thresholds, y=precisions,
        labels={'x': 'Threshold', 'y': 'Precision'}, title='Precision vs. Threshold', width=1000, height=600)

In [None]:
px.line(x=thresholds, y=recalls, 
        labels={'x': 'Threshold', 'y': 'Recall'}, title='Recall vs. Threshold', width=1000, height=600)

In [None]:
px.line(x=recalls, y=precisions, hover_name=thresholds, 
        labels={'x': 'Recall', 'y': 'Precision'}, title='Precision vs. Recall')

The above curve is called a precision-recall (or PR) curve.

🤔 **Question:** Based on the PR curve above, what threshold would you choose?

### Combining precision and recall

If we care equally about a model's precision $PR$ and recall $RE$, we can combine the two using a single metric called the **F1-score**:

$$\text{F1-score} = \text{harmonic mean}(PR, RE) = 2\frac{PR \cdot RE}{PR + RE}$$

In [None]:
pr = metrics.precision_score(y_test, clf.predict(X_test))
re = metrics.recall_score(y_test, clf.predict(X_test))

2 * pr * re / (pr + re)

In [None]:
metrics.f1_score(y_test, clf.predict(X_test))

Both F1-score and accuracy are overall measures of a binary classifier's performance. But remember, accuracy is misleading in the presence of class imbalance, and doesn't take into account the kinds of errors the classifier makes.

In [None]:
metrics.accuracy_score(y_test, clf.predict(X_test))

### Other evaluation metrics for binary classifiers

We just scratched the surface! This [excellent table from Wikipedia](https://en.wikipedia.org/wiki/Template:Diagnostic_testing_diagram) summarizes the many other metrics that exist.

<center><img src='imgs/wiki-table.png' width=75%></center>

If you're interested in exploring further, a good next metric to look at is **true negative rate (i.e. specificity)**, which is the analogue of recall for true negatives.

## Fairness

### Recall, from Lecture 1

<center><img src='imgs/depixel2.png' width=60%></center>

### Fairness: why do we care?

- Sometimes, a model performs better for certain groups than others; in such cases we say the model is **unfair**.
- Since ML models are now used in processes that significantly affect human lives, it is important that they are fair!
    * Job applications and college admissions.
    * Criminal sentencing and parole grants.
    * Predictive policing.
    * Credit and loans.

### Example: COMPAS and recidivism prediction

COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is a "black-box" model that estimates the likelihood that someone who has commited a crime will recidivate (commit another crime).

<br>

<center><img src="imgs/compas.jpeg"></center>

[Propublica found](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing) that the model's false positive rate is higher for African-Americans than it is for White Americans, and that its false negative rate is lower for African-Americans than it is for White Americans.



### Example: Facial recognition

* The table below comes from [a paper](http://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf) that analyzes several "gender classifiers", and shows that popular classifiers perform much worse for women and those with darker skin colors.
* Police departments are beginning to [use these models](https://www.usatoday.com/story/tech/2018/07/09/orlando-police-decide-keep-testing-amazon-facial-recognition-program/768507002/) for surveillance.
* Self-driving cars use similar models to recognize pedestrians!

<center><img src="imgs/imgnet.jpeg" width=60%></center>

Note:

$$PPV = \text{precision} = \frac{TP}{TP+FP},\:\:\:\:\:\: TPR = \text{recall} = \frac{TP}{TP + FN}, \:\:\:\:\:\: FPR = \frac{FP}{FP+TN}$$

### How does bias occur?

Remember, our models learn patterns from the training data. Various sources of bias may be present within training data:
* Training data may not be representative of the population.
    * There may be fewer data points for minority groups, leading to poorer model performance.
* The features chosen may be more useful in making predictions for certain groups than others.
* Training data may encode existing human biases.

### Example: Gender associations

- English is not a gendered language – words like "teacher" and "scientist" are not inherently gendered (unlike in, say, French). 
- However, English does have gendered pronouns (e.g. "he", "she").
- Humans subconsciously associate certain words with certain genders.
- What gender does English associate the following words with?

<center>soldier, teacher, nurse, doctor, dog, cat, president, nanny</center>

### Example: Gender associations

* Unlike English, Turkish 🇹🇷 **does not** have gendered pronouns – there is only a single, gender-neutral pronoun ("o").
* Let's see what happens when we use Google Translate to translate Turkish sentences that **should be** gender-neutral back to English.
* Click [this link](https://translate.google.com/?sl=tr&tl=en&text=o%20bir%20asker%20%0Ao%20bir%20öğretmen%0Ao%20bir%20mühendis%0Ao%20bir%20hemşire&op=translate) to follow along.
* Why is this happening?
    * Answer: Google Translate is "trained" on a large corpus of English text, and these associations are present in those English texts.
    * Ideally, the results should contain a gender-neutral singular "they", rather than "he" or "she".

### Example: Image searches

A 2015 study examined the image queries of vocations and the gender makeup in the search results. Since 2015, the behavior of Google Images has been improved.

In 2015, a Google Images search for "**nurse**" returned...

<center><img src='imgs/nurses2015.jpg'></center>

Search for "nurse" now, what do you see?

In 2015, a Google Images search for "**doctor**" returned...

<center><img src='imgs/doctors2015.jpg'></center>

Search for "doctor" now, what do you see?

### Ethics: What gender ratio _should_ we expect in the results?

- Should it be 50/50?
- Should it reflect the true gender distribution of those jobs?
- More generally, what do you expect from your search results?
    - This is a philosophical and ethical question, but one that **we need to think about as data scientists**.

<center><img src='imgs/google-photos-paper.png' width=70%></center>

Excerpts:

> "male-dominated professions tend to have even more men
in their results than would be expected if the proportions
reflected real-world distributions.

> "People’s existing perceptions of gender ratios in occupations
are quite accurate, but that manipulated search results have an effect on perceptions."

### How did this unequal representation occur?

* The training data that Google Images searches from encoded existing biases.
    - While 60% of doctors may be male, 80% of photos (including stock photos) of doctors on the internet may be of male doctors.
* Models (like PageRank) that "rank" images find the, say, 5 "most relevant" image, not the 5 "most typical" images.

## Summary, next time

### Summary

- Accuracy alone is not always a meaningful representation of a classifier's quality, particularly when the **classes are imbalanced**.
    - Precision and recall are classifier evaluation metrics that consider the **types of errors** being made.
    - There is a "tradeoff" between precision and recall. One may be more important than the other, depending on the task.
- A logistic regression model makes classifications by first predicting a probability and then thresholding that probability.
    - The default threshold is 0.5; by moving the threshold, we change the balance between precision and recall.
- A model is **unfair** if it performs better for some groups than others.
    - Unfairness often arises through biased training data, which encodes existing human biases.
- **Next time:** A mathematical framework for assessing the unfairness of a model, and a practical example. High-level overview of the quarter.