<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">
 
# Evaluating Binary Classification Models

In [1]:
import numpy as np
import pandas as pd
from sklearn import metrics

## Limitations of Accuracy

### Accuracy Can Be Misleading with Imbalanced Classes

The `.score` method of a scikit-learn classification estimator returns *accuracy*, which is simply the proportion of predictions that are correct.

In [2]:
# Load admissions data
# /scrub/
admissions_path = '../assets/data/admissions.csv'
admissions = pd.read_csv(admissions_path).dropna()
admissions.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.0,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0


In [3]:
# Split the data into feature columns and target column
# /scrub/
feature_cols = ['gre']
target_col = 'admit'
X = admissions.loc[:, feature_cols]
y = admissions.loc[:, target_col]

In [4]:
# Do train/test split
# /scrub/
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=46)

In [5]:
# Train a logistic regression estimator
# /scrub/
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [6]:
# Calculate test-set accuracy
# /scrub/
lr.score(X_test, y_test)

0.64

64% accuracy might sound OK, but check the class frequencies.

In [7]:
# Check class frequencies
# /scrub/
y_test.value_counts(normalize=True)

0    0.64
1    0.36
Name: admit, dtype: float64

**Exercise (4 mins., in groups)**

- Print the model's predictions on the test set.

In [8]:
# /scrub/
lr.predict(X_test)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64)

- What is the model doing?

/scrub/

Predicting "no" every time.

- What is the model's accuracy on the test set?

/scrub/

64%

- What is the model's accuracy on students who are admitted?

/scrub/

0

- Suppose we applied this model to an extremely selective program that admitted only 5% of applicants. What accuracy score would it get, assuming that it behaves the same way in that sample as it does on our test set? Does that mean that it is a good model?

/scrub/

It would get an accuracy score of 95%. That score does not mean that it is a good model -- we don't need machine learning to get 95% accuracy by guessing "no" every time!

$\blacksquare$

One limitation of accuracy as an evaluation metric is that **it can be misleading when classes are imbalanced.**

### Accuracy Does Not Weigh the Costs of Different Kinds of Errors

Suppose we are building an airport security model that flags people for additional scrutiny when they are suspected of carrying dangerous items. Suppose that 1 out every 100 are carrying potentially dangerous items.

Model 1 raises an alert in 90% of the cases in which a person is carrying a potentially dangerous item and in 10% of the cases in which a person is not carrying a potentially dangerous item.

Model 1 raises an alert in 10% of the cases in which a person is carrying a potentially dangerous item and in 5% of the cases in which a person is not carrying a potentially dangerous item.

**Exercise. (5 mins., in groups)**

- What is Model 1's accuracy?

/scrub/

90%

- What is Model 2's accuracy?

In [9]:
# /scrub/
.1*.01+.95*.99

0.9415

- Which model would you recommend using?

/scrub/

I would recommend using Model 1. Model 2 is more accurate overall, but Model 1 is much more accurate for people who are carrying dangerous items, and the cost of being wrong in those cases is much higher than the cost of being wrong for people who are not carrying dangerous weapons.

$\blacksquare$

A second limitation of accuracy is that **it does not take into account the relative costs of false positives and false negatives.**

## Confusion Matrices

A *confusion matrix* shows the counts for all combinations of true labels and predictions for a classification model.

In [10]:
# Get the confusion matrix for our admissions model
# /scrub/
y_pred = lr.predict(X_test)
metrics.confusion_matrix(y_test, y_pred)

array([[64,  0],
       [36,  0]], dtype=int64)

Sklearn uses columns for the predicted class, increasing from left to right, so that the first column is the predicted negative "0" class (not admitted) and the second is the predicted positive "1" class (admitted).

It uses rows for the true class, increasing from top to bottom, so that the first row is the actual negative "0" class (not admitted) and the second is the actual positive "1" class (admitted).

**Exercise (8 mins., in groups)**

- We call it a "true positive" when the model correctly predicts that someone is in the positive class (admitted). How many true positives did our model generate in this example?

/scrub/

0

- We call it a "true negative" when the model correctly predicts that someone is in the negative class (not admitted). How many true negatives did our model generate in this example?

/scrub/

64

- We call it a "false positive" when the model *incorrectly* predicts that someone is in the positive class (admitted). How many false positives did our model generate in this example?

/scrub/

0

- We call it a "false negative" when the model *incorrectly* predicts that someone is in the negative class (not admitted). How many false negatives did our model generate in this example?

/scrub/

36

**Categorize the following cases as true positives, true negatives, false positives, and false negatives.**
    
- We predict that a growth is malignant, and it is benign. (is_malignant=1)

/scrub/

FP

- We predict that an image does not contain a cat, and it does not. (has_cat=1)

/scrub/

TN

- We predict that a locomotive will fail in the next two weeks, and it does. (breaks=1)

/scrub/

TP

- We predict that a user will like a song, and she does not. (likes_song=1)

/scrub/

FP

- Give some examples of scenarios in which a false positive is worse than a false negative.

/scrub/

Criminal conviction, scientific discovery

- Give some examples of scenarios in which a false negative is worse than a false positive.

/scrub/

Medical screening, preventive maintenance

$\blacksquare$

## Changing the Probability Threshold

By default, a scikit-learn logistic regression estimator's `predict` method returns 1 exactly where the predicted probability is greater than .5. We can change that. For instance, we might want to lower the threshold for making a positive prediction in our admissions model so that it does not predict 0 every time.

In [11]:
# Get our model's probability predictions for the positive class
# /scrub/
y_pred_prob = lr.predict_proba(X_test)[:, 1]
y_pred_prob

array([0.29218435, 0.31554192, 0.26987406, 0.33987013, 0.33002806,
       0.31554192, 0.31554192, 0.31554192, 0.26987406, 0.31079002,
       0.35489405, 0.30140557, 0.29218435, 0.32516165, 0.25281932,
       0.34484418, 0.25701539, 0.35996808, 0.33002806, 0.24456417,
       0.34985245, 0.33493115, 0.29677426, 0.28763639, 0.26554309,
       0.30140557, 0.32033269, 0.30607769, 0.29677426, 0.33493115,
       0.29218435, 0.29677426, 0.33987013, 0.34484418, 0.30140557,
       0.31554192, 0.28763639, 0.30140557, 0.35996808, 0.26125675,
       0.2831309 , 0.27866838, 0.33493115, 0.31554192, 0.31079002,
       0.28763639, 0.27866838, 0.30607769, 0.32516165, 0.31554192,
       0.32516165, 0.27866838, 0.33493115, 0.29677426, 0.27424929,
       0.35996808, 0.30140557, 0.32033269, 0.30607769, 0.31079002,
       0.29677426, 0.32033269, 0.32516165, 0.33493115, 0.31079002,
       0.29677426, 0.32516165, 0.29218435, 0.31554192, 0.31079002,
       0.33002806, 0.30607769, 0.31554192, 0.30140557, 0.29677

In [12]:
# Find the confusion matrix for a probability threshold of .3
# /scrub/
y_pred_low_thresh = y_pred_prob > .3
metrics.confusion_matrix(y_test, y_pred_low_thresh)

array([[25, 39],
       [ 9, 27]], dtype=int64)

**Exercise (6 mins., in groups)**

- How many true positives did our model generate with this new threshold probability?

/scrub/

27

- How many true negatives did it generate?

/scrub/

25

- How many false positives did it generate?

/scrub/

39

- How many false negatives did it generate?

/scrub/

9

- What is its accuracy?

/scrub/

52%

- In a medical screening example where false positives are stressful and inconvenient but false negatives are potentially deadly, would you want to use a probability threshold above or below .5?

/scrub/

Below

- In a criminal trial where false negatives allow guilty people to walk free while false positives put innocent people in prison, would you want to use a probability threshold above or below .5?

/scrub/

Above

$\blacksquare$

## Precision, Recall, and $F_\beta$

### Precision and Recall

Voice assistants are generally designed to become active when you say a "wake word" or phrase (e.g. "Alexa," "OK Google," or "Hey, Siri"). To make these devices work well, their engineers need to classify utterances according to whether or not they are instances of the wake word.

A model's *precision* is the accuracy of its positive predictions -- for instance, when the voice assistant becomes active, how often was it because someone actually said the wake word? In other words, precision measures **how good a model is at avoiding false positives**.

A model's *recall* is its accuracy on the positive class -- when someone says the wake word, how often does the voice assistant respond? In other words, recall measures **how good a model is at avoiding false negatives.**

For instance, suppose a wake word model generates the following confusion matrix.

<table style="border: none">
<tr style="border: none">
    <td style=""><b> </b></td>
    <td style=""><b>Predicted: No</b></td>
    <td style=""><b>Predicted: Yes</b></td>
</tr>
<tr>
    <td><b>Actual: No</b></td>
    <td style="text-align: center; color: blue">TN = 50</td>
    <td style="text-align: center">FP = 10</td>
    <td style="text-align: center">60</td>
</tr>
<tr>
    <td><b>Actual: Yes</b></td>
    <td style="text-align: center">FN = 5</td>
    <td style="text-align: center; color: green">TP = 100</td>
    <td style="text-align: center; color: orange">105</td>
</tr>
<tr style="border: none">
    <td style="border: none"></td>
    <td style="text-align: center">55</td>
    <td style="text-align: center; color: red">110</td>
    <td style="border: none; color: purple">165</td>
</tr>

</table>

The model's accuracy is

$$\frac{|\color{green}{\text{True Positives}}| + |\color{blue}{\text{True Negatives}}|}{|\color{purple}{\text{Total}}|} = \frac{50 + 100}{165} = \frac{150}{165} = .91$$

The model's precision is

$$\frac{|\color{green}{\text{True Positives}}|}{|\color{red}{\text{Positive Predictions}}|} = \frac{100}{110} = .91$$

The model's recall is

$$\frac{|\color{green}{\text{True Positives}}|}{|\color{orange}{\text{Positive Class}}|} = \frac{100}{105} = .95$$

**Exercise (6 mins., in groups)**

- Explain what the precision reported above means. A voice assistant that uses this model will do what 91% of the time under what conditions?

/scrub/

It will stay asleep for 91% of utterances that are not the wake word.

- Explain what the recall reported above means. A voice assistant that uses this model will do what 95% of the time under what conditions?

/scrub/

It will wake up 95% of the time when the wake word is uttered.

**Here again is the confusion matrix for our admissions model with a probability threshold of .3.**

In [13]:
metrics.confusion_matrix(y_test, y_pred_low_thresh)

array([[25, 39],
       [ 9, 27]], dtype=int64)

- What is this model's precision?

In [14]:
# /scrub/
27/(27 + 39)

0.4090909090909091

- What does that number mean?

/scrub/

When the model predicts that a student will be admitted, it is right 41% of the time.

- What is this model's recall?

In [15]:
# /scrub/
27/(27 + 9)

0.75

- What does that number mean?

/scrub/

The model made the correct prediction for 75% of the students who were admitted.

- What happens to a logistic regression model's precision and recall as you decrease its threshold probability?

/scrub/

Recall increases, precision decreases

$\blacksquare$

### $F_\beta$

The main problem with accuracy is that it does not account for the relative costs of false positives and false negatives.

We can address this problem by using precision and recall to measure the model's ability to avoid false positives and false negatives, respectively.

But now we have two numbers instead of one. If we want to decide how to set our probability threshold or more generally what model to use, **we need to weigh precision against recall somehow.**

**The $F_\beta$ score addresses this problem.** It is a "harmonic mean" of precision and recall, which is similar to the standard arithmetic mean but closer to the minimum. As a result, a model will have a low $F_\beta$ score if either precision or recall is low.

The $\beta$ in $F_\beta$ encodes how much you care about recall relative to precision: $\beta=2$, for instance, encodes that you care twice as much about recall as your care about precision, and $\beta=1/2$ encodes the opposite. **The $\beta$ parameter allows you to account for the relative costs of false positives and false negatives.**

In [16]:
# Calculate precision, recall, F_1, F_2, and F_{1/2} for our admissions model
# at different thresholds.
from sklearn import metrics

for threshold in np.linspace(y_pred_prob.min(), y_pred_prob.max(), 4):
    y_pred_thresh = y_pred_prob > threshold
    print('threshold:', threshold)
    print('precision:', metrics.precision_score(y_test, y_pred_thresh))
    print('recall:', metrics.recall_score(y_test, y_pred_thresh))
    print('f1:', metrics.f1_score(y_test, y_pred_thresh))
    print('f2:', metrics.fbeta_score(y_test, y_pred_thresh, 2))
    print('f1/2:', metrics.fbeta_score(y_test, y_pred_thresh, .5))
    print()

threshold: 0.2445641739351176
precision: 0.36363636363636365
recall: 1.0
f1: 0.5333333333333333
f2: 0.7407407407407408
f1/2: 0.41666666666666674

threshold: 0.2830321419054502
precision: 0.40476190476190477
recall: 0.9444444444444444
f1: 0.5666666666666667
f2: 0.7456140350877193
f1/2: 0.4569892473118279

threshold: 0.3215001098757828
precision: 0.59375
recall: 0.5277777777777778
f1: 0.5588235294117648
f2: 0.5397727272727273
f1/2: 0.5792682926829269

threshold: 0.3599680778461154
precision: 0.0
recall: 0.0
f1: 0.0
f2: 0.0
f1/2: 0.0



  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


## Using $F_\beta$ to Tune the Probability Threshold

Suppose we care equally about precision and recall in the admissions example, so that $F_1$ is an appropriate evaluation metric. Let's find the probability threshold that maximizes that metric.

In [17]:
# Find the probability threshold that gives the best F-beta score.
# Predictions only change when the threshold hits a probability value
# that the model actually generates, so we can just check those values.
# /scrub/
best_score = -1
best_threshold = -1

for threshold in sorted(set(y_pred_prob)):
    y_pred_thresh = y_pred_prob > threshold
    score = metrics.f1_score(y_test, y_pred_thresh)
    if score > best_score:
        best_score = score
        best_threshold = threshold

best_threshold

0.2742492919202762

In [18]:
# Calculate the confusion matrix for the best threshold.
# /scrub/
y_pred_thresh = y_pred_prob > best_threshold
metrics.confusion_matrix(y_test, y_pred_thresh)

array([[13, 51],
       [ 0, 36]], dtype=int64)

**Exercise (15 mins., in pairs)**

- Calculate this model's accuracy both with base Python operators and with sklearn's `accuracy_score` function.

In [19]:
# /scrub/
(13 + 36) / 100

0.49

In [20]:
# /scrub/
metrics.accuracy_score(y_test, y_pred_thresh)

0.49

- Calculate this model's precision both with base Python operators and with the appropriate sklearn function.

In [21]:
# /scrub/
36/87

0.41379310344827586

In [22]:
# /scrub/
metrics.precision_score(y_test, y_pred_thresh)

0.41379310344827586

- Calculate this model's recall both with base Python operators and with the appropriate sklearn function.

In [23]:
# /scrub/
36/36

1.0

In [24]:
# /scrub/
metrics.recall_score(y_test, y_pred_thresh)

1.0

- Calculate this model's F1 score with the appropriate sklearn function.

In [25]:
# /scrub/
metrics.f1_score(y_test, y_pred_thresh)

0.5853658536585366

- In a medical screening example where false positives are stressful and inconvenient but false negatives are potentially deadly, would you want to use a value for $\beta$ above or below 1 in an $F_\beta$ score? Why?

/scrub/

Above, because you care about recall more than precision.

- In a criminal trial where false negatives allow guilty people to walk free while false positives put innocent people in prison, would you want to use a value for $\beta$ above or below 1 in an $F_\beta$ score?

/scrub/

Below, because you care about precision more than recall.

- Find the probability threshold that optimizes the $F_2$ score for the titanic model below.

In [26]:
titanic = pd.read_csv('../assets/data/titanic.csv')

In [27]:
titanic = titanic.select_dtypes(['int64', 'float64']).dropna(axis='columns')

In [28]:
target_col = 'Survived'
X = titanic.drop(target_col, axis='columns')
y = titanic.loc[:, target_col]

In [29]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [30]:
lr = LogisticRegression()
lr.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [31]:
y_pred_prob = lr.predict_proba(X_test)[:, 1]

In [32]:
# /scrub/
best_score = -1
best_threshold = -1

for threshold in sorted(set(y_pred_prob)):
    y_pred_thresh = y_pred_prob > threshold
    score = metrics.fbeta_score(y_test, y_pred_thresh, 2)
    if score > best_score:
        best_score = score
        best_threshold = threshold

best_threshold

  'precision', 'predicted', average, warn_for)


0.2570949367160027

- Calculate the test-set accuracy, precision, recall, and $F_2$ score for this model.

In [33]:
# /scrub/
y_pred_thresh = y_pred_prob > best_threshold

In [34]:
# /scrub/
metrics.accuracy_score(y_test, y_pred_thresh)

0.6188340807174888

In [35]:
# /scrub/
metrics.precision_score(y_test, y_pred_thresh)

0.5125

In [36]:
# /scrub/
metrics.recall_score(y_test, y_pred_thresh)

0.9213483146067416

In [37]:
# /scrub/
metrics.fbeta_score(y_test, y_pred_thresh, 2)

0.7945736434108527

$\blacksquare$

## Lesson Review

- Accuracy scores can be misleading when classes are imbalanced, and they do not account for the relative costs of false positives and false negatives.
- A confusion matrix displays the number of false positives, false negatives, true positives, and true negatives a model generates.
- Precision measures a model's ability to avoid false positives.
- Recall measures a model's ability to avoid false negatives.
- The $F_\beta$ score allows you to weigh precision against recall to select the best overall model even when the costs of false positives and false negatives are unequal.