# Classification Model Evaluation

- Different domains have different needs
    - iPhone face id -- high precision
    - Fraud detection -- high recall

In [1]:
import pandas as pd

df = pd.DataFrame({
    'actual': ['coffee', 'no coffee', 'no coffee', 'coffee', 'coffee', 'coffee', 'no coffee', 'coffee'],
    'prediction': ['no coffee', 'no coffee', 'coffee', 'coffee', 'coffee', 'coffee', 'no coffee', 'no coffee'],
})
df

Unnamed: 0,actual,prediction
0,coffee,no coffee
1,no coffee,no coffee
2,no coffee,coffee
3,coffee,coffee
4,coffee,coffee
5,coffee,coffee
6,no coffee,no coffee
7,coffee,no coffee


# This Confusion Matrix is Key

In [2]:
pd.crosstab(df.actual, df.prediction)

prediction,coffee,no coffee
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
coffee,3,2
no coffee,1,2


- TP: predicted coffee + actual is coffee
- FP: predicted coffee, but they didn't want coffee
- FN: predicted no coffee, but really they wanted coffee
- TN: predicted no coffee + actual is no coffee

- our choice of positive and negative is arbitrary
- the labels / layout of the confusion matrix vary

## Metrics

- **accuracy**: (TP + TN) / (TP + TN + FP + FN)
    - (3 + 2) / (3 + 1 + 2 + 2) = 62.5%
- **precision**: TP / (TP + FP)
    - 3 / (3 + 1) = 75%
    - FP is more costly than FN; in fact precision does not take into account FN at all
- **recall**: TP / (TP + FN)
    - 3 / (3 + 2) = 60%
    - FN is more costly than a FP; in fact precision does not take into account FP at all

Imagine you're bringing coffee to meeting, which metric would we choose? It depends

Outcomes:
- FP: Buy coffee for someone who won't drink it
- FN: Don't buy coffee for someone who would have drank it
- TP: Buy coffee for someone who will drink it
- TN: Don't buy coffee for someone who wouldn't drink it

- lola: really good coffee, but expensive
    - cost of a FP is a higher than FN
    - precision is better here because buying a cup of coffee for someone who won't drink it is expensive
    - We want to be sure about our positive preditions
- taco cabana: bad coffee, but cheap
    - cost of a FN is higher than FP
    - recall because the coffee is cheap, its not bad to buy a cheap coffee for someone who won't drink it; worse to not get someone coffee who wanted it
- meeting with super important client
    - cost of NF is higher, because they might be offended if we don't get them coffee
    - cost of FN == not signing a contract
    - recall

What if we just don't buy coffee or buy coffee for everyone? == Baseline Model

In [3]:
df.actual.value_counts()

coffee       5
no coffee    3
Name: actual, dtype: int64

In [4]:
df["baseline"] = "coffee"

In [5]:
df

Unnamed: 0,actual,prediction,baseline
0,coffee,no coffee,coffee
1,no coffee,no coffee,coffee
2,no coffee,coffee,coffee
3,coffee,coffee,coffee
4,coffee,coffee,coffee
5,coffee,coffee,coffee
6,no coffee,no coffee,coffee
7,coffee,no coffee,coffee


In [6]:
# foundational code
# model accuracy
(df.actual == df.prediction).mean()

0.625

In [7]:
# baseline accuracy
(df.actual == df.baseline).mean()

0.625

In [8]:
# precision -- how good are our positive predictions?
# precision -- model performance | predicted positive
subset = df[df.prediction == "coffee"]
(subset.prediction == subset.actual).mean()

0.75

In [9]:
# recall -- how often do we get the actual positive cases?
# recall -- model performance | actual positive
subset = df[df.actual == "coffee"]
(subset.prediction == subset.actual).mean()

0.6

In [10]:
# precision baseline
subset = df[df.baseline == "coffee"]
(subset.actual == subset.baseline).mean() # precision is the same as accuracy in this case

0.625

In [11]:
# recall baseline
subset = df[df.actual == "coffee"]
(subset.actual == subset.baseline).mean()

1.0

Another Example:

Predict whether an email is spam or not.

- positive case: spam
- negative case: not spam

Outcome:

- TP: Predict message is spam and it really is
- FP: Predict message is spam and it's really not
- FN: Predict message is not spam and it's really is
- TN: Predict message is not spam, and it really is spam

Which has a higher cost, a false postive or false negative?

FP - it is  **_bad_** to put Bill in accounting's email in your spam folder

Precision: because we want to be certain a message really is spam when we send it to the spam folder

Another Example:

Predict whether an emali is a phishing attempt. When we predict positive, show an additional banner

- positive case: phishing attempt
- negative case: legit email

Outcomes:

- FP: Show a warning about phishing on a legit email
- TP: Show a warning about phishing and it really is
- FN: Don't show a warning on a phishing email
- TN: Don't show a warning on a legit email

Which has the highest cost? FN: not showing a warning on a phishing email

The cost of showing a banner on a legit email is less than not showing a banner on a phishing email.

Optimize for recall in out phishing detection model.

# What does "positive" mean?

In [None]:
positive = "coffee"



# Recap
In short:

- optimize for **precision** when you want to be sure about your positive predictions
- optimize for **recall** when you don't want to miss actual positive cases