# Classification Model Evaluation

## Examples:

Imagine you're bringing coffee to meeting, and you need to predict whether each person at the meeting will want a coffee or not. Which metric should you choose? It depends

Outcomes:

- FP: Buy a coffee for someone who won't drink it
- FN: Don't buy a coffee for someone who wanted one
- TP: Buy a coffee for someone who will drink it
- TN: Don't buy a coffee for someone who wouldn't drink it anyway

Scenarios

- lola: really good coffee, but super expensive
    - cost of a FP is higher than FN
    - precision is better here because buying a cup of coffee for someone who won't drink it is expensive
    - We want to be sure about our positive predictions
- taco cabana: bad coffee, but cheap
    - cost of a FN is higher than FP
    - recall because the coffee is cheap, its not bad to buy a cheap coffee for someone who won't drink it; worse to not get someone coffee who wanted it
- meeting with super important client
    - cost of FN is higher, because they might be offended if we dont' get them coffee
    - cost of FN == not signing a contract
    - recall

What if we just don't buy coffee or buy coffee for everyone? Baseline model

### Mini Exercise

**Scenario: Build a classifier to predict whether a given face should unlock the iPhone.**

**- What is the positive and negative case?**
- positive case : unlock the iPhone with the correct face
- negative case : unlock the iPhone with the incorrect face

**- What are the possible outcomes?**

- FP: unlock the phone when it was not the correct face
- FN: do not unlock the iPhone when it is the correct face
- TP: unlock the iPhone with the correct face
- TN: do not unlock the phon when it is the incorrect face


**- What are the costs of the outcomes?**
- cost of FP is higher because everyone can access to the iPhone

**- Which metric should we use?**
- precision


**Scenario: Predict whether an email is spam or not. Emails marked as spam skip the inbox and go to the spam folder.**

**- What is the positive and negative case?**
- positive case : emails that really are spam go to the spam folder
- negative case: emails that are not spam are to the spam folder 

**- What are the possible outcomes?**
- FP: email (not a spam) goes to the spam folder
- FN: 
- TP: 
- TN: 
**- What are the costs of the outcomes?**
**- Which metric should we use?**

Scenario: Predict whether an email is a phishing attempt. When we predict positive, show an additional banner warning the user that this might be a phishing email.

- What is the positive and negative case?
- What are the possible outcomes?
- What are the costs of the outcomes?
- Which metric should we use?
recall

## Python Implementation

In [2]:
import pandas as pd

df = pd.DataFrame({
    'actual': ['coffee', 'no coffee', 'no coffee', 'coffee', 'coffee', 'coffee', 'no coffee', 'coffee'],
    'prediction': ['no coffee', 'no coffee', 'coffee', 'coffee', 'coffee', 'coffee', 'no coffee', 'no coffee'],
})
df

Unnamed: 0,actual,prediction
0,coffee,no coffee
1,no coffee,no coffee
2,no coffee,coffee
3,coffee,coffee
4,coffee,coffee
5,coffee,coffee
6,no coffee,no coffee
7,coffee,no coffee


## Confusion Matrix

In [3]:
pd.crosstab(df.prediction, df.actual)

actual,coffee,no coffee
prediction,Unnamed: 1_level_1,Unnamed: 2_level_1
coffee,3,1
no coffee,2,2


- TP: predicted coffee + actual is coffee
- FP: predicted coffee, but they didn't like coffee
- FN: predicted no coffee, but really they liked coffee
- TN: predicted no coffee, actual no coffee

Note:

- our choice of positive and negative is arbitrary
- the labels / layout of the confusion matrix varies

## Metrics

- **accuracy**: (TP + TN) / (TP + TN + FP + FN)
    - (3 + 2) / (3 + 1 + 2 +2) = 62.5%
- **precision**: TP / (TP + FP)
    - 3 / (3 + 1) = 75%
    - FP is more costly than FN
- **recall**: TP / (TP + FN)
    - 3 / (3 + 2) = 60%
    - FN is more costly than FP

<div style="background: rgba(0, 150, 0, .25); padding: 1em 3em;">
    <p style="font-weight: bold">Sidebar: Baseline Models</p>
    <p>A <em>baseline model</em> is a model that we can compare to in order to see if the models we are creating are worthwhile.</p>
    <p>Depending on the context, this can mean either an existing rule-based model created with domain knowlege, or a model that makes predictions without any knowledge of the independent variables.</p>
</div>

In [4]:
df.actual.value_counts()

coffee       5
no coffee    3
Name: actual, dtype: int64

In [5]:
df['baseline'] = 'coffee'

In [6]:
df

Unnamed: 0,actual,prediction,baseline
0,coffee,no coffee,coffee
1,no coffee,no coffee,coffee
2,no coffee,coffee,coffee
3,coffee,coffee,coffee
4,coffee,coffee,coffee
5,coffee,coffee,coffee
6,no coffee,no coffee,coffee
7,coffee,no coffee,coffee


### Python Metric Implementation

In [7]:
(df.actual == df.prediction)

0    False
1     True
2    False
3     True
4     True
5     True
6     True
7    False
dtype: bool

In [8]:
# model accuracy
(df.actual == df.prediction).mean()

0.625

In [9]:
# baseline accuracy
(df.actual == df.baseline).mean()

0.625

In [10]:
# precision -- how good are our positive predictions
# precision -- model performance | pred +
subset = df[df.prediction == 'coffee']
print(subset)
(subset.prediction == subset.actual).mean()

      actual prediction baseline
2  no coffee     coffee   coffee
3     coffee     coffee   coffee
4     coffee     coffee   coffee
5     coffee     coffee   coffee


0.75

In [11]:
# recall -- how often do we get the actual positive cases
# recall -- model performance | actual +
subset = df[df.actual == 'coffee']
print(subset)
(subset.prediction == subset.actual).mean()

   actual prediction baseline
0  coffee  no coffee   coffee
3  coffee     coffee   coffee
4  coffee     coffee   coffee
5  coffee     coffee   coffee
7  coffee  no coffee   coffee


0.6

What will the precision and recall of our baseline model that always predicts + be?

In [12]:
# precision
subset = df[df.baseline == 'coffee']
print(subset)
(subset.baseline == subset.actual).mean()

      actual prediction baseline
0     coffee  no coffee   coffee
1  no coffee  no coffee   coffee
2  no coffee     coffee   coffee
3     coffee     coffee   coffee
4     coffee     coffee   coffee
5     coffee     coffee   coffee
6  no coffee  no coffee   coffee
7     coffee  no coffee   coffee


0.625

In [13]:
# recall
subset = df[df.actual == 'coffee']
print(subset)
(subset.baseline == subset.actual).mean()

   actual prediction baseline
0  coffee  no coffee   coffee
3  coffee     coffee   coffee
4  coffee     coffee   coffee
5  coffee     coffee   coffee
7  coffee  no coffee   coffee


1.0

## What does "positive" mean?

In [14]:
positive = 'coffee'

# accuracy -- overall hit rate
model_accuracy = (df.prediction == df.actual).mean()
baseline_accuracy = (df.baseline == df.actual).mean()

# precision -- how good are our positive predictions?
# precision -- model performance | predicted positive
subset = df[df.prediction == positive]
model_precision = (subset.prediction == subset.actual).mean()
subset = df[df.baseline == positive]
baseline_precision = (subset.baseline == subset.actual).mean()

# recall -- how good are we at detecting actual positives?
# recall -- model performance | actual positive
subset = df[df.actual == positive]
model_recall = (subset.prediction == subset.actual).mean()
baseline_recall = (subset.baseline == subset.actual).mean()


print(f'''
positive: {positive}

         | accuracy | recall | precision
         | -------- | ------ | ---------         
   model | {model_accuracy:8.1%} | {model_recall:6.1%} | {model_precision:9.1%}
baseline | {baseline_accuracy:8.1%} | {baseline_recall:6.1%} | {baseline_precision:9.1%}
''')


positive: coffee

         | accuracy | recall | precision
         | -------- | ------ | ---------         
   model |    62.5% |  60.0% |     75.0%
baseline |    62.5% | 100.0% |     62.5%

