### Evaluation
[Course content](https://ds.codeup.com/classification/evaluation/)

**Objective:** Understand and apply various metrics used to evaluate the performance of a classification model. 

In [1]:
import pandas as pd 
import sklearn.metrics
from sklearn.metrics import confusion_matrix

In [2]:
### A dataframe which contains predicted values and actual values

df = pd.DataFrame({
    'actual': ['coffee', 'no coffee', 'no coffee', 'coffee', 'coffee', 'coffee', 'no coffee', 'coffee'],
    'prediction': ['no coffee', 'no coffee', 'coffee', 'coffee', 'coffee', 'coffee', 'no coffee', 'no coffee'],
})


In [3]:
### View the dataframe
df
## rememeber the difference between supervised and unsupervised
## Learning is the use of labels

Unnamed: 0,actual,prediction
0,coffee,no coffee
1,no coffee,no coffee
2,no coffee,coffee
3,coffee,coffee
4,coffee,coffee
5,coffee,coffee
6,no coffee,no coffee
7,coffee,no coffee


In [13]:
df['correct_prediction'] = df.actual == df.prediction

In [14]:
df

Unnamed: 0,actual,prediction,correct_prediction
0,coffee,no coffee,False
1,no coffee,no coffee,True
2,no coffee,coffee,False
3,coffee,coffee,True
4,coffee,coffee,True
5,coffee,coffee,True
6,no coffee,no coffee,True
7,coffee,no coffee,False


In [8]:
### Use a crosstab to count the outcomes

pd.crosstab(df.actual, df.prediction)

prediction,coffee,no coffee
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
coffee,3,2
no coffee,1,2


### Terminology

The two outcomes in classification are labeled as either **positive** or **negative**. 


While the designations are arbitrary, they impact how evaluation metrics are interpreted. 


### Evaluation on train, test, and split


| Split |  Purpose |
| ----------- | :----------- |
| Train | Evaluate in-sample performance|
| Validate |  Evaluate out of sample performance to tune hyper-parameters |
| Test | Evaluate performance of model |

### Confusion Matrix

A diagram which summarizes the outcomes of a model. 



| Designation      | Description |
| ----------- | ----------- |
| True Negative      | Model correctly predicted the negative outcome       |
| False Positive   | Model incorrectly predicted the positive outcome        |
| False Negative   | Model incorrectly predicted the negative outcome        |
| True Positive      | Model correctly predicted the positive outcome       |



### Confusion Matrix with `sklearn`

'coffee' is the positive outcome`

'no coffee' is the negative outcome


The function `confusion_matrix` returns a 2x2 array

### Components of a confusion matrix
 
 For a confusion matrix $C$,


| Index (row, column)      | Count of |
| ----------- | ----------- |
| $C_{0,0}$      | True Negatives       |
| $C_{1,0}$    |   False Negatives      |
| $C_{1,1}$    |   True Positives      |
| $C_{0,1}$    |   False Positives      |

In [21]:
### Return a confusion matrix for the model's predictions
confusion_matrix(df.actual, df.prediction, 
                 labels=('no coffee', 'coffee'))

# 

array([[2, 1],
       [2, 3]])

In [48]:
pd.crosstab(df.actual, df.prediction)

prediction,coffee,no coffee
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
coffee,3,2
no coffee,1,2


### Evaluation Metrics

### Accuracy 

Accuracy evaluates how many correct predictions (both positive and negative) were made over the total number of predictions. 


$\texttt{Accuracy} = \dfrac{TP + TN}{TP + FP + FN + TN}$

### Precision

Precision evaluates how many of the positive predictions were correct.

$\texttt{Precision} = \dfrac{TP}{TP + FP}$

### Recall

Recall evaluates how the model handled all positive outcomes. 

$\texttt{Recall} = \dfrac{TP}{TP + FN}$


### Misclassification Rate

Misclassification rate concerns how many predictions were incorrect. This accounts for all other outcomes not included in the calculation of accuracy. 

$\texttt{Misclassification Rate} = 1 - \texttt{Accuracy}$

### Sensitivity (True Positive Rate)


$\texttt{True Positive Rate} = \dfrac{TP}{TP + FN} $

### Specificity 

How well does the model predict negative outcomes?


$\texttt{Specificity} = \dfrac{TN}{FP + TN}$

### Negative Predictive Value

$\texttt{NPV} = \dfrac{TN}{TN + FN}$

### F1 Score

$\texttt{F1  Score} = 2 * \dfrac{Precision * Recall}{Precision + Recall}$

## Baseline

The baseline is a simple model that is a reference point for the performance of other models. 

For a classification model, a baseline is often the mode.
    

In [23]:
### Find the counts of each outcome
df.actual.value_counts()

coffee       5
no coffee    3
Name: actual, dtype: int64

In [24]:
### Set the baseline_prediction to be coffee
df['baseline'] = 'coffee'
df

Unnamed: 0,actual,prediction,correct_prediction,baseline
0,coffee,no coffee,False,coffee
1,no coffee,no coffee,True,coffee
2,no coffee,coffee,False,coffee
3,coffee,coffee,True,coffee
4,coffee,coffee,True,coffee
5,coffee,coffee,True,coffee
6,no coffee,no coffee,True,coffee
7,coffee,no coffee,False,coffee


<div class="alert alert-block alert-info">
    
### Evaluation Examples

## Accuracy 

In [29]:
# Compare predicted to actual

model_accuracy = df.correct_prediction.mean()

In [None]:
# Compare actual to baseline

In [28]:
(df.actual == df.baseline).mean()

0.625

In [38]:
print(f'Model accuracy: {model_accuracy:.2%}')

Model accuracy: 62.50%


## Recall

In [33]:
# focuses on all the positive values
# restrict to postive values ('coffee')

subset = df[df.actual == 'coffee']
subset

Unnamed: 0,actual,prediction,correct_prediction,baseline
0,coffee,no coffee,False,coffee
3,coffee,coffee,True,coffee
4,coffee,coffee,True,coffee
5,coffee,coffee,True,coffee
7,coffee,no coffee,False,coffee


In [36]:
# model recall

model_recall = (subset.prediction == subset.actual).mean()
model_recall

0.6

In [37]:
# Baseline recall

baseline_recall = (subset.baseline == subset.actual).mean()
baseline_recall`

1.0

In [40]:
print(f'Model Recall: {model_recall:.2%}')
print(f'Baseline Recall: {baseline_recall:.2%}')

Model Recall: 60.00%
Baseline Recall: 100.00%


## Precision

In [42]:
# restrict to the positive values for the PREDICTED values

precision_subset = df[df.prediction == 'coffee']
precision_subset

Unnamed: 0,actual,prediction,correct_prediction,baseline
2,no coffee,coffee,False,coffee
3,coffee,coffee,True,coffee
4,coffee,coffee,True,coffee
5,coffee,coffee,True,coffee


In [44]:
# If we want to find the model's precision: we look at prediction and
# actual again

model_precision = (precision_subset.prediction == precision_subset.actual).mean()
model_precision

0.75

In [45]:
# now lets compare it to the baseline's precision

baseline_subset = df[df.baseline == 'coffee']
baseline_precision = (baseline_subset.baseline == baseline_subset.actual).mean()

In [46]:
baseline_precision

0.625

In [47]:
baseline_subset

Unnamed: 0,actual,prediction,correct_prediction,baseline
0,coffee,no coffee,False,coffee
1,no coffee,no coffee,True,coffee
2,no coffee,coffee,False,coffee
3,coffee,coffee,True,coffee
4,coffee,coffee,True,coffee
5,coffee,coffee,True,coffee
6,no coffee,no coffee,True,coffee
7,coffee,no coffee,False,coffee
