# Classification Metrics: Introducing the Confusion Matrix

## Objectives

- Calculate and interpret a confusion matrix
- Calculate and interpret classification metrics such as accuracy, recall, and precision
- Choose classification metrics appropriate to a business problem

# Motivation

There are many ways to evaluate a classification model, and your choice of evaluation metric can have a major impact on how well your model serves its intended goals. This lecture will review common classification metrics you might consider using, and considerations for how to make your choice.

Let's start off with a page from [Google's Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course/classification/true-false-positive-negative) and talk about a classic classification problem:

## The Boy Who Cried 'Wolf'

![adorable wolf image from instagram user fablefire: https://www.instagram.com/p/CCGgVLGFneE/](images/awoo.png)

In the old fable about the boy who cried 'wolf' there are two possible outcomes: 

- **No Wolf** - negative outcome, or 0
- **Wolf** - positive outcome, or 1

(I know, having a wolf arrive is not "positive" - but it is what we're trying to predict)

If you think of this as a model, where the shepherd is predicting whether or not a wolf will threaten the flock of sheep:

![outcome description for wolf scenarios as a confusion matrix](images/wolf_confusion_matrix.png)

So what does that look like with data?

In [None]:
# All of the imports

import pandas as pd
import numpy as np
np.random.seed(0)

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import fetch_openml, load_breast_cancer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import recall_score, precision_score, f1_score
from sklearn.metrics import confusion_matrix, plot_confusion_matrix

In [None]:
# Getting the data from sklearn
dfX, dfy = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
# Cleaning a bit to get to a full dataframe of the data
df = dfX.copy()
df = df.drop(columns=['boat', 'body', 'home.dest'])
df['survived'] = dfy

df.head()

In [None]:
df['survived'].value_counts()

### Model-less Baseline

First of all, I want to see how well the model will do if it predicts the majority class. In other words, if the model only predicts that no one survives, what percentage of the time would it be right? 

How do we do this? Find the number of passengers who didn't survive, divide by the total number of passengers - which `value_counts` will do for us if we set `normalize=True`.

To visualize:

In [None]:
y_actual = df['survived']

In [None]:
y_pred_baseline = ['0'] * len(df)

In [None]:
accuracy_score(y_actual, y_pred_baseline)

In [None]:
# A confusion matrix
confusion_matrix(y_actual, y_pred_baseline)

Or, prettier: 

<img alt="table view with colors to show results of modelless baseline" src="images/full_titanic_modelless_baseline_cm.png" height=200 width=200>

#### Evaluate:

What is this showing us? Why two zeros on the right side?

- 


## Confusion Matrix &rarr; Classification Metrics

That block above, where we hashed out true negatives / true positives / false negatives / false positives, is called a **Confusion Matrix** - a summary of how well a classification model was able to predict each class. Across one axis you have the _predicted_ labels, and across the other axis you have the _actual_ labels, and thus you're able to clearly see the breakdown of where a model is making mistakes - and, more importantly, what kinds of mistakes your model is making.

So - how does a confusion matrix translate into classification metrics?

### Confusion Matrix Interpretation


<img alt="confusion matrix interpretation with metrics" src="images/confusion_matrix_interpretation.png" height=600 width=600>

Note that I've highlighted the most often used metrics in blue above. 

In other words, those metrics are:

- Accuracy: All True Predictions / All Predictions

- Precision score: TP / All Predicted Positives

- Recall or Sensitivity: TP / All Actual Positives 

There's one more score that's often referenced which balances precision and recall - it's called an [**F1 Score**](https://en.wikipedia.org/wiki/F1_score).

$$ \text{F1 Score} = 2 * \frac{ precision * recall}{precision + recall} $$



**Let's Discuss**: Why might we care more about precision than recall, or vice versa? In other words, which one of these would you think is the **primary metric** for the business problem of predicting whether or not someone survived the Titanic?

- 



Let's calculate the above highlighted classification metrics and consider which would be most useful for this scenario.

First, though, we'll create a real model for the Titanic, generally using the strategy outlined by SKLearn [here](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html) (although, we'll use just three columns, and we'll set `drop='first'` in our one hot encoder to reduce multicollinearity)

In [None]:
# Define our X and y
X = df[['pclass', 'sex', 'age']]
y = df['survived']

X.head()

In [None]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
# Define our preprocessor
numeric_features = ["age"]
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), 
           ("scaler", StandardScaler())]
)

categorical_features = ["sex", "pclass"]
categorical_transformer = OneHotEncoder(handle_unknown="error", drop='first')

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

In [None]:
# Fit our preprocessor, then transform train and test
preprocessor.fit(X_train)

X_train_pr = preprocessor.transform(X_train)
X_test_pr = preprocessor.transform(X_test)

In [None]:
# Instantiate and fit our model, then grab train and test predictions
model = LogisticRegression()
model.fit(X_train_pr, y_train)

train_preds = model.predict(X_train_pr)
test_preds = model.predict(X_test_pr)

In [None]:
# Show the confusion matrix for our test set
cm = confusion_matrix(y_test, test_preds)
cm

In [None]:
# Visualize that a bit nicer, using sklearn's function to plot CMs
plot_confusion_matrix(model, X_test_pr, y_test);

### Evaluate:

What is a false positive in this context?

- 


What is a false negative in this context?

- 


Which is worse?

- 


## Explore Our Metrics

In [None]:
# Define our true positives, true negatives, false positives, and false negatives
tn = cm[0, 0]
fp = cm[0, 1]
fn = cm[1, 0]
tp = cm[1, 1]

### Accuracy
$\frac{TP + TN}{TP + TN + FP + FN}$

In words: How often did my model correctly identify whether or not someone survived? 

In [None]:
# Code it here

Note: this is the default metric for most classification models, and thus is the score we get when we use `.score`

In [None]:
model.score(X_test_pr, y_test)

### Recall

AKA **Sensitivity**

$\frac{TP}{TP + FN}$

In words: How many of those who actually survived did my model identify? 

In [None]:
# Code it here

### Precision

$\frac{TP}{TP + FP}$

In words: How often was my model's prediction of 'survived' correct?

In [None]:
# Code it here

### F-Scores

An $F$-score is a combination of precision and recall, which can be useful when both are important for a business problem. 

Most common is the **$F_1$ Score**, which is an equal balance of the two using a [harmonic mean](https://en.wikipedia.org/wiki/Harmonic_mean).

$$F_1 = 2 \frac{Pr \cdot Rc}{Pr + Rc} = \frac{2TP}{2TP + FP + FN}$$

In [None]:
# Code it here

We can generalize this score to the **$F_\beta$ Score** where increasing $\beta$ puts more importance on _recall_:

$$F_\beta =  \frac{(1+\beta^2) \cdot Precision \cdot Recall}{\beta^2 \cdot Precision + Recall}$$

## `classification_report()`

You can get all of these metrics using the `classification_report()` function. 

- The top rows (here, for 0 and 1) show statistics for if you treated each label as the "positive" class
    - The scores we calculated above all match what is in the `1` row - that's our positive class
- **Support** shows the sample size in each class
- The averages in the bottom two rows are across the rows in the class table above (useful when there are more than two classes)

In [None]:
print(classification_report(y_test, test_preds))

Luckily SKLearn will of course calculate these scores for us. You can see all of their classification metrics [here](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics).

# Exercise: Breast Cancer Prediction

Let's evaulate a model using Scikit-Learn's breast cancer dataset. [Data description available here](https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-dataset)

This dataset has columns describing tumor details, and is predicting whether or not a tumor is benign. In our target column:
- 0: Malignant
- 1: Benign

In [None]:
# Load the data
preds, target = load_breast_cancer(return_X_y=True)

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(
    preds, target, random_state=42)

# Scale the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_test_sc = scaler.transform(X_test)

# Run the model
bc_model = LogisticRegression(solver='lbfgs', max_iter=10000, random_state=42)
bc_model.fit(X_train_sc, y_train)

## Task

**Step 1:** Calculate the following for this model:

- Confusion Matrix
- Accuracy
- Precision
- Recall
- F1 Score


In [None]:
# Your code here - confusion matrix

In [None]:
# Accuracy

In [None]:
# Precision

In [None]:
# Recall

In [None]:
# F1 Score

**Step 2:** Describe your business context:

- What is a false positive in this context?

    - 
    
- What is a false negative in this context?

    - 
    
- Which is worse?

    - 
    
- Based on the above questions, which metric would you want to optimize on?

    - 
