# Evaluating Classification: Confusion Matrix

In [None]:
import numpy as np
import pandas as pd

from matplotlib import pyplot as plt

from sklearn.utils import resample
from sklearn.datasets import load_breast_cancer, load_iris, make_classification
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix,\
    precision_score, recall_score, accuracy_score, f1_score, log_loss,\
    roc_curve, roc_auc_score, classification_report, ConfusionMatrixDisplay

# Objectives

- Calculate and interpret a confusion matrix
- Calculate and interpret classification metrics such as accuracy, recall, and precision
- Choose classification metrics appropriate to a business problem

# Motivation

There are many ways to evaluate a classification model, and your choice of evaluation metric can have a major impact on how well your model serves its intended goals. This lecture will review common classification metrics you might consider using, and considerations for how to make your choice.

# Scenario: Identifying Fraudulent Credit Card Transactions

Credit card companies often try to identify whether a transaction is fraudulent at the time when it occurs, in order to decide whether to approve it. Let's build a classification model to try to classify fraudulent transactions! 

The data for this example from from [this Kaggle dataset](https://www.kaggle.com/mlg-ulb/creditcardfraud).

In [None]:
# Code to downsample from original dataset
#
# credit_data = pd.read_csv('creditcard.csv')
# credit_data_small = credit_data.iloc[0:10000]
# credit_data_small.describe()
# credit_data_small.to_csv('credit_fraud_small.csv', index=False)

In [None]:
credit_data = pd.read_csv('data/credit_fraud_small.csv')

The dataset contains features for the transaction amount, the relative time of the transaction, and 28 other features formed using PCA. The target 'Class' is a 1 if the transaction was fraudulent, 0 otherwise

In [None]:
credit_data.head()

## EDA

Let's see what we can learn from some summary statistics.

In [None]:
credit_data.describe()

**Question**: What can we learn from the mean of the target 'Class'?

<details>
<summary>Answer</summary>
Fraudulent transactions are rare - only 0.4% of transactions were fraudulent
</details>

In [None]:
credit_data['Class'].value_counts(normalize=True)

## Logistic Regression

Let's run a logistic regression model using all of our features.

In [None]:
# Separate data into feature and target DataFrames
X = credit_data.drop('Class', axis=1)
y = credit_data['Class']

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Scale the data for modeling
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)

# Train a logistic regresssion model with the train data
cred_model = LogisticRegression(random_state=42)
cred_model.fit(X_train_sc, y_train)

## Evaluation

Let's calculate the accuracy score for our model using cross validation.

In [None]:
cv_results = cross_validate(estimator=cred_model, X=X_train_sc, y=y_train, return_train_score=True)

In [None]:
cv_results

In [None]:
cv_results['test_score'].mean()

In [None]:
cv_results['train_score'].mean()

In [None]:
cred_model.score(X_test_sc, y_test)

In [None]:
credit_data['Class'].value_counts(normalize=True)

In [None]:
from sklearn.dummy import DummyClassifier

In [None]:
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train_sc, y_train)
dummy.score(X_test_sc, y_test)

In [None]:
y_test.value_counts(normalize=True)

That seems great, right? Maybe... too great? Let's dig in deeper.

## Confusion Matrix

Let's consider the four categories of predictions our model might have made:

* Predicting that a transaction was fraudulent when it actually was (**true positive** or **TP**)
* Predicting that a transaction was fraudulent when it actually wasn't (**false positive** or **FP**)
* Predicting that a transaction wasn't fraudulent when it actually was (**false negative** or **FN**)
* Predicting that a transaction wasn't fraudulent when it actually wasn't (**true negative** or **TN**)

<img src='images/precisionrecall.png' width=70%/>

The **confusion matrix** gives us all four of these values.

In [None]:
y_pred = cred_model.predict(X_test_sc)
cm_1 = confusion_matrix(y_test, y_pred)
cm_1

In [None]:
# More visual representation
display = ConfusionMatrixDisplay(confusion_matrix=cm_1)
display.plot();

In [None]:
# More visual representation
display = ConfusionMatrixDisplay(confusion_matrix=cm_1, display_labels=['No Fraud', 'Fraud'])
display.plot();

In [None]:
# Overfit on training
ConfusionMatrixDisplay(confusion_matrix(y_train, cred_model.predict(X_train_sc))).plot();

Notice the way that sklearn displays its confusion matrix: The rows are \['actually false', 'actually true'\]; the columns are \['predicted false', 'predicted true'\].

So it displays:

$\begin{bmatrix}
TN & FP \\
FN & TP
\end{bmatrix}$

**Question:** Do you see anything surprising in the confusion matrix?

-

## Classification Metrics

Let's calculate some common classification metrics and consider which would be most useful for this scenario.

In [None]:
tn = cm_1[0, 0]
fp = cm_1[0, 1]
fn = cm_1[1, 0]
tp = cm_1[1, 1]

## Accuracy

**Accuracy** = $\frac{TP + TN}{TP + TN + FP + FN}$

In words: How often did my model correctly identify transactions (fraudulent or not fraudulent)? This should give us the same value as we got from the `.score()` method.

In [None]:
acc = (tp + tn) / (tp + tn + fp + fn)
print(acc)

In [None]:
# Via sklearn
accuracy_score(y_test, y_pred)

## Recall

**Recall** = **Sensitivity** = $\frac{TP}{TP + FN}$

In words: How many of the actually fraudulent transactions did my model identify? 

In [None]:
rec = tp / (tp + fn)
print(rec)

In [None]:
# Via sklearn
recall_score(y_test, y_pred)

**Question:** Do you think a credit card company would consider recall to be an important metric? Why or why not?

## Precision

**Precision** = $\frac{TP}{TP + FP}$

In words: How often was my model's prediction of 'fraudulent' correct?

In [None]:
prec = tp / (tp + fp)
print(prec)

In [None]:
# Via sklearn
precision_score(y_test, y_pred)

**Question:** Do you think a credit card company would care more about recall or precision?

## $F$-Scores

The $F$-score is a combination of precision and recall, which can be useful when both are important for a business problem. 

Most common is the **$F_1$ Score**, which is an equal balance of the two using a [harmonic mean](https://en.wikipedia.org/wiki/Harmonic_mean).

$$F_1 = 2 \frac{Pr \cdot Rc}{Pr + Rc} = \frac{2TP}{2TP + FP + FN}$$

> _Recall a ***score** typically means higher is better_

In [None]:
f1_scores = 2*prec*rec / (prec + rec)
print(f1_scores)

In [None]:
# Via sklearn
f1_score(y_test, y_pred)

**Question:** Which of these metrics do you think a credit card company would care most about when trying to flag fraudulent transactions to deny?

We can generalize this score to the **$F_\beta$ Score** where increasing $\beta$ puts more importance on _recall_:

$$F_\beta =  \frac{(1+\beta^2) \cdot Precision \cdot Recall}{\beta^2 \cdot Precision + Recall}$$

## `classification_report()`

You can get all of these metrics using the `classification_report()` function. 

- The top rows show statistics for if you treated each label as the "positive" class
- **Support** shows the sample size in each class
- The averages in the bottom two rows are across the rows in the class table above (useful when there are more than two classes)

In [None]:
print(classification_report(y_test, y_pred))

# Exercise: Breast Cancer Prediction

Let's evaulate a model using Scikit-Learn's breast cancer dataset:

In [None]:
# Load the data
preds, target = load_breast_cancer(return_X_y=True)

# Split into train and test


# Scale the data


# Run the model
bc_model = 


## Task

Calculate the following for this model:

- Confusion Matrix
- Accuracy
- Precision
- Recall
- F1 Score

Discuss: Which one would you choose to evaluate the model for use as a diagnostic tool to detect breast cancer? Why?

In [None]:
# Your work here

In [None]:
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#

# Multiclass Classification

What if our target has more than two classes?

**Multiclass classification** problems have more than two possible values for the target. For example, your target would have 10 possible values if you were trying to [classify an image of a hand-written number as a digit from 0 to 9](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html). 

In these cases, we can use the same methods to evaluate our models. Confusion matrices will no longer be 2x2, but will have a number of rows/columns equal to the number of classes. 

When calculating metrics like precision, we choose one class to be the "positive" class, and the rest are assigned to the "negative" class. 

An example of comparing multiclass confusion matrices (letter recognition for two different models from [this repo](https://github.com/MrGeislinger/ASLTransalation)):

![https://github.com/MrGeislinger/ASLTransalation/blob/main/fingerspelling/paper/images/resnet50_confusionMatrix.png](images/resnet50_confusionMatrix.png)
![https://raw.githubusercontent.com/MrGeislinger/ASLTransalation/main/fingerspelling/paper/images/vgg16_confusionMatrix.png](images/vgg16_confusionMatrix.png)

# Summary: Which Metric Should I Care About?

Well, it depends.

Accuracy:
- Pro: Takes into account both false positives and false negatives.
- Con: Can be misleadingly high when there is a significant class imbalance. (A lottery-ticket predictor that *always* predicts a loser will be highly accurate.)

Recall:
- Pro: Highly sensitive to false negatives.
- Con: No sensitivity to false positives.

Precision:
- Pro: Highly sensitive to false positives.
- Con: No sensitivity to false negatives.

F-1 Score:
- Harmonic mean of recall and precision.

The nature of your business problem will help you determine which metric matters.

Sometimes false positives are much worse than false negatives: Arguably, a model that compares a sample of crime-scene DNA with the DNA in a city's database of its citizens presents one such case. Here a false positive would mean falsely identifying someone as having been present at a crime scene, whereas a false negative would mean only that we fail to identify someone who really was present at the crime scene as such.

On the other hand, consider a model that inputs X-ray images and predicts the presence of cancer. Here false negatives are surely worse than false positives: A false positive means only that someone without cancer is misdiagnosed as having it, while a false negative means that someone with cancer is misdiagnosed as *not* having it.

# Level Up: Cost Matrix

One might assign different weights to the costs associated with false positives and false negatives. (We'll standardly assume that the costs associated with *true* positives and negatives are negligible.)

**Example**. Suppose we are in the DNA prediction scenario above. Then we might construct the following cost matrix:

In [None]:
cost = np.array([[0, 10], [3, 0]])
cost

This cost matrix will allow us to compare models if we have access to those models' rates of false positives and false negatives, i.e. if we have access to the models' confusion matrices!

**Problem**. Given the cost matrix above and the confusion matrices below, which model should we go with?

In [None]:
conf1, conf2 = np.array([[100, 10], [30, 300]]), np.array([[120, 20], [0, 300]])

print(conf1, 2*'\n', conf2)

In [None]:
cost1 = (10*10) + (30*3)
cost2 = (20*10) + (0*3)
cost1, cost2

# Level Up: Multiclass Example

In [None]:
flowers = load_iris()

In [None]:
print(flowers.DESCR)

In [None]:
dims_train, dims_test, spec_train, spec_test = train_test_split(flowers.data,
                                                                flowers.target,
                                                                test_size=0.5,
                                                               random_state=42)

In [None]:
spec_train[:5]

In [None]:
ss_f = StandardScaler()

ss_f.fit(dims_train)

dims_train_sc = ss_f.transform(dims_train)
dims_test_sc = ss_f.transform(dims_test)

In [None]:
logreg_f = LogisticRegression(multi_class='multinomial',
                             C=0.01, random_state=42)

logreg_f.fit(dims_train_sc, spec_train)

In [None]:
None

In [None]:
print(classification_report(spec_test,
              logreg_f.predict(dims_test_sc)))

In [None]:
precision_score(spec_test, logreg_f.predict(dims_test_sc), average='weighted')