# Classification Metrics

In today's session:

- Confusion Matrix
- Classification Metrics
    - Accuracy
    - Recall (+ False Negative Rate)
    - Specificity (+ False Positive Rate)
    - Precision (+ Negative Predicted Value)
    - Balanced Accuracy
    - F1 Score
- AUC/ROC curve
- Classification Report

---

**Objectives:**

By the end of this session:

- Understand how to read a confusion matrix
- Given equations, know how to use a confusion matrix to calculate metrics
- Describe an AUC/ROC curve
- Attempt to create a classification report after fitting a model using sklearn

---

Understanding metrics for any machine learning problem is crucial to picking out the best performing model for your use case. We use metrics to not only assess the performance of our one model, but to also compare it to others we create. 

In regression, we were introduced to metrics such as the MSE and the RMSE. In classification, we need to use slightly different error metrics, as we are predicting labels/categories rather than continuous numerical values.

## Let's first import some data...

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Read data in and view first few entries
df = pd.read_csv('healthcare-dataset-stroke-data.csv')
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [3]:
df['stroke'].value_counts()

0    4861
1     249
Name: stroke, dtype: int64

In [4]:
# some very simple pre-processing before we get into modelling
df = df.dropna()

# labels
y = df['stroke']

# features
X = df.drop('stroke', axis=1)

# transforming the features
X_transformed = pd.get_dummies(X, drop_first=True)

#now let's do a train-test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_transformed, y, test_size=0.2, random_state=50)

In [5]:
# finally we train our very simple model
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

lr.fit(X_train, y_train)

# and make predictions
y_pred = lr.predict(X_test)

---

## 1. Confusion Matrix

![](https://media3.giphy.com/media/eIm624c8nnNbiG0V3g/giphy.gif)

A confusion matrix is a table used to evaluate the performance of a classification model. It compares actual and predicted labels of a set of data to create a summary of the model's performance.

A confusion matrixs contains four metrics: True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN)

![](https://glassboxmedicine.files.wordpress.com/2019/02/confusion-matrix.png?w=816)

- **True Negative**: Predicted Negative, is Negative
- **False Positive**: Predicted Positive, is Negative *(Type I error)*
- **False Negative**: Predicted Negative, is Positive *(Type II error)*
- **True Positive**: Predicted Positive, is Positive


Let's put this in the example of COVID-19 testing...
- True Negative = Individual tests negative, and they truly do not have COVID
- False Positive = Individual tests positive, but they truly do not have COVID
- False Negative = ...?
- True Positive = ...?

In [6]:
# Let's create a confusion matrix for our model

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

cm

array([[938,   3],
       [ 40,   1]])

In [7]:
labels = ['0: no stroke', '1: stroke']

pd.DataFrame(data = confusion_matrix(y_test, y_pred), index = labels, columns = labels)

Unnamed: 0,0: no stroke,1: stroke
0: no stroke,938,3
1: stroke,40,1


In [8]:
# Set up our TN, FP, FN, and TP

tn, fp, fn, tp = cm.ravel()
print(tn, fp, fn, tp)

938 3 40 1


---

## 2. Classification Metrics

![](https://media.tenor.com/hh6n7Ou_OnUAAAAC/math-hangover.gif)

### Accuracy

Ratio of total number of *correctly* predicted instances to total number of instances.

$ Accuracy = \frac{(TP + TN)}{(TP + FP + TN + FN)} $

In [9]:
accuracy = (tp + tn) / (tp + fp + tn + fn)

print("Accuracy: ", accuracy)

Accuracy:  0.9562118126272913


---

### Recall (+ False Negative Rate)

Recall (also known as **sensitivity** or **true positive rate**) is the ratio of true positive predictions to the total number of *actual* positive instances in the dataset. 

--> how well the model is able to identify positive instances out of all the positive instances in the dataset

Recall is important in cases where the cost of missing a positive instance is high - e.g. in medical diagnoses: predicting a false negative (saying someone doesn't have a disease when they in fact do) can be costly to the patient.

High recall can also lead to a high amount of false positives, however.

$ Recall = \frac{TP}{TP + FN} $

**False Negative Rate**

Ratio of false negative predictions to total number of actual positive instances in the dataset. High FNR means the model is missing many positive instances. False negatives can be reduced by improving sensitivity/recall, but again this comes at the risk of increasing the number of false positives

$FNR = \frac{FN}{TP + FN} $

In [10]:
recall = (tp) / (tp + fn)
fnr = (fn) / (tp + fn)

print("Recall: ", recall)
print("FNR: ", fnr)

Recall:  0.024390243902439025
FNR:  0.975609756097561


---

### Specificity (+ False Positive Rate)

Specificity (aka **True Negative Rate**) is the ratio of true negative predictions to the total number of actual negativeinstances in the dataset.

--> how well our model is able to identify negative instances in our dataset

Specificity is important in situations where the cost of a false positive is high - e.g. security at an airport: we'd want high specificity to avoid unneccessary extra checks and invasions of privacy, but this can also lead to an increase in false negatives (i.e. more security risks)

$Specificity = \frac{TN}{TN + FP}$

**False Positive Rate**

Ratio of false positive predictions to total number of actual negative instances in the dataset. High FPR indicates the model is incorrectly classifying many negative instances as positive. False positives can be reduced by improving specificity, but again this comes at the risk of increasing the number of false negatives



$FPR = \frac{FP}{TN + FP}$

In [11]:
specificity = (tn) / (tn + fp)
fpr = (fp) / (tn + fp)

print("Specificity: ", specificity)
print("FPR: ", fpr)

Specificity:  0.9968119022316685
FPR:  0.003188097768331562


---

### Precision (+ Negative Predicted Value)

Precision (aka **positive predicted value**) is the ratio of true positive predictions to the total number of positve predictions made. High precision indicates the model is making accurate positive predictions (i.e. fewer false positives) - however, it may still be missing many true positive instances.

High precision is important when the cost of false positive prediction is high such as in medical diagnoses.

$Precision = \frac{TP}{TP + FP}$

**Negative Predicted Value**

Ratio of true negative predictions to the total number of actual negative instances in the dataset. Indicates how well our model is identifying negative instances

High NPV is important when the cost of false negatives prediction is high such as in medical diagnoses.

$NPV = \frac{TN}{TN + FN}$

In [12]:
precision = (tp) / (tp + fp)
npv = (tn) / (tn + fn)

print("Precision: ", precision)
print("NPV: ", npv)

Precision:  0.25
NPV:  0.9591002044989775


---

### Balanced Accuracy

Average of **sensitivity** and **specificity**. Indicates our model's ability to accurately predict each class label. Particularly useful when dealing with **data imbalance**

$Balanced Accuracy = \frac{Sensitivity + Specificity}{2}$

In [13]:
bal_acc = (recall + specificity) / 2

print("Balanced Accuracy: ", bal_acc)

Balanced Accuracy:  0.5106010730670537


---

### F1 Score

Weighted average of **precision** and **recall** - it is the harmonic mean of these two values and provides a balance between the two. 

F1 scores fall between 0 and 1, with 0 being the worst and 1 being the best.

Indicates how accurately our model is predicting both positive and negative instances, and is particularly useful when the cost of false positives and false negatives is similar

$F1 score = 2 * \frac{precision * recall}{precision + recall}$

In [14]:
f1 = 2 * ((precision * recall) / (precision + recall))

print("F1 score: ", f1)

F1 score:  0.04444444444444444


---

**As always, sklearn has its ways of making our lives much easier. It has built in functions for many of these metrics!**

In [15]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

In [16]:
acc = accuracy_score(y_test, y_pred)

balanced_acc = balanced_accuracy_score(y_test, y_pred)

f1_score = f1_score(y_test, y_pred)

prec = precision_score(y_test, y_pred)

rec = recall_score(y_test, y_pred)

In [18]:
f1_score

0.04444444444444444

---

## 3. AUC - ROC Curve

![](https://miro.medium.com/v2/resize:fit:494/1*EPmzi0GCgdLstsJb6Q8e-w.png)



**ROC Curve**:

- Receiver Operating Characteristics. 
- Graphical representation of how well our classification model classifies! It is a plot of the true positive rate (TPR) against the false positive rate (FPR) at different classification thresholds.
- Can help us to identify the optimal classification threshold for the problem at hand
- A good model will have a high TPR and a low FPR, indicating that it is accurately predicting positive instances while minimising false positives

![](https://dorianbrown.dev/assets/images/logreg/prob_threshold.png)

**AUC**:

- Area Under the Curve
- Represents the area under the ROC curve. It measures the overall performance of the binary classification model. The area will always lie between 0 and 1.
- Greater value of AUC = better model performance. 
- An AUC value of 0.5 means the classifier is doing no better than random chance classification

![](https://www.datasciencecentral.com/wp-content/uploads/2021/10/1341805045.jpg)

In [19]:
# Let's find the AUC

from sklearn.metrics import roc_auc_score

roc_auc_score(y_test, lr.predict_proba(X_test)[:, 1])

0.6819159690002851

---

## 4. Classification Report

A table that provides a summary of the performance of our classification model, for each class in the target variable! This is useful for interpreting the performance of our model. In general, a model with high precision, recall and F1-score is considered to be accurate and reliable (but - this does depend on the specific context of your problem at hand, depending on the consequences of FNs and FPs).

In [20]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred, target_names = ['No Stroke', 'Stroke']))

              precision    recall  f1-score   support

   No Stroke       0.96      1.00      0.98       941
      Stroke       0.25      0.02      0.04        41

    accuracy                           0.96       982
   macro avg       0.60      0.51      0.51       982
weighted avg       0.93      0.96      0.94       982



---

# Let's check our understanding!

1. We are checking whether a dataset of credit card transactions are 'fraud' or 'not fraud'. What would a False Negative be in this situation? *Hint: start by thinking about which is 'positive' and which is 'negative'* 

2. Given the following, calculate the **recall**

- TP = 73
- FP = 4
- FN = 11
- TN = 39


$ Recall = \frac{TP}{TP + FN} $

3. True/False: An F1 score of 0 indicates a really well-performing model

4. What two metrics are on the axes of an ROC curve?