# Evaluating Classification: ROC/AUC

In [None]:
import numpy as np
import pandas as pd

from matplotlib import pyplot as plt

from sklearn.utils import resample
from sklearn.datasets import load_breast_cancer, load_iris, make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay,\
    precision_score, recall_score, accuracy_score, f1_score, log_loss,\
    roc_curve, roc_auc_score, classification_report, RocCurveDisplay

# Objectives

- Calculate and interpret probability estimates
- Adjust the threshold of a logistic regression model
- Visualize, calculate and interpret the AUC-ROC metric

# Motivation

Now that we've learned how to evaluate a classification model's predictions, let's dig deeper to see how else we might evaluate our models and how we can use that information to improve them.

# Scenario: Identifying Heart Disease

Let's use [this UCI dataset](https://archive.ics.uci.edu/ml/datasets/Heart+Disease) about predicting heart disease.

In [None]:
hd_data = pd.read_csv('data/heart.csv')
hd_data.info()

In [None]:
hd_data['target'].value_counts()

In [None]:
hd_data.head()

In [None]:
# Separate data into feature and target DataFrames
None

# Split data into train and test sets
None

# Scale the data for modeling
None

# Train a logistic regresssion model with the train data
None

## Predicting Labels

Let's look at some predictions from our example.

In [None]:
y_pred = hd_model.predict(X_test_sc)

In [None]:
y_pred

When we run the `.predict()` method, `sklearn` gives us the predicted values for each transaction in our test set: 0 if predicting "no heart disease", 1 if predicting "heart disease"

## Probability Estimates

If you remember how the logistic regression model works, though, it doesn't actually generate predicted values of 0 or 1. It creates an S-shaped curve to approximate the data, estimating the _probability_ that they belong to the target class. This probability takes a value _between_ 0 and 1.

![](https://www.graphpad.com/guides/prism/latest/curve-fitting/images/hmfile_hash_38a8acae.png)

Source: [GraphPad](https://www.graphpad.com/guides/prism/latest/curve-fitting/reg_simple_logistic_and_linear_difference.htm)

We can get these estimated probabilities using the `.predict_proba()` method. Each element gives two probabilities: the estimate probability of being in the 0 class (not heart disease) and the 1 class (heart disease)

In [None]:
y_prob = hd_model.predict_proba(X_test_sc)
y_prob[:5]

In [None]:
y_pred[:5]

## Thresholds

How did we get those 0 and 1 label predictions, when the model only calculates probabilities between 0 and 1? 

The default behavior is simply to take the larger of these values as the "real" prediction. Since $0.996 > 0.004$, we'd understand the model to be predicting this point to belong to class "0" (or the negative class). An equivalent way of understanding the default behavior is that we either:

- Round the predicted numbers up to 1 if they are at least as large as 0.5
- Round them down to 0 if they are less than 0.5

Since the probabilities must sum to 1, there will never be any problem with this algorithm. We refer to this value of 0.5 as the **threshold**.

### But Are We Stuck w/0.5 Threshold?

But we don't have to do things this way. Suppose we're building a model that predicts the presence of cancer from X-ray scans. And suppose we get a pair of probabilities for some particular scan that look like this:

- pred_neg: 0.52, pred_pos: 0.48

Because false negatives (cancers not flagged) are *much* more costly than false positives (non-cancers flagged as cancers), we may well want to **adjust our threshold**. We might want to have our model predict "positive" if the corresponding probability is, say, as low as 0.4, or maybe even as low as 0.1. (Speaking for myself, if there was even a 10% chance that I had cancer, I think I'd probably want to know about it.) 

## True & False Positive Rates

Adjusting the threshold can increase or decrease performance on different evaluation metrics. When doing this, data scientist often look at changes in two metrics: **True Positive Rate (TPR)** and **False Positive Rate (FPR)**. Let's define and calculate these. 

To do this, we'll first need to get the values from the confusion matrix.

In [None]:
hd_model.score(X_train_sc, y_train)

In [None]:
hd_model.score(X_test_sc, y_test)

In [None]:
cm = confusion_matrix(y_test, hd_model.predict(X_test_sc))

In [None]:
cm

In [None]:
tp, tn, fp, fn = cm[1][1], cm[0][0], cm[0][1], cm[1][0]

### True Positive Rate

True Positive Rate (TPR) is the same as recall, measuring how many of the positive cases we correctly classified as positive.

**True Positive Rate (TPR)** = **Recall** = $\frac{TP}{TP + FN}$

How many of the patients with heart disease did my model identify?

In [None]:
tpr = tp / (tp + fn)
print(tpr)

### False Positive Rate

False Positive Rate (FPR) measures how many of the negative casses we incorrectly classified as positive.

**False Positive Rate (TPR)** = $\frac{FP}{FP + TN}$

How many of the patients without heart disease did my model flag as having heart disease?

In [None]:
fpr = fp / (fp + tn)
print(fpr)

## Adjusting the threshold

The true- and false-positive rates will change if we make adjustments to the threshold. In fact, in the present case that was the whole point of making the adjustment: We want to minimize our false negatives.

This is how the plot of these rates takes shape.

Let's build a function that will take in our data, together with a threshold setting, and return the corresponding true- and false-positive rates.

# The Receiver Operating Characteristic (ROC) Curve

The Receiver Operating Characteristic (ROC) curve plots the true-positive rate vs. the false-positive rate. Let's define these now:

In [None]:
def classify_rates(y_test, y_probs, model, thresh):
    y_hat = []
    for val in y_probs:                 # Each element in y_hat_probs is an array.
        if val[1] < thresh:             # We'll set our own threshold for classifying
            y_hat.append(0)             # a test point as positive! The lower my threshold,
        else:                           # the fewer predicted positives I'll have. For the
            y_hat.append(1)             # cancer example, I'd want to set a *high* threshold.
    cm = confusion_matrix(y_test, y_hat)
    tp, tn, fp, fn = cm[1][1], cm[0][0], cm[0][1], cm[1][0]
    tpr = tp / (tp + fn)
    fpr = fp / (fp + tn)
    return tpr, fpr, f'tpr:{round(tpr, 3)}, fpr:{round(fpr, 3)}'

True- and false-positive rates for various thresholds:

In [None]:
for x in np.linspace(0, 1, 11):
    print(f'Rates at threshold = {round(x, 2)}: '\
          + classify_rates(y_test, y_prob, hd_model, x)[2])

As my threshold goes up, I'll have fewer positive predictions, which means I'll have both fewer true positives and fewer false positives.

> **NOTE**
>
> - I can artificially increase my true-positive rate to 1 by setting my threshold to 0, but at that point my false-positive rate is also 1! I'll have no true negatives and no false negatives. This will arise naturally if my training data has **very few (actual) negatives**. 
> - I can artificially reduce my false-positive rate to 0 by setting my threshold to 1, but at  that point my true-positive rate is also 0! I'll have no true positives and no false positives. This will arise naturally if my training data has **very few (actual) positives**. 

## Plotting the Curve

Let's plot our own ROC curve. We'll create an array of different thresholds and use our `classify_rates()` function to get the true- and false-positive rates for each threshold.

One way of choosing a threshold **independently of business concerns** is to select the point on the curve that is furthest from (1, 0), the "worse-case" point where our true-positive rate is 0 and our false-positive rate is 1. So let's find that point as well:

In [None]:
tprs = []
fprs = []
diffs = []
for x in np.linspace(0, 1, 101):
    fprs.append(classify_rates(y_test, y_prob, hd_model, x)[1])
    tprs.append(classify_rates(y_test, y_prob, hd_model, x)[0])
    diffs.append(np.sqrt(tprs[-1]**2 + (1-fprs[-1])**2))
    
max_dist = diffs.index(np.max(diffs))
print(f"""With a threshold of {(max_dist - 1) / 100}: \n"""
      f"""\tYou\'ll have a True Positive Rate of {round(tprs[max_dist], 3)} \n"""
      f"""\tand a False Positive Rate of {round(fprs[max_dist], 3)}""")

fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(fprs[:max_dist], tprs[:max_dist], 'r.')
ax.plot(fprs[max_dist], tprs[max_dist], 'ko', ms=10)
ax.plot(fprs[max_dist + 1:], tprs[max_dist + 1:], 'r.')
ax.plot(fprs, fprs, '.');

Let's compare our curve with scikit-learn's:

In [None]:
roc_curve(y_test, y_hat_hd)

In [None]:
# Extract the probability predictions for the "1" class (heart disease)
y_hat_hd = y_prob[:, 1]

# Get the FPR and TPR data
fpr, tpr, thresholds = roc_curve(y_test, y_hat_hd)

# Plot the FPR and TPR data
fig, ax = plt.subplots()
ax.plot(fpr, tpr)
ax.plot((0,1), (0,1), 'k--');

### `RocCurveDisplay()`

You can also use the `RocCurveDislay` with just your fitted model and test data.

In [None]:
roc_disp = RocCurveDisplay(fpr=fpr, tpr=tpr, estimator_name='Heart Diease Model')
roc_disp.plot()

## Area Under the Curve (AUC)

The ROC curve will be a plot of tpr (on the y-axis) vs. fpr (on the x-axis). There will always be a point at (0, 0) and another at (1, 1). The question is what happens in the middle. Since we want our y-values to be as high as possible for any particular x-value, a natural metric is to calculate the **area under the curve**. The larger the area, the better the classifier. The maximum possible area is the area of the whole box between 0 and 1 on both axes, so that's a **maximum area of 1**.

What's the minimum? Well that depends on the ratios of (actual) positive and negatives in my data, in much the way that a baseline accuracy score does.

> Remember: If my test data comprises 90% positives and only 10% negatives, then a simple classifier that always predicts "positive" will be 90% accurate! And so that would be the baseline level for a classifier on that data.

If we have equal numbers of positives and negatives, then we can set an **absolute minimum area of 0.5**. That's the "curve" we'd get by plotting a straight diagonal line from (0, 0) to (1, 1).

Why? The area under the curve really represents the test's ability to **discriminate** positives from negatives. Suppose I randomly took several pairs of points, one positive and one negative, and checked my test's predictions. The area under the curve represents a threshold-independent measure of how often my test would get the two predictions correct.

### AUC Calculation with `sklearn`

Scikit-Learn's `roc_auc_score()` function will compute the area under the curve for us:

In [None]:
y_prob

In [None]:
# Extract the probabilitiy predictions for the "1" class (heart disease)
y_hat_hd = y_prob[:, 1]

roc_auc_score(y_test, y_hat_hd)

## Sidebar: Visualizing Threshold Changes

This [ROC Applet](https://web.archive.org/web/20210210014824/http://www.navan.name/roc/) helps  visualize how a change in the threshold corresponds to moving along the ROC curve

# Scenario: Breast Cancer Prediction

Let's evaluate a model using Scikit-Learn's breast cancer dataset:

In [None]:
# Load the data
preds, target = load_breast_cancer(return_X_y=True)

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(preds, target,
                                                   random_state=42)

# Scale the data
bc_scaler = StandardScaler()
bc_scaler.fit(X_train)
X_train_sc = bc_scaler.transform(X_train)
X_test_sc = bc_scaler.transform(X_test)

# Run the model
bc_model = LogisticRegression(solver='lbfgs', max_iter=100,
                           random_state=42)
bc_model.fit(X_train_sc, y_train)

## Task

For this example, draw the ROC curve and calculate the AUC-ROC metric. Based on the results, do you think your model would be useful for identifying patients with breast cancer?

In [None]:
# Your work here
# Lets look at Confusion Matrix first
None

In [None]:
#Train ROC-AUC
None

In [None]:
#Test ROC-AUC
None

In [None]:
bc_model.score(X_train_sc, y_train)

In [None]:
bc_model.score(X_test_sc, y_test)

In [None]:
bc_model.predict_proba(X_test_sc)[0]

In [None]:
roc_auc_score(y_test, bc_model.predict_proba(X_test_sc)[:, 1])

In [None]:
recall_score(y_test, bc_model.predict(X_test_sc))

# Level Up: Oversampling
## Coming soon in another lecture

What do you do if your model doesn't perform well due to class imbalance? One of the most effective strategies is to **oversample the minority class**. That is, I give myself more data points than I really have. I could achieve this either by [bootstrapping](https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html) or by generating some data that is fake but close to actual data. The latter is the idea behind [SMOTE](https://imbalanced-learn.org/stable/over_sampling.html).

In [None]:
# Another less intensive method that might help
log_class_weights = LogisticRegression(class_weight='balanced')