<img src='img/logo.png'>
<img src='img/title.png'>

This notebook describes how to quantitatively assess models' information value by calculating classification metrics like receiver operating characteristic (ROC) curves and confusion matrices.

# Table of Contents
* [Classification](#Classification)
* [Evaluation metrics and scoring](#Evaluation-metrics-and-scoring)
	* [Metrics for binary classification](#Metrics-for-binary-classification)
		* [Confusion matrices](#Confusion-matrices)
			* [Relation to accuracy](#Relation-to-accuracy)
		* [Precision, recall and f-score](#Precision,-recall-and-f-score)
* [Taking uncertainty into account](#Taking-uncertainty-into-account)
	* [Precision-Recall curves and ROC curves](#Precision-Recall-curves-and-ROC-curves)
		* [Exercise: compare cancer models](#Exercise:-compare-cancer-models)
		* [ROC and AUC](#ROC-and-AUC)
	* [Multi-class classification](#Multi-class-classification)
* [Summary](#Summary)


In [None]:
#Fix working directory
%cd notebooks

In [None]:
import os
import numpy as np
import pandas as pd

import holoviews as hv
import hvplot.pandas
import panel as pn
hv.extension('bokeh')

import matplotlib.pyplot as plt
plt.rcParams['image.interpolation'] = "none"
np.set_printoptions(precision=3)
%matplotlib inline
import matplotlib as mpl
mpl.rcParams['legend.numpoints'] = 1

import src.mglearn as mglearn

# Classification

Classification is the set of techniques to predict a label, either binary or from a set, based on the features.

Simple classifiers predict the label of a new observation based on proximity to a known observation.

More generalized classifiers attempt to find dividing surfaces within the n-dimesional feature space to decide how to label a new observation.

# Evaluation metrics and scoring

Reading on classification metrics:

http://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix

http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report

http://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-and-f-measures

## Metrics for binary classification

Let's build a few different classifiers so we can compare them. The dataset seeks to model whether a bank customer will respond to a marketing campaign.

In [None]:
from sklearn.model_selection import train_test_split

data = pd.read_csv("data/bank-campaign.csv")
X = data.drop(columns="target").values
y = data.target.values

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)


The DummClassifier will *always* assign the same label to every sample. Most people responded "no".

In [None]:
data['target'].value_counts()

88% seems like a good score, no?

In [None]:
from sklearn.dummy import DummyClassifier

dummy_majority = DummyClassifier(strategy='most_frequent').fit(X_train, y_train)
pred_most_frequent = dummy_majority.predict(X_test)

print("predicted labels: %s" % np.unique(pred_most_frequent))
print("score: %f" % dummy_majority.score(X_test, y_test))

Let's build some more sophisticated models. These have similar scores. 

In [None]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(max_depth=2).fit(X_train, y_train)
pred_tree = tree.predict(X_test)

tree.score(X_test, y_test)

In [None]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(C=0.1, solver='liblinear').fit(X_train, y_train)
pred_logreg = logreg.predict(X_test)

print("logreg score: %f" % logreg.score(X_test, y_test))

### Confusion matrices

Confusion matrices display the total amount of true and false positive and negative classifcations.

In [None]:
mglearn.plots.plot_binary_confusion_matrix()

Let's start with LogisticRegression

In [None]:
from sklearn.metrics import confusion_matrix

confusion = confusion_matrix(y_test, pred_logreg)
print(confusion)

Fractions may be more useful to compare

In [None]:
n_no = y_test[y_test=='no'].shape[0]
n_yes = y_test[y_test=='yes'].shape[0]

sizes = np.array([n_no, n_yes])

confusion / sizes

How did the other models do?

In [None]:
print("Most frequent class:")
print(confusion_matrix(y_test, pred_most_frequent)/sizes)
print("\nDecision tree:")
print(confusion_matrix(y_test, pred_tree)/sizes)
print("\nLogistic Regression")
print(confusion_matrix(y_test, pred_logreg)/sizes)

#### Relation to accuracy

\begin{equation}
\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}
\end{equation}

### Precision, recall and f-score

\begin{equation}
\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}
\end{equation}

\begin{equation}
\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}
\end{equation}
\begin{equation}
\text{F1} = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}
\end{equation}

In [None]:
from sklearn.metrics import f1_score
print("f1 score most frequent: %.2f" % f1_score(y_test, pred_most_frequent, pos_label="yes"))
print("f1 score tree: %.2f" % f1_score(y_test, pred_tree, pos_label="yes"))
print("f1 score: %.2f" % f1_score(y_test, pred_logreg, pos_label="yes"))

The DummyClassifier

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, pred_most_frequent,
                            target_names=["no", "yes"]))

DeicisionTree

In [None]:
print(classification_report(y_test, pred_tree,
                            target_names=["no", "yes"]))

LogisticRegression

In [None]:
print(classification_report(y_test, pred_logreg,
                            target_names=["no", "yes"]))

# Taking uncertainty into account

The default *decision threshold* is the value at which the classifier distinguishes between False and True. This threshold can be modified if you wish to optimize either false positives or false negatives.

For SVC models a *decision function* is provided that takes the raw probabilities of each target value and provides a linear function to destinguish between True and False. By default the decision threshold is set to 0. Predicted values greater than 0 are assigned True.

In [None]:
mglearn.plots.plot_decision_threshold()

In [None]:
from src.mglearn.datasets import make_blobs 
from sklearn.svm import SVC

X, y = make_blobs(n_samples=(400, 50), centers=2, cluster_std=[7.0, 2],        
                  random_state=22)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

svc = SVC(gamma=.05).fit(X_train, y_train)                                     

In [None]:
print(classification_report(y_test, svc.predict(X_test)))

Compare the Classification Report when the threshold is lowered to 0.8. Precision for the False class is now perfect.

In [None]:
y_pred_lower_threshold = svc.decision_function(X_test) > -.8

print(classification_report(y_test, y_pred_lower_threshold))

## Precision-Recall curves and ROC curves

Curves allow us to view tradeoff between 1) precision and recall or 2) true positive rate versus false positive rate (ROC).

http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html

also want to look at:

http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html#example-model-selection-plot-roc-crossval-py

Precision-Recall curves are best used when you *imbalanced classes*, like the Cancer problem above.

In [None]:
from sklearn.metrics import precision_recall_curve

In [None]:
# create a similar dataset as before, but with more samples to get a smoother curve
X, y = make_blobs(n_samples=(4000, 500), centers=2, cluster_std=[7.0, 2], random_state=22)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

svc = SVC(gamma=.05).fit(X_train, y_train)

precision, recall, thresholds = precision_recall_curve(
    y_test, svc.decision_function(X_test))
# find threshold closest to zero:
close_zero = np.argmin(np.abs(thresholds))

In [None]:
precision_recall = pd.DataFrame({
    'precision': precision,
    'recall': recall,
    'threshold': np.append(thresholds, np.nan)
})

precision_recall_plot = precision_recall.hvplot.line(x='precision', y='recall', hover_cols=['threshold'], width=800,
                                                     label='SVC')
zero = hv.Points((precision[close_zero], recall[close_zero]), label='Zero').options(marker='o', color='black',
                                                                                    size=10, fill_color=None)
svc_plot = precision_recall_plot * zero
svc_plot

For RandomForestClassifier probabilities between 0 and 1 are provided. The default threshold is then 0.5.

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, random_state=0, max_features=2)
rf.fit(X_train, y_train)

precision_rf, recall_rf, thresholds_rf = precision_recall_curve(
    y_test, rf.predict_proba(X_test)[:, 1])

close_default_rf = np.argmin(np.abs(thresholds_rf - 0.5))

In [None]:
precision_recall_rf = pd.DataFrame({
    'precision': precision_rf,
    'recall': recall_rf,
    'threshold': np.append(thresholds_rf, np.nan)
})

pr_rf = precision_recall_rf.hvplot.line(x='precision', y='recall', hover_cols=['threshold'], width=800,
                                       label='RandomForest')
point5_rf = (hv.Points((precision_rf[close_default_rf], recall_rf[close_default_rf]), label='0.5')
             .options(marker='o', color='red', size=10, fill_color=None))

svc_plot * pr_rf * point5_rf

In [None]:
print("f1_score of random forest: %f" % f1_score(y_test, rf.predict(X_test)))
print("f1_score of svc:           %f" % f1_score(y_test, svc.predict(X_test)))

In [None]:
from sklearn.metrics import average_precision_score

ap_rf = average_precision_score(y_test, rf.predict_proba(X_test)[:, 1])
ap_svc = average_precision_score(y_test, svc.decision_function(X_test))

print("average precision of random forest: %f" % ap_rf)
print("average precision of svc:           %f" % ap_svc)

### Exercise: compare cancer models

Use the RandomForest `rnd` and LogisticRegression `logreg` cancer models fitted for you, construct confusion matrices with the argument `labels=['malignant','benign']`.

Would we me more concerned in improving *preicision* or *recall*? Remember that False means no-cancer. Use `precision_score()` or `recall_score()` with `pos_label='benign'`.

In [None]:
from sklearn.datasets import load_breast_cancer

_cancer = load_breast_cancer()
cancer = pd.DataFrame(data=_cancer.data, columns=_cancer.feature_names)
cancer['target'] = _cancer.target
cancer['target'] = cancer['target'].replace({0:'malignant', 1:'benign'})
cancer.head()

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import recall_score, precision_score

X = cancer.drop(columns='target').values
y = cancer['target'].values

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=42)

logreg = LogisticRegression(solver='liblinear', C=1)
logreg.fit(Xtrain, ytrain)

bayes = GaussianNB()
bayes.fit(Xtrain, ytrain)

print(f"LogisticRegression score: {logreg.score(Xtest,ytest):.4f}")
print(f"Naive Bayes        score: {bayes.score(Xtest,ytest):.4f}")

In [None]:
## YOUR SOLUTION HERE


<button data-toggle="collapse" data-target="#cancer" class='btn btn-primary'>Show solution</button>

<div id="cancer" class="collapse">

High *false negatives* can put people at risk! False positives can be ignored (somewhat). You want a model that **maximizes recall** near 1.0.

Logistic Regression has a *slightly* higher precision score.


 ```python
print("Logistic Regression")
print(confusion_matrix(ytest, logreg.predict(Xtest), labels=['malignant', 'benign']))
print(recall_score(ytest, logreg.predict(Xtest), pos_label='benign'))

print("Naive Bayes")
print(confusion_matrix(ytest, bayes.predict(Xtest), labels=['malignant', 'benign']))
print(recall_score(ytest, bayes.predict(Xtest), pos_label='benign'))
 ```
 
Extra: The plots look weird. Why is that?

 ```python
plots = []

for model in logreg, bayes:
    p, r, t = precision_recall_curve(ytest, model.predict_proba(Xtest)[:, 1], pos_label='benign')
    close_default = np.argmin(np.abs(t - 0.5))
    
    df = pd.DataFrame({
        'precision': p,
        'recall': r,
        'threshold': np.append(t, np.nan)
    })
    
    curve = df.hvplot.line(x='precision', y='recall', hover_cols=['threshold'], width=800,
                                       label=model.__class__.__name__, padding=0.02)
    point = (hv.Points((p[close_default], r[close_default]), label=f'default {model.__class__.__name__}')
             .options(marker='o', size=10, fill_color=None))
    
    plots.append(curve * point)

plots[0] * plots[1]
 ```

### ROC and AUC

When looking at Receiver Operating Characteristic (ROC) plots, it is also helpful to calculate the Area Under the Curve (AUC). A perfect classifier has an AUC of 1.0.

Similar to precision-recall curves, ROC uses *false positive rate (FPR)* and *true positive rate (TPR)*. ROC is best used for problems where the classes are well balanced.

The false positive rate is defined as

\begin{equation}
\text{FPR} = \frac{\text{FP}}{\text{FP} + \text{TN}}
\end{equation}

The true positive rate is defined as

\begin{equation}
\text{TPR} = \frac{\text{TP}}{\text{TP} + \text{FN}}
\end{equation}

In [None]:
# create a similar dataset as before, but with balanced classes
X, y = make_blobs(n_samples=4500, centers=2, cluster_std=[7.0, 2], random_state=22)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

svc = SVC(gamma=.05).fit(X_train, y_train)

In [None]:
from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_test, svc.decision_function(X_test))

>the top left corner of the plot is the “ideal” point - a false positive rate of zero, and a true positive rate of one. This is not very realistic, but it does mean that a larger area under the curve (AUC) is usually better.

>The “steepness” of ROC curves is also important, since it is ideal to maximize the true positive rate while minimizing the false positive rate.

In [None]:
roc_svc = pd.DataFrame({
    'FPR (precision)': fpr,
    'TPR (recall)': tpr,
    'threshold': thresholds
})

roc_svc_plot = roc_svc.hvplot.line(x='FPR (precision)', y='TPR (recall)', hover_cols=['threshold'], width=800,
                            label='SVC')

close_zero = np.argmin(np.abs(thresholds))
zero_svc = (hv.Points((fpr[close_zero], tpr[close_zero]), label='Zero (svc)')
             .options(marker='o', size=10, fill_color=None))

roc_svc_plot * zero_svc

Let's compare SVC to the RandomForest model

**Note**: the diagonal line represents a *random guess* classifier. Good classifiers have ROC curves significantly above the diagonal.

In [None]:
fpr_rf, tpr_rf, thresholds_rf = roc_curve(y_test, rf.predict_proba(X_test)[:, 1])


roc_rf = pd.DataFrame({
    'FPR (precision)': fpr_rf,
    'TPR (recall)': tpr_rf,
    'threshold': thresholds_rf
})

roc_rf_plot = roc_rf.hvplot.line(x='FPR (precision)', y='TPR (recall)', hover_cols=['threshold'], width=800,
                            label='RandomForest', title='ROC')

close_default_rf = np.argmin(np.abs(thresholds_rf - 0.5))
point5_rf = (hv.Points((fpr_rf[close_default_rf], tpr_rf[close_default_rf]), label='0.5 (rf)')
             .options(marker='o', size=10, fill_color=None))

diag = hv.Curve([(0,0),(1,1)]).options(color='black', alpha=0.2)

roc_svc_plot * zero_svc * roc_rf_plot * point5_rf * diag

Computing the area (AUC) provides a single metric to evaluate different models.

In [None]:
from sklearn.metrics import roc_auc_score

rf_auc = roc_auc_score(y_test, rf.predict_proba(X_test)[:, 1])
svc_auc = roc_auc_score(y_test, svc.decision_function(X_test))

print("AUC for Random Forest: %f" % rf_auc)
print("AUC for SVC:           %f" % svc_auc)

AUC is also sensitive to changes in hyperparameters.

In [None]:
data = pd.read_csv("data/bank-campaign.csv")
X = data.drop(columns="target").values
y = data.target.values

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=0, train_size=.1, test_size=.1)

plots = []
for gamma in [1, 0.01, 0.001]:
    svc = SVC(gamma=gamma).fit(X_train, y_train)
    
    accuracy = svc.score(X_test, y_test)
    auc = roc_auc_score(y_test == "yes", svc.decision_function(X_test))
    print("gamma = %.03f  accuracy = %.02f  AUC = %.02f" % (gamma, accuracy, auc))
    
    fpr, tpr, _ = roc_curve(y_test , svc.decision_function(X_test), pos_label="yes")
    plots.append(hv.Curve((fpr,tpr), 'FPR', 'TPR', label=f'gamma={gamma:.03f}'))

(plots[0] * plots[1] * plots[2]).options(width=500, height=500)

## Multi-class classification

In a binary classification problem, the confusion matrix has a shape of `(2, 2)`.  In a multi-class problem, the confusion matrix has a shape `(n_classes, n_classes)` to show the correctly vs incorrectly classified counts for each class versus every other class.

A perfect confusion matrix has positive main diagonal and zeros elsewhere.

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_digits
digits = load_digits()

X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, random_state=0)

lr = LogisticRegression(solver='liblinear', multi_class='auto').fit(X_train, y_train)
pred = lr.predict(X_test)

mat = confusion_matrix(y_test, pred)

print("accuracy: %0.3f" % accuracy_score(y_test, pred))
print("confusion matrix:")
print(mat)

In [None]:
print(classification_report(y_test, pred))

In [None]:
print("micro average f1 score: %f" % f1_score(y_test, pred, average="micro"))
print("macro average f1 score: %f" % f1_score(y_test, pred, average="macro"))

# Summary

In this notebook, we reviewed the following topics in preparation for more advanced topics:

 * [Evaluation metrics and scoring](#Evaluation-Metrics-and-scoring)
 * [Metrics for binary classification](#Metrics-for-binary-classification)
 * [Confusion matrices](#Confusion-matrices)
 * [Precision, recall and f-score](#Precision,-recall-and-f-score)
 * [Precision-Recall curves and ROC curves](#Precision-Recall-curves-and-ROC-curves)
 * [ROC and AUC](#Receiver-Operating-Characteristics-%28ROC%29-and-AUC)
 * [Multi-class classification](#Multi-class-classification)

<a href='Evaluation_Metrics_Exercises.ipynb' class='btn btn-primary btn-lg'>Exercises</a>

<img src='img/copyright.png'>