# Evaluation of Diagnostic Models

Welcome to the second assignment of course 1. In this assignment, we will be working with the results of the X-ray classification model we developed in the previous assignment. In order to make the data processing a bit more manageable, we will be working with a subset of our training, and validation datasets. We will also use our manually labeled test dataset of 420 X-rays.

As a reminder, our dataset contains X-rays from 14 different conditions diagnosable from an X-ray. We'll evaluate our performance on each of these classes using the classification metrics we learned in lecture.

**By the end of this assignment you will learn about:**

1. Accuracy
2. Prevalence
3. Specificity & Sensitivity
4. PPV and NPV
5. ROC curve and AUCROC (c-statistic)
6. Confidence Intervals

## Table of Contents
- [1. Packages](#1)
- [2. Overview](#2)
- [3. Metrics](#3)
    - [3.1 - True Positives, False Positives, True Negatives and False Negatives](#3-1)
        - [Exercise 1 - true positives, false positives, true negatives, and false negatives](#ex-1)
    - [3.2 - Accuracy](#3-2)
        - [Exercise 2 - get_accuracy](#ex-2)
    - [3.3 Prevalence](#3-3)
        - [Exercise 3 - get_prevalence](#ex-3)
    - [3.4 Sensitivity and Specificity](#3-4)
        - [Exercise 4 - get_sensitivity and get_specificity](#ex-4)
    - [3.5 PPV and NPV](#3-5)
        - [Exercise 5 - get_ppv and get_npv](#ex-5)
    - [3.6 ROC Curve](#3-6)
- [4. Confidence Intervals](#4)
- [5. Precision-Recall Curve](#5)
- [6. F1 Score](#6)
- [7. Calibration](#7)

## 1. Packages <a name='1'></a>

We'll use:
- numpy: scientific computing
- matplotlib: visualization
- pandas: data manipulation
- sklearn: performance metrics
- util, public_tests, test_utils: provided utilities


In [None]:
import numpy as np
import matplotlib.pyplot as plt 
import pandas as pd  

import util
from public_tests import *
from test_utils import *

## 2. Overview <a name='2'></a>

We'll go through our evaluation metrics in the following order:
- TP, TN, FP, FN
- Accuracy
- Prevalence
- Sensitivity and Specificity
- PPV and NPV
- AUC
- Confidence Intervals

In [None]:
train_results = pd.read_csv("data/train_preds.csv")
valid_results = pd.read_csv("data/valid_preds.csv")

class_labels = ['Cardiomegaly', 'Emphysema', 'Effusion', 'Hernia', 'Infiltration',
 'Mass', 'Nodule', 'Atelectasis', 'Pneumothorax', 'Pleural_Thickening',
 'Pneumonia', 'Fibrosis', 'Edema', 'Consolidation']
pred_labels = [l + "_pred" for l in class_labels]

y = valid_results[class_labels].values
pred = valid_results[pred_labels].values

In [None]:
# peek at the dataset
valid_results[np.concatenate([class_labels, pred_labels])].head()

In [None]:
plt.xticks(rotation=90)
plt.bar(x=class_labels, height=y.sum(axis=0));

## 3. Metrics <a name='3'></a>

### 3.1 True Positives, False Positives, True Negatives and False Negatives <a name='3-1'></a>
#### Exercise 1 - true positives, false positives, true negatives, and false negatives <a name='ex-1'></a>


In [None]:
def true_positives(y, pred, th=0.5):
    TP = np.sum((y == 1) & (pred >= th))
    return TP

def true_negatives(y, pred, th=0.5):
    TN = np.sum((y == 0) & (pred < th))
    return TN

def false_positives(y, pred, th=0.5):
    FP = np.sum((y == 0) & (pred >= th))
    return FP

def false_negatives(y, pred, th=0.5):
    FN = np.sum((y == 1) & (pred < th))
    return FN

In [None]:
# test
get_tp_tn_fp_fn_test(true_positives, true_negatives, false_positives, false_negatives)

In [None]:
util.get_performance_metrics(y, pred, class_labels)

### 3.2 - Accuracy <a name='3-2'></a>
#### Exercise 2 - get_accuracy <a name='ex-2'></a>


In [None]:
def get_accuracy(y, pred, th=0.5):
    TP = true_positives(y, pred, th)
    FP = false_positives(y, pred, th)
    TN = true_negatives(y, pred, th)
    FN = false_negatives(y, pred, th)
    accuracy = (TP + TN) / (TP + TN + FP + FN)
    return accuracy

In [None]:
get_accuracy_test(get_accuracy)

In [None]:
util.get_performance_metrics(y, pred, class_labels, acc=get_accuracy)

In [None]:
# What if we predicted all zeros for 'Emphysema'?
get_accuracy(valid_results["Emphysema"].values, np.zeros(len(valid_results)))

### 3.3 - Prevalence <a name='3-3'></a>
#### Exercise 3 - get_prevalence <a name='ex-3'></a>


In [None]:
def get_prevalence(y):
    prevalence = np.mean(y == 1)
    return prevalence

In [None]:
get_prevalence_test(get_prevalence)

In [None]:
util.get_performance_metrics(y, pred, class_labels, acc=get_accuracy, prevalence=get_prevalence)

### 3.4 Sensitivity and Specificity <a name='3-4'></a>
#### Exercise 4 - get_sensitivity and get_specificity <a name='ex-4'></a>


In [None]:
def get_sensitivity(y, pred, th=0.5):
    TP = true_positives(y, pred, th)
    FN = false_negatives(y, pred, th)
    sensitivity = TP / (TP + FN) if (TP + FN) > 0 else 0
    return sensitivity

def get_specificity(y, pred, th=0.5):
    TN = true_negatives(y, pred, th)
    FP = false_positives(y, pred, th)
    specificity = TN / (TN + FP) if (TN + FP) > 0 else 0
    return specificity

In [None]:
get_sensitivity_specificity_test(get_sensitivity, get_specificity)

In [None]:
util.get_performance_metrics(y, pred, class_labels, acc=get_accuracy, prevalence=get_prevalence, 
                        sens=get_sensitivity, spec=get_specificity)

### 3.5 PPV and NPV <a name='3-5'></a>
#### Exercise 5 - get_ppv and get_npv <a name='ex-5'></a>


In [None]:
def get_ppv(y, pred, th=0.5):
    TP = true_positives(y, pred, th)
    FP = false_positives(y, pred, th)
    PPV = TP / (TP + FP) if (TP + FP) > 0 else 0
    return PPV

def get_npv(y, pred, th=0.5):
    TN = true_negatives(y, pred, th)
    FN = false_negatives(y, pred, th)
    NPV = TN / (TN + FN) if (TN + FN) > 0 else 0
    return NPV

In [None]:
get_ppv_npv_test(get_ppv, get_npv)

In [None]:
util.get_performance_metrics(y, pred, class_labels, acc=get_accuracy, prevalence=get_prevalence, 
                        sens=get_sensitivity, spec=get_specificity, ppv=get_ppv, npv=get_npv)

### 3.6 ROC Curve <a name='3-6'></a>


In [None]:
util.get_curve(y, pred, class_labels)

In [None]:
from sklearn.metrics import roc_auc_score
util.get_performance_metrics(y, pred, class_labels, acc=get_accuracy, prevalence=get_prevalence, 
                        sens=get_sensitivity, spec=get_specificity, ppv=get_ppv, npv=get_npv, auc=roc_auc_score)

## 4. Confidence Intervals <a name='4'></a>


In [None]:
def bootstrap_auc(y, pred, classes, bootstraps = 100, fold_size = 1000):
    statistics = np.zeros((len(classes), bootstraps))

    for c in range(len(classes)):
        df = pd.DataFrame(columns=['y', 'pred'])
        df.loc[:, 'y'] = y[:, c]
        df.loc[:, 'pred'] = pred[:, c]
        # get positive examples for stratified sampling
        df_pos = df[df.y == 1]
        df_neg = df[df.y == 0]
        prevalence = len(df_pos) / len(df)
        for i in range(bootstraps):
            # stratified sampling of positive and negative examples
            pos_sample = df_pos.sample(n = int(fold_size * prevalence), replace=True)
            neg_sample = df_neg.sample(n = int(fold_size * (1-prevalence)), replace=True)

            y_sample = np.concatenate([pos_sample.y.values, neg_sample.y.values])
            pred_sample = np.concatenate([pos_sample.pred.values, neg_sample.pred.values])
            score = roc_auc_score(y_sample, pred_sample)
            statistics[c][i] = score
    return statistics

statistics = bootstrap_auc(y, pred, class_labels)

In [None]:
util.print_confidence_intervals(class_labels, statistics)

## 5. Precision-Recall Curve <a name='5'></a>


In [None]:
util.get_curve(y, pred, class_labels, curve='prc')

## 6. F1 Score <a name='6'></a>


In [None]:
from sklearn.metrics import f1_score
util.get_performance_metrics(y, pred, class_labels, acc=get_accuracy, prevalence=get_prevalence, 
                        sens=get_sensitivity, spec=get_specificity, ppv=get_ppv, npv=get_npv, auc=roc_auc_score, f1=f1_score)

## 7. Calibration <a name='7'></a>


In [None]:
from sklearn.calibration import calibration_curve
def plot_calibration_curve(y, pred):
    plt.figure(figsize=(20, 20))
    for i in range(len(class_labels)):
        plt.subplot(4, 4, i + 1)
        fraction_of_positives, mean_predicted_value = calibration_curve(y[:,i], pred[:,i], n_bins=20)
        plt.plot([0, 1], [0, 1], linestyle='--')
        plt.plot(mean_predicted_value, fraction_of_positives, marker='.')
        plt.xlabel("Predicted Value")
        plt.ylabel("Fraction of Positives")
        plt.title(class_labels[i])
    plt.tight_layout()
    plt.show()

plot_calibration_curve(y, pred)

In [None]:
from sklearn.linear_model import LogisticRegression as LR 

y_train = train_results[class_labels].values
pred_train = train_results[pred_labels].values
pred_calibrated = np.zeros_like(pred)

for i in range(len(class_labels)):
    lr = LR(solver='liblinear', max_iter=10000)
    lr.fit(pred_train[:, i].reshape(-1, 1), y_train[:, i])    
    pred_calibrated[:, i] = lr.predict_proba(pred[:, i].reshape(-1, 1))[:,1]

In [None]:
plot_calibration_curve(y, pred_calibrated)

---
## That's it!
Congratulations! That was a lot of metrics to get familiarized with.
We hope that you feel a lot more confident in your understanding of medical diagnostic evaluation and test your models correctly in your future work :)
