# ROC and AUC 

This notebook covers the basics of Receiver Operator Characteristic (ROC) graphs and the concept of the area under the curve (AUC).

Source: "ROC and AUC, Clearly Explained!", StatQuest with Josh Starmer https://www.youtube.com/watch?v=4jRBRDbJemM

Data Science Project Steps:

    1) Load the data

    2) Define appropriate X and y

    3) Normalize the Data

    4) Split the data into train and test sets.

    5) Fit a model

    6) Predict

## Part 1: Model Confusion Matrix


### A confusion matrix for a binary model
![Binary Classifiction Confusion Matrix](supporting_images/Binary_Classifiction_Confusion_Matrix.jpg)


### A confusion matrix for a multi-categorical model
![Multi-Categorical Classifiction Confusion Matrix](supporting_images/Multi_Categorical_Models.png)


## Part 2: Sensitivity and Specificity

Sensitivity is the model's ability to correctly identify positives (belongs to a class). Specificity is the model's ability to correctly identify negatives (does not belong to a class):

![Sensitivity and Specificity Calculations](supporting_images/Sensitivity_and_Specificity.png)


### Calculating Sensitivity
![Calculating Sensitivity](supporting_images/Sensitivity_Categorical_Models.png)


### Calculating Specificity
![Calculating Specificity](supporting_images/Specificity_Categorical_Models.png)



## Part 3: ROC and AUC

Let's evaluate a categorical model with a threshold feature.


In [None]:
#Your code here
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

In [None]:
from sklearn import datasets

## Just 0 and 1

In [None]:
iris = datasets.load_iris()

X = iris.data
y = iris.target
iris = pd.DataFrame(data= np.c_[X, y],
                     columns= iris['feature_names'] + ['target'])

In [None]:
iris.head()

In [None]:
iris = iris[iris.target != 2.0]

In [None]:
X = iris[["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"]]
X.head()
y = iris["target"]
y.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [None]:
#Fit a model
logreg = LogisticRegression(fit_intercept = False, C = 1e12) #Starter code
model_log = logreg.fit(X_train, y_train)
print(model_log)

In [None]:
#Predict
y_hat_test = logreg.predict(X_test)

In [None]:
# Your code here
from sklearn.metrics import roc_curve, auc

#for various decision boundaries given the case member probabilites

#First calculate the probability scores of each of the datapoints:
y_score = model_log.decision_function(X_test)

fpr, tpr, thresholds = roc_curve(y_test, y_score)

y_train_score = model_log.decision_function(X_train)
train_fpr, train_tpr, thresholds = roc_curve(y_train, y_train_score)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#Seaborns Beautiful Styling
sns.set_style("darkgrid", {"axes.facecolor": ".9"})

plt.figure(figsize=(10,8))
lw = 2
plt.plot(fpr, tpr, color='darkorange',
         lw=lw, label='ROC curve')
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.yticks([i/20.0 for i in range(21)])
plt.xticks([i/20.0 for i in range(21)])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Setosa and Virginica Classifier ROC Curve')
plt.legend(loc="lower right")
print('AUC: {}'.format(auc(fpr, tpr)))
plt.show()

## Class 1 and 2

In [None]:
iris = datasets.load_iris()

X = iris.data
y = iris.target
iris = pd.DataFrame(data= np.c_[X, y],
                     columns= iris['feature_names'] + ['target'])

In [None]:
iris = iris[iris.target != 0.0]

iris.loc[iris.target == 1.0, "target"] = 0
iris.loc[iris.target == 2.0, "target"] = 1

In [None]:
X = iris[["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"]]
X.head()
y = iris["target"]
y.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [None]:
#Fit a model
logreg = LogisticRegression(fit_intercept = False, C = 1e12) #Starter code
model_log = logreg.fit(X_train, y_train)
print(model_log)

In [None]:
#Predict
y_hat_test = logreg.predict(X_test)

In [None]:
# Your code here
from sklearn.metrics import roc_curve, auc

#for various decision boundaries given the case member probabilites

#First calculate the probability scores of each of the datapoints:
y_score = model_log.decision_function(X_test)

fpr, tpr, thresholds = roc_curve(y_test, y_score)

y_train_score = model_log.decision_function(X_train)
train_fpr, train_tpr, thresholds = roc_curve(y_train, y_train_score)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#Seaborns Beautiful Styling
sns.set_style("darkgrid", {"axes.facecolor": ".9"})

plt.figure(figsize=(10,8))
lw = 2
plt.plot(fpr, tpr, color='darkorange',
         lw=lw, label='ROC curve')
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.yticks([i/20.0 for i in range(21)])
plt.xticks([i/20.0 for i in range(21)])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Virginica and Versicolor Classifier ROC Curve')
plt.legend(loc="lower right")
print('AUC: {}'.format(auc(fpr, tpr)))
plt.show()