<a href="https://colab.research.google.com/github/cagBRT/Data/blob/main/Imbalanced_Datasets_4c.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

ROC Curves and Precision-Recall Curves provide a diagnostic tool for binary classification models.<br>

ROC AUC and Precision-Recall AUC provide scores that summarize the curves and can be used to compare classifiers.<br>

ROC Curves and ROC AUC can be optimistic on severely imbalanced classification problems with few samples of the minority class.<br>

# **Precision-Recall Curves**


**Precision = TruePositive /( TruePositive + FalsePositive)**<br>

A  value between 0.0 for no precision and 1.0 for full or perfect precision

**Recall = TruePositive / (TruePositive + FalseNegative)**<br>

A value between 0.0 for no recall and 1.0 for full or perfect recall

**Import libraries**

In [None]:
# example of a precision-recall curve for a predictive model
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve
from matplotlib import pyplot
from numpy import where
from sklearn.metrics import auc
from collections import Counter

**Create a dataset**

In [None]:
# generate 2 class dataset
X, y = make_classification(n_samples=10000, n_classes=2, random_state=1)
# split into train/test sets
trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # fit a model

In [None]:
for class_value in range(2):
  # get row indexes for samples with this class
  row_ix = where(y == class_value)
    # create scatter of these samples
  pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
  # show the plot
pyplot.show()

**Create and train a logistic regression model**

In [None]:
model = LogisticRegression(solver='lbfgs')
model.fit(trainX, trainy)
# predict probabilities
yhat = model.predict_proba(testX)

In [None]:
# retrieve just the probabilities for the positive class
pos_probs = yhat[:, 1]
# calculate the no skill line as the proportion of the positive class
no_skill = len(y[y==1]) / len(y)

In [None]:
# calculate model precision-recall curve
precision, recall, _ = precision_recall_curve(testy, pos_probs)

The Precision-Recall Curve for the Logistic Regression model is shown (orange).
A random or baseline classifier is shown as a horizontal line (blue with dashes).

A model with perfect skill is depicted as a point at a coordinate of (1,1). <br>

A skillful model is represented by a curve that bows towards a coordinate of (1,1). <br>

A no-skill classifier will be a horizontal line on the plot with a precision that is proportional to the number of positive examples in the dataset.<br>

For a balanced dataset this will be 0.5.<br>

**The focus of the PR curve on the minority class makes it an effective diagnostic for imbalanced binary classification models**

Precision-recall curves (PR curves) are recommended for highly skewed domains
where ROC curves may provide an excessively optimistic view of the performance

In [None]:
# plot the no skill precision-recall curve
pyplot.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill')
# plot the model precision-recall curve
pyplot.plot(recall, precision, marker='.', label='Logistic')
# axis labels
pyplot.xlabel('Recall')
pyplot.ylabel('Precision')
# show the legend
pyplot.legend()
# show the plot
pyplot.show()

# **ROC and PR CCurves with Severe Imbalanced Datasets**

**Import the libraries**

In [None]:
# create an imbalanced dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification 
from sklearn.linear_model import LogisticRegression 
from sklearn.dummy import DummyClassifier
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from matplotlib import pyplot

**Create the dataset**

In [None]:
# generate 2 class dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.99, 0.01],
random_state=1)

In [None]:
# split into train/test sets with same class ratio
trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y)
# summarize dataset
print('Dataset: Class0=%d, Class1=%d' % (len(y[y==0]), len(y[y==1])))
print('Train: Class0=%d, Class1=%d' % (len(trainy[trainy==0]), len(trainy[trainy==1]))) 
print('Test: Class0=%d, Class1=%d' % (len(testy[testy==0]), len(testy[testy==1])))

In [None]:
# plot no skill and model roc curves
def plot_roc_curve(test_y, naive_probs, model_probs):
# plot naive skill roc curve
  fpr, tpr, _ = roc_curve(test_y, naive_probs) 
  pyplot.plot(fpr, tpr, linestyle='--', label='No Skill') # plot model roc curve
  fpr, tpr, _ = roc_curve(test_y, model_probs) 
  pyplot.plot(fpr, tpr, marker='.', label='Logistic') # axis labels
  pyplot.xlabel('False Positive Rate') 
  pyplot.ylabel('True Positive Rate')
  # show the legend
  pyplot.legend()
  # show the plot
  pyplot.show()

**Create and train a model with no skill**

In [None]:
# no skill model, stratified random class predictions
model = DummyClassifier(strategy='stratified') 
model.fit(trainX, trainy)
yhat = model.predict_proba(testX)
naive_probs = yhat[:, 1]
# calculate roc auc
roc_auc = roc_auc_score(testy, naive_probs) 
print('No Skill ROC AUC %.3f' % roc_auc)

**Create and train a logistic regression model**

In [None]:
# skilled model
model = LogisticRegression(solver='lbfgs') 
model.fit(trainX, trainy)
yhat = model.predict_proba(testX) 
model_probs = yhat[:, 1]

**Plot the ROC Curves**

In [None]:
# calculate roc auc
roc_auc = roc_auc_score(testy, model_probs) 
print('Logistic ROC AUC %.3f' % roc_auc)
# plot roc curves
plot_roc_curve(testy, naive_probs, model_probs)

In [None]:
def plot_pr_curve(test_y, model_probs):
  # calculate the no skill line as the proportion of the positive class 
  no_skill = len(test_y[test_y==1]) / len(test_y)
  # plot the no skill precision-recall curve
  pyplot.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill') # plot model precision-recall curve
  precision, recall, _ = precision_recall_curve(testy, model_probs)
  pyplot.plot(recall, precision, marker='.', label='Logistic') # axis labels
  pyplot.xlabel('Recall')
  pyplot.ylabel('Precision')
    # show the legend
  pyplot.legend()
    # show the plot
  pyplot.show()

In [None]:
precision, recall, _ = precision_recall_curve(testy, naive_probs)
auc_score = auc(recall, precision)
print('No Skill PR AUC: %.3f' % auc_score)

In [None]:
# calculate the precision-recall auc
precision, recall, _ = precision_recall_curve(testy, model_probs) 
auc_score = auc(recall, precision)
print('Logistic PR AUC: %.3f' % auc_score)

The horizontal line of the no skill classifier is expected and in this case the zig-zag line of the logistic regression curve close to the no skill line

In [None]:
# plot precision-recall curves
plot_pr_curve(testy, model_probs)

To explain why the ROC and PR curves tell a different story, recall that the PR curve focuses on the minority class, whereas the ROC curve covers both classes. If we use a threshold of 0.5 and use the logistic regression model to make a prediction for all examples in the test set, we see that it predicts class 0 or the majority class in all cases. 

In [None]:
# predict probabilities
yhat = model.predict_proba(testX)
# retrieve just the probabilities for the positive class
pos_probs = yhat[:, 1]
# predict class labels
yhat = model.predict(testX)
# summarize the distribution of class labels
print(Counter(yhat))

In [None]:
# create a histogram of the predicted probabilities 
pyplot.hist(pos_probs, bins=100) 
pyplot.show()