<a href="https://colab.research.google.com/github/bmreiniger/datascience.stackexchange/blob/master/111621_decile_aucs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import numpy as np
import pandas as pd


In [2]:
X, y = make_classification(n_samples=5000, class_sep=0.5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=314)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict_proba(X_test)[:, 1]
roc_auc_score(y_test, y_pred)

0.762605472270009

In [3]:
deciles = pd.qcut(y_pred, 10)
for dec in deciles.categories:
  mask = deciles == dec
  try:
    auc = roc_auc_score(y_test[mask], y_pred[mask])
  except ValueError:
    auc = np.nan
  print(dec, auc)

(0.0194, 0.158] 0.5287610619469026
(0.158, 0.255] 0.5771452145214522
(0.255, 0.348] 0.5102777777777778
(0.348, 0.425] 0.6027647365675535
(0.425, 0.504] 0.5254760679361812
(0.504, 0.579] 0.4971693257848687
(0.579, 0.649] 0.5566666666666666
(0.649, 0.72] 0.48698412698412696
(0.72, 0.822] 0.4751228501228501
(0.822, 0.996] 0.6377995642701525


As I expected, the intra-decile AUCs are all (significantly) lower than the overall AUC.

In [4]:
df = pd.DataFrame({'y': y_test, 'p': y_pred, 'decile': deciles})
df.groupby('decile')[['y', 'p']].mean()


Unnamed: 0_level_0,y,p
decile,Unnamed: 1_level_1,Unnamed: 2_level_1
"(0.0194, 0.158]",0.096,0.101466
"(0.158, 0.255]",0.192,0.204594
"(0.255, 0.348]",0.36,0.303286
"(0.348, 0.425]",0.432,0.386357
"(0.425, 0.504]",0.464,0.465099
"(0.504, 0.579]",0.536,0.539693
"(0.579, 0.649]",0.64,0.616535
"(0.649, 0.72]",0.72,0.68452
"(0.72, 0.822]",0.704,0.767459
"(0.822, 0.996]",0.864,0.893887


The decile chart here helps to show why, given the interpretation of AUROC as the probability of correctly ordering a random positive and negative observation.  The model-score-deciles have nearly monotonic mean responses.  A random positive and negative example are either intra-decile or inter-decile.  The overall AUC is obtained by considering all of these, but the individual decile AUCs considers just the intra-decile pairs.  Since the deciles' response rates are quite different, the inter-decile pairs give a significant boost to the overall AUC.

# But...

Theoretically, it's possible for the intra-decile AUCs to be better than the overall, but it would be very strange for a model to produce that: it requires that larger differences in probabilities (so inter-decile pairs of (+,-) observations) are actually less often in the right order than smaller differences in probability (intra-decile pairs).

In [5]:
y_pred_2 = np.arange(50) / 50
y_true_2 = np.array(
    [0, 1, 1, 1, 1] * 3
    + [0, 0, 1, 1, 1] * 2
    + [0, 0, 0, 1, 1] * 2
    + [0, 0, 0, 0, 1] * 3
)
roc_auc_score(y_true_2, y_pred_2)

0.2704

In [6]:
deciles = pd.qcut(y_pred_2, 10)
for dec in deciles.categories:
  mask = deciles == dec
  try:
    auc = roc_auc_score(y_true_2[mask], y_pred_2[mask])
  except ValueError:
    auc = np.nan
  print(dec, auc)

(-0.001, 0.098] 1.0
(0.098, 0.196] 1.0
(0.196, 0.294] 1.0
(0.294, 0.392] 1.0
(0.392, 0.49] 1.0
(0.49, 0.588] 1.0
(0.588, 0.686] 1.0
(0.686, 0.784] 1.0
(0.784, 0.882] 1.0
(0.882, 0.98] 1.0


A quick aside: there are $10\cdot\binom{n/10}{2}\sim n^2/20$ intra-decile pairs, and $\binom{10}{2}(n/10)^2\sim 9n^2/20$ inter-decile pairs.