# Diversity Quality (DQ) score

The Diversity Quality score measures how close the diversity on the in-distribution (ID) and out-of-distribution (OOD) datasets is to the ideal diversity. On the ID set, diversity should be small, while on the OOD set, diversity should be large. The DQ score is the harmonic mean of (1 - ID diversity) and OOD diversity. The $DQ_1$-score is calculated as follows:

$$DQ_1 = 2 \cdot \frac{(1 - IDD) \cdot OODD}{(1 - IDD) + OODD}$$

The $DQ_1$ score can also be generalized to a $DQ_\beta$ score, valuing one of ID diversity or OOD diversity more than the other.
With $\beta$ a positive real factor, OODD is considered $\beta$ times as important as IDD:

$$DQ_\beta = (1 + \beta^2) \cdot \frac{(1 - IDD) \cdot OODD}{\beta^2 \cdot (1 - IDD) + OODD}$$
	

In [1]:
import reject
from reject.utils import generate_synthetic_output
from reject.diversity import diversity_quality_score, diversity_score

In [2]:
print(reject.__version__)

0.2.0


## Generate synthetic NN output

In this example, we generate synthetic outputs of a NN with multiple samples of the predictive distribution. The output predictions are of shape `(n_observations, n_samples, n_classes)` and the true labels `(n_observations,)`. The data generation function uses 10 output classes.

In [3]:
NUM_SAMPLES = 10
NUM_OBSERVATIONS = 1000

(y_pred_id, y_true_id), (y_pred_ood, y_true_ood) = generate_synthetic_output(
    NUM_SAMPLES, NUM_OBSERVATIONS, concat=False
)
print(y_pred_id.shape, y_true_id.shape)
print(y_pred_ood.shape, y_true_ood.shape)

(1000, 10, 10) (1000,)
(1000, 10, 10) (1000,)


## Diversity score

We first calculate the diversity scores on the ID and OOD sets. The diversity score is calculated as the fraction of test data points on which predictions of ensemble members disagree. As the base model for diversity computation, we average the output distributions over the members and determine the resulting predicted label.

The `diversity_score` functions directly takes the predictions. You can choose the get the diversity for each member or the average diversity.

In [4]:
# ID set - diversity for each member
div_score = diversity_score(y_pred=y_pred_id, average=False)
print(div_score)

# ID set - average diversity
div_score = diversity_score(y_pred=y_pred_id, average=True)
print(div_score)

[0.437 0.436 0.451 0.428 0.439 0.416 0.431 0.434 0.422 0.441]
0.4335000000000001


In [5]:
# OOD set - diversity for each member
div_score = diversity_score(y_pred=y_pred_ood, average=False)
print(div_score)

# OOD set - average diversity
div_score = diversity_score(y_pred=y_pred_ood, average=True)
print(div_score)

[0.667 0.671 0.655 0.653 0.641 0.644 0.698 0.665 0.662 0.662]
0.6618


We observe that the diversity scores on the OOD set are higher than the diversity scores on the ID set. This is expected and is desired in real-life applications.

## $DQ_1$-score

Based on the ID and OOD diversities, we calculate the $DQ_1$-score.

The `diversity_quality_score` function directly takes in the ID and OOD predictions. You can choose the get the diversity for each member or the average diversity. By default, the $DQ_1$-score is calculated, which gives equal weight to the ID and OOD diversity.

In [6]:
# diversity quality for each member
dq_score = diversity_quality_score(
    y_pred_id=y_pred_id, y_pred_ood=y_pred_ood, average=False
)
print(dq_score)

# average diversity quality
dq_score = diversity_quality_score(
    y_pred_id=y_pred_id, y_pred_ood=y_pred_ood, average=True
)
print(dq_score)

[0.61060325 0.61286478 0.59733389 0.60982204 0.59833777 0.6125342
 0.62693291 0.61151909 0.61715484 0.60615561]
0.6104529837987462


## $DQ_\beta$-score

By adapting the `beta_ood` parameter, we can assign a higher weight to either ID or OOD diversity. For example, for `beta_ood=2`, the OOD diversity is considered twice as important as the ID diversity. Conversely, for `beta_ood=0.5`, the ID diversity is considered twice as important as the OOD diversity.

In [7]:
# beta = 2.0
dq_score = diversity_quality_score(
    y_pred_id=y_pred_id, y_pred_ood=y_pred_ood, beta_ood=2.0, average=True
)
print(dq_score)

# beta = 0.5
dq_score = diversity_quality_score(
    y_pred_id=y_pred_id, y_pred_ood=y_pred_ood, beta_ood=0.5, average=True
)
print(dq_score)

0.6402583851355967
0.5832991567352271
