Selective Classification Can Magnify Disparities Across Groups
This repository contains the code used for the following paper:
Erik Jones*, Shiori Sagawa*, Pang Wei Koh*, Ananya Kumar, and Percy Liang
International Conference on Learning Representations (ICLR), 2021
For an executable version of our paper, check out the CodaLab Worksheet.
Selective classification, in which models can abstain on uncertain predictions, is a natural approach to improving accuracy in settings where errors are costly but abstentions are manageable. In this paper, we find that while selective classification can improve average accuracies, it can simultaneously magnify existing accuracy disparities between various groups within a population, especially in the presence of spurious correlations. We observe this behavior consistently across five vision and NLP datasets. Surprisingly, increasing abstentions can even decrease accuracies on some groups. To better understand this phenomenon, we study the margin distribution, which captures the model’s confidences over all predictions. For symmetric margin distributions, we prove that whether selective classification monotonically improves or worsens accuracy is fully determined by the accuracy at full coverage (i.e., without any abstentions) and whether the distribution satisfies a property we call left-log-concavity. Our analysis also shows that selective classification tends to magnify full-coverage accuracy disparities. Motivated by our analysis, we train distributionally-robust models that achieve similar full-coverage accuracies across groups and show that selective classification uniformly improves each group on these models. Altogether, our results suggest that selective classification should be used with care and underscore the importance of training models to perform equally well across groups at full coverage.
We use five datasets in our experiments, four of which are available in the correct format as downloadable bundles from CodaLab: CelebA (Liu et al., 2015), Waterbirds (Sagawa et al., 2020), CivilComments (Borkan et al., 2019), and MultiNLI (Williams et al., 2018). We additionally use a modified version of CheXpert (Irvin et al., 2019), called CheXpert-device, where we subsample to enforce a spurious correlation between the presence of pleural effusion and the presence of a support device. The splits we use are available on CodaLab.
There are two main steps in reproducing the results of our paper:
- Training models and saving predictions
- Plotting results based on saved predictions
The code for the first step is stored in
src, and is heavily based off of this code.
As an example, consider the following command:
python3 src/run_expt.py -d Waterbirds -t waterbird_complete95 -c forest2water2 --lr 0.001 --batch_size 128 --weight_decay 0.0001 --model resnet50 --n_epochs 300 --data_dir waterbirds --log_dir ./preds --save_preds --save_step 1000 --log_every 1000
Here, replace the
--data_dir argument with the location of the Waterbirds bundle downloaded from CodaLab. The
-d argument specifies the dataset, the
-t specifies the label, the
-c specifies the name of the confounder, and the predictions on the test set will be stored in
--log_dir. In this case, the optimizer is ERM; to change to DRO, add
--robust --alpha 0.01 --gamma 0.1 --generalization_adjustment 1, and to add dropout add
--mc_dropout. Commands to train each model for each dataset are available on CodaLab, except for CheXpert. For CheXpert, request access from the bottom of this page. Then, run
python3 src/run_expt.py -s confounder -d CheXpert -t Pleural_Effusion -c Support_Devices --lr 0.001 --batch_size 16 --weight_decay 0 --model densenet121 --n_epochs 4 --log_every 10000 --log_dir ./preds --data_dir CheXpert-v1.0-small
CheXpert-v1.0-small is the folder containing the small version of the downloaded CheXpert dataset. You will first need to filter the downloaded
metadata.csv file with to only contain entries with
Path contained within the
chexpert_paths.csv file available on codalab. CodaLab
Ensure that the
split column from
chexpert_paths.csv also translates over.
Next, given saved models, we compute and plot the accuracy-coverage curves, along with the group-agnostic reference, the Robin Hood reference, and the margin distributions. To do so, ensure the saved
preds folder from the previous step, for each
dataset for ERM, ERM with dropout, and MC-Dropout, are stored in
dataset-MC respectively. Then, to plot figures, run:
python3 eval/process_preds.py --opt ERM --datasets CelebA Waterbirds CheXpert-device CivilComments MultiNLI
Feel free to remove and reorder datasets depending on the desired figures, and replace
MC to generate plots for models trained with DRO and MC-dropout based confidences respectively. The output is stored in