DOCTOR aims to identify whether the prediction of a classifier should or should not be trusted so that to choose between accepting or rejecting the prediction.
The results in the tables below reported in terms of AUROC% / FRR% (95% TRR).
1- Totally-Black-Box (TBB)
Dataset | D_alpha | D_beta | SR | MHLNB |
---|---|---|---|---|
CIFAR10 | 94 / 17.9 | 68.5 / 18.6 | 93.8 / 18.2 | 92.2 / 30.8 |
CIFAR100 | 87 / 40.6 | 84.2 / 40.6 | 86.9 / 40.5 | 82.6 / 66.7 |
TinyImageNet | 84.9 / 45.8 | 84.9 / 45.8 | 84.9 / 45.8 | 78.4 / 82.3 |
SVHN | 92.3 / 38.6 | 92.2 / 39.7 | 92.3 / 38.6 | 87.3 / 85.8 |
Amazon_Fashion | 89.7 / 27.1 | 89.7 / 26.3 | 87.4 / 50.1 | - / - |
Amazon_Software | 68.8 / 73.2 | 68.8 / 73.2 | 67.3 / 86.6 | - / - |
IMDb | 84.4 / 54.2 | 84.4 / 54.4 | 83.7 / 61.7 | - / - |
2- Partially-Black-Box (PBB)
Dataset | D_alpha | D_beta | ODIN | MHLNB |
---|---|---|---|---|
CIFAR10 | 95.2 / 13.9 | 94.8 / 13.4 | 94.2 / 18.4 | 84.4 / 44.6 |
CIFAR100 | 88.2 / 35.7 | 87.4 / 36.7 | 87.1 / 40.7 | 50 / 94 |
TinyImageNet | 86.1 / 43.3 | 85.3 / 45.1 | 84.9 / 45.3 | 59 / 86 |
SVHN | 93 / 36.6 | 92.8 / 38.4 | 92.3 / 40.7 | 88 / 54.7 |
Package
├── data
├── datasets
├── lib_discriminators
│ ├── discriminators.py
├── models
│ └── sigmoid_nn.py
├── mystat
│ └── statistics.py
├── plots
├── tests
│ ├── compute_FRR_vs_TRR.py
│ └── test_FRR_vs_TRR.py
├── utils
│ ├── GUI_tools.py
│ ├── dataset_utils.py
│ ├── files_utils.py
│ ├── var_utils.py
│ └── plot_utils.py
├── main.py
├── test_wrapper.py
├── README.md
└── requirements.txt
- T_tbb temperature scaling in TBB (same for SR)
- eps_tbb: perturbation magnitude in TBB (same for SR)
- T_alpha: temperature scaling in PBB for D_alpha
- eps_alpha: perturbation magnitude in PBB for D_alpha
- T_beta: temperature scaling in PBB for D_beta
- eps_beta: perturbation magnitude in PBB for D_beta
- T_odin: temperature scaling in PBB for ODIN
- eps_odin: perturbation magnitude in PBB for ODIN
- T_mhlnb: temperature scaling in PBB for Mahalanobis
- eps_mhlnb: perturbation magnitude in PBB for Mahalanobis
Name | T_tbb | eps_tbb | T_alpha | eps_alpha | T_beta | eps_beta | T_odin | eps_odin | T_mhlnb | eps_mhlnb |
---|---|---|---|---|---|---|---|---|---|---|
CIFAR10 | 1 | 0 | 1 | 0.00035 | 1.5 | 0.00035 | 1.3 | 0 | 1 | 0.0002 |
CIFAR100 | 1 | 0 | 1 | 0.00035 | 1.5 | 0.00035 | 1.3 | 0 | 1 | 0.0002 |
TinyImageNet | 1 | 0 | 1 | 0.00035 | 1.5 | 0.00035 | 1.3 | 0 | 1 | 0.0002 |
SVHN | 1 | 0 | 1 | 0.00035 | 1.5 | 0.00035 | 1.3 | 0 | 1 | 0.0002 |
Amazon_Fashion | 1 | 0 | 1 | 0.00035 | 1.5 | 0.00035 | 1.3 | 0 | 1 | 0.0002 |
Amazon_Software | 1 | 0 | 1 | 0.00035 | 1.5 | 0.00035 | 1.3 | 0 | 1 | 0.0002 |
IMDb | 1 | 0 | 1 | 0.00035 | 1.5 | 0.00035 | 1.3 | 0 | 1 | 0.0002 |
DOCTOR requires the predictions for a given dataset to be in the following format. Example on CIFAR10:
- 1,...,10: softmax probability associated to the corresponding class
- label: predicted class
- true_label: true class
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | label | true_label |
---|---|---|---|---|---|---|---|---|---|---|---|
0.02 | 0.01 | 0.04 | 0.01 | 0.005 | 0.005 | 0.9 | 0.006 | 0.002 | 0.002 | 7 | 7 |
Dataframe are stored in the corresponding directory. For CIFAR10:
├── data
│ ├── cifar10_T_1_eps_0_test.csv
│ ├── cifar10_T_1_eps_0_train.csv
│ └── cifar10_T_1_eps_0_train_logits.csv
├── data_perturb
│ └── cifar10_T_1.3_eps_0_pt_odin_test.csv
├── data_perturb_our
│ ├── cifar10_T_1.5_eps_0.00035_pt_beta_test.csv
│ ├── cifar10_T_1_eps_0.0002_pt_mahalanobis_test_logits.csv
│ └── cifar10_T_1_eps_0.00035_pt_alpha_test.csv
A clean execution of DOCTOR is in:
tests/test_FRR_vs_TRR.py
To execute it:
- Create the enviroment for DOCTOR:
foo@bar:~$ conda create --name doctor python=3.8
- Activate the enviroment for DOCTOR:
foo@bar:~$ source activate doctor
- Install all the required packages:
(doctor) foo@bar:~$ pip install -r requirements.txt
- Launch the test from CLI for CIFAR10:
(doctor) foo@bar:~$ python main.py -d_name cifar10 -sc tbb
(doctor) foo@bar:~$ python main.py -d_name cifar10 -sc pbb
Output:
(doctor) foo@bar:~$ python main.py -d_name cifar10 -sc pbb -ood
ALPHA: AUROC 95.2 % --- FRR (95% TRR) 13.9 %
BETA: AUROC 94.8 % --- FRR (95% TRR) 13.4 %
ODIN: AUROC 94.2 % --- FRR (95% TRR) 18.4 %
MAHALANOBIS: AUROC 84.4 % --- FRR (95% TRR) 44.6 %
Plot:
Experiments with OOD samples:
(doctor) foo@bar:~$ python main.py -d_name isun_cifar10 -sc pbb -ood True
ALPHA: AUROC 95.6 % / 0.1 % --- FRR 15.1 % / 0.1 %
BETA: AUROC 95.6 % / 0.0 % --- FRR 13.6 % / 0.5 %
ODIN: AUROC 95.4 % / 0.0 % --- FRR 16.1 % / 0.2 %
ODIN (DEFAULT SETTING OF ODIN) : AUROC 93.5 % / 0.0 % --- FRR 30.6 % / 0.4 %
Note that, the name of the dataset to set is out-dataset-name_in-dataset-name.csv
.
Click here to download the datasets for OOD experiments.
We run each experiment on a machine equipped with an Intel(R) Xeon(R) CPU E5-2623 v4, 2.60GHz clock frequency, and a GeForce GTX 1080 Ti GPU.
We test this clean execution on a machine equipped with Intel(R) Core(TM) i7-8569U CPU @ 2.80GHz.