[Update April, 2021] Checkout a recent paper with fast, efficient implementation of SA: https://github.com/testingautomated-usi/surprise-adequacy. Big thanks to the authors! 😃
Code release of a paper "Guiding Deep Learning System Testing using Surprise Adequacy"
If you find this paper helpful, consider cite the paper:
@inproceedings{Kim2019aa,
Author = {Jinhan Kim and Robert Feldt and Shin Yoo},
Booktitle = {Proceedings of the 41th International Conference on Software Engineering},
Pages = {1039-1049},
Publisher = {IEEE Press},
Series = {ICSE 2019},
Title = {Guiding Deep Learning System Testing using Surprise Adequacy},
Year = {2019}}
}
This archive includes code for computing Surprise Adequacy (SA) and Surprise Coverage (SC), which are basic components of the main experiments in the paper. Currently, the "run.py" script contains a simple example that calculates SA and SC of a test set and an adversarial set generated using FGSM method for the MNIST dataset, only considering the last hidden layer (activation_3). Layer selection can be easily changed by modifying layer_names
in run.py.
run.py
- Script processing SA with a benign dataset and adversarial examples (MNIST and CIFAR-10).sa.py
- Tools that fetch activation traces, compute LSA and DSA, and coverage.train_model.py
- Model training script for MNIST and CIFAR-10. It keeps the trained models in the "model" directory (code from Ma et al.).model
directory - Used for saving models.tmp
directory - Used for saving activation traces and prediction arrays.adv
directory - Used for saving adversarial examples.
-d
- The subject dataset (either mnist or cifar). Default is mnist.-lsa
- If set, computes LSA.-dsa
- If set, computes DSA.-target
- The name of target input set. Default isfsgm
.-save_path
- The temporal save path of AT files. Default is tmp directory.-batch_size
- Batch size. Default is 128.-var_threshold
- Variance threshold. Default is 1e-5.-upper_bound
- Upper bound of SA. Default is 2000.-n_bucket
- The number of buckets for coverage. Default is 1000.-num_classes
- The number of classes in dataset. Default is 10.-is_classification
- Set if task is classification problem. Default is True.
We used the framework by Ma et al. to generate various adversarial examples (FGSM, BIM-A, BIM-B, JSMA, and C&W). Please refer to craft_adv_samples.py in the above repository of Ma et al., and put them in the adv
directory. For a basic usage example, there is an included adversarial set generated by the FSGM method for MNIST (See file ./adv/adv_mnist_fgsm.npy).
To reproduce the result of Udacity self-driving car challenge, please refer to the DeepXplore and DeepTest repositories, which contain information about the dataset, models (Dave-2, Chauffeur), and synthetic data generation processes. It might take a few hours to get the dataset and the models due to their sizes.
Our implementation is based on Python 3.5.2, Tensorflow 1.9.0, Keras 2.2, Numpy 1.14.5. Details are listed in requirements.txt
.
This is a simple example of installation and computing LSA or DSA of a test set and FGSM in MNIST dataset.
# install Python dependencies
pip install -r requirements.txt
# train a model
python train_model.py -d mnist
# calculate LSA, coverage, and ROC-AUC score
python run.py -lsa
# calculate DSA, coverage, and ROC-AUC score
python run.py -dsa
- If you encounter
ValueError: Input contains NaN, infinity or a value too large for dtype ('float64').
error, you need to increase the variance threshold. Please refer to the configuration details in the paper (Section IV-C). - Images were processed by clipping its pixels in between -0.5 and 0.5.
- If you want to select specific layers, you can modify the layers array in
run.py
. - Coverage may vary depending on the upper bound.
- For speed-up, use GPU-based tensorflow.
- All experimental results