This repository contains code to reproduce results from the paper:
Ensemble Adversarial Training: Attacks and Defenses
Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Dan Boneh and Patrick McDaniel
ArXiv report: https://arxiv.org/abs/1705.07204
The code was tested with Python 2.7.12, Tensorflow 1.0.1 and Keras 1.2.2.
We start by training a few simple MNIST models. These are described in mnist.py.
python -m train models/modelA --type=0
python -m train models/modelB --type=1
python -m train models/modelC --type=2
python -m train models/modelD --type=3
Then, we can use (standard) Adversarial Training or Ensemble Adversarial Training (we train for either 6 or 12 epochs in the paper). With Ensemble Adversarial Training, we additionally augment the training data with adversarial examples crafted from external pre-trained models (models A, C and D here):
python -m train_adv models/modelA_adv --type=0 --epochs=12
python -m train_adv models/modelA_ens models/modelA models/modelC models/modelD --type=0 --epochs=12
The accuracy of the models on the MNIST test set can be computed using
python -m simple_eval test [model(s)]
To evaluate robustness to various attacks, we use
python -m simple_eval [attack] [source_model] [target_model(s)] [--parameters (opt)]
The attack can be:
Attack | Description | Parameters |
---|---|---|
fgs | Standard FGSM | eps (the norm of the perturbation) |
rand_fgs | Our FGSM variant that prepends the gradient computation by a random step | eps (the norm of the total perturbation); alpha (the norm of the random perturbation) |
ifgs | The iterative FGSM | eps (the norm of the perturbation); steps (the number of iterative FGSM steps) |
CW | The Carlini and Wagner attack | eps (the norm of the perturbation); kappa (attack confidence) |
Note that due to GPU non-determinism, the obtained results may vary by a few percent compared to those reported in the paper. Nevertheless, we consistently observe the following:
- Standard Adversarial Training performs worse on transferred FGSM examples than on a "direct" FGSM attack on the model due to a gradient masking effect.
- Our RAND+FGSM attack outperforms the FGSM when applied to any model. The gap is particularly pronounced for the adversarially trained model.
- Ensemble Adversarial Training is more robust than (standard) adversarial training to transferred examples computed using any of the attacks above.
Questions and suggestions can be sent to tramer@cs.stanford.edu