Adversarial Robustness (and Interpretability) via Gradient Regularization
This repository contains Python code and iPython notebooks used to run the experiments in Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing their Input Gradients.
If you add an imperceptibly small amount of carefully crafted noise to an image which a neural network classifies correctly, you can usually cause it to make an incorrect prediction. This type of noise addition is called "adversarial perturbation," and the perturbed images are called adversarial examples. Unfortunately, it turns out that it's pretty easy to generate adversarial examples which (1) fool almost any model trained on the same dataset, and (2) continue to fool models even when printed out or viewed at different perspectives and scales. As neural networks start being used for things like face recognition and self-driving cars, this vulnerability poses an increasingly pressing problem.
In this repository, we try to tackle this problem directly, by training neural networks with a type of regularization that penalizes how sensitive their predictions are to infinitesimal changes in their inputs. This type of regularization moves examples further away from the decision boundary in input-space, and has the side-effect of making gradient-based explanations of the model -- as well as the adversarial perturbations themselves -- more human-interpretable. Check out the experiments below or the paper for more details!
notebooks/contains iPython notebooks replicating the main experiments from the paper:
- MNIST compares robustness to two adversarial attack methods (the FGSM and TGSM) when CNNs are trained on the MNIST dataset with with various forms of regularization: defensive distillation, adversarial training, and two forms of input gradient regularization. This is a good one to look at first, since it's got both the results and some textual explanation of what's going on.
- notMNIST does the same accuracy comparisons, but for the notMNIST dataset. We omit the textual explanations since it would be redundant with what's in the MNIST notebook.
- SVHN does the same for the Street View House Numbers dataset.
scripts/contains code used to train models and generate / animate adversarial examples.
cached/contains data files with trained model parameters and adversarial examples. The actual data is gitignored, but you can download it (see instructions below).
adversarial_robustness/contains code modeling Python code for representing neural networks, datasets, and training / explanation / visualization / adversarial perturbation. Some of the code is strongly influenced by cleverhans and tensorflow-adversarial, but we've modified everything to be more object-oriented.
To immediately run the notebooks using models and adversarial examples used to generate figures in the paper, you can download this zipped directory, which should replace the
cached/ subdirectory of this folder.
To fully replicate all experiments, you can use the files in the scripts directory to retrain models and regenerate adversarial examples.
This code was tested with Python 3.5 and Tensorflow >= 1.2.1. Most files should also work with Python 2.7, but training may not work with earlier versions of Tensorflow, which lack second-derivative support for many CNN operations.