A live repository following research in adversarial image generation to fool convolutional (and other) neural networks. Currently testing on MNIST.
There are a few different ways to generate adversarial examples for image recognition networks (and neural networks by extension). These methods differ in that they have different levels of knowledge about the model architecture and its parameters, and use differnt optimization techniques to generate perturbations. Perturbations are then added to input sample features with the goal of causing mis-classification.
Much progress has been made in generating and circumventing white-box attacks (through techniques that try to obfuscate, or hide network gradients). However, the problem of circumventing adversarial attacks is similar to an arms race. Shortly after the demonstration of new defenses, new adversaries are designed in response. This makes this area of work highly compelling (there is lots to be done).
For the purpose of this project, we will start by performing a simple and standard white-box attack commonly seen in literature. It is our goal to get a hands-on introduction to the task of adversarial attack on neural networks. We will do this by using the MNIST dataset, and generate perturbations that allow us (the adversarial engineers) to choose how a standard convolutional network classifies images. Specifically, we will be trying to force a standard convolutional network to classify images of 2s as 6s.
Time permitting, we will try two further things:
- White/Black-box attacks where we are allowed to modify only a handful of pixels.
- Black-box attacks (in which the adversary does not have access to the network)
We start with a white-box method which requires access to network architecture for gradient calculations. This method was first theorized by Goodfellow et al. in 2015 as a simple test of the linearity of a neural network. Starting from the original image that one wishes to generate adversarial examples for, one can iteratively descend in the direction of the gradient of the loss function calculated with respect to that image and the adversarial target class by subtracting a scaled multiple of sign of the gradient from the image. Accumulated over time steps, this cumulative subtracted image constitued the additive perturbation (notice how this process is essentially gradient descent).
For this experiment, it was found through trial an error that allowing the perturbed image to be at most e < 0.14 away from the original was sufficient to allow consistent misclassification in the target direction. This e value is a per-pixel quantity and equates to the l_infinity distance metric from literature.
It is very simple to direct a standard MNIST classifier to misclassify 2s as 6s using the iterative FGSM technique. However, this technique requires white-box access to the network and its gradients. Future work will involve further black-box testing and restrictions on the number of pixels that can be perturbed.
The code for this project is written primarily using Python3 and TensorFlow. To reproduce results, complete the following steps:
Setup:
- Python 3 Setup
- Run
pip3 install -r requirements.txt
from the root directory in this file to install necessary python packages. - Run
python3 train_mnist.py
in order to train a new mnist model. (This only needs to be run once)
Fast Gradient Sign Method
- Run
python3 ./experiments/iterative_fgsm.py
. This file will generate adversarial examples for 10 random images of 2s and save results under at./results/iterative_fgsm
These results are shown here:
Alexey Kurakin, Ian Goodfellow: “Adversarial Machine Learning at Scale”, 2016; arXiv:1611.01236.
Nicholas Carlini: “Towards Evaluating the Robustness of Neural Networks”, 2016; arXiv:1608.04644.