Synthesizing [adversarial examples](https://arxiv.org/abs/1312.6199) for neural networks is surprisingly easy: small, carefully-crafted perturbations to inputs can cause neural networks to misclassify inputs in arbitrarily chosen ways. Given that adversarial examples [transfer to the physical world](https://arxiv.org/abs/1607.02533) and [can be made extremely robust](https://blog.openai.com/robust-adversarial-inputs/), this is a real security concern.

In this notebook, we'll give a brief introduction to algorithms for synthesizing adversarial examples, and we'll walk through the process of implementing attacks in [TensorFlow](https://www.tensorflow.org/), building up to synthesizing a robust adversarial example following [this technique](https://arxiv.org/abs/1707.07397).

# Setup

We'll attack an [Inception v3](https://arxiv.org/abs/1512.00567) network trained on [ImageNet](http://www.image-net.org/). In this section, we load a pre-trained network from the [TF-slim image classification library](https://github.com/tensorflow/models/tree/master/slim). This part isn't particularly interesting, so feel free to skip reading this section.

In [None]:
import tensorflow as tf
import tensorflow.contrib.slim as slim
import tensorflow.contrib.slim.nets as nets

In [None]:
tf.logging.set_verbosity(tf.logging.ERROR)
sess = tf.InteractiveSession()

First, we set up the input image.

In [None]:
x = tf.placeholder(tf.float32, (299, 299, 3)) # InceptionV3 takes a 299x299x3 input

Next, we load the Inception v3 model.

In [None]:
def inception(x, reuse):
    preprocessed = tf.multiply(tf.subtract(x, 0.5), 2.0)
    arg_scope = nets.inception.inception_v3_arg_scope(weight_decay=0.0)
    with slim.arg_scope(arg_scope):
        logits, _ = nets.inception.inception_v3(
            preprocessed, 1001, is_training=False, reuse=reuse)
        logits = logits[:,1:] # ignore background class
        probs = tf.nn.softmax(logits) # probabilities
    return logits, probs

logits, probs = inception(tf.expand_dims(x, 0), reuse=False)
# gives us logits (pre-softmax) and probabilities (post-softmax)

Next, we load pre-trained weights. This Inception v3 has a top-5 accuracy of 93.9%.

In [None]:
import tempfile
from urllib.request import urlretrieve
import tarfile
import os

In [None]:
data_dir = tempfile.mkdtemp()
inception_tarball, _ = urlretrieve(
    'http://download.tensorflow.org/models/inception_v3_2016_08_28.tar.gz')
tarfile.open(inception_tarball, 'r:gz').extractall(data_dir)

In [None]:
restore_vars = [
    var for var in tf.global_variables()
    if var.name.startswith('InceptionV3/')
]
saver = tf.train.Saver(restore_vars)
saver.restore(sess, os.path.join(data_dir, 'inception_v3.ckpt'))

Next, we write some code to show an image, classify it, and show the classification result.

In [None]:
import json
import matplotlib.pyplot as plt

In [None]:
with open('resources/imagenet.json') as f:
    imagenet_labels = json.load(f)

In [None]:
def classify(img, correct_class=None, target_class=None):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 8))
    fig.sca(ax1)
    p = sess.run(probs, feed_dict={x: img})[0]
    ax1.imshow(img)
    fig.sca(ax1)
    
    topk = list(p.argsort()[-10:][::-1])
    topprobs = p[topk]
    barlist = ax2.bar(range(10), topprobs)
    if target_class in topk:
        barlist[topk.index(target_class)].set_color('r')
    if correct_class in topk:
        barlist[topk.index(correct_class)].set_color('g')
    plt.sca(ax2)
    plt.ylim([0, 1.1])
    plt.xticks(range(10),
               [imagenet_labels[i][:15] for i in topk],
               rotation='vertical')
    fig.subplots_adjust(bottom=0.2)
    plt.show()

## Example image

We load our example image and make sure it's classified correctly.

In [None]:
import PIL.Image
import numpy as np

In [None]:
img_class = 281
img = PIL.Image.open('resources/cat.jpg')
# take a square center crop of the image and convert it to a float32 image with pixels in [0, 1]
big_dim = max(img.width, img.height)
wide = img.width > img.height
new_w = 299 if not wide else int(img.width * 299 / img.height)
new_h = 299 if wide else int(img.height * 299 / img.width)
img = img.resize((new_w, new_h)).crop((0, 0, 299, 299))
img = (np.asarray(img) / 255.0).astype(np.float32)

In [None]:
classify(img, correct_class=img_class)

As we expect, it's classified as a cat.

# Adversarial examples

Given an image $\mathbf{x}$, our neural network outputs a probability distribution over labels, $P(y \mid \mathbf{x})$. When we craft an adversarial input, we want to find an $\hat{\mathbf{x}}$ where $\log P(\hat{y} \mid \hat{\mathbf{x}})$ is maximized for a target label $\hat{y}$: that way, our input will be misclassified as the target class. We can ensure that $\hat{\mathbf{x}}$ doesn't look too different from the original $\mathbf{x}$ by constraining ourselves to some $\ell_\infty$ box with radius $\epsilon$, requiring that $\left\lVert \mathbf{x} - \hat{\mathbf{x}} \right\rVert_\infty \le \epsilon$.

In this framework, an adversarial example is the solution to a constrained optimization problem that we can solve using backpropagation and projected gradient descent, basically the same techniques that are used to train networks themselves. The algorithm is simple:

We begin by initializing our adversarial example as $\hat{\mathbf{x}} \leftarrow \mathbf{x}$. Then, we repeat the following until convergence:

1. $\hat{\mathbf{x}} \leftarrow \hat{\mathbf{x}} + \alpha \cdot \nabla \log P(\hat{y} \mid \hat{\mathbf{x}})$
2. $\hat{\mathbf{x}} \leftarrow \mathrm{clip}(\hat{\mathbf{x}}, \mathbf{x} - \epsilon, \mathbf{x} + \epsilon)$

## Setup

In [None]:
y_hat = tf.placeholder(tf.int32, ()) # a placeholder for the target class
labels = tf.one_hot(y_hat, 1000) # InceptionV3 uses one-hot encoded labels
loss = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=[labels])

## Parameters

Before we perform our attack, let's choose some concrete parameters.

In [None]:
epsilon = 2.0/255.0 # a really small perturbation
target = 924 # this is "guacamole"; you could try choosing anything else

## Projected gradient descent

We use projected gradient descent to maximize the log probability of the target class (or equivalently, minimize the [cross entropy](https://en.wikipedia.org/wiki/Cross_entropy)) while keeping the adversarial example visually close to the original image (what happens if we don't do this?).

In [None]:
grad, = tf.gradients(loss, x)

We can evaluate the gradient `grad` at a particular $x = $ `x_hat` and $\hat{y}$ = `target` by running `grad.eval({x: x_hat, y_hat: target})`.

Finally, we're ready to run projected gradient descent to find our adversarial example `x_hat`.

In [None]:
learning_rate = 1e-1
iterations = 30

x_hat = np.copy(img) # initial guess

upper = np.clip(img + epsilon, 0, 1) # an upper bound for pixels in the image
lower = np.clip(img - epsilon, 0, 1) # a lower bound for pixels in the image

for i in range(iterations):
    # gradient descent step
    # TODO
    
    # projection step; what happens if you don't do this (and e.g. clip to [0, 1] instead)?
    # TODO
    
    # print progress
    l = loss.eval({x: x_hat, y_hat: target})
    print('step %d, loss %f' % (i, l))

This adversarial image is visually indistinguishable from the original, with no visual artifacts. However, it's classified as "guacamole" with high confidence!

In [None]:
classify(x_hat, correct_class=img_class, target_class=target)

# Robust adversarial examples

Now, we go through a more advanced example. We follow an approach for synthesizing robust adversarial examples to find a single perturbation of our cat image that's simultaneously adversarial under some chosen distribution of transformations.  We could choose any distribution of differentiable transformations; in this post, we'll synthesize a single adversarial input that's robust to rotation by $\theta \in [-\pi/4, \pi/4]$.

Before we proceed, let's check if our previous example is still adversarial if we rotate it, say by an angle of $\theta = \pi/8$.

In [None]:
ex_angle = np.pi/8

angle = tf.placeholder(tf.float32, ())
rotated_image = tf.contrib.image.rotate(x, angle)
rotated_example = rotated_image.eval(feed_dict={x: x_hat, angle: ex_angle})
classify(rotated_example, correct_class=img_class, target_class=target)

Looks like our original adversarial example is not rotation-invariant!

So, how do we make an adversarial example robust to a distribution of transformations? Given some distribution of transformations $T$, we can maximize $\mathbb{E}_{t \sim T} \log P\left(\hat{y} \mid t(\hat{\mathbf{x}})\right)$, subject to $\left\lVert \mathbf{x} - \hat{\mathbf{x}} \right\rVert_\infty \le \epsilon$. We can solve this optimization problem via projected gradient descent, noting that $\nabla \mathbb{E}_{t \sim T} \log P\left(\hat{y} \mid t(\hat{\mathbf{x}})\right)$ is $\mathbb{E}_{t \sim T} \nabla \log P\left(\hat{y} \mid t(\hat{\mathbf{x}})\right)$ and approximating with samples at each gradient descent step.

Rather than manually implementing the gradient sampling, we can get TensorFlow to do it for us.

In [None]:
num_samples = 10 # samples per step

rotated = []
for i in range(num_samples):
    rotated.append(tf.contrib.image.rotate(
        x, tf.random_uniform((), minval=-np.pi/4, maxval=np.pi/4)))
rotated = tf.stack(rotated)
rotated_logits, _ = inception(rotated, reuse=True)

duplicated_labels = tf.tile(tf.expand_dims(labels, 0), (num_samples, 1))

average_loss = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(logits=rotated_logits, labels=duplicated_labels))

## Projected gradient descent

Now, we can run PGD just like we did last time to generate our adversarial input. As in the previous example, we'll choose "guacamole" as our target class.

In [None]:
grad_robust, = tf.gradients(average_loss, x)

In [None]:
epsilon = 8.0/255.0 # still a pretty small perturbation
learning_rate = 2e-1
iterations = 50

x_hat_robust = np.copy(img) # initial guess

upper = np.clip(img + epsilon, 0, 1) # an upper bound for pixels in the image
lower = np.clip(img - epsilon, 0, 1) # a lower bound for pixels in the image

for i in range(iterations):
    # gradient descent step
    # TODO
    
    # projection step; what happens if you don't do this (and e.g. clip to [0, 1] instead)?
    # TODO
    
    # print progress
    l = average_loss.eval({x: x_hat_robust, y_hat: target})
    print('step %d, average loss %f' % (i, l))

This adversarial image is classified as "guacamole" with high confidence, even when it's rotated!

In [None]:
rotated_example = rotated_image.eval(feed_dict={x: x_hat_robust, angle: ex_angle})
classify(rotated_example, correct_class=img_class, target_class=target)

## Evaluation

Let's examine the rotation-invariance of the robust adversarial example we produced over the entire range of angles, looking at $P(\hat{y} \mid \hat{\mathbf{x}})$ over $\theta \in [-\pi/4, \pi/4]$.

In [None]:
thetas = np.linspace(-np.pi/4, np.pi/4, 51)

p_naive = []
p_robust = []
for theta in thetas:
    rotated = rotated_image.eval(feed_dict={x: x_hat_robust, angle: theta})
    p_robust.append(probs.eval(feed_dict={x: rotated})[0][target])
    
    rotated = rotated_image.eval(feed_dict={x: x_hat, angle: theta})
    p_naive.append(probs.eval(feed_dict={x: rotated})[0][target])

robust_line, = plt.plot(thetas, p_robust, color='b', linewidth=2, label='robust')
naive_line, = plt.plot(thetas, p_naive, color='r', linewidth=2, label='naive')
plt.ylim([0, 1.05])
plt.xlabel('rotation angle')
plt.ylabel('target class probability')
plt.legend(handles=[robust_line, naive_line], loc='lower right')
plt.show()

It's super effective!

# More exploration

Finished early? Try making this work for other types of transformation, such as random crops, rescaling, or shearing.