# Adversarial Attack on Neural Networks
In this question, we will implement fast gradient sign attack (FGSM) on a convolution neural network for MNIST dataset. This reveals the striking lack of robustness present in neural networks. FGSM was first introduced by Goodfellow et al. in 2015. Check out their paper: [Explaining and harnessing adversarial examples](https://arxiv.org/pdf/1412.6572.pdf).

In [5]:
from __future__ import absolute_import
from __future__ import division
from __future__ import unicode_literals
from __future__ import print_function

import logging

import numpy as np
import tensorflow as tf
from PIL import Image
from tensorflow.python.platform import flags

logging.basicConfig(level=logging.INFO,
                    format='%(levelname)s %(asctime)s %(name)s %(message)s')

from utils import data_mnist

import os

# Step 1: Read in data


In [6]:
#choose subset of images in the dataset
train_start = 0
train_end = 60000
test_start = 0
test_end = 10000

X_train, Y_train, X_test, Y_test = data_mnist(train_start=train_start,
                                              train_end=train_end,
                                              test_start=test_start,
                                              test_end=test_end)
#(we will only need the test set)

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
(u'X_train shape:', (60000, 28, 28, 1))
(u'X_test shape:', (10000, 28, 28, 1))


# Step 2: Load model
Loading a pre-learned model in tensorflow is simple. We need to first load the meta data which is the computation graph. Then, load the weights. Why do we need `tf.Session()` to load a model? Remember that the model variables have no value without the session!

In [7]:
sess = tf.Session()
# load the computation graph
saver = tf.train.import_meta_graph('MNIST_model/model_Q3.meta')
# load the weights
saver.restore(sess, 'MNIST_model/model_Q3')

INFO:tensorflow:Restoring parameters from MNIST_model/model_Q3


# Step 2.1: visualize the model with Tensorboard
[Tensorboard](https://www.tensorflow.org/get_started/summaries_and_tensorboard) is a visualization tool that comes with tensorflow. Here, we use it to visualize the model that you just loaded.

To start tensorboard, 

- Go to terminal and navigate to the root directory of the assignment. 
- Type `tensorboard --logdir=MNIST_model/`. The server will be launched in the terminal and a url will be displayed.
- Open the url with a browser and click the Graph tab. You should see the model that you just loaded. How many convolution layers does it have? (3 layers)

# Step 2.5: Get tensors from the model

In [8]:
graph = tf.get_default_graph()
x = graph.get_tensor_by_name('x:0')
y = graph.get_tensor_by_name('y:0')
logits = graph.get_tensor_by_name('fc_1/logits:0')
probs = graph.get_tensor_by_name('probs:0')
cross_entropy = graph.get_tensor_by_name('cross_entropy:0')
accuracy = graph.get_tensor_by_name('accuracy:0')
preds_label = graph.get_tensor_by_name('preds_label:0')

In [9]:
# sanity check: the testing accuracy should be over .99
print('test accuracy {}'.format(sess.run(accuracy, feed_dict={
    x: X_test, y: Y_test})))

test accuracy 0.990599989891


# Step 3: Get the output label for the raw image.
Why do we use stop gradient here? (to prevent this from influencing the original convolutional neural network)

In [11]:
probs_max = tf.reduce_max(probs, axis=1, keep_dims=True)
model_predict = tf.stop_gradient(tf.to_float(tf.equal(probs, probs_max)))
model_predict = model_predict / tf.reduce_sum(model_predict, 1, keep_dims=True)

# Step 4: calculate the cross entropy loss 
Use `model_predict` as your y labels.

HINT: This time, do not use `tf.reduce_mean`. Why?

In [12]:
loss = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=model_predict)

# Step 5: Gradient of the loss w.r.t. x
The loss function is a function of the input image x. 
We want to calculate the gradient of the loss with respect to x.

HINT: use `tf.gradients`

In [13]:
grad = tf.gradients(loss, x)[0]

# Step 6: Get the sign of the gradient.
HINT: use `tf.sign` and `tf.stop_gradient`

In [14]:
normalized_grad = tf.stop_gradient(tf.sign(grad))

# Step 7: rescale normalized_grad

In [15]:
eps = 0.3
scaled_grad = eps * normalized_grad

# Step 8: Perturb the raw image

Add `scaled_grad` to `x` and let the result be `adv_x` (`adv_x` is the adversarial image).

In [17]:
# TODO
adv_x = scaled_grad + x

# Step 9: Clip adv_x
Clip the value of `adv_x` so that the values are
between 0 and 1.

HINT: use `tf.clip_by_value`

In [18]:
# TODO
adv_x = tf.clip_by_value(adv_x, 0, 1, name='adv_x_clipped')

# Step 10: Generate adversarial images 
Store them as `adv_x_images`.

In [19]:
adv_images = None
for idx in range(10):
    print(idx,end=' ')
    #take 1000 images at a time to do this in batches
    first, last = 1000 * idx, 1000 * (idx + 1)
    
    foo = sess.run(adv_x, feed_dict={
        x: X_test[first:last, ...], y: Y_test[first:last]})
    
    if adv_images is not None:
        adv_images = np.concatenate([adv_images, foo], axis=0)#append current batch to total output
    else:
        adv_images = foo
        

0 1 2 3 4 5 6 7 8 9 

# Step 11: Accuracy on adversarial images
Now check the accuracy on the modified test images. You should expect to see the accuracy drop dramatically to around 0.11!

In [20]:
print(sess.run(accuracy, feed_dict={x: adv_images, y: Y_test}))

0.1146


# Step 12: Save sample adversarial images
Maybe the reason for such a low accuracy is that the modified test images are unrecognizable after the process? Save a few of them so you can check for yourself...

In [21]:
#get the classification label (digit) given to the adversarial image
label_hat = sess.run(tf.argmax(logits, axis=1), feed_dict={x: adv_images, y: Y_test})

if not os.path.exists('MNIST_noisy_imgs'):
    os.makedirs('MNIST_noisy_imgs')

#save first 10
for idx, img in enumerate(adv_images[:10]):
    im = Image.fromarray(np.uint8(img * 255).squeeze())
    #save each image with its corresponding predicted label in the filename
    im.save(os.path.join('MNIST_noisy_imgs', 'im{}_pred_label_{}.png'.format(idx, label_hat[idx])))