# Classifier Performance with Adversarial Samples

In this notebook, we create some adversarial samples of a dataset using the Fast Gradient Sign Method (FSGM) and show the performance of several different classifier models when evaluated on these samples. In order to demonstrate the generalization of adversarial samples, we construct them using a completely separate dataset than what is used to train the classifier models. This is to follow the scenario where an adversarial will not have explicit access to a model's paramaters or the exact data it was trained on.

# Fast Gradient Sign Method

The Fast Gradient Sign Method (FGSM) is an algorithm to produce adversarial samples by adding small changes to pre-existing samples from a classification dataset. These small changes are usually imperceptable to the human eye, but can cause the output of a deep learning model to change enough to alter class predictions. In order to compute which small changes will result in the greatest skew in model outputs, FGSM uses gradient information from the loss on a classifier model that was trained on the type of data to be altered. The FGSM therefore requires two things; a representative dataset that mirrors the data used for any models that will be attacked, and a model trained on this representative dataset. S 

When training a deep learning model, backpropagation relies on the fact that the gradient of the loss function w.r.t model parameters is the multi-dimensional direction in model paramater space corresponding to the greatest increase of loss. Backpropagation therefore updates model paramaters in the exact opposite direction of this gradient in order to decrease the loss each update step. For FGSM, we compute the gradient of the loss w.r.t the input sample instead, resulting in the multi-dimensional direction in the input space that corresponds to the greates increase of loss. Since the goal of FGSM is to construct a sample that *hurts* the performance of the model (i.e. *increasing* the loss), this gradient information is used directly to slightly alter the input sample, resulting in a large increase of loss after this error cascades through the model undergoing multiple layers of multiplicative computation.

Given a trained classifier $f$, with input $x$, and loss function $L(f(x))$, the FGSM computes an adversarial sample $x_{adv}$ with the formula:

\begin{align}
x_{adv} = x + \epsilon * \text{sgn}(\nabla_x L(f(x)))
\tag{1}
\label{eq:fgsm}
\end{align}

Where $\epsilon$ is a small (strength indicator) number (i.e. 0.1) and the *sign* of the gradient is used instead of the actual gradient since this preserves enough information to be effective.  

### Setup

Since FGSM needs a dataset and a model trained on this dataset to compute adversarial samples, we create two independent paritions of our training data prior to training models for this project; one parition is used to provide the gradient information to FGSM when creating adversarial samples, and the other is used to train the remaining "friendly" classifier models, which are then evaluated on adversarially altered data from the test set. This setup ensures that our friendly models cannot pick up on any common dataset artifacts to defend against adversarial samples, since these samples are coinstructed using information from a completely disjoint dataset.

In the following code block we set up our environment, initialize a test set dataloader, and load all models. You will have to alter the global variables `CONFIG`, `ADVERSARIAL_MODEL`, and `FRIENDLY_MODEL_DIR` to match locations and specifications for your local machine.

In [15]:
import torch
import torchvision
import glob

from util.pytorch_utils import build_image_dataset
from util.data_utils import generate_df_from_image_dataset
from module.classifier import Classifier

# local/model/data parameters
config = {
    'dataset_directory': '/home/dylan/datasets/cifar_png/',
    'batch_size': 32,
    'input_dimensions': (32, 32, 3),
    'number_workers': 1,
}

# local file locations
adversary_model_file = '/home/dylan/trained_model_files/pytorch/adversarial_project/adversarial_models/classifier.pt'
fiendly_model_directory = '/home/dylan/trained_model_files/pytorch/adversarial_project/friendly_models/'

# get local device
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

# generate filenames/labels df from image data directory
data_dict = generate_df_from_image_dataset(config['dataset_directory'])

# add number of classes in labels to config
config['output_dimension'] = data_dict['train']['Label'].nunique()

# add number of samples to config
config['number_train'] = len(data_dict['train'])
config['number_test'] = len(data_dict['test'])

# build testing dataloader
test_set, test_loader = build_image_dataset(
    data_dict['test'],
    image_size=config['input_dimensions'][:-1],
    batch_size=config['batch_size'],
    num_workers=config['number_workers']
)

# initialize and load adversarial model
adv_classifier = Classifier(config['input_dimensions'], config['output_dimension'])
adv_classifier.load_state_dict(torch.load(adversary_model_file, map_location=device))

# get all friendly model files
friendly_model_files = glob.glob(fiendly_model_directory+'*.pt')

# initialize all models found in friendly model directory
classifiers = [{
    'model_file': model_file,
    'name': model_file.split('/')[-1].strip('.pt'), 
    'model': Classifier(config['input_dimensions'], config['output_dimension'])
} for model_file in friendly_model_files]

# load all models found in friendly model directory
for classifier in classifiers:
    classifier['model'].load_state_dict(torch.load(classifier['model_file'], map_location=device))