# Adversarial Attack on Cyberharassment Models

In lab 3 (Adversarial attack on cyberharassment models), you will add noise to the input of a cyberharassment model to make the model misclassify the input. This lab will walk you through this and will develop your intuition about adversarial attacks in AI models. 

**You will learn:**
- How to use an adversarial attack algorithm to fool a model trained to detect images that are not safe for work (NSFW).


## Required modules 
First, run the cells below to import important packages needed for this lab. The second cell mounts Google drive which is where some files required for the lab will be stored. You will be prompted for authentication, please follow the instruction to authenticate. 
- [numpy](https://www.numpy.org/) a Python package for scientific computing.
- [matplotlib](http://matplotlib.org) a Python library for visualization
- [PIL](http://www.pythonware.com/products/pil/) a Python package for image processing
- [torch](https://pytorch.org/) a Python open source source framework for deep learning


In [None]:
#@title
import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torchvision import datasets, transforms, models
from torchvision.io import read_image
from PIL import Image
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
import os

seed_val = 42
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

## Lab Overview

<img src="https://clemson.box.com/shared/static/lp073durs1970cvy9cboy5lwfytw6awh.png">

You are given a model that have been trained (pretrained) to detect not safe for work (NSFW) images. You are
also given a set pf NSFW test images. Of course building such models are important but as you will see in the
lab, such a model can be fooled to not detect such contents.

You will use adversarial algorithms to generate noise that will be added to the test images. When this modified
test image is passed through the model, the model will classify it as not NSFW even though it is.

**About test data**

We will use some NSFW images obtained by downloading images using the [ nsfw_data_scrapper](https://github.com/EBazarov/nsfw_data_source_urls). Each image belong to 1 of 5 categories (neutral, hentai, drawings, porn and sexy). Each of the categories will be a folder containing images that belong to that category.

Let us explore the test dataset. Download the data by running the cell below. 

In [None]:
#@title
%%capture
# Download the dataset
if not Path('test').is_dir():
    !wget https://buffalo.box.com/shared/static/x1wqhcvq46ux4uwlzd3df3uy1qh1n87h.zip
    # Unzip it
    !unzip -q x1wqhcvq46ux4uwlzd3df3uy1qh1n87h.zip -d test/

It is important to keep the dimension of all images the same, this allows to group a batch of images with the same shape into a single matrix/vector. We will perform some transformations on each image, for each image we make it a specific height and weight (224 x 224), center it, convert it to a tensor (converting the pixels into a vector) and normalize it. In this lab, we will batch our images into groups of 1 i.e each image is in its own batch. 

In [None]:
#@title
image_path = "/content/test/nsfw_test"
batch_size = 1
test_image_transforms = transforms.Compose([transforms.Resize(224),
                                            transforms.CenterCrop(224),
                                            transforms.ToTensor(),
                                            transforms.Normalize(
                                                mean=[0.485, 0.456, 0.406],
                                                std=[0.229, 0.224, 0.225])    
                                        ])

dataset = datasets.ImageFolder(image_path, transform=test_image_transforms)
class_dict = dataset.class_to_idx
index_to_class = {}
for category, idx in class_dict.items():
    index_to_class[idx] = category

# Batch dataset and create an iterator over the dataset
test_dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

In [None]:
#@title View few images. { run: "auto" }
#@markdown Select a number to view the image and its label. Note: Run manually first by clicking the play button to the left. After this manual run,  each subsequent value selection will run automatically.

index = "0" #@param [0, 1, 2, 3, 4, 5, 41, 42, 43, 44, 45]
image, label = dataset[int(index)]
plt.imshow(image.transpose(2, 0))
print(f"Classes = {[category for category in class_dict.keys()]}")
print(f'class = {label}, which is {index_to_class[label]}:')

## NSFW Detection Without Attack

In [None]:
#@title
def detect_nsfw(model, test_dataloader):
    """
    Performs the FGSM/PGD attack on a dataset given a specific epsilon value
    Args:
        model (nn.Module): The neural network under attack
        test_dataloader (DataLoader): an iterable over the dataset
    Returns:
        final_acc, examples (Tuple): a tuple of the final accuracy 
        of the attack and some adversarial examples for visualization
    """
    
    # We are not training, so set the model to evaluation mode
    model.eval()

    # Accuracy counter
    correct = 0
    examples = []

    # Loop over all examples in test set
    for data, target in test_dataloader:                         

        # Send the data and label to the device
        data, target = data.to(device), target.to(device)         

        # Forward pass the data through the model
        output = model(data)

        prediction = output.max(1, keepdim=True)[1] # get the index of the max log-probability

        # Check for success
        if prediction.item() == target.item():
            correct += 1
            # Special case for saving 0 epsilon examples
            if (len(examples) < 5):
                example = data.squeeze().detach().cpu().numpy()
                if (prediction.item() == 0 or  prediction.item() == 2):
                    examples.append( (prediction.item(), target.item(), example) )
        else:
            # Save some examples for visualization later
            if len(examples) < 5:
                example = data.squeeze().detach().cpu().numpy()
                if (prediction.item() == 0 or  prediction.item() == 2):
                    examples.append( (prediction.item(), target.item(), example) )

    # Calculate final accuracy for this epsilon
    final_acc = correct / float(len(test_dataloader))
    print("Accuracy = {} / {} = {}".format(correct, len(test_dataloader), final_acc))

    # Return the accuracy and an adversarial example
    return final_acc, examples

## White-box Attack

In a white-box attack, the adversary (the agent generating the adversarial examples) have knowledge about the target model architecture, its parameters and training data. We will be performing a white-box attack since we have a model that have been trained to detect NSFW, we have access to its parameters and we know the data it was trained on. 

Execute the following cells to setup our model.  


In [None]:
#@title
def get_model():
    """
    Initialize the NSFW model which is a modified ResNet50 model with
    two additional fully-connected layers
    Args:
        None
    Returns:
        model (nn.Module): pre-trained NSFW model
    """

    # NSFW model uses ResNet50 with additional layers 
    # Load ResNet50 model
    model = models.resnet50(pretrained=False)
    number_of_fully_connected_features = model.fc.in_features

    # Replacing last layer 
    model.fc = nn.Sequential(
        nn.Linear(number_of_fully_connected_features, 512),
        nn.ReLU(),
        nn.Dropout(0.2),
        nn.Linear(512, 10),
        nn.LogSoftmax(dim=1)
    )
    
    return model


Download the model parameters (pre-trained weights) by running the cell below. 
We will make use of a pretrained NSFW [model](https://github.com/emiliantolo/pytorch_nsfw_model).

In [None]:
#@title
%%capture
if not os.path.isfile('ResNet50_nsfw_model.pth'):
    !wget https://github.com/emiliantolo/pytorch_nsfw_model/blob/master/ResNet50_nsfw_model.pth?raw=true -O ResNet50_nsfw_model.pth

Initialize the model and load its parameters

In [None]:
#@title
# Use GPU if it is available 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

pre_trained_model_path = "/content/ResNet50_nsfw_model.pth"
# Initialize the network
model = get_model().to(device)

# Initialize the network with the pretrained weights (parameters that the previous model learned) 
model.load_state_dict(torch.load(pre_trained_model_path, map_location='cpu')) 

## Attack
We will use fast gradient sign method (FGSM) to generate adversarial perturbations to fool the target model into predicting the wrong output. 

In [None]:
#@title
def fgsm(image, epsilon, data_grad):
    """
    Perform the FGSM attack on a single image
    Args:
        image (torch.tensor): The image to be perturbed
        epsilon (float): Hyperparameter for controlling the scale of perturbation
        data_grad (): The gradient of the loss wrt to image
    Returns:
        perturbed_image (torch.tensor): a perturbed image
    """
    # Collect the element-wise sign of the data gradient
    sign_data_grad = data_grad.sign()
    
    # Create the perturbed image by adjusting each pixel of the input image
    perturbed_image = image + epsilon * sign_data_grad
    
    # Adding clipping to maintain [0,1] range
    if epsilon != 0.0:
        perturbed_image = torch.clamp(perturbed_image, 0, 1)
    
    # Return the perturbed image
    return perturbed_image

FGSM Attack
- Perform a forward pass through the model using the original image
- Perform an FGSM attack by using FGSM to generate an adversarial image
- Perform a forward pass through the model using the adversarial image 

In [None]:
#@title
def fgsm_attack(model, test_dataloader, epsilon):
    """
    Performs the FGSM/PGD attack on a dataset given a specific epsilon value
    Args:
        model (nn.Module): The neural network under attack
        device (): put variables into cpu/gpu mode
        test_dataloader (DataLoader): an iterable over the dataset
        epsilon (float): hyperparameter for controlling the scale of perturbation 
        attack_type (String): adversarial attack method, defaults to FGSM
    Returns:
        final_acc, adversarial_examples (Tuple): a tuple of the final accuracy 
        of the attack and some adversarial examples for visualization
    """
    
    # We are not training, so set the model to evaluation mode
    model.eval()

    # Accuracy counter
    correct = 0
    adversarial_examples = []

    # Loop over all examples in test set
    for data, target in test_dataloader:                         

        # Send the data and label to the device
        data, target = data.to(device), target.to(device)         

        # Set requires_grad attribute of tensor. Important for Attack
        data.requires_grad = True

        # Forward pass the data through the model
        output = model(data)

        init_pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability

        # If the initial prediction is wrong, dont bother attacking, just move on
        if init_pred.item() != target.item():
            continue

        # Calculate the loss
        criterion = nn.CrossEntropyLoss()
        loss = criterion(output, target)

        # Zero all existing gradients
        model.zero_grad()

        # Calculate gradients of model in backward pass
        loss.backward()

        # Collect datagrad
        data_grad = data.grad.data

        # Call FGSM to add perturbation to the data
        perturbed_data = fgsm(data, epsilon, data_grad)

        # Re-classify the perturbed image
        output = model(perturbed_data)

        # Check for success
        final_pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability
        if final_pred.item() == target.item():
            correct += 1
            # Special case for saving 0 epsilon examples
            if (epsilon == 0.0) and (len(adversarial_examples) < 5):
                adversarial_example = perturbed_data.squeeze().detach().cpu().numpy()
                if (init_pred.item() == 0 and final_pred.item() == 0) or \
                (init_pred.item() == 2 and final_pred.item() == 2):
                    adversarial_examples.append( (init_pred.item(), final_pred.item(), adversarial_example) )
        else:
            # Save some adversarial examples for visualization later
            if len(adversarial_examples) < 5:
                adversarial_example = perturbed_data.squeeze().detach().cpu().numpy()
                if (init_pred.item() == 0 and final_pred.item() == 0) or \
                (init_pred.item() == 0 and final_pred.item() == 2) or \
                (init_pred.item() == 2 and final_pred.item() == 0) or \
                (init_pred.item() == 2 and final_pred.item() == 1):
                    adversarial_examples.append( (init_pred.item(), final_pred.item(), adversarial_example) )

    # Calculate final accuracy for this epsilon
    final_acc = correct / float(len(test_dataloader))
    print("Epsilon: {}\tTest Accuracy = {} / {} = {}".format(epsilon, correct, len(test_dataloader), final_acc))

    # Return the accuracy and an adversarial example
    return final_acc, adversarial_examples, epsilon

## Visualization

In [None]:
#@title
def plot_accuracy_vs_epsilon(epsilons, accuracies):
    plt.figure(figsize=(5,5))
    plt.plot(epsilons, accuracies, "*-")
    plt.yticks(np.arange(0, 1.1, step=0.1))
    plt.xticks(np.arange(0, .35, step=0.05))
    plt.title("Accuracy vs Epsilon")
    plt.xlabel("Epsilon")
    plt.ylabel("Accuracy")
    plt.show()

In [None]:
#@title
# Plot several examples of adversarial samples at each epsilon. Note: Only showing images from the drawings and neutral classes to avoid showing inappropriate content
def plot_adversarial_samples(epsilons, adversarial_examples):
    count = 0
    
    plt.figure(figsize=(8,10))
    for i in range(len(epsilons)):
        for j in range(len(adversarial_examples[i])):
            count += 1
            plt.subplot(len(epsilons),len(adversarial_examples[0]), count)
            plt.xticks([], [])
            plt.yticks([], [])
            if j == 0:
                plt.ylabel("Eps: {}".format(epsilons[i]), fontsize=14)
            orig, adv, ex = adversarial_examples[i][j]
            # Convert class indexes to string 
            orig = index_to_class[orig]
            adv = index_to_class[adv]
            plt.title("{}\n -> {}".format(orig, adv))
            plt.imshow(np.transpose(ex, (1,2,0)), cmap="gray")
    plt.tight_layout()
    plt.show()

## Execute Without FGSM Attack

Perform NSFW detection with the trained model.

- - - -
**<font color='red'>Task 1:</font>
What are your observations in terms of how the model is predicting NSFW. Is the model predicting NSFW accurately?**
- - - -

Run the cell below to answer.

In [None]:
accuracy, _ = detect_nsfw(model, test_dataloader)

## Execute FGSM Attack

$ϵ$ controls the scale of the perturbations and we will use different $ϵ$ values to understand its effect. 

Run the cell below to setup for the next task

In [None]:
#@title
accuracies = []
adversarial_examples = []
epsilons = []

- - - -
**<font color='red'>Task 2:</font>**
- First, execute the cell below with the default epsilon of 0.0
- Select each of the other epsilon value. The code cell will execute automatically
- Execute the visualize and sample adversarial examples code cells
- Compare the result with the result of executing without FSGM attack above and describe your observations. What effects does the epsilon value have on accuracy and perceptibility of the examples?
- - - -

In [None]:
#@title Choose epsilon and execute. { run: "auto" }
#@markdown Select an epsilon value. Note: Run manually first by clicking the play button to the left. After this manual run,  each subsequent value selection will run automatically.
epsilon = "0.0"  #@param ["0.0", "0.02", "0.04", "0.06", "0.08", "0.1", "0.14"]

# Perform attack using each epsilon value
accuracy, adversarial_example, epsilon = fgsm_attack(model, test_dataloader, float(epsilon))
accuracies.append(accuracy)
epsilons.append(epsilon)
adversarial_examples.append(adversarial_example)

Visualize accuracy vs epsilon

In [None]:
#@title
plot_accuracy_vs_epsilon(epsilons, accuracies)

Sample adversarial examples - Original prediction -> adversarial prediction

In [None]:
#@title
plot_adversarial_samples(epsilons, adversarial_examples)