# Adversarial Example

An adversarial example refers to a specially crafted input data point that is intentionally designed to deceive a machine learning model. It is created by making subtle, often imperceptible modifications to the original input data, such as images, audio, or text, with the intention of causing the model to misclassify or produce incorrect outputs.

The goal of an adversarial example is to exploit vulnerabilities or weaknesses in the model's decision-making process. By manipulating the input data in a strategic manner, an attacker can cause the model to make mistakes or produce undesired outcomes. These adversarial examples are typically created with the intent to deceive the model, rather than being naturally occurring data points.

The concept of adversarial examples highlights the limitations and vulnerabilities of machine learning models. It demonstrates that even advanced models can be susceptible to manipulation and can be easily fooled by carefully crafted inputs. Adversarial examples raise concerns about the robustness, reliability, and security of machine learning systems in real-world applications.

Understanding and studying adversarial examples is crucial for developing more robust and resilient machine learning models. Researchers and practitioners aim to develop defense mechanisms and techniques to detect and mitigate the impact of adversarial examples, improving the overall security and trustworthiness of machine learning systems.

In summary, an adversarial example is a specially crafted input designed to exploit the vulnerabilities of a machine learning model, causing it to produce incorrect outputs or make mistakes. It serves as a test and a means to understand the limitations of machine learning systems and develop defenses against potential attacks.

# White-box Adversarial Examples

White-box adversarial examples refer to adversarial attacks where the attacker has complete knowledge about the targeted machine learning model. In a white-box setting, the attacker has access to the model's architecture, parameters, and training data, enabling them to analyze and exploit its vulnerabilities effectively.

To create a white-box adversarial example, the attacker can employ various optimization algorithms and techniques. They can use gradient information, obtained through backpropagation, to compute the direction and magnitude of perturbations that will maximize the model's prediction error. By iteratively adjusting the input data based on the gradient information, the attacker can craft an adversarial example that leads to a misclassification or an undesired output from the model.

The main advantage of white-box attacks is the extensive knowledge available to the attacker. They can thoroughly analyze the model's weaknesses, understand its decision boundaries, and tailor the adversarial example accordingly. This access to detailed information allows for more precise and effective manipulation of the input data.

White-box adversarial examples are particularly concerning because they simulate a scenario where the attacker has insider knowledge about the targeted model. This knowledge can be leveraged to design highly targeted attacks that exploit specific vulnerabilities, potentially causing severe consequences in real-world applications.

Defending against white-box adversarial examples requires the development of robust and resilient machine learning models. Techniques like adversarial training, defensive distillation, and input preprocessing can enhance the model's ability to withstand such attacks. Additionally, ensuring model transparency, frequent security evaluations, and careful consideration of model architectures can help mitigate the impact of white-box attacks.

In summary, white-box adversarial examples are crafted by attackers who possess complete knowledge of the targeted model. They exploit this knowledge to design precise and effective attacks, highlighting vulnerabilities in the model's decision-making process. Defending against white-box attacks requires robust model design, proactive security measures, and ongoing evaluation of the model's vulnerabilities.

# Black-box Adversarial Examples 
Black-box adversarial examples refer to adversarial attacks where the attacker has limited or no knowledge about the targeted machine learning model. In a black-box setting, the attacker doesn't have access to the model's internal architecture, parameters, or training data. They can only interact with the model by providing inputs and observing its outputs.

Creating black-box adversarial examples presents a greater challenge for the attacker compared to white-box attacks. Without detailed knowledge of the model, they need to rely on techniques that explore the model's behavior through input-output observations and feedback.

One common approach in black-box attacks is the use of transferability. The attacker trains a surrogate model, either by collecting their own labeled data or using a publicly available dataset. This surrogate model mimics the behavior of the targeted model to some extent. By crafting adversarial examples on the surrogate model, the attacker assumes that these examples will also fool the targeted model due to the transferability of adversarial perturbations.

Another technique employed in black-box attacks is the query-based method. The attacker can query the targeted model multiple times, providing carefully selected inputs and observing the corresponding outputs. By analyzing the model's responses and monitoring the changes in predictions, the attacker can gradually refine their understanding of the model's decision boundaries and craft adversarial examples.

Black-box attacks are particularly challenging because they simulate scenarios where the attacker has limited information and has to work with restricted access to the targeted model. These attacks closely resemble real-world situations where models are deployed as services, and the internal workings are not exposed.

Defending against black-box adversarial examples requires robustness even in the absence of specific knowledge about the attack. Techniques like adversarial training, ensemble methods, and input sanitization can enhance the model's resilience against unseen adversarial inputs. Additionally, monitoring and anomaly detection systems can help identify suspicious patterns and inputs that exhibit adversarial behavior.

In summary, black-box adversarial examples are crafted by attackers who have limited or no knowledge about the targeted model. They rely on techniques like transferability and query-based methods to explore the model's behavior and create adversarial examples. Defending against black-box attacks requires robust models, proactive security measures, and techniques that can generalize to unseen adversarial inputs.

# Risks in Adversarial Example
When it comes to applications of classification, such as text or image classification, adversarial examples pose several risks. Here are some of the major risks associated with adversarial examples:

1. Misclassification: Adversarial examples can cause misclassification, where a model wrongly predicts the class or label of an input. By making small, imperceptible modifications to the input data, an attacker can trick the model into producing incorrect predictions. This misclassification can lead to severe consequences, especially in critical applications like autonomous vehicles or medical diagnosis.

2. Security Breaches: Adversarial examples can be used as a means to exploit vulnerabilities in the classification system. Attackers can craft adversarial inputs specifically designed to bypass security measures, gain unauthorized access, or deceive the system for malicious purposes. This can compromise the integrity and security of sensitive data and systems.

3. Privacy Violation: Adversarial examples can potentially reveal sensitive information about the input data or compromise user privacy. By manipulating inputs, attackers might be able to extract sensitive features or fool the system into revealing confidential information. This poses a significant risk in applications that deal with personal data or sensitive information.

4. Model Robustness Evaluation: Adversarial examples serve as a critical tool for evaluating the robustness and reliability of classification models. By exposing models to adversarial attacks, researchers and developers can identify vulnerabilities, weaknesses, and limitations in the models' performance. Failure to account for adversarial examples during model evaluation can lead to overestimation of performance and create a false sense of security.

5. Transferability: Adversarial examples can exhibit transferability, meaning that an adversarial example designed to fool one model can also fool other models or even different machine learning algorithms. This transferability poses a risk across different systems and implementations, as an attack designed for one model can potentially affect multiple models or classifiers.

6. Social Engineering: Adversarial examples can be used as a form of social engineering, manipulating the perception of systems to deceive users or gain their trust. By crafting adversarial inputs, attackers can create a false sense of credibility or authority, leading users to make decisions based on incorrect or manipulated information.

7. Ethical Implications: The existence of adversarial examples raises ethical concerns related to the trustworthiness and fairness of classification systems. Adversarial attacks can disproportionately affect certain groups or bias decision-making processes. Addressing these ethical implications and ensuring fairness in classification systems is crucial to maintain public trust and avoid discriminatory outcomes.

8. Adversarial Attacks on Training Data: Adversarial examples can also be used to manipulate the training data itself. By introducing subtle perturbations or modifications to the training samples, attackers can influence the learning process and bias the model's behavior. This can lead to biased predictions, compromised model performance, or unintended consequences during deployment.

9. Legal and Regulatory Compliance: The presence of adversarial examples can have legal and regulatory implications, particularly in domains where accuracy and reliability are critical. Compliance with regulations, such as data protection or safety standards, may require thorough evaluation and mitigation strategies against adversarial attacks to ensure the integrity and trustworthiness of the classification system.

It is important to note that the risks associated with adversarial examples depend on the specific application, the potential impact of misclassification or manipulation, and the threat landscape. Understanding and mitigating these risks is essential for the development and deployment of robust and secure classification systems.

# How to Protect Classification Systems against this Type of Attack
To protect classification systems against adversarial attacks, several strategies can be employed. Here are some approaches:

1. Adversarial Training: This technique involves augmenting the training data with adversarial examples during the model training phase. By exposing the model to adversarial examples and incorporating them into the training process, the model learns to be more robust and resilient to such attacks. Adversarial training can enhance the model's ability to generalize and classify both clean and adversarial inputs accurately.

2. Defensive Distillation: This method involves training a model in two stages. In the first stage, a teacher model is trained on the dataset using standard training techniques. In the second stage, a student model is trained to mimic the behavior of the teacher model. However, during this stage, the training data is modified by adding perturbations. The objective is to make the student model more resistant to adversarial attacks.

3. Input Preprocessing: Applying preprocessing techniques to the input data can help detect and mitigate adversarial attacks. Techniques like input normalization, feature scaling, and noise injection can help make the model more robust against perturbations introduced by attackers. Additionally, input validation and anomaly detection methods can be employed to identify potential adversarial inputs and reject or flag them for further analysis.

4. Ensemble Methods: Using ensemble methods, which combine the predictions of multiple models, can enhance the system's robustness. By training and combining multiple models with different architectures or initializations, the ensemble can collectively make more accurate predictions and be more resistant to adversarial attacks. Adversarial examples that affect one model in the ensemble are less likely to have the same effect on all models, reducing the overall vulnerability.

5. Model Regularization: Applying regularization techniques such as L1 or L2 regularization, dropout, or batch normalization during model training can help reduce the impact of adversarial perturbations. Regularization adds constraints to the model's parameters, making it less sensitive to small changes in the input data. Regularization can improve the model's generalization capability and make it more robust against adversarial attacks.

6. Adversarial Detection and Defense Mechanisms: Implementing adversarial detection and defense mechanisms can help identify and mitigate adversarial attacks. These mechanisms can include techniques like outlier detection, anomaly detection, or using specialized detection models to identify potential adversarial inputs. Once identified, appropriate actions can be taken, such as rejecting or flagging the inputs for manual review or applying additional defenses.

7. Ongoing Research and Development: Adversarial attacks are continuously evolving, and new defense techniques are being developed to counter them. Staying updated with the latest research and incorporating state-of-the-art defense mechanisms can help protect classification systems against emerging adversarial attack methods.

It's important to note that while these techniques can enhance the system's resilience, they may not provide complete protection against all types of adversarial attacks. Adversarial attacks and defense mechanisms are an ongoing cat-and-mouse game, and it's crucial to remain vigilant and adapt to new attack techniques and defense strategies.

# Notebook 
The objective of this notebook is to perform a series of operations using a pre-trained model (ResNet-50) to generate an adversarial image from an input image. Let's go through the steps of the code:

1. Import the necessary libraries:
2. Load a pre-trained model:
3. Set up the model for evaluation:
4. Define the loss function:
5. Define preprocessing transformations:
6. Load and preprocess the image:
7. Save the original image:
8. Define the target label:
9. Set up the image tensor for gradient calculation:
10. Make a prediction on the original image:
11. Calculate the loss and backpropagate the gradient:
12. Calculate the sign gradients:
13. Generate the adversarial image:
14. Save the adversarial image:
15. Make a prediction on the adversarial image:
16. Print the original prediction:

The code is typically used to demonstrate the process of creating an adversarial image using a pre-trained model and PyTorch. It highlights the steps involved in generating the adversarial image and evaluating its impact on the model's predictions.

In [1]:
# Install the watermark package.
# This package is used to record the versions of other packages used in this Jupyter notebook.
# https://github.com/rasbt/watermark
!pip install -q -U watermark

In [26]:
import torch  # Importing the PyTorch library for deep learning
import torch.nn as nn  # Importing the neural network module of PyTorch
import torch.optim as optim  # Importing the optimization module of PyTorch
import torchvision.transforms as transforms  # Importing image transformations from torchvision
import torchvision.models as models  # Importing pre-trained models from torchvision
from PIL import Image  # Importing the Python Imaging Library for image processing

In [30]:
import warnings  # Import the warnings library for warning control and management

# Disable all warnings
warnings.filterwarnings("ignore")

In [31]:
# Load the watermark extension to display information about the Python version and installed packages.
%reload_ext watermark

# Display the versions of Python and installed packages.
%watermark -a 'Fabiano Falcão' -ws "https://fabianumfalco.github.io/" --python --iversions

Author: Fabiano Falcão

Website: https://fabianumfalco.github.io/

Python implementation: CPython
Python version       : 3.10.6
IPython version      : 8.11.0

PIL        : 9.0.1
torch      : 2.0.0
torchvision: 0.15.1



In [32]:
# Set the device to CUDA (GPU) if available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  
print(f"Using device: {device}")  # Print the device being used

Using device: cuda


# ResNet-50

The ResNet-50 is a convolutional neural network (CNN) architecture that was proposed as a solution to the problem of vanishing gradients in deep network training. It was first introduced in the paper "Deep Residual Learning for Image Recognition" by Kaiming He et al. in 2015.

The ResNet-50 consists of 50 layers, including convolutional, activation, and pooling layers. The main innovation of ResNet-50 is the use of residual connections, which allow the gradient information to be directly propagated through the layers, even in deep networks. These residual connections enable the model to learn richer and deeper representations of the images, thereby improving the accuracy of object recognition.

ResNet-50 was trained on a large dataset, such as ImageNet, which contains millions of images across various classes. This allowed the model to learn general features of a wide range of objects and textures. As a result, ResNet-50 has become a benchmark model for image classification tasks, achieving state-of-the-art performance on object recognition benchmarks.

Furthermore, due to its deep learning capacity and rich representations, ResNet-50 has also been used as a base for transfer learning, where the learned features can be transferred to related tasks such as object detection and semantic segmentation.

In summary, ResNet-50 is a powerful CNN architecture that overcame the limitations of deep networks by introducing residual connections. It excelled in image classification tasks and has served as a foundation for many other computer vision applications.

In [33]:
# Load the pre-trained ResNet-50 model from torchvision
# https://pytorch.org/vision/main/models/generated/torchvision.models.resnet50.html
# https://pytorch.org/hub/nvidia_deeplearningexamples_resnet50/
model = models.resnet50(pretrained=True)  

# Move the model to GPU if available
model = model.to(device)

# sets the model to evaluation mode. 
# In this mode, the model behaves differently during inference compared to training. 
# For example, dropout layers are deactivated, batch normalization layers use the running statistics, 
#and the gradients are not computed for parameters. Setting the model to evaluation mode is important 
#to ensure consistent and accurate results during inference.
model.eval()


ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): Bottleneck(
      (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (downsample): Sequential(
        (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 

# Cross entropy loss

Cross entropy loss, also known as softmax loss or log loss, is a commonly used loss function in machine learning for multi-class classification tasks. It measures the dissimilarity between the predicted probability distribution and the true distribution of the target labels.

In the context of classification, the cross entropy loss quantifies the difference between the predicted class probabilities and the actual class labels. It calculates the average negative log likelihood of the predicted probabilities for the true class labels. The goal is to minimize this loss function during the training process to improve the accuracy of the model.

The formula for cross entropy loss involves taking the logarithm of the predicted probabilities, multiplying them with the true label's one-hot encoded representation, and summing the values across all classes. The loss is then averaged over the batch or the entire dataset.

By optimizing the cross entropy loss, the model learns to assign higher probabilities to the correct class labels and lower probabilities to the incorrect ones. This loss function encourages the model to produce more confident predictions for the correct classes and penalizes uncertain or incorrect predictions.

The cross entropy loss is widely used because it is effective for optimizing models in multi-class classification tasks. It provides a gradient signal that guides the model's parameters towards better predictions, facilitating the learning process. Many deep learning frameworks, including PyTorch, provide an implementation of the cross entropy loss for easy integration into the training pipeline.

In [34]:
# Define the loss function for multi-class classification tasks
# initializing criterion with nn.CrossEntropyLoss(), we can use it later in the training loop 
# to compute and backpropagate the loss, updating the model's parameters to improve its performance 
# in the classification task.
criterion = nn.CrossEntropyLoss()

In [35]:
# The transforms.Compose() function is used to chain the transformations together into 
# a single pipeline, allowing for easy and efficient preprocessing of the input images before
# feeding them into the model.

preprocess = transforms.Compose([
    transforms.Resize(256),  # Resize the image to a square of size 256x256 pixels
    transforms.CenterCrop(224),  # Crop the center of the image to a size of 224x224 pixels
    transforms.ToTensor()  # Convert the image to a tensor representation
])

In [36]:
# Preprocess the input image and convert it into a tensor
# .unsqueeze(0): This adds an extra dimension to the tensor, making it a 4-dimensional tensor
# with a batch size of 1. This is necessary as many deep learning models expect inputs in batch format.
image = preprocess(Image.open('dog.jpg')).unsqueeze(0)  

# Move the image and model to GPU if available
image = image.to(device)

In [37]:
# Remove the extra batch dimension from the image tensor
image2 = torch.squeeze(image, 0)

# Create an instance of the ToPILImage transformation
T = transforms.ToPILImage()

# Convert the tensor image back to a PIL Image object
img = T(image2)

# Save the PIL Image as 'dog_original.jpg'
img.save('dog_original.jpg')

In [38]:
# Set the target label to 3
# In the ResNet-50 architecture of torchvision.models, the label 3 corresponds to the specific class "cat" 
# within the ImageNet dataset. 
# https://deeplearning.cms.waikato.ac.nz/user-guide/class-maps/IMAGENET/
target_label = 3 # tiger shark, Galeocerdo cuvieri


# epsilon (ε)

In the context of adversarial attacks or perturbation-based techniques, epsilon (ε) is a small positive constant used to control the magnitude of perturbation applied to an image. It determines the maximum allowable change in pixel values for generating adversarial examples while maintaining visual similarity to the original image.

The purpose of using epsilon is to strike a balance between the effectiveness of the attack and the perceptibility of the perturbation. By setting an appropriate value for epsilon, one can control the level of distortion introduced to the image to deceive a machine learning model.

During adversarial attacks, the original image is perturbed by adding imperceptible perturbations to the pixel values. The magnitude of these perturbations is constrained by epsilon. A smaller epsilon value limits the amount of perturbation, making it less likely to be detected by humans but potentially less effective in fooling the model. On the other hand, a larger epsilon value allows for more significant perturbations, increasing the likelihood of misleading the model but potentially introducing noticeable visual changes.

The choice of epsilon depends on factors such as the sensitivity of the targeted model, the specific attack technique being employed, and the desired trade-off between stealthiness and attack success rate. It is often determined through experimentation and fine-tuning to achieve the desired level of adversarial perturbation while minimizing perceptibility.

In [39]:
# Set the value of epsilon to 0.03
epsilon = 0.03

# Enable gradient calculation for the image tensor
image.requires_grad = True

In [41]:
# Perform forward pass through the model
output = model(image)

# Calculate the loss using the criterion of the model's output compared to a target label
loss = criterion(output, torch.tensor([target_label]).to(device))

In [42]:
# Reset gradients to zero
model.zero_grad()

# Perform backward pass to calculate gradients
loss.backward()

In [43]:
# Calculate the sign of the gradients of the image
# Determine the direction of perturbation that maximizes the loss function.
# image.grad represents the gradients of the image with respect to some loss function. Gradients capture 
# the direction and magnitude of the steepest ascent in the loss landscape.
sign_grad = torch.sign(image.grad.data)

# In the context of adversarial attacks, the sign of gradients helps determine the direction to perturb 
# the image in order to create an adversarial example that can mislead the model's predictions.

# Generate the adversarial image by adding perturbation based on the sign of gradients
adversarial_image = image + epsilon * sign_grad

# By multiplying the perturbation (scaled by epsilon) with the sign of the gradients and adding it 
# to the original image, we introduce a small distortion to the image in the direction indicated 
# by the sign of the gradients. This distortion is intended to create an adversarial example that 
# can potentially fool a machine learning model.

# The resulting adversarial_image is a tensor with the same shape as the original image, where 
# each pixel has been modified according to the sign of the gradients. The purpose is to create 
# a visually similar image that can lead to different predictions or misclassify the input when fed into 
# a machine learning model.

In [44]:
# Squeeze the dimensions of adversarial_image
adversarial_image2 = torch.squeeze(adversarial_image, 0)

# Convert the tensor to a PIL image
T = transforms.ToPILImage()
img = T(adversarial_image2)

# Save the adversarial image to disk
img.save('dog_adversarial.jpg')

In [51]:
# Predefined list of ImageNet labels
# https://deeplearning.cms.waikato.ac.nz/user-guide/class-maps/IMAGENET/

# Print the original prediction from the model
# 207 - golden retriever
print("Original prediction:", torch.argmax(output).item())

# Print the adversarial prediction obtained after applying the perturbation
# 222 - kuvasz
print("Adversarial prediction:", adversarial_prediction.item())

Original prediction: 207
Adversarial prediction: 222
