# üîê Exploring Out-of-Distribution (OOD) Attacks on AI Models
This notebook gives you hands-on experience with evaluating how AI models respond to inputs that lie outside their training distribution.

**üéØ Learning Objectives:**
- Understand the concept of OOD attacks.
- Analyze model behavior under OOD conditions.
- Explore strategies to improve model robustness and reliability.

_Please run each cell and reflect on the questions throughout. At the end, submit a short response in Canvas._

In [None]:
# üõ†Ô∏è Install dependencies (for Colab only)
!pip install -q numpy pandas matplotlib scikit-learn

# Out-of-Distribution (OoD) Attack on GPT-2

## Overview
This notebook demonstrates how **random token perturbations** can create an **Out-of-Distribution (OoD) attack** on a **GPT-2 model**.

## Goal:
- To **randomly modify a token** in the input prompt.
- To observe **how GPT-2's response changes** due to the perturbations.
- To simulate **adversarial attacks** on language models.

## Why is this Important?
Language models can **misinterpret small changes in input**, leading to drastically different responses. This notebook explores how **random perturbations** cause **semantic drift** over multiple iterations.


## Step 1: Load GPT-2 Model and Tokenizer
In this cell:
- We **import the necessary libraries** (`torch`, `random`, `transformers`).
- We **load the GPT-2 model and tokenizer** from Hugging Face.
- We **set the model to evaluation mode** (`model.eval()`) since we don't need to fine-tune it.


In [None]:
import torch
import random
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load GPT-2 model and tokenizer
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model.eval()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

## Step 2: Function to Generate Text from GPT-2
This function:
- **Takes input tokens** (`input_ids`) and **generates text**.
- **Uses GPT-2's `generate` method** to produce an output sequence.
- **Skips special tokens** (`skip_special_tokens=True`) for better readability.


In [None]:
# Function to generate text from a model
def generate_text(input_ids, max_length=50):
    with torch.no_grad():
        output = model.generate(input_ids, max_length=max_length, do_sample=True, pad_token_id=tokenizer.eos_token_id)
    return tokenizer.decode(output[0], skip_special_tokens=True)


## Step 3: Function to Randomly Perturb a Token
This function introduces **random perturbations** in the input prompt by:
- **Selecting a random token** (excluding start/end tokens).
- **Replacing it with a random token** from GPT-2's vocabulary.
- **Returning the modified input** for adversarial attack iterations.


In [None]:
# Function to introduce random perturbation in tokens
def perturb_prompt(input_ids, attack_position=None):
    """
    Randomly replaces a token in the prompt with a semantically different or random token.
    """
    # Choose a random token position to modify (excluding special tokens)
    if attack_position is None:
        attack_position = random.randint(1, len(input_ids[0]) - 2)

    # Select a random replacement token from the vocabulary
    new_token_id = random.choice(list(range(50257)))  # GPT-2 vocab size

    # Replace the chosen token with a random one
    input_ids[0, attack_position] = new_token_id

    return input_ids


## Step 4: Performing the OoD Attack
This function:
- **Runs for `num_iterations` (default = 20)**.
- **Randomly perturbs a token** in the input at each step.
- **Generates a new response** for the modified prompt.
- **Prints the changes at each iteration**.


In [None]:
# OoD Attack function (randomized perturbations)
def ood_adversarial_attack(initial_prompt, num_iterations=20):
    """
    Performs Out-of-Distribution (OoD) attack using random token perturbation.
    """
    print(f"üîπ Initial Prompt: {initial_prompt}\n")

    # Tokenize the input prompt
    input_ids = tokenizer(initial_prompt, return_tensors="pt").input_ids

    for i in range(num_iterations):
        print(f"--- Iteration {i+1} ---")

        # Perturb a random token in the prompt
        input_ids = perturb_prompt(input_ids)

        # Decode the modified prompt
        modified_prompt = tokenizer.decode(input_ids[0], skip_special_tokens=True)

        # Generate response from the perturbed input
        output_text = generate_text(input_ids)

        print(f"Modified Prompt: {modified_prompt}")
        print(f"Generated Response: {output_text}\n")


In [None]:
# Example: Start with a structured prompt and introduce random perturbations
initial_prompt = "Artificial intelligence can help humans because"
ood_adversarial_attack(initial_prompt, num_iterations=20)


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


üîπ Initial Prompt: Artificial intelligence can help humans because

--- Iteration 1 ---
üìù Modified Prompt: Art Rabb intelligence can help humans because
ü§ñ Generated Response: Art Rabb intelligence can help humans because it knows how to decode complex intelligence.

The group found that when an individual is very clever at something, they develop new skills on it. However, when he receives information that could aid him in doing something

--- Iteration 2 ---
üìù Modified Prompt: Art Rabb intelligenceElectric help humans because
ü§ñ Generated Response: Art Rabb intelligenceElectric help humans because of its ability to solve real-time problems quickly. It has multiple types of weapons, including an air-and-nuclear weapon and two laser-based weapons that can be fired in a single move. The ability

--- Iteration 3 ---
üìù Modified Prompt: Art Rabb ParliamentElectric help humans because
ü§ñ Generated Response: Art Rabb ParliamentElectric help humans because: They're good at sa

## Step 5: Running the Attack with an Example Prompt
- We start with **an initial structured prompt**:  
  `"Artificial intelligence can help humans because"`  
- The function runs **for 20 iterations** and **modifies a token** in each step.
- Observe how **GPT-2's responses change** due to adversarial perturbations.


In [None]:
**Run with your own example**

## üß† Reflection Questions
Answer the following questions after completing the notebook. Submit your answers to the Canvas assignment.

1. What challenges do OOD inputs pose for AI models?
2. How well did the model in this notebook perform under OOD conditions?
3. What techniques could help detect or defend against such attacks?
4. How might these vulnerabilities be exploited in real-world GenAI applications?


## ‚úÖ Quiz Checkpoint (Optional for Self-Assessment)
**Q: Which of the following best describes an OOD attack?**
A. Providing slightly perturbed inputs to cause model misclassification  
B. Injecting code into model prompts to bypass filters  
C. Supplying data that falls outside the training distribution  
D. Encrypting user input to trick the model

**Correct Answer: C**

## üì§ Submission Instructions
- Complete all code cells and answer the reflection questions.
- Take a screenshot of your final output OR copy your reflection answers.
- Submit them to the Canvas assignment titled **‚ÄúOOD Attack Lab‚Äù**.
- If you're working in Colab, download the notebook via `File > Download > .ipynb` and upload it to Canvas if required.