# 🧠 Defense & Exploitation of Large Language Models (LLMs)

Welcome to this hands-on companion notebook for the course **"Defense and Exploitation of LLMs"**. Throughout this tutorial, we will explore how Large Language Models (LLMs) can be attacked, manipulated, and defended through practical techniques using the a variety of transformer models (GPT-2, Phi-2, Phi-4, etc.). Some of these models are light and performant — ideal for Colab free tier, while others are larger and require extra computer (GPUs) like the A100 on the Pro version of Colab. We also will show you throughout the course how to utilize these models __offline__ for your own research and development in air-gapped networks as you all normally work. Please run the notebook cells that best pertain to your situation. This notebook doesn't fully utilize our methodology for loading models but Part 2 and Part 3 will adequately set up a pipeline.

In [None]:
# run this codeblock for both groups (Colab or Local)
# local notebooks may need to wait until their environment is fully set up first instructions below
# pip installs are not needed in the Colab environment unless asked specifically

import transformers
from transformers import AutoTokenizer, AutoModel, GPT2Tokenizer, BertTokenizer, BertModel, T5Tokenizer, GPT2LMHeadModel, DistilBertTokenizer, DistilBertModel
import torch
import matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 72
import seaborn as sns
import numpy as np
from IPython.display import Markdown
import warnings
transformers.logging.set_verbosity_error()
warnings.simplefilter(action='ignore', category=FutureWarning)
# Set seaborn style for nicer plots
sns.set(style='whitegrid')

# For Colab Users
## 💻 Setup and Model Loading

To authenticate with the Hugging Face for using an LLM:
- Login to the HuggingFace, Goto https://huggingface.co/settings/tokens
- Click on "+ Create new token"
- Select Token type as "Write", Give a name as "responsible_ai" and create the token
- Copy the `HF_TOKEN` displayed to you, it would look something like "hf_MQndTFAzdVxxxxxxxxxxxxxxxctLtWoIoaMabO"
- Open the Secrets in your Google Colab, give the Name = `HF_TOKEN` and Value = `hf_MQndTFAzdVxxxxxxxxxxxxxxxctLtWoIoaMabO`
- Turn ON the Notebook access.

# For Local Python Users  
## 💻 Setup and Model Loading Instructions

### 1. Install Required Packages  
Make sure you have the `transformers` and `torch` libraries installed:

```bash
pip install transformers torch
```
To authenticate with the Hugging Face for using an LLM:
- Login to the HuggingFace, Goto https://huggingface.co/settings/tokens
- Click on "+ Create new token"
- Select Token type as "Write", Give a name as "responsible_ai" and create the token
- Copy the `HF_TOKEN` displayed to you, it would look something like "hf_MQndTFAzdVxxxxxxxxxxxxxxxctLtWoIoaMabO"

```python
os.environ["HF_TOKEN"] = "hf_your_actual_token_here"
```

## Test Your Environment

In [None]:
print(f'PyTorch version= {torch.__version__}')
print(f'transformers version= {transformers.__version__}')
print(f'CUDA available= {torch.cuda.is_available()}')
Device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Large Language Models (LLMs) Basics
This notebook provides an introductory overview of how large language models work.

We will cover:
- What is an LLM?
- Tokenization basics
- Deep learning and embeddings
- Transformer architecture and self-attention
- How vectors become words
- Attention visualization example

## What is a Large Language Model?
Large language models (LLMs) are deep neural networks trained on vast amounts of text data to understand and generate human-like language. They learn statistical patterns of language, such as grammar, facts, and context.

LLMs work by converting text into numeric vectors, processing these vectors through many layers (typically Transformers), and then generating output text token-by-token.




## Tokenization
Tokenization breaks text into smaller units (tokens), which may be words, subwords, or characters.

This notebook demonstrates different tokenization methods used in NLP and LLMs:

- Word-level tokenization (simple split)
- Byte Pair Encoding (BPE) via GPT-2 tokenizer
- WordPiece tokenization via BERT tokenizer
- SentencePiece tokenization via T5 tokenizer
- Character-level tokenization

We apply each to the same example sentence to compare outputs.


In [None]:
sentence = "I do not like this movie."

print(f"Input sentence:\n{sentence}\n")

# 1. Word-level tokenization (simple split)
word_tokens = sentence.split()
print("Word-level tokens:")
print(word_tokens)
print()

# 2. BPE tokenization (GPT-2 tokenizer)
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
gpt2_tokens = gpt2_tokenizer.tokenize(sentence)
print("BPE tokens (GPT-2):")
print(gpt2_tokens)
print()

# 3. WordPiece tokenization (BERT tokenizer)
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_tokens = bert_tokenizer.tokenize(sentence)
print("WordPiece tokens (BERT):")
print(bert_tokens)
print()

# 4. SentencePiece tokenization (T5 tokenizer)
t5_tokenizer = T5Tokenizer.from_pretrained("t5-small")
t5_tokens = t5_tokenizer.tokenize(sentence)
print("SentencePiece tokens (T5):")
print(t5_tokens)
print()

# 5. Character-level tokenization (manual)
char_tokens = list(sentence)
print("Character-level tokens:")
print(char_tokens)
print()


## Deep Learning & Embeddings: Representing Tokens as Vectors

Tokens are converted into dense numerical vectors called **embeddings**.  
Embeddings capture semantic meaning — similar words have similar vectors.

These vectors are the inputs to the neural network model.

Let's see how to get embeddings from a pretrained model.


In [None]:
model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertModel.from_pretrained(model_name)

inputs = tokenizer(sentence, return_tensors="pt")
print(f"Tokenized input IDS: {inputs['input_ids']}")
print(f"Tokens: {tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])}")
print()

with torch.no_grad():
    embeddings = model.embeddings.word_embeddings(inputs["input_ids"])

print(f"Embedding tensor shape:", embeddings.shape)  # (batch_size, seq_len, embedding_dim)
print(f"Each token is represented by a {embeddings.shape[-1]}-dimensional vector.\n")

# Show embedding for a specific token (index 2 should be 'do')
token_index = 2
token_id = inputs['input_ids'][0][token_index].item()
token_str = tokenizer.convert_ids_to_tokens([token_id])[0]

print(f"Embedding vector for token '{token_str}' (first 10 dimensions):")
print(embeddings[0, token_index, :10])
print("...")
print(f"Full embedding has {embeddings.shape[-1]} dimensions")



## Transformer Architecture Overview: The Heart of Modern LLMs

Transformers are the core architecture behind most LLMs.

Key ideas:

- **Self-Attention:** Each token looks at other tokens to understand context.
- **Feed-forward layers:** Further process the attention outputs.
- **Stacking layers:** Multiple layers enable learning complex language features.

The self-attention mechanism allows the model to weigh the importance of each token relative to others in the sequence.

## Attention Mechanism in Detail & Visualization

Attention computes scores between tokens, telling the model which tokens to focus on.

Let's visualize attention weights from the last layer of the DistilBERT model for our sentence.

We average attention across all heads for simplicity.


In [None]:
# Get attention weights by running the model with output_attentions=True
with torch.no_grad():
    # Only pass the inputs that DistilBERT expects
    model_inputs = {
        'input_ids': inputs['input_ids'],
        'attention_mask': inputs['attention_mask']
    }
    outputs = model(**model_inputs, output_attentions=True)

# Get attention from the last layer
attentions = outputs.attentions[-1]  # Shape: (batch_size, num_heads, seq_len, seq_len)
print(f"Attention tensor shape: {attentions.shape}")
print(f"Number of attention heads: {attentions.shape[1]}")

# Average attention across all heads
avg_attention = attentions[0].mean(dim=0).cpu().numpy()

# Get the tokens for labeling
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
print(f"Tokens: {tokens}\n")

# Create attention heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(
    avg_attention,
    xticklabels=tokens,
    yticklabels=tokens,
    cmap="viridis",
    annot=True,
    fmt='.3f',
    cbar_kws={'label': 'Attention Weight'}
)
plt.title("Average Attention Weights (Last Layer)\nEach cell shows how much the row token attends to the column token")
plt.xlabel("Key Tokens (what we attend TO)")
plt.ylabel("Query Tokens (what is attending)")
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

# Show some specific attention patterns
print("\nInteresting attention patterns:")
for i, query_token in enumerate(tokens):
    if query_token not in ['[CLS]', '[SEP]']:
        max_attention_idx = np.argmax(avg_attention[i])
        max_attention_token = tokens[max_attention_idx]
        max_attention_score = avg_attention[i, max_attention_idx]
        print(f"'{query_token}' attends most to '{max_attention_token}' with weight {max_attention_score:.3f}")



## Decoding: From Vectors Back to Words

After processing the input, the model predicts the next token by generating a probability distribution over its vocabulary.

It picks the token with the highest probability (or samples probabilistically), converts its ID back to a word piece, and generates text token-by-token.

This iterative decoding enables text generation, translation, summarization, and more.


## Demonstrating the Full Prediction Cycle of an LLM

In this example, we show how a large language model predicts the next word (token) given a text prompt.

**What happens step-by-step:**

1. **Input Prompt:**  
   We start with a human-readable text input, e.g., `"The capital of Virginia is"`.  

2. **Tokenization:**  
   The prompt is broken down into tokens using the model’s tokenizer. These tokens are converted into numeric IDs that the model can process.

3. **Model Forward Pass:**  
   The tokens are fed into the pretrained language model, which outputs **logits** — raw scores representing how likely each token in the vocabulary is to come next.

4. **Next-Token Prediction:**  
   We extract the logits corresponding to the position after the last input token, then apply a softmax function to convert these scores into probabilities.

5. **Top Candidate Tokens:**  
   The model ranks tokens by their probability of being the next token and we display the top choices with their associated likelihoods.

**Why this represents the full cycle:**

- The LLM does not “understand” facts or intentions internally; it **predicts the next token** solely based on learned statistical patterns from massive text data.  
- This next-token prediction is repeated token-by-token to generate fluent text, which is how models produce sentences, paragraphs, and even long documents.  
- By seeing the actual probabilities, we gain insight into the model’s “thought process” — which next words it considers most plausible given the prompt context.

This demonstration ties together all components covered so far: tokenization, embeddings (hidden inside the model), Transformer processing, and decoding predictions back into human-readable words.

It’s a hands-on view of what really happens inside an LLM when it generates language.


In [None]:
# Load tokenizer and model (GPT-2 small for speed)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Example prompt
prompt = "The capital of Virginia is"

# Tokenize input
inputs = tokenizer(prompt, return_tensors="pt")

# Forward pass to get logits
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# Get logits for the last token's next token prediction
last_token_logits = logits[0, -1, :]

# Convert logits to probabilities (softmax)
probs = torch.softmax(last_token_logits, dim=-1)

# Get top 10 tokens with highest probability
top_probs, top_indices = torch.topk(probs, k=10)

print(f"Prompt: {prompt}\n")
print("Top next token predictions with probabilities:")
for token_id, prob in zip(top_indices, top_probs):
    token_str = tokenizer.decode([token_id])
    print(f"{token_str.strip()}: {prob.item():.4f}")


In [None]:
# Load GPT-2 Medium model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium")
model = GPT2LMHeadModel.from_pretrained("gpt2-medium")

# Example prompt
prompt = "The capital of Virginia is"

# Tokenize input
inputs = tokenizer(prompt, return_tensors="pt")

# Forward pass to get logits
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# Get logits for the last token's next token prediction
last_token_logits = logits[0, -1, :]

# Convert logits to probabilities (softmax)
probs = torch.softmax(last_token_logits, dim=-1)

# Get top 10 tokens with highest probability
top_probs, top_indices = torch.topk(probs, k=10)

print(f"Prompt: {prompt}\n")
print("Top next token predictions with probabilities:")
for token_id, prob in zip(top_indices, top_probs):
    token_str = tokenizer.decode([token_id])
    print(f"{token_str.strip()}: {prob.item():.4f}")
