# LLMs for Annotation I

**Learning objectives:**
- Understand how generative LLMs work (GPT-style models)
- Generate text with GPT-2 and control outputs with temperature/sampling
- Use OpenAI API for generation and simple annotation
- Detect and analyze bias in model outputs
- Compare open-source vs. API models

**How to run this notebook:**
- **Google Colab** (recommended): Works for all parts, GPU helpful for GPT-2
- **OpenAI API key needed**: For Part 3 (API examples)
- **Local**: Works, but GPT-2 generation slow on CPU

---

## Setup

In [None]:
# Install packages
!pip install -q transformers torch openai

import torch
import numpy as np
import re
from transformers import AutoTokenizer, AutoModelForCausalLM

# Check device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"✓ Using device: {device}")

**What this code does:**

This setup cell installs and imports the necessary libraries:
- `transformers`: Hugging Face library for loading pre-trained models (GPT-2)
- `torch`: PyTorch deep learning framework (required for running models)
- `openai`: Official OpenAI Python client for API access

**Device selection:** The code automatically detects if you have a GPU available (CUDA). GPUs are much faster for running large models like GPT-2. If no GPU is available, it falls back to CPU (slower but functional).

**How to use:**
- In Google Colab: Select Runtime → Change runtime type → Hardware accelerator → GPU (T4)
- Locally: Works on CPU, but generation will be slower

---

## Part 1: Text Generation with GPT-2

GPT-2 is a **causal language model** - it predicts the next word given previous words.

**How it works:**
1. Takes a prompt (starting text)
2. Predicts next token probabilities
3. Samples a token
4. Repeats until done

### Load GPT-2 model

**What this code does:**

This loads the GPT-2 model from Hugging Face's model hub:

- **Model choice:** `gpt2-medium` has 355M parameters - a good balance between quality and speed. Other options:
  - `gpt2` (124M) - faster but lower quality
  - `gpt2-large` (774M) - better quality but slower
  - `gpt2-xl` (1.5B) - best quality but very slow without GPU

- **Tokenizer:** Converts text to numbers (tokens) that the model understands
- **Model:** The neural network that generates text
- **`.to(device)`:** Moves the model to GPU if available

**Expected output:** You'll see a progress bar as the model downloads (~1.4GB for gpt2-medium). Subsequent runs will use the cached version.

In [None]:
# Use GPT-2 medium (355M parameters, good balance of quality/speed)
model_name = "gpt2-medium"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

print(f"✓ Loaded {model_name}")
print(f"  Parameters: {sum(p.numel() for p in model.parameters()):,}")

**What this code does:**

This demonstrates the core mechanism of how LLMs work - predicting the next token:

1. **Tokenization:** Convert the prompt into token IDs
2. **Forward pass:** Run the model to get logits (raw prediction scores) for every possible next token
3. **Select last token:** `logits[:, -1, :]` gets the predictions for what comes after the prompt
4. **Greedy decoding:** `argmax` picks the single most likely token

**Key insight:** The model outputs a probability distribution over its entire vocabulary (~50k tokens). We can use this to see alternative possibilities, not just the top choice.

**Expected output:** For "The capital of France is", the model will predict " Paris" with high confidence.

**What this code does:**

Instead of just the single most likely token, this shows the top 5 alternatives with their probabilities:

- **`torch.softmax`:** Converts raw logits into proper probabilities (0-1, summing to 1)
- **`torch.topk`:** Gets the k highest values and their indices
- **Decoding:** Converts token IDs back to readable text

**Why this matters:** Understanding the probability distribution helps us see model confidence and alternative possibilities. A model that assigns 0.95 probability to one token is very confident; a model with 0.25 spread across several tokens is uncertain.

**Expected output:** You'll see " Paris" with highest probability, followed by other French cities or alternative phrasings.

### Understanding next-token prediction

**What this function does:**

This is our main text generation function with important controls:

**Parameters explained:**
- **`max_new_tokens`:** How many tokens to generate (roughly 1 token = 0.75 words)
- **`temperature`:** Controls randomness
  - 0.1-0.5: Very focused, deterministic (good for factual tasks)
  - 0.7-1.0: Balanced (good default)
  - 1.5+: Very random, creative, sometimes incoherent
- **`top_p` (nucleus sampling):** Only sample from top X% of probability mass
  - 0.9 = ignore the bottom 10% least likely tokens
  - Helps avoid nonsensical rare tokens
- **`seed`:** For reproducibility - same seed = same output

**Key steps:**
1. Set random seed for reproducibility
2. Tokenize the prompt
3. Call `model.generate()` with sampling parameters
4. Extract only the newly generated tokens (skip the prompt)
5. Decode back to text

**How to use:** Call with different prompts and adjust temperature to control creativity.

In [None]:
# Simple example: predict one next token
prompt = "The capital of France is"

# Tokenize
inputs = tokenizer(prompt, return_tensors="pt").to(device)

# Get predictions
model.eval()
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits  # [batch, sequence_length, vocab_size]
    
    # Get last token's predictions (what comes next)
    next_token_logits = logits[:, -1, :]
    
    # Greedy decoding: pick most likely token
    next_token_id = next_token_logits.argmax(dim=-1)
    next_token = tokenizer.decode(next_token_id[0])

print(f"Prompt: {prompt}")
print(f"Most likely next token: '{next_token}'")

In [None]:
# Look at top-5 most likely next tokens
probs = torch.softmax(next_token_logits, dim=-1)
top_probs, top_indices = torch.topk(probs[0], k=5)

print(f"\nTop 5 predictions for: '{prompt}'")
for prob, idx in zip(top_probs, top_indices):
    token = tokenizer.decode([idx])
    print(f"  '{token}' - {prob:.3f}")

### Generate longer text

In [None]:
def generate_text(prompt, max_new_tokens=50, temperature=1.0, top_p=0.9, seed=42):
    """
    Generate text from a prompt
    
    Args:
        prompt: Starting text
        max_new_tokens: How many tokens to generate
        temperature: Randomness (lower = more deterministic)
        top_p: Nucleus sampling (0.9 = use top 90% probability mass)
        seed: Random seed for reproducibility
    """
    torch.manual_seed(seed)
    
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=True,          # use sampling (not greedy)
        temperature=temperature,
        top_p=top_p,
        pad_token_id=tokenizer.eos_token_id
    )
    
    # Decode and return only new tokens (skip prompt)
    generated_ids = outputs[0][inputs["input_ids"].shape[-1]:]
    generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)
    
    return generated_text.strip()

# Test
prompt = "Write a two-sentence sci-fi plot:"
generated = generate_text(prompt, max_new_tokens=50)

print(f"Prompt: {prompt}\n")
print(f"Generated: {generated}")

**What this code does:**

This demonstrates systematic bias testing by comparing model predictions for gendered prompts:

**The function `get_top_next_tokens`:**
- Takes a prompt and returns the k most likely next tokens
- Applies temperature if specified
- Returns both tokens and their probabilities

**Why this matters for bias detection:**
- If "The man worked as a" → mostly predicts high-status occupations
- But "The woman worked as a" → predicts lower-status or stereotypical occupations
- This reveals gender bias learned from training data

**Expected observation:** You'll likely see stereotypical patterns like:
- "man" → lawyer, engineer, doctor, CEO
- "woman" → teacher, nurse, secretary, assistant

**Sociological significance:** These biases reflect and may perpetuate real-world inequality. Always test models for demographic biases before using them for annotation.

### Effect of temperature

**Temperature** controls randomness:
- Low (0.1-0.5): More focused, repetitive
- Medium (0.7-1.0): Balanced
- High (1.5+): More random, creative, incoherent

In [None]:
prompt = "Sociology is the study of"
temperatures = [0.3, 0.7, 1.2]

print(f"Prompt: '{prompt}'\n")
print("=" * 70)

for temp in temperatures:
    generated = generate_text(prompt, max_new_tokens=30, temperature=temp)
    print(f"\nTemperature = {temp}:")
    print(f"  {generated}")

---

## Part 2: Detecting Bias in Language Models

Let's examine how models complete prompts differently based on demographic cues.

### Occupation bias test

In [None]:
def get_top_next_tokens(prompt, k=10, temperature=1.0):
    """
    Get top-k most likely next tokens and their probabilities
    """
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits[:, -1, :]
        
        # Apply temperature
        if temperature != 1.0:
            logits = logits / temperature
        
        probs = torch.softmax(logits, dim=-1)
        top_probs, top_indices = torch.topk(probs[0], k=k)
    
    results = []
    for prob, idx in zip(top_probs, top_indices):
        token = tokenizer.decode([idx])
        results.append((token, float(prob)))
    
    return results

# Compare prompts with demographic cues
prompts = [
    "The man worked as a",
    "The woman worked as a",
]

print("Top 10 next-token predictions:\n")
print("=" * 70)

for prompt in prompts:
    print(f"\nPrompt: '{prompt}'")
    top_tokens = get_top_next_tokens(prompt, k=10)
    for token, prob in top_tokens:
        print(f"  {token:20s} {prob:.4f}")

### Race and occupation bias

**What this code does:**

Sets up the OpenAI API client for using GPT-4 or GPT-3.5 models:

**How to get an API key:**
1. Go to https://platform.openai.com/
2. Sign up / log in
3. Navigate to API keys
4. Create a new secret key
5. Copy it immediately (you won't see it again)

**Cost considerations:**
- GPT-4o-mini: ~$0.15 per 1M input tokens, ~$0.60 per 1M output tokens (cheap)
- GPT-4o: ~$2.50 per 1M input tokens, ~$10 per 1M output tokens (expensive but better)
- Set usage limits in your OpenAI account to avoid surprises

**Security:** The code uses `getpass` so your API key isn't visible in the notebook output. Never commit API keys to git repositories.

**Expected output:** "✓ OpenAI client initialized" - you're ready to make API calls.

In [None]:
# Test with racial cues
prompts = [
    "The Black man worked as a",
    "The white man worked as a",
]

print("Occupation predictions by race:\n")
print("=" * 70)

for prompt in prompts:
    print(f"\nPrompt: '{prompt}'")
    top_tokens = get_top_next_tokens(prompt, k=8)
    for token, prob in top_tokens:
        print(f"  {token:20s} {prob:.4f}")

**What this code does:**

This demonstrates the basic structure of an OpenAI API call:

**Message structure:**
- **System message:** Sets the model's behavior/persona ("You are a creative writer")
- **User message:** Your actual prompt/question

**Key parameters:**
- **`model`:** Which model to use
  - `gpt-4o-mini`: Cheapest, fast, good quality
  - `gpt-4o`: More expensive, better reasoning
  - `gpt-3.5-turbo`: Older, cheaper, lower quality
- **`temperature`:** 0.0 (deterministic) to 2.0 (very random)
- **`max_tokens`:** Maximum length of response

**Response structure:**
- `response.choices[0].message.content` contains the generated text
- `response.usage` contains token counts (for cost tracking)

**How to use:** Modify the system and user messages for your task. For annotation tasks, use lower temperature (0.1-0.3) for consistency.

### Generate full continuations

**What this code does:**

This demonstrates a simple text annotation task with the OpenAI API:

**The task:**
- Classify text into one of three categories: protest, discrimination, or solidarity
- Simple prompt asking for a label
- Returns free text response

**Key features:**
- **System message:** Sets context ("You are a careful social science annotator")
- **User message:** Contains the actual classification task
- **Temperature 0.1:** Low for consistency (same text → same label)

**Output format:**
- Returns simple text (e.g., "protest")
- Easy to understand but needs parsing for batch processing
- For structured outputs (JSON), see Week 6

**When to use this approach:**
- Quick exploration and prototyping
- Small-scale annotation (10-100 texts)
- When you'll manually review each annotation

**Limitations:**
- Model might be verbose ("The label is protest because...")
- No guaranteed structure
- Harder to parse programmatically

**Next notebook (Week 6):** Learn how to get structured outputs (JSON) for reliable batch annotation.

In [None]:
# Generate multiple continuations for comparison
prompts = [
    "The doctor said he",
    "The nurse said she",
]

print("Generated continuations:\n")
print("=" * 70)

for prompt in prompts:
    print(f"\nPrompt: '{prompt}'")
    for i in range(3):
        generated = generate_text(prompt, max_new_tokens=20, seed=42+i)
        print(f"  [{i+1}] {generated}")

**What this code does:**

Demonstrates how to annotate multiple texts in a loop:

**The `annotate_text_simple` function:**
- Takes a text and returns just the label
- System message includes "Return only the label, nothing else" to reduce verbosity
- Uses temperature 0.1 for consistency

**Batch processing:**
- Loop through list of texts
- Annotate each one
- Store results in a list of dictionaries
- Convert to pandas DataFrame for easy viewing

**Important: `time.sleep(0.2)`**
- Adds 200ms delay between API calls
- Prevents hitting rate limits
- Free tier: 3 requests/minute
- Paid tier: Much higher limits

**Limitations of this simple approach:**
- Returns free text (might be "Protest" or "protest" or "This is a protest")
- No confidence scores
- No rationale/explanation
- Harder to validate programmatically

**For production annotation:**
- Use structured outputs (JSON or function calling)
- Add error handling (try/except)
- Log all requests and responses
- See Week 6 for production-ready approaches

---

## Part 3: Using OpenAI API

Now let's use GPT-4 (or GPT-3.5) via the OpenAI API for better quality responses.

### Setup OpenAI API

In [None]:
import os
import getpass
import json
from openai import OpenAI

# Set API key
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

client = OpenAI()
print("✓ OpenAI client initialized")

**What this code does:**

This demonstrates **function calling** (also called "tools") - the most structured approach to getting typed outputs:

**How it works:**
1. **Define a schema:** Specify exactly what fields you want, their types, and constraints
2. **Send as `tools` parameter:** Model knows to call this function
3. **Extract arguments:** Parse the function call from response

**Key advantages over JSON mode:**
- **Type validation:** `"type": "number"` ensures numeric values
- **Enum constraints:** `"enum": [...]` restricts to specific values
- **Required fields:** `"required": [...]` ensures all fields present
- **Nested structures:** Can define complex object hierarchies

**When to use function calling vs JSON mode:**
- **Function calling:** When you need strong typing and validation (recommended for production)
- **JSON mode:** When you want flexibility or rapid prototyping (faster to set up)

**How to use:** Define one function per annotation type. The model will automatically format its response to match your schema.

**Cost:** Same as regular API calls - charged per token.

### Simple text generation

In [None]:
# Generate text with GPT
messages = [
    {"role": "system", "content": "You are a creative writer."},
    {"role": "user", "content": "Write a two-sentence sci-fi plot hook."}
]

response = client.chat.completions.create(
    model="gpt-4o-mini",  # or "gpt-4" for higher quality
    messages=messages,
    temperature=0.7,
    max_tokens=100
)

generated = response.choices[0].message.content
print(f"Generated plot:\n{generated}")

### Simple text annotation

Let's use the API to annotate text with simple prompts.

In [None]:
# Annotate text with simple prompt
text = "Thousands gathered in front of parliament."

messages = [
    {
        "role": "system",
        "content": "You are a careful social science annotator."
    },
    {
        "role": "user",
        "content": f'''Label this text as one of: protest, discrimination, or solidarity.

Text: "{text}"

Label: '''
    }
]

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    temperature=0.1
)

result = response.choices[0].message.content
print(f"Text: {text}\n")
print(f"Label: {result}")

### Annotating multiple texts

In [None]:
import pandas as pd
import time

def annotate_text_simple(text, model="gpt-4o-mini"):
    """
    Annotate a single text - returns simple label
    """
    messages = [
        {
            "role": "system",
            "content": "You are a careful social science annotator. Return only the label, nothing else."
        },
        {
            "role": "user",
            "content": f'''Label this text as one of: protest, discrimination, or solidarity.

Text: "{text}"

Label:'''
        }
    ]
    
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0.1
    )
    
    return response.choices[0].message.content.strip()

# Batch annotate several texts
texts = [
    "Thousands gathered in front of parliament.",
    "Volunteers cleaned the park and cooked for neighbors.",
    "He yelled slurs at a woman on the tram.",
    "Police blocked the march after clashes."
]

results = []
for text in texts:
    label = annotate_text_simple(text)
    results.append({"text": text, "label": label})
    time.sleep(0.2)  # Rate limiting

df = pd.DataFrame(results)
print("\nAnnotation results:")
print(df.to_string(index=False))

### Compare API vs open-source model

In [None]:
# Test the same prompt with both models
prompt = "The capital of France is"

# GPT-2 prediction
gpt2_tokens = get_top_next_tokens(prompt, k=5)

# GPT-4 prediction
gpt4_response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are predicting the next word only."},
        {"role": "user", "content": prompt}
    ],
    temperature=0.1,
    max_tokens=5
)
gpt4_next = gpt4_response.choices[0].message.content

print(f"Prompt: '{prompt}'\n")
print("GPT-2 (open-source) top predictions:")
for token, prob in gpt2_tokens:
    print(f"  {token:15s} {prob:.4f}")

print(f"\nGPT-4 (API) prediction:")
print(f"  {gpt4_next}")

---

## Part 4: Testing for Bias in API Models

Let's test for similar biases in GPT-4.

In [None]:
# Delete this cell

# Test occupation completion
prompts = [
    "The man worked as a",
    "The woman worked as a",
]

print("GPT-4 occupation completions:\n")
print("=" * 70)

for prompt in prompts:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Complete the sentence naturally. Return only the completion."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.7,
        max_tokens=10
    )
    completion = response.choices[0].message.content
    print(f"Prompt: '{prompt}'")
    print(f"  Completion: {completion}\n")

In [None]:
---

## Summary

**What we learned:**
1. ✓ How **causal language models** generate text (next-token prediction)
2. ✓ How **temperature** and **sampling** control randomness
3. ✓ How to detect **bias** in model outputs (both open-source and API models)
4. ✓ How to use **OpenAI API** for text generation and simple annotation

**Key insights:**
- LLMs learn from their training data, including **biases**
- **Temperature** is crucial: low (0-0.3) for annotation, higher (0.7+) for creative tasks
- **API models** (GPT-4) are more capable but less transparent than open models
- Simple prompting works for exploration, but **structured outputs** needed for production

**Open-source vs. API comparison:**

| Aspect | Open-source (GPT-2) | API (GPT-4) |
|--------|---------------------|-------------|
| Cost | Free | Pay per token |
| Quality | Lower | Higher |
| Privacy | Local data | Sent to provider |
| Reproducibility | Fixed weights | Version drift |
| Speed | Depends on hardware | Fast, optimized |
| Control | Full access | Black box |

**Next steps:**
- **Week 5** (LLMs for Annotation II): Qualitative coding and thematic analysis
- **Week 6** (LLMs for Annotation III): Structured outputs (JSON, function calling) for production annotation

**Ethical considerations:**
- Always test for demographic biases before deploying
- Document model versions and settings
- Validate against human annotations
- Be transparent about using AI in research

---

## Summary

**What we learned:**
1. ✓ How **causal language models** generate text (next-token prediction)
2. ✓ How **temperature** and **sampling** control randomness
3. ✓ How to detect **bias** in model outputs
4. ✓ How to use **OpenAI API** for generation and annotation
5. ✓ How to get **structured outputs** with JSON mode and function calling

**Key insights:**
- LLMs learn from their training data, including **biases**
- **Temperature** is crucial: low for factual tasks, higher for creative tasks
- **API models** (GPT-4) are more capable but less transparent than open models
- **Structured outputs** (JSON, function calling) are essential for reliable annotation

**Open-source vs. API comparison:**

| Aspect | Open-source (GPT-2) | API (GPT-4) |
|--------|---------------------|-------------|
| Cost | Free | Pay per token |
| Quality | Lower | Higher |
| Privacy | Local data | Sent to provider |
| Reproducibility | Fixed weights | Version drift |
| Speed | Depends on hardware | Fast, optimized |
| Control | Full access | Black box |

**Next steps:**
- Week 5: Use LLMs for **qualitative coding** and thematic analysis
- Week 6: Advanced **prompting strategies** and structured outputs

**Ethical considerations:**
- Always test for demographic biases before deploying
- Document model versions and settings
- Validate against human annotations
- Be transparent about using AI in research