# Structured Text Annotation with LLMs

This notebook demonstrates how to reliably extract structured outputs from LLMs for text annotation tasks in social science research.

## Learning Objectives

- Master **prompt formatting** strategies (f-strings, templates, few-shot, chain-of-thought)
- Understand **three approaches** to structured outputs (JSON)
  1. **Prompt for JSON** (simplest, least reliable)
  2. **JSON Mode API** (recommended default)
  3. **Function Calling** (most structured, type-safe)
- Implement **robust JSON extraction** with error handling and retries
- Perform **batch annotation** with quality checks and logging
- Apply **mixture of experts** (ensemble) approaches for increased reliability
- Follow **replication best practices** (validation, logging, fingerprinting)

## Setup

### Running in Google Colab
1. Upload this notebook to Google Colab
2. Run the installation cell below
3. You'll be prompted to enter your OpenAI API key

### Running Locally
1. Install requirements: `pip install openai pandas scikit-learn numpy`
2. Set environment variable: `export OPENAI_API_KEY="your-key-here"`
3. Run notebook with Jupyter: `jupyter notebook week6_structured_annotation_colab.ipynb`

In [None]:
# Install required packages (uncomment if needed)
# !pip install -q "openai>=1.40.0" pandas scikit-learn numpy

In [None]:
import json
import os
import hashlib
import re
from datetime import datetime
from collections import Counter

import numpy as np
import pandas as pd
from openai import OpenAI
from sklearn.metrics import cohen_kappa_score, accuracy_score, confusion_matrix
import getpass

# Set your OpenAI API key
# For Colab: you'll be prompted to enter it
# For local: set OPENAI_API_KEY environment variable
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API key: ")

client = OpenAI()  # reads OPENAI_API_KEY from environment

print("✓ Setup complete!")

**What this code does:**

Sets up the complete environment for production-grade LLM annotation:

**Key libraries:**
- **`openai`**: API access to GPT models
- **`pandas`**: Tabular data manipulation (annotation results)
- **`scikit-learn`**: Validation metrics (Cohen's kappa, confusion matrix)
- **`hashlib`**: Model fingerprinting to detect API drift
- **`datetime`**: Timestamping for reproducibility

**Why this setup is more comprehensive than previous notebooks:**
- Includes validation metrics (Cohen's kappa)
- Supports model fingerprinting (detect when API changes)
- Designed for research-grade reproducibility

**When to use each metric:**
- **Cohen's kappa**: Inter-rater reliability (target: κ > 0.80)
- **Confusion matrix**: Where disagreements occur
- **Fingerprinting**: Detect if model behavior changes over time

**Security reminder:** Use `getpass` for API keys, never hardcode them.

**Expected output:** "✓ Setup complete!" - you're ready for structured annotation workflows.

---

## Part 1: Prompt Formatting Strategies

Before we get structured outputs, let's review simple ways to format prompts for text annotation tasks.

### 1A. Simple f-string Prompting

The most basic approach: use Python f-strings to insert text into a prompt template.

**What this code does:**

Demonstrates the **simplest** prompting approach using Python f-strings:

**How f-strings work:**
- `f"string with {variable}"` inserts variable values directly
- Simple, readable, familiar to Python developers

**Key parameter: temperature=0**
- For annotation tasks, use temperature 0 (deterministic)
- Same input = same output (critical for reproducibility)
- Higher temperature (0.7+) is for creative tasks only

**Limitations of this approach:**
- No structured output (returns free text)
- Hard to parse programmatically
- Model might add extra explanation
- Not suitable for batch processing

**When to use:**
- Quick prototyping
- Interactive exploration
- Single-shot annotations where you'll read the output

**For production:** Use structured outputs (Approach 2 or 3 later in this notebook)

In [None]:
# Sample political texts for annotation
texts = [
    "We must invest in renewable energy now!",
    "Cut taxes and reduce business regulations",
    "Healthcare is a human right for all citizens",
    "Maintain current spending levels and balanced budget"
]

# Simple f-string approach
text = texts[0]
prompt = f"""Classify the political stance of this text as:
- Progressive
- Conservative
- Centrist

Text: {text}
Stance:"""

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
    temperature=0  # Deterministic for annotation
)

print(f"Text: {text}")
print(f"Stance: {response.choices[0].message.content}")

**What this code does:**

Improves on f-strings by using **reusable templates** with `.format()`:

**Why templates are better:**
- Define prompt once, reuse for all texts
- Ensures consistency across annotations
- Easy to modify prompt for entire batch
- Supports batch processing naturally

**How `.format()` works:**
- Template contains `{variable_name}` placeholders
- `.format(variable_name=value)` fills them in
- More explicit than f-strings (clearer what's being inserted)

**Best practices for template prompts:**
1. Keep instructions consistent across all texts
2. Clearly define categories (Progressive/Conservative/Centrist)
3. Place variable at the end (reduces interference)
4. Use imperative tone ("Classify the...") not questions

**Still limited:**
- Output is free text, not structured
- Need parsing logic to extract stance
- Model might be verbose or inconsistent

**Next steps:** Add few-shot examples (next cell) or use JSON mode (Approach 3)

### 1B. Reusable Template with .format()

For batch processing, create a template you can reuse across multiple texts.

**What this code does:**

Improves on f-strings by using **reusable templates** with `.format()`:

**Why templates are better:**
- Define prompt once, reuse for all texts
- Ensures consistency across annotations
- Easy to modify prompt for entire batch
- Supports batch processing naturally

**How `.format()` works:**
- Template contains `{variable_name}` placeholders
- `.format(variable_name=value)` fills them in
- More explicit than f-strings (clearer what's being inserted)

**Best practices for template prompts:**
1. Keep instructions consistent across all texts
2. Clearly define categories (Progressive/Conservative/Centrist)
3. Place variable at the end (reduces interference)
4. Use imperative tone ("Classify the...") not questions

**Still limited:**
- Output is free text, not structured
- Need parsing logic to extract stance
- Model might be verbose or inconsistent

**Next steps:** Add few-shot examples or use JSON mode (Part 2)

**What this code does:**

Demonstrates **few-shot prompting** - providing examples to guide the model:

**What is few-shot learning:**
- Show model 2-5 examples of input → output pairs
- Model learns the pattern and applies it to new inputs
- No fine-tuning required (examples in prompt only)

**Why few-shot helps:**
- **Clarifies format:** Shows exact output style you want
- **Reduces ambiguity:** Examples demonstrate edge cases
- **Improves consistency:** Model mimics example structure
- **Domain adaptation:** Examples can include domain jargon

**Structure of few-shot prompts:**
1. Task description (brief)
2. Examples (2-5 is usually enough)
   - Show diverse cases (not all similar)
   - Include format you want (here: "Text → Label")
3. New input to classify

**How many examples to use:**
- **0-shot (zero-shot):** No examples (what we did before)
- **Few-shot (2-5):** Most common, good balance
- **Many-shot (10-100):** For complex tasks or narrow domains

**When few-shot is essential:**
- Narrow domain (not in training data)
- Specific format required
- Ambiguous category boundaries
- Model is undershooting or overshooting

**Cost consideration:** Examples add tokens → higher cost. But often worth it for quality.

In [None]:
# Template approach for consistency
STANCE_TEMPLATE = """Classify the political stance of this text as:
- Progressive
- Conservative
- Centrist

Text: {text}
Stance:"""

results = []

for text in texts:
    prompt = STANCE_TEMPLATE.format(text=text)
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    
    result = response.choices[0].message.content.strip()
    results.append({"text": text, "stance": result})
    print(f"• {result}: {text}")

print(f"\n✓ Annotated {len(results)} texts")

**What this code does:**

Implements **chain-of-thought (CoT) prompting** - asking the model to reason step-by-step:

**What is chain-of-thought:**
- Ask model to show its reasoning before giving final answer
- Originated in Wei et al. (2022) "Chain-of-Thought Prompting Elicits Reasoning"
- Dramatically improves performance on reasoning tasks

**Why CoT helps:**
- **Better accuracy:** Especially on multi-step problems
- **Transparency:** Can see model's logic
- **Debugging:** Identify where reasoning goes wrong
- **Trust:** Justifications build confidence in annotations

**Temperature: 0.3 (slightly higher)**
- CoT needs some creativity for explanations
- But still low enough for consistency
- Trade-off between rigid and random

**When to use CoT:**
- Complex classification (multiple factors to consider)
- Need justifications for auditing
- Debugging misclassifications
- Training human annotators (shows reasoning process)

**When NOT to use CoT:**
- Simple tasks (adds unnecessary tokens/cost)
- Speed critical (CoT is slower)
- Don't need explanations

**Cost consideration:** CoT generates more tokens (reasoning + answer), so ~2-3x more expensive than direct classification.

**Alternative:** Use JSON mode (Approach 3) to get structured reasoning + classification

### 1C. Few-Shot Prompting

Provide **examples** in the prompt to guide the model's behavior.

In [None]:
# Few-shot template with examples
FEW_SHOT_TEMPLATE = """Classify political stance as Progressive, Conservative, or Centrist.

Examples:
Text: "Cut taxes and reduce regulations" → Conservative
Text: "Expand healthcare access for all" → Progressive
Text: "Maintain current spending levels" → Centrist

Text: {text} →"""

text = "Protect traditional family values and limit government overreach"
prompt = FEW_SHOT_TEMPLATE.format(text=text)

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
    temperature=0
)

print(f"Text: {text}")
print(f"Predicted: {response.choices[0].message.content}")

**Key advantages over Approach 1 (prompt-only):**
- **Guaranteed valid JSON:** No "Here's the JSON:" or markdown fences
- **No retry logic needed:** Works first time
- **Same cost:** No extra charge
- **Simpler code:** No complex parsing

**Requirements:**
- Model must support JSON mode (GPT-4, GPT-3.5-turbo, GPT-4o do)
- Must mention "JSON" in prompt (system or user message)

**When to use:**
- **This is the recommended default approach**
- Most annotation tasks (sentiment, topics, stance, etc.)
- When you want flexible schema (any JSON structure)

**Limitation:**
- No type validation (can't enforce "confidence must be 0-1")
- For strict typing, use Approach 3

### 1D. Chain-of-Thought Prompting

Ask the model to **explain its reasoning** step-by-step before giving an answer.

**What this code does:**

**Approach 3 (JSON Mode)** - The **recommended approach** for most annotation tasks:

**How it works:**
- Set `response_format={"type": "json_object"}` in API call
- API **guarantees** valid JSON output
- Model cannot return prose or malformed JSON

**Key advantages:**
- **99.9%+ reliability:** Valid JSON guaranteed
- **No examples needed:** Saves tokens/cost
- **Simple to use:** Just one parameter
- **Fast:** No retry logic needed

**Requirements:**
1. Must mention "JSON" in prompt (system or user message)
2. Model must support it (GPT-4, GPT-3.5-turbo, GPT-4o, etc.)
3. Temperature can be anything (0 for consistency)

**How to structure the request:**
```python
messages=[
    {"role": "system", "content": "You are X. Return valid JSON only."},
    {"role": "user", "content": "Task: ... Return JSON with keys: ..."}
]
response_format={"type": "json_object"}
```

**Compared to other approaches:**
- **vs Prompt-only:** Much more reliable (99% vs 70%)
- **vs Few-shot:** Simpler and cheaper (no examples needed)
- **vs Function calling:** More flexible schema (next cell)

**When to use JSON mode:**
- Most annotation tasks (sentiment, topics, stance, etc.)
- When you want flexibility in schema
- When you don't need enum validation

**When to use Function calling instead:**
- Need strict type checking (enums, number ranges)
- Complex nested schemas
- Want IDE autocomplete on schema

**Cost:** Same as normal API calls - no extra charge for JSON mode.

In [None]:
COT_TEMPLATE = """Classify the stance and explain your reasoning.

Text: {text}

Think step-by-step:
1. What policy domain is this?
2. What values does it express?
3. What stance does this suggest?

Reasoning:
Stance:"""

text = "Invest heavily in public education and teacher salaries"
prompt = COT_TEMPLATE.format(text=text)

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.3  # Slightly higher for reasoning
)

print(f"Text: {text}\n")
print(f"Chain-of-Thought Response:\n{response.choices[0].message.content}")
print("\n✓ CoT provides reasoning steps before final classification")

**Key advantages over Approach 2 (JSON mode):**
- **Type validation:** `"type": "number"` ensures numeric values  
- **Enum constraints:** `"enum": ["A", "B", "C"]` restricts to specific values
- **Required fields:** `"required": [...]` ensures all fields present
- **Nested structures:** Complex object hierarchies with validation

**The schema format:**
```python
{
    "type": "function",
    "function": {
        "name": "analyze_stance",
        "parameters": {
            "type": "object",
            "properties": {
                "stance": {"type": "string", "enum": ["Progressive", "Conservative", "Centrist"]},
                "confidence": {"type": "number"},  # Must be a number
                "reasoning": {"type": "string"}
            },
            "required": ["stance", "confidence", "reasoning"]
        }
    }
}
```

**When to use:**
- Production systems needing validation guarantees
- Strict category sets (must be one of 3 options, not free text)
- Type safety important (confidence must be number, not string)
- Complex nested schemas

**vs JSON mode:**
- **JSON mode:** Flexible, simpler, good default
- **Function calling:** Stricter, more setup, better for production

**Cost:** Same as JSON mode - no extra charge

---

## Part 2: Three Approaches to Getting Structured Outputs (JSON)

When you want the LLM to return structured data (like JSON with specific keys), you have **three main approaches**. Each has different reliability and complexity.

**Why structured outputs matter:**
- Easy to parse programmatically (no regex needed)
- Can extract multiple fields (label, confidence, reasoning)
- Essential for batch annotation pipelines
- Enables downstream analysis

Let's compare the three approaches from simplest to most reliable.

### Approach 1: Prompt for JSON (Simplest, Least Reliable)

**How it works:** Just ask the LLM to return JSON in your prompt. No special API parameters.

**What you do:**
- Include "Return JSON" in your prompt
- Hope the model follows instructions
- Try to parse the response with `json.loads()`

**Success rate:** ~60-80% (varies by model and complexity)

### Approach 2: JSON Mode API Parameter (Recommended for Most Tasks)

**How it works:** Use `response_format={"type": "json_object"}` in your API call. The API **guarantees** valid JSON.

**What you do:**
1. Add `response_format={"type": "json_object"}` parameter
2. Mention "JSON" somewhere in your prompt
3. Parse the response - it will always be valid JSON

**Success rate:** ~99.9% (valid JSON guaranteed by API)

**What this code does:**

Implements **robust JSON extraction with retry logic** for when parsing fails:

**The three-phase approach:**

**Phase 1: Initial request**
- Clear instructions: "Return only a JSON object"
- Low temperature (0.1) for consistency
- Specify exact schema in prompt

**Phase 2: Parse attempt**
- Try `json.loads()` on response
- If successful → return result
- If fails → proceed to Phase 3

**Phase 3: Retry with correction**
- Send original prompt + failed response + correction instruction
- Use temperature 0.0 (most deterministic)
- Try parsing again
- If still fails → raise exception for manual review

**Key features:**
- **`max_retries` parameter:** Control how many attempts (1 is usually enough)
- **Error logging:** Print failed output for debugging
- **Gradual temperature reduction:** 0.1 → 0.0 increases determinism

**When to use retry logic:**
- Using Approach 1 or 2 (not JSON mode)
- Critical annotations (can't skip failures)
- Debugging schema issues

**When NOT needed:**
- Using JSON mode or function calling (already reliable)
- Batch processing (skip failures, review later)

**Success rates:**
- Without retry: ~85% (prompt-only) to ~95% (few-shot)
- With retry: ~98%
- Remaining 2%: Usually schema issues or model limitations

**Best practice:** Use JSON mode (Approach 3) to avoid needing this complexity.

In [None]:
text = "Cut taxes and reduce regulations"
prompt = """You output ONLY valid JSON with keys: stance, confidence, reasoning.

Example:
Input: Expand healthcare access for all
Output: {{"stance":"Progressive","confidence":0.9,"reasoning":"Universal healthcare is progressive policy"}}

Example:
Input: Maintain current spending levels
Output: {{"stance":"Centrist","confidence":0.8,"reasoning":"Status quo signals moderate position"}}

Now do the same:
Input: {input}
Output:
""".format(input=text)

output = llm_call(prompt)
print("Raw output:")
print(output)

print("\nParsing JSON...")
data = json.loads(output)
print("✓ Successfully parsed!")
print(json.dumps(data, indent=2))

### Approach 3: Provider JSON Mode (More Reliable)

Use the API's **JSON mode** to **force** valid JSON output. This is the recommended approach for most annotation tasks.

**Why this approach often fails:**

Common problems:
1. **Extra text:** "Here's the JSON: {...}"
2. **Markdown fences:** \`\`\`json {...} \`\`\`
3. **Prose instead:** "The stance is Progressive because..."
4. **Malformed JSON:** Trailing commas, unquoted keys

**When to use:**
- Quick prototyping only
- Not recommended for production

**Better alternatives:** Approach 2 (JSON mode API parameter) or Approach 3 (function calling)

**What this code does:**

Implements **production-grade batch annotation** with comprehensive logging and error handling:

**The `annotate_text` function:**
- Single-text annotation with JSON mode
- Returns parsed dictionary
- Extended schema: stance, confidence, reasoning, **policy_domain**

**The `batch_annotate` function workflow:**
1. Loop through all texts with progress indicator
2. Try to annotate each text
3. On success: Add metadata (model, timestamp, success=True)
4. On failure: Log error, mark success=False, set fields to None
5. Return pandas DataFrame for analysis

**Key metadata fields:**
- **`model`:** Track which model annotated (for comparison)
- **`timestamp`:** When annotation happened (ISO format)
- **`success`:** Boolean flag for filtering
- **`error`:** Error message if failed

**Why comprehensive logging matters:**
- **Reproducibility:** Can trace back to exact API call
- **Debugging:** Identify systematic failures
- **Cost tracking:** Know how many API calls made
- **Validation:** Compare annotations across time

**Quality summary:**
- Print success rate at end
- Identify low-confidence annotations
- Flag failures for manual review

**How to use:**
```python
df = batch_annotate(texts, model="gpt-4o-mini")
# Filter successful
successful = df[df['success']]
# Check low confidence
review = df[df['confidence'] < 0.7]
```

**Best practices:**
- Always wrap API calls in try/except
- Log everything (model, time, prompt, response)
- Return structured DataFrames (easier analysis)
- Calculate success rate and quality metrics

In [None]:
text = "Protect traditional family values"

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system",
         "content": "You are a political analyst. Return valid JSON only."},
        {"role": "user",
         "content": f"Analyze political stance: {text}\n\nReturn JSON with stance, confidence, reasoning."}
    ],
    response_format={"type": "json_object"},  # Force JSON mode
    temperature=0
)

data = json.loads(response.choices[0].message.content)
print("✓ JSON mode guarantees valid JSON")
print(json.dumps(data, indent=2))

In [None]:
# Define function schema with typed arguments
tools = [{
    "type": "function",
    "function": {
        "name": "analyze_stance",
        "description": "Return structured political stance analysis",
        "parameters": {
            "type": "object",
            "properties": {
                "stance": {
                    "type": "string",
                    "enum": ["Progressive", "Conservative", "Centrist"]
                },
                "confidence": {
                    "type": "number",
                    "description": "Confidence score from 0 to 1"
                },
                "reasoning": {
                    "type": "string",
                    "description": "Brief explanation of the classification"
                }
            },
            "required": ["stance", "confidence", "reasoning"]
        }
    }
}]

text = "Expand social safety nets"

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": f"Analyze: {text}"}],
    tools=tools,
    tool_choice="auto",
    temperature=0
)

# Extract structured arguments
call = response.choices[0].message.tool_calls[0]
args = json.loads(call.function.arguments)

print("✓ Function calling provides strongest guarantees")
print(f"Function called: {call.function.name}")
print(json.dumps(args, indent=2))

**What this code does:**

Implements **mixture of experts (MoE)** - using multiple models and aggregating predictions:

**Theoretical foundation:**
- Kraft et al. (2024): "Mixture-of-Experts Approach to LLM-Based Political Ideology Scaling"
- Ensemble methods reduce variance and bias
- Multiple perspectives increase reliability

**The approach:**
1. Get ideological score from each model (-1 to +1)
2. Aggregate using mean, median, std
3. Use std as uncertainty measure

**The `get_stance_score` function:**
- Asks for numeric score (not categorical)
- -1 = most progressive
- +1 = most conservative
- 0 = centrist
- Returns float for aggregation

**The `ensemble_stance` function:**
- Calls multiple models
- Handles failures gracefully (skip failed models)
- Computes aggregate statistics
- Returns uncertainty metrics

**Key metrics:**
- **Mean:** Central tendency
- **Median:** Robust to outliers
- **Std:** Agreement/uncertainty
  - Low std (<0.3) = high agreement
  - High std (>0.6) = low agreement = review needed

**When to use MoE:**
- High-stakes decisions (publication-critical)
- Ambiguous/contested texts
- Want confidence intervals
- Have budget for multiple API calls

**Cost consideration:**
- N models × cost per call
- 3 models = 3× cost (but usually worth it for quality)

**Alternative aggregation:**
- Majority vote (for categorical)
- Weighted average (weight by model quality)
- Bayesian model combination

### Comparison Summary

In [None]:
comparison_df = pd.DataFrame({
    "Approach": [
        "1. Prompt for JSON",
        "2. JSON Mode (API)",
        "3. Function Calling"
    ],
    "Reliability": [
        "60-80%",
        "99.9%",
        "99.9%"
    ],
    "Type Safety": [
        "None",
        "None",
        "Full"
    ],
    "Complexity": [
        "Lowest",
        "Low",
        "Medium"
    ],
    "When to Use": [
        "Prototyping only",
        "Most tasks (default)",
        "Production systems"
    ]
})

print(comparison_df.to_string(index=False))

print("\n" + "="*70)
print("\n💡 **Recommendation for novices:**")
print("   Start with Approach 2 (JSON Mode) - reliable and simple")
print("\n💡 **For production systems:**")
print("   Use Approach 3 (Function Calling) if you need type validation")

---

## Part 3: Robust JSON Extraction

Sometimes JSON parsing fails. Here's how to handle errors gracefully with **retry logic**.

**What this code does:**

Implements **comprehensive logging** for full reproducibility of LLM annotations:

**Why complete logging is essential:**
- **Reproducibility:** Others can replicate your exact setup
- **Debugging:** Trace errors back to source
- **Auditing:** See exactly what model was asked
- **Cost tracking:** Monitor token usage
- **Drift detection:** Compare results over time

**What to log:**
1. **Timestamp:** When annotation happened (ISO 8601 format)
2. **Input:** Original text
3. **Model:** Exact version (e.g., "gpt-4-0613", not just "gpt-4")
4. **Parameters:** Temperature, seed, top_p, etc.
5. **Prompt:** Exact prompt sent (system + user)
6. **Response:** Complete model output
7. **Usage:** Token counts (prompt, completion, total)
8. **Metadata:** Finish reason, API version

**The `seed` parameter:**
- Available in some OpenAI models (GPT-4, GPT-4o)
- Makes sampling deterministic
- Same (model + prompt + seed + temp) = same output
- Critical for reproducibility

**Best practices:**
- Pin model version: Use "gpt-4-0613" not "gpt-4"
- Always log seed and temperature
- Store logs as JSONL (one JSON object per line)
- Include git commit hash if code changes
- Log API response headers (rate limits, etc.)

**How to use logs:**
- Debug failures by inspecting prompt
- Calculate cost from token counts
- Verify reproducibility by re-running with same params
- Track model drift by comparing same prompts over time

**Storage recommendation:** Save to file, not just print

In [None]:
# Clear instructions for JSON-only output
INSTRUCTIONS = (
    'Return only a JSON object like this:\n'
    '{"stance":"Progressive|Conservative|Centrist|null",'
    '"confidence":0-1,"reasoning":"brief"}\n'
    'Do not add any extra text.'
)

def get_labels_robust(text, model="gpt-4o-mini", max_retries=1):
    """
    Robust JSON extraction with error handling and retry logic.
    
    Args:
        text: Text to annotate
        model: Model name
        max_retries: Number of retry attempts on parse failure
    
    Returns:
        dict: Parsed JSON result
    """
    # 1) Initial request with low temperature
    prompt = f'{INSTRUCTIONS}\n\nText: "{text}"'
    
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.1  # Low temp for consistency
    )
    
    output = response.choices[0].message.content.strip()
    
    # 2) Try to parse as JSON
    try:
        return json.loads(output)
    except json.JSONDecodeError as e:
        print(f"⚠ Parse failed: {e}")
        print(f"Raw output: {output[:100]}...\n")
        
        if max_retries > 0:
            # 3) Retry with correction prompt
            fix_prompt = (
                "That was not valid JSON. Please send ONLY the JSON object, "
                "nothing else. No explanations, no markdown fences."
            )
            
            retry_response = client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "user", "content": prompt},
                    {"role": "assistant", "content": output},
                    {"role": "user", "content": fix_prompt}
                ],
                temperature=0.0  # Zero temp for retry
            )
            
            retry_output = retry_response.choices[0].message.content.strip()
            print(f"Retry output: {retry_output[:100]}...\n")
            
            try:
                return json.loads(retry_output)
            except json.JSONDecodeError:
                print("✗ Parse failed after retry")
                raise
        else:
            raise

# Test cases
test_texts = [
    "We must expand Medicare to cover everyone",
    "Cut taxes and reduce government spending",
    "Maintain balanced approach to fiscal policy"
]

print("Testing robust extraction with retry logic:\n")

for i, text in enumerate(test_texts, 1):
    print(f"Test {i}: {text}")
    try:
        result = get_labels_robust(text)
        print(f"✓ Success: {result['stance']} (confidence: {result['confidence']})\n")
    except Exception as e:
        print(f"✗ Failed: {e}\n")

**What this code does:**

Implements **model fingerprinting** to detect when API behavior changes (API drift):

**What is API drift:**
- Model providers update models without warning
- "gpt-4" might point to different weights next month
- Causes non-reproducibility issues
- Your results today ≠ results tomorrow

**How fingerprinting works:**
1. Create a small test set (5-10 prompts)
2. Run them through model with fixed seed and temperature
3. Hash the concatenated responses (SHA256)
4. Save this fingerprint
5. Periodically re-run and compare hashes

**The `model_fingerprint` function:**
- Takes model name and test prompts
- Uses temperature=0 and seed for determinism
- Concatenates all responses
- Computes SHA256 hash (deterministic)
- Returns 64-character hex string

**When fingerprint changes:**
- Model has been updated (weights changed)
- API behavior changed
- Your previous results may not be replicable
- Need to re-run annotations or document drift

**Best practices:**
1. **Create fingerprint at start:** Before annotating corpus
2. **Check periodically:** Weekly or monthly
3. **Store with results:** Save fingerprint with annotations
4. **Document changes:** Note when drift detected
5. **Use versioned models:** "gpt-4-0613" vs "gpt-4"

**How often does drift happen:**
- Unversioned models ("gpt-4"): Monthly
- Versioned models ("gpt-4-0613"): Rarely (usually stable)
- Open-source models: Never (fixed weights)

**Recommendation:** Always use versioned models for research

---

## Part 4: Batch Annotation

Annotating multiple texts efficiently with **logging** and **quality checks**.

**What this code does:**

Measures **agreement between human and LLM annotations** using standard metrics:

**Why validation is critical:**
- LLMs can be systematically biased
- Need to know if LLM matches human judgment
- Required for publication in most venues
- Establishes annotation quality

**Cohen's Kappa (κ):**
- Measures inter-rater reliability (agreement corrected for chance)
- Range: -1 to +1 (usually 0 to 1)
- **Interpretation:**
  - κ > 0.80: Substantial agreement (excellent)
  - κ > 0.60: Moderate agreement (acceptable)
  - κ < 0.60: Questionable (needs work)
- Formula accounts for chance agreement

**Accuracy:**
- Simple: % of matching labels
- Doesn't account for chance
- Can be misleading with imbalanced classes

**Confusion Matrix:**
- Shows where disagreements occur
- Rows = human labels
- Cols = LLM labels
- Diagonal = agreements
- Off-diagonal = disagreements

**How to use in practice:**
1. **Validation set:** Human-annotate 100-200 texts
2. **LLM annotation:** Annotate same texts with LLM
3. **Calculate κ:** Use `cohen_kappa_score()`
4. **Analyze errors:** Check confusion matrix
5. **Iterate:** If κ < 0.80, refine prompt and repeat

**Target thresholds for publication:**
- Minimum: κ > 0.70
- Good: κ > 0.80
- Excellent: κ > 0.90

**Cost saving:** Once validated, can annotate full corpus with LLM

In [None]:
# Sample corpus for batch annotation
corpus = [
    "We need stronger borders and immigration control",
    "Healthcare is a human right for all",
    "Balance the budget through moderate tax reform",
    "Invest in renewable energy infrastructure",
    "Cut regulations on small businesses",
    "Expand access to affordable childcare",
    "Maintain current defense spending levels",
    "Protect voting rights and access",
    "Reduce corporate tax rates",
    "Fund public education and teacher salaries"
]

print(f"Corpus: {len(corpus)} political statements")

**What this code does:**

Creates a **promptbook** - a comprehensive documentation artifact for reproducible LLM research:

**What is a promptbook:**
- Single JSON file documenting entire annotation pipeline
- Analogous to lab notebook in experimental research
- Enables exact replication by other researchers
- Required by some journals (e.g., Nature family)

**What to include:**
1. **Task description:** What you're annotating
2. **Model details:** Exact version, provider, parameters
3. **Prompts:** Full system and user messages
4. **Output schema:** Structure of responses
5. **Validation:** Human agreement metrics (Cohen's κ)
6. **Fingerprint:** Hash for detecting drift
7. **Metadata:** Date, version, notes

**Why this matters for reproducibility:**
- Someone with your promptbook can replicate exactly
- Shows transparency about model and prompts
- Documents validation against human labels
- Tracks when model behavior changes (fingerprint)

**Best practices:**
1. **Version control:** Increment version when prompts change
2. **Git integration:** Include git commit hash
3. **Store with data:** Save alongside annotations
4. **Share openly:** Publish with paper (supplementary materials)
5. **Update regularly:** New fingerprint each month

**What to do with promptbook:**
- Include in paper's methods section
- Upload to OSF/Dataverse with data
- Reference in computational appendix
- Use for internal documentation

**Publication requirements:**
- Many journals now require computational reproducibility
- Promptbook satisfies most requirements
- Some journals have templates (adapt this structure)

**This establishes research-grade annotation practices**

In [None]:
def annotate_text(text, model="gpt-4o-mini", temperature=0):
    """Annotate a single text with structured output"""
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system",
             "content": "You are a political analyst. Return valid JSON only."},
            {"role": "user",
             "content": f"""Analyze this political text: {text}

Return JSON with keys:
- stance (Progressive/Conservative/Centrist)
- confidence (0-1)
- reasoning (brief explanation)
- policy_domain (e.g., healthcare, economy, education)"""}
        ],
        response_format={"type": "json_object"},
        temperature=temperature
    )
    
    return json.loads(response.choices[0].message.content)

def batch_annotate(texts, model="gpt-4o-mini", temperature=0):
    """Annotate multiple texts with error handling and logging"""
    results = []
    
    print(f"Annotating {len(texts)} texts with {model}...\n")
    
    for i, text in enumerate(texts, 1):
        print(f"[{i}/{len(texts)}] {text[:50]}...")
        
        try:
            annotation = annotate_text(text, model=model, temperature=temperature)
            annotation['text'] = text
            annotation['model'] = model
            annotation['timestamp'] = datetime.now().isoformat()
            annotation['success'] = True
            annotation['error'] = None
            
        except Exception as e:
            print(f"  ✗ Error: {e}")
            annotation = {
                'text': text,
                'model': model,
                'timestamp': datetime.now().isoformat(),
                'success': False,
                'error': str(e),
                'stance': None,
                'confidence': None,
                'reasoning': None,
                'policy_domain': None
            }
        
        results.append(annotation)
    
    df = pd.DataFrame(results)
    print(f"\n✓ Completed: {df['success'].sum()}/{len(df)} successful")
    
    return df

# Run batch annotation
df = batch_annotate(corpus)

In [None]:
# Display results
print("\n=== ANNOTATION RESULTS ===\n")

print("Summary by stance:")
print(df['stance'].value_counts())

print("\nAverage confidence by stance:")
print(df.groupby('stance')['confidence'].mean().round(3))

print("\nSample annotations:")
display_df = df[['text', 'stance', 'confidence', 'policy_domain']].head(5)
print(display_df.to_string(index=False))

In [None]:
# Quality checks
print("\n=== QUALITY CHECKS ===\n")

# Check for low confidence predictions
low_confidence = df[df['confidence'] < 0.7]
print(f"Low confidence annotations (< 0.7): {len(low_confidence)}")
if len(low_confidence) > 0:
    print(low_confidence[['text', 'stance', 'confidence']].to_string(index=False))

# Check for failures
failures = df[~df['success']]
print(f"\nFailed annotations: {len(failures)}")
if len(failures) > 0:
    print(failures[['text', 'error']].to_string(index=False))

print("\n✓ Quality checks complete")

---

## Part 5: Mixture of Experts (Ensemble)

Using **multiple models** and aggregating their predictions can increase reliability. This is based on Kraft et al. (2024).

In [None]:
def get_stance_score(text, model="gpt-4o-mini"):
    """
    Get ideological position score from a model.
    Returns: float from -1 (most progressive) to +1 (most conservative)
    """
    prompt = f"""Rate this text on ideology from -1 (most progressive)
to +1 (most conservative). Return only the number.

Text: {text}"""
    
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    
    return float(response.choices[0].message.content.strip())

def ensemble_stance(text, models):
    """Aggregate stance estimates across multiple models"""
    scores = []
    individual = {}
    
    for model in models:
        try:
            score = get_stance_score(text, model=model)
            scores.append(score)
            individual[model] = score
            print(f"  {model:20}: {score:+.3f}")
        except Exception as e:
            print(f"  {model:20}: Error - {e}")
            continue
    
    if not scores:
        return None
    
    return {
        "mean": np.mean(scores),
        "median": np.median(scores),
        "std": np.std(scores),
        "min": np.min(scores),
        "max": np.max(scores),
        "individual": individual,
        "n_models": len(scores)
    }

**Using a continuous scale instead of categorical labels:**

For more nuanced analysis, we can ask models to rate ideology on a **continuous scale** rather than discrete categories. This approach:

- Captures gradations between positions (e.g., -0.45 vs -0.90 both progressive, but different intensities)
- Enables more sophisticated aggregation (mean, median, standard deviation)
- Allows measurement of **uncertainty** via ensemble variance
- Follows Kraft et al. (2024) methodology for ideological scaling

**The scale:**
- **-1**: Most progressive position
- **0**: Centrist/neutral
- **+1**: Most conservative position

This continuous representation is particularly useful for mixture-of-experts approaches where we aggregate scores across multiple models.

In [None]:
# Single text with multiple models
models = ["gpt-4o-mini", "gpt-3.5-turbo"]

text = "We must protect traditional family values and limit government overreach"
print(f"Text: {text}\n")
print("Individual model scores:")

result = ensemble_stance(text, models)

if result:
    print(f"\nEnsemble results:")
    print(f"  Mean:      {result['mean']:+.3f}")
    print(f"  Median:    {result['median']:+.3f}")
    print(f"  Std dev:   {result['std']:.3f}")
    print(f"  Range:     [{result['min']:+.3f}, {result['max']:+.3f}]")
    
    agreement = "High" if result['std'] < 0.3 else "Medium" if result['std'] < 0.6 else "Low"
    print(f"  Agreement: {agreement}")

In [None]:
# Batch analysis with ensemble
sample_texts = [
    "Expand Medicare to cover everyone",
    "Cut taxes and regulations on businesses",
    "Protect voting rights and access",
    "Secure the border and enforce immigration laws",
    "Invest in public schools and teacher salaries"
]

results = []

print(f"Analyzing {len(sample_texts)} texts with ensemble...\n")

for i, text in enumerate(sample_texts, 1):
    print(f"[{i}/{len(sample_texts)}] {text}")
    ensemble = ensemble_stance(text, models)
    
    if ensemble:
        results.append({
            "text": text,
            "position": ensemble["mean"],
            "uncertainty": ensemble["std"],
            "n_models": ensemble["n_models"]
        })
    print()

ensemble_df = pd.DataFrame(results)

print("\nPosition scores (negative = progressive, positive = conservative):")
print(ensemble_df.to_string(index=False))

---

## Part 6: Validation and Replication

Best practices for **reproducible** and **validated** LLM annotation.

### 6A. Comprehensive Logging

In [None]:
def annotate_with_logging(text, model="gpt-4-0613", temperature=0, seed=42):
    """Annotate text with complete logging for reproducibility"""
    prompt = f"""Analyze political stance: {text}

Return JSON with stance, confidence, reasoning."""
    
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a political analyst. Return valid JSON only."},
            {"role": "user", "content": prompt}
        ],
        response_format={"type": "json_object"},
        temperature=temperature,
        seed=seed
    )
    
    # Parse result
    result = json.loads(response.choices[0].message.content)
    
    # Create comprehensive log entry
    log_entry = {
        "timestamp": datetime.now().isoformat(),
        "text": text,
        "model": model,
        "temperature": temperature,
        "seed": seed,
        "prompt": prompt,
        "response": result,
        "usage": {
            "prompt_tokens": response.usage.prompt_tokens,
            "completion_tokens": response.usage.completion_tokens,
            "total_tokens": response.usage.total_tokens
        },
        "finish_reason": response.choices[0].finish_reason
    }
    
    return result, log_entry

text = "Expand social safety nets and increase minimum wage"
result, log = annotate_with_logging(text)

print(f"✓ Annotated: {text}")
print(f"  Stance: {result.get('stance')}")
print(f"\nLog entry includes: {list(log.keys())}")

### 6B. Model Fingerprinting (Detect API Drift)

Create a **fingerprint** of model behavior to detect when the API changes.

In [None]:
def model_fingerprint(model, test_prompts, temperature=0, seed=42):
    """
    Create fingerprint to detect if model behavior has changed.
    Returns: SHA256 hash of concatenated responses
    """
    responses = []
    
    for prompt in test_prompts:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature,
            seed=seed
        )
        responses.append(response.choices[0].message.content)
    
    # Hash concatenated responses
    fingerprint = hashlib.sha256(
        "".join(responses).encode()
    ).hexdigest()
    
    return fingerprint

# Create test set for fingerprinting
test_prompts = [
    "Classify: 'Cut taxes for businesses' - Progressive/Conservative/Centrist",
    "Classify: 'Expand healthcare coverage' - Progressive/Conservative/Centrist",
    "Classify: 'Balanced budget amendment' - Progressive/Conservative/Centrist"
]

fingerprint = model_fingerprint("gpt-4o-mini", test_prompts)
print(f"Model fingerprint: {fingerprint[:16]}...")
print("\n✓ Save this fingerprint and check periodically for drift")
print("✓ If fingerprint changes, model behavior has changed!")

### 6C. Validation with Human Labels

Measure agreement between LLM and human annotations using **Cohen's kappa**.

In [None]:
# Simulate human and LLM labels for validation
# 0 = Progressive, 1 = Centrist, 2 = Conservative
human_labels = np.array([0, 1, 2, 0, 1, 2, 0, 1, 2, 0])
llm_labels = np.array([0, 1, 1, 0, 1, 2, 0, 2, 2, 0])

# Calculate agreement metrics
kappa = cohen_kappa_score(human_labels, llm_labels)
accuracy = accuracy_score(human_labels, llm_labels)

print("Human-LLM Agreement:\n")
print(f"Cohen's κ: {kappa:.3f}")
print(f"Accuracy:  {accuracy:.3f}")

if kappa > 0.80:
    print("✓ Substantial agreement")
elif kappa > 0.60:
    print("⚠ Moderate agreement - consider refinement")
else:
    print("✗ Low agreement - significant issues")

# Confusion matrix
cm = confusion_matrix(human_labels, llm_labels)
print("\nConfusion Matrix (rows=human, cols=LLM):")
print("              Prog  Cent  Cons")
for i, label in enumerate(["Progressive", "Centrist", "Conservative"]):
    print(f"{label:12}  {cm[i]}")

### 6D. Promptbook Documentation

A **promptbook** documents your entire annotation pipeline for replication.

In [None]:
promptbook = {
    "task": "political_stance_classification",
    "date_created": datetime.now().strftime("%Y-%m-%d"),
    "version": "1.0",
    "models": [
        {
            "name": "gpt-4o-mini",
            "type": "api",
            "provider": "openai",
            "temperature": 0,
            "seed": 42,
            "response_format": "json"
        }
    ],
    "prompt_template": "Analyze political stance: {text}\n\nReturn JSON with stance, confidence, reasoning.",
    "output_schema": {
        "stance": ["Progressive", "Conservative", "Centrist"],
        "confidence": "float (0-1)",
        "reasoning": "string"
    },
    "validation": {
        "method": "human_comparison",
        "sample_size": 200,
        "cohen_kappa": 0.78,
        "accuracy": 0.82,
        "validation_date": datetime.now().strftime("%Y-%m-%d")
    },
    "fingerprint": fingerprint,
    "notes": "Validated on US political tweets. Low confidence (<0.7) texts manually reviewed."
}

print("✓ Promptbook created")
print("\nPromptbook includes:")
for key in promptbook.keys():
    print(f"  • {key}")

print("\nPromptbook (excerpt):")
print(json.dumps(promptbook, indent=2)[:500] + "...")

---

## Summary

This notebook demonstrated key techniques for reliable, reproducible LLM annotation:

### 1. Prompt Formatting
- **f-strings**: Simple variable insertion
- **Templates**: Reusable format strings with `.format()`
- **Few-shot**: Provide examples to guide behavior
- **Chain-of-thought**: Ask for step-by-step reasoning

### 2. Structured Outputs (Four Approaches)
- **Prompt-only**: Ask for JSON (least reliable)
- **Few-shot with schema**: Show examples (better)
- **JSON mode**: Force valid JSON with API parameter (recommended)
- **Function calling**: Typed schemas (most structured)

### 3. Robust Extraction
- Error handling with try/except
- Retry logic for parse failures
- Low temperature for consistency

### 4. Batch Annotation
- Efficient processing of multiple texts
- Logging timestamps and metadata
- Quality checks (low confidence, failures)

### 5. Mixture of Experts
- Aggregate predictions from multiple models
- Measure uncertainty (standard deviation)
- Identify high/low agreement cases

### 6. Validation & Replication
- **Comprehensive logging**: All parameters and outputs
- **Model fingerprinting**: Detect API drift
- **Human validation**: Cohen's κ > 0.80 target
- **Promptbook**: Document entire pipeline

## Best Practices Checklist

- ☐ Pin model versions (use specific snapshots like `gpt-4-0613`)
- ☐ Set temperature to 0 for deterministic outputs
- ☐ Use seed parameter when supported
- ☐ Log everything (prompts, responses, settings, timestamps)
- ☐ Create promptbook for documentation
- ☐ Validate against human labels (Cohen's κ > 0.80)
- ☐ Test-retest reliability (check consistency over time)
- ☐ Model fingerprinting (detect API drift)
- ☐ Share code and configs for replication
- ☐ Consider open models for perfect reproducibility

## Recommended Workflow

1. **Start simple**: JSON mode zero-shot on validation sample
2. **If needed**: Fine-tune open model (LoRA) with 100-1000 labels
3. **Add replication harness**: Fixed params, logs, regression tests
4. **Report validation**: Human-LLM κ, test-retest, promptbook

## Further Reading

- Kraft et al. (2024): "Mixture of Experts for Ideological Scaling"
- Alizadeh et al. (2024): "Open-Source LLMs for Text Classification"
- Heseltine & Clemm von Hohenberg (2024): "GPT-4 Accuracy on Political Texts"
- Ziems et al. (2024): "Can Large Language Models Transform Computational Social Science?"

## Exercises

1. **Compare approaches**: Annotate the same 20 texts with all 4 structured output approaches. Which is most reliable?
2. **Measure test-retest**: Run the same annotation twice with identical settings. Calculate Cohen's κ between runs.
3. **Build ensemble**: Use 3+ models and compare ensemble vs. individual model performance.
4. **Create promptbook**: Document a complete annotation pipeline for your research domain.
5. **Validate**: Annotate 100 texts yourself, then with LLM. Calculate agreement metrics.