# Week 6: Structured Output Annotation Examples

This notebook demonstrates different approaches to getting structured outputs from LLMs for text annotation tasks.

**Topics covered:**
1. Basic prompting patterns
2. Four approaches to structured outputs
3. Robust JSON extraction
4. Batch annotation
5. Local models with Ollama (optional)
6. Mixture of experts (ensemble)
7. Validation and logging

---

## Setup

Install required packages and set up API keys.

In [None]:
# Install packages
!pip install -q openai pandas scikit-learn numpy

In [None]:
# Import libraries
import json
import os
import numpy as np
import pandas as pd
from datetime import datetime
import hashlib
import re

from openai import OpenAI
from sklearn.metrics import cohen_kappa_score, accuracy_score, confusion_matrix

In [None]:
# Set up OpenAI API key
import getpass

if 'OPENAI_API_KEY' not in os.environ:
    os.environ['OPENAI_API_KEY'] = getpass.getpass('Enter your OpenAI API key: ')

client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])
print("✓ API key configured")

---

# Example 1: Basic Prompting Patterns

Simple approaches to formatting prompts for text annotation.

In [None]:
# Sample political texts
texts = [
    "We must invest in renewable energy now!",
    "Cut taxes and reduce business regulations",
    "Healthcare is a human right for all citizens",
    "Maintain current spending levels and balanced budget"
]

print("Sample texts:")
for i, text in enumerate(texts, 1):
    print(f"{i}. {text}")

## 1A: Simple f-string Prompting

In [None]:
text = texts[0]

# Simple f-string approach
prompt = f"""Classify the political stance of this text as:
- Progressive
- Conservative
- Centrist

Text: {text}
Stance:"""

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}],
    temperature=0
)

print(f"Text: {text}")
print(f"Response: {response.choices[0].message.content}")

## 1B: Reusable Template for Batch Processing

In [None]:
# Template approach for consistency
STANCE_TEMPLATE = """Classify the political stance of this text as:
- Progressive
- Conservative
- Centrist

Text: {text}
Stance:"""

results = []
for text in texts:
    prompt = STANCE_TEMPLATE.format(text=text)
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    result = response.choices[0].message.content.strip()
    results.append({"text": text, "stance": result})
    print(f"Text: {text}")
    print(f"Stance: {result}\n")

## 1C: Few-Shot Prompting

In [None]:
# Few-shot template with examples
FEW_SHOT_TEMPLATE = """Classify political stance as Progressive, Conservative, or Centrist.

Examples:
Text: "Cut taxes and reduce regulations" -> Conservative
Text: "Expand healthcare access for all" -> Progressive
Text: "Maintain current spending levels" -> Centrist

Text: {text} ->"""

text = "Protect traditional family values and limit government overreach"
prompt = FEW_SHOT_TEMPLATE.format(text=text)

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}],
    temperature=0
)

print(f"Text: {text}")
print(f"Response: {response.choices[0].message.content}")

## 1D: Chain-of-Thought Prompting

In [None]:
COT_TEMPLATE = """Classify the stance and explain your reasoning.

Text: {text}

Think step-by-step:
1. What policy domain is this?
2. What values does it express?
3. What stance does this suggest?

Reasoning:
Stance:"""

text = "Invest heavily in public education and teacher salaries"
prompt = COT_TEMPLATE.format(text=text)

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.3
)

print(f"Text: {text}")
print(f"Response:\n{response.choices[0].message.content}")

---

# Example 2: Four Approaches to Structured Outputs

Demonstrates the progression from basic to most reliable structured output methods.

## Approach 1: Prompt-Only Formatting (basic)

In [None]:
def llm(prompt: str) -> str:
    """Generic LLM call wrapper"""
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    return response.choices[0].message.content

text = "We must expand Medicare to cover everyone"
prompt = f"""
Extract fields as JSON and respond ONLY with valid JSON:
{{
  "stance": "Progressive/Conservative/Centrist",
  "confidence": 0.0-1.0,
  "reasoning": "brief explanation"
}}

Input: {text}
"""

try:
    data = json.loads(llm(prompt))
    print("✓ Successfully parsed JSON")
    print(json.dumps(data, indent=2))
except json.JSONDecodeError as e:
    print(f"✗ Failed to parse: {e}")
    print(f"Raw output: {llm(prompt)}")

## Approach 2: Few-Shot with Schema + Examples (better)

In [None]:
text = "Cut taxes and reduce regulations"
prompt = """
You output ONLY valid JSON with keys: stance, confidence, reasoning.

Example:
Input: Expand healthcare access for all
Output: {{"stance":"Progressive","confidence":0.9,"reasoning":"Universal healthcare is progressive policy"}}

Example:
Input: Maintain current spending levels
Output: {{"stance":"Centrist","confidence":0.8,"reasoning":"Status quo signals moderate position"}}

Now do the same:
Input: {input}
Output:
""".format(input=text)

data = json.loads(llm(prompt))
print("✓ Successfully parsed JSON")
print(json.dumps(data, indent=2))

## Approach 3: Provider JSON Mode / Schema (more reliable)

In [None]:
# Define schema for structured output
schema = {
  "type": "object",
  "properties": {
    "stance": {
      "type": "string",
      "enum": ["Progressive", "Conservative", "Centrist"]
    },
    "confidence": {"type": "number", "minimum": 0, "maximum": 1},
    "reasoning": {"type": "string"}
  },
  "required": ["stance", "confidence", "reasoning"],
  "additionalProperties": False
}

text = "Protect traditional family values"
response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system",
         "content": "You are a political analyst. Return valid JSON only."},
        {"role": "user",
         "content": f"Analyze political stance: {text}"}
    ],
    response_format={"type": "json_object"},  # Force JSON mode
    temperature=0
)

data = json.loads(response.choices[0].message.content)
print("✓ JSON mode guarantees valid JSON")
print(json.dumps(data, indent=2))

## Approach 4: Function/Tool Calling (most structured)

In [None]:
# Define function schema with typed arguments
tools = [{
  "type": "function",
  "function": {
    "name": "analyze_stance",
    "description": "Return structured political stance analysis",
    "parameters": {
      "type": "object",
      "properties": {
        "stance": {
          "type": "string",
          "enum": ["Progressive", "Conservative", "Centrist"]
        },
        "confidence": {"type": "number"},
        "reasoning": {"type": "string"}
      },
      "required": ["stance", "confidence", "reasoning"]
    }
  }
}]

text = "Expand social safety nets"
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": f"Analyze: {text}"}],
    tools=tools,
    tool_choice="auto",
    temperature=0
)

# Extract structured arguments
call = response.choices[0].message.tool_calls[0]
args = json.loads(call.function.arguments)
print("✓ Function calling provides strongest guarantees")
print(f"Function called: {call.function.name}")
print(json.dumps(args, indent=2))

## Comparison Table

In [None]:
comparison_df = pd.DataFrame({
    "Approach": ["Prompt-only", "Few-shot", "JSON mode", "Function calling"],
    "Reliability": ["Low", "Medium", "High", "Highest"],
    "Flexibility": ["High", "High", "Medium", "Low"],
    "Support": ["Universal", "Universal", "Most APIs", "OpenAI, Anthropic, Google"]
})

print(comparison_df.to_string(index=False))
print("\n✓ Recommendation: Start with JSON mode (Approach 3)")

---

# Example 3: Robust JSON Extraction

Shows how to reliably extract JSON from models without native JSON mode.

In [None]:
# Clear instructions for JSON-only output
INSTRUCTIONS = (
    'Return only a JSON object like this:\n'
    '{"stance":"Progressive|Conservative|Centrist|null",'
    '"confidence":0-1,"reasoning":"brief"}\n'
    'Do not add any extra text.'
)

def get_labels(client, text, model="gpt-4", max_retries=1):
    """
    Robust JSON extraction with error handling and retry logic
    """
    # 1) Ask for JSON only with low temperature
    prompt = f'{INSTRUCTIONS}\n\nText: "{text}"'
    
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.1  # Low temp for consistency
    )
    
    output = response.choices[0].message.content.strip()
    
    # 2) Try to parse as JSON
    try:
        return json.loads(output)
    except json.JSONDecodeError as e:
        print(f"⚠ Parse failed on first attempt: {e}")
        print(f"Raw output: {output}\n")
        
        if max_retries > 0:
            # 3) One retry asking for just JSON again
            fix_prompt = (
                "That was not valid JSON. Please send ONLY the JSON object, "
                "nothing else. No explanations, no markdown fences."
            )
            
            retry_response = client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "user", "content": prompt},
                    {"role": "assistant", "content": output},
                    {"role": "user", "content": fix_prompt}
                ],
                temperature=0.0  # Zero temp for retry
            )
            
            retry_output = retry_response.choices[0].message.content.strip()
            print(f"Retry output: {retry_output}\n")
            
            try:
                return json.loads(retry_output)
            except json.JSONDecodeError as e:
                print(f"✗ Parse failed after retry: {e}")
                raise
        else:
            raise

# Test cases
test_texts = [
    "We must expand Medicare to cover everyone",
    "Cut taxes and reduce government spending",
    "Maintain balanced approach to fiscal policy"
]

print("Testing robust extraction with retry logic:\n")

for i, text in enumerate(test_texts, 1):
    print(f"Test {i}: {text}")
    try:
        result = get_labels(client, text)
        print(f"✓ Success: {json.dumps(result, indent=2)}\n")
    except Exception as e:
        print(f"✗ Failed: {e}\n")

---

# Example 4: Batch Annotation

Efficient batch processing with structured outputs and logging.

In [None]:
# Sample corpus
texts = [
    "We need stronger borders and immigration control",
    "Healthcare is a human right for all",
    "Balance the budget through moderate tax reform",
    "Invest in renewable energy infrastructure",
    "Cut regulations on small businesses",
    "Expand access to affordable childcare",
    "Maintain current defense spending levels",
    "Protect voting rights and access",
    "Reduce corporate tax rates",
    "Fund public education and teacher salaries"
]

# Template for consistent prompting
JSON_TEMPLATE = """Analyze this political text: {text}

Return JSON with keys:
- stance (Progressive/Conservative/Centrist)
- confidence (0-1)
- reasoning (brief explanation)
- policy_domain (e.g., healthcare, economy, education)"""

def annotate_text(text, model="gpt-4", temperature=0):
    """Annotate a single text with structured output"""
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system",
             "content": "You are a political analyst. Return valid JSON only."},
            {"role": "user",
             "content": JSON_TEMPLATE.format(text=text)}
        ],
        response_format={"type": "json_object"},
        temperature=temperature
    )
    
    return json.loads(response.choices[0].message.content)

def batch_annotate(texts, model="gpt-4", temperature=0):
    """Annotate multiple texts"""
    results = []
    
    print(f"Annotating {len(texts)} texts with {model}...")
    print(f"Temperature: {temperature}\n")
    
    for i, text in enumerate(texts, 1):
        print(f"[{i}/{len(texts)}] Processing: {text[:50]}...")
        
        try:
            annotation = annotate_text(text, model=model, temperature=temperature)
            annotation['text'] = text
            annotation['model'] = model
            annotation['temperature'] = temperature
            annotation['timestamp'] = datetime.now().isoformat()
            annotation['success'] = True
            annotation['error'] = None
            
        except Exception as e:
            print(f"  ✗ Error: {e}")
            annotation = {
                'text': text,
                'model': model,
                'temperature': temperature,
                'timestamp': datetime.now().isoformat(),
                'success': False,
                'error': str(e),
                'stance': None,
                'confidence': None,
                'reasoning': None,
                'policy_domain': None
            }
        
        results.append(annotation)
    
    df = pd.DataFrame(results)
    print(f"\n✓ Completed: {df['success'].sum()}/{len(df)} successful")
    
    return df

# Run batch annotation
df = batch_annotate(texts, model="gpt-4", temperature=0)

In [None]:
# Display results
print("\nSummary by stance:")
print(df['stance'].value_counts())

print("\nAverage confidence by stance:")
print(df.groupby('stance')['confidence'].mean().round(3))

print("\nSample annotations:")
print(df[['text', 'stance', 'confidence', 'policy_domain']].head(3))

In [None]:
# Quality checks
print("\nQuality Checks:")
print("=" * 60)

# Check for low confidence predictions
low_confidence = df[df['confidence'] < 0.7]
print(f"\nLow confidence annotations (< 0.7): {len(low_confidence)}")
if len(low_confidence) > 0:
    print(low_confidence[['text', 'stance', 'confidence']])

# Check for null values
nulls = df[df['stance'].isna()]
print(f"\nMissing stance labels: {len(nulls)}")

print("\n✓ Batch annotation complete!")

---

# Example 5: Mixture of Experts (Ensemble)

Multi-model ensemble approach based on Kraft et al. (2024).

In [None]:
def get_stance_score(text, model="gpt-4"):
    """
    Get ideological position score from a model
    Returns: float from -1 (progressive) to +1 (conservative)
    """
    prompt = f"""Rate this text on ideology from -1 (most progressive)
to +1 (most conservative). Return only the number.

Text: {text}"""
    
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    return float(response.choices[0].message.content.strip())

def ensemble_stance(text, models):
    """Aggregate stance estimates across multiple models"""
    scores = []
    individual = {}
    
    for model in models:
        try:
            score = get_stance_score(text, model=model)
            scores.append(score)
            individual[model] = score
            print(f"  {model:20}: {score:+.3f}")
        except Exception as e:
            print(f"  {model:20}: Error - {e}")
            continue
    
    if not scores:
        return None
    
    return {
        "mean": np.mean(scores),
        "median": np.median(scores),
        "std": np.std(scores),
        "min": np.min(scores),
        "max": np.max(scores),
        "individual": individual,
        "n_models": len(scores)
    }

In [None]:
# Example 1: Single text with multiple models
models = ["gpt-4", "gpt-3.5-turbo"]

text = "We must protect traditional family values and limit government overreach"
print(f"Text: {text}\n")
print("Individual model scores:")

result = ensemble_stance(text, models)

if result:
    print(f"\nEnsemble results:")
    print(f"  Mean:      {result['mean']:+.3f}")
    print(f"  Median:    {result['median']:+.3f}")
    print(f"  Std dev:   {result['std']:.3f}")
    print(f"  Range:     [{result['min']:+.3f}, {result['max']:+.3f}]")
    print(f"  Agreement: {'High' if result['std'] < 0.3 else 'Medium' if result['std'] < 0.6 else 'Low'}")

In [None]:
# Example 2: Batch analysis with ensemble
tweets = [
    "Expand Medicare to cover everyone",
    "Cut taxes and regulations on businesses",
    "Protect voting rights and access",
    "Secure the border and enforce immigration laws",
    "Invest in public schools and teacher salaries"
]

results = []

print(f"\nAnalyzing {len(tweets)} texts with ensemble...\n")

for i, tweet in enumerate(tweets, 1):
    print(f"[{i}/{len(tweets)}] {tweet}")
    ensemble = ensemble_stance(tweet, models)
    
    if ensemble:
        results.append({
            "text": tweet,
            "position": ensemble["mean"],
            "uncertainty": ensemble["std"],
            "n_models": ensemble["n_models"]
        })
    print()

ensemble_df = pd.DataFrame(results)

print("\nPosition scores (negative = progressive, positive = conservative):")
print(ensemble_df[['text', 'position', 'uncertainty']])

---

# Example 6: Validation and Logging

Best practices for reproducibility and validation.

## Comprehensive Logging

In [None]:
def annotate_with_logging(text, model="gpt-4-0613", temperature=0, seed=42):
    """Annotate text with complete logging for reproducibility"""
    prompt = f"""Analyze political stance: {text}

Return JSON with stance, confidence, reasoning."""
    
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a political analyst. Return valid JSON only."},
            {"role": "user", "content": prompt}
        ],
        response_format={"type": "json_object"},
        temperature=temperature,
        seed=seed
    )
    
    # Parse result
    result = json.loads(response.choices[0].message.content)
    
    # Create comprehensive log entry
    log_entry = {
        "timestamp": datetime.now().isoformat(),
        "text": text,
        "model": model,
        "temperature": temperature,
        "seed": seed,
        "prompt": prompt,
        "response": result,
        "usage": {
            "prompt_tokens": response.usage.prompt_tokens,
            "completion_tokens": response.usage.completion_tokens,
            "total_tokens": response.usage.total_tokens
        },
        "finish_reason": response.choices[0].finish_reason
    }
    
    return result, log_entry

text = "Expand social safety nets and increase minimum wage"
result, log = annotate_with_logging(text)

print(f"✓ Annotated: {text}")
print(f"  Stance: {result.get('stance')}")
print(f"\nLog entry (partial):")
print(json.dumps({k: log[k] for k in ['timestamp', 'model', 'temperature', 'seed']}, indent=2))

## Model Fingerprinting (Detect API Drift)

In [None]:
def model_fingerprint(model, test_prompts, temperature=0, seed=42):
    """Create fingerprint to detect if model behavior has changed"""
    responses = []
    
    for prompt in test_prompts:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature,
            seed=seed
        )
        responses.append(response.choices[0].message.content)
    
    # Hash concatenated responses
    fingerprint = hashlib.sha256(
        "".join(responses).encode()
    ).hexdigest()
    
    return fingerprint

# Create test set for fingerprinting
test_prompts = [
    "Classify: 'Cut taxes for businesses' - Progressive/Conservative/Centrist",
    "Classify: 'Expand healthcare coverage' - Progressive/Conservative/Centrist",
    "Classify: 'Balanced budget amendment' - Progressive/Conservative/Centrist"
]

fingerprint = model_fingerprint("gpt-4-0613", test_prompts)
print(f"Model fingerprint: {fingerprint[:16]}...")
print("\n✓ Save this fingerprint and check periodically for drift")
print("✓ If fingerprint changes, model behavior has changed!")

## Validation Strategies

In [None]:
# Simulate human and LLM labels for validation
human_labels = np.array([0, 1, 2, 0, 1, 2, 0, 1, 2, 0])  # 0=Prog, 1=Centrist, 2=Cons
llm_labels = np.array([0, 1, 1, 0, 1, 2, 0, 2, 2, 0])

# 1. Human-LLM Agreement (Cohen's Kappa)
kappa = cohen_kappa_score(human_labels, llm_labels)
accuracy = accuracy_score(human_labels, llm_labels)

print("Human-LLM Agreement:\n")
print(f"Cohen's κ: {kappa:.3f}")
print(f"Accuracy:  {accuracy:.3f}")

if kappa > 0.80:
    print("✓ Substantial agreement")
elif kappa > 0.60:
    print("⚠ Moderate agreement - consider refinement")
else:
    print("✗ Low agreement - significant issues")

# 2. Confusion Matrix
cm = confusion_matrix(human_labels, llm_labels)
print("\nConfusion Matrix (rows=human, cols=LLM):")
print("              Prog  Cent  Cons")
for i, label in enumerate(["Progressive", "Centrist", "Conservative"]):
    print(f"{label:12}  {cm[i]}")

## Promptbook Documentation

In [None]:
promptbook = {
    "task": "political_stance_classification",
    "date_created": "2024-10-08",
    "version": "1.0",
    "models": [
        {
            "name": "gpt-4-0613",
            "type": "api",
            "provider": "openai",
            "temperature": 0,
            "seed": 42,
            "response_format": "json"
        }
    ],
    "prompt_template": "Analyze political stance: {text}\n\nReturn JSON with stance, confidence, reasoning.",
    "output_schema": {
        "stance": ["Progressive", "Conservative", "Centrist"],
        "confidence": "float (0-1)",
        "reasoning": "string"
    },
    "validation": {
        "method": "human_comparison",
        "sample_size": 200,
        "cohen_kappa": 0.78,
        "accuracy": 0.82,
        "validation_date": "2024-10-08"
    },
    "fingerprint": fingerprint,
    "notes": "Validated on US political tweets. Low confidence (<0.7) texts manually reviewed."
}

print("✓ Promptbook created")
print("\nPromptbook includes:")
for key in promptbook.keys():
    print(f"  • {key}")

print("\n" + json.dumps(promptbook, indent=2)[:500] + "...")

## Reproducibility Checklist

In [None]:
checklist = [
    ("☐ Pin model versions", "Use specific snapshots (gpt-4-0613, not gpt-4)"),
    ("☐ Set temperature to 0", "For deterministic outputs"),
    ("☐ Use seed parameter", "When supported by API"),
    ("☐ Log everything", "Prompts, responses, settings, timestamps"),
    ("☐ Create promptbook", "Document complete annotation pipeline"),
    ("☐ Validate against humans", "Cohen's κ > 0.80 target"),
    ("☐ Test-retest reliability", "Check consistency over time"),
    ("☐ Model fingerprinting", "Detect API drift"),
    ("☐ Share code & configs", "Enable exact replication"),
    ("☐ Use open models when possible", "Fixed weights = perfect reproducibility")
]

print("Reproducibility Checklist:\n")
for item, description in checklist:
    print(f"{item:30} - {description}")

print("\n✓ Following these practices enables credible, replicable research")

---

# Summary

## Key Takeaways

1. **Keep prompts simple**: f-strings and templates are enough
2. **Force structured output**: Use JSON mode (Approach 3) or function calling (Approach 4)
3. **Log everything**: Prompts, responses, settings, timestamps
4. **Validate**: Cohen's κ > 0.80 with human labels
5. **Detect drift**: Model fingerprinting for API changes
6. **Ensemble**: Multiple models > single model for robustness
7. **Document**: Create promptbooks for replication

## Recommended Workflow

1. Start with JSON mode zero-shot on validation slice
2. If needed, fine-tune open model (LoRA) with 100-1000 labels
3. Add replication harness: fixed params, logs, regression test
4. Report human-LLM κ, test-retest, and promptbook

## Resources

- Kraft et al. (2024): Mixture of Experts for Ideological Scaling
- Alizadeh et al. (2024): Open-Source LLMs for Text Classification
- Heseltine & Clemm von Hohenberg (2024): GPT-4 Accuracy on Political Texts
- BPS Replication Guide (2025): Standards for LLM Reproducibility