# 02 - Structured Outputs

**Get reliable JSON and structured data from LLMs.**

## Learning Objectives

By the end of this notebook, you will:
- Use JSON mode for structured responses
- Validate outputs with Pydantic
- Extract entities and data from text
- Handle validation errors gracefully

## Table of Contents

1. [Why Structured Outputs?](#why)
2. [JSON Mode](#json-mode)
3. [Pydantic Validation](#pydantic)
4. [Entity Extraction](#extraction)
5. [Error Handling](#errors)
6. [Exercises](#exercises)
7. [Checkpoint](#checkpoint)

In [None]:
# GUIDED: Setup
import os
import sys
import json
from pathlib import Path

sys.path.append(str(Path.cwd().parent))

from dotenv import load_dotenv
load_dotenv(Path.cwd().parent / ".env")

print("Setup complete!")

---
## 1. Why Structured Outputs? <a id='why'></a>

LLMs return free-form text, but applications need structured data:

```
LLM Response (text)          What You Need (data)
─────────────────────        ─────────────────────
"The product costs $29.99    {"product": "Widget",
 and is available in blue     "price": 29.99,
 and red colors."             "colors": ["blue", "red"]}
```

### Challenges:
- LLMs might include extra text
- JSON might be malformed
- Fields might be missing or wrong type
- Format varies between requests

---
## 2. JSON Mode <a id='json-mode'></a>

In [None]:
# GUIDED: OpenAI JSON mode
from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": "Extract product info. Return JSON with: name, price, colors (array)."
        },
        {
            "role": "user",
            "content": "The SuperWidget Pro costs $49.99 and comes in black, silver, and gold."
        }
    ],
    response_format={"type": "json_object"}
)

# Parse the JSON response
data = json.loads(response.choices[0].message.content)
print("Extracted data:")
print(json.dumps(data, indent=2))

In [None]:
# GUIDED: Anthropic structured output
from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=500,
    system="""Extract product information from the text.
Return ONLY valid JSON with this structure:
{"name": "string", "price": number, "colors": ["string"]}""",
    messages=[
        {
            "role": "user",
            "content": "The SuperWidget Pro costs $49.99 and comes in black, silver, and gold."
        }
    ]
)

data = json.loads(response.content[0].text)
print("Extracted data:")
print(json.dumps(data, indent=2))

---
## 3. Pydantic Validation <a id='pydantic'></a>

In [None]:
# GUIDED: Define Pydantic models
from pydantic import BaseModel, Field, field_validator
from typing import Optional

class Product(BaseModel):
    """Product information extracted from text."""
    name: str = Field(description="Product name")
    price: float = Field(ge=0, description="Price in USD")
    colors: list[str] = Field(default_factory=list, description="Available colors")
    in_stock: Optional[bool] = Field(default=None, description="Stock status")
    
    @field_validator('name')
    @classmethod
    def name_not_empty(cls, v):
        if not v.strip():
            raise ValueError('Name cannot be empty')
        return v.strip()

# Test validation
product = Product(
    name="SuperWidget Pro",
    price=49.99,
    colors=["black", "silver", "gold"]
)
print(product)

In [None]:
# GUIDED: Validate LLM output with Pydantic
from openai import OpenAI
from pydantic import BaseModel, Field, ValidationError
import json

class Person(BaseModel):
    name: str
    age: int = Field(ge=0, le=150)
    occupation: str
    skills: list[str] = Field(default_factory=list)

def extract_person(text: str) -> Person:
    """Extract person info from text with validation."""
    client = OpenAI()
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": f"""Extract person information from text.
Return JSON matching this schema: {Person.model_json_schema()}"""
            },
            {"role": "user", "content": text}
        ],
        response_format={"type": "json_object"}
    )
    
    data = json.loads(response.choices[0].message.content)
    return Person(**data)  # Validate with Pydantic

# Test it
text = """John Smith is a 35-year-old software engineer. 
He specializes in Python, machine learning, and cloud architecture."""

person = extract_person(text)
print(f"Name: {person.name}")
print(f"Age: {person.age}")
print(f"Occupation: {person.occupation}")
print(f"Skills: {person.skills}")

---
## 4. Entity Extraction <a id='extraction'></a>

In [None]:
# GUIDED: Extract multiple entities
from pydantic import BaseModel, Field
from typing import Optional
from enum import Enum

class Priority(str, Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"
    CRITICAL = "critical"

class Task(BaseModel):
    title: str
    description: Optional[str] = None
    priority: Priority = Priority.MEDIUM
    assignee: Optional[str] = None
    deadline: Optional[str] = None

class TaskList(BaseModel):
    tasks: list[Task]

def extract_tasks(meeting_notes: str) -> TaskList:
    """Extract tasks from meeting notes."""
    from openai import OpenAI
    
    client = OpenAI()
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": f"""Extract action items/tasks from meeting notes.
Return JSON matching: {TaskList.model_json_schema()}
Priority levels: low, medium, high, critical"""
            },
            {"role": "user", "content": meeting_notes}
        ],
        response_format={"type": "json_object"}
    )
    
    data = json.loads(response.choices[0].message.content)
    return TaskList(**data)

# Test with meeting notes
notes = """
Team meeting 2024-01-15:

- John will update the API documentation by Friday (high priority)
- Sarah needs to fix the login bug ASAP - this is blocking users
- We should clean up the test database sometime next week
- Mike to prepare demo for the client meeting on Monday
"""

tasks = extract_tasks(notes)
print(f"Found {len(tasks.tasks)} tasks:\n")
for task in tasks.tasks:
    print(f"[{task.priority.value.upper()}] {task.title}")
    if task.assignee:
        print(f"  Assignee: {task.assignee}")
    if task.deadline:
        print(f"  Deadline: {task.deadline}")
    print()

In [None]:
# GUIDED: Use our utility function
from src.llm_utils import LLMClient, get_json_response

client = LLMClient(provider="openai", model="gpt-4o-mini")

result = get_json_response(
    client,
    message="What are the 3 largest countries by area? Include name and area in sq km.",
    system="Return JSON with 'countries' array, each with 'name' and 'area_km2' fields."
)

print(json.dumps(result, indent=2))

---
## 5. Error Handling <a id='errors'></a>

In [None]:
# GUIDED: Robust extraction with retries
from pydantic import BaseModel, ValidationError
from openai import OpenAI
import json

class ExtractedData(BaseModel):
    company: str
    revenue: float
    year: int

def extract_with_retry(
    text: str,
    model_class: type[BaseModel],
    max_retries: int = 3
) -> BaseModel:
    """Extract data with retry on validation failure."""
    client = OpenAI()
    last_error = None
    
    for attempt in range(max_retries):
        try:
            # Add error context on retry
            error_context = ""
            if last_error:
                error_context = f"\n\nPrevious attempt failed: {last_error}. Please fix."
            
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {
                        "role": "system",
                        "content": f"""Extract data from text.
Return JSON matching: {model_class.model_json_schema()}{error_context}"""
                    },
                    {"role": "user", "content": text}
                ],
                response_format={"type": "json_object"}
            )
            
            data = json.loads(response.choices[0].message.content)
            return model_class(**data)  # Validate
            
        except json.JSONDecodeError as e:
            last_error = f"Invalid JSON: {e}"
        except ValidationError as e:
            last_error = f"Validation failed: {e}"
    
    raise ValueError(f"Failed after {max_retries} attempts: {last_error}")

# Test it
text = "Acme Corp reported $5.2 billion in revenue for fiscal year 2023."

try:
    data = extract_with_retry(text, ExtractedData)
    print(f"Company: {data.company}")
    print(f"Revenue: ${data.revenue:,.0f}")
    print(f"Year: {data.year}")
except ValueError as e:
    print(f"Extraction failed: {e}")

---
## 6. Exercises <a id='exercises'></a>

### Exercise 1: Email Extractor

Create a Pydantic model and extractor for email metadata.

In [None]:
# TODO: Create Email model with: sender, recipient, subject, summary, sentiment

# Your code here:


### Exercise 2: Multi-format Extractor

Build an extractor that works with different document types.

In [None]:
# TODO: Create extractors for invoices, receipts, and contracts

# Your code here:


---
## 7. Checkpoint <a id='checkpoint'></a>

Before moving on, verify:

- [ ] You can use JSON mode with LLMs
- [ ] You created Pydantic models for validation
- [ ] You extracted structured data from text
- [ ] You handled validation errors

### Next Steps

In the next notebook, we'll learn about **Embeddings & Vectors** - the foundation of semantic search!