# Comprehensive Guide to Pydantic for LLM Workflows

This notebook provides a hands-on guide to using Pydantic for structuring and validating LLM outputs. We'll explore core concepts, practical examples, and best practices for building robust LLM-powered applications.

## Setup and Installation

First, let's install the required packages and set up our AWS Bedrock connection.

In [None]:
# Install required packages
!pip install pydantic boto3 langchain-aws email-validator python-dateutil

In [None]:
# Import necessary libraries
import os
import json
import boto3
from datetime import datetime, date
from typing import Optional, List, Literal
from enum import Enum

from pydantic import BaseModel, EmailStr, Field, HttpUrl, ValidationError, validator
from langchain_aws import ChatBedrock
from google.colab import userdata

print("✓ All imports successful")

In [None]:
# Configure AWS credentials using Colab secrets
AWS_ACCESS_KEY_ID = userdata.get('awsid')
AWS_SECRET_ACCESS_KEY = userdata.get('awssecret')
AWS_REGION = "us-east-1"

# Initialize Bedrock client
bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name=AWS_REGION,
    aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY
)

# Set up the Bedrock model (using Amazon Nova Lite for cost-effectiveness)
llm = ChatBedrock(
    client=bedrock_runtime,
    model_id="amazon.nova-lite-v1:0",
    model_kwargs={
        "temperature": 0,
        "max_tokens": 4096
    }
)

print("✓ AWS Bedrock client initialized")
print(f"✓ Using model: amazon.nova-lite-v1:0")

## 1. Introduction and Context

### The Challenge of Structured Output from LLMs

When working with Large Language Models (LLMs), one of the fundamental challenges is obtaining structured, predictable output that can be reliably processed by downstream systems. While you can simply ask an LLM to format its response in a particular way (like JSON), the results are often unpredictable:

**Common Issues:**
- Extra text outside the JSON structure (e.g., "Here's the JSON output you requested:")
- Markdown formatting (triple backticks around JSON)
- Missing or incorrectly formatted fields
- Invalid data types

### Why Pydantic?

Pydantic provides a robust solution by allowing you to:
1. Define explicit data models with field names and types
2. Validate LLM responses against these models
3. Catch and handle validation errors systematically
4. Ensure data consistency throughout your application

## 2. Basic Pydantic Models

Let's start by creating simple Pydantic models and understanding how they work.

### Creating Your First Model

In [None]:
class UserInput(BaseModel):
    name: str
    email: EmailStr
    query: str

# Valid data
user_input = UserInput(
    name="Alice Johnson",
    email="alice.johnson@company.com",
    query="I need help resetting my account password"
)

print("Valid user input:")
print(user_input)
print(f"\nName: {user_input.name}")
print(f"Email: {user_input.email}")
print(f"Query: {user_input.query}")

### Validation in Action

In [None]:
# Invalid email - missing @ symbol
try:
    invalid_user = UserInput(
        name="Bob Smith",
        email="bob.smith.invalid.com",  # Invalid format
        query="Where is my order?"
    )
except ValidationError as e:
    print("Validation Error Occurred:")
    print(e)

### E-commerce Support System Example

Let's build a more realistic model for a customer support ticketing system.

In [None]:
class IssueType(str, Enum):
    DELIVERY = "delivery"
    PRODUCT_QUALITY = "product_quality"
    BILLING = "billing"
    TECHNICAL = "technical"
    OTHER = "other"

class SupportTicket(BaseModel):
    customer_name: str = Field(min_length=2, max_length=100)
    email: EmailStr
    phone: Optional[str] = Field(default=None, pattern=r"^\+?1?\d{9,15}$")
    issue_type: IssueType
    description: str = Field(min_length=10, max_length=1000)
    order_number: Optional[str] = Field(
        default=None,
        pattern=r"^ORD-\d{8}$",
        description="Format: ORD-12345678"
    )
    purchase_date: Optional[date] = None
    attachments: List[str] = Field(default_factory=list, max_items=5)
    
    class Config:
        json_schema_extra = {
            "example": {
                "customer_name": "Sarah Chen",
                "email": "sarah.chen@email.com",
                "phone": "+1-555-0123",
                "issue_type": "product_quality",
                "description": "The laptop I received has a defective screen with dead pixels",
                "order_number": "ORD-20240315",
                "purchase_date": "2024-03-15",
                "attachments": ["screen_issue.jpg"]
            }
        }

# Create a support ticket
ticket = SupportTicket(
    customer_name="Sarah Chen",
    email="sarah.chen@email.com",
    issue_type=IssueType.PRODUCT_QUALITY,
    description="The laptop I received has a defective screen with dead pixels",
    order_number="ORD-20240315"
)

print("Support Ticket Created:")
print(json.dumps(ticket.model_dump(), indent=2, default=str))

### Data Type Coercion

Pydantic automatically converts compatible data types.

In [None]:
class Product(BaseModel):
    product_id: int
    price: float
    in_stock: bool
    
# String to int/float/bool conversion
product = Product(
    product_id="12345",      # Converted to int
    price="29.99",           # Converted to float
    in_stock="true"          # Converted to bool
)

print(f"Product ID: {product.product_id} (type: {type(product.product_id).__name__})")
print(f"Price: {product.price} (type: {type(product.price).__name__})")
print(f"In Stock: {product.in_stock} (type: {type(product.in_stock).__name__})")

### Working with JSON Data

In [None]:
class EnhancedUserInput(BaseModel):
    name: str
    email: EmailStr
    query: str
    order_id: Optional[int] = Field(
        default=None,
        description="5-digit order number, cannot start with 0",
        ge=10000,
        le=99999
    )
    purchase_date: Optional[date] = None

# From JSON string to model
json_data = '''
{
    "name": "Emily Rodriguez",
    "email": "emily.r@example.com",
    "query": "Need to update my shipping address",
    "order_id": 54321,
    "purchase_date": "2024-03-10"
}
'''

# Method 1: Parse JSON, then create model
data_dict = json.loads(json_data)
user1 = EnhancedUserInput(**data_dict)
print("Method 1 - From dict:")
print(user1)

# Method 2: Direct validation from JSON (preferred)
user2 = EnhancedUserInput.model_validate_json(json_data)
print("\nMethod 2 - Direct from JSON:")
print(user2)

# To JSON
json_output = user2.model_dump_json(indent=2)
print("\nBack to JSON:")
print(json_output)

## 3. Validating LLM Responses

Now let's see how to use Pydantic to validate and structure LLM outputs.

### Content Moderation System Example

In [None]:
class SeverityLevel(str, Enum):
    SAFE = "safe"
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"
    CRITICAL = "critical"

class ViolationType(str, Enum):
    NONE = "none"
    SPAM = "spam"
    HARASSMENT = "harassment"
    HATE_SPEECH = "hate_speech"
    EXPLICIT_CONTENT = "explicit_content"
    MISINFORMATION = "misinformation"
    VIOLENCE = "violence"

class ModerationResult(BaseModel):
    content_id: str
    is_safe: bool
    severity: SeverityLevel
    violations: List[ViolationType] = Field(default_factory=list)
    confidence_score: float = Field(ge=0.0, le=1.0)
    flagged_phrases: List[str] = Field(
        default_factory=list,
        max_items=10,
        description="Specific phrases that triggered flags"
    )
    recommended_action: Literal["approve", "review", "reject", "escalate"]
    explanation: str = Field(
        min_length=20,
        max_length=500,
        description="Brief explanation of the decision"
    )
    
    class Config:
        json_schema_extra = {
            "example": {
                "content_id": "post_12345",
                "is_safe": False,
                "severity": "medium",
                "violations": ["spam", "misinformation"],
                "confidence_score": 0.87,
                "flagged_phrases": ["guaranteed results", "click here now"],
                "recommended_action": "review",
                "explanation": "Content contains promotional language and unverified claims requiring manual review"
            }
        }

# Display the schema
schema = ModerationResult.model_json_schema()
print("ModerationResult Schema:")
print(json.dumps(schema, indent=2))

### Validation Function with Error Handling

In [None]:
def validate_moderation_response(llm_response: str) -> tuple[ModerationResult | None, str | None]:
    """Validate LLM moderation response with detailed error handling."""
    try:
        # First attempt: direct validation
        result = ModerationResult.model_validate_json(llm_response)
        return result, None
    except json.JSONDecodeError as e:
        return None, f"Invalid JSON format: {str(e)}"
    except ValidationError as e:
        # Extract specific validation errors
        error_details = []
        for error in e.errors():
            field = " -> ".join(str(x) for x in error["loc"])
            message = error["msg"]
            error_details.append(f"{field}: {message}")
        return None, "; ".join(error_details)

# Test with valid data
valid_response = '''
{
    "content_id": "post_12345",
    "is_safe": false,
    "severity": "medium",
    "violations": ["spam"],
    "confidence_score": 0.85,
    "flagged_phrases": ["buy now", "limited time"],
    "recommended_action": "review",
    "explanation": "Content contains promotional language that may violate spam policies"
}
'''

result, error = validate_moderation_response(valid_response)
if result:
    print("✓ Validation successful!")
    print(f"Content ID: {result.content_id}")
    print(f"Safe: {result.is_safe}")
    print(f"Action: {result.recommended_action}")
else:
    print(f"✗ Validation failed: {error}")

### Testing with LLM - Content Moderation

In [None]:
def create_moderation_prompt(content: str, content_id: str) -> str:
    """Create a schema-based prompt for content moderation."""
    schema = ModerationResult.model_json_schema()
    
    prompt = f"""You are a content moderation AI. Analyze the following content for policy violations.

CONTENT TO ANALYZE:
ID: {content_id}
Text: {content}

OUTPUT REQUIREMENTS:
Provide your analysis as a valid JSON object that strictly conforms to this schema:

{json.dumps(schema, indent=2)}

CRITICAL INSTRUCTIONS:
1. Return ONLY valid JSON - no markdown formatting, no explanatory text
2. Ensure all required fields are present
3. Follow exact field types and constraints
4. confidence_score must be between 0.0 and 1.0
5. severity must be one of: safe, low, medium, high, critical

Begin your response with {{ and end with }}"""
    return prompt

# Test content
test_content = "Click here now for guaranteed weight loss! Limited time offer!"
content_id = "test_001"

prompt = create_moderation_prompt(test_content, content_id)
print("Prompt sent to LLM:")
print("=" * 60)
print(prompt)
print("=" * 60)

In [None]:
# Get LLM response
response = llm.invoke(prompt)
llm_output = response.content

print("\nLLM Response:")
print("=" * 60)
print(llm_output)
print("=" * 60)

# Validate the response
result, error = validate_moderation_response(llm_output)

if result:
    print("\n✓ Validation SUCCESSFUL!")
    print("\nParsed Result:")
    print(json.dumps(result.model_dump(), indent=2, default=str))
else:
    print(f"\n✗ Validation FAILED: {error}")

## 4. Retry Logic with Error Feedback

LLMs don't always get the format right on the first try. Let's implement a retry mechanism.

In [None]:
def validate_llm_response_with_retry(
    prompt: str,
    data_model: type[BaseModel],
    max_retries: int = 3,
) -> BaseModel | None:
    """
    Validate LLM response with automatic retry on validation errors.
    """
    # Initial call
    response = llm.invoke(prompt)
    llm_response = response.content
    
    for attempt in range(max_retries):
        try:
            # Attempt validation
            validated_data = data_model.model_validate_json(llm_response)
            print(f"✓ Validation successful on attempt {attempt + 1}")
            return validated_data
            
        except (ValidationError, json.JSONDecodeError) as e:
            print(f"✗ Attempt {attempt + 1} failed: {str(e)[:200]}...")
            
            if attempt == max_retries - 1:
                print("Max retries reached. Validation failed.")
                return None
            
            # Create retry prompt with error feedback
            retry_prompt = f"""VALIDATION ERROR OCCURRED

Original Prompt:
{prompt}

Your Previous Response:
{llm_response}

Error Message:
{str(e)}

Please fix the error and provide a corrected response. Remember:
- Return ONLY valid JSON
- No markdown formatting or extra text
- Match the exact schema requirements
- Begin with {{ and end with }}"""
            
            # Retry with error feedback
            response = llm.invoke(retry_prompt)
            llm_response = response.content
    
    return None

# Test with retry logic
print("Testing retry logic...\n")
result = validate_llm_response_with_retry(
    prompt=create_moderation_prompt(test_content, content_id),
    data_model=ModerationResult,
    max_retries=3
)

if result:
    print("\nFinal validated result:")
    print(json.dumps(result.model_dump(), indent=2, default=str))

## 5. Advanced Example: Research Paper Analysis

Let's build a more complex model for analyzing academic papers.

In [None]:
class Author(BaseModel):
    name: str
    affiliation: str
    email: Optional[EmailStr] = None

class ResearchPaperAnalysis(BaseModel):
    title: str = Field(min_length=10, max_length=300)
    authors: List[Author] = Field(min_items=1, max_items=20)
    publication_date: date
    abstract: str = Field(min_length=100, max_length=2000)
    keywords: List[str] = Field(min_items=3, max_items=10)
    methodology: str = Field(min_length=50, max_length=1000)
    key_findings: List[str] = Field(
        min_items=2,
        max_items=5,
        description="Main discoveries or conclusions"
    )
    limitations: List[str] = Field(min_items=1, max_items=5)
    impact_score: float = Field(
        ge=0.0,
        le=10.0,
        description="Estimated research impact (0-10)"
    )
    citations_analyzed: int = Field(ge=0)
    related_works: List[HttpUrl] = Field(default_factory=list, max_items=10)

# Display schema
print("Research Paper Analysis Schema:")
print(json.dumps(ResearchPaperAnalysis.model_json_schema(), indent=2))

In [None]:
def create_paper_analysis_prompt(paper_abstract: str) -> str:
    """Create a schema-based prompt for paper analysis."""
    schema = ResearchPaperAnalysis.model_json_schema()
    
    prompt = f"""You are an expert research analyst. Analyze the following academic paper abstract and extract structured information.

PAPER ABSTRACT:
{paper_abstract}

OUTPUT REQUIREMENTS:
Provide your analysis as a valid JSON object that strictly conforms to this schema:

{json.dumps(schema, indent=2)}

CRITICAL INSTRUCTIONS:
1. Return ONLY valid JSON - no markdown formatting, no explanatory text
2. Ensure all required fields are present
3. Follow exact field types and constraints
4. Arrays must contain the specified minimum number of items
5. Dates must be in YYYY-MM-DD format
6. Make reasonable inferences based on the abstract
7. For missing information, use plausible estimates

Begin your response with {{ and end with }}"""
    return prompt

# Sample abstract
sample_abstract = """Large language models have demonstrated remarkable capabilities in natural language understanding
and generation. This paper presents a novel approach to improving structured output generation from LLMs using
validation frameworks. We evaluated our method on three benchmark datasets and achieved a 34% reduction in
output formatting errors compared to baseline approaches. Our findings suggest that incorporating runtime
validation significantly improves the reliability of LLM-powered applications in production environments."""

prompt = create_paper_analysis_prompt(sample_abstract)
result = validate_llm_response_with_retry(
    prompt=prompt,
    data_model=ResearchPaperAnalysis,
    max_retries=3
)

if result:
    print("\n✓ Paper Analysis Completed!")
    print("\nStructured Output:")
    print(json.dumps(result.model_dump(), indent=2, default=str))

## 6. Best Practices and Patterns

Let's explore some important patterns for production use.

### Custom Validators

In [None]:
class Transaction(BaseModel):
    transaction_id: str
    amount: float
    currency: str
    timestamp: datetime
    
    @validator('amount')
    def amount_must_be_positive(cls, v):
        if v <= 0:
            raise ValueError('Amount must be positive')
        return v
    
    @validator('currency')
    def currency_must_be_valid(cls, v):
        valid_currencies = {'USD', 'EUR', 'GBP', 'JPY'}
        if v.upper() not in valid_currencies:
            raise ValueError(f'Currency must be one of {valid_currencies}')
        return v.upper()

# Test valid transaction
txn = Transaction(
    transaction_id="TXN123",
    amount=99.99,
    currency="usd",
    timestamp=datetime.now()
)
print(f"Valid transaction: {txn.currency} {txn.amount}")

# Test invalid amount
try:
    invalid_txn = Transaction(
        transaction_id="TXN124",
        amount=-50.0,
        currency="USD",
        timestamp=datetime.now()
    )
except ValidationError as e:
    print(f"\nValidation error: {e}")

### Model Inheritance

In [None]:
class BaseUserInfo(BaseModel):
    name: str
    email: EmailStr
    user_id: str = Field(pattern=r"^USR\d{6}$")

class UserRegistration(BaseUserInfo):
    password: str = Field(min_length=8, max_length=128)
    confirm_password: str
    terms_accepted: bool = True
    
    @validator('confirm_password')
    def passwords_match(cls, v, values):
        if 'password' in values and v != values['password']:
            raise ValueError('Passwords do not match')
        return v

class UserProfile(BaseUserInfo):
    bio: Optional[str] = Field(default=None, max_length=500)
    avatar_url: Optional[HttpUrl] = None
    created_at: datetime
    last_login: Optional[datetime] = None

# Create instances
registration = UserRegistration(
    name="John Doe",
    email="john@example.com",
    user_id="USR123456",
    password="SecurePass123",
    confirm_password="SecurePass123"
)

profile = UserProfile(
    name="John Doe",
    email="john@example.com",
    user_id="USR123456",
    created_at=datetime.now(),
    bio="Software developer"
)

print("Registration:", registration.name)
print("Profile:", profile.name, "-", profile.bio)

### Safe Validation Helper

In [None]:
from typing import Union

def safe_validate(
    data: Union[str, dict],
    model: type[BaseModel]
) -> tuple[BaseModel | None, dict]:
    """
    Safely validate data with detailed error information.
    
    Returns:
        (validated_model, error_info) where error_info is empty dict if successful
    """
    try:
        if isinstance(data, str):
            validated = model.model_validate_json(data)
        else:
            validated = model.model_validate(data)
        return validated, {}
        
    except json.JSONDecodeError as e:
        return None, {
            "error_type": "json_decode",
            "message": str(e),
            "position": e.pos
        }
        
    except ValidationError as e:
        return None, {
            "error_type": "validation",
            "errors": [
                {
                    "field": ".".join(str(x) for x in err["loc"]),
                    "message": err["msg"],
                    "type": err["type"]
                }
                for err in e.errors()
            ]
        }

# Test the helper
test_data = '{"name": "Test", "email": "invalid-email", "query": "Help"}'
result, error_info = safe_validate(test_data, UserInput)

if result:
    print("✓ Validation successful")
else:
    print("✗ Validation failed:")
    print(json.dumps(error_info, indent=2))

## 7. Key Takeaways and Summary

### Key Takeaways:

1. **Start with Clear Models**: Well-defined Pydantic models are the foundation of reliable LLM integrations

2. **Use Schema Over Examples**: Pass `model_json_schema()` to LLMs for better structured output

3. **Implement Retry Logic**: LLMs don't always get it right the first time; build in error handling and retries

4. **Validate Early and Often**: Catch data issues as early as possible in your pipeline

5. **Leverage Type Safety**: Pydantic's type checking prevents many runtime errors

6. **Document Your Models**: Use `Field(description=...)` to make models self-documenting

7. **Test Thoroughly**: Write tests for both valid and invalid data scenarios

### Next Steps:

- Modern LLM APIs (like OpenAI) support passing Pydantic models directly for structured outputs
- Explore instructor library for even more streamlined LLM + Pydantic workflows
- Build production-ready validation pipelines with comprehensive error handling
- Consider using Pydantic v2 for improved performance

## 8. Practice Exercises

Try these exercises to reinforce your learning:

In [None]:
# Exercise 1: Create a model for a movie review
# Requirements:
# - movie_title: string
# - reviewer_name: string
# - rating: float between 0 and 5
# - review_text: string (min 50 chars)
# - watched_date: date
# - would_recommend: boolean
# - genre: one of ["action", "comedy", "drama", "horror", "sci-fi"]

# Your code here:


In [None]:
# Exercise 2: Create a prompt for the LLM to analyze a movie review
# and validate the response using your model

sample_review = """I watched The Matrix last night and it was absolutely mind-blowing! 
The special effects still hold up today, and the philosophical themes are fascinating. 
Definitely a must-watch for any sci-fi fan. Rating: 4.5/5"""

# Your code here:
