# Comprehensive Guide to Pydantic for LLM Workflows

This notebook provides a hands-on guide to using Pydantic for structuring and validating LLM outputs. We'll explore core concepts, practical examples, and best practices for building robust weather data processing applications.

## Setup and Installation

First, let's install the required packages and set up our AWS Bedrock connection.

In [None]:
# Install required packages
!pip install pydantic boto3 langchain-aws email-validator python-dateutil

In [None]:
# Import necessary libraries
import os
import json
import boto3
from datetime import datetime, date
from typing import Optional, List, Literal
from enum import Enum

from pydantic import BaseModel, EmailStr, Field, HttpUrl, ValidationError, validator
from langchain_aws import ChatBedrock
from google.colab import userdata

print("✓ All imports successful")

In [None]:
# Configure AWS credentials using Colab secrets
AWS_ACCESS_KEY_ID = userdata.get('awsid')
AWS_SECRET_ACCESS_KEY = userdata.get('awssecret')
AWS_REGION = "us-east-1"

# Initialize Bedrock client
bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name=AWS_REGION,
    aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY
)

# Set up the Bedrock model (using Amazon Nova Lite for cost-effectiveness)
llm = ChatBedrock(
    client=bedrock_runtime,
    model_id="amazon.nova-lite-v1:0",
    model_kwargs={
        "temperature": 0,
        "max_tokens": 4096
    }
)

print("✓ AWS Bedrock client initialized")
print(f"✓ Using model: amazon.nova-lite-v1:0")

## 1. Introduction and Context

### The Challenge of Structured Output from LLMs

When working with Large Language Models (LLMs), one of the fundamental challenges is obtaining structured, predictable output that can be reliably processed by downstream systems. While you can simply ask an LLM to format its response in a particular way (like JSON), the results are often unpredictable:

**Common Issues:**
- Extra text outside the JSON structure (e.g., "Here's the JSON output you requested:")
- Markdown formatting (triple backticks around JSON)
- Missing or incorrectly formatted fields
- Invalid data types

### Why Pydantic?

Pydantic provides a robust solution by allowing you to:
1. Define explicit data models with field names and types
2. Validate LLM responses against these models
3. Catch and handle validation errors systematically
4. Ensure data consistency throughout your application

## 2. Basic Pydantic Models

Let's start by creating simple Pydantic models and understanding how they work.

### Creating Your First Model

In [None]:
class WeatherReport(BaseModel):
    station_id: str
    location: str
    temperature: float
    conditions: str

# Valid data
weather_report = WeatherReport(
    station_id="KORD",
    location="Chicago, IL",
    temperature=72.5,
    conditions="Partly cloudy with light winds"
)

print("Valid weather report:")
print(weather_report)
print(f"\nStation ID: {weather_report.station_id}")
print(f"Location: {weather_report.location}")
print(f"Temperature: {weather_report.temperature}°F")
print(f"Conditions: {weather_report.conditions}")

### Validation in Action

In [None]:
# Invalid temperature - string instead of float
try:
    invalid_report = WeatherReport(
        station_id="KLAX",
        location="Los Angeles, CA",
        temperature="very hot",  # Invalid format - should be float
        conditions="Sunny and clear"
    )
except ValidationError as e:
    print("Validation Error Occurred:")
    print(e)

### Weather Alert System Example

Let's build a more realistic model for a weather alert and warning system.

In [None]:
class AlertType(str, Enum):
    TORNADO = "tornado"
    THUNDERSTORM = "thunderstorm"
    FLOOD = "flood"
    HEAT = "heat"
    WINTER_STORM = "winter_storm"
    HIGH_WIND = "high_wind"

class WeatherAlert(BaseModel):
    alert_id: str = Field(min_length=5, max_length=20)
    station_id: str = Field(pattern=r"^[A-Z]{4}$")
    county: str = Field(min_length=2, max_length=50)
    alert_type: AlertType
    description: str = Field(min_length=10, max_length=1000)
    event_id: Optional[str] = Field(
        default=None,
        pattern=r"^EVT-\d{8}$",
        description="Format: EVT-12345678"
    )
    issued_date: Optional[date] = None
    affected_areas: List[str] = Field(default_factory=list, max_items=10)
    
    class Config:
        json_schema_extra = {
            "example": {
                "alert_id": "ALERT12345",
                "station_id": "KORD",
                "county": "Cook County",
                "alert_type": "thunderstorm",
                "description": "Severe thunderstorm warning with hail up to quarter size and winds up to 60 mph",
                "event_id": "EVT-20240315",
                "issued_date": "2024-03-15",
                "affected_areas": ["Chicago", "Evanston", "Oak Park"]
            }
        }

# Create a weather alert
alert = WeatherAlert(
    alert_id="ALERT12345",
    station_id="KORD",
    county="Cook County",
    alert_type=AlertType.THUNDERSTORM,
    description="Severe thunderstorm warning with hail up to quarter size and winds up to 60 mph",
    event_id="EVT-20240315"
)

print("Weather Alert Created:")
print(json.dumps(alert.model_dump(), indent=2, default=str))

### Data Type Coercion

Pydantic automatically converts compatible data types.

In [None]:
class WeatherStation(BaseModel):
    station_id: int
    elevation: float
    is_active: bool
    
# String to int/float/bool conversion
station = WeatherStation(
    station_id="12345",      # Converted to int
    elevation="1025.5",      # Converted to float
    is_active="true"         # Converted to bool
)

print(f"Station ID: {station.station_id} (type: {type(station.station_id).__name__})")
print(f"Elevation: {station.elevation} ft (type: {type(station.elevation).__name__})")
print(f"Is Active: {station.is_active} (type: {type(station.is_active).__name__})")

### Working with JSON Data

In [None]:
class WeatherObservation(BaseModel):
    station_name: str
    observer_email: EmailStr
    conditions: str
    temperature: Optional[int] = Field(
        default=None,
        description="Temperature in Fahrenheit, range -50 to 130",
        ge=-50,
        le=130
    )
    observation_date: Optional[date] = None

# From JSON string to model
json_data = '''
{
    "station_name": "Downtown Weather Station",
    "observer_email": "weather.observer@nws.gov",
    "conditions": "Clear skies with light winds from the northwest",
    "temperature": 68,
    "observation_date": "2024-03-10"
}
'''

# Method 1: Parse JSON, then create model
data_dict = json.loads(json_data)
observation1 = WeatherObservation(**data_dict)
print("Method 1 - From dict:")
print(observation1)

# Method 2: Direct validation from JSON (preferred)
observation2 = WeatherObservation.model_validate_json(json_data)
print("\nMethod 2 - Direct from JSON:")
print(observation2)

# To JSON
json_output = observation2.model_dump_json(indent=2)
print("\nBack to JSON:")
print(json_output)

## 3. Validating LLM Responses

Now let's see how to use Pydantic to validate and structure LLM outputs.

### Weather Forecast Validation System Example

In [None]:
class ConfidenceLevel(str, Enum):
    VERY_HIGH = "very_high"
    HIGH = "high"
    MEDIUM = "medium"
    LOW = "low"
    VERY_LOW = "very_low"

class ForecastIssue(str, Enum):
    NONE = "none"
    TEMPERATURE_ANOMALY = "temperature_anomaly"
    PRECIPITATION_MISMATCH = "precipitation_mismatch"
    PRESSURE_INCONSISTENT = "pressure_inconsistent"
    WIND_DIRECTION_ERROR = "wind_direction_error"
    HUMIDITY_OUT_OF_RANGE = "humidity_out_of_range"
    SEVERE_WEATHER_MISSED = "severe_weather_missed"

class ForecastValidation(BaseModel):
    forecast_id: str
    is_accurate: bool
    confidence: ConfidenceLevel
    issues: List[ForecastIssue] = Field(default_factory=list)
    accuracy_score: float = Field(ge=0.0, le=1.0)
    flagged_parameters: List[str] = Field(
        default_factory=list,
        max_items=10,
        description="Specific weather parameters that triggered validation flags"
    )
    recommended_action: Literal["publish", "review", "revise", "escalate"]
    explanation: str = Field(
        min_length=20,
        max_length=500,
        description="Brief explanation of the validation result"
    )
    
    class Config:
        json_schema_extra = {
            "example": {
                "forecast_id": "FCST_12345",
                "is_accurate": False,
                "confidence": "medium",
                "issues": ["temperature_anomaly", "precipitation_mismatch"],
                "accuracy_score": 0.72,
                "flagged_parameters": ["high_temp: 95°F", "chance_of_rain: 80%"],
                "recommended_action": "review",
                "explanation": "Temperature prediction significantly above seasonal average and precipitation forecast conflicts with pressure readings"
            }
        }

# Display the schema
schema = ForecastValidation.model_json_schema()
print("ForecastValidation Schema:")
print(json.dumps(schema, indent=2))

### Validation Function with Error Handling

In [None]:
def validate_forecast_response(llm_response: str) -> tuple[ForecastValidation | None, str | None]:
    """Validate LLM forecast validation response with detailed error handling."""
    try:
        # First attempt: direct validation
        result = ForecastValidation.model_validate_json(llm_response)
        return result, None
    except json.JSONDecodeError as e:
        return None, f"Invalid JSON format: {str(e)}"
    except ValidationError as e:
        # Extract specific validation errors
        error_details = []
        for error in e.errors():
            field = " -> ".join(str(x) for x in error["loc"])
            message = error["msg"]
            error_details.append(f"{field}: {message}")
        return None, "; ".join(error_details)

# Test with valid data
valid_response = '''
{
    "forecast_id": "FCST_12345",
    "is_accurate": true,
    "confidence": "high",
    "issues": ["none"],
    "accuracy_score": 0.89,
    "flagged_parameters": [],
    "recommended_action": "publish",
    "explanation": "All weather parameters are within expected ranges and consistent with current atmospheric conditions"
}
'''

result, error = validate_forecast_response(valid_response)
if result:
    print("✓ Validation successful!")
    print(f"Forecast ID: {result.forecast_id}")
    print(f"Accurate: {result.is_accurate}")
    print(f"Action: {result.recommended_action}")
else:
    print(f"✗ Validation failed: {error}")

### Testing with LLM - Weather Forecast Validation

In [None]:
def create_forecast_validation_prompt(forecast_data: str, forecast_id: str) -> str:
    """Create a schema-based prompt for weather forecast validation."""
    schema = ForecastValidation.model_json_schema()
    
    prompt = f"""You are a weather forecast validation AI. Analyze the following forecast for accuracy and consistency.

FORECAST TO ANALYZE:
ID: {forecast_id}
Data: {forecast_data}

OUTPUT REQUIREMENTS:
Provide your analysis as a valid JSON object that strictly conforms to this schema:

{json.dumps(schema, indent=2)}

CRITICAL INSTRUCTIONS:
1. Return ONLY valid JSON - no markdown formatting, no explanatory text
2. Ensure all required fields are present
3. Follow exact field types and constraints
4. accuracy_score must be between 0.0 and 1.0
5. confidence must be one of: very_high, high, medium, low, very_low

Begin your response with {{ and end with }}"""
    return prompt

# Test forecast data
test_forecast = "Tomorrow: High 110°F, Low 32°F, 99% chance of snow, winds calm from all directions"
forecast_id = "FCST_001"

prompt = create_forecast_validation_prompt(test_forecast, forecast_id)
print("Prompt sent to LLM:")
print("=" * 60)
print(prompt)
print("=" * 60)

In [None]:
# Get LLM response
response = llm.invoke(prompt)
llm_output = response.content

print("\nLLM Response:")
print("=" * 60)
print(llm_output)
print("=" * 60)

# Validate the response
result, error = validate_forecast_response(llm_output)

if result:
    print("\n✓ Validation SUCCESSFUL!")
    print("\nParsed Result:")
    print(json.dumps(result.model_dump(), indent=2, default=str))
else:
    print(f"\n✗ Validation FAILED: {error}")

## 4. Retry Logic with Error Feedback

LLMs don't always get the format right on the first try. Let's implement a retry mechanism.

In [None]:
def validate_llm_response_with_retry(
    prompt: str,
    data_model: type[BaseModel],
    max_retries: int = 3,
) -> BaseModel | None:
    """
    Validate LLM response with automatic retry on validation errors.
    """
    # Initial call
    response = llm.invoke(prompt)
    llm_response = response.content
    
    for attempt in range(max_retries):
        try:
            # Attempt validation
            validated_data = data_model.model_validate_json(llm_response)
            print(f"✓ Validation successful on attempt {attempt + 1}")
            return validated_data
            
        except (ValidationError, json.JSONDecodeError) as e:
            print(f"✗ Attempt {attempt + 1} failed: {str(e)[:200]}...")
            
            if attempt == max_retries - 1:
                print("Max retries reached. Validation failed.")
                return None
            
            # Create retry prompt with error feedback
            retry_prompt = f"""VALIDATION ERROR OCCURRED

Original Prompt:
{prompt}

Your Previous Response:
{llm_response}

Error Message:
{str(e)}

Please fix the error and provide a corrected response. Remember:
- Return ONLY valid JSON
- No markdown formatting or extra text
- Match the exact schema requirements
- Begin with {{ and end with }}"""
            
            # Retry with error feedback
            response = llm.invoke(retry_prompt)
            llm_response = response.content
    
    return None

# Test with retry logic
print("Testing retry logic...\n")
result = validate_llm_response_with_retry(
    prompt=create_forecast_validation_prompt(test_forecast, forecast_id),
    data_model=ForecastValidation,
    max_retries=3
)

if result:
    print("\nFinal validated result:")
    print(json.dumps(result.model_dump(), indent=2, default=str))

### Advanced Example: Weather Station Data Analysis

Let's build a more complex model for analyzing weather station data and climate patterns.

In [None]:
class Meteorologist(BaseModel):
    name: str
    station: str
    email: Optional[EmailStr] = None

class ClimateDataAnalysis(BaseModel):
    report_title: str = Field(min_length=10, max_length=300)
    meteorologists: List[Meteorologist] = Field(min_items=1, max_items=20)
    analysis_date: date
    summary: str = Field(min_length=100, max_length=2000)
    weather_patterns: List[str] = Field(min_items=3, max_items=10)
    methodology: str = Field(min_length=50, max_length=1000)
    key_findings: List[str] = Field(
        min_items=2,
        max_items=5,
        description="Main weather patterns or climate observations"
    )
    data_limitations: List[str] = Field(min_items=1, max_items=5)
    accuracy_score: float = Field(
        ge=0.0,
        le=10.0,
        description="Estimated forecast accuracy (0-10)"
    )
    measurements_analyzed: int = Field(ge=0)
    related_stations: List[HttpUrl] = Field(default_factory=list, max_items=10)

# Display schema
print("Climate Data Analysis Schema:")
print(json.dumps(ClimateDataAnalysis.model_json_schema(), indent=2))

In [None]:
def create_climate_analysis_prompt(weather_summary: str) -> str:
    """Create a schema-based prompt for climate data analysis."""
    schema = ClimateDataAnalysis.model_json_schema()
    
    prompt = f"""You are an expert meteorologist. Analyze the following weather data summary and extract structured information.

WEATHER DATA SUMMARY:
{weather_summary}

OUTPUT REQUIREMENTS:
Provide your analysis as a valid JSON object that strictly conforms to this schema:

{json.dumps(schema, indent=2)}

CRITICAL INSTRUCTIONS:
1. Return ONLY valid JSON - no markdown formatting, no explanatory text
2. Ensure all required fields are present
3. Follow exact field types and constraints
4. Arrays must contain the specified minimum number of items
5. Dates must be in YYYY-MM-DD format
6. Make reasonable inferences based on the weather data
7. For missing information, use plausible meteorological estimates

Begin your response with {{ and end with }}"""
    return prompt

# Sample weather summary
sample_summary = """Weather stations across the Midwest have recorded unprecedented temperature variations over the past month.
Our analysis of 15 monitoring stations shows a 23% increase in temperature fluctuations compared to historical averages.
The data reveals significant correlation between wind patterns and local temperature variations. Precipitation measurements
indicate irregular rainfall distribution with some areas experiencing 40% above normal levels while others report drought conditions."""

prompt = create_climate_analysis_prompt(sample_summary)
result = validate_llm_response_with_retry(
    prompt=prompt,
    data_model=ClimateDataAnalysis,
    max_retries=3
)

if result:
    print("\n✓ Climate Analysis Completed!")
    print("\nStructured Output:")
    print(json.dumps(result.model_dump(), indent=2, default=str))

## 6. Best Practices and Patterns

Let's explore some important patterns for production use.

### Custom Validators

In [None]:
class WeatherMeasurement(BaseModel):
    measurement_id: str
    temperature: float
    unit: str
    timestamp: datetime
    
    @validator('temperature')
    def temperature_must_be_realistic(cls, v):
        if v < -150 or v > 150:
            raise ValueError('Temperature must be between -150 and 150 degrees')
        return v
    
    @validator('unit')
    def unit_must_be_valid(cls, v):
        valid_units = {'F', 'C', 'K'}
        if v.upper() not in valid_units:
            raise ValueError(f'Unit must be one of {valid_units}')
        return v.upper()

# Test valid measurement
measurement = WeatherMeasurement(
    measurement_id="TEMP123",
    temperature=72.5,
    unit="f",
    timestamp=datetime.now()
)
print(f"Valid measurement: {measurement.temperature}°{measurement.unit}")

# Test invalid temperature
try:
    invalid_measurement = WeatherMeasurement(
        measurement_id="TEMP124",
        temperature=200.0,
        unit="F",
        timestamp=datetime.now()
    )
except ValidationError as e:
    print(f"\nValidation error: {e}")

### Model Inheritance

In [None]:
class BaseStationInfo(BaseModel):
    station_name: str
    contact_email: EmailStr
    station_id: str = Field(pattern=r"^WX\d{6}$")

class StationRegistration(BaseStationInfo):
    access_code: str = Field(min_length=8, max_length=128)
    confirm_code: str
    calibration_verified: bool = True
    
    @validator('confirm_code')
    def codes_match(cls, v, values):
        if 'access_code' in values and v != values['access_code']:
            raise ValueError('Access codes do not match')
        return v

class StationProfile(BaseStationInfo):
    location_description: Optional[str] = Field(default=None, max_length=500)
    station_url: Optional[HttpUrl] = None
    installed_at: datetime
    last_reading: Optional[datetime] = None

# Create instances
registration = StationRegistration(
    station_name="Downtown Weather Station",
    contact_email="admin@weather.gov",
    station_id="WX123456",
    access_code="SecureCode123",
    confirm_code="SecureCode123"
)

profile = StationProfile(
    station_name="Downtown Weather Station",
    contact_email="admin@weather.gov",
    station_id="WX123456",
    installed_at=datetime.now(),
    location_description="Urban monitoring station"
)

print("Registration:", registration.station_name)
print("Profile:", profile.station_name, "-", profile.location_description)

### Safe Validation Helper

In [None]:
from typing import Union

def safe_validate(
    data: Union[str, dict],
    model: type[BaseModel]
) -> tuple[BaseModel | None, dict]:
    """
    Safely validate data with detailed error information.
    
    Returns:
        (validated_model, error_info) where error_info is empty dict if successful
    """
    try:
        if isinstance(data, str):
            validated = model.model_validate_json(data)
        else:
            validated = model.model_validate(data)
        return validated, {}
        
    except json.JSONDecodeError as e:
        return None, {
            "error_type": "json_decode",
            "message": str(e),
            "position": e.pos
        }
        
    except ValidationError as e:
        return None, {
            "error_type": "validation",
            "errors": [
                {
                    "field": ".".join(str(x) for x in err["loc"]),
                    "message": err["msg"],
                    "type": err["type"]
                }
                for err in e.errors()
            ]
        }

# Test the helper
test_data = '{"station_id": "Test", "location": "Chicago", "temperature": "very_hot", "conditions": "Sunny"}'
result, error_info = safe_validate(test_data, WeatherReport)

if result:
    print("✓ Validation successful")
else:
    print("✗ Validation failed:")
    print(json.dumps(error_info, indent=2))

## 7. Key Takeaways and Summary

### Key Takeaways:

1. **Start with Clear Models**: Well-defined Pydantic models are the foundation of reliable weather data processing

2. **Use Schema Over Examples**: Pass `model_json_schema()` to LLMs for better structured weather output

3. **Implement Retry Logic**: LLMs don't always get it right the first time; build in error handling and retries

4. **Validate Early and Often**: Catch data issues as early as possible in your weather pipeline

5. **Leverage Type Safety**: Pydantic's type checking prevents many runtime errors in meteorological data

6. **Document Your Models**: Use `Field(description=...)` to make weather models self-documenting

7. **Test Thoroughly**: Write tests for both valid and invalid weather data scenarios

### Next Steps:

- Modern LLM APIs (like OpenAI) support passing Pydantic models directly for structured weather outputs
- Explore instructor library for even more streamlined LLM + Pydantic weather workflows
- Build production-ready validation pipelines for meteorological data with comprehensive error handling
- Consider using Pydantic v2 for improved performance in weather data processing

## 8. Practice Exercises

Try these exercises to reinforce your learning:

In [None]:
# Exercise 1: Create a model for a weather observation report
# Requirements:
# - station_name: string
# - observer_name: string
# - temperature: float between -100 and 150
# - observation_notes: string (min 50 chars)
# - observation_date: date
# - severe_weather: boolean
# - weather_type: one of ["clear", "cloudy", "rainy", "stormy", "snowy"]

# Your code here:


In [None]:
# Exercise 2: Create a prompt for the LLM to analyze a weather observation
# and validate the response using your model

sample_observation = """Observed conditions at Central Park Weather Station today: Temperature reached 78°F with clear skies and light winds from the southwest. 
Visibility excellent at 10+ miles with no precipitation. Perfect conditions for outdoor activities. 
No severe weather warnings in effect for the metropolitan area."""

# Your code here:
