# Working with LLMs Effectively: Pydantic, Crawl4AI & Instructor

## 📚 Learning Objectives
By the end of this notebook, you will:
- Understand how to structure LLM outputs
- Learn how to create data schemas with Pydantic
- Master web scraping with LLM-powered extraction using Crawl4AI
- Use Instructor for reliable structured outputs from any LLM
- Build production-ready AI workflows that validate and handle errors gracefully

Libraries we will use:
1. Pydantic - defining what your data should look like and ensure that it's valid
2. Instructor - patches LLM APIs to return Pydantic objects
3. Crawl4ai - convert website html into markdown that LLMs can understand & reduce hallucinations

## 🎯 The Problem: LLM Outputs Are Unpredictable

Large Language Models excel at generating human-like text, but integrating their outputs into structured workflows is challenging. Consider these inconsistent outputs from the same prompt:

```python
# Attempt 1: LLM returns
{"price": "$10.00", "name": "Widget", "sku": "W123"}

# Attempt 2: LLM returns  
{"price": "ten dollars", "name": "Widget"}  # Missing SKU!

# Attempt 3: LLM returns
{"cost": "$10.00", "product_name": "Widget", "sku": "W123"}  # Different keys!
```

**This inconsistency breaks production systems.** We need a way to enforce structure and validate outputs.

In [16]:
%pip install pydantic crawl4ai instructor openai --quiet
%pip install nest_asyncio

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [17]:
# Import required libraries
import os
import json
import asyncio
from typing import List, Optional, Union
from datetime import datetime

# Core libraries
from pydantic import BaseModel, Field, field_validator, ValidationError
import instructor
import openai

# Web scraping
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai import LLMConfig

## 📊 Part 1: Pydantic Fundamentals

### What is Pydantic?

Pydantic is a data validation library that uses Python type hints to validate data. It acts as a "data contract" that ensures your data has the expected structure and types.

In [18]:
# Basic Pydantic model
class Product(BaseModel):
    name: str
    price: float
    in_stock: bool
    tags: List[str] = []

# Valid data
valid_product = Product(
    name="Laptop",
    price=999.99,
    in_stock=True,
    tags=["electronics", "computers"]
)

print(f"Valid product: {valid_product}")
print(f"JSON representation: {valid_product.model_dump_json(indent=2)}")

Valid product: name='Laptop' price=999.99 in_stock=True tags=['electronics', 'computers']
JSON representation: {
  "name": "Laptop",
  "price": 999.99,
  "in_stock": true,
  "tags": [
    "electronics",
    "computers"
  ]
}


In [19]:
# Let's see what happens with invalid data
try:
    invalid_product = Product(
        name="Laptop",
        price="not a number",  # This should be a float
        in_stock=True
    )
except ValidationError as e:
    print("Validation Error:")
    print(e.json(indent=2))

Validation Error:
[
  {
    "type": "float_parsing",
    "loc": [
      "price"
    ],
    "msg": "Input should be a valid number, unable to parse string as a number",
    "input": "not a number",
    "url": "https://errors.pydantic.dev/2.11/v/float_parsing"
  }
]


### Advanced Pydantic Features

In [20]:
class AdvancedProduct(BaseModel):
    name: str = Field(..., min_length=1, max_length=100, description="Product name")
    price: float = Field(..., gt=0, description="Price must be positive")
    rating: Optional[float] = Field(None, ge=1, le=5, description="Rating between 1-5")
    tags: List[str] = Field(default_factory=list, description="Product tags")
    created_at: datetime = Field(default_factory=datetime.now)
    
    @field_validator('tags')
    def validate_tags(cls, v):
        # Custom validation: ensure tags are lowercase
        return [tag.lower().strip() for tag in v]
    
    @field_validator('price')
    def validate_price(cls, v):
        # Round price to 2 decimal places
        return round(v, 2)

# Test the advanced model
advanced_product = AdvancedProduct(
    name="Gaming Laptop",
    price=1499.999,  # Will be rounded
    rating=4.5,
    tags=["Gaming", "ELECTRONICS", " computers "]  # Will be normalized
)

print(f"Advanced product: {advanced_product.model_dump_json(indent=2)}")

Advanced product: {
  "name": "Gaming Laptop",
  "price": 1500.0,
  "rating": 4.5,
  "tags": [
    "gaming",
    "electronics",
    "computers"
  ],
  "created_at": "2025-06-13T12:48:03.734700"
}


## 🕷️ Part 2: Web Scraping with Crawl4AI

**Crawl4AI** is a powerful web scraping library specifically designed for the LLM era, handling JavaScript-heavy websites and dynamic content that traditional scrapers struggle with. Unlike conventional scraping tools that require you to manually parse HTML and write complex selectors, Crawl4AI leverages LLMs to intelligently extract structured data from any webpage using natural language instructions. It seamlessly integrates with Pydantic schemas to ensure your scraped data is validated and consistent, making it perfect for building reliable AI-powered data pipelines.

In [21]:
# Set your OpenAI API key (get one from https://platform.openai.com)
OPENAI_API_KEY=os.getenv("OPENAI_API_KEY")

# Step 1: Define our data contract
class Product(BaseModel):
    name: str = Field(..., description="Product name from the webpage")
    price: str = Field(..., description="Current price including currency")
    rating: float = Field(None, description="User rating between 1-5 stars")
    features: list[str] = Field(..., description="Key product features")

print("Product schema:")
print(json.dumps(Product.model_json_schema(), indent=2))

Product schema:
{
  "properties": {
    "name": {
      "description": "Product name from the webpage",
      "title": "Name",
      "type": "string"
    },
    "price": {
      "description": "Current price including currency",
      "title": "Price",
      "type": "string"
    },
    "rating": {
      "default": null,
      "description": "User rating between 1-5 stars",
      "title": "Rating",
      "type": "number"
    },
    "features": {
      "description": "Key product features",
      "items": {
        "type": "string"
      },
      "title": "Features",
      "type": "array"
    }
  },
  "required": [
    "name",
    "price",
    "features"
  ],
  "title": "Product",
  "type": "object"
}


In [22]:
# Step 2: Configure the LLM extraction strategy
def create_extraction_strategy():
    return LLMExtractionStrategy(
        llm_config=LLMConfig(
            provider="openai/gpt-4o-mini", 
            api_token=OPENAI_API_KEY
        ),
        schema=Product.model_json_schema(),
        extraction_type="schema",
        instruction="""
            Extract product details from this e-commerce page. 
            Be precise with pricing (include currency symbols).
            If information is not available, use null for optional fields.
            Extract features as a list of key product characteristics.
        """,
        chunk_token_threshold=2048,
        verbose=True
    )

In [23]:
# Step 3: Execute the scraping
async def scrape_product(url: str) -> Optional[Product]:
    """
    Scrape a product page and return structured data
    """
    browser_config = BrowserConfig(
        headless=True,
        verbose=False,
        extra_args=["--disable-gpu", "--no-sandbox"]
    )
    
    crawl_config = CrawlerRunConfig(
        extraction_strategy=create_extraction_strategy(),
        cache_mode=CacheMode.BYPASS,
        word_count_threshold=50
    )
    
    try:
        async with AsyncWebCrawler(config=browser_config) as crawler:
            result = await crawler.arun(url=url, config=crawl_config)
            
            if result.success and result.extracted_content:
                # Parse and validate with Pydantic
                product_data = Product.model_validate_json(result.extracted_content)
                return product_data
            else:
                print(f"Scraping failed: {result.error_message}")
                return None
                
    except ValidationError as e:
        print(f"Data validation failed: {e}")
        print(f"Raw extracted content: {result.extracted_content}")
        return None
    except Exception as e:
        print(f"Unexpected error: {e}")
        return None

In [24]:
# Example usage
product = await scrape_product("https://www.amazon.com/stores/page/A0F96D7A-62B9-40A6-B9FF-6143D9E58BFC")
if product:
    print(f"Extracted product: {product.model_dump_json(indent=2)}")

Unexpected error: 


Not sure why this isn't working in the Jupyter notebook, see ```example_1-webscraping.py```

Output:
```
[
    {
        "name": "Beats Powerbeats Pro 2 Wireless Bluetooth Earbuds - Noise Cancelling, Heart Rate Monitor, IPX4, Up to 45H Battery & Charging Case, Works with Apple & Android - Jet Black",
        "price": "$199.95",
        "rating": null,
        "features": [
            "Noise Cancelling",
            "Heart Rate Monitor",
            "IPX4 Water Resistance",
            "Up to 45H Battery Life",
            "Charging Case",
            "Compatible with Apple & Android"
        ],
        "error": false
    },
    {
        "name": "Beats Powerbeats Pro 2 Wireless Bluetooth Earbuds - Noise Cancelling, Heart Rate Monitor, IPX4, Up to 45H Battery & Charging Case, Works with Apple & Android - Electric Orange",
        "price": "$199.95",
        "rating": null,
        "features": [
            "Noise Cancelling",
            "Heart Rate Monitor",
            "IPX4 Water Resistance",
            "Up to 45H Battery Life",
            "Charging Case",
            "Compatible with Apple & Android"
        ],
        "error": false
    },
    ...
]
```

## 🎯 Part 3: Instructor for Direct LLM Integration

Instructor is another powerful library that patches OpenAI's API to return structured Pydantic objects directly.

In [25]:
# Patch the OpenAI client with instructor
client = instructor.patch(openai.OpenAI())

# Define a response schema
class PersonInfo(BaseModel):
    name: str = Field(description="Person's full name")
    age: int = Field(description="Person's age in years")
    occupation: str = Field(description="Person's job or profession")
    skills: List[str] = Field(description="List of skills or expertise")
    
class AnalysisResult(BaseModel):
    sentiment: str = Field(description="Sentiment: positive, negative, or neutral")
    confidence: float = Field(description="Confidence score between 0 and 1")
    key_topics: List[str] = Field(description="Main topics discussed")

In [26]:
# Example: Extract structured information from unstructured text
def analyze_text_with_instructor(text: str) -> AnalysisResult:
    """
    Use instructor to get structured analysis from unstructured text
    """
    try:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            response_model=AnalysisResult,
            messages=[
                {
                    "role": "system", 
                    "content": "You are an expert text analyst. Extract sentiment, confidence, and key topics from the given text."
                },
                {
                    "role": "user", 
                    "content": f"Analyze this text: {text}"
                }
            ]
        )
        return response
    except Exception as e:
        print(f"Error: {e}")
        return None

# Test with sample text
sample_text = """
    I absolutely love the new smartphone I bought yesterday! 
    The camera quality is amazing and the battery life exceeds my expectations. 
    The user interface is intuitive and the build quality feels premium. 
    However, I wish it had better gaming performance for intensive games.
"""

In [None]:
result = analyze_text_with_instructor(sample_text)
if result:
    print(f"Analysis: {result.model_dump_json(indent=2)}")

Analysis: {
  "sentiment": "positive",
  "confidence": 0.85,
  "key_topics": [
    "smartphone",
    "camera quality",
    "battery life",
    "user interface",
    "build quality",
    "gaming performance"
  ]
}


: 

# Documentation Links
- [Pydantic](https://docs.pydantic.dev/)
- [Crawl4AI](https://docs.crawl4ai.com/)
- [Instructor](https://python.useinstructor.com/)
- [Mirascope (Alternative to Instructor)](https://mirascope.com/docs/mirascope)

Tutorial for getting around most blockers, captchas & rate limits: 
- https://youtu.be/Htb_NsGlbgc?si=n9JWb6Na2zKoFh-z