# Structured Output Validation with Pydantic
This hands-on tutorial demonstrates how to validate and manage structured data using Pydantic in a Jupyter notebook on Colab. We'll cover:

- Installing Pydantic
- Defining simple models
- Handling validation errors
- Nested models
- Collections and optional fields
- Custom validation logic

Each section includes example code and brief explanations.

In [None]:
# Install Pydantic
!pip install pydantic

## 1. Defining Simple Models
Use Pydantic's `BaseModel` to define data structures with type annotations. Instances are validated on creation.

In [None]:
from pydantic import BaseModel, ValidationError

class User(BaseModel):
    id: int
    name: str

# Valid instance
user = User(id=1, name='Alice')
print(user)

# Invalid instance: this will raise a ValidationError

In [None]:
try:
    User(id='not-an-int', name=123)
except ValidationError as e:
    print(e)

## 2. Nested Models
Pydantic supports nesting models to represent complex structures.

In [None]:
from pydantic import BaseModel

class Address(BaseModel):
    street: str
    city: str

class UserWithAddress(BaseModel):
    id: int
    name: str
    address: Address

# You can pass a dict for nested models
user = UserWithAddress(id=2, name='Bob', address={'street':'Main St', 'city':'Metropolis'})
print(user)

## 3. List and Optional Fields
Leverage `typing` annotations for lists and optional values.

In [None]:
from typing import List, Optional
from pydantic import BaseModel

class Blog(BaseModel):
    title: str
    tags: List[str]
    summary: Optional[str] = None

blog = Blog(title='My Post', tags=['pydantic', 'validation'])
print(blog)

## 4. Custom Validation Logic
Add methods decorated with `@validator` for field-level validation.

In [None]:
from pydantic import BaseModel, validator, ValidationError

class Product(BaseModel):
    name: str
    price: float

    @validator('price')
    def price_must_be_positive(cls, v):
        if v <= 0:
            raise ValueError('Price must be positive')
        return v

# Valid product
product = Product(name='Book', price=9.99)
print(product)

# Invalid product: negative price

In [None]:
try:
    Product(name='Pen', price=-1)
except ValidationError as e:
    print(e)

## 5. Root Validators
Use `@root_validator` to validate across multiple fields.

In [None]:
from pydantic import BaseModel, root_validator, ValidationError

class Order(BaseModel):
    item_id: int
    quantity: int
    price_per_unit: float

    @root_validator
    def check_total_value(cls, values):
        qty = values.get('quantity')
        price = values.get('price_per_unit')
        if qty * price > 10000:
            raise ValueError('Total order value cannot exceed 10000')
        return values

# Valid order
order = Order(item_id=123, quantity=5, price_per_unit=1500)
print(order)

# Invalid order (value > 10000)
try:
    Order(item_id=124, quantity=10, price_per_unit=2000)
except ValidationError as e:
    print(e)

# Task
Create a Python notebook that uses an LLM to generate social computing/graph datasets and validates the generated output using Pydantic.

## Setup


In [2]:
!pip install google-generativeai networkx



## Define pydantic model for social graph data

### Subtask:
Create a Pydantic `BaseModel` that defines the expected structure and types for social graph data (e.g., nodes, edges, their attributes).


**Reasoning**:
Define the Pydantic models for Person, Relationship, and SocialGraphData as specified in the instructions.



In [3]:
from typing import List, Optional
from pydantic import BaseModel

class Person(BaseModel):
    id: int
    name: str
    age: int
    interests: List[str]

class Relationship(BaseModel):
    source_id: int
    target_id: int
    type: str
    duration_months: Optional[int] = None

class SocialGraphData(BaseModel):
    people: List[Person]
    relationships: List[Relationship]

## Integrate with llm

### Subtask:
Write code to call an LLM and prompt it to generate social graph data.


**Reasoning**:
Import the necessary library for interacting with the LLM and configure it with an API key.



In [9]:
import google.generativeai as genai
import os

# Replace with your actual API key or set it as an environment variable
# If using Google Colab secrets, you can access it like os.environ['API_KEY_NAME']
API_KEY = "YOUR_API_KEY_HERE" # Replace with your actual API key
genai.configure(api_key=API_KEY)

# Choose a suitable model
model = genai.GenerativeModel('gemini-2.0-flash')

**Reasoning**:
Define the prompt to instruct the LLM to generate social graph data in JSON format based on the Pydantic models and call the LLM to generate the data.



In [11]:
prompt = """
Generate a JSON object representing social graph data.
The JSON should have two top-level keys: 'people' and 'relationships'.

'people' should be a list of objects, where each object represents a person and has the following structure:
- "id": an integer, a unique identifier for the person.
- "name": a string, the name of the person.
- "age": an integer, the age of the person.
- "interests": a list of strings, representing the person's interests.

'relationships' should be a list of objects, where each object represents a relationship between two people and has the following structure:
- "source_id": an integer, the id of the person initiating the relationship.
- "target_id": an integer, the id of the person receiving the relationship.
- "type": a string, the type of relationship (e.g., "friend", "family", "colleague").
- "duration_months": an optional integer, the duration of the relationship in months (can be null).

Generate data for at least 5 people and a reasonable number of relationships between them. Ensure the source_id and target_id in relationships correspond to the ids in the people list.

Example of the desired structure:
{{
  "people": [
    {{
      "id": 1,
      "name": "Alice",
      "age": 30,
      "interests": ["reading", "hiking"]
    }},
    {{
      "id": 2,
      "name": "Bob",
      "age": 25,
      "interests": ["gaming", "movies"]
    }}
  ],
  "relationships": [
    {{
      "source_id": 1,
      "target_id": 2,
      "type": "friend",
      "duration_months": 60
    }}
  ]
}}
"""

# Generate content from the model
response = model.generate_content(prompt)

# Store the raw output
raw_llm_output = response.text

# Print the raw output (optional, for debugging)
print(raw_llm_output)

```json
{
  "people": [
    {
      "id": 1,
      "name": "Alice",
      "age": 30,
      "interests": ["reading", "hiking", "yoga"]
    },
    {
      "id": 2,
      "name": "Bob",
      "age": 25,
      "interests": ["gaming", "movies", "coding"]
    },
    {
      "id": 3,
      "name": "Charlie",
      "age": 35,
      "interests": ["cooking", "traveling", "photography"]
    },
    {
      "id": 4,
      "name": "David",
      "age": 28,
      "interests": ["music", "sports", "hiking"]
    },
    {
      "id": 5,
      "name": "Eve",
      "age": 32,
      "interests": ["dancing", "reading", "painting"]
    }
  ],
  "relationships": [
    {
      "source_id": 1,
      "target_id": 2,
      "type": "friend",
      "duration_months": 72
    },
    {
      "source_id": 1,
      "target_id": 4,
      "type": "friend",
      "duration_months": 24
    },
    {
      "source_id": 2,
      "target_id": 3,
      "type": "colleague",
      "duration_months": 36
    },
    {
      "source_id

## Validate llm output

### Subtask:
Use the Pydantic model to validate the LLM's raw output.

**Reasoning**:
Attempt to parse the raw LLM output as JSON and validate it against the SocialGraphData Pydantic model.

In [12]:
import json
from pydantic import ValidationError

# Assume raw_llm_output contains the raw string output from the LLM
# and SocialGraphData is the Pydantic model defined earlier

try:
    # Clean the output by removing the markdown code block
    cleaned_output = raw_llm_output.strip()
    if cleaned_output.startswith("```json"):
        cleaned_output = cleaned_output[7:]
    if cleaned_output.endswith("```"):
        cleaned_output = cleaned_output[:-3]

    # Parse the cleaned output as JSON
    graph_data_json = json.loads(cleaned_output)

    # Validate the JSON data using the Pydantic model
    validated_graph_data = SocialGraphData(**graph_data_json)

    print("LLM output is valid according to the Pydantic model.")
    print(validated_graph_data)

except json.JSONDecodeError as e:
    print(f"JSON Decode Error: Could not parse LLM output as JSON. {e}")
    print("Raw LLM Output:")
    print(raw_llm_output)
except ValidationError as e:
    print(f"Pydantic Validation Error: LLM output does not match the model schema. {e}")
    print("Raw LLM Output:")
    print(raw_llm_output)
except Exception as e:
    print(f"An unexpected error occurred: {e}")
    print("Raw LLM Output:")
    print(raw_llm_output)

LLM output is valid according to the Pydantic model.
people=[Person(id=1, name='Alice', age=30, interests=['reading', 'hiking', 'yoga']), Person(id=2, name='Bob', age=25, interests=['gaming', 'movies', 'coding']), Person(id=3, name='Charlie', age=35, interests=['cooking', 'traveling', 'photography']), Person(id=4, name='David', age=28, interests=['music', 'sports', 'hiking']), Person(id=5, name='Eve', age=32, interests=['dancing', 'reading', 'painting'])] relationships=[Relationship(source_id=1, target_id=2, type='friend', duration_months=72), Relationship(source_id=1, target_id=4, type='friend', duration_months=24), Relationship(source_id=2, target_id=3, type='colleague', duration_months=36), Relationship(source_id=2, target_id=5, type='friend', duration_months=12), Relationship(source_id=3, target_id=4, type='family', duration_months=None), Relationship(source_id=4, target_id=1, type='friend', duration_months=24), Relationship(source_id=5, target_id=3, type='friend', duration_months=

## Process Validated Social Graph Data

### Subtask:
Load the validated social graph data into a graph library and perform basic processing or analysis.

**Reasoning**:
Use the NetworkX library to create a graph from the validated people and relationships data.

In [13]:
import networkx as nx

# Assume validated_graph_data contains the validated data from the previous step

# Create a directed graph
G = nx.DiGraph()

# Add nodes (people)
for person in validated_graph_data.people:
    G.add_node(person.id, name=person.name, age=person.age, interests=person.interests)

# Add edges (relationships)
for relationship in validated_graph_data.relationships:
    G.add_edge(relationship.source_id, relationship.target_id, type=relationship.type, duration_months=relationship.duration_months)

print(f"Created a graph with {G.number_of_nodes()} nodes and {G.number_of_edges()} edges.")

# Example basic analysis: print node attributes
print("\nNode Attributes:")
for node, attributes in G.nodes(data=True):
    print(f"Node ID: {node}, Attributes: {attributes}")

# Example basic analysis: print edge attributes
print("\nEdge Attributes:")
for source, target, attributes in G.edges(data=True):
    print(f"Edge from {source} to {target}, Attributes: {attributes}")

Created a graph with 5 nodes and 8 edges.

Node Attributes:
Node ID: 1, Attributes: {'name': 'Alice', 'age': 30, 'interests': ['reading', 'hiking', 'yoga']}
Node ID: 2, Attributes: {'name': 'Bob', 'age': 25, 'interests': ['gaming', 'movies', 'coding']}
Node ID: 3, Attributes: {'name': 'Charlie', 'age': 35, 'interests': ['cooking', 'traveling', 'photography']}
Node ID: 4, Attributes: {'name': 'David', 'age': 28, 'interests': ['music', 'sports', 'hiking']}
Node ID: 5, Attributes: {'name': 'Eve', 'age': 32, 'interests': ['dancing', 'reading', 'painting']}

Edge Attributes:
Edge from 1 to 2, Attributes: {'type': 'friend', 'duration_months': 72}
Edge from 1 to 4, Attributes: {'type': 'friend', 'duration_months': 24}
Edge from 2 to 3, Attributes: {'type': 'colleague', 'duration_months': 36}
Edge from 2 to 5, Attributes: {'type': 'friend', 'duration_months': 12}
Edge from 3 to 4, Attributes: {'type': 'family', 'duration_months': None}
Edge from 4 to 1, Attributes: {'type': 'friend', 'duration

## Implement Iterative Generation and Validation

### Subtask:
Refactor the LLM call and validation into an iterative process with feedback.

**Reasoning**:
Combine the LLM call and validation steps into a function that attempts validation and provides feedback to the LLM in case of errors, retrying for a set number of attempts.

In [14]:
import json
from pydantic import ValidationError
import time # Import time for potential delays between retries

def generate_and_validate_graph_data(model, prompt, max_attempts=5):
    """
    Attempts to generate and validate social graph data iteratively.

    Args:
        model: The generative model to use for content generation.
        prompt: The initial prompt for the LLM.
        max_attempts: The maximum number of generation and validation attempts.

    Returns:
        A SocialGraphData object if validation is successful, otherwise None.
    """
    feedback = ""
    for attempt in range(max_attempts):
        print(f"Attempt {attempt + 1}/{max_attempts}...")

        # Add feedback to the prompt for retry attempts
        current_prompt = prompt
        if feedback:
            print("Providing feedback to the LLM.")
            current_prompt = f"{prompt}\n\nPrevious attempt failed validation with the following errors:\n{feedback}\nPlease try generating the data again, fixing these issues."

        try:
            # Generate content from the model
            response = model.generate_content(current_prompt)
            raw_llm_output = response.text
            print("Raw LLM Output:")
            print(raw_llm_output)

            # Clean the output by removing the markdown code block
            cleaned_output = raw_llm_output.strip()
            if cleaned_output.startswith("```json"):
                cleaned_output = cleaned_output[7:]
            if cleaned_output.endswith("```"):
                cleaned_output = cleaned_output[:-3]

            # Parse the cleaned output as JSON
            graph_data_json = json.loads(cleaned_output)

            # Validate the JSON data using the Pydantic model
            validated_graph_data = SocialGraphData(**graph_data_json)

            print("\nLLM output successfully validated.")
            return validated_graph_data # Return valid data

        except json.JSONDecodeError as e:
            feedback = f"JSON Decode Error: Could not parse output as JSON. Make sure the entire output is a valid JSON object. Error: {e}"
            print(feedback)
        except ValidationError as e:
            # Capture validation errors as feedback
            feedback = f"Pydantic Validation Error: Output does not match the schema. Errors:\n{e}"
            print(feedback)
        except Exception as e:
            feedback = f"An unexpected error occurred: {e}"
            print(feedback)

        # Optional: Add a small delay before retrying
        if attempt < max_attempts - 1:
            time.sleep(1)

    print(f"\nFailed to generate valid data after {max_attempts} attempts.")
    return None

# Now call the function to start the iterative process
# Make sure 'model' and 'prompt' variables are defined in previous cells
validated_data_iterative = generate_and_validate_graph_data(model, prompt)

# You can then use validated_data_iterative if it's not None
if validated_data_iterative:
    print("\nSuccessfully obtained validated social graph data through iteration.")
    # Proceed with processing the data, e.g., loading into NetworkX
    # (Code for loading into NetworkX would go here, using validated_data_iterative)
else:
    print("\nCould not obtain valid social graph data.")

Attempt 1/5...
Raw LLM Output:
```json
{
  "people": [
    {
      "id": 1,
      "name": "Alice",
      "age": 30,
      "interests": ["reading", "hiking", "photography"]
    },
    {
      "id": 2,
      "name": "Bob",
      "age": 25,
      "interests": ["gaming", "movies", "coding"]
    },
    {
      "id": 3,
      "name": "Charlie",
      "age": 35,
      "interests": ["travel", "cooking", "music"]
    },
    {
      "id": 4,
      "name": "David",
      "age": 28,
      "interests": ["sports", "hiking", "gaming"]
    },
    {
      "id": 5,
      "name": "Eve",
      "age": 32,
      "interests": ["reading", "yoga", "painting"]
    }
  ],
  "relationships": [
    {
      "source_id": 1,
      "target_id": 2,
      "type": "friend",
      "duration_months": 60
    },
    {
      "source_id": 1,
      "target_id": 3,
      "type": "colleague",
      "duration_months": 24
    },
    {
      "source_id": 2,
      "target_id": 4,
      "type": "friend",
      "duration_months": 12
  