# HR Structured Outputs with LangChain 1.0

**Module:** Working with Structured Response Formats

**Learning Objectives:**
- Understand 4 different ways to define structured outputs
- Compare Pydantic, Dataclass, TypedDict, and JSON Schema
- Build production-ready HR agents with structured responses
- Apply best practices for data extraction

**Use Case:** Extract structured employee information from unstructured text

**Time:** 2-3 hours

---
## Setup: Install Dependencies

In [None]:
# Install LangChain 1.0 alpha packages
!pip install --pre -U langchain langchain-openai pydantic

## Setup: Configure OpenAI API Key

In [None]:
# For Google Colab
from google.colab import userdata
import os

OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

print("✅ API Key configured!")

In [None]:
# Alternative: For local Jupyter or other environments
# import os
# os.environ['OPENAI_API_KEY'] = 'your-api-key-here'
# print("✅ API Key configured!")

## Import Required Libraries

In [None]:
from typing import Optional, List
from dataclasses import dataclass
from typing_extensions import TypedDict
from pydantic import BaseModel, Field
from langchain.agents import create_agent
from langchain_core.tools import tool

print("✅ All imports successful!")

---
# Lab 1: Pydantic BaseModel (⭐ Recommended)

**Objective:** Use Pydantic BaseModel for structured output

**Benefits:**
- Automatic validation
- Rich field descriptions
- IDE autocomplete support
- Easy serialization
- Best integration with LangChain

## Step 1: Define Pydantic Model

In [None]:
class EmployeeInfo(BaseModel):
    """Structured employee information using Pydantic."""
    
    employee_id: str = Field(
        description="Unique employee identifier (e.g., EMP001)"
    )
    full_name: str = Field(
        description="Full name of the employee"
    )
    email: str = Field(
        description="Work email address"
    )
    phone: str = Field(
        description="Contact phone number"
    )
    department: str = Field(
        description="Department name (e.g., Engineering, HR, Sales)"
    )
    position: str = Field(
        description="Job title/position"
    )
    salary: Optional[float] = Field(
        default=None,
        description="Annual salary in INR (optional)"
    )
    joining_date: Optional[str] = Field(
        default=None,
        description="Date of joining in YYYY-MM-DD format"
    )
    skills: Optional[List[str]] = Field(
        default=None,
        description="List of key skills"
    )

print("✅ EmployeeInfo Pydantic model defined!")
print(f"\nModel fields: {list(EmployeeInfo.model_fields.keys())}")

## Step 2: Create Agent with Pydantic Response Format

In [None]:
# Define a simple tool (optional - for demonstration)
@tool
def get_employee_database(query: str) -> str:
    """Search employee database for information."""
    return "Database contains employee records..."

# Create agent with Pydantic response format
tools = [get_employee_database]

agent_pydantic = create_agent(
    model="openai:gpt-4o-mini",
    tools=tools,
    response_format=EmployeeInfo  # Auto-selects ProviderStrategy
)

print("✅ Agent created with Pydantic response format!")

## Step 3: Test the Agent

In [None]:
# Sample unstructured employee data
input_text = """
Extract employee info: Priya Sharma, EMP101, works in Engineering 
department as Senior Developer. Email: priya.sharma@company.com, 
Phone: +91-9876543210. Joined on 2020-05-15. Salary: 1200000 INR.
Skills: Python, Django, AWS, Docker.
"""

result = agent_pydantic.invoke({
    "messages": [{"role": "user", "content": input_text}]
})

employee = result["structured_response"]

print("=" * 70)
print("PYDANTIC BASEMODEL RESULT")
print("=" * 70)
print(f"Type: {type(employee)}")
print(f"\nEmployee ID: {employee.employee_id}")
print(f"Name: {employee.full_name}")
print(f"Email: {employee.email}")
print(f"Phone: {employee.phone}")
print(f"Department: {employee.department}")
print(f"Position: {employee.position}")
if employee.salary:
    print(f"Salary: ₹{employee.salary:,.2f}")
if employee.joining_date:
    print(f"Joining Date: {employee.joining_date}")
if employee.skills:
    print(f"Skills: {', '.join(employee.skills)}")

print("\n✅ Pydantic provides validation, serialization, and IDE support!")

## Bonus: Serialize to Dictionary/JSON

In [None]:
import json

# Convert to dictionary
employee_dict = employee.model_dump()
print("As Dictionary:")
print(employee_dict)

# Convert to JSON
employee_json = employee.model_dump_json(indent=2)
print("\nAs JSON:")
print(employee_json)

---
# Lab 2: Python Dataclass

**Objective:** Use Python's built-in dataclass for structured output

**Benefits:**
- Built into Python 3.7+
- No external dependencies
- Simple and lightweight
- Good for prototypes

## Step 1: Define Dataclass

In [None]:
@dataclass
class EmployeeInfoDataclass:
    """Structured employee information using dataclass."""
    
    employee_id: str
    full_name: str
    email: str
    phone: str
    department: str
    position: str
    salary: Optional[float] = None
    joining_date: Optional[str] = None
    skills: Optional[List[str]] = None

print("✅ EmployeeInfoDataclass defined!")

## Step 2: Create Agent with Dataclass Response Format

In [None]:
agent_dataclass = create_agent(
    model="openai:gpt-4o-mini",
    tools=tools,
    response_format=EmployeeInfoDataclass
)

print("✅ Agent created with Dataclass response format!")

## Step 3: Test the Agent

In [None]:
input_text = """
Extract info: Rahul Verma (EMP102) - Engineering Manager
Contact: rahul.verma@company.com, +91-9876543211
Joined: 2018-03-20, Salary: 1800000 INR
Skills: Team Management, System Design, Kubernetes
"""

result = agent_dataclass.invoke({
    "messages": [{"role": "user", "content": input_text}]
})

employee = result["structured_response"]

print("=" * 70)
print("PYTHON DATACLASS RESULT")
print("=" * 70)
print(f"Type: {type(employee)}")
print(f"\nEmployee ID: {employee.employee_id}")
print(f"Name: {employee.full_name}")
print(f"Email: {employee.email}")
print(f"Phone: {employee.phone}")
print(f"Department: {employee.department}")
print(f"Position: {employee.position}")
if employee.salary:
    print(f"Salary: ₹{employee.salary:,.2f}")
if employee.skills:
    print(f"Skills: {', '.join(employee.skills)}")

print("\n✅ Dataclass is simple and built into Python!")

---
# Lab 3: TypedDict

**Objective:** Use TypedDict for dictionary-based structured output

**Benefits:**
- Dictionary-based access
- Type hints for IDEs
- Flexible structure
- Works well with dict workflows

## Step 1: Define TypedDict

In [None]:
class EmployeeInfoTypedDict(TypedDict):
    """Structured employee information using TypedDict."""
    
    employee_id: str
    full_name: str
    email: str
    phone: str
    department: str
    position: str
    salary: Optional[float]
    joining_date: Optional[str]
    skills: Optional[List[str]]

print("✅ EmployeeInfoTypedDict defined!")

## Step 2: Create Agent with TypedDict Response Format

In [None]:
agent_typeddict = create_agent(
    model="openai:gpt-4o-mini",
    tools=tools,
    response_format=EmployeeInfoTypedDict
)

print("✅ Agent created with TypedDict response format!")

## Step 3: Test the Agent

In [None]:
input_text = """
Employee details: Anjali Patel, ID: EMP103
HR Director, anjali.patel@company.com
Phone: +91-9876543212, Joined: 2015-01-10
Annual compensation: 2500000 INR
Key skills: Recruitment, Policy Development, Employee Relations
"""

result = agent_typeddict.invoke({
    "messages": [{"role": "user", "content": input_text}]
})

employee = result["structured_response"]

print("=" * 70)
print("TYPEDDICT RESULT")
print("=" * 70)
print(f"Type: {type(employee)}")
print(f"\nEmployee ID: {employee['employee_id']}")
print(f"Name: {employee['full_name']}")
print(f"Email: {employee['email']}")
print(f"Phone: {employee['phone']}")
print(f"Department: {employee['department']}")
print(f"Position: {employee['position']}")
if employee.get('salary'):
    print(f"Salary: ₹{employee['salary']:,.2f}")
if employee.get('skills'):
    print(f"Skills: {', '.join(employee['skills'])}")

print("\n✅ TypedDict returns a dictionary with type hints!")

---
# Lab 4: JSON Schema

**Objective:** Use JSON Schema for structured output

**Benefits:**
- Language-agnostic
- Fine-grained validation
- Enum constraints
- Cross-platform compatibility

## Step 1: Define JSON Schema

In [None]:
EMPLOYEE_INFO_JSON_SCHEMA = {
    "type": "object",
    "title": "EmployeeInfo",
    "description": "Structured employee information using JSON Schema",
    "properties": {
        "employee_id": {
            "type": "string",
            "description": "Unique employee identifier (e.g., EMP001)"
        },
        "full_name": {
            "type": "string",
            "description": "Full name of the employee"
        },
        "email": {
            "type": "string",
            "description": "Work email address",
            "format": "email"
        },
        "phone": {
            "type": "string",
            "description": "Contact phone number"
        },
        "department": {
            "type": "string",
            "description": "Department name",
            "enum": ["Engineering", "HR", "Sales", "Marketing", "Finance", "Operations"]
        },
        "position": {
            "type": "string",
            "description": "Job title/position"
        },
        "salary": {
            "type": ["number", "null"],
            "description": "Annual salary in INR"
        },
        "joining_date": {
            "type": ["string", "null"],
            "description": "Date of joining in YYYY-MM-DD format",
            "format": "date"
        },
        "skills": {
            "type": ["array", "null"],
            "description": "List of key skills",
            "items": {
                "type": "string"
            }
        }
    },
    "required": ["employee_id", "full_name", "email", "phone", "department", "position"],
    "additionalProperties": False
}

print("✅ JSON Schema defined!")
import json
print("\nSchema preview:")
print(json.dumps(EMPLOYEE_INFO_JSON_SCHEMA, indent=2)[:500] + "...")

## Step 2: Create Agent with JSON Schema Response Format

In [None]:
agent_json_schema = create_agent(
    model="openai:gpt-4o-mini",
    tools=tools,
    response_format=EMPLOYEE_INFO_JSON_SCHEMA
)

print("✅ Agent created with JSON Schema response format!")

## Step 3: Test the Agent

In [None]:
input_text = """
Parse employee info: Arjun Reddy (EMP104), Sales Team Lead
Email: arjun.reddy@company.com, Mobile: +91-9876543213
Department: Sales, Joining: 2019-07-01
CTC: 1500000 per annum
Expertise: B2B Sales, CRM Management, Negotiation
"""

result = agent_json_schema.invoke({
    "messages": [{"role": "user", "content": input_text}]
})

employee = result["structured_response"]

print("=" * 70)
print("JSON SCHEMA RESULT")
print("=" * 70)
print(f"Type: {type(employee)}")
print(f"\nEmployee ID: {employee['employee_id']}")
print(f"Name: {employee['full_name']}")
print(f"Email: {employee['email']}")
print(f"Phone: {employee['phone']}")
print(f"Department: {employee['department']}")
print(f"Position: {employee['position']}")
if employee.get('salary'):
    print(f"Salary: ₹{employee['salary']:,.2f}")
if employee.get('skills'):
    print(f"Skills: {', '.join(employee['skills'])}")

print("\n✅ JSON Schema provides fine-grained validation and is language-agnostic!")

---
# Lab 5: Advanced Example - Batch Processing with Nested Models

**Objective:** Process multiple employees with nested Pydantic models

**Use Case:** Batch employee onboarding

In [None]:
class EmployeeBatch(BaseModel):
    """Process multiple employees at once."""
    
    employees: List[EmployeeInfo] = Field(
        description="List of employee records to process"
    )
    processed_count: int = Field(
        description="Total number of employees processed"
    )
    department_summary: dict = Field(
        description="Count of employees by department"
    )

print("✅ EmployeeBatch model with nested EmployeeInfo defined!")

In [None]:
agent_batch = create_agent(
    model="openai:gpt-4o-mini",
    tools=tools,
    response_format=EmployeeBatch
)

input_text = """
Process these employees:

1. Sneha Gupta (EMP105), Marketing Specialist
   Email: sneha.gupta@company.com, +91-9876543214
   Joined: 2021-09-15, Salary: 900000
   Skills: Digital Marketing, SEO, Content Strategy

2. Vikram Singh (EMP106), Finance Analyst
   Email: vikram.singh@company.com, +91-9876543215
   Joined: 2020-11-01, Salary: 1100000
   Skills: Financial Modeling, Excel, Power BI

3. Meera Krishnan (EMP107), Operations Manager
   Email: meera.krishnan@company.com, +91-9876543216
   Joined: 2017-06-20, Salary: 1600000
   Skills: Process Optimization, Supply Chain, Six Sigma

Generate a summary by department.
"""

result = agent_batch.invoke({
    "messages": [{"role": "user", "content": input_text}]
})

batch_data = result["structured_response"]

print("=" * 70)
print("BATCH PROCESSING WITH NESTED MODELS")
print("=" * 70)
print(f"Total Employees Processed: {batch_data.processed_count}")
print(f"\nDepartment Summary:")
for dept, count in batch_data.department_summary.items():
    print(f"  {dept}: {count} employee(s)")

print("\nEmployee Details:")
print("-" * 70)

for emp in batch_data.employees:
    print(f"\n{emp.full_name} ({emp.employee_id})")
    print(f"  Department: {emp.department}")
    print(f"  Position: {emp.position}")
    print(f"  Email: {emp.email}")
    if emp.salary:
        print(f"  Salary: ₹{emp.salary:,.2f}")
    if emp.skills:
        print(f"  Skills: {', '.join(emp.skills)}")

print("\n✅ Nested Pydantic models enable complex structured outputs!")

---
# Comparison Summary

Let's compare all four approaches side by side.

In [None]:
import pandas as pd

comparison_data = {
    "Format": ["Pydantic BaseModel", "Python Dataclass", "TypedDict", "JSON Schema"],
    "Validation": ["✅ Rich", "⚠️ Basic", "❌ None", "✅ Rich"],
    "Complexity": ["Medium", "Low", "Low", "High"],
    "Dependencies": ["External", "Built-in", "Built-in", "None"],
    "IDE Support": ["✅ Excellent", "✅ Good", "✅ Good", "❌ Limited"],
    "Documentation": ["✅ Field-level", "❌ Class-only", "❌ Class-only", "✅ Property-level"],
    "Best For": ["Production", "Simple cases", "Dict workflows", "Cross-platform"]
}

df = pd.DataFrame(comparison_data)
print("\n" + "=" * 80)
print("COMPARISON: STRUCTURED OUTPUT FORMATS")
print("=" * 80)
print(df.to_string(index=False))
print("\n" + "=" * 80)
print("RECOMMENDATION: Use Pydantic BaseModel for most HR use cases!")
print("=" * 80)

---
# Exercises

## Exercise 1: Create a Leave Request Model

Create a Pydantic model for leave requests that includes:
- employee_id
- leave_type (Casual/Sick/Earned)
- start_date
- end_date
- reason
- days_requested

Then create an agent that extracts this information from text.

In [None]:
# Your code here
class LeaveRequest(BaseModel):
    """TODO: Define the leave request model."""
    pass

# TODO: Create agent and test

## Exercise 2: Performance Review Model

Create a model for performance reviews with:
- employee_id
- reviewer_id
- review_period
- technical_rating (1-5)
- communication_rating (1-5)
- achievements (list)
- areas_of_improvement (list)
- promotion_recommended (boolean)

Add validation to ensure ratings are between 1-5.

In [None]:
# Your code here
class PerformanceReview(BaseModel):
    """TODO: Define the performance review model."""
    pass

# TODO: Add validation constraints
# Hint: Use Field(ge=1, le=5) for ratings

## Exercise 3: Compare All Four Formats

For the same input text, extract employee information using all four formats and compare:
1. Execution time
2. Response structure
3. Ease of access to fields

Which format would you choose for a production HR system?

In [None]:
# Your code here
import time

test_input = "Your test employee data here"

# TODO: Test all four formats and measure time
# TODO: Compare results

## 🌟 Bonus Challenge: Multi-Department Report

Create a complex nested model that can:
1. Process employees from multiple departments
2. Calculate average salary per department
3. List top skills across all employees
4. Identify departments that are understaffed (< 3 employees)
5. Generate an executive summary

Test with at least 10 employees across 4 departments.

In [None]:
# Your code here - Be creative!
class DepartmentReport(BaseModel):
    """TODO: Design your comprehensive report model."""
    pass

---
# Conclusion

**What you learned:**
1. ✅ Four different ways to define structured outputs in LangChain 1.0
2. ✅ Using Pydantic BaseModel for production-ready extraction
3. ✅ Python Dataclass for simple, lightweight schemas
4. ✅ TypedDict for dictionary-based workflows
5. ✅ JSON Schema for language-agnostic specifications
6. ✅ Nested models for complex data structures
7. ✅ Best practices for HR data extraction

**Key Takeaways:**
- **Pydantic** is the recommended choice for most production use cases
- **Field descriptions** are critical for LLM understanding
- **Validation** catches errors early and ensures data quality
- **Nested models** enable complex hierarchical data structures

**Next Steps:**
- Integrate with actual HR databases
- Add more complex validation rules
- Build end-to-end HR automation workflows
- Deploy as production API services

---
**Created with:** LangChain 1.0 + OpenAI + Pydantic

**References:**
- [LangChain Documentation](https://python.langchain.com/)
- [Pydantic Documentation](https://docs.pydantic.dev/)
- [JSON Schema](https://json-schema.org/)