# **Structured Outputs**
## **Transform raw LLM responses (like AIMessage object) into structured-usable formats**

## **What's Covered?**
1. Introduction to Output Parsers
    - What is Output Parser?
    - Why they are crucial?
    - Types of Output Parsers
    - What is Pydantic?
    - Key methods of LangChain output parser
2. CommaSeparatedListOutputParser
    - What it does?
    - Building an AI System to Auto-Extract Skills from Job Descriptions
3. PydanticOutputParser
    - What is Pydantic?
    - What it does?
    - Installation
    - Defining a Pydantic Model
    - Building an AI Powered Song Recommender using Pydantic Parser
    - Step 1: Defining a Pydantic Model
    - Step 2: Create PydanticOutputParser
    - Step 3: Create Prompt with Format Instructions
    - Step 4: Build Chain and Test
    - Step 5: Handle Parsing Errors (Production)
4. Case Study: Building an AI Powered Text2Movie Metadata Generator
    - Step 1: Defining a Pydantic Model
    - Step 2: Create PydanticOutputParser
    - Step 3: Create Prompt with Format Instructions
    - Step 4: Build Chain and Test
    - Step 5: Handle Parsing Errors (Production)
5. JSONOutputParser with Pydantic
    - What it does?
    - Building an Intelligent Parser to Translate User Requests into Order Objects
    - Step 1: Defining a Pydantic Model
    - Step 2: Create JsonOutputParser
    - Step 3: Create Prompt with Format Instructions
    - Step 4: Build Chain and Test
    - Step 5: Handle Parsing Errors (Production)

## **Introduction to Output Parsers**
### **What is Output Parser?**
For many applications, such as chatbots, models need to respond to users directly in natural language. However, there are scenarios where we need models to output in a structured format. For example, we might want to store the model output in a database and ensure that the output conforms to the database schema. This need motivates the concept of structured output, where models can be instructed to respond with a particular output structure.

### **Why they are crucial?**
Often we need the output of a LLM in a particular format, for example, you want a python datetime object, or a JSON object. LangChain come with Parse utilities allowing you to easily convert output into precise data types or even your own custom class instance with Pydantic.

Output parsers are responsible for taking the output of an LLM and transforming it to a more suitable format. This is very useful when you are using LLMs to generate any form of structured data.

### **Types of output parsers**
Output Parser Types:
- CSL Parser
- Pydantic Parser
- JSON Parser with Pydantic
etc...

### **What is Pydantic?**  
Pydantic is a library which allows us to define data models, validate the data and type coercion.  
Coercion in Pydantic refers to its ability to automatically convert input data into the types specified in the model, as long as the conversion is reasonable. 

### **Key methods of LangChain output parser**
Parser consists of two key elements:
- `get_format_instructions()` method:  A method which returns a string containing instructions for how the output of a language model should be formatted.
- `parse()` method: A method which takes in a string (assumed to be the response from a language model) and parses it into some structure.
- (Optional)"Parse with prompt": A method which takes in a string (assumed to be the response from a language model) and a prompt (assumed to be the prompt that generated such a response) and parses it into some structure. The prompt is largely provided in the event the OutputParser wants to retry or fix the output in some way, and needs information from the prompt to do so.


## **Comma Separated List Parser**

### **What it does?**
This output parser can be used when you want to return a list of comma-separated items.

In [1]:
from langchain_core.output_parsers import CommaSeparatedListOutputParser

csv_output_parser = CommaSeparatedListOutputParser()

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# As discussed above, lets experiment with get_format_instructions()

csv_output_parser.get_format_instructions()

'Your response should be a list of comma separated values, eg: `foo, bar, baz` or `foo,bar,baz`'

In [3]:
# prompt -> generate a list of modules one must study to become data scientist

example_input = "Python, DA, SQL, ML, DL"

print(type(example_input))

<class 'str'>


In [4]:
example_input = "Python, DA, SQL, ML, DL"

# using parse() method
parsed_output = csv_output_parser.parse(example_input)

print(type(parsed_output))
print(parsed_output)

<class 'list'>
['Python', 'DA', 'SQL', 'ML', 'DL']


### **Building an AI System to Auto-Extract Skills from Job Descriptions** 

**Use Case:** Extract a list of required skills from a job description so you can auto-fill a checklist or tag candidate profiles.

In [5]:
# Example: extract skills from a job description
from langchain_core.prompts import PromptTemplate

prompt_template = PromptTemplate(
    template="""Extract the key technical skills from the following job description.
                Return them as a comma-separated list (no extra text).
                
                Job description:
                {job}
                """
)

prompt_template

PromptTemplate(input_variables=['job'], input_types={}, partial_variables={}, template='Extract the key technical skills from the following job description.\n                Return them as a comma-separated list (no extra text).\n\n                Job description:\n                {job}\n                ')

In [6]:
# Import Google ChatModel
from langchain_openai import ChatOpenAI

# Setup API Key
f = open('keys/.openai_api_key.txt')
OPENAI_API_KEY = f.read()

# Set the GoogleAI Key and initialize a ChatModel
openai_chat_model = ChatOpenAI(api_key=OPENAI_API_KEY, 
                               model="gpt-4o-mini", 
                               temperature=0)

In [7]:
from langchain_core.output_parsers import CommaSeparatedListOutputParser

output_parser = CommaSeparatedListOutputParser()

In [8]:
jd_file = open("data/job_desc.txt")

file_data = jd_file.read()

print(file_data)

Job Description: Snowflake Data Engineer (2+ Years Experience)

Role: Snowflake Data Engineer
Experience: 2–4 years
Location: Hyderabad
Employment Type: Full-time

About the Role
We are looking for a Snowflake Data Engineer with hands-on experience in building data pipelines, developing data models, and working across modern cloud data platforms. The ideal candidate will have strong SQL skills, good understanding of data warehousing concepts, and practical experience implementing solutions on Snowflake.

Key Responsibilities
Develop, optimize, and maintain ETL/ELT pipelines using Snowflake and related tools.
Design and implement Snowflake schemas, views, materialized views, and stored procedures.
Manage Snowflake workloads, including Virtual Warehouses, Roles, and Security policies.
Work with semi-structured data (JSON, Parquet, Avro) using Snowflake-native functions.
Build and manage data ingestion pipelines using tools such as Airflow, DBT, AWS Glue, or Informatica (whatever applies 

In [9]:
chain = prompt_template | openai_chat_model | output_parser

result = chain.invoke({"job": file_data})

print(result)

['Snowflake', 'SQL', 'data modeling', 'ETL', 'ELT', 'data ingestion', 'Airflow', 'DBT', 'AWS', 'Azure', 'GCP', 'Python', 'Scala', 'data warehousing', 'semi-structured data', 'CI/CD', 'Git', 'Snowpipe', 'Streams', 'Tasks', 'BI tools', 'Tableau', 'Power BI', 'QuickSight', 'ML pipelines.']


In [10]:
print(type(result))

<class 'list'>


## **PydanticOutputParser**

### **What is Pydantic?**
Pydantic is a **data validation and parsing library** for Python, primarily used with FastAPI, but also great for any application that requires structured data handling.

Use Pydantic to declare your data model. Pydantic’s BaseModel is like a Python dataclass, but with actual type checking + coercion. (Think of it as a smart data checker + converter)

### **What it does?**
This output parser allows users to specify an arbitrary Pydantic Model and query LLMs for outputs that conform to that schema.

Pydantic becomes your **guardrail** to catch bad LLM output.

You should have some Pydantic knowledge to use it.

### **Installation**
`pip install pydantic`

### **Building an AI Powered Song Recommender using Pydantic Parser**

We will follow the following steps to build the Pydantic Parser:
- Step 1: Defining a Pydantic Model
- Step 2: Create PydanticOutputParser
- Step 3: Create Prompt with Format Instructions
- Step 4: Build Chain and Test
- Step 5: Handle Parsing Errors (Production)

### **Step 1: Defining a Pydantic Model**

In [11]:
from pydantic import BaseModel, Field

class Song(BaseModel):
    name: str = Field(description="Name of a Song")
    geners: list = Field(description="List of Geners")

### **Step 2: Create PydanticOutputParser**

In [12]:
from langchain_core.output_parsers import PydanticOutputParser

output_parser = PydanticOutputParser(pydantic_object=Song)

print(output_parser.get_format_instructions())

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"properties": {"name": {"description": "Name of a Song", "title": "Name", "type": "string"}, "geners": {"description": "List of Geners", "items": {}, "title": "Geners", "type": "array"}}, "required": ["name", "geners"]}
```


### **Step 3: Create Prompt with Format Instructions**

In [13]:
from langchain_core.prompts import ChatPromptTemplate

# Template
chat_template = ChatPromptTemplate(
    messages=[
        ("system", """You are a helpful AI Song Recommendation Engine.
                      You generate output while following the below mentioned format.
                      Output Format Instructions:
                      {output_format_instructions}"""), 
        ("human", "What is the most famous song by {singer_name}.")
    ],
    partial_variables={"output_format_instructions": output_parser.get_format_instructions()}
)

chat_template

ChatPromptTemplate(input_variables=['singer_name'], input_types={}, partial_variables={'output_format_instructions': 'The output should be formatted as a JSON instance that conforms to the JSON schema below.\n\nAs an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}\nthe object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.\n\nHere is the output schema:\n```\n{"properties": {"name": {"description": "Name of a Song", "title": "Name", "type": "string"}, "geners": {"description": "List of Geners", "items": {}, "title": "Geners", "type": "array"}}, "required": ["name", "geners"]}\n```'}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=['output_format_instructions'], input_types={}, partial_variables={}, template='You are a helpful AI Song Recommendatio

### **Step 4: Build Chain and Test**

In [14]:
# Import Google ChatModel
from langchain_openai import ChatOpenAI

# Setup API Key
f = open('keys/.openai_api_key.txt')
OPENAI_API_KEY = f.read()

# Set the GoogleAI Key and initialize a ChatModel
openai_chat_model = ChatOpenAI(api_key=OPENAI_API_KEY, 
                               model="gpt-4o-mini", 
                               temperature=0)

In [15]:
chain = chat_template | openai_chat_model | output_parser

raw_input = {"singer_name": "arijit singh"}

chain.invoke(raw_input)

Song(name='Tum Hi Ho', geners=['Bollywood', 'Romantic', 'Pop'])

### **Step 5: Handle Parsing Errors (Production)**

In [16]:
from pydantic import ValidationError
from langchain_core.exceptions import OutputParserException

def song_recommendations(text: str) -> Song:
    """Safely extract movie information with fallback."""
    try:
        return chain.invoke({"singer_name": text})
    except OutputParserException as e:
        print(f"Output Parser failed:\n {e}")
    except ValidationError as e:
        print(f"Validation failed:\n e.json()")
    except Exception as e:
        print(f"Unexpected error: {type(e).__name__}: {e}")

In [17]:
singer = "imagine dragons"

song_obj = song_recommendations(singer)

print(song_obj)

name='Radioactive' geners=['Rock', 'Alternative', 'Indie']


## **Case Study: Building an AI Powered Text2Movie Metadata Generator**

**Use Case:** An AI-powered extraction pipeline that parses only what’s explicitly mentioned in the text, delivering reliable structured movie information.

### **Step 1: Defining a Pydantic Model**

In [18]:
from pydantic import BaseModel, Field
from typing import List

class Actor(BaseModel):
    name: str = Field(..., description="Actor's full name")
    role: str = Field(..., description="Character name")

class Movie(BaseModel):
    """Movie information extracted from user query."""
    title: str = Field(..., description="Movie title")
    year: int = Field(..., description="Release year", ge=1880, le=2025)
    director: str = Field(..., description="Director name")
    rating: float = Field(..., description="IMDB rating", ge=0, le=10)
    cast: List[Actor] = Field(default_factory=list, description="Main cast members")

### **Step 2: Create PydanticOutputParser**

In [19]:
from langchain_core.output_parsers import PydanticOutputParser

output_parser = PydanticOutputParser(pydantic_object=Movie)

print(output_parser.get_format_instructions())

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"$defs": {"Actor": {"properties": {"name": {"description": "Actor's full name", "title": "Name", "type": "string"}, "role": {"description": "Character name", "title": "Role", "type": "string"}}, "required": ["name", "role"], "title": "Actor", "type": "object"}}, "description": "Movie information extracted from user query.", "properties": {"title": {"description": "Movie title", "title": "Title", "type": "string"}, "year": {"description": "Release year", "maximum": 2025, "minimum": 1880, "title": "Year", "type": "integer"}, "director": {"descr

### **Step 3: Create Prompt with Format Instructions**

In [20]:
from langchain_core.prompts import ChatPromptTemplate

chat_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", """You extract structured movie information from text.
                      You are a strict parser. Extract only what is explicitly present in the text. Never invent facts.
                      {format_instructions}"""),
        ("user", "{input}")
    ]
)

# Partial to inject format instructions
chat_prompt_partial = chat_prompt.partial(
    format_instructions=output_parser.get_format_instructions()
)

### **Step 4: Build Chain and Test**

In [21]:
# Import Google ChatModel
from langchain_openai import ChatOpenAI

# Setup API Key
f = open('keys/.openai_api_key.txt')
OPENAI_API_KEY = f.read()

# Set the GoogleAI Key and initialize a ChatModel
openai_chat_model = ChatOpenAI(api_key=OPENAI_API_KEY, 
                               model="gpt-4o-mini", 
                               temperature=0)

In [22]:
chain = chat_prompt_partial | openai_chat_model | output_parser

# Test with movie description
result = chain.invoke({
    "input": """Inception is a 2010 sci-fi film directed by Christopher Nolan. 
    It stars Leonardo DiCaprio and has an IMDB rating of 8.8. Main cast includes 
    Joseph Gordon-Levitt as Arthur and Ellen Page as Ariadne."""
})

print(result)

title='Inception' year=2010 director='Christopher Nolan' rating=8.8 cast=[Actor(name='Leonardo DiCaprio', role='Lead'), Actor(name='Joseph Gordon-Levitt', role='Arthur'), Actor(name='Ellen Page', role='Ariadne')]


In [23]:
print(result.title)
print(result.rating)

Inception
8.8


### **Step 5: Handle Parsing Errors (Production)**

**Important Note: If the Pydantic constructor succeeds, the object is guaranteed to have all required fields.**

In [24]:
from pydantic import ValidationError
from langchain_core.exceptions import OutputParserException

def extract_movie_info(text: str) -> Movie:
    """Safely extract movie information with fallback."""
    try:
        return chain.invoke({"input": text})
    except OutputParserException as e:
        print(f"Output Parser failed:\n {e}")
    except ValidationError as e:
        print(f"Validation failed:\n e.json()")
    except Exception as e:
        print(f"Unexpected error: {type(e).__name__}: {e}")

In [25]:
movie_info = """Inception is a 2010 sci-fi film directed by Christopher Nolan. 
It stars Leonardo DiCaprio and has an IMDB rating of 8.8. Main cast includes 
Joseph Gordon-Levitt as Arthur and Ellen Page as Ariadne."""

movie_obj = extract_movie_info(movie_info)

print(movie_obj)

title='Inception' year=2010 director='Christopher Nolan' rating=8.8 cast=[Actor(name='Leonardo DiCaprio', role='N/A'), Actor(name='Joseph Gordon-Levitt', role='Arthur'), Actor(name='Ellen Page', role='Ariadne')]


In [26]:
movie_info = """Inception is a great movie"""

movie_obj = extract_movie_info(movie_info)

print(movie_obj)

Output Parser failed:
 Failed to parse Movie from completion {"title": "Inception", "year": null, "director": null, "rating": null, "cast": []}. Got: 3 validation errors for Movie
year
  Input should be a valid integer [type=int_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.12/v/int_type
director
  Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.12/v/string_type
rating
  Input should be a valid number [type=float_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.12/v/float_type
For troubleshooting, visit: https://docs.langchain.com/oss/python/langchain/errors/OUTPUT_PARSING_FAILURE 
None


## **JSONOutputParser with Pydantic**

### **What it does?**
If you want the parser to return JSON instead of a Pydantic object, you can use a Pydantic schema together with the **JsonOutputParser** to generate structured JSON output.

Let's learn how to do it using the following case study.

### **Building an Intelligent Parser to Translate User Requests into Order Objects**
**Use case:** Parse complex structured user requests into typed Python objects (e.g., Order object used by downstream business logic).

Steps involved:
- Step 1: Defining a Pydantic Model
- Step 2: Create JsonOutputParser
- Step 3: Create Prompt with Format Instructions
- Step 4: Build Chain and Test
- Step 5: Handle Parsing Errors (Production)

### **Step 1: Defining a Pydantic Model**

In [27]:
from pydantic import BaseModel, Field

# Define the pydantic model for expected output
class OrderItem(BaseModel):
    sku: str = Field(description="Unique product identifier (Stock Keeping Unit) for the item.")
    quantity: int = Field(description="Number of units ordered for this item. Must be greater than zero.", gt=0)
    price: float = Field(description="Price of a single unit of the item.")

class Order(BaseModel):
    order_id: str = Field(description="Unique identifier for the order.")
    customer_email: str = Field(description="Email address of the customer who placed the order.")
    items: list[OrderItem] = Field(description="List of all items included in the order.")
    total: float = Field(description="Total order amount after summing all item costs.")

### **Step 2: Create JsonOutputParser**

In [28]:
from langchain_core.output_parsers import JsonOutputParser

# Create parser from pydantic model
output_parser = JsonOutputParser(pydantic_object=Order)

print(output_parser.get_format_instructions())

STRICT OUTPUT FORMAT:
- Return only the JSON value that conforms to the schema. Do not include any additional text, explanations, headings, or separators.
- Do not wrap the JSON in Markdown or code fences (no ``` or ```json).
- Do not prepend or append any text (e.g., do not write "Here is the JSON:").
- The response must be a single top-level JSON value exactly as required by the schema (object/array/etc.), with no trailing commas or comments.

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]} the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema (shown in a code block for readability only — do not include any backticks or Markdown in your output):


### **Step 3: Create Prompt with Format Instructions**

In [29]:
from langchain_core.prompts import PromptTemplate

prompt_template = PromptTemplate(
    template="""User provided this order information in text. 
    Extract into a JSON format that matches the output format instructions provided below.
    Return only a JSON format.
    Text:
    {text}
    
    Output Format Instructions:
    {output_format_instructions}
    """,
    input_variables=["text"],
    partial_variables={"output_format_instructions": output_parser.get_format_instructions()}
)

### **Step 4: Build Chain and Test**

In [30]:
# Import Google ChatModel
from langchain_openai import ChatOpenAI

# Setup API Key
f = open('keys/.openai_api_key.txt')
OPENAI_API_KEY = f.read()

# Set the GoogleAI Key and initialize a ChatModel
openai_chat_model = ChatOpenAI(api_key=OPENAI_API_KEY, 
                               model="gpt-4o-mini", 
                               temperature=0)

In [31]:
pd_file = open("data/prod_desc_asr_output.txt")

file_data = pd_file.read()

print(file_data)

uh yeah, for order number ORD 1001, the email is jane at example dot com.
Need two of item ABC one twenty three, price is nineteen ninety nine each,
and one of the X Y Z nine nine nine for hundred twenty nine point five.
total should be one sixty nine forty eight.



In [32]:
chain = prompt_template | openai_chat_model | output_parser

result = chain.invoke({"text": file_data})

print(result)

{'order_id': 'ORD 1001', 'customer_email': 'jane@example.com', 'items': [{'sku': 'ABC123', 'quantity': 2, 'price': 19.99}, {'sku': 'XYZ999', 'quantity': 1, 'price': 129.5}], 'total': 169.48}


In [33]:
type(result)

dict