# Lesson 2.3: Output Parsers

---

In previous lessons, we learned how to create prompts and send them to LLMs to get responses. However, LLMs often return free-form text, without a specific structure. In many applications, we need structured data (e.g., lists, JSON, Python objects) for further processing. This is where **Output Parsers** come into play.

## 1. Why are Output Parsers Needed?

### 1.1. The Problem with LLM String Output

Large Language Models (LLMs) are trained to generate natural language text. When you ask an LLM a question, it will respond with a text string. For example:

* **Prompt:** "List 3 types of fruits."
* **LLM Response:** "Apple, Banana, Orange." (This is a text string.)

If you want to use this list of fruits in your Python code (e.g., to iterate over each fruit), you would have to manually parse this string yourself. This can be complex and error-prone, especially when the LLM's output format might vary slightly.

### 1.2. The Role of Output Parsers

**Output Parsers** in LangChain are tools that help automatically convert the LLM's string output into structured data formats that are easier to use in your code. They act as a bridge between the LLM's free-form text response and the application's need for structured data.

* **Benefits:**
    * **Automation:** Automatically parses the output, reducing manual coding effort.
    * **Reliability:** Ensures output consistently follows a specific format, making your code more stable.
    * **Ease of Use:** Converts data into native Python types (list, dict, Pydantic objects) for easy manipulation.
    * **Error Handling:** Some parsers can attempt to fix or flag cases where the LLM doesn't perfectly adhere to the format.




---

## 2. Common Output Parsers

LangChain provides various types of Output Parsers to suit diverse output formatting needs.

### 2.1. `StrOutputParser`: Extracting Strings (Default)

* **Concept:** This is the simplest and often the default parser. It merely extracts the string content from the LLM's response. If you don't specify any parser, LangChain will typically use `StrOutputParser`.
* **When to Use:** When you only need plain text output from the LLM and no special structure is required.

In [None]:
# Install the library if not already installed
# pip install langchain-openai openai

import os
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Set environment variable for OpenAI API key
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7)

# Define prompt
prompt = ChatPromptTemplate.from_template("Say 'Hello LangChain!'")

# Build chain with StrOutputParser (can be omitted if it's the default)
chain = prompt | llm | StrOutputParser()

# Execute chain
response = chain.invoke({})
print(f"Response from StrOutputParser: {response}")
print(f"Data type: {type(response)}")

### 2.2. `CommaSeparatedListOutputParser`: Converting to a List

* **Concept:** This parser is designed to convert a text string where items are separated by commas into a Python list of strings.
* **When to Use:** When you ask the LLM to list items, and you want to receive them as an iterable list.

In [None]:
# Install the library if not already installed
# pip install langchain-openai openai

import os
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import CommaSeparatedListOutputParser

# Set environment variable for OpenAI API key
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7)

# Define prompt requesting a comma-separated list of items
prompt = ChatPromptTemplate.from_template(
    "List 5 common mammals, separated by commas."
)

# Build chain with CommaSeparatedListOutputParser
chain = prompt | llm | CommaSeparatedListOutputParser()

# Execute chain
animals_list = chain.invoke({})
print(f"Response from CommaSeparatedListOutputParser: {animals_list}")
print(f"Data type: {type(animals_list)}")
print(f"First item: {animals_list[0]}")

### 2.3. `StructuredOutputParser` and `PydanticOutputParser`: Converting to JSON/Pydantic Objects

These are the most powerful parsers, allowing you to define a complex data structure (e.g., a JSON object) and instruct the LLM to generate output according to that structure.

#### 2.3.1. `PydanticOutputParser`

* **Concept:** `PydanticOutputParser` uses the `Pydantic` library to define the desired data structure as Python classes. This parser will automatically generate formatting instructions for the LLM and then parse the LLM's output string into a type-checked Pydantic object. If the LLM doesn't adhere to the format, it will attempt to fix it or raise an error.
* **When to Use:** When you need complex, reliable, and automatically type-checked structured output. Ideal for information extraction, generating data for databases, or APIs.

In [None]:
# Install the library if not already installed
# pip install langchain-openai openai pydantic

import os
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field # Pydantic is the library used to define data structures

# Set environment variable for OpenAI API key
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0) # temperature=0 for more consistent output

# 1. Define the desired data structure using Pydantic BaseModel
class PersonInfo(BaseModel):
    name: str = Field(description="Full name of the person")
    age: int = Field(description="Age of the person")
    occupation: str = Field(description="Current occupation of the person")
    hobbies: list[str] = Field(description="List of the person's hobbies")

# 2. Initialize PydanticOutputParser with the defined structure
parser = PydanticOutputParser(pydantic_object=PersonInfo)

# 3. Define Prompt Template
# Crucial: Include formatting instructions from the parser into the prompt
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are an information extraction assistant. Extract the following person's information based on the format:\n{format_instructions}"),
    ("human", "Extract information from the text: 'My name is John Doe, I am 30 years old and a software engineer. I like reading books and playing games.'")
]).partial(format_instructions=parser.get_format_instructions()) # Add formatting instructions to the prompt

# 4. Build chain
chain = prompt | llm | parser

# 5. Execute chain
person_info = chain.invoke({})
print(f"Response from PydanticOutputParser:")
print(f"Name: {person_info.name}")
print(f"Age: {person_info.age}")
print(f"Occupation: {person_info.occupation}")
print(f"Hobbies: {person_info.hobbies}")
print(f"Data type of hobbies: {type(person_info.hobbies)}")

**Explanation:**
* `BaseModel` and `Field` from `pydantic`: Used to define data fields and their types. `description` in `Field` is crucial as it provides context to the LLM about the meaning of each field.
* `parser.get_format_instructions()`: This method generates a detailed instruction string for the LLM on how to format the output (e.g., requesting JSON with specific fields). This string is then injected into the prompt.
* `chain = prompt | llm | parser`: The LLM will attempt to generate output according to the instructions, and the `parser` will attempt to parse that string into a `PersonInfo` object.

#### 2.3.2. `StructuredOutputParser` (Note: Largely superseded by PydanticOutputParser)

* **Concept:** `StructuredOutputParser` is an older parser in LangChain, also used for extracting structured data. It typically requires you to define the output structure using `ResponseSchema` objects and `StructuredOutputParser.from_response_schemas`.
* **Relationship:** While still existing, `PydanticOutputParser` is recommended over it because it leverages the power of Pydantic for better type checking and error handling. You can consider `PydanticOutputParser` as a more advanced and specialized version of the `StructuredOutputParser` idea. We will focus on `PydanticOutputParser` due to its modernity and effectiveness.


---

## 3. Practical Example: Using Output Parsers to Get Structured Data from LLMs

We've practiced individual examples above. Now, let's try a more comprehensive example to see the power of combining `PromptTemplate`, LLM, and `OutputParser`.

**Goal:** Build a chain to extract movie information from a text snippet and format it into a Python object.

In [None]:
# Install the library if not already installed
# pip install langchain-openai openai pydantic

import os
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field
from typing import List

# Set environment variable for OpenAI API key
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

# 1. Define the data structure for movie information
class MovieInfo(BaseModel):
    title: str = Field(description="Title of the movie")
    release_year: int = Field(description="Release year of the movie")
    director: str = Field(description="Name of the movie's director")
    main_actors: List[str] = Field(description="List of main actors")
    genres: List[str] = Field(description="List of movie genres")
    imdb_score: float = Field(description="IMDb score of the movie")

# 2. Initialize PydanticOutputParser
parser_movie = PydanticOutputParser(pydantic_object=MovieInfo)

# 3. Define Prompt Template with formatting instructions
prompt_movie = ChatPromptTemplate.from_messages([
    ("system", "You are a movie information extraction assistant. Extract the following information based on the format:\n{format_instructions}"),
    ("human", "Extract movie information from the following text:\n{movie_text}")
]).partial(format_instructions=parser_movie.get_format_instructions())

# 4. Build chain
chain_movie = prompt_movie | llm | parser_movie

# 5. Input text
movie_text = """
The movie "The Shawshank Redemption" was released in 1994, directed by Frank Darabont.
Main actors include Tim Robbins and Morgan Freeman.
The film belongs to the Drama genre. Its IMDb score is 9.3.
"""

# 6. Execute chain
try:
    extracted_movie_info = chain_movie.invoke({"movie_text": movie_text})

    print(f"--- Extracted Movie Information ---")
    print(f"Title: {extracted_movie_info.title}")
    print(f"Release Year: {extracted_movie_info.release_year}")
    print(f"Director: {extracted_movie_info.director}")
    print(f"Main Actors: {', '.join(extracted_movie_info.main_actors)}")
    print(f"Genres: {', '.join(extracted_movie_info.genres)}")
    print(f"IMDb Score: {extracted_movie_info.imdb_score}")
    print(f"Data type of object: {type(extracted_movie_info)}")
    print(f"Data type of main actors: {type(extracted_movie_info.main_actors)}")

except Exception as e:
    print(f"An error occurred while extracting movie information: {e}")

**Important Notes:**
* The accuracy of information extraction depends on the LLM's capabilities and the clarity of the prompt/formatting instructions.
* With `PydanticOutputParser`, if the LLM returns an invalid format, the parser will attempt to fix it. If it cannot, it will raise an error, letting you know when the output is not structured correctly.


---

## Lesson Summary

This lesson explained the importance of **Output Parsers** in converting the LLM's free-form string output into structured data formats that are easy to use in Python code. We learned about common parser types:
* **`StrOutputParser`** for simple string extraction.
* **`CommaSeparatedListOutputParser`** for converting strings into lists of items.
* **`PydanticOutputParser`** (recommended) for extracting complex data into Python objects defined with `Pydantic BaseModel`, ensuring data type integrity.

Through practical examples, you've seen how to integrate Output Parsers into LangChain chains using LCEL to obtain structured data from LLMs, opening up possibilities for building more reliable and powerful LLM applications.