# LangChain Output Parsers
* Notebook by Adam Lang
* Date: 6/20/2024
* This notebook will go over output parsers and their use cases in LangChain along with an implementation using Python.

# Output Parser - Overview
* output parsers are responsible for taking the LLM outputs and tranforming them to a more "consumable format."
* these are very useful when we want to obtain structured or semi-structured outputs to store them or use in a specific workflow or store in a database.
* **The main concept is forcing a specific structured output result.**

# Types of output parsers in LangChain
* `PydanticOutputParser` - using the Pydantic library builds a data class to format data into structured fields (deserializes --> serializes)
* `JsonOutputParser` - Formats outputs into well-defined JSON structure.
  * Returns JSON object as specified.
  * Specify a Pydantic model and returns JSON for that model.
  * **Probably most reliable output parser for getting structured data.**
* `CommaSeparatedListOutputParser` - Formats outputs to return a list of CSV items.
* `XML` - returns dictionary of tags. Used when XML output is needed. Use with models good at writing XML such as **Anthropic Claude**
* `YAML` - takes user defined Pydantic model and returns data in that format. YAML is used to encode it.

## Install dependencies

In [1]:
!pip install langchain==0.1.19
!pip install langchain-openai==0.1.6
!pip install langchain-community==0.0.38

Collecting langchain==0.1.19
  Downloading langchain-0.1.19-py3-none-any.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain==0.1.19)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl (28 kB)
Collecting langchain-community<0.1,>=0.0.38 (from langchain==0.1.19)
  Downloading langchain_community-0.0.38-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m52.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-core<0.2.0,>=0.1.52 (from langchain==0.1.19)
  Downloading langchain_core-0.1.52-py3-none-any.whl (302 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.9/302.9 kB[0m [31m35.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-text-splitters<0.1,>=0.0.1 (from langchain==0.1.19)
  Downloading langchain_text_splitters-0.0.2-py3-none-any.whl (23 kB)
Collect

## API Tokens
* Enter Open AI key here

In [2]:
from getpass import getpass

OPENAI_KEY = getpass('Please enter your Open AI API Key here: ')

Please enter your Open AI API Key here: ··········


## Setup environment variables

In [3]:
import os

os.environ['OPENAI_API_KEY'] = OPENAI_KEY

## Chat Models and LLMs
* LLMs are the core component of LangChain and accessible via APIs.

## Accessing Commercial LLMs like ChatGPT

In [4]:
from langchain_openai import ChatOpenAI

# model instantiate
chatgpt = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0)

### Output Parsers
* Pydantic parser
* JSON parser
* CSV parser


### PydanticOutputParser
* This parser allows you to specify a Pydantic Model to query LLMs for outputs that conform to the Pydantic schema.
* LLMs are "non-deterministic", thus using Pydantic you can structure the output the way you want or need it to eliminate most if not all of this issue.
* Pydantic provides type hints in Python.

In [5]:
from langchain_core.prompts import PromptTemplate #prompt template
from langchain_core.output_parsers import PydanticOutputParser # specific parser
from langchain_core.pydantic_v1 import BaseModel, Field

# define pydantic data class for data structure
class QueryResponse(BaseModel):
  description: str = Field(description="A brief description of the topic asked by the user")
  pros: str = Field(description="3 bullet points showing the pros of the topic asked by the user")
  cons: str = Field(description="3 bullet points showing the cons of the topic asked by the user")
  conclusion: str = Field(description="One line conclusion of the topic asked by the user")

# set up parser + inject instructions into langchain prompt template
parser = PydanticOutputParser(pydantic_object = QueryResponse) # pass the pydnatic class with your data structures here
parser

PydanticOutputParser(pydantic_object=<class '__main__.QueryResponse'>)

In [6]:
# langchain pre-generated output response formatting instructions
print(parser.get_format_instructions())

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"properties": {"description": {"title": "Description", "description": "A brief description of the topic asked by the user", "type": "string"}, "pros": {"title": "Pros", "description": "3 bullet points showing the pros of the topic asked by the user", "type": "string"}, "cons": {"title": "Cons", "description": "3 bullet points showing the cons of the topic asked by the user", "type": "string"}, "conclusion": {"title": "Conclusion", "description": "One line conclusion of the topic asked by the user", "type": "string"}}, "required": ["descriptio

In [13]:
# create the final prompt with formatting instructions from the parser
prompt_txt = """
            Answer the user query and generate the response based on the following formatting instructions

            Format Instructions:
            {format_instructions}

            Query:
            {query}
            """

# create prompt template -- parser is from the Pydantic Class
prompt = PromptTemplate(
    template=prompt_txt,
    input_variables=["query"],
    partial_variables={"format_instructions:": parser.get_format_instructions()},)

prompt

PromptTemplate(input_variables=['format_instructions', 'query'], partial_variables={'format_instructions:': 'The output should be formatted as a JSON instance that conforms to the JSON schema below.\n\nAs an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}\nthe object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.\n\nHere is the output schema:\n```\n{"properties": {"description": {"title": "Description", "description": "A brief description of the topic asked by the user", "type": "string"}, "pros": {"title": "Pros", "description": "3 bullet points showing the pros of the topic asked by the user", "type": "string"}, "cons": {"title": "Cons", "description": "3 bullet points showing the cons of the topic asked by the user", "type": "string"}, "conclusion": {"title": "Conclusion

In [14]:
# create a simple LCEL chain to take prompt --> pass to LLM --> enforce format using parser
chain = (prompt
         |
        chatgpt
         |
        parser)

In [17]:
## ask question - response to model
question = "Tell me about Commercial Real Estate"
response = chain.invoke({"query": question, "format_instructions": parser.get_format_instructions()})


In [18]:
response

QueryResponse(description='Commercial real estate refers to properties used for business purposes, such as office buildings, retail spaces, and industrial facilities. It involves leasing, buying, or selling properties for commercial use.', pros='1. Potential for high returns on investment. 2. Diversification of investment portfolio. 3. Long-term leases provide stable income streams.', cons='1. Market fluctuations can impact property values. 2. High upfront costs for purchasing commercial properties. 3. Vacancy rates can affect cash flow.', conclusion='Commercial real estate can be a lucrative investment option but requires careful research and management.')

In [19]:
## description
response.description

'Commercial real estate refers to properties used for business purposes, such as office buildings, retail spaces, and industrial facilities. It involves leasing, buying, or selling properties for commercial use.'

In [20]:
## response dict
response.dict()

{'description': 'Commercial real estate refers to properties used for business purposes, such as office buildings, retail spaces, and industrial facilities. It involves leasing, buying, or selling properties for commercial use.',
 'pros': '1. Potential for high returns on investment. 2. Diversification of investment portfolio. 3. Long-term leases provide stable income streams.',
 'cons': '1. Market fluctuations can impact property values. 2. High upfront costs for purchasing commercial properties. 3. Vacancy rates can affect cash flow.',
 'conclusion': 'Commercial real estate can be a lucrative investment option but requires careful research and management.'}

In [21]:
# key, value dict
for k,v in response.dict().items():
  print(f"{k}:\n{v}\n")

description:
Commercial real estate refers to properties used for business purposes, such as office buildings, retail spaces, and industrial facilities. It involves leasing, buying, or selling properties for commercial use.

pros:
1. Potential for high returns on investment. 2. Diversification of investment portfolio. 3. Long-term leases provide stable income streams.

cons:
1. Market fluctuations can impact property values. 2. High upfront costs for purchasing commercial properties. 3. Vacancy rates can affect cash flow.

conclusion:
Commercial real estate can be a lucrative investment option but requires careful research and management.



### JsonOutputParser
* This parser lets you specify JSON schema and query an LLM for outputs conforming to that schema.
* Recommended to use Pydantic for this.

In [22]:
from typing import List

from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.pydantic_v1 import BaseModel, Field

# define data structure in pydnatic class
class QueryResponse(BaseModel):
  description: str = Field(description="A brief description of the topic asked by the user")
  pros: str = Field(description="3 bullet points showing the pros of the topic asked by the user")
  cons: str = Field(description="3 bullet points showing the cons of the topic asked by the user")
  conclusion: str = Field(description="One line conclusion of the topic asked by the user")

# set up a parser + inject instructions into prompt template
parser = JsonOutputParser(pydantic_object=QueryResponse)
parser

JsonOutputParser(pydantic_object=<class '__main__.QueryResponse'>)

In [23]:
# create the final prompt with formatting instructions from the parser
prompt_txt = """
            Answer the user query and generate the response based on the following formatting instructions

            Format Instructions:
            {format_instructions}

            Query:
            {query}
            """

# create prompt template -- parser is from the Pydantic Class
prompt = PromptTemplate(
    template=prompt_txt,
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},)

prompt

PromptTemplate(input_variables=['query'], partial_variables={'format_instructions': 'The output should be formatted as a JSON instance that conforms to the JSON schema below.\n\nAs an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}\nthe object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.\n\nHere is the output schema:\n```\n{"properties": {"description": {"title": "Description", "description": "A brief description of the topic asked by the user", "type": "string"}, "pros": {"title": "Pros", "description": "3 bullet points showing the pros of the topic asked by the user", "type": "string"}, "cons": {"title": "Cons", "description": "3 bullet points showing the cons of the topic asked by the user", "type": "string"}, "conclusion": {"title": "Conclusion", "description": "One l

In [24]:
## create LCEL chain to take prompt --> pass to LLM --> enforce response format using parser
chain = (prompt
              |
            chatgpt
              |
            parser)

chain

PromptTemplate(input_variables=['query'], partial_variables={'format_instructions': 'The output should be formatted as a JSON instance that conforms to the JSON schema below.\n\nAs an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}\nthe object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.\n\nHere is the output schema:\n```\n{"properties": {"description": {"title": "Description", "description": "A brief description of the topic asked by the user", "type": "string"}, "pros": {"title": "Pros", "description": "3 bullet points showing the pros of the topic asked by the user", "type": "string"}, "cons": {"title": "Cons", "description": "3 bullet points showing the cons of the topic asked by the user", "type": "string"}, "conclusion": {"title": "Conclusion", "description": "One l

In [25]:
topic_queries = [
    "Tell me about New York City",
    "Tell me about Generative AI"
]

topic_queries_formatted = [{"query": topic}
                    for topic in topic_queries]

topic_queries_formatted

[{'query': 'Tell me about New York City'},
 {'query': 'Tell me about Generative AI'}]

In [26]:
responses = chain.map().invoke(topic_queries_formatted)

In [27]:
# get responses
responses[0], type(responses[0])

({'description': 'New York City is a bustling metropolis known for its iconic landmarks such as the Statue of Liberty, Times Square, and Central Park. It is a melting pot of cultures and offers a diverse range of activities and attractions for visitors and residents alike.',
  'pros': '1. Vibrant cultural scene with world-class museums, theaters, and restaurants. 2. Excellent public transportation system making it easy to get around the city. 3. Opportunities for career growth and networking in various industries.',
  'cons': '1. High cost of living compared to other cities in the United States. 2. Crowded streets and public spaces, especially during peak tourist seasons. 3. Harsh winters with cold temperatures and occasional snowstorms.',
  'conclusion': 'Overall, New York City is a dynamic and exciting place to live or visit, offering a unique blend of opportunities and experiences.'},
 dict)

In [28]:
## input to dataframe
import pandas as pd

df = pd.DataFrame(responses)
df

Unnamed: 0,description,pros,cons,conclusion
0,New York City is a bustling metropolis known f...,1. Vibrant cultural scene with world-class mus...,1. High cost of living compared to other citie...,"Overall, New York City is a dynamic and exciti..."
1,Generative AI refers to a type of artificial i...,1. Can generate creative and unique content. 2...,1. May produce biased or inappropriate content...,Generative AI has the potential to revolutioni...


In [29]:
# loop through responses
for response in responses:
  for k,v in response.items():
    print(f"{k}:\n{v}\n")
  print('------')

description:
New York City is a bustling metropolis known for its iconic landmarks such as the Statue of Liberty, Times Square, and Central Park. It is a melting pot of cultures and offers a diverse range of activities and attractions for visitors and residents alike.

pros:
1. Vibrant cultural scene with world-class museums, theaters, and restaurants. 2. Excellent public transportation system making it easy to get around the city. 3. Opportunities for career growth and networking in various industries.

cons:
1. High cost of living compared to other cities in the United States. 2. Crowded streets and public spaces, especially during peak tourist seasons. 3. Harsh winters with cold temperatures and occasional snowstorms.

conclusion:
Overall, New York City is a dynamic and exciting place to live or visit, offering a unique blend of opportunities and experiences.

------
description:
Generative AI refers to a type of artificial intelligence that is capable of creating new content, such 

### CSV Output parser
* return a list of comma-separated items.

In [30]:
from langchain_core.output_parsers import CommaSeparatedListOutputParser #specific parser
from langchain_core.prompts import PromptTemplate


# output parser setup
output_parser = CommaSeparatedListOutputParser()

# format the instructions
format_instructions = output_parser.get_format_instructions()
format_instructions

'Your response should be a list of comma separated values, eg: `foo, bar, baz` or `foo,bar,baz`'

In [40]:
format_instructions = output_parser.get_format_instructions()

# Create prompt to send to LLM
prompt_txt = """
             Create a list of 5 different musical genres

             Include with each musical genre in parenthesis an example musician or band

             Here is an example: Rock - Beatles, Jazz - Miles Davis, Classical -


             Output format instructions:
             {format_instructions}
             """

prompt = PromptTemplate.from_template(template=prompt_txt)
prompt

PromptTemplate(input_variables=['format_instructions'], template='\n             Create a list of 5 different musical genres\n\n             Include with each musical genre in parenthesis an example musician or band\n\n             Here is an example: Rock - Beatles, Jazz - Miles Davis, Classical - \n\n\n             Output format instructions:\n             {format_instructions}\n             ')

In [41]:
# create a simple LLM chain
llm_chain = (prompt
              |
            chatgpt
              |
            output_parser)

# run chain
response = llm_chain.invoke({'format_instructions': format_instructions})
response

['Rock (Beatles)',
 'Jazz (Miles Davis)',
 'Hip Hop (Kendrick Lamar)',
 'Country (Johnny Cash)',
 'Electronic (Daft Punk)']

In [42]:
type(response)

list