## LangChain: Output Parsers for Structured data

So far, we have just worked with the string data returned from a LLM. In this notebook we are going to ask for formated data and parse it first into JSON, then finally into a pandas dataframe, which is very common way to work with tabular data.

### Table of Contents <a name="top"></a>
1. [Introduction to Pydantic](#pydantic)
2. [Use a Pydantic model to structure a simple return from a LLM into JSON](#joke)
3. [Build a pandas dataframe](#pandas)



In [1]:
# Import everything we need up front
from dotenv import load_dotenv
import os
import langchain
from pydantic import BaseModel, Field
from typing import List
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import JsonOutputParser
from langchain.callbacks.tracers import ConsoleCallbackHandler
from langchain_core.prompts import PromptTemplate
import pandas as pd

In [2]:
load_dotenv()
# Now you can access the environment variables
openai_api_key = os.getenv('OPENAI_API_KEY')
#
# Not needed for this notebook
# langchain_api_key = os.getenv('LANGCHAIN_API_KEY')
# anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')
# huggingface_api_key = os.getenv('HUGGINGFACE_API_KEY'
#
# You can always just assign your variable directly, just not good practice to expose your key in a notebook
# anthropic_api_key='sk-ant-api03....._AAA' 

## Introduction to Pydantic:<a name="pydantic"></a>
Pydantic is a Python library that helps you handle data. It allows you to define the structure of your data using Python classes, and then it makes sure that the data you work with matches this structure.

Here's an analogy: imagine you're running a club and you want to keep a list of all your members. For each member, you want to record their name, age, and email address. You could keep this information in a Python dictionary, like this:
[Top of Page](#top)

In [3]:
# Pydantic allows you to define a data structure

class User(BaseModel):
    id: int
    name: str
    age: float
    email: str
# View the schema for the data
User.schema()

{'title': 'User',
 'type': 'object',
 'properties': {'id': {'title': 'Id', 'type': 'integer'},
  'name': {'title': 'Name', 'type': 'string'},
  'age': {'title': 'Age', 'type': 'number'},
  'email': {'title': 'Email', 'type': 'string'}},
 'required': ['id', 'name', 'age', 'email']}

In [4]:
# Create an instance of the User pydantic model
user = User(id=1, name='LeBron', age=39.999, email='lebron@lakers.com')
# Just have look at the user, it is a JSON object.
user.json()

'{"id": 1, "name": "LeBron", "age": 39.999, "email": "lebron@lakers.com"}'

## Use a Pydantic model to structure a simple return from a LLM into JSON <a name="joke"></a>
In this section we we ask the LLM for a simple joke, then parse it into a structured JSON object.<BR>
[Top of Page](#top)

In [5]:
# Define a new data structure for a joke
#
class Joke(BaseModel):
    setup: str = Field(description="question to set up a joke")
    punchline: str = Field(description="answer to resolve the joke")
Joke.schema()

{'title': 'Joke',
 'type': 'object',
 'properties': {'setup': {'title': 'Setup',
   'description': 'question to set up a joke',
   'type': 'string'},
  'punchline': {'title': 'Punchline',
   'description': 'answer to resolve the joke',
   'type': 'string'}},
 'required': ['setup', 'punchline']}

In [6]:
# Set up a parser give it your data structure
joke_parser = JsonOutputParser(pydantic_object=Joke)
joke_parser.get_format_instructions()

'The output should be formatted as a JSON instance that conforms to the JSON schema below.\n\nAs an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}\nthe object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.\n\nHere is the output schema:\n```\n{"properties": {"setup": {"title": "Setup", "description": "question to set up a joke", "type": "string"}, "punchline": {"title": "Punchline", "description": "answer to resolve the joke", "type": "string"}}, "required": ["setup", "punchline"]}\n```'

In [7]:
# Setup a simple prompt template
template = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": joke_parser.get_format_instructions()},
)
template.schema()

{'title': 'PromptTemplate',
 'type': 'object',
 'properties': {'name': {'title': 'Name', 'type': 'string'},
  'input_variables': {'title': 'Input Variables',
   'type': 'array',
   'items': {'type': 'string'}},
  'input_types': {'title': 'Input Types', 'type': 'object'},
  'output_parser': {'$ref': '#/definitions/BaseOutputParser'},
  'partial_variables': {'title': 'Partial Variables', 'type': 'object'},
  'metadata': {'title': 'Metadata', 'type': 'object'},
  'tags': {'title': 'Tags', 'type': 'array', 'items': {'type': 'string'}},
  'template': {'title': 'Template', 'type': 'string'},
  'template_format': {'title': 'Template Format',
   'default': 'f-string',
   'enum': ['f-string', 'mustache', 'jinja2'],
   'type': 'string'},
  'validate_template': {'title': 'Validate Template',
   'default': False,
   'type': 'boolean'}},
 'required': ['input_variables', 'template'],
 'definitions': {'BaseOutputParser': {'title': 'BaseOutputParser',
   'description': 'Base class to parse the output 

In [8]:
# Create a ChatGPT model
openai_llm = ChatOpenAI(model='gpt-3.5-turbo', api_key=openai_api_key)

In [9]:
# Now we have everything we need for a chain:
print(type(template))
print(type(openai_llm))
print(type(joke_parser))

<class 'langchain_core.prompts.prompt.PromptTemplate'>
<class 'langchain_openai.chat_models.base.ChatOpenAI'>
<class 'langchain_core.output_parsers.json.JsonOutputParser'>


In [10]:
# So create the chain
chain = template | openai_llm | joke_parser

In [11]:
# Create a query to send to the LLM
joke_query = "Tell me a joke."
#
# Now start the chain, but this time we will turn on the verbose mode so we can see what is happening.
response = chain.invoke({"query": joke_query},config={'callbacks': [ConsoleCallbackHandler()]})

[32;1m[1;3m[chain/start][0m [1m[1:chain:RunnableSequence] Entering Chain run with input:
[0m{
  "query": "Tell me a joke."
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RunnableSequence > 2:prompt:PromptTemplate] Entering Prompt run with input:
[0m{
  "query": "Tell me a joke."
}
[36;1m[1;3m[chain/end][0m [1m[1:chain:RunnableSequence > 2:prompt:PromptTemplate] [1ms] Exiting Prompt run with output:
[0m[outputs]
[32;1m[1;3m[llm/start][0m [1m[1:chain:RunnableSequence > 3:llm:ChatOpenAI] Entering LLM run with input:
[0m{
  "prompts": [
    "Human: Answer the user query.\nThe output should be formatted as a JSON instance that conforms to the JSON schema below.\n\nAs an example, for the schema {\"properties\": {\"foo\": {\"title\": \"Foo\", \"description\": \"a list of strings\", \"type\": \"array\", \"items\": {\"type\": \"string\"}}}, \"required\": [\"foo\"]}\nthe object {\"foo\": [\"bar\", \"baz\"]} is a well-formatted instance of the schema. The object {\"properties\": {\

In [12]:
# If everything went right, we now have a JSON/dictionary
print(type(response))

<class 'dict'>


In [13]:
# OK, just have a look at the JSON object. Structured output!
response

{'setup': 'Why was the math book sad?',
 'punchline': 'Because it had too many problems.'}

### Build a pandas dataframe <a name="pandas"></a>
Let's take a step in complexity. Now, we'll ask the LLM for some structured data and we'll convert that structured data into a pandas dataframe.<BR>
[Top of Page](#top)

In [14]:
# Define pydantic desired data structure.
#
class SP500Data(BaseModel):
    year: int
    sp_500_index_value: float

class SP500Index(BaseModel):
    data: List[SP500Data]
SP500Data.schema()

{'title': 'SP500Data',
 'type': 'object',
 'properties': {'year': {'title': 'Year', 'type': 'integer'},
  'sp_500_index_value': {'title': 'Sp 500 Index Value', 'type': 'number'}},
 'required': ['year', 'sp_500_index_value']}

In [15]:
# Set up a new parser with our data structure
sp_parser = JsonOutputParser(pydantic_object=SP500Data)
sp_parser.get_format_instructions()

'The output should be formatted as a JSON instance that conforms to the JSON schema below.\n\nAs an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}\nthe object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.\n\nHere is the output schema:\n```\n{"properties": {"year": {"title": "Year", "type": "integer"}, "sp_500_index_value": {"title": "Sp 500 Index Value", "type": "number"}}, "required": ["year", "sp_500_index_value"]}\n```'

In [16]:
# Now we have everything we need for a new chain:
print(type(template)) # We can use the same template
print(type(openai_llm))
print(type(sp_parser))

<class 'langchain_core.prompts.prompt.PromptTemplate'>
<class 'langchain_openai.chat_models.base.ChatOpenAI'>
<class 'langchain_core.output_parsers.json.JsonOutputParser'>


In [17]:
# Create a new chain using our new configured parser
sp_chain = template | openai_llm | sp_parser

In [18]:
# You need to be very specific with your LLM prompt. 
# I tested this out in a web chat until I was happy with what I was getting back.

stock_query='''
Please generate a table of hypothetical data of the S&P 500 stock market index value
for the end of each year for the period 1980 - 1985? Please format this data JSON.
'''
# Start the chain in verbose mode
response = chain.invoke({"query": stock_query}, config={'callbacks': [ConsoleCallbackHandler()]})

[32;1m[1;3m[chain/start][0m [1m[1:chain:RunnableSequence] Entering Chain run with input:
[0m{
  "query": "\nPlease generate a table of hypothetical data of the S&P 500 stock market index value\nfor the end of each year for the period 1980 - 1985? Please format this data JSON.\n"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RunnableSequence > 2:prompt:PromptTemplate] Entering Prompt run with input:
[0m{
  "query": "\nPlease generate a table of hypothetical data of the S&P 500 stock market index value\nfor the end of each year for the period 1980 - 1985? Please format this data JSON.\n"
}
[36;1m[1;3m[chain/end][0m [1m[1:chain:RunnableSequence > 2:prompt:PromptTemplate] [1ms] Exiting Prompt run with output:
[0m[outputs]
[32;1m[1;3m[llm/start][0m [1m[1:chain:RunnableSequence > 3:llm:ChatOpenAI] Entering LLM run with input:
[0m{
  "prompts": [
    "Human: Answer the user query.\nThe output should be formatted as a JSON instance that conforms to the JSON schema below.\n\nAs 

In [19]:
# Check out the raw response, looks like JSON!
response

{'data': [{'year': 1980, 's&p_500_index_value': 135.76},
  {'year': 1981, 's&p_500_index_value': 137.55},
  {'year': 1982, 's&p_500_index_value': 140.64},
  {'year': 1983, 's&p_500_index_value': 165.37},
  {'year': 1984, 's&p_500_index_value': 167.24},
  {'year': 1985, 's&p_500_index_value': 211.28}]}

In [20]:
# Let's look at just the 'data' key
response['data']

[{'year': 1980, 's&p_500_index_value': 135.76},
 {'year': 1981, 's&p_500_index_value': 137.55},
 {'year': 1982, 's&p_500_index_value': 140.64},
 {'year': 1983, 's&p_500_index_value': 165.37},
 {'year': 1984, 's&p_500_index_value': 167.24},
 {'year': 1985, 's&p_500_index_value': 211.28}]

In [21]:
# With this structured data, we can easily build a pandas dataframe to work with the data.
#
# Create a DataFrame from the LLM response
df = pd.DataFrame(response['data'])
df

Unnamed: 0,year,s&p_500_index_value
0,1980,135.76
1,1981,137.55
2,1982,140.64
3,1983,165.37
4,1984,167.24
5,1985,211.28


### What we did
1. Learned how to create a data structure using Pydantic
2. Learned how to format LLM data into a simple JSON-structured joke: setup and punchline
3. Leared how to format tabular data from a LLM into JSON format.
4. Converted the JSON into a pandas dataframe