## Using LangChain to get structured outputs


In [1]:
 %xmode minimal

Exception reporting mode: Minimal


In [1]:
from langchain.prompts import ChatPromptTemplate
from langchain_anthropic import ChatAnthropic
from langchain_ollama import ChatOllama
from langchain_fireworks import ChatFireworks

from langchain.output_parsers import XMLOutputParser, PydanticOutputParser

from typing import Optional
from pydantic import BaseModel, Field
from pydantic_xml import BaseXmlModel, element
from typing_extensions import Annotated, TypedDict

In [2]:
ANTHROPIC_API_KEY = "<API KEY>"
FIREWORKS_API_KEY = "<API KEY>"

Let's start by creating a LLM model to run our structured output queries. Use a temperature of 0 to improve structured output generation (but at the cost of "creativity").


In [3]:
# LLM parameters
temperature = 0.8
timeout = 30
num_ctx = 8192
num_predict = 4096

Define a LLM model below


In [4]:
# llm_model = ChatAnthropic(model="claude-3-5-haiku-20241022", api_key=claude_api_key)
llm_model = ChatOllama(
    model="llama3.2",
    temperature=temperature,
    num_ctx=num_ctx,
    num_thread=1,
    num_predict=num_predict,
)
# llm_model = ChatFireworks(
#     # model_name="accounts/fireworks/models/llama-v3p1-70b-instruct",
#     # model_name="accounts/fireworks/models/llama-v3p2-3b-instruct",
#     model_name="accounts/fireworks/models/llama-v3p3-70b-instruct",
#     # model_name="accounts/fireworks/models/deepseek-r1",
#     api_key=FIREWORKS_API_KEY,
# )

Check it works


In [5]:
print(llm_model.invoke("Tell me a joke about zebras").content)

Why did the zebra refuse to play poker?

Because he always got stripped of his money.


### Problem setup and prompt


In [15]:
test_science_prompt_txt = """
You are a professional science writer tasked with responding to members of
the general public who write in asking questions about science.
Write an article responding to a writer's question for publication in a
science magazine intended for a general readership with a high-school education.
You should write clearly and compellingly, include all relavent context,
and provide motivating stories where applicable.

Your response must be less than 200 words.

The question given to you is the following:
{question}
"""

In [8]:
prompt_system_format = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Write an article in the provided format that answers the user question.\n{format_instructions}",
        ),
        ("human", test_science_prompt_txt),
    ]
)

XML output parsers


In [9]:
from langchain_core.output_parsers import XMLOutputParser

### Manual XML schema specification


In [10]:
prompt_xml_format = """
Your answers must strictly adhere to the following XML format:

<article>
<title>
[Title of the article.]
</title>

<problem>
[Summary of the writer's question, write concisely.]
</problem>

<answer>
[Answer the writer's question.]
</answer>
</article>
"""

In [11]:
xml_parser = XMLOutputParser()

In [12]:
test_chain = (
    prompt_system_format.partial(format_instructions=prompt_xml_format)
    | llm_model
    | xml_parser
)

output = test_chain.invoke(dict(question="What is the oldest recorded fossil?"))

In [13]:
output

{'article': [{'title': '\nUncovering the Oldest Record of Life on Earth\n'},
  {'problem': '\nWhat is the oldest recorded fossil, and how does it reveal insights into the history of life on our planet?\n\n'},
  {'answer': "\nThe oldest recorded fossil is a staggering 3.46 billion-year-old rock formation discovered in Western Australia. This ancient relic, known as Apex Chert, contains the fossilized remains of microorganisms that lived during the Eoarchean era of the Precambrian period.\n\nThe discovery was made by Dr. Scott McGhee and his team from the University of California, Berkeley, using advanced scanning techniques to reveal the hidden fossils within the rock. The findings have provided significant insights into the origins of life on Earth, suggesting that life may have emerged as early as 3.5 billion years ago.\n\nThis remarkable discovery highlights the significance of fossils in understanding our planet's history and the evolution of life. It also underscores the importance

### LangChain Output Parsers


In [14]:
xml_parser = XMLOutputParser(
    name="article", tags=["article", "title", "problem", "answer"]
)

In [15]:
test_chain = (
    prompt_system_format.partial(
        format_instructions=xml_parser.get_format_instructions()
    )
    | llm_model
    | xml_parser
)

output = test_chain.invoke(dict(question="What is the oldest recorded fossil?"))

In [16]:
output

{'article': [{'title': 'The Ancient Remains of Life'},
  {'problem': 'What is the oldest recorded fossil?'},
  {'answer': "\n        The oldest recorded fossil is a subject of great interest in the scientific community. According to the fossil record, the earliest fossils date back to around 3.5 billion years ago, during the Eoarchean era of the Precambrian period.\n\n        One of the most significant and well-preserved fossils is that of stromatolites, found in Western Australia's Apex Chert formation. These ancient structures are believed to be over 3.46 billion years old, making them one of the oldest known examples of life on Earth.\n\n        The discovery of these ancient remains not only sheds light on the origins of life but also provides a glimpse into the earliest ecosystems that existed on our planet. The study of fossils like stromatolites continues to captivate scientists and inspire new discoveries about the evolution of life on Earth."}]}

In [17]:
print(xml_parser.get_format_instructions())

The output should be formatted as a XML file.
1. Output should conform to the tags below.
2. If tags are not given, make them on your own.
3. Remember to always open and close all the tags.

As an example, for the tags ["foo", "bar", "baz"]:
1. String "<foo>
   <bar>
      <baz></baz>
   </bar>
</foo>" is a well-formatted instance of the schema.
2. String "<foo>
   <bar>
   </foo>" is a badly-formatted instance.
3. String "<foo>
   <tag>
   </tag>
</foo>" is a badly-formatted instance.

Here are the output tags:
```
['article', 'title', 'problem', 'answer']
```


### Pydantic XML would be nice!


In [44]:
from pydantic_xml import BaseXmlModel, element


class ArticleXML(BaseXmlModel, tag="article"):
    title: str = element(default="[Title of the article]")
    problem: str = element(
        default="[Summary of the writer's question, write concisely]"
    )
    answer: str = element(default="[Answer the writer's question]")

In [45]:
print(ArticleXML().to_xml(pretty_print=True).decode())

<article>
  <title>[Title of the article]</title>
  <problem>[Summary of the writer's question, write concisely]</problem>
  <answer>[Answer the writer's question]</answer>
</article>



In [46]:
xml_schema = ArticleXML().to_xml(pretty_print=True).decode()

prompt_xml_format_pyxml = f"""
Your answers must strictly adhere to the following XML format:

{xml_schema}
"""

Numeric types will have non-numeric default (This seems like a bug here!)


In [47]:
class ArticleWithNumberXML(BaseXmlModel, tag="article"):
    title: str = element(default="[Title of the article]")
    number: int = element(
        default="[An arbitrary number that is most relevant to the question.]"
    )


a = ArticleWithNumberXML()
a.number

'[An arbitrary number that is most relevant to the question.]'

Heirarchical schemas need careful consideration


In [48]:
from pydantic_xml import BaseXmlModel, element
from typing import Optional


# Nested types
class HistoricalEventXML(BaseXmlModel):
    """The year and explanation of a historical event."""

    year: Optional[int] = element(default="[The year of the historical event]")
    description: str = element(
        default="[A clear description of what happened in this event]"
    )


class ArticleWithHistory(BaseXmlModel, tag="article"):
    title: str = element(default="[Title of the article]")
    problem: str = element(
        default="[Summary of the writer's question, write concisely]"
    )
    historical_event: HistoricalEventXML = element(default=HistoricalEventXML())

In [49]:
print(ArticleWithHistory().to_xml(pretty_print=True).decode())

<article>
  <title>[Title of the article]</title>
  <problem>[Summary of the writer's question, write concisely]</problem>
  <historical_event>
    <year>[The year of the historical event]</year>
    <description>[A clear description of what happened in this event]</description>
  </historical_event>
</article>



### Pydantic XML with a prompt instructions helper class


In [50]:
from pydantic_xml import BaseXmlModel, element
from experiment_xml import pydantic_to_xml_instructions


# Nested types
class HistoricalEventXML(BaseXmlModel):
    year: str = element(description="The year of the historical event")
    summary: str = element(
        description="A clear description of what happened in this event"
    )


class ArticleWithHistoryXML(BaseXmlModel, tag="article"):
    title: str = element(description="Title of the article")
    problem: str = element(
        description="Summary of the writer's question, write concisely"
    )
    historical_event: HistoricalEventXML = element(
        description="The year and explanation of a historical event"
    )

Using a helper function to create the


In [51]:
print(pydantic_to_xml_instructions(ArticleWithHistoryXML))

You must respond only in XML using the following schema:
<article>
  <!---->
  <title>
    {Title of the article}
  </title>
  <problem>
    {Summary of the writer's question, write concisely}
  </problem>
<historical_event>
  <!---->
  <year>
    {The year of the historical event}
  </year>
  <summary>
    {A clear description of what happened in this event}
  </summary>
</historical_event>
</article>


In [52]:
prompt_format = ChatPromptTemplate.from_messages(
    [
        (
            "human",
            "Write an article in the provided format that answers the user question."
            "\n{format_instructions}",
        ),
        ("human", test_science_prompt_txt),
    ]
)

In [53]:
test_chain = (
    prompt_format.partial(
        format_instructions=pydantic_to_xml_instructions(ArticleWithHistoryXML)
    )
    | llm_model
)

output = test_chain.invoke(dict(question="What is the oldest recorded fossil?"))

In [54]:
ArticleWithHistoryXML.from_xml(output.content)

ArticleWithHistoryXML(title='The Oldest Recorded Fossil: A Glimpse into the Past', problem='\n    What is the oldest recorded fossil, and what does it tell us about life on Earth?\n  ', historical_event=HistoricalEventXML(year='3.5 billion years ago', summary="\n      The oldest recorded fossil is stromatolites found in Western Australia's Apex Chert formation. These ancient structures were created by microorganisms that lived in shallow waters and worked together to produce layers of sediment. The fossils date back an astonishing 3.5 billion years, providing a glimpse into the earliest life on Earth.\n    "))

But we can't use the `PydanticOutputParser` with a Pydantic XML object as it assumes JSON output.


In [57]:
pydantic_parser = PydanticOutputParser(pydantic_object=ArticleXML)

In [60]:
test_chain = (
    prompt_format.partial(format_instructions=pydantic_to_xml_instructions(ArticleXML))
    | llm_model
    | pydantic_parser
)

output = test_chain.invoke(dict(question="What is the oldest recorded fossil?"))

OutputParserException: Invalid json output: <article>
  <title>Uncovering the Ancient Past: The Oldest Recorded Fossil</title>
  <problem>What is the oldest recorded fossil?</problem>
  <answer>
    <p>The oldest recorded fossil dates back an astonishing 3.5 billion years, to a time known as the Eoarchean era of the Precambrian period. This ancient relic was discovered in Western Australia and consists of a 1.8 billion-year-old stromatolite, a layered structure created by microbial life.</p>
    <p>The fossil was found in the Apex Chert, a rock formation that was once part of a shallow sea. The stromatolite features a unique pattern of layered minerals, which scientists believe were produced by the metabolic activities of ancient microorganisms.</p>
    <p>Discoveries like this one have rewritten our understanding of life on Earth's earliest days and have provided valuable insights into the origins of life itself. As we continue to explore and learn from the ancient past, we are reminded of the incredible journey that has brought us to where we are today.</p>
  </answer>
</article>
For troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/OUTPUT_PARSING_FAILURE 

Can we get an LLM to describe the format in a better way?


In [41]:
prompt_explain_format = ChatPromptTemplate.from_messages(
    [
        (
            "human",
            "Describe clearly in words the following schema. Do not output anything in this format. Give step-by-step instructions on how to structure data in this XML format."
            "\n{format_instructions}",
        )
    ]
)

In [42]:
chain = prompt_explain_format | llm_model

output = chain.invoke(
    dict(format_instructions=pydantic_to_xml_instructions(ArticleWithHistoryXML))
)

In [43]:
print(output.content)

To structure data in the provided XML format, follow these step-by-step instructions:

1. Start with an opening tag `<article>`. This will serve as the root element for your XML data.
2. Within the `<article>` element, create a title by adding an opening and closing tag `<title>`.
3. Add the title content inside the `<title>` tags using curly brackets `{}`. For example: `{Title of the article}`
4. Immediately below the `<title>` element, add another opening and closing tag `<problem>`. This will represent the writer's question or problem.
5. Within the `<problem>` element, add a brief summary (concisely) inside the curly brackets `{}`. For instance: `{Summary of the writer's question, write concisely}`
6. To introduce a new section, create an opening tag `<historical_event>`. This will represent a historical event related to the article.
7. Below the `<historical_event>` opening tag, add another opening and closing tag `<year>`.
8. Inside the `<year>` tags, include the year of the hist