## Using LangChain to get structured outputs


In [1]:
 %xmode minimal

Exception reporting mode: Minimal


In [2]:
from langchain.prompts import ChatPromptTemplate
from langchain_anthropic import ChatAnthropic
from langchain_ollama import ChatOllama
from langchain_fireworks import ChatFireworks

from langchain.output_parsers import XMLOutputParser, PydanticOutputParser

from typing import Optional
from pydantic import BaseModel, Field
from typing_extensions import Annotated, TypedDict

In [4]:
ANTHROPIC_API_KEY = "<API KEY>"
FIREWORKS_API_KEY = "<API KEY>"

In [3]:
import streamlit as st

LANGSMITH_API_KEY = st.secrets["api_keys"]["LANGSMITH_API_KEY"]
ANTHROPIC_API_KEY = st.secrets["api_keys"]["ANTHROPIC_API_KEY"]
FIREWORKS_API_KEY = st.secrets["api_keys"]["FIREWORKS_API_KEY"]

Let's start by creating a LLM model to run our structured output queries. Use a temperature of 0 to improve structured output generation (but at the cost of "creativity").


In [4]:
# LLM parameters
temperature = 0.8
timeout = 30
num_ctx = 8192
num_predict = 4096

### Problem setup and prompt


In [5]:
test_science_prompt_txt = """
You are a professional science writer tasked with responding to members of
the general public who write in asking questions about science.
Write an article responding to a writer's question for publication in a
science magazine intended for a general readership with a high-school education.
You should write clearly and compellingly, include all relavent context,
and provide motivating stories where applicable.

Your response must be less than 200 words.

The question given to you is the following:
{question}
"""

questions = [
    "What is the oldest recorded fossil?",
    "What is a black hole?",
    "How far away is the sun?",
    "Which other planet in the Solar System has a surface gravity closest to that of the Earth?",
    "Eris, Haumea, Makemake and Ceres are all examples of what?",
    "Why does earth have seasons? Do other planets exhibit seasons too?",
    "What causes the aurora borealis?",
    "Why is the sky blue?",
    "How do bees communicate?",
    "What is the smallest unit of life?",
    "How do plants make their own food?",
    "Why do we dream?",
    "What is the theory of relativity?",
    "How do volcanoes erupt?",
    "What is the speed of light?",
    "How do magnets work?",
    "What is the purpose of DNA?",
    "What are the different types of galaxies?",
    "Why do some animals hibernate?",
    "How do vaccines work?",
]

In [6]:
prompt_direct = ChatPromptTemplate.from_template(test_science_prompt_txt)

prompt_system_format = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Answer the user query.\n{format_instructions}",
        ),
        ("human", test_science_prompt_txt),
    ]
)

prompt_user_format = ChatPromptTemplate.from_template(
    test_science_prompt_txt + "\n{format_instructions}"
)

#### XML schema from Pydantic models


In [None]:
# Simple types
class ArticleResponse1(BaseModel):
    """Structured article for publication answering a reader's question."""

    title: str = Field(description="Title of the article")
    answer: str = Field(
        description="Provide a detailed description of historical events to answer the question."
    )
    number: int = Field(
        description="An arbitraty number that is most relevant to the question."
    )


# Lists of simple types
class ArticleResponse2(BaseModel):
    """Structured article for publication answering a reader's question."""

    title: str = Field(description="Title of the article")
    further_questions: list[str] = Field(
        description="A list of related questions that may be of interest to the readers."
    )


# Nested types
class HistoricalEvent(BaseModel):
    """The year and explanation of a historical event."""

    year: int = Field(description="The year of the historical event")
    description: str = Field(
        description="A clear description of what happened in this event"
    )


class ArticleResponse3(BaseModel):
    """Structured article for publication answering a reader's question."""

    title: str = Field(description="Title of the article")
    historical_event_1: HistoricalEvent = Field(
        description="Provide a detailed description of a first historical event to answer the question."
    )
    historical_event_2: HistoricalEvent = Field(
        description="Provide a detailed description of a second historical event to answer the question."
    )


# Lists of custom types
class ArticleResponse4(BaseModel):
    """Structured article for publication answering a reader's question."""

    title: str = Field(description="Title of the article")
    historical_timeline: list[HistoricalEvent] = Field(
        description="Provide a compelling account of the historical context of the question"
    )


# Nested types
class CriicalAnalysis(BaseModel):
    """A critique of interpretations of historical events"""

    historical_event: HistoricalEvent = Field(
        description="Provide an overview of the facts of a historical event"
    )
    common_understanding: str = Field(description="Agreed interpretation of event")
    analysis: str = Field(
        description="Critical analysis of the event and opposing interpretations"
    )


# Multiple nested custom types
class ArticleResponse5(BaseModel):
    """Structured article for publication answering a reader's question."""

    title: str = Field(description="Title of the article")
    historical_timeline: list[HistoricalEvent] = Field(
        description="Provide a compelling account of the historical context of the question"
    )
    critique: list[CriicalAnalysis] = Field(
        description="A list of key historical events and an analysis of them"
    )


structured_formats_pydantic = [
    dict(pydantic=schema)
    for schema in [
        ArticleResponse1,
        ArticleResponse2,
        ArticleResponse3,
        ArticleResponse4,
    ]
]

In [None]:
def pydantic_to_xml_schema(model):
    # Get the JSON schema representation of the model
    schema = model.model_json_schema()

    # Start with root element named after the model
    xml = f"<{schema['title']}>\n"

    # Process each property
    for field_name, field_info in schema["properties"].items():
        field_type = field_info.get("type", "any")
        description = field_info.get("description", "")

        if field_type == "array":
            xml += f"  <{field_name}>\n"
            xml += f"    [Type: {field_type}]\n"
            xml += f"    [First instance of {description}]\n"
            xml += f"  </{field_name}>\n"
            xml += f"  <{field_name}>\n"
            xml += f"    [Type: {field_type}]\n"
            xml += f"    [Next instance of {description}]\n"
            xml += f"  </{field_name}>\n"
            xml += f"  <{field_name}>\n"
            xml += f"    [ ... ]\n"
            xml += f"  </{field_name}>\n"

        else:
            # Add field as XML element with type comment and description
            xml += f"  <{field_name}>\n"
            xml += f"    Type: {field_type}\n"
            xml += f"    {description}\n"
            xml += f"  </{field_name}>\n"

    xml += f"</{schema['title']}>"

    return xml


print(pydantic_to_xml_schema(ArticleResponse2))

<ArticleResponse2>
  <title>
    Type: string
    Title of the article
  </title>
  <further_questions>
    [Type: array]
    [First instance of A list of related questions that may be of interest to the readers.]
  </further_questions>
  <further_questions>
    [Type: array]
    [Next instance of A list of related questions that may be of interest to the readers.]
  </further_questions>
  <further_questions>
    [ ... ]
  </further_questions>
</ArticleResponse2>


#### Schema using Pydantic XML


In [42]:
from pydantic_xml import BaseXmlModel, element, attr


class ArticleResponse1XML(BaseXmlModel, tag="article"):
    """Structured article for publication answering a reader's question."""

    title: str = element(description="Title of the article")
    answer: str = element(
        description="Provide a detailed description of historical events to answer the question"
    )
    number: int = element(description="A number that is most relevant to the question.")


# Lists of simple types
class ArticleResponse2XML(BaseXmlModel, tag="article"):
    """Structured article for publication answering a reader's question."""

    title: str = element(description="Title of the article")
    answer: str = element(description="Answer the writer's question")
    further_questions: list[str] = element(
        tag="further_question",
        description="A list of related questions that may be of interest to the readers.",
    )


# Nested types
class HistoricalEventXML(BaseXmlModel):
    """The year and explanation of a historical event."""

    year: str = element(description="The year of the historical event")
    event: str = element(
        description="A clear and concise explanation of what happened in this event"
    )


class ArticleResponse3XML(BaseXmlModel, title="article"):
    """Structured article for publication answering a reader's question."""

    title: str = element(description="[Title of the article]")
    historical_event_1: HistoricalEventXML = element(
        description="A first historical event relevant to the question"
    )
    historical_event_2: HistoricalEventXML = element(
        description="A second historical event relevant to the question"
    )


# Lists of custom types
class ArticleResponse4XML(BaseXmlModel):
    """Structured article for publication answering a reader's question."""

    title: str = element(default="Title of the article")
    historical_timeline: list[HistoricalEventXML] = (
        "A list of historical events relevant to the question"
    )


structured_formats_xml = [
    dict(pydantic=schema)
    for schema in [
        ArticleResponse1XML,
        ArticleResponse2XML,
        ArticleResponse3XML,
        ArticleResponse4XML,
    ]
]

In [54]:
def pydantic_to_xml_schema(model, root_name=None, root_description=None):
    # Get the JSON schema representation of the model
    if hasattr(model, "model_json_schema"):
        schema = model.model_json_schema()
    else:
        schema = model

    defs = schema.get("$defs")

    # Start with root element named after the model

    _root_name = root_name if root_name else schema["title"]
    _root_desc = root_description if root_description else schema["description"]
    xml = f"<{_root_name}>\n  <!--{_root_desc}-->\n"

    # Process each property
    for field_name, field_info in schema["properties"].items():
        field_type = field_info.get("type", "any")
        description = field_info.get("description", "")
        ref = field_info.get("$ref")

        if ref:
            # Look up ref
            ref_name = ref.split("/")[-1]
            ref_class = defs.get(ref_name)
            if ref_class:
                xml += pydantic_to_xml_schema(ref_class, root_name=field_name)

        elif field_type == "array":
            xml += f"  <{field_name}>\n"
            xml += f"    [Type: {field_type}]\n"
            xml += f"    [First instance of '{description}']\n"
            xml += f"  </{field_name}>\n"
            xml += f"  <{field_name}>\n"
            xml += f"    [Type: {field_type}]\n"
            xml += f"    [Next instance of '{description}']\n"
            xml += f"  </{field_name}>\n"
            xml += f"  <{field_name}>\n"
            xml += f"    [ ... ]\n"
            xml += f"  </{field_name}>\n"

        else:
            # Add field as XML element with type comment and description
            xml += f"  <{field_name}>\n"
            xml += f"    Type: {field_type}\n"
            xml += f"    {description}\n"
            xml += f"  </{field_name}>\n"

    xml += f"</{_root_name}>"

    return xml


print(pydantic_to_xml_schema(ArticleResponse3XML))

<article>
  <!--Structured article for publication answering a reader's question.-->
  <title>
    Type: string
    [Title of the article]
  </title>
<historical_event_1>
  <!--The year and explanation of a historical event.-->
  <year>
    Type: string
    The year of the historical event
  </year>
  <event>
    Type: string
    A clear and concise explanation of what happened in this event
  </event>
</historical_event_1><historical_event_2>
  <!--The year and explanation of a historical event.-->
  <year>
    Type: string
    The year of the historical event
  </year>
  <event>
    Type: string
    A clear and concise explanation of what happened in this event
  </event>
</historical_event_2></article>


### Evaluation


In [None]:
import pickle


def run_experiment(
    prompt_format,
    questions,
    llm_models,
    method,
    n_iter=1,
    resume=0,
    results_out=None,
    save_file_name=None,
):

    if results_out is None:
        structure_support_by_model = {}
    else:
        structure_support_by_model = results_out
    n_questions = len(questions)

    position = 0

    # Iterate over models
    for model_name, llm_model in llm_models.items():
        structure_support_by_model[model_name] = {}

        # Iterate over schemas
        for structure in structured_formats:
            pydantic_obj = structure["pydantic"]
            print(
                f"Model: {model_name}  Output: {pydantic_obj.__name__}   Pos: {position}"
            )

            position += 1
            if position < resume:
                continue

            # Format instructions if required
            parser = PydanticOutputParser(pydantic_object=pydantic_obj)
            prompt = prompt_format.partial(
                format_instructions=parser.get_format_instructions()
            )

            # Iterate over questions
            error_types = []
            error_messages = []
            outputs = []
            output_valid = 0
            for _ in range(n_iter):
                for ii in range(n_questions):
                    try:
                        test_chain = prompt | llm_model.with_structured_output(
                            pydantic_obj, method=method, include_raw=True
                        )
                        output = test_chain.invoke(dict(question=questions[ii]))
                        outputs.append(output)

                        # Typically Pydantic validation failure
                        if output["parsing_error"] is not None:
                            error_types.append("parse_error")
                            error_messages.append(output["parsing_error"])
                            print("Error: Parse error")

                        # Typically function-calling failure
                        elif output["parsed"] is None:
                            error_types.append("no_output")
                            print("Error: No output")

                        # This is not expected to happen
                        elif not isinstance(output["parsed"], pydantic_obj):
                            error_types.append("unexpected_error")
                            raise RuntimeError("Unexpected error")

                        else:
                            error_types.append("ok")
                            output_valid += 1

                    # Other failures (typically function-calling not supported)
                    except Exception as e:
                        error_types.append("other_error")
                        print(f"Error: Other error {type(e).__name__}")
                        error_messages.append(f"{type(e).__name__}, {e}")

                    # Pause to avoid timeouts
                    print(".", end="")
                    sleep(1)
                print()

            structure_support_by_model[model_name][pydantic_obj.__name__] = dict(
                valid=output_valid / (n_iter * n_questions),
                error_types=error_types,
                errors=error_messages,
                outputs=outputs,
            )
    if save_file_name:
        with open(file=save_file_name, mode="wb") as f:
            pickle.dump(
                dict(
                    method=method,
                    prompt=prompt,
                    questions=questions,
                    structure_support_by_model=structure_support_by_model,
                ),
                f,
            )
    return structure_support_by_model

In [11]:
xml_parser = XMLOutputParser()

In [12]:
test_chain = (
    prompt_system_format.partial(format_instructions=prompt_xml_format)
    | llm_model
    | xml_parser
)

output = test_chain.invoke(dict(question=questions[0]))

In [13]:
output

{'article': [{'title': "The Oldest Recorded Fossil: A Glimpse into Earth's Ancient Past"},
  {'problem': '\nA reader asks about the oldest known fossil, seeking to understand its significance and what it reveals about early life on Earth.\n'},
  {'answer': "\nThe oldest recorded fossil is Stromatolite fossils found in Australia, estimated to be around 3.5 billion years old. These layered structures are created by the growth of cyanobacteria, a type of photosynthetic microorganism. Discovered in the Pilbara region, these stromatolites provide crucial evidence about early life on Earth.\n\nCyanobacteria played a vital role in transforming our planet's atmosphere through oxygen production, paving the way for more complex life forms to evolve. The discovery of these ancient fossils not only offers insights into biological processes from billions of years ago but also underscores the resilience and adaptability of life. Their existence suggests that early Earth had conditions suitable for m

### Pydantic XML would be nice!


In [17]:
from pydantic_xml import BaseXmlModel, element


class Article(BaseXmlModel, tag="article"):
    title: str = element(default="[Title of the article]")
    problem: str = element(
        default="[Summary of the writer's question, write concisely]"
    )
    answer: str = element(default="[Answer the writer's question]")

In [18]:
print(Article().to_xml(pretty_print=True).decode())

<article>
  <title>[Title of the article]</title>
  <problem>[Summary of the writer's question, write concisely]</problem>
  <answer>[Answer the writer's question]</answer>
</article>



In [19]:
xml_schema = Article().to_xml(pretty_print=True).decode()

prompt_xml_format_pyxml = f"""
Your answers must strictly adhere to the following XML format:

{xml_schema}
"""

But we can't use the `PydanticOutputParser` with a Pydantic XML object


In [24]:
pydantic_parser = PydanticOutputParser(pydantic_object=Article)

In [21]:
test_chain = (
    prompt_system_format.partial(format_instructions=prompt_xml_format_pyxml)
    | llm_model
    | pydantic_parser
)

output = test_chain.invoke(dict(question=questions[0]))

OutputParserException: Invalid json output: <article>
  <title>Unearthing Time: The Oldest Known Fossil</title>
  <problem>The writer inquires about the identity of the oldest known fossil.</problem>
  <answer>The quest for Earth's earliest life forms leads us to remarkable fossils. The oldest recorded fossil is a stromatolite, dating back approximately 3.5 billion years. Stromatolites are layered structures formed by the growth of cyanobacteria, ancient microorganisms that played a pivotal role in oxygenating our planet.

Discovered in Western Australia's Pilbara region, these fossils offer a window into early life on Earth. They suggest that life began not long after the planet cooled enough for water to exist in liquid form. These formations are significant because they provide evidence of some of the first complex life forms, shaping the course of biological evolution.

The discovery of such ancient life highlights the resilience and adaptability of organisms, sparking curiosity about what other secrets lie beneath our feet, waiting to be uncovered by future explorations. Through stromatolites, we gain insight into a time when Earth was vastly different from today—a testament to life's enduring legacy across the eons.</answer>
</article>
For troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/OUTPUT_PARSING_FAILURE 