# 结构化信息提取

- 本notebook展示如何使用langchain从结构化文本中提取结构化信息，还会进一步演示如何使用few-shot方式来提高性能。

## 定义模式/schema

In [1]:
from typing import Optional

from pydantic import BaseModel, Field


class Person(BaseModel):
    """Information about a person."""

    # ^ Doc-string for the entity Person.
    # This doc-string is sent to the LLM as the description of the schema Person,
    # and it can help to improve extraction results.

    # Note that:
    # 1. Each field is an `optional` -- this allows the model to decline to extract it!
    # 2. Each field has a `description` -- this description is used by the LLM.
    # Having a good description can help improve extraction results.
    name: Optional[str] = Field(default=None, description="The name of the person")
    hair_color: Optional[str] = Field(
        default=None, description="The color of the person's hair if known"
    )
    height_in_meters: Optional[str] = Field(
        default=None, description="Height measured in meters"
    )

定义模式时有两条最佳实践：
- 对结构化的属性和模式本身文档化，这些消息会发送给LLMs，用于提高信息提取的质量
- 不要强制LLMs编造信息，上述定义中对属性使用了`Optional`，允许LLMs在不知道答案时输出`None`

## 提取器

In [2]:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

# Define a custom prompt to provide instructions and any additional context.
# 1) You can add examples into the prompt template to improve extraction quality
# 2) Introduce additional parameters to take context into account (e.g., include metadata
#    about the document from which the text was extracted.)
prompt_template = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert extraction algorithm. "
            "Only extract relevant information from the text. "
            "If you do not know the value of an attribute asked to extract, "
            "return null for the attribute's value.",
        ),
        # Please see the how-to about improving performance with
        # reference examples.
        # MessagesPlaceholder('examples'),
        ("human", "{text}"),
    ]
)

In [3]:
from based_on_openai_model import ChatOpenRouter

# 需要使用支持函数/工具调用的模型
llm = ChatOpenRouter(model_name="meituan/longcat-flash-chat:free")

In [4]:
structured_llm = llm.with_structured_output(schema=Person)

测试一下

In [5]:
text = "Alan Smith is 6 feet tall and has blond hair."
prompt = prompt_template.invoke({"text": text})
structured_llm.invoke(prompt)

Person(name='Alan Smith', hair_color='blond', height_in_meters=None)

In [6]:
structured_llm.invoke(prompt)

Person(name='Alan Smith', hair_color='blond', height_in_meters='}')

可以看上，上述使用OpenRouter中提供的免费longcat-flash-chat模型两次结果都不太理想
- 第一次是没有成功输出以米为单位的身高，但是没有生成错误信息，而是返回了None
- 第二次中返回的身高是字符`}`

但两次调用返回的结果的结构是正确的

以下换了一个模型提供商，使用了Intern-S1模型，两次调用也没有输出合适的结果

In [7]:
from based_on_openai_model import ChatINTERNLM

# 需要使用支持函数/工具调用的模型
llm2 = ChatINTERNLM(model="intern-latest")
structured_llm2 = llm2.with_structured_output(schema=Person)
structured_llm2.invoke(prompt)

Person(name=None, hair_color='blond', height_in_meters=None)

In [8]:
structured_llm2.invoke(prompt)

Person(name='Alan Smith', hair_color='blond', height_in_meters=None)

## 多实体

In [9]:
from typing import List

class Data(BaseModel):
    """Extracted data about people."""

    # Creates a model so that we can extract multiple entities.
    people: List[Person]

继续使用OpenRouter中提供的免费longcat-flash-chat模型测试两次，结果还是不是非常理想

In [10]:
structured_llm = llm.with_structured_output(schema=Data)
text = "My name is Jeff, my hair is black and i am 6 feet tall. Anna has the same color hair as me."
prompt = prompt_template.invoke({"text": text})
structured_llm.invoke(prompt)

Data(people=[Person(name='Jeff', hair_color='black', height_in_meters='.6 meters'), Person(name='Anna', hair_color='black', height_in_meters=None)])

In [11]:
structured_llm.invoke(prompt)

Data(people=[Person(name='Jeff', hair_color='black', height_in_meters='.0183'), Person(name='Anna', hair_color='black', height_in_meters=None)])

继续使用Intern-S1模型，处理失败了，模型返回的结果缺失相关字段，导致在实例化Data对象时报错

In [13]:
structured_llm2 = llm2.with_structured_output(schema=Data)
structured_llm2.invoke(prompt)

ValidationError: 1 validation error for Data
people
  Field required [type=missing, input_value={'name': 'Jeff', 'hair_co...ck', 'height': '6 feet'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.11/v/missing

## 参考示例

结构化输出通常在底层使用工具调用。这通常涉及生成包含工具调用的AI Messages，以及包含工具调用结果的tool messages。在这种情况下，消息序列应该是什么样子？

不同的聊天模型提供商对有效消息序列有不同的要求。有些会接受以下形式的（重复）消息序列：
- 用户消息
- 包含工具调用的AI messages
- 包含结果的tool messages

LangChain提供一个实用函数`tool_example_to_messages`，它将为大多数模型提供商生成有效的序列。它通过仅要求Pydantic表示相应的工具调用来简化结构化少样本示例的生成。

可以将输入字符串和所需的Pydantic对象对转换为可提供给聊天模型的消息序列。在底层，LangChain会将工具调用格式化为每个提供商所需的格式。

In [14]:
from langchain_core.utils.function_calling import tool_example_to_messages

examples = [
    (
        "The ocean is vast and blue. It's more than 20,000 feet deep.",
        Data(people=[]),
    ),
    (
        "Fiona traveled far from France to Spain.",
        Data(people=[Person(name="Fiona", height_in_meters=None, hair_color=None)]),
    ),
]


messages = []

for txt, tool_call in examples:
    if tool_call.people:
        # This final message is optional for some providers
        ai_response = "Detected people."
    else:
        ai_response = "Detected no people."
    messages.extend(tool_example_to_messages(txt, [tool_call], ai_response=ai_response))

  messages.extend(tool_example_to_messages(txt, [tool_call], ai_response=ai_response))


对messages进行打印，可以看到共有八条消息；examples中的每个参考示例对应了4个messages，分别是Human Message-->Ai Message-->Tool Message-->Ai Message

In [17]:
for message in messages:
    message.pretty_print()


The ocean is vast and blue. It's more than 20,000 feet deep.
Tool Calls:
  Data (b97ca998-ae7d-444f-ac0d-4940e01d4508)
 Call ID: b97ca998-ae7d-444f-ac0d-4940e01d4508
  Args:
    people: []

You have correctly called this tool.

Detected no people.

Fiona traveled far from France to Spain.
Tool Calls:
  Data (48fd755d-9b04-4669-8685-e0f2f24f4aa5)
 Call ID: 48fd755d-9b04-4669-8685-e0f2f24f4aa5
  Args:
    people: [{'name': 'Fiona', 'hair_color': None, 'height_in_meters': None}]

You have correctly called this tool.

Detected people.


比较一下有无这些消息的性能
- 测试了OpenRouter上很多免费的模型，均因为模型返回的数据不满足Data的结构定义而报错
- 最终使用了付费的gemini-2.5-flash模型，在无few-shot信息和有few-shot信息下都能正确的解析出正确的截图

In [28]:
message_no_extraction = {
    "role": "user",
    "content": "The solar system is large, but earth has only 1 moon.",
}

In [42]:
llm3 = ChatOpenRouter(model_name="google/gemini-2.5-flash")

In [43]:

structured_llm3 = llm3.with_structured_output(schema=Data)
structured_llm3.invoke([message_no_extraction])

Data(people=[])

In [44]:
structured_llm3.invoke(messages + [message_no_extraction])

Data(people=[])