- Extraction: Extract structured data from text and other unstructured media using chat models and few-shot examples.
- 提取：使用聊天模型和少量示例从文本和其他非结构化媒体中提取结构化数据。

In [1]:
from dotenv import load_dotenv
import os
import getpass
# 获取.env中的环境变量
load_dotenv()

True

In [2]:
pip show langchain-core

Name: langchain-core
Version: 0.3.35
Summary: Building applications with LLMs through composability
Home-page: 
Author: 
Author-email: 
License: MIT
Location: /opt/anaconda3/envs/langChain/lib/python3.13/site-packages
Requires: jsonpatch, langsmith, packaging, pydantic, PyYAML, tenacity, typing-extensions
Required-by: langchain, langchain-chroma, langchain-community, langchain-deepseek, langchain-ollama, langchain-openai, langchain-text-splitters
Note: you may need to restart the kernel to use updated packages.


# Build an Extraction Chain
- 本教程需要 langchain-core>=0.3.20 ，并且只适用于支持工具调用的模型。
- 将使用聊天模型的工具调用功能从非结构化文本中提取结构化信息。我们还将展示如何在这种情况下使用少量示例提示以提高性能。

## The Schema (模式)
- 首先，我们需要描述我们想要从文本中提取哪些信息
- 使用 Pydantic 定义一个示例schema来提取个人信息

In [38]:
from typing import Optional

from pydantic import BaseModel, Field


class Person(BaseModel):
    """Information about a person."""

    # ^ Doc-string for the entity Person.
    # This doc-string is sent to the LLM as the description of the schema Person,
    # and it can help to improve extraction results.

    # Note that:
    # 1. Each field is an `optional` -- this allows the model to decline to extract it!
    # 2. Each field has a `description` -- this description is used by the LLM.
    # Having a good description can help improve extraction results.
    name: Optional[str] = Field(default=None, description="The name of the person")
    hair_color: Optional[str] = Field(
        default=None, description="The color of the person's hair if known"
    )
    height_in_meters: Optional[float] = Field(
        default=None, description="Height measured in meters"
    )

- 在定义模式时有两个最佳实践：
    - 记录属性和schema本身：这些信息发送到LLM，并用于提高信息提取的质量。
    - 不要强迫LLM编造信息！上面我们使用 Optional 来表示属性，允许LLM在不知道答案时输出 None 。

## The Extractor (提取)
- 注意让大模型在不知道的时候返回null,防止编造

In [39]:
from typing import Optional

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from pydantic import BaseModel, Field

# Define a custom prompt to provide instructions and any additional context.
# 1) You can add examples into the prompt template to improve extraction quality
# 2) Introduce additional parameters to take context into account (e.g., include metadata
#    about the document from which the text was extracted.)
prompt_template = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert extraction algorithm. "
            "Only extract relevant information from the text. "
            "If you do not know the value of an attribute asked to extract, "
            "return null for the attribute's value.",
        ),
        # Please see the how-to about improving performance with
        # reference examples.
        # MessagesPlaceholder('examples'),
        ("human", "{text}"),
    ]
)

- 使用一个支持函数/工具调用的模型。

In [46]:
from langchain_ollama import ChatOllama

llm = ChatOllama(
    model="llama3.2:latest",
    temperature=0.3,
    # other params...
)

structured_llm = llm.with_structured_output(
    schema=Person
)
structured_llm

RunnableBinding(bound=ChatOllama(model='llama3.2:latest', temperature=0.3), kwargs={'tools': [{'type': 'function', 'function': {'name': 'Person', 'description': 'Information about a person.', 'parameters': {'properties': {'name': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'default': None, 'description': 'The name of the person'}, 'hair_color': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'default': None, 'description': "The color of the person's hair if known"}, 'height_in_meters': {'anyOf': [{'type': 'number'}, {'type': 'null'}], 'default': None, 'description': 'Height measured in meters'}}, 'type': 'object'}}}], 'structured_output_format': {'kwargs': {'method': 'function_calling'}, 'schema': {'type': 'function', 'function': {'name': 'Person', 'description': 'Information about a person.', 'parameters': {'properties': {'name': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'default': None, 'description': 'The name of the person'}, 'hair_color': {'anyOf': [{'type': 'strin

- 测试

In [41]:
text = "Alan Smith is 6 feet tall and has blond hair."
prompt = prompt_template.invoke({"text": text})
structured_llm.invoke(prompt)

Person(name='Alan Smith', hair_color='blond', height_in_meters=1.8288)

## Multiple Entities  多个实体
- 在大多数情况下，你应该提取实体列表而不是单个实体
- **当模式能够提取多个实体时，如果文本中没有相关信息，它也允许模型通过提供空列表来提取零个实体。**
- 在 pydantic 中嵌套模型来轻松实现


In [66]:
from typing import List, Optional

class Data(BaseModel):
    """Extracted data about people."""

    # Creates a model so that we can extract multiple entities.
    people: Optional[list] = None # 嵌套模型

In [67]:
structured_llm = llm.with_structured_output(schema=Data)
text = "My name is Jeff, my hair is black and i am 1.85 meters tall. Anna has the same color hair as me."
prompt = prompt_template.invoke({"text": text})
structured_llm.invoke(prompt)

ValidationError: 1 validation error for Data
people
  Input should be a valid list [type=list_type, input_value={'hair_color': 'black', '... '1.85', 'name': 'Jeff'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.10/v/list_type

## 通过few shot示例来提高准确性
- LLMs的行为可以通过少量示例提示进行引导。
- ChatModel，这可以表现为一系列输入和响应消息对，展示期望的行为。

In [None]:
messages = [
    {"role": "user", "content": "2 🦜 2"},
    {"role": "assistant", "content": "4"},
    {"role": "user", "content": "2 🦜 3"},
    {"role": "assistant", "content": "5"},
    {"role": "user", "content": "3 🦜 4"},
]

response = llm.invoke(messages)
print(response.content)

## 实用函数 tool_example_to_messages
- 将输入字符串和所需的 Pydantic 对象配对转换为可以提供给聊天模型的一系列消息。
- 在幕后，LangChain 将格式化对每个提供者所需格式的工具调用。

In [68]:
from langchain_core.utils.function_calling import tool_example_to_messages

examples = [
    (
        "The ocean is vast and blue. It's more than 20,000 feet deep.",
        Data(people=[]),
    ),
    (
        "Fiona traveled far from France to Spain.",
        Data(people=[Person(name="Fiona", height_in_meters=None, hair_color=None)]),
    ),
    (
        "Alan Smith is 6 feet tall and has blond hair.",
        Data(people=[Person(name='Alan Smith', hair_color='blond', height_in_meters=1.8288)]),
    ),
]


messages = []

for txt, tool_call in examples:
    if tool_call.people:
        # This final message is optional for some providers
        ai_response = "Detected people."
    else:
        ai_response = "Detected no people."
    messages.extend(tool_example_to_messages(txt, [tool_call], ai_response=ai_response))

for message in messages:
    message.pretty_print()


The ocean is vast and blue. It's more than 20,000 feet deep.
Tool Calls:
  Data (bb79810d-c419-4d15-b33f-b9d349dfb348)
 Call ID: bb79810d-c419-4d15-b33f-b9d349dfb348
  Args:
    people: []

You have correctly called this tool.

Detected no people.

Fiona traveled far from France to Spain.
Tool Calls:
  Data (a8d341f0-f0b8-41f8-a963-8ff86732e71d)
 Call ID: a8d341f0-f0b8-41f8-a963-8ff86732e71d
  Args:
    people: [{'name': 'Fiona', 'hair_color': None, 'height_in_meters': None}]

You have correctly called this tool.

Detected people.

Alan Smith is 6 feet tall and has blond hair.
Tool Calls:
  Data (138dceea-1cbf-450b-ac03-81f81d0b897c)
 Call ID: 138dceea-1cbf-450b-ac03-81f81d0b897c
  Args:
    people: [{'name': 'Alan Smith', 'hair_color': 'blond', 'height_in_meters': 1.8288}]

You have correctly called this tool.

Detected people.


In [69]:
message_no_extraction = {
    "role": "user",
    "content": "The solar system is large, but earth has only 1 moon.",
}

structured_llm = llm.with_structured_output(schema=Data)
structured_llm.invoke([message_no_extraction])

ValidationError: 1 validation error for Data
people
  Input should be a valid list [type=list_type, input_value="{'Earth': {'moons': 1}", input_type=str]
    For further information visit https://errors.pydantic.dev/2.10/v/list_type

In [70]:
structured_llm.invoke(messages + [message_no_extraction])

Data(people=None)