# 标记和提取

 - [一、设置OpenAI API Key](#一、设置OpenAI-API-Key)
 - [二、Tagging](#二、Tagging)
     - [2.1 创建Tagging函数](#2.1-创建Tagging函数)
     - [2.2 通过LangChain实现Tagging](#2.2-通过LangChain实现Tagging)
     - [2.3 结构化解析Tagging结果](#2.3-结构化解析Tagging结果)
 - [三、 Extraction](#三、Extraction)
     - [3.1 创建Extraction函数](#3.1-创建Extraction函数)
     - [3.2 通过LangChain实现创建Extraction函数](#3.2-通过LangChain实现创建Extraction函数)
     - [3.3 结构化解析Extraction结果](#3.3-结构化解析Extraction结果)
 - [四、应用案例](#四、应用案例)
     - [4.1 加载数据](#4.1-加载数据)
     - [4.2 提取文章概览](#4.2-提取文章概览)
     - [4.3 提取文章信息](#4.3-提取文章信息)
     - [4.4 分块文本提取](#4.4-分块文本提取)
         

# 一、设置OpenAI-API-Key

详细内容见`设置OpenAI_API_KEY.ipynb`文件

# 二、Tagging

`Tagging` 是什么：
- LLM 给出一个函数描述，从输入文本中选择参数生成一个结构化的输出，形成函数调用
- 更一般地说，LLM 可以评估输入文本并生成**结构化输出**

## 2.1 创建Tagging函数

我们定义了一个`Tagging`，它继承自 Pydantic 的 BaseModel 类，因此`Tagging`类也具备了严格的数据类型校验功能。`Tagging`类包含了2给成员变量：`sentiment`和`language`：
- `sentiment`：用来判断用户信息的情感包括 pos(正面)，neg(负面)，neutral(中立)。
- `language`：用来判断用户使用的是哪国的语言，并且要符合 ISO 639-1 编码规范。

In [1]:
# 导入模块
from typing import List  
from pydantic import BaseModel, Field  
from langchain.utils.openai_functions import convert_pydantic_to_openai_function

In [2]:
# 创建 Tagging 类
# 该类表是基于输入的文本来标记文本情感的 `pos`（正面）、`neg`（负面）或`neutral`（中立）
class Tagging(BaseModel):
    """用特定信息标记这段文本。"""
    # 文本的情绪标签，可选值为`pos`（正面）、`neg`（负面）或`neutral`（中立）
    sentiment: str = Field(description="文本的情绪，请从“正面”、“负面”或“中立”中选择")
    # 文本的语言标签，应为ISO 639-1标准代码
    language: str = Field(description="文本语言(应采用ISO 639-1代码)")

In [4]:
# 将Tagging数据模型转换为OpenAI函数
convert_pydantic_to_openai_function(Tagging)

{'name': 'Tagging',
 'description': '用特定信息标记这段文本。',
 'parameters': {'properties': {'sentiment': {'description': '文本的情绪，请从“正面”、“负面”或“中立”中选择',
    'type': 'string'},
   'language': {'description': '文本语言(应采用ISO 639-1代码)', 'type': 'string'}},
  'required': ['sentiment', 'language'],
  'type': 'object'}}

## 2.2 通过LangChain实现Tagging

接下来我们要将`Tagging`类转换成一个openai能识别的函数描述对象

In [13]:
# 导入模块
from langchain.prompts import ChatPromptTemplate 
# from langchain.chat_models import ChatOpenAI 最新版本使用以下方式替换该语句
from langchain_openai import ChatOpenAI


In [15]:
# 创建一个温度为0的ChatOpenAI模型实例
model = ChatOpenAI(temperature=0,openai_api_key="your_api_key")  

In [16]:
# 应用 Tagging 
tagging_functions = [convert_pydantic_to_openai_function(Tagging)]

有了函数描述变量，我们使用`LCEL`语法来创建一个 chain。在这之前我们需要创建 prompt，model，并绑定函数描述变量最后创建 chain。

In [18]:
# 使用ChatPromptTemplate的from_messages方法创建聊天提示模板
prompt = ChatPromptTemplate.from_messages([
    ("system", "仔细思考，然后按指示标记文本"),
    ("user", "{input}")
])

In [19]:
# 将模型与函数绑定，并指定函数调用的名称
model_with_functions = model.bind(
    functions=tagging_functions,
    function_call={"name": "Tagging"}
)

In [20]:
# 创建一个标签链，结合提示模板和模型
tagging_chain = prompt | model_with_functions

In [21]:
# 调用标签链并传入输入文本
tagging_chain.invoke({"input": "我爱langchain"})

AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{"sentiment":"正面","language":"zh"}', 'name': 'Tagging'}}, response_metadata={'token_usage': {'completion_tokens': 11, 'prompt_tokens': 130, 'total_tokens': 141}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-ec3c4b22-575e-4275-802b-50ae8837632a-0')

In [22]:
# 再次调用标签链并传入另一个输入文本
tagging_chain.invoke({"input": "我想要问的不是这些问题"})

AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{"sentiment":"中立","language":"zh"}', 'name': 'Tagging'}}, response_metadata={'token_usage': {'completion_tokens': 11, 'prompt_tokens': 136, 'total_tokens': 147}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-2911f5b2-8b4d-46e2-b5a1-7bbf2ed611f4-0')

## 2.3 结构化解析Tagging结果

以上输出 LLM 给出的 AIMessage 格式的结果，我们可以利用`LCEL`语法，在创建 chain 的时候附加一个 json 的输出解析器就可以解决这个问题。

In [23]:
# 从langchain.output_parsers.openai_functions模块导入JsonOutputFunctionsParser
from langchain.output_parsers.openai_functions import JsonOutputFunctionsParser  

In [24]:
# 创建一个新的标签链，结合提示模板、模型和JsonOutputFunctionsParser解析器
tagging_chain = prompt | model_with_functions | JsonOutputFunctionsParser()

In [25]:
# 调用标签链并传入输入文本
tagging_chain.invoke({"input": "我爱langchain"})

{'sentiment': '正面', 'language': 'zh'}

# 三、Extraction

Extraction 是什么：
- 提取（Extraction）类似于标记（Tagging），但用于提取多条信息。
- 当给定一个输入Json模式时，LLM已经进行了微调，以查找并填充该模式的参数。
- 该功能并不局限于function模式，可以用于一般用途的提取。

## 3.1 创建Extraction函数

In [26]:
# 导入模块
from typing import Optional  
from pydantic import BaseModel, Field  

定义了`Person`和`Information`两个类：
- `person`类包含了2个成员，name和age，其中age是可选的。
- `Information`类包含了一个people成员，它一个person的集合(List)。

In [27]:
# 创建Person类
class Person(BaseModel):
    """个人信息"""
    name: str = Field(description="人的名字")  # 人的名字
    age: Optional[int] = Field(description="人的年龄")  # 人的年龄，可选字段

In [28]:
# 创建Information类别
class Information(BaseModel):
    """要提取的信息"""
    people: List[Person] = Field(description="关于人的信息列表")  # 关于人的信息列表

In [29]:
# 将Information数据模型转换为OpenAI函数
convert_pydantic_to_openai_function(Information)

{'name': 'Information',
 'description': '要提取的信息',
 'parameters': {'$defs': {'Person': {'description': '个人信息',
    'properties': {'name': {'description': '人的名字', 'type': 'string'},
     'age': {'anyOf': [{'type': 'integer'}, {'type': 'null'}],
      'description': '人的年龄'}},
    'required': ['name', 'age'],
    'type': 'object'}},
  'properties': {'people': {'description': '关于人的信息列表',
    'items': {'description': '个人信息',
     'properties': {'name': {'description': '人的名字', 'type': 'string'},
      'age': {'anyOf': [{'type': 'integer'}, {'type': 'null'}],
       'description': '人的年龄'}},
     'required': ['name', 'age'],
     'type': 'object'},
    'type': 'array'}},
  'required': ['people'],
  'type': 'object'}}

In [30]:
# 创建提取功能列表，并将提取功能绑定到模型上
extraction_functions = [convert_pydantic_to_openai_function(Information)]  
extraction_model = model.bind(functions=extraction_functions, function_call={"name": "Information"})  

In [31]:
# 调用提取模型，传入文本信息
extraction_model.invoke("乔30岁，他妈妈叫玛莎")

AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{"people":[{"name":"乔","age":30},{"name":"玛莎","age":null}]}', 'name': 'Information'}}, response_metadata={'token_usage': {'completion_tokens': 25, 'prompt_tokens': 115, 'total_tokens': 140}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-8c5e72ac-b313-45d2-9e08-1c4c1fffba93-0')

## 3.2 通过LangChain实现创建Extraction函数

In [32]:
# 使用ChatPromptTemplate创建提示模板
prompt = ChatPromptTemplate.from_messages([
    ("system", "提取相关信息，如果没有明确提供不要猜测。可以提取部分信息"), 
    ("human", "{input}")  
])

In [33]:
# 创建提取链，结合提示模板和提取模型
extraction_chain = prompt | extraction_model

In [34]:
# 调用提取链并传入输入文本
extraction_chain.invoke({"input": "乔30岁，他妈妈叫玛莎"})

AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{"people":[{"name":"乔","age":30},{"name":"玛莎","age":null}]}', 'name': 'Information'}}, response_metadata={'token_usage': {'completion_tokens': 25, 'prompt_tokens': 141, 'total_tokens': 166}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-c198b77f-aea8-4c36-8198-8eece7e0ffa2-0')

In [35]:
# 创建新的提取链，加入JsonOutputFunctionsParser来解析输出
extraction_chain = prompt | extraction_model | JsonOutputFunctionsParser()

In [36]:
# 再次调用提取链
extraction_chain.invoke({"input": "乔30岁，他妈妈叫玛莎"})

{'people': [{'name': '乔', 'age': 30}, {'name': '玛莎', 'age': None}]}

## 3.3 结构化解析Extraction结果

In [37]:
# 导入模块
from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser 

In [38]:
# 创建提取链，指定关键字"name"来解析输出
extraction_chain = prompt | extraction_model | JsonKeyOutputFunctionsParser(key_name="people")

In [39]:
# 调用提取链并传入输入文本
extraction_chain.invoke({"input": "乔30岁，他妈妈叫玛莎"})

[{'name': '乔', 'age': 30}, {'name': '玛莎', 'age': None}]

# 四、应用案例

我们可以对更大的文本主体应用标记。例如，加载博客文章并从文本的子集中提取标记信息。

## 4.1 加载数据

In [40]:
# 使用WebBaseLoader加载文档
from langchain.document_loaders import WebBaseLoader  
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/") 
documents = loader.load() 

In [41]:
# 获取第一个文档
doc = documents[0]  

In [42]:
# 获取页面内容的前10000个字符
page_content = doc.page_content[:10000]  

## 4.2 提取文章概览

In [43]:
# 从pydantic导入BaseModel和Field用于创建数据模型
from pydantic import BaseModel, Field  

定义一个Pydantic类`Overview`
- `summary`：表示对文章内容的总结
- `language`：表示文章所使用的语言
- `keyword`：表示文章中的关键词

In [44]:
# 创建Overview类别
class Overview(BaseModel):
    """一段文本的概述"""
    summary: str = Field(description="提供内容的简明总结。")  # 内容摘要
    language: str = Field(description="提供编写内容所用的语言。")  # 内容语言
    keywords: str = Field(description="提供与内容相关的关键字。")  # 关键词

In [45]:
# 将Overview数据模型转换为OpenAI函数
overview_tagging_function = [
    convert_pydantic_to_openai_function(Overview)
]
tagging_model = model.bind(
    functions=overview_tagging_function,
    function_call={"name":"Overview"}  # 绑定函数调用
)
tagging_chain = prompt | tagging_model | JsonOutputFunctionsParser()  # 创建标注链并加入解析器

In [46]:
# 调用标注链
tagging_chain.invoke({"input": page_content})

{'summary': 'This is a blog post discussing the concept of building autonomous agents powered by LLM (large language model) as the core controller. It covers components like planning, memory, and tool use, along with challenges and references.',
 'language': 'English',
 'keywords': 'LLM, autonomous agents, planning, memory, tool use, challenges, references'}

## 4.3 提取文章信息

In [47]:
# 创建Paper类，用于标题和作者
class Paper(BaseModel):
    """提到的论文信息。"""
    title: str  # 论文标题
    author: Optional[str]  # 作者，可选字段

# 创建Info，用户提取论文论文信息列表
class Info(BaseModel):
    """要提取的信息"""
    papers: List[Paper] 

In [48]:
# 将Info数据模型转换为OpenAI函数
paper_extraction_function = [
    convert_pydantic_to_openai_function(Info)
]
extraction_model = model.bind(
    functions=paper_extraction_function, 
    function_call={"name":"Info"}  # 绑定函数调用
)

In [49]:
# 创建提取链并加入解析器
extraction_chain = prompt | extraction_model | JsonKeyOutputFunctionsParser(key_name="papers") 

In [50]:
# 调用提取链，发现提取了论文本身的名称。因此接下里可以结合prompt改进
extraction_chain.invoke({"input": page_content})  

[{'title': 'LLM Powered Autonomous Agents', 'author': 'Lilian Weng'}]

In [51]:
template = """
A article will be passed to you. Extract from it all papers that are mentioned by this article. 
Do not extract the name of the article itself. If no papers are mentioned that's fine - you don't need to extract any! Just return an empty list.
Do not make up or guess ANY extra information. Only extract what exactly is in the text.
"""

template_chinese = """
一篇文章将转交给你。把这篇文章中提到的所有论文都摘录出来。
不要提取文章本身的名称。如果没有提到论文，那很好——你不需要提取任何论文!只返回一个空列表。
不要编造或猜测任何额外的信息。只提取文本中的内容。
"""

In [52]:
# 使用定制化提示模板创建聊天提示
prompt = ChatPromptTemplate.from_messages([
    ("system", template_chinese),
    ("human", "{input}")
])

In [53]:
# 重新创建提取链
extraction_chain = prompt | extraction_model | JsonKeyOutputFunctionsParser(key_name="papers")  

In [54]:
# 再次调用提取链
extraction_chain.invoke({"input": page_content})  

[{'title': 'Chain of thought (CoT; Wei et al. 2022)', 'author': 'Wei et al.'},
 {'title': 'Tree of Thoughts (Yao et al. 2023)', 'author': 'Yao et al.'},
 {'title': 'LLM+P (Liu et al. 2023)', 'author': 'Liu et al.'},
 {'title': 'ReAct (Yao et al. 2023)', 'author': 'Yao et al.'},
 {'title': 'Reflexion (Shinn & Labash 2023)', 'author': 'Shinn & Labash'},
 {'title': 'Chain of Hindsight (CoH; Liu et al. 2023)',
  'author': 'Liu et al.'},
 {'title': 'Algorithm Distillation (AD; Laskin et al. 2023)',
  'author': 'Laskin et al.'}]

In [55]:
# 使用不相关输入调用提取链，不会返回有效信息
extraction_chain.invoke({"input": "hi"})  

[{'title': 'Paper A', 'author': 'Author A'},
 {'title': 'Paper B', 'author': 'Author B'}]

## 4.4 分块文本提取

In [56]:
# 导入模块
from langchain.text_splitter import RecursiveCharacterTextSplitter 

# 实例化文本分割器
text_splitter = RecursiveCharacterTextSplitter(chunk_overlap=0)  

In [57]:
# 分割文档内容，text_splitter可以将长文本切分成多个短文本
splits = text_splitter.split_text(doc.page_content)  

# 获取分割后的段落数量
len(splits)  

15

In [58]:
# 定义函数用于扁平化列表
def flatten(matrix):
    flat_list = []
    for row in matrix:
        flat_list += row
    return flat_list  

In [59]:
# 示例调用扁平化函数
flatten([[1, 2], [3, 4]])  

[1, 2, 3, 4]

In [60]:
# 打印第一个分割的文本块最后一千个字符
print(splits[0][-1000:])  

lemented by several key components:

Planning

Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.
Reflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.


Memory

Short-term memory: I would consider all the in-context learning (See Prompt Engineering) as utilizing short-term memory of the model to learn.
Long-term memory: This provides the agent with the capability to retain and recall (infinite) information over extended periods, often by leveraging an external vector store and fast retrieval.


Tool use

The agent learns to call external APIs for extra information that is missing from the model weights (often hard to change after pre-training), including current information, code execution capability, access to proprietary information sources and more.

In [61]:
# 导入模块
from langchain.schema.runnable import RunnableLambda  

In [62]:
# 创建Lambda函数用于预处理文本
prep = RunnableLambda(
    lambda x: [{"input": doc} for doc in text_splitter.split_text(x)]  
)

In [63]:
# 测试prep
print(prep.invoke("hi"))
print(len(prep.invoke("hi")))

# 将长文本放入，会切分成多个短文本
print(len(prep.invoke(doc.page_content)))

[{'input': 'hi'}]
1
15


In [64]:
# 创建链式调用，包括预处理、映射提取
# 多个短文本分别使用extraction_chain进行提取，将结果的list通过flatten函数扁平化到一起
chain = prep | extraction_chain.map() | flatten  

In [65]:
chain.invoke(doc.page_content)

[{'title': 'AutoGPT', 'author': None},
 {'title': 'GPT-Engineer', 'author': None},
 {'title': 'BabyAGI', 'author': None},
 {'title': 'Chain of thought', 'author': 'Wei et al. 2022'},
 {'title': 'Tree of Thoughts', 'author': 'Yao et al. 2023'},
 {'title': 'LLM+P', 'author': 'Liu et al. 2023'},
 {'title': 'ReAct', 'author': 'Yao et al. 2023'},
 {'title': 'Reflexion', 'author': 'Shinn & Labash 2023'},
 {'title': 'Chain of Hindsight', 'author': 'Liu et al. 2023'},
 {'title': 'Algorithm Distillation', 'author': 'Laskin et al. 2023'},
 {'title': 'Algorithm Distillation (AD)', 'author': 'Laskin et al. 2023'},
 {'title': 'ED (expert distillation)', 'author': None},
 {'title': 'RL^2', 'author': 'Duan et al. 2017'},
 {'title': 'LSH (Locality-Sensitive Hashing)', 'author': None},
 {'title': 'ANNOY (Approximate Nearest Neighbors Oh Yeah)', 'author': None},
 {'title': 'HNSW (Hierarchical Navigable Small World)', 'author': None},
 {'title': 'FAISS (Facebook AI Similarity Search)', 'author': None},
 

# 五、英文版模版

**2.1 创建Tagging函数**

In [66]:
class Tagging(BaseModel):
    """Tag the piece of text with particular info."""
    sentiment: str = Field(description="sentiment of text, should be `pos`, `neg`, or `neutral`")
    language: str = Field(description="language of text (should be ISO 639-1 code)")

**2.2 通过LangChain实现Tagging**

In [67]:
prompt = ChatPromptTemplate.from_messages([
    ("system", "Think carefully, and then tag the text as instructed"),
    ("user", "{input}")
])

**3.1 创建Extraction函数**

In [68]:
class Person(BaseModel):
    """Information about a person."""
    name: str = Field(description="person's name")  
    age: Optional[int] = Field(description="person's age")  

In [69]:
class Information(BaseModel):
    """Information to extract."""
    people: List[Person] = Field(description="List of info about people")

**3.2 通过LangChain实现创建Extraction函数**

In [70]:
prompt = ChatPromptTemplate.from_messages([
    ("system", "Extract the relevant information, if not explicitly provided do not guess. Extract partial info"), 
    ("human", "{input}")  
])

**4.2 提取文章概览**

In [71]:
class Overview(BaseModel):
    """Overview of a section of text."""
    summary: str = Field(description="Provide a concise summary of the content.") 
    language: str = Field(description="Provide the language that the content is written in.") 
    keywords: str = Field(description="Provide keywords related to the content.") 

**4.3 提取文章信息**

In [72]:
class Paper(BaseModel):
    """Information about papers mentioned."""
    title: str  
    author: Optional[str]  

class Info(BaseModel):
    """Information to extract"""
    papers: List[Paper] 

prompt 使用 `template`

In [73]:
template = """
A article will be passed to you. Extract from it all papers that are mentioned by this article. 
Do not extract the name of the article itself. If no papers are mentioned that's fine - you don't need to extract any! Just return an empty list.
Do not make up or guess ANY extra information. Only extract what exactly is in the text.
"""