# 标记和提取

 - [一、设置OpenAI API Key](#一、设置OpenAI-API-Key)
 - [二、Tagging](#二、Tagging)
     - [2.1 创建Tagging函数](#2.1-创建Tagging函数)
     - [2.2 通过LangChain实现Tagging](#2.2-通过LangChain实现Tagging)
     - [2.3 结构化解析Tagging结果](#2.3-结构化解析Tagging结果)
 - [三、 Extraction](#三、Extraction)
     - [3.1 创建Extraction函数](#3.1-创建Extraction函数)
     - [3.2 通过LangChain实现创建Extraction函数](#3.2-通过LangChain实现创建Extraction函数)
     - [3.3 结构化解析Extraction结果](#3.3-结构化解析Extraction结果)
 - [四、应用案例](#四、应用案例)
     - [4.1 加载数据](#4.1-加载数据)
     - [4.2 提取文章概览](#4.2-提取文章概览)
     - [4.3 提取文章信息](#4.3-提取文章信息)
     - [4.4 分块文本提取](#4.4-分块文本提取)
         

# 一、设置OpenAI-API-Key

详细内容见`设置OpenAI_API_KEY.ipynb`文件

# 二、Tagging

`Tagging` 是什么：
- LLM 给出一个函数描述，从输入文本中选择参数生成一个结构化的输出，形成函数调用
- 更一般地说，LLM 可以评估输入文本并生成**结构化输出**

## 2.1 创建Tagging函数

In [99]:
# 导入模块
from typing import List  
from pydantic import BaseModel, Field  
from langchain.utils.openai_functions import convert_pydantic_to_openai_function

In [100]:
# 创建 Tagging 类
# 该类表是基于输入的文本来标记文本情感的 `pos`（正面）、`neg`（负面）或`neutral`（中立）
class Tagging(BaseModel):
    """Tag the piece of text with particular info."""
    # 文本的情绪标签，可选值为`pos`（正面）、`neg`（负面）或`neutral`（中立）
    sentiment: str = Field(description="sentiment of text, should be `pos`, `neg`, or `neutral`")
    # 文本的语言标签，应为ISO 639-1标准代码
    language: str = Field(description="language of text (should be ISO 639-1 code)")

In [101]:
# 将Tagging数据模型转换为OpenAI函数
convert_pydantic_to_openai_function(Tagging)

{'name': 'Tagging',
 'description': 'Tag the piece of text with particular info.',
 'parameters': {'title': 'Tagging',
  'description': 'Tag the piece of text with particular info.',
  'type': 'object',
  'properties': {'sentiment': {'title': 'Sentiment',
    'description': 'sentiment of text, should be `pos`, `neg`, or `neutral`',
    'type': 'string'},
   'language': {'title': 'Language',
    'description': 'language of text (should be ISO 639-1 code)',
    'type': 'string'}},
  'required': ['sentiment', 'language']}}

## 2.2 通过LangChain实现Tagging

In [102]:
# 导入模块
from langchain.prompts import ChatPromptTemplate 
from langchain.chat_models import ChatOpenAI

In [103]:
model = ChatOpenAI(openai_api_key=openai.api_key, temperature=0)  # 创建一个温度为0的ChatOpenAI模型实例

In [104]:
# 应用 Tagging 
tagging_functions = [convert_pydantic_to_openai_function(Tagging)]

In [105]:
# 使用ChatPromptTemplate的from_messages方法创建聊天提示模板
prompt = ChatPromptTemplate.from_messages([
    ("system", "Think carefully, and then tag the text as instructed"),
    ("user", "{input}")
])

In [106]:
# 将模型与函数绑定，并指定函数调用的名称
model_with_functions = model.bind(
    functions=tagging_functions,
    function_call={"name": "Tagging"}
)

In [107]:
# 创建一个标签链，结合提示模板和模型
tagging_chain = prompt | model_with_functions

In [108]:
# 调用标签链并传入输入文本
tagging_chain.invoke({"input": "I love langchain"})

AIMessage(content='', additional_kwargs={'function_call': {'name': 'Tagging', 'arguments': '{\n  "sentiment": "pos",\n  "language": "en"\n}'}}, example=False)

In [109]:
# 再次调用标签链并传入另一个输入文本
tagging_chain.invoke({"input": "non mi piace questo cibo"})

AIMessage(content='', additional_kwargs={'function_call': {'name': 'Tagging', 'arguments': '{\n  "sentiment": "neg",\n  "language": "it"\n}'}}, example=False)

## 2.3 结构化解析Tagging结果

In [110]:
from langchain.output_parsers.openai_functions import JsonOutputFunctionsParser  # 从langchain.output_parsers.openai_functions模块导入JsonOutputFunctionsParser

In [111]:
# 创建一个新的标签链，结合提示模板、模型和JsonOutputFunctionsParser解析器
tagging_chain = prompt | model_with_functions | JsonOutputFunctionsParser()

In [112]:
# 调用标签链并传入输入文本
tagging_chain.invoke({"input": "non mi piace questo cibo"})

{'sentiment': 'neg', 'language': 'it'}

# 三、Extraction

Extraction 是什么：
- 提取（Extraction）类似于标记（Tagging），但用于提取多条信息。
- 当给定一个输入Json模式时，LLM已经进行了微调，以查找并填充该模式的参数。
- 该功能并不局限于function模式，可以用于一般用途的提取。

## 3.1 创建Extraction函数

In [113]:
# 导入模块
from typing import Optional  
from pydantic import BaseModel, Field  

In [114]:
# 创建Person类
class Person(BaseModel):
    """Information about a person."""
    name: str = Field(description="person's name")  # 人的名字
    age: Optional[int] = Field(description="person's age")  # 人的年龄，可选字段

In [115]:
# 创建Information类别
class Information(BaseModel):
    """Information to extract."""
    people: List[Person] = Field(description="List of info about people")  # 关于人的信息列表

In [116]:
# 将Information数据模型转换为OpenAI函数
convert_pydantic_to_openai_function(Information)

{'name': 'Information',
 'description': 'Information to extract.',
 'parameters': {'title': 'Information',
  'description': 'Information to extract.',
  'type': 'object',
  'properties': {'people': {'title': 'People',
    'description': 'List of info about people',
    'type': 'array',
    'items': {'title': 'Person',
     'description': 'Information about a person.',
     'type': 'object',
     'properties': {'name': {'title': 'Name',
       'description': "person's name",
       'type': 'string'},
      'age': {'title': 'Age',
       'description': "person's age",
       'type': 'integer'}},
     'required': ['name']}}},
  'required': ['people']}}

In [117]:
# 创建提取功能列表，并将提取功能绑定到模型上
extraction_functions = [convert_pydantic_to_openai_function(Information)]  
extraction_model = model.bind(functions=extraction_functions, function_call={"name": "Information"})  

In [118]:
# 调用提取模型，传入文本信息
extraction_model.invoke("Joe is 30, his mom is Martha")

AIMessage(content='', additional_kwargs={'function_call': {'name': 'Information', 'arguments': '{\n  "people": [\n    {\n      "name": "Joe",\n      "age": 30\n    },\n    {\n      "name": "Martha",\n      "age": 0\n    }\n  ]\n}'}}, example=False)

## 3.2 通过LangChain实现创建Extraction函数

In [119]:
# 使用ChatPromptTemplate创建提示模板
prompt = ChatPromptTemplate.from_messages([
    ("system", "Extract the relevant information, if not explicitly provided do not guess. Extract partial info"),  # 系统提示：提取相关信息，如果没有明确提供则不要猜测。提取部分信息
    ("human", "{input}")  
])

In [120]:
# 创建提取链，结合提示模板和提取模型
extraction_chain = prompt | extraction_model

In [121]:
# 调用提取链并传入输入文本
extraction_chain.invoke({"input": "Joe is 30, his mom is Martha"})

AIMessage(content='', additional_kwargs={'function_call': {'name': 'Information', 'arguments': '{\n  "people": [\n    {\n      "name": "Joe",\n      "age": 30\n    },\n    {\n      "name": "Martha"\n    }\n  ]\n}'}}, example=False)

In [122]:
# 创建新的提取链，加入JsonOutputFunctionsParser来解析输出
extraction_chain = prompt | extraction_model | JsonOutputFunctionsParser()

In [123]:
# 再次调用提取链
extraction_chain.invoke({"input": "Joe is 30, his mom is Martha"})

{'people': [{'name': 'Joe', 'age': 30}, {'name': 'Martha'}]}

## 3.3 结构化解析Extraction结果

In [124]:
# 导入模块
from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser 

In [125]:
# 创建提取链，指定关键字"name"来解析输出
extraction_chain = prompt | extraction_model | JsonKeyOutputFunctionsParser(key_name="people")

In [126]:
# 调用提取链并传入输入文本
extraction_chain.invoke({"input": "Joe is 30, his mom is Martha"})

[{'name': 'Joe', 'age': 30}, {'name': 'Martha'}]

# 四、应用案例

我们可以对更大的文本主体应用标记。例如，加载博客文章并从文本的子集中提取标记信息。

## 4.1 加载数据

In [127]:
# 使用WebBaseLoader加载文档
from langchain.document_loaders import WebBaseLoader  
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/") 
documents = loader.load() 

In [128]:
# 获取第一个文档
doc = documents[0]  

In [129]:
# 获取页面内容的前10000个字符
page_content = doc.page_content[:10000]  

## 4.2 提取文章概览

In [130]:
# 从pydantic导入BaseModel和Field用于创建数据模型
from pydantic import BaseModel, Field  

In [131]:
# 创建Overview类别
class Overview(BaseModel):
    """Overview of a section of text."""
    summary: str = Field(description="Provide a concise summary of the content.")  # 内容摘要
    language: str = Field(description="Provide the language that the content is written in.")  # 内容语言
    keywords: str = Field(description="Provide keywords related to the content.")  # 关键词

In [132]:
# 将Overview数据模型转换为OpenAI函数
overview_tagging_function = [
    convert_pydantic_to_openai_function(Overview)
]
tagging_model = model.bind(
    functions=overview_tagging_function,
    function_call={"name":"Overview"}  # 绑定函数调用
)
tagging_chain = prompt | tagging_model | JsonOutputFunctionsParser()  # 创建标注链并加入解析器

In [133]:
# 调用标注链
tagging_chain.invoke({"input": page_content})

{'summary': 'This article discusses the concept of building autonomous agents powered by LLM (large language model) as their core controller. It explores the key components of such agent systems, including planning, memory, and tool use. It also covers various techniques for task decomposition and self-reflection in autonomous agents. The article provides examples of case studies and challenges in implementing LLM-powered autonomous agents.',
 'language': 'English',
 'keywords': 'LLM, autonomous agents, planning, memory, tool use, task decomposition, self-reflection, case studies, challenges'}

## 4.3 提取文章信息

In [134]:
# 创建Paper类，用于标题和作者
class Paper(BaseModel):
    """Information about papers mentioned."""
    title: str  # 论文标题
    author: Optional[str]  # 作者，可选字段

# 创建Info，用户提取论文论文信息列表
class Info(BaseModel):
    """Information to extract"""
    papers: List[Paper] 

In [135]:
# 将Info数据模型转换为OpenAI函数
paper_extraction_function = [
    convert_pydantic_to_openai_function(Info)
]
extraction_model = model.bind(
    functions=paper_extraction_function, 
    function_call={"name":"Info"}  # 绑定函数调用
)

In [136]:
# 创建提取链并加入解析器
extraction_chain = prompt | extraction_model | JsonKeyOutputFunctionsParser(key_name="papers") 

In [137]:
# 调用提取链，发现提取了论文本身的名称。因此接下里可以结合prompt改进
extraction_chain.invoke({"input": page_content})  

[{'title': 'LLM Powered Autonomous Agents', 'author': 'Lilian Weng'}]

In [138]:
template = """
A article will be passed to you. Extract from it all papers that are mentioned by this article. 
Do not extract the name of the article itself. If no papers are mentioned that's fine - you don't need to extract any! Just return an empty list.
Do not make up or guess ANY extra information. Only extract what exactly is in the text.
"""

template_chinese = """
一篇文章将转交给你。把这篇文章中提到的所有论文都摘录出来。
不要提取文章本身的名称。如果没有提到论文，那很好——你不需要提取任何论文!只返回一个空列表。
不要编造或猜测任何额外的信息。只提取文本中的内容。
"""

In [139]:
# 使用定制化提示模板创建聊天提示
prompt = ChatPromptTemplate.from_messages([
    ("system", template),
    ("human", "{input}")
])

In [140]:
# 重新创建提取链
extraction_chain = prompt | extraction_model | JsonKeyOutputFunctionsParser(key_name="papers")  

In [141]:
# 再次调用提取链
extraction_chain.invoke({"input": page_content})  

[{'title': 'Chain of thought (CoT; Wei et al. 2022)', 'author': 'Wei et al.'},
 {'title': 'Tree of Thoughts (Yao et al. 2023)', 'author': 'Yao et al.'},
 {'title': 'LLM+P (Liu et al. 2023)', 'author': 'Liu et al.'},
 {'title': 'ReAct (Yao et al. 2023)', 'author': 'Yao et al.'},
 {'title': 'Reflexion (Shinn & Labash 2023)', 'author': 'Shinn & Labash'},
 {'title': 'Chain of Hindsight (CoH; Liu et al. 2023)',
  'author': 'Liu et al.'},
 {'title': 'Algorithm Distillation (AD; Laskin et al. 2023)',
  'author': 'Laskin et al.'}]

In [142]:
# 使用不相关输入调用提取链，不会返回有效信息
extraction_chain.invoke({"input": "hi"})  

[]

## 4.4 分块文本提取

In [143]:
# 导入模块
from langchain.text_splitter import RecursiveCharacterTextSplitter 

# 实例化文本分割器
text_splitter = RecursiveCharacterTextSplitter(chunk_overlap=0)  

In [144]:
# 分割文档内容，text_splitter可以将长文本切分成多个短文本
splits = text_splitter.split_text(doc.page_content)  

# 获取分割后的段落数量
len(splits)  

14

In [145]:
# 定义函数用于扁平化列表
def flatten(matrix):
    flat_list = []
    for row in matrix:
        flat_list += row
    return flat_list  

In [146]:
# 示例调用扁平化函数
flatten([[1, 2], [3, 4]])  

[1, 2, 3, 4]

In [147]:
# 打印第一个分割的文本块
print(splits[0])  

LLM Powered Autonomous Agents | Lil'Log







































Lil'Log






















Posts




Archive




Search




Tags




FAQ




emojisearch.app









      LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


 


Table of Contents



Agent System Overview

Component One: Planning

Task Decomposition

Self-Reflection


Component Two: Memory

Types of Memory

Maximum Inner Product Search (MIPS)


Component Three: Tool Use

Case Studies

Scientific Discovery Agent

Generative Agents Simulation

Proof-of-Concept Examples


Challenges

Citation

References





Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general probl

In [148]:
# 导入模块
from langchain.schema.runnable import RunnableLambda  

In [149]:
# 创建Lambda函数用于预处理文本
prep = RunnableLambda(
    lambda x: [{"input": doc} for doc in text_splitter.split_text(x)]  
)

In [150]:
# 测试prep
print(prep.invoke("hi"))
print(len(prep.invoke("hi")))

# 将长文本放入，会切分成多个短文本
print(len(prep.invoke(doc.page_content)))

[{'input': 'hi'}]
1
14


In [151]:
# 创建链式调用，包括预处理、映射提取
# 多个短文本分别使用extraction_chain进行提取，将结果的list通过flatten函数扁平化到一起
chain = prep | extraction_chain.map() | flatten  

In [152]:
chain.invoke(doc.page_content)

[{'title': 'AutoGPT', 'author': ''},
 {'title': 'GPT-Engineer', 'author': ''},
 {'title': 'BabyAGI', 'author': ''},
 {'title': 'Chain of thought (CoT; Wei et al. 2022)', 'author': 'Wei et al.'},
 {'title': 'Tree of Thoughts (Yao et al. 2023)', 'author': 'Yao et al.'},
 {'title': 'LLM+P (Liu et al. 2023)', 'author': 'Liu et al.'},
 {'title': 'ReAct (Yao et al. 2023)', 'author': 'Yao et al.'},
 {'title': 'Reflexion (Shinn & Labash 2023)', 'author': 'Shinn & Labash'},
 {'title': 'Reflexion: A Framework for Self-Reflection in Reinforcement Learning',
  'author': 'Shinn & Labash'},
 {'title': 'Chain of Hindsight: Improving Model Outputs with Sequential Feedback',
  'author': 'Liu et al.'},
 {'title': 'Algorithm Distillation: Learning from Cross-Episode Trajectories',
  'author': 'Laskin et al.'},
 {'title': 'Algorithm Distillation', 'author': 'Laskin et al. 2023'},
 {'title': 'ED (expert distillation)', 'author': ''},
 {'title': 'RL^2', 'author': 'Duan et al. 2017'},
 {'title': 'LSH: Locali