## AutoGPT示例：寻找马拉松获胜时间

* 实现了https://github.com/Significant-Gravitas/Auto-GPT
* 使用LangChain基元（LLMs、PromptTemplates、VectorStores、Embeddings、Tools）

In [1]:
# 安装 bs4 库
!pip install bs4

# 安装 nest_asyncio 库
!pip install nest_asyncio

In [2]:
# 通用
import asyncio  # 引入异步IO库
import os  # 引入操作系统相关库

import nest_asyncio  # 引入异步IO库
import pandas as pd  # 引入pandas库
from langchain.docstore.document import Document  # 从langchain库中引入Document类
from langchain_experimental.agents.agent_toolkits.pandas.base import (  # 从langchain_experimental库中引入create_pandas_dataframe_agent函数
    create_pandas_dataframe_agent,
)
from langchain_experimental.autonomous_agents import AutoGPT  # 从langchain_experimental库中引入AutoGPT类
from langchain_openai import ChatOpenAI  # 从langchain_openai库中引入ChatOpenAI类

# 由于jupyter运行在异步事件循环中，需要进行特殊处理
nest_asyncio.apply()

In [3]:


# 创建ChatOpenAI对象，并指定模型为"gpt-4"，温度为1.0
llm = ChatOpenAI(model="gpt-4", temperature=1.0)

### 设置工具

* 我们将设置一个带有`search`工具、`write-file`工具、`read-file`工具、一个网页浏览工具以及一个通过Python REPL与CSV文件交互的工具的AutoGPT。

在下面定义你想要使用的任何其他“工具”：

In [4]:
# 工具
import os
from contextlib import contextmanager
from typing import Optional

from langchain.agents import tool
from langchain_community.tools.file_management.read import ReadFileTool
from langchain_community.tools.file_management.write import WriteFileTool

ROOT_DIR = "./data/"

# 上下文管理器，用于更改当前工作目录
@contextmanager
def pushd(new_dir):
    """Context manager for changing the current working directory."""
    prev_dir = os.getcwd()
    os.chdir(new_dir)
    try:
        yield
    finally:
        os.chdir(prev_dir)

# 处理CSV文件的函数，使用pandas在有限的REPL中处理。
# 只有在将数据写入磁盘作为CSV文件后才能使用。
# 任何图形必须保存到磁盘上才能供人查看。
# 指令应该用自然语言编写，而不是代码。假设数据框已经加载。
@tool
def process_csv(
    csv_file_path: str, instructions: str, output_path: Optional[str] = None
) -> str:
    """Process a CSV by with pandas in a limited REPL.\
 Only use this after writing data to disk as a csv file.\
 Any figures must be saved to disk to be viewed by the human.\
 Instructions should be written in natural language, not code. Assume the dataframe is already loaded."""
    with pushd(ROOT_DIR):
        try:
            df = pd.read_csv(csv_file_path)
        except Exception as e:
            return f"Error: {e}"
        agent = create_pandas_dataframe_agent(llm, df, max_iterations=30, verbose=True)
        if output_path is not None:
            instructions += f" Save output to disk at {output_path}"
        try:
            result = agent.run(instructions)
            return result
        except Exception as e:
            return f"Error: {e}"

**使用PlayWright浏览网页**

In [5]:
# 安装playwright库
!pip install playwright
# 安装playwright浏览器驱动
!playwright install

In [6]:
# 引入必要的库
async def async_load_playwright(url: str) -> str:
    """使用Playwright加载指定的URL并使用BeautifulSoup进行解析。"""
    from bs4 import BeautifulSoup
    from playwright.async_api import async_playwright

    results = ""
    # 使用async_playwright创建一个异步上下文
    async with async_playwright() as p:
        # 使用chromium浏览器启动一个无头浏览器
        browser = await p.chromium.launch(headless=True)
        try:
            # 创建一个新页面
            page = await browser.new_page()
            # 跳转到指定的URL
            await page.goto(url)

            # 获取页面内容
            page_source = await page.content()
            # 使用BeautifulSoup解析页面内容
            soup = BeautifulSoup(page_source, "html.parser")

            # 移除页面中的script和style标签
            for script in soup(["script", "style"]):
                script.extract()

            # 获取页面文本内容
            text = soup.get_text()
            lines = (line.strip() for line in text.splitlines())
            chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
            # 将文本内容拼接成一个字符串
            results = "\n".join(chunk for chunk in chunks if chunk)
        except Exception as e:
            # 如果出现异常，将异常信息赋值给results
            results = f"Error: {e}"
        # 关闭浏览器
        await browser.close()
    return results


# 运行异步函数的辅助函数
def run_async(coro):
    event_loop = asyncio.get_event_loop()
    return event_loop.run_until_complete(coro)


# 定义一个装饰器函数
@tool
def browse_web_page(url: str) -> str:
    """以详细方式抓取整个网页。可能会导致解析问题。"""
    # 调用run_async函数运行async_load_playwright函数
    return run_async(async_load_playwright(url))

**在网页上进行问答**

帮助模型提出更具针对性的问题，以避免过多地占用其记忆空间。

In [7]:
from langchain.chains.qa_with_sources.loading import (
    BaseCombineDocumentsChain,
    load_qa_with_sources_chain,
)
from langchain.tools import BaseTool, DuckDuckGoSearchRun
from langchain_text_splitters import RecursiveCharacterTextSplitter
from pydantic import Field

# 定义一个函数，用于获取文本分割器
def _get_text_splitter():
    return RecursiveCharacterTextSplitter(
        # 设置一个非常小的块大小，仅作演示用途。
        chunk_size=500,
        chunk_overlap=20,
        length_function=len,
    )

class WebpageQATool(BaseTool):
    name = "query_webpage"
    description = (
        "浏览网页并检索与问题相关的信息。"
    )
    text_splitter: RecursiveCharacterTextSplitter = Field(
        default_factory=_get_text_splitter
    )
    qa_chain: BaseCombineDocumentsChain

    def _run(self, url: str, question: str) -> str:
        """用于浏览网站并提取文本信息。"""
        result = browse_web_page.run(url)  # 这里应该是调用了一个名为browse_web_page的函数，但代码中未定义
        docs = [Document(page_content=result, metadata={"source": url})]  # 这里缺少Document类的定义
        web_docs = self.text_splitter.split_documents(docs)
        results = []
        # TODO: 使用MapReduceChain处理此部分
        for i in range(0, len(web_docs), 4):
            input_docs = web_docs[i : i + 4]
            window_result = self.qa_chain(
                {"input_documents": input_docs, "question": question},
                return_only_outputs=True,
            )
            results.append(f"Response from window {i} - {window_result}")
        results_docs = [
            Document(page_content="\n".join(results), metadata={"source": url})  # 这里缺少Document类的定义
        ]
        return self.qa_chain(
            {"input_documents": results_docs, "question": question},
            return_only_outputs=True,
        )

    async def _arun(self, url: str, question: str) -> str:
        raise NotImplementedError

In [8]:
# 导入WebpageQATool类
from WebpageQATool import WebpageQATool

# 载入qa_chain
qa_chain = load_qa_with_sources_chain(llm)

# 创建query_website_tool对象，并传入qa_chain参数
query_website_tool = WebpageQATool(qa_chain)

### 设置内存

* 这里的内存用于代理的中间步骤。

In [9]:
# 内存
import faiss
from langchain.docstore import InMemoryDocstore
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

# 创建一个OpenAIEmbeddings对象
embeddings_model = OpenAIEmbeddings()

# 设置嵌入向量的大小
embedding_size = 1536

# 创建一个L2距离度量的索引对象
index = faiss.IndexFlatL2(embedding_size)

# 创建一个FAISS对象，用于存储嵌入向量和文档
vectorstore = FAISS(embeddings_model.embed_query, index, InMemoryDocstore({}), {})

### 设置模型和AutoGPT

`模型设置`

In [10]:
# 安装duckduckgo_search包
!pip install duckduckgo_search
# 导入DuckDuckGoSearchRun类
web_search = DuckDuckGoSearchRun()

In [11]:
# 定义一个工具列表
tools = [
    web_search,  # 网页搜索工具
    WriteFileTool(root_dir="./data"),  # 写文件工具，指定根目录为"./data"
    ReadFileTool(root_dir="./data"),  # 读文件工具，指定根目录为"./data"
    process_csv,  # 处理csv文件工具
    query_website_tool,  # 查询网站工具
    # HumanInputRun(), # 如果需要从人类获取帮助许可，则激活此行
]

In [12]:
# 导入AutoGPT类
from gpt import AutoGPT

# 创建一个AutoGPT实例，并传入以下参数：
# ai_name: AI的名称，这里设置为"Tom"
# ai_role: AI的角色，这里设置为"Assistant"
# tools: 工具，这里传入tools变量
# llm: 语言模型，这里传入llm变量
# memory: 记忆，这里使用vectorstore.as_retriever方法创建一个检索器，并传入search_kwargs参数设置k为8
# human_in_the_loop: 是否启用人机交互模式，如果想在每一步添加反馈，请将其设置为True
agent = AutoGPT.from_llm_and_tools(
    ai_name="Tom",
    ai_role="Assistant",
    tools=tools,
    llm=llm,
    memory=vectorstore.as_retriever(search_kwargs={"k": 8}),
    # human_in_the_loop=True, # 如果想在每一步添加反馈，请将其设置为True
)

# 设置agent的verbose属性为True，用于输出详细信息
# agent.chain.verbose = True

### 使用AutoGPT查询网络

多年来，我花了很多时间爬取数据源并清理数据。让我们看看AutoGPT是否可以帮助解决这个问题！

以下是查询最近波士顿马拉松时间并将其转换为表格形式的提示。

In [13]:
# 运行agent，agent是一个用于处理自然语言的工具
agent.run(
    [
        "过去5年波士顿马拉松比赛的获胜时间是多少（截至2022年）？生成一个包含年份、姓名、国家和时间的表格。"
    ]
)

{
    "thoughts": {
        "text": "I need to find the winning Boston Marathon times for the past 5 years. I can use the DuckDuckGo Search command to search for this information.",
        "reasoning": "Using DuckDuckGo Search will help me gather information on the winning times without complications.",
        "plan": "- Use DuckDuckGo Search to find the winning Boston Marathon times\n- Generate a table with the year, name, country of origin, and times\n- Ensure there are no legal complications",
        "criticism": "None",
        "speak": "I will use the DuckDuckGo Search command to find the winning Boston Marathon times for the past 5 years."
    },
    "command": {
        "name": "DuckDuckGo Search",
        "args": {
            "query": "winning Boston Marathon times for the past 5 years ending in 2022"
        }
    }
}
{
    "thoughts": {
        "text": "The DuckDuckGo Search command did not provide the specific information I need. I must switch my approach and use query_w

'I have generated the table with the winning Boston Marathon times for the past 5 years. Task complete.'