<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/agent/Chatbot_SEC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="在 Colab 中打开"/></a>


# 💬🤖 如何构建一个聊天机器人

LlamaIndex充当您的数据和语言学习模型（LLMs）之间的桥梁，提供了一个工具包，使您能够为各种任务（如问答和摘要）建立围绕您的数据的查询接口。

在本教程中，我们将带您逐步构建一个使用[Data Agent](https://gpt-index.readthedocs.io/en/stable/core_modules/agent_modules/agents/root.html)构建的上下文增强型聊天机器人。这个代理由LLMs提供支持，能够智能地在您的数据上执行任务。最终结果是一个聊天机器人代理，配备了LlamaIndex提供的强大的数据接口工具，用于回答关于您的数据的查询。

**注意**：本教程是在之前创建一个关于SEC 10-K申报的查询接口的基础上进行的 - [在这里查看](https://medium.com/@jerryjliu98/how-unstructured-and-llamaindex-can-help-bring-the-power-of-llms-to-your-own-data-3657d063e30d)。

### 上下文

在本指南中，我们将构建一个“10-K聊天机器人”，使用来自Dropbox的原始UBER 10-K HTML申报。用户可以与聊天机器人交互，询问与10-K申报相关的问题。


### 准备工作


In [None]:
%pip install llama-index-readers-file
%pip install llama-index-embeddings-openai
%pip install llama-index-agent-openai
%pip install llama-index-llms-openai

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "sk-..."

import nest_asyncio

nest_asyncio.apply()

In [None]:
# 设置文本换行
from IPython.display import HTML,display

def set_css():
    display(
        HTML(
            """
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  """
        )
    )

get_ipython().events.register("pre_run_cell", set_css)

### 导入数据

首先让我们下载2019年至2022年的原始10-k文件。


In [None]:
# 注意：代码示例假定您正在Jupyter笔记本中操作。
# 下载文件
!mkdir data
!wget "https://www.dropbox.com/s/948jr9cfs7fgj99/UBER.zip?dl=1" -O data/UBER.zip
!unzip data/UBER.zip -d data

为了将HTML文件解析为格式化文本，我们使用[Unstructured](https://github.com/Unstructured-IO/unstructured)库。多亏了[LlamaHub](https://llamahub.ai/)，我们可以直接集成Unstructured，从而将任何文本转换为LlamaIndex可以接受的文档格式。

首先，我们安装必要的软件包：


然后我们可以使用`UnstructuredReader`来将HTML文件解析为`Document`对象的列表。


In [None]:
from llama_index.readers.file import UnstructuredReader
from pathlib import Path

years = [2022, 2021, 2020, 2019]

loader = UnstructuredReader()
doc_set = {}
all_docs = []
for year in years:
    year_docs = loader.load_data(
        file=Path(f"./data/UBER/UBER_{year}.html"), split_documents=False
    )
    # 将年份元数据插入到每个年份中
    for d in year_docs:
        d.metadata = {"year": year}
    doc_set[year] = year_docs
    all_docs.extend(year_docs)

### 为每年设置向量索引

我们首先为每年设置一个向量索引。每个向量索引都允许我们针对特定年份的10-K申报提出问题。

我们构建每个索引并将其保存到磁盘上。


In [None]:
# 初始化简单的向量索引
# 注意：如果索引已经加载，请不要运行此单元格！
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

Settings.chunk_size = 512
Settings.chunk_overlap = 64
Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

index_set = {}
for year in years:
    storage_context = StorageContext.from_defaults()
    cur_index = VectorStoreIndex.from_documents(
        doc_set[year],
        storage_context=storage_context,
    )
    index_set[year] = cur_index
    storage_context.persist(persist_dir=f"./storage/{year}")

要从磁盘加载索引，请执行以下操作：


In [None]:
# 从磁盘加载索引
from llama_index.core import load_index_from_storage

index_set = {}
for year in years:
    storage_context = StorageContext.from_defaults(
        persist_dir=f"./storage/{year}"
    )
    cur_index = load_index_from_storage(
        storage_context,
    )
    index_set[year] = cur_index

### 设置子问题查询引擎以综合10-K文件中的答案

由于我们可以访问4年的文件，我们不仅可以询问关于特定年份的10-K文件的问题，还可以询问需要分析所有10-K文件的问题。

为了解决这个问题，我们可以使用[子问题查询引擎](https://gpt-index.readthedocs.io/en/stable/examples/query_engine/sub_question_query_engine.html)。它将查询分解为子查询，每个子查询由单独的向量索引回答，并综合结果以回答整体查询。

LlamaIndex提供了一些围绕索引（和查询引擎）的包装器，以便它们可以被查询引擎和代理使用。首先，我们为每个向量索引定义一个`QueryEngineTool`。
每个工具都有一个名称和描述；这些是LLM代理用来决定选择哪个工具的依据。


In [None]:
from llama_index.core.tools import QueryEngineTool, ToolMetadata

individual_query_engine_tools = [
    QueryEngineTool(
        query_engine=index_set[year].as_query_engine(),
        metadata=ToolMetadata(
            name=f"vector_index_{year}",
            description=(
                "useful for when you want to answer queries about the"
                f" {year} SEC 10-K for Uber"
            ),
        ),
    )
    for year in years
]

现在我们可以创建子问题查询引擎，这将允许我们在10-K申报中综合回答。我们传入上面定义的`individual_query_engine_tools`。


In [None]:
from llama_index.core.query_engine import SubQuestionQueryEngine

query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=individual_query_engine_tools,
)

### 设置聊天机器人代理

我们使用LlamaIndex数据代理来设置外部聊天机器人代理，该代理可以访问一组工具。具体来说，我们将使用一个OpenAIAgent，利用OpenAI API的函数调用。我们希望为每个索引（对应于给定年份）使用我们之前定义的单独工具，以及上面定义的子问题查询引擎的工具。

首先，我们为子问题查询引擎定义一个`QueryEngineTool`：


In [None]:
query_engine_tool = QueryEngineTool(
    query_engine=query_engine,
    metadata=ToolMetadata(
        name="sub_question_query_engine",
        description=(
            "useful for when you want to answer queries that require analyzing"
            " multiple SEC 10-K documents for Uber"
        ),
    ),
)

然后，我们将上面定义的工具组合成一个代理程序的工具列表：


In [None]:
tools = individual_query_engine_tools + [query_engine_tool]

最后，我们调用 `OpenAIAgent.from_tools` 来创建代理，传入我们上面定义的工具列表。


In [None]:
from llama_index.agent.openai import OpenAIAgent

agent = OpenAIAgent.from_tools(tools, verbose=True)

### 测试代理程序

现在我们可以用各种查询来测试代理程序。

如果我们用一个简单的“你好”查询来测试它，代理程序就不会使用任何工具。


In [None]:
response = agent.chat("hi, i am bob")
print(str(response))

Added user message to memory: hi, i am bob
Hello Bob! How can I assist you today?


如果我们用一个关于特定年份的10-k的查询来测试它，代理将使用相关的向量索引工具。


In [None]:
response = agent.chat(
    "What were some of the biggest risk factors in 2020 for Uber?"
)
print(str(response))

Added user message to memory: What were some of the biggest risk factors in 2020 for Uber?
=== Calling Function ===
Calling function: vector_index_2020 with args: {
  "input": "biggest risk factors"
}
Got output: The biggest risk factors mentioned in the context are:

1. The adverse impact of the COVID-19 pandemic and actions taken to mitigate it on the business.
2. The potential reclassification of drivers as employees, workers, or quasi-employees instead of independent contractors.
3. Intense competition in the mobility, delivery, and logistics industries.
4. The need to lower fares or service fees and offer driver incentives and consumer discounts to remain competitive.
5. Significant losses incurred and the uncertainty of achieving profitability.
6. Difficulty in attracting and maintaining a critical mass of platform users.
7. Operational, compliance, and cultural challenges.
8. Negative media coverage and reputation issues.
9. Inability to optimize organizational structure or mana

最后，如果我们用一个查询来比较/对比不同年份的风险因素，代理将使用子问题查询引擎工具。


In [None]:
cross_query_str = (
    "Compare/contrast the risk factors described in the Uber 10-K across"
    " years. Give answer in bullet points."
)

response = agent.chat(cross_query_str)
print(str(response))

Added user message to memory: Compare/contrast the risk factors described in the Uber 10-K across years. Give answer in bullet points.
=== Calling Function ===
Calling function: sub_question_query_engine with args: {
  "input": "Compare/contrast the risk factors described in the Uber 10-K across years"
}
Generated 4 sub questions.
[1;3;38;2;237;90;200m[vector_index_2022] Q: What are the risk factors described in the 2022 SEC 10-K for Uber?
[0m[1;3;38;2;90;149;237m[vector_index_2021] Q: What are the risk factors described in the 2021 SEC 10-K for Uber?
[0m[1;3;38;2;11;159;203m[vector_index_2020] Q: What are the risk factors described in the 2020 SEC 10-K for Uber?
[0m[1;3;38;2;155;135;227m[vector_index_2019] Q: What are the risk factors described in the 2019 SEC 10-K for Uber?
[0m[1;3;38;2;237;90;200m[vector_index_2022] A: The risk factors described in the 2022 SEC 10-K for Uber are not provided in the given context information.
[0m[1;3;38;2;90;149;237m[vector_index_2021] A: 

### 设置聊天机器人循环

现在我们已经设置好了聊天机器人，只需要再进行几个步骤，就可以设置一个基本的交互循环，与我们增强的SEC聊天机器人进行交谈了！


In [None]:

agent = OpenAIAgent.from_tools(tools)  # 默认情况下为verbose=False

while True:
    text_input = input("用户：")
    if text_input == "exit":
        break
    response = agent.chat(text_input)
    print(f"机器人：{response}")

# 用户：Uber在2022年有哪些法律诉讼？

Agent: In 2022, Uber is facing several legal proceedings. Here are some of them:

1. California: The state Attorney General and city attorneys filed a complaint against Uber and Lyft, alleging that drivers are misclassified as independent contractors. A preliminary injunction was issued but stayed pending appeal. The Court of Appeal affirmed the lower court's ruling, and Uber filed a petition for review with the California Supreme Court. However, the Supreme Court declined the petition for review. The lawsuit is ongoing, focusing on claims by the California Attorney General for periods prior to the enactment of Proposition 22.

2. Massachusetts: The Attorney General of Massachusetts filed a complaint against Uber, alleging that drivers are employees entitled to wage and labor law protections. Uber's motion to dismiss the complaint was denied, and a summary judgment motion is pending.

3. New York: Uber is facing allegations of misclassification and employment violations by the state At