# 第二章 高级的RAG管道

In [1]:
import utils

import os
import openai
openai.api_key = utils.get_openai_api_key()

✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Context Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input response will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input source will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .


载入文本数据

In [2]:
from llama_index import SimpleDirectoryReader

documents = SimpleDirectoryReader(
    input_files=["data/人工智能.pdf"]
).load_data()

In [3]:
print(type(documents), "\n")
print(len(documents), "\n")
print(type(documents[0]))
print(documents[0])

<class 'list'> 

7 

<class 'llama_index.schema.Document'>
Doc ID: f16f5ce4-8920-458e-8afc-3ef8add87cd8
Text: 2/2/24, 2:43 PM ⼈⼯智能  - 维基百科，⾃由的百科全书
https://zh.wikipedia.org/wiki/ ⼈⼯智能 2/13“⼈⼯智能”的各地常⽤名称 中国⼤陆⼈⼯智能 台湾⼈⼯智慧
港澳⼈⼯智能 新⻢⼈⼯智能、⼈⼯智慧 ⽇韩⼈⼯知能 越南智慧⼈造 [展开] [展开] [展开] [展开] [展开] [展开]⼈⼯智能系列内容
主要⽬标 实现⽅式 ⼈⼯智能哲学 历史 技术 术语⼈⼯智能（英语：artiﬁcial intelligence ，缩写为
AI）亦称机器智能，指由⼈制造出来的机器所表现出来的智能。通常⼈⼯
智能是指⽤普通计算机程序来呈现⼈类智能的技术。该词也指出研究这样的智能系统是否能够实现，以及如何实现。同 时，通过 医学 、神经科学
、机器⼈学 及...


In [34]:
from llama_index import SimpleDirectoryReader

documents_en = SimpleDirectoryReader(
    input_files=["data/eBook-How-to-Build-a-Career-in-AI.pdf"]
).load_data()

In [35]:
print(type(documents_en), "\n")
print(len(documents_en), "\n")
print(type(documents_en[0]))
print(documents_en[0])

<class 'list'> 

41 

<class 'llama_index.schema.Document'>
Doc ID: 9dde8137-3f09-4c32-a4ab-01c4c3543f76
Text: PAGE 1Founder, DeepLearning.AICollected Insights from Andrew Ng
How to  Build Your Career in AIA Simple Guide


## 一、基础RAG通道

这里通过将 documents 中各个文档的文本连接成一个字符串，然后创建了一个 Document 实例，该实例代表了整个文档集合。

In [36]:
from llama_index import Document

document = Document(text="\n\n".join([doc.text for doc in documents]))

In [5]:
# 将中文标点符号替换成英文标点符号，方便后续处理
# 如果是英文文档，可以跳过这一步
# 不处理的话，会导致无法正确切分中文句子，会影响后续sentence_window的大小，导致输入长度大于gpt-3.5-turbo的最大限制
document.text=document.text.replace('。','. ')
document.text=document.text.replace('！','! ')
document.text=document.text.replace('？','? ')

llm-使用 OpenAI 类创建了一个 GPT-3.5-turbo 模型的实例，并设置了温度参数为 0.1。  
service_context-使用 ServiceContext 类创建了一个服务上下文实例，包含了前面创建的 GPT-3.5-turbo 模型以及指定的嵌入模型。  
index-使用 VectorStoreIndex.from_documents 方法，基于之前创建的文档和服务上下文，创建了一个向量存储索引。

In [6]:
from llama_index import VectorStoreIndex
from llama_index import ServiceContext
from llama_index.llms import OpenAI

llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
service_context = ServiceContext.from_defaults(
    llm=llm, embed_model="local:BAAI/bge-small-zh-v1.5"
)
index = VectorStoreIndex.from_documents([document],
                                        service_context=service_context)

In [None]:
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
service_context = ServiceContext.from_defaults(
    llm=llm, embed_model="local:BAAI/bge-small-en-v1.5"
)
index = VectorStoreIndex.from_documents([document],
                                        service_context=service_context)

将之前创建的向量存储索引转换为查询引擎，以便后续进行查询操作。

In [7]:
query_engine = index.as_query_engine()

使用查询引擎执行了一个查询操作，查询给定的问题。

In [8]:
response = query_engine.query(
    "在寻找项目以积累经验时应采取哪些步骤?"
)
print(str(response))

在寻找项目以积累经验时，应该首先确定项目的目标和所需实现的目标，然后建立一个可预测的世界模型，以便选择最有效的行为。在这个过程中，需要定期检查世界模型的状态是否与预测相符，如果不符合，则需要调整计划。最终，通过合作和竞争的方式，利用演化算法和群体智能来达成整体的突现行为目标。


In [38]:
response_en = query_engine.query(
    "What are steps to take when finding projects to build your experience?"
)
print(str(response_en))

Establishing clear goals and developing a predictable world model are essential steps for intelligent agents to pursue and achieve their objectives. They must be able to adapt and change their plans based on the alignment of the world model with their predictions. In a multi-agent setting, utilizing evolutionary algorithms and collective intelligence can help in achieving emergent behavioral goals.


## 二、使用Trulens进行评测

In [9]:
eval_questions = []
with open('data/eval_questions.txt', 'r') as file:
    for line in file:
        # Remove newline character and convert to integer
        item = line.strip()
        print(item)
        eval_questions.append(item)

人工智能中的先验知识是如何被存储的？
人工智能的自我更新和自我提升是否可能导致其脱离人类的控制？
管理者如何管理AI？
强人工智能是什么？
人工智能被滥用带来的危害？


In [40]:
eval_questions_en = []
with open('data/eval_questions_en.txt', 'r') as file:
    for line in file:
        # Remove newline character and convert to integer
        item = line.strip()
        print(item)
        eval_questions_en.append(item)

What are the keys to building a career in AI?
How can teamwork contribute to success in AI?
What is the importance of networking in AI?
What are some good habits to develop for a successful career?
How can altruism be beneficial in building a career?
What is imposter syndrome and how does it relate to AI?
Who are some accomplished individuals who have experienced imposter syndrome?
What is the first step to becoming good at AI?
What are some common challenges in AI?
Is it normal to find parts of AI challenging?


加上自定义问题。

In [10]:
# You can try your own question:
new_question = "什么是适合我的人工智能工作?"
eval_questions.append(new_question)

In [11]:
eval_questions

['人工智能中的先验知识是如何被存储的？',
 '人工智能的自我更新和自我提升是否可能导致其脱离人类的控制？',
 '管理者如何管理AI？',
 '强人工智能是什么？',
 '人工智能被滥用带来的危害？',
 '什么是适合我的人工智能工作?']

In [41]:
# You can try your own question:
new_question_en = "What is the right AI job for me?"
eval_questions.append(new_question_en)
eval_questions_en

['What are the keys to building a career in AI?',
 'How can teamwork contribute to success in AI?',
 'What is the importance of networking in AI?',
 'What are some good habits to develop for a successful career?',
 'How can altruism be beneficial in building a career?',
 'What is imposter syndrome and how does it relate to AI?',
 'Who are some accomplished individuals who have experienced imposter syndrome?',
 'What is the first step to becoming good at AI?',
 'What are some common challenges in AI?',
 'Is it normal to find parts of AI challenging?',
 'What is the right AI job for me?']

通过调用 reset_database() 方法重置 Trulens 数据库。清空之前的记录和反馈数据。

In [12]:
from trulens_eval import Tru
tru = Tru()

tru.reset_database()

🦑 Tru initialized with db url sqlite:///default.sqlite .
🛑 Secret keys may be written to the database. See the `database_redact_keys` option of `Tru` to prevent this.


使用 get_prebuilt_trulens_recorder 函数创建一个 Trulens 记录器 (tru_recorder)，该记录器与给定的查询引擎 (query_engine) 相关联。同时，指定了应用程序的标识为 "Direct Query Engine"。

In [13]:
from utils import get_prebuilt_trulens_recorder

tru_recorder = get_prebuilt_trulens_recorder(query_engine,
                                             app_id="Direct Query Engine")

使用 tru_recorder 记录器开始记录过程，遍历 eval_questions 列表，对每个问题进行查询，并将查询引擎的响应记录下来。

In [14]:
with tru_recorder as recording:
    for question in eval_questions:
        response = query_engine.query(question)

获取 Trulens 记录和反馈数据。用于后续分析和评估。

In [15]:
records, feedback = tru.get_records_and_feedback(app_ids=[])

In [16]:
records.head()

Unnamed: 0,app_id,app_json,type,record_id,input,output,tags,record_json,cost_json,perf_json,ts,Answer Relevance,Context Relevance,Groundedness,Answer Relevance_calls,Context Relevance_calls,Groundedness_calls,latency,total_tokens,total_cost
0,Direct Query Engine,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_962f7b2603939b57d5a7c8f5fb03f4da,"""\u4eba\u5de5\u667a\u80fd\u4e2d\u7684\u5148\u9...","""\u5148\u9a8c\u77e5\u8bc6\u53ef\u4ee5\u88ab\u5...",-,"{""record_id"": ""record_hash_962f7b2603939b57d5a...","{""n_requests"": 0, ""n_successful_requests"": 0, ...","{""start_time"": ""2024-02-29T21:39:41.969744"", ""...",2024-02-29T21:39:45.585026,0.9,0.6,1.0,"[{'args': {'prompt': '人工智能中的先验知识是如何被存储的？', 're...","[{'args': {'prompt': '人工智能中的先验知识是如何被存储的？', 're...",[{'args': {'source': '[10] 早期的⼈⼯智能研究⼈员直接模仿⼈类进⾏...,3,0,0.0
1,Direct Query Engine,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_a49a8b79d123b9fbf62d6301336b9f89,"""\u4eba\u5de5\u667a\u80fd\u7684\u81ea\u6211\u6...","""\u4eba\u5de5\u667a\u80fd\u7684\u81ea\u6211\u6...",-,"{""record_id"": ""record_hash_a49a8b79d123b9fbf62...","{""n_requests"": 0, ""n_successful_requests"": 0, ...","{""start_time"": ""2024-02-29T21:39:45.793005"", ""...",2024-02-29T21:39:48.647105,1.0,0.8,,[{'args': {'prompt': '人工智能的自我更新和自我提升是否可能导致其脱离人...,[{'args': {'prompt': '人工智能的自我更新和自我提升是否可能导致其脱离人...,,2,0,0.0
2,Direct Query Engine,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_865c5f006d9e297734bd8eae9ab6ed55,"""\u7ba1\u7406\u8005\u5982\u4f55\u7ba1\u7406AI\...","""Management of AI involves treating it as a te...",-,"{""record_id"": ""record_hash_865c5f006d9e297734b...","{""n_requests"": 0, ""n_successful_requests"": 0, ...","{""start_time"": ""2024-02-29T21:39:48.812111"", ""...",2024-02-29T21:39:51.689651,0.8,0.6,,"[{'args': {'prompt': '管理者如何管理AI？', 'response':...","[{'args': {'prompt': '管理者如何管理AI？', 'response':...",,2,0,0.0
3,Direct Query Engine,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_21a2225e1b8ec41f41340b90e7c121a3,"""\u5f3a\u4eba\u5de5\u667a\u80fd\u662f\u4ec0\u4...","""\u5f3a\u4eba\u5de5\u667a\u80fd\u662f\u4e00\u7...",-,"{""record_id"": ""record_hash_21a2225e1b8ec41f413...","{""n_requests"": 0, ""n_successful_requests"": 0, ...","{""start_time"": ""2024-02-29T21:39:51.856594"", ""...",2024-02-29T21:39:55.313402,1.0,,,"[{'args': {'prompt': '强人工智能是什么？', 'response': ...",,,3,0,0.0
4,Direct Query Engine,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_6c3b3bdfd78575e3d57dcf3ca470b4ca,"""\u4eba\u5de5\u667a\u80fd\u88ab\u6ee5\u7528\u5...","""The misuse of artificial intelligence technol...",-,"{""record_id"": ""record_hash_6c3b3bdfd78575e3d57...","{""n_requests"": 0, ""n_successful_requests"": 0, ...","{""start_time"": ""2024-02-29T21:39:55.470704"", ""...",2024-02-29T21:39:59.171553,0.9,,,"[{'args': {'prompt': '人工智能被滥用带来的危害？', 'respons...",,,3,0,0.0


运行 Trulens 仪表板以可视化评估结果。

In [17]:
# launches on http://localhost:8501/
tru.run_dashboard()

Starting dashboard ...
Config file already exists. Skipping writing process.
Credentials file already exists. Skipping writing process.


Accordion(children=(VBox(children=(VBox(children=(Label(value='STDOUT'), Output())), VBox(children=(Label(valu…

Dashboard started at http://198.18.0.1:8501 .


<Popen: returncode: None args: ['streamlit', 'run', '--server.headless=True'...>

## 三、高级的RAG通道

### 3.1 滑窗句子检索

创建 OpenAI 的 GPT-3.5-turbo 语言模型实例：

In [18]:
from llama_index.llms import OpenAI

llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)

使用辅助函数 build_sentence_window_index 创建基于窗口的句子索引：

In [42]:
from utils import build_sentence_window_index

sentence_index = build_sentence_window_index(
    document,
    llm,
    embed_model="local:BAAI/bge-small-zh-v1.5",
    save_dir="sentence_index"
)

使用辅助函数 get_sentence_window_query_engine 获取基于句子窗口的查询引擎：

In [26]:
from utils import get_sentence_window_query_engine

sentence_window_engine = get_sentence_window_query_engine(sentence_index)

对一个特定的问题进行查询并打印结果：

In [27]:
window_response = sentence_window_engine.query(
    "如何开始人工智能个人项目?"
)
str(window_response)

'通过模仿人类思考模式，使用概率和经济学概念处理不确定或不完整的信息，寻找更有效的算法，并强调感知运动的重要性，可以开始一个人工智能个人项目。'

In [43]:
from utils import build_sentence_window_index

sentence_index_en = build_sentence_window_index(
    document,
    llm,
    embed_model="local:BAAI/bge-small-en-v1.5",
    save_dir="sentence_index"
)

window_response_en = sentence_window_engine.query(
    "how do I get started on a personal project in AI?"
)
str(window_response_en)

'You can begin a personal project in AI by first selecting a specific subfield that interests you, such as vision, natural language processing, decision theory, genetic algorithms, or robotics. Once you have chosen a subfield, you can start by studying relevant textbooks and resources to build a foundational understanding. From there, you can experiment with implementing algorithms, working on small projects, and gradually increasing the complexity of your AI projects as you gain more experience and knowledge in the field.'

重置 Trulens 数据库，  
使用 Trulens 记录器对基于窗口的句子索引进行评估，记录查询结果：

In [45]:
tru.reset_database()

tru_recorder_sentence_window = get_prebuilt_trulens_recorder(
    sentence_window_engine,
    app_id = "Sentence Window Query Engine"
)

In [29]:
for question in eval_questions:
    with tru_recorder_sentence_window as recording:
        response = sentence_window_engine.query(question)
        print(question)
        print(str(response))

人工智能中的先验知识是如何被存储的？
人工智能中的先验知识是通过某种方式告知机器的知识，可以包括描述目标、特征、种类及对象之间的关系，描述事件、时间、状态、原因和结果等内容。
人工智能的自我更新和自我提升是否可能导致其脱离人类的控制？
The self-updating and self-improving capabilities of artificial intelligence could potentially lead to it surpassing human control.
管理者如何管理AI？
Management should consider adjusting their work functions by relinquishing administrative tasks, focusing on enhancing their comprehensive judgment and creativity in the field of analysis and prediction, treating AI as a colleague to form a collaborative team, and acknowledging that AI technologies also have limitations and bottlenecks.
强人工智能是什么？
强人工智能是一种观点，认为计算机本身具有思维，而不仅仅是用来模拟人类思维的工具。
人工智能被滥用带来的危害？
The misuse of artificial intelligence could potentially lead to violations of copyright laws and other legal regulations. There have been cases where artificial intelligence technology has been used to remove mosaic from explicit videos or alter the appearance of individuals in videos. Additionally, experts have warned about the pote

In [46]:
for question in eval_questions_en:
    with tru_recorder_sentence_window as recording:
        response = sentence_window_engine.query(question)
        print(question)
        print(str(response))

What are the keys to building a career in AI?
Studying the characteristics of intelligence from relevant textbooks, understanding the concept of strong artificial intelligence commonly found in introductory materials, and gaining expertise in problem solving, puzzle solving, game playing, and deduction are key elements to building a career in AI.
How can teamwork contribute to success in AI?
Teamwork can contribute to success in AI by treating AI as a colleague and forming a collaborative team. This approach fosters synergy and cooperation, allowing for a more effective utilization of AI technologies in various tasks and projects.
What is the importance of networking in AI?
Networking in AI is crucial as it allows for the exchange of information and data between different AI systems, enabling them to learn from each other and improve their performance. This interconnectedness enhances the overall capabilities of AI by facilitating collaboration, sharing of knowledge, and collective pro

Validation error: 1 validation error for Rating
rating
  Value error, Rating must be between 0 and 10 [type=value_error, input_value=33, input_type=int]
    For further information visit https://errors.pydantic.dev/2.6/v/value_error


获取性能评估的排行榜：

In [49]:
tru.get_leaderboard(app_ids=[])

Unnamed: 0_level_0,Context Relevance,Answer Relevance,Groundedness,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Sentence Window Query Engine,0.018182,0.909091,0.288636,9.181818,0.0


In [50]:
# launches on http://localhost:8501/
tru.run_dashboard()

Starting dashboard ...
Config file already exists. Skipping writing process.
Credentials file already exists. Skipping writing process.
Dashboard already running at path:   Network URL: http://198.18.0.1:8501



<Popen: returncode: None args: ['streamlit', 'run', '--server.headless=True'...>

### 3.2 自动合并检索

In [51]:
from utils import build_automerging_index

automerging_index = build_automerging_index(
    documents,
    llm,
    embed_model="local:BAAI/bge-small-zh-v1.5",  # "local:BAAI/bge-small-en-v1.5" for english
    save_dir="merging_index"
)

In [52]:
from utils import get_automerging_query_engine

automerging_query_engine = get_automerging_query_engine(
    automerging_index,
)

In [53]:
auto_merging_response = automerging_query_engine.query(
    "如何开始人工智能个人项目?"
)
print(str(auto_merging_response))

Begin a personal project in artificial intelligence by first selecting a specific problem or application you are interested in exploring. Then, familiarize yourself with the various tools and technologies available for AI development. Next, design a plan outlining the steps you will take to implement your project, considering factors such as data collection, model building, and evaluation. Finally, start working on your project by coding and experimenting with different AI techniques to achieve your desired outcome.


In [None]:
tru.reset_database()

tru_recorder_automerging = get_prebuilt_trulens_recorder(automerging_query_engine,
                                                         app_id="Automerging Query Engine")

In [None]:
for question in eval_questions_en:
    with tru_recorder_automerging as recording:
        response = automerging_query_engine.query(question)
        print(question)
        print(response)

> Merging 4 nodes into parent node.
> Parent node id: 1f9d8836-82de-4011-bc12-9c836bf0cc4b.
> Parent node text: [16]
⼈类解决问题的模式通常是⽤最快捷、直观的判断，⽽不是有意识的、⼀步⼀步的推导，早期⼈⼯智能研究通常使⽤逐步推导的⽅式。[17]⼈⼯智能研究已经
于这种“次表征性的”解决问题⽅法获取进展...

人工智能中的先验知识是如何被存储的？
In artificial intelligence, prior knowledge can be stored by providing the machine with knowledge in a certain way, which includes descriptions of goals, features, relationships between objects, events, time, states, reasons, results, or any knowledge that one wishes the machine to store. This stored prior knowledge can be combined with specific reasoning rules (such as logical reasoning) to derive new knowledge through intelligent inference.
人工智能的自我更新和自我提升是否可能导致其脱离人类的控制？
Yes, the self-updating and self-improving capabilities of artificial intelligence could potentially lead to it surpassing human control.
> Merging 2 nodes into parent node.
> Parent node id: 8bc4d6be-e10d-40dc-8af3-5684cfc080d2.
> Parent node text: 依⽬前的研究⽅向，电脑⽆法突变、苏醒、产⽣⾃我意志，AI也不可能具有创意与智能、同

In [None]:
tru.get_leaderboard(app_ids=[])

Unnamed: 0_level_0,Context Relevance,Answer Relevance,Groundedness,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Automerging Query Engine,0.8625,0.88,0.783333,3.666667,0.0


In [None]:
# launches on http://localhost:8501/
tru.run_dashboard()

Starting dashboard ...
Config file already exists. Skipping writing process.
Credentials file already exists. Skipping writing process.
Dashboard already running at path:   Network URL: http://198.18.0.1:8501



<Popen: returncode: None args: ['streamlit', 'run', '--server.headless=True'...>