# 使用Langchain、Qdrant和OpenAI进行问答

本笔记本展示了如何使用Langchain、Qdrant作为知识库以及OpenAI嵌入来实现一个问答系统。如果您对Qdrant不熟悉，最好先查看[Getting_started_with_Qdrant_and_OpenAI.ipynb](Getting_started_with_Qdrant_and_OpenAI.ipynb)笔记。

本笔记本展示了一个端到端的过程：
1. 使用OpenAI API计算嵌入。
2. 将嵌入存储在本地的Qdrant实例中，以构建知识库。
3. 将原始文本查询转换为嵌入，使用OpenAI API。
4. 使用Qdrant在创建的集合中执行最近邻搜索，以找到一些上下文。
5. 请求LLM在给定上下文中找到答案。

所有步骤将简化为调用一些对应的Langchain方法。


## 先决条件

为了完成这个练习，我们需要准备一些事项：

1. Qdrant 服务器实例。在我们的情况下，是一个本地的 Docker 容器。
2. [qdrant-client](https://github.com/qdrant/qdrant_client) 库，用于与向量数据库进行交互。
3. [Langchain](https://github.com/hwchase17/langchain) 作为一个框架。
4. 一个 [OpenAI API key](https://beta.openai.com/account/api-keys)。

### 启动 Qdrant 服务器

我们将使用在 Docker 容器中运行的本地 Qdrant 实例。启动它的最简单方法是使用附带的 [docker-compose.yaml] 文件，并运行以下命令：


In [1]:
! docker-compose up -d


Starting qdrant_qdrant_1 ... 
[1Bting qdrant_qdrant_1 ... [32mdone[0m

我们可以通过运行一个简单的curl命令来验证服务器是否成功启动：


In [2]:
! curl http://localhost:6333


{"title":"qdrant - vector search engine","version":"1.0.1"}

### 安装要求

这个笔记本显然需要安装 `openai`、`langchain` 和 `qdrant-client` 包。


In [None]:
! pip install openai qdrant-client "langchain==0.0.100" wget


### 准备你的OpenAI API密钥

OpenAI API密钥用于对文档和查询进行向量化。

如果你还没有OpenAI API密钥，你可以从[https://beta.openai.com/account/api-keys](https://beta.openai.com/account/api-keys)获取一个。

获取到密钥后，请通过运行以下命令将其添加到你的环境变量中，变量名为`OPENAI_API_KEY`：


In [None]:
! export OPENAI_API_KEY="your API key"


In [4]:
# 验证您的OpenAI API密钥是否已正确设置为环境变量。
# 注意：如果您在本地运行此笔记本，您需要重新加载终端和笔记本，以使环境变量生效。
import os

# 注意：或者，您也可以像这样设置一个临时的环境变量：
# os.environ["OPENAI_API_KEY"] = "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

if os.getenv("OPENAI_API_KEY") is not None:
    print("OPENAI_API_KEY is ready")
else:
    print("OPENAI_API_KEY environment variable not found")


OPENAI_API_KEY is ready


## 加载数据

在这一部分，我们将加载包含一些自然问题及其答案的数据。所有数据将用于创建一个Langchain应用程序，其中Qdrant将作为知识库。


In [5]:
import wget

# 所有示例均来自 https://ai.google.com/research/NaturalQuestions。
# 这是我们下载并提取的一些训练集样本。
# 进一步加工。
wget.download("https://storage.googleapis.com/dataset-natural-questions/questions.json")
wget.download("https://storage.googleapis.com/dataset-natural-questions/answers.json")


100% [..............................................................................] 95372 / 95372

'answers.json'

In [6]:
import json

with open("questions.json", "r") as fp:
    questions = json.load(fp)

with open("answers.json", "r") as fp:
    answers = json.load(fp)


In [7]:
print(questions[0])


when is the last episode of season 8 of the walking dead


In [8]:
print(answers[0])


No . overall No. in season Title Directed by Written by Original air date U.S. viewers ( millions ) 100 `` Mercy '' Greg Nicotero Scott M. Gimple October 22 , 2017 ( 2017 - 10 - 22 ) 11.44 Rick , Maggie , and Ezekiel rally their communities together to take down Negan . Gregory attempts to have the Hilltop residents side with Negan , but they all firmly stand behind Maggie . The group attacks the Sanctuary , taking down its fences and flooding the compound with walkers . With the Sanctuary defaced , everyone leaves except Gabriel , who reluctantly stays to save Gregory , but is left behind when Gregory abandons him . Surrounded by walkers , Gabriel hides in a trailer , where he is trapped inside with Negan . 101 `` The Damned '' Rosemary Rodriguez Matthew Negrete & Channing Powell October 29 , 2017 ( 2017 - 10 - 29 ) 8.92 Rick 's forces split into separate parties to attack several of the Saviors ' outposts , during which many members of the group are killed ; Eric is critically injure

## 链定义

Langchain已经集成了Qdrant，并为给定的文档列表执行所有索引。在我们的案例中，我们将存储我们拥有的一组答案。


In [9]:
from langchain.vectorstores import Qdrant
from langchain.embeddings import OpenAIEmbeddings
from langchain import VectorDBQA, OpenAI

embeddings = OpenAIEmbeddings()
doc_store = Qdrant.from_texts(
    answers, embeddings, host="localhost" 
)


在这个阶段，所有可能的答案已经存储在Qdrant中，因此我们可以定义整个问答链。


In [10]:
llm = OpenAI()
qa = VectorDBQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    vectorstore=doc_store,
    return_source_documents=False,
)


## 搜索数据

一旦数据被放入Qdrant中，我们就可以开始提出一些问题。一个问题将会被OpenAI模型自动转换为向量，然后使用创建的向量在Qdrant中找到一些可能匹配的答案。一旦检索到，最相似的答案将被合并到发送给OpenAI大型语言模型的提示中。所有服务之间的通信如下图所示：

![](https://qdrant.tech/articles_data/langchain-integration/flow-diagram.png)


In [11]:
import random

random.seed(52)
selected_questions = random.choices(questions, k=5)


In [12]:
for question in selected_questions:
    print(">", question)
    print(qa.run(question), end="\n\n")


> where do frankenstein and the monster first meet
 Victor and the Creature first meet in the mountains.

> who are the actors in fast and furious
 The actors in the Fast and Furious films are Vin Diesel, Paul Walker, Michelle Rodriguez, Jordana Brewster, Tyrese Gibson, Ludacris, Lucas Black, Sung Kang, Gal Gadot, Dwayne Johnson, Matt Schulze, Chad Lindberg, Johnny Strong, Eva Mendes, Devon Aoki, Nathalie Kelley, Bow Wow, Tego Calderón, Don Omar, Elsa Pataky, Kurt Russell, Nathalie Emmanuel, Scott Eastwood, Noel Gugliemi, Ja Rule, Thom Barry, Ted Levine, Minka Kelly, James Remar, Amaury Nolasco, Michael Ealy, MC Jin, Brian Goodman, Lynda Boyd, Jason Tobin, Neela, Liza Lapira, Alimi Ballard, Yorgo Constantine, Geoff Meed, Jeimy Osorio, Max William Crane, Charlie & Miller Kimsey, Eden Estrella, Romeo Santos, John Brotherton, Helen Mirren, Celestino Cornielle, Janmarco Santiago, Carlos De La Hoz, James Ayoub, Rick Yune, Cole Hauser, Brian Tee, John Ortiz, Luke Evans, Jason Statham, Charli

### 自定义提示模板

Langchain中的`stuff`链类型使用特定的提示，其中包含问题和上下文文档。这是默认提示的样式：

```text
使用以下上下文片段来回答最后的问题。如果你不知道答案，只需说你不知道，不要试图凭空编造答案。
{context}
问题：{question}
有用的回答：
```

然而，我们可以提供自定义的提示模板，并改变OpenAI LLM的行为，同时仍然使用`stuff`链类型。重要的是保持`{context}`和`{question}`作为占位符。

#### 尝试使用自定义提示进行实验

我们可以尝试使用不同的提示模板，这样模型：
1. 如果知道答案，就用一个简短的句子回答。
2. 如果不知道问题的答案，就建议一个随机的歌曲标题。


In [13]:
from langchain.prompts import PromptTemplate


In [14]:
custom_prompt = """
Use the following pieces of context to answer the question at the end. Please provide
a short single-sentence summary answer only. If you don't know the answer or if it's 
not present in given context, don't try to make up an answer, but suggest me a random 
unrelated song title I could listen to. 
Context: {context}
Question: {question}
Helpful Answer:
"""


In [15]:
custom_prompt_template = PromptTemplate(
    template=custom_prompt, input_variables=["context", "question"]
)


In [16]:
custom_qa = VectorDBQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    vectorstore=doc_store,
    return_source_documents=False,
    chain_type_kwargs={"prompt": custom_prompt_template},
)


In [17]:
random.seed(41)
for question in random.choices(questions, k=5):
    print(">", question)
    print(custom_qa.run(question), end="\n\n")


> what was uncle jesse's original last name on full house
Uncle Jesse's original last name on Full House was Cochran.

> when did the volcano erupt in indonesia 2018
No volcanic eruption is mentioned in the given context. Suggested Song: "Ring of Fire" by Johnny Cash.

> what does a dualist way of thinking mean
Dualist way of thinking means that the mind and body are separate entities, with the mind being a non-physical substance.

> the first civil service commission in india was set up on the basis of recommendation of
The first Civil Service Commission in India was not set up on the basis of a recommendation.

> how old do you have to be to get a tattoo in utah
In Utah, you must be at least 18 years old to get a tattoo.

