# 搭建查询分析系统

本页面将展示如何在一个基础的端到端示例中使用查询分析。这将涵盖创建一个简单的搜索引擎，展示在将原始用户问题传递给搜索时可能出现的失败模式，然后举例说明查询分析如何帮助解决这个问题。有许多不同的查询分析技术，这个端到端示例不会展示所有这些技术。

为了这个示例的目的，我们将对LangChain YouTube视频进行检索。

In [83]:
import os
from langchain_openai import ChatOpenAI
API_SECRET_KEY = ""
BASE_URL = ""  # 代理 base-url 记得加上 /v1

os.environ["OPENAI_API_KEY"] = API_SECRET_KEY
os.environ["OPENAI_API_BASE"] = BASE_URL
llm = ChatOpenAI(temperature=0)

### 加载文档

我们可以用 来加载一些LangChain视频的文字记录：[YouTubeLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.youtube.YoutubeLoader.html)

In [84]:
from langchain_community.document_loaders import YoutubeLoader

urls = [
    "https://www.youtube.com/watch?v=HAn9vnJy6S4",
    "https://www.youtube.com/watch?v=dA1cHGACXCo",
    "https://www.youtube.com/watch?v=ZcEMLz27sL4",
    "https://www.youtube.com/watch?v=hvAPnpSfSGo",
    "https://www.youtube.com/watch?v=EhlPDL4QrWY",
    "https://www.youtube.com/watch?v=mmBo8nlu2j0",
    "https://www.youtube.com/watch?v=rQdibOsL1ps",
    "https://www.youtube.com/watch?v=28lC4fqukoc",
    "https://www.youtube.com/watch?v=es-9MgxB-uc",
    "https://www.youtube.com/watch?v=wLRHwKuKvOE",
    "https://www.youtube.com/watch?v=ObIltMaRJvY",
    "https://www.youtube.com/watch?v=DjuXACWYkkU",
    "https://www.youtube.com/watch?v=o7C9ld6Ln-M",
]
docs = []
for url in urls:
    docs.extend(YoutubeLoader.from_youtube_url(url, add_video_info=True).load())

In [85]:
import datetime

# Add some additional metadata: what year the video was published
for doc in docs:
    doc.metadata["publish_year"] = int(
        datetime.datetime.strptime(
            doc.metadata["publish_date"], "%Y-%m-%d %H:%M:%S"
        ).strftime("%Y")
    )

以下是我们加载的视频的标题：

In [86]:
[doc.metadata["title"] for doc in docs]

['OpenGPTs',
 'Building a web RAG chatbot: using LangChain, Exa (prev. Metaphor), LangSmith, and Hosted Langserve',
 'Streaming Events: Introducing a new `stream_events` method',
 'LangGraph: Multi-Agent Workflows',
 'Build and Deploy a RAG app with Pinecone Serverless',
 'Auto-Prompt Builder (with Hosted LangServe)',
 'Build a Full Stack RAG App With TypeScript',
 'Getting Started with Multi-Modal LLMs',
 'SQL Research Assistant',
 'Skeleton-of-Thought: Building a New Template from Scratch',
 'Benchmarking RAG over LangChain Docs',
 'Building a Research Assistant from Scratch',
 'LangServe and LangChain Templates Webinar']

以下是与每个视频关联的元数据。我们可以看到，每个文档还有一个标题、浏览次数、发布日期和长度：

In [87]:
docs[0].metadata

{'source': 'HAn9vnJy6S4',
 'title': 'OpenGPTs',
 'description': 'Unknown',
 'view_count': 9037,
 'thumbnail_url': 'https://i.ytimg.com/vi/HAn9vnJy6S4/hq720.jpg',
 'publish_date': '2024-01-31 00:00:00',
 'length': 1530,
 'author': 'LangChain',
 'publish_year': 2024}

下面是文档内容的示例：

In [88]:
docs[0].page_content[:500]

"hello today I want to talk about open gpts open gpts is a project that we built here at linkchain uh that replicates the GPT store in a few ways so it creates uh end user-facing friendly interface to create different Bots and these Bots can have access to different tools and they can uh be given files to retrieve things over and basically it's a way to create a variety of bots and expose the configuration of these Bots to end users it's all open source um it can be used with open AI it can be us"

### 索引文档
每当我们执行检索时，我们都需要创建一个可以查询的文档索引。我们将使用向量存储来索引我们的文档，我们将首先对它们进行分块，以使我们的检索更加简洁和精确：

In [89]:
from langchain_chroma import Chroma
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
chunked_docs = text_splitter.split_documents(docs)
EMBEDDING_DEVICE = "cuda"
embeddings=HuggingFaceEmbeddings(model_name= "../models/m3e-base",
                                    model_kwargs={"device":EMBEDDING_DEVICE}
                                    )
vectorstore = FAISS.from_documents(
    chunked_docs,
    embeddings,
)

### 不带查询分析的检索
我们可以直接对用户问题进行相似性搜索，以查找与该问题相关的块：

In [90]:
vectorstore.similarity_search_with_score("how do I build a RAG agent")

[(Document(metadata={'source': '28lC4fqukoc', 'title': 'Getting Started with Multi-Modal LLMs', 'description': 'Unknown', 'view_count': 4093, 'thumbnail_url': 'https://i.ytimg.com/vi/28lC4fqukoc/hq720.jpg?sqp=-oaymwEmCIAKENAF8quKqQMa8AEB-AH-CYAC0AWKAgwIABABGCkgWChyMA8=&rs=AOn4CLCPeU4y3IyyG2C3XDHmIYh8efhGbQ', 'publish_date': '2023-12-20 00:00:00', 'length': 1833, 'author': 'LangChain', 'publish_year': 2023}, page_content="capacity and conventional rag approaches that just strip the text out really miss a lot of this so let's try kind of how could we build a rag system over the visual content in in a slide deck um so to start off what I did was I took a slide deck and this is um uh data dog's Q3 earnings report I randomly chose it you know it was just like an interesting demonstration of like kind of complex uh you know financial information and figures and slide deck and I created a set of 10 questions and answer pairs about these slides this is like my evalve set um and this is really 

In [91]:
search_results = vectorstore.similarity_search("how do I build a RAG agent")
print(search_results[0].metadata["title"])
print(search_results[0].page_content[:500])

Getting Started with Multi-Modal LLMs
capacity and conventional rag approaches that just strip the text out really miss a lot of this so let's try kind of how could we build a rag system over the visual content in in a slide deck um so to start off what I did was I took a slide deck and this is um uh data dog's Q3 earnings report I randomly chose it you know it was just like an interesting demonstration of like kind of complex uh you know financial information and figures and slide deck and I created a set of 10 questions and answer


这很有效！我们的第一个结果与这个问题非常相关。

如果我们想搜索特定时间段的结果，该怎么办？

In [92]:
search_results = vectorstore.similarity_search("videos on RAG published in 2023")
print(search_results[0].metadata["title"])
print(search_results[0].metadata["publish_date"])
print(search_results[0].page_content[:500])

Getting Started with Multi-Modal LLMs
2023-12-20 00:00:00
GPD 4V and some other models that we'll talk about today um so kind of a quick overview of models a lot of this work of course of course you know predates uh you know the current year of 2023 uh it's probably worth noting clip it's very important work from open AI um that kind of map data from different modalities text and images into a shared embedding space um it's open source and actually clip embeddings are still used uh for visual encoding in models that you'll see today for example lava um


我们的第一个结果是 2024 年的（尽管我们要求提供 2023 年的视频），与输入不是很相关。由于我们只是针对文档内容进行搜索，因此无法根据任何文档属性筛选结果。

这只是可能出现的一种故障模式。现在让我们来看看查询分析的基本形式是如何解决它的！

额，这里我的结果和官方文档不太一样，我的输出就是2023年的

### 查询分析

查询分析旨在将用户的自然语言问题转换为结构化的数据库查询，以提高检索结果的准确性和相关性。通过定义查询架构和使用函数调用模型，我们可以将用户的问题转换为包含明确筛选条件的结构化查询。

#### 查询架构

在这个例子中，查询架构包含一个查询字段和一个可选的发布日期字段。发布日期字段可以包含一个最小值和最大值，用于筛选视频的发布日期。查询架构定义如下：



In [93]:
from typing import Optional

from langchain_core.pydantic_v1 import BaseModel, Field


class Search(BaseModel):
    """Search over a database of tutorial videos about a software library."""

    query: str = Field(
        ...,
        description="Similarity search query applied to video transcripts.",
    )
    publish_year: Optional[int] = Field(None, description="Year video was published")

#### 查询生成

为了将用户的问题转换为结构化查询，教程使用了 OpenAI 的工具调用 API，特别是新的 `ChatModel.with_structured_output()` 构造函数。这允许将查询架构传递给模型，并让模型输出结构化查询。以下是实现步骤：

In [94]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

system = """You are an expert at converting user questions into database queries. \
You have access to a database of tutorial videos about a software library for building LLM-powered applications. \
Given a question, return a list of database queries optimized to retrieve the most relevant results.

If there are acronyms or words you are not familiar with, do not try to rephrase them."""
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        ("human", "{question}"),
    ]
)
llm = ChatOpenAI(model="gpt-3.5-turbo-0125",temperature=0)
structured_llm = llm.with_structured_output(Search)
query_analyzer = {"question": RunnablePassthrough()} | prompt | structured_llm

让我们看看我们的分析器为我们之前搜索的问题生成了哪些查询：

In [95]:
query_analyzer.invoke("how do I build a RAG agent")

Search(query='build RAG agent', publish_year=None)

In [96]:
# 这个地方有的时候跑的出来，有的时候跑不出来，和数学运算的错误类似，都是格式出错
query_analyzer.invoke("videos on RAG published in 2023")

Search(query='RAG', publish_year=2023)

### 使用查询分析进行检索

我们的查询分析看起来相当不错；现在让我们尝试使用生成的查询实际执行检索。

注意：在我们的示例中，我们指定了 `tool_choice="Search"`。这将强制 LLM 调用一个（且仅一个）工具，这意味着我们总是会有一个优化的查询来查找。请注意，这并不总是如此——有关如何处理没有或返回多个优化查询的情况，请参见其他指南。

In [97]:
from typing import List

from langchain_core.documents import Document

In [98]:
def retrieval(search: Search) -> List[Document]:
    if search.publish_year is not None:
        # This is syntax specific to Chroma,
        # the vector database we are using.
        _filter = {"publish_year": {"$eq": search.publish_year}}
    else:
        _filter = None
    return vectorstore.similarity_search(search.query, filter=_filter)

In [114]:
retrieval_chain = query_analyzer | retrieval

我们现在可以在之前有问题的输入上运行这个链，并看到它只产生当年的结果！

In [116]:
# 对于运算格式和json格式的问题 不知道如何处理？
results = retrieval_chain.invoke("RAG tutorial published in 2024")

OutputParserException: Function Search arguments:

{
  publish_year: 2024,
  query: "RAG tutorial"
}

are not valid JSON. Received JSONDecodeError Expecting property name enclosed in double quotes: line 2 column 3 (char 4)

In [112]:
print(results)

[]


In [113]:
[(doc.metadata["title"], doc.metadata["publish_date"]) for doc in results]

[]