# 使用自查询检索构建酒店客房搜索

在这个示例中，我们将介绍如何构建和迭代一个酒店客房搜索服务，该服务利用LLM生成结构化的过滤查询，然后将这些查询传递给向量存储。

要了解自查询检索的介绍，请查看[文档](https://python.langchain.com/docs/modules/data_connection/retrievers/self_query)。

## 导入和数据准备

在这个示例中，我们使用 `ChatOpenAI` 作为模型，`ElasticsearchStore` 作为向量存储，但可以用LLM/ChatModel和[支持自查询的任何VectorStore](https://python.langchain.com/docs/integrations/retrievers/self_query/)进行替换。

从以下链接下载数据：https://www.kaggle.com/datasets/keshavramaiah/hotel-recommendation

In [None]:
# 安装所需的Python包
!pip install langchain langchain-elasticsearch lark openai elasticsearch pandas

In [1]:
# 导入 pandas 库，约定别名为 pd
import pandas as pd

In [2]:
# 读取Hotel_details.csv文件并去除重复的hotelid行，然后将hotelid列设为索引
details = (
    pd.read_csv("~/Downloads/archive/Hotel_details.csv")
    .drop_duplicates(subset="hotelid")
    .set_index("hotelid")
)

# 读取Hotel_Room_attributes.csv文件，将id列设为索引
attributes = pd.read_csv(
    "~/Downloads/archive/Hotel_Room_attributes.csv", index_col="id"
)

# 读取hotels_RoomPrice.csv文件，将id列设为索引
price = pd.read_csv("~/Downloads/archive/hotels_RoomPrice.csv", index_col="id")

In [3]:
# 将最新的价格数据按照"refid"列去重，保留最后一条数据
latest_price = price.drop_duplicates(subset="refid", keep="last")[
    [
        "hotelcode",
        "roomtype",
        "onsiterate",
        "roomamenities",
        "maxoccupancy",
        "mealinclusiontype",
    ]
]

# 从属性数据中获取"ratedescription"列的值，并添加到最新的价格数据中
latest_price["ratedescription"] = attributes.loc[latest_price.index]["ratedescription"]

# 将最新的价格数据与详情数据中的["hotelname", "city", "country", "starrating"]列进行连接
latest_price = latest_price.join(
    details[["hotelname", "city", "country", "starrating"]], on="hotelcode"
)

# 重命名列名"ratedescription"为"roomdescription"
latest_price = latest_price.rename({"ratedescription": "roomdescription"}, axis=1)

# 添加新列"mealsincluded"，表示是否包含餐食
latest_price["mealsincluded"] = ~latest_price["mealinclusiontype"].isnull()

# 删除列"hotelcode"和"mealinclusiontype"
latest_price.pop("hotelcode")
latest_price.pop("mealinclusiontype")

# 重置索引并返回一个新的DataFrame
latest_price = latest_price.reset_index(drop=True)

# 显示处理后的最新价格数据的前几行
latest_price.head()

Unnamed: 0,roomtype,onsiterate,roomamenities,maxoccupancy,roomdescription,hotelname,city,country,starrating,mealsincluded
0,Vacation Home,636.09,Air conditioning: ;Closet: ;Fireplace: ;Free W...,4,"Shower, Kitchenette, 2 bedrooms, 1 double bed ...",Pantlleni,Beddgelert,United Kingdom,3,False
1,Vacation Home,591.74,Air conditioning: ;Closet: ;Dishwasher: ;Firep...,4,"Shower, Kitchenette, 2 bedrooms, 1 double bed ...",Willow Cottage,Beverley,United Kingdom,3,False
2,"Guest room, Queen or Twin/Single Bed(s)",0.0,,2,,AC Hotel Manchester Salford Quays,Manchester,United Kingdom,4,False
3,Bargemaster King Accessible Room,379.08,Air conditioning: ;Free Wi-Fi in all rooms!: ;...,2,Shower,"Lincoln Plaza London, Curio Collection by Hilton",London,United Kingdom,4,True
4,Twin Room,156.17,Additional toilet: ;Air conditioning: ;Blackou...,2,"Room size: 15 m²/161 ft², Non-smoking, Shower,...",Ibis London Canning Town,London,United Kingdom,3,True


## 描述数据属性

我们将使用一个自查询检索器，这需要我们描述可以进行过滤的元数据。

或者，如果我们感到懒惰，我们可以让模型为我们撰写描述的草稿 :)

In [4]:
# 导入ChatOpenAI类
from langchain_openai import ChatOpenAI

# 创建ChatOpenAI对象，指定模型为"gpt-4"
model = ChatOpenAI(model="gpt-4")

# 使用模型预测，传入包含酒店房间信息的字符串，生成JSON列表
res = model.predict(
    "Below is a table with information about hotel rooms. "
    "Return a JSON list with an entry for each column. Each entry should have "
    '{"name": "column name", "description": "column description", "type": "column data type"}'
    f"\n\n{latest_price.head()}\n\nJSON:\n"
)

In [5]:
import json

# 使用json.loads()方法将res转换为Python对象
attribute_info = json.loads(res)
# 打印转换后的Python对象
attribute_info

[{'name': 'roomtype', 'description': 'The type of the room', 'type': 'string'},
 {'name': 'onsiterate',
  'description': 'The rate of the room',
  'type': 'float'},
 {'name': 'roomamenities',
  'description': 'Amenities available in the room',
  'type': 'string'},
 {'name': 'maxoccupancy',
  'description': 'Maximum number of people that can occupy the room',
  'type': 'integer'},
 {'name': 'roomdescription',
  'description': 'Description of the room',
  'type': 'string'},
 {'name': 'hotelname', 'description': 'Name of the hotel', 'type': 'string'},
 {'name': 'city',
  'description': 'City where the hotel is located',
  'type': 'string'},
 {'name': 'country',
  'description': 'Country where the hotel is located',
  'type': 'string'},
 {'name': 'starrating',
  'description': 'Star rating of the hotel',
  'type': 'integer'},
 {'name': 'mealsincluded',
  'description': 'Whether meals are included or not',
  'type': 'boolean'}]

对于低基数特征，让我们在描述中包含有效值。

In [6]:
# 获取最新价格的唯一值数量，并筛选出唯一值数量小于40的数据
latest_price.nunique()[latest_price.nunique() < 40]

maxoccupancy     19
country          29
starrating        3
mealsincluded     2
dtype: int64

In [7]:
# 将最新价格数据集中'starrating'列的唯一值按照升序排列后，添加到attribute_info列表倒数第二个元素的"description"字段中
attribute_info[-2]["description"] += (
    f". Valid values are {sorted(latest_price['starrating'].value_counts().index.tolist())}"
)

# 将最新价格数据集中'maxoccupancy'列的唯一值按照升序排列后，添加到attribute_info列表第4个元素的"description"字段中
attribute_info[3]["description"] += (
    f". Valid values are {sorted(latest_price['maxoccupancy'].value_counts().index.tolist())}"
)

# 将最新价格数据集中'country'列的唯一值按照升序排列后，添加到attribute_info列表倒数第3个元素的"description"字段中
attribute_info[-3]["description"] += (
    f". Valid values are {sorted(latest_price['country'].value_counts().index.tolist())}"
)

In [8]:
attribute_info

[{'name': 'roomtype', 'description': 'The type of the room', 'type': 'string'},
 {'name': 'onsiterate',
  'description': 'The rate of the room',
  'type': 'float'},
 {'name': 'roomamenities',
  'description': 'Amenities available in the room',
  'type': 'string'},
 {'name': 'maxoccupancy',
  'description': 'Maximum number of people that can occupy the room. Valid values are [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 20, 24]',
  'type': 'integer'},
 {'name': 'roomdescription',
  'description': 'Description of the room',
  'type': 'string'},
 {'name': 'hotelname', 'description': 'Name of the hotel', 'type': 'string'},
 {'name': 'city',
  'description': 'City where the hotel is located',
  'type': 'string'},
 {'name': 'country',
  'description': "Country where the hotel is located. Valid values are ['Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'Ireland', 'Italy', 'Latvia

## 创建查询构造器链

让我们来看一下将自然语言请求转换为结构化查询的链。

首先，我们只需加载提示并查看其外观。

In [9]:
from langchain.chains.query_constructor.base import (
    get_query_constructor_prompt,
    load_query_constructor_runnable,
)

In [10]:
# 定义一个包含酒店房间详细描述的字符串
doc_contents = "Detailed description of a hotel room"

# 调用函数 get_query_constructor_prompt，获取查询构造器的提示信息
prompt = get_query_constructor_prompt(doc_contents, attribute_info)

# 打印格式化后的提示信息，将"{query}"替换为实际的查询内容
print(prompt.format(query="{query}"))

Your goal is to structure the user's query to match the request schema provided below.

<< Structured Request Schema >>
When responding use a markdown code snippet with a JSON object formatted in the following schema:

```json
{
    "query": string \ text string to compare to document contents
    "filter": string \ logical condition statement for filtering documents
}
```

The query string should contain only text that is expected to match the contents of documents. Any conditions in the filter should not be mentioned in the query as well.

A logical condition statement is composed of one or more comparison and logical operation statements.

A comparison statement takes the form: `comp(attr, val)`:
- `comp` (eq | ne | gt | gte | lt | lte | contain | like | in | nin): comparator
- `attr` (string):  name of attribute to apply the comparison to
- `val` (string): is the comparison value

A logical operation statement takes the form `op(statement1, statement2, ...)`:
- `op` (and | or | not

In [11]:


# 使用 ChatOpenAI 类创建一个实例 chain，并传入参数 model="gpt-3.5-turbo" 和 temperature=0
chain = load_query_constructor_runnable(
    ChatOpenAI(model="gpt-3.5-turbo", temperature=0), doc_contents, attribute_info
)

In [12]:
# 调用chain对象的invoke方法，传入一个包含查询信息的字典作为参数
chain.invoke({"query": "I want a hotel in Southern Europe and my budget is 200 bucks."})

StructuredQuery(query='hotel', filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='country', value='Italy'), Comparison(comparator=<Comparator.LTE: 'lte'>, attribute='onsiterate', value=200)]), limit=None)

In [13]:
# 调用chain的invoke方法，传入一个字典作为参数
chain.invoke(
    {
        "query": "Find a 2-person room in Vienna or London, preferably with meals included and AC"
    }
)

StructuredQuery(query='2-person room', filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Operation(operator=<Operator.OR: 'or'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='city', value='Vienna'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='city', value='London')]), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='maxoccupancy', value=2), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='mealsincluded', value=True), Comparison(comparator=<Comparator.CONTAIN: 'contain'>, attribute='roomamenities', value='AC')]), limit=None)

## 优化属性描述

我们可以看到上面至少有两个问题。首先是当我们要求一个南欧目的地时，我们只得到了对意大利的过滤，其次是当我们要求空调时，我们得到了对AC的字面字符串查找（这并不是太糟糕，但会错过像“空调”这样的内容）。

作为第一步，让我们尝试更新我们对“国家”属性的描述，以强调只有在提到特定国家时才应使用相等性。

In [14]:
# 将描述信息添加到attribute_info列表倒数第三个元素的"description"键中
attribute_info[-3]["description"] += (
    ". NOTE: Only use the 'eq' operator if a specific country is mentioned. If a region is mentioned, include all relevant countries in filter."
)

# 调用load_query_constructor_runnable函数，传入ChatOpenAI模型参数和其他参数
chain = load_query_constructor_runnable(
    ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
    doc_contents,
    attribute_info,
)

In [15]:
# 调用chain对象的invoke方法，并传入一个字典作为参数
# 字典包含一个键值对，键为"query"，值为"I want a hotel in Southern Europe and my budget is 200 bucks."
chain.invoke({"query": "我想在南欧找一家酒店，我的预算是200美元。"})

StructuredQuery(query='hotel', filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='mealsincluded', value=False), Comparison(comparator=<Comparator.LTE: 'lte'>, attribute='onsiterate', value=200), Operation(operator=<Operator.OR: 'or'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='country', value='Italy'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='country', value='Spain'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='country', value='Greece'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='country', value='Portugal'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='country', value='Croatia'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='country', value='Cyprus'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='country', value='Malta'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='country', value='Bulgaria'), Comparison(comparator=<

## 精细化筛选属性

这似乎有所帮助！现在让我们尝试缩小我们筛选的属性范围。我们可以将更自由形式的属性留给主查询，这样可以更好地捕捉语义含义，而不是搜索特定的子字符串。

In [16]:
# 定义一个包含属性名称的列表
content_attr = ["roomtype", "roomamenities", "roomdescription", "hotelname"]
# 定义一个包含酒店房间详细描述的字符串
doc_contents = "A detailed description of a hotel room, including information about the room type and room amenities."
# 使用列表推导式创建一个过滤后的属性信息元组
filter_attribute_info = tuple(
    ai for ai in attribute_info if ai["name"] not in content_attr
)
# 调用load_query_constructor_runnable函数，传入ChatOpenAI模型、文档内容、过滤后的属性信息
chain = load_query_constructor_runnable(
    ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
    doc_contents,
    filter_attribute_info,
)

In [17]:
# 调用chain的invoke方法，传入一个包含查询信息的字典作为参数
chain.invoke(
    {
        "query": "Find a 2-person room in Vienna or London, preferably with meals included and AC"
    }
)

StructuredQuery(query='2-person room', filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Operation(operator=<Operator.OR: 'or'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='city', value='Vienna'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='city', value='London')]), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='maxoccupancy', value=2), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='mealsincluded', value=True)]), limit=None)

## 添加特定于我们用例的示例

我们已经移除了对“AC”的严格过滤，但它仍未包含在查询字符串中。我们的链式提示是一个带有一些默认示例的少拍提示。让我们看看添加特定于用例的示例是否会有所帮助：

In [18]:
examples = [
    (
        "我想要在巴尔干地区的一家带有特大号床和热水浴缸的酒店。预算是每晚300美元",
        {
            "query": "特大号床，热水浴缸",
            "filter": 'and(in("country", ["保加利亚", "希腊", "克罗地亚", "塞尔维亚"]), lte("价格", 300))',
        },
    ),
    (
        "一间希尔顿酒店包含早餐的房间，适合3个人",
        {
            "query": "希尔顿",
            "filter": 'and(eq("包含餐食", true), gte("最大入住人数", 3))',
        },
    ),
]
prompt = get_query_constructor_prompt(
    doc_contents, filter_attribute_info, examples=examples
)
print(prompt.format(query="{query}"))

Your goal is to structure the user's query to match the request schema provided below.

<< Structured Request Schema >>
When responding use a markdown code snippet with a JSON object formatted in the following schema:

```json
{
    "query": string \ text string to compare to document contents
    "filter": string \ logical condition statement for filtering documents
}
```

The query string should contain only text that is expected to match the contents of documents. Any conditions in the filter should not be mentioned in the query as well.

A logical condition statement is composed of one or more comparison and logical operation statements.

A comparison statement takes the form: `comp(attr, val)`:
- `comp` (eq | ne | gt | gte | lt | lte | contain | like | in | nin): comparator
- `attr` (string):  name of attribute to apply the comparison to
- `val` (string): is the comparison value

A logical operation statement takes the form `op(statement1, statement2, ...)`:
- `op` (and | or | not

In [19]:
# 加载查询构造器可运行对象
chain = load_query_constructor_runnable(
    ChatOpenAI(model="gpt-3.5-turbo", temperature=0),  # 使用ChatOpenAI模型，模型为"gpt-3.5-turbo"，温度为0
    doc_contents,  # 文档内容
    filter_attribute_info,  # 过滤属性信息
    examples=examples,  # 示例
)

In [20]:
# 调用chain对象的invoke方法，传入一个包含查询信息的字典作为参数
chain.invoke(
    {
        "query": "Find a 2-person room in Vienna or London, preferably with meals included and AC"
    }
)

StructuredQuery(query='2-person room, meals included, AC', filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Operation(operator=<Operator.OR: 'or'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='city', value='Vienna'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='city', value='London')]), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='mealsincluded', value=True)]), limit=None)

这似乎有所帮助！让我们尝试另一个复杂的查询：

In [21]:
# 调用chain的invoke方法，传入一个字典作为参数
chain.invoke(
    {
        "query": "I want to stay somewhere highly rated along the coast. I want a room with a patio and a fireplace."
    }
)

OutputParserException: Parsing text
```json
{
    "query": "highly rated, coast, patio, fireplace",
    "filter": "and(eq(\"starrating\", 4), contain(\"description\", \"coast\"), contain(\"description\", \"patio\"), contain(\"description\", \"fireplace\"))"
}
```
 raised following error:
Received invalid attributes description. Allowed attributes are ['onsiterate', 'maxoccupancy', 'city', 'country', 'starrating', 'mealsincluded']

## 自动忽略无效查询

看起来我们的模型在这个更复杂的查询上出现了问题，并尝试搜索一个不存在的属性（'description'）。通过在我们的查询构造器链中设置 `fix_invalid=True`，我们可以自动移除任何无效的筛选条件（即使用了不允许的操作、比较或属性）。

In [22]:
# 导入所需的库
from openai import ChatOpenAI

# 调用load_query_constructor_runnable函数，加载查询构造器
chain = load_query_constructor_runnable(
    ChatOpenAI(model="gpt-3.5-turbo", temperature=0),  # 使用ChatOpenAI模型，选择"gpt-3.5-turbo"模型，设置温度为0
    doc_contents,  # 文档内容
    filter_attribute_info,  # 过滤属性信息
    examples=examples,  # 示例
    fix_invalid=True,  # 修复无效值
)

In [23]:
# 调用chain的invoke方法，传入一个包含查询信息的字典作为参数
chain.invoke(
    {
        "query": "I want to stay somewhere highly rated along the coast. I want a room with a patio and a fireplace."
    }
)

StructuredQuery(query='highly rated, coast, patio, fireplace', filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='starrating', value=4), limit=None)

## 使用自查询检索器

现在我们的查询构造链已经基本就位，让我们尝试将其与实际的检索器一起使用。在这个例子中，我们将使用[ElasticsearchStore](https://python.langchain.com/docs/integrations/vectorstores/elasticsearch)。

In [24]:
from langchain_elasticsearch import ElasticsearchStore  # 导入ElasticsearchStore模块
from langchain_openai import OpenAIEmbeddings  # 导入OpenAIEmbeddings模块

embeddings = OpenAIEmbeddings()  # 创建OpenAIEmbeddings对象并赋值给embeddings变量

## 填充向量存储

第一次运行时，请取消下面单元格的注释，首先对数据进行索引。

In [25]:
# 创建一个空列表docs用于存储文档
docs = []

# 遍历latest_price中的每一行数据，_表示索引，room表示每一行的数据
for _, room in latest_price.fillna("").iterrows():
    # 创建一个Document对象，其中page_content是将room转换为字典后使用json.dumps转换为字符串的结果，
    # metadata是room转换为字典的结果
    doc = Document(
        page_content=json.dumps(room.to_dict(), indent=2),
        metadata=room.to_dict()
    )
    # 将doc添加到docs列表中
    docs.append(doc)

# 使用ElasticsearchStore.from_documents方法将docs中的文档存储到Elasticsearch中
vecstore = ElasticsearchStore.from_documents(
    docs,
    embeddings,
    es_url="http://localhost:9200",
    index_name="hotel_rooms",
    # strategy=ElasticsearchStore.ApproxRetrievalStrategy(
    #     hybrid=True,
    # )
)

In [26]:
# 创建一个名为"hotel_rooms"的ElasticsearchStore对象
# 参数embedding指定嵌入向量
# 参数es_url指定Elasticsearch的URL地址
vecstore = ElasticsearchStore(
    "hotel_rooms",
    embedding=embeddings,
    es_url="http://localhost:9200",
    # strategy=ElasticsearchStore.ApproxRetrievalStrategy(hybrid=True) # 在社区版本中似乎不可用
)

In [27]:
from langchain.retrievers import SelfQueryRetriever

retriever = SelfQueryRetriever(
    query_constructor=chain, vectorstore=vecstore, verbose=True
)

In [28]:
# 调用retriever的invoke方法，并传入一个字符串作为参数
results = retriever.invoke(
    "I want to stay somewhere highly rated along the coast. I want a room with a patio and a fireplace."
)

# 遍历results中的每个结果
for res in results:
    # 打印结果的页面内容
    print(res.page_content)
    # 打印分隔线
    print("\n" + "-" * 20 + "\n")

{
  "roomtype": "Three-Bedroom House With Sea View",
  "onsiterate": 341.75,
  "roomamenities": "Additional bathroom: ;Additional toilet: ;Air conditioning: ;Closet: ;Clothes dryer: ;Coffee/tea maker: ;Dishwasher: ;DVD/CD player: ;Fireplace: ;Free Wi-Fi in all rooms!: ;Full kitchen: ;Hair dryer: ;Heating: ;High chair: ;In-room safe box: ;Ironing facilities: ;Kitchenware: ;Linens: ;Microwave: ;Private entrance: ;Refrigerator: ;Seating area: ;Separate dining area: ;Smoke detector: ;Sofa: ;Towels: ;TV [flat screen]: ;Washing machine: ;",
  "maxoccupancy": 6,
  "roomdescription": "Room size: 125 m\u00b2/1345 ft\u00b2, 2 bathrooms, Shower and bathtub, Shared bathroom, Kitchenette, 3 bedrooms, 1 double bed or 2 single beds or 1 double bed",
  "hotelname": "Downings Coastguard Cottages - Type B-E",
  "city": "Downings",
  "country": "Ireland",
  "starrating": 4,
  "mealsincluded": false
}

--------------------

{
  "roomtype": "Three-Bedroom House With Sea View",
  "onsiterate": 774.05,
  "ro