## Self Querying

Self-Querying 是一种特殊的检索器，能够自动将自然语言查询转换为结构化查询。具体来说，给定任何自然语言查询，检索器使用一个构造查询的LLM来编写结构化查询，然后将该结构化查询应用于其底层向量存储。这使得检索器不仅可以使用用户输入查询与存储文档的内容进行语义相似性比较，还可以从存储文档的元数据上的用户查询中提取过滤器，并执行这些过滤器。

![这是图片](../assets/self_querying.jpg "Self Query")


### 核心功能：

* 不仅可以进行语义相似性搜索
* 还能从用户查询中提取元数据过滤器
* 可以直接在向量存储中执行这些过滤器

### 工作原理：

利用LLM能力，将自然语言查询转换为结构化的查询；
然后将该结构化查询应用于底层向量存储。

### 注意事项：

需要提供文档描述，并指定文档元数据字段及其含义

In [2]:
from langchain_ollama import OllamaEmbeddings
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document
# from langchain_community.vectorstores import Chroma
from langchain_chroma import Chroma
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.document_loaders import SeleniumURLLoader

from chat_model_client import get_model
from pprint import pprint

## 1. 组织数据

需要声明文档数据结构，细化到每一个属性的名称、数据类型、描述，并解释文档主体内容。

In [3]:
documents = [
    Document(
        page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
        metadata={
            "year": 1993,
            "director": "Steven Spielberg",
            "rating": 7.7,
            "genre": "science fiction",
        }
    ),
    Document(
        page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
        metadata={
            "year": 2010,
            "director": "Christopher Nolan",
            "rating": 8.2,
            "genre": "action",
        }
    ),
    Document(
        page_content="A psychologist / detective gets lost in a series of dreams within some dreams",
        metadata={
            "year": 1964,
            "director": "Andrei Tarkovsky",
            "rating": 8.1,
            "genre": "thriller",
        }
    ),
    Document(
        page_content="A bunch of normal-sized women are supremely wholesome and very very noble for your ...",
        metadata={
            "year": 2015,
            "director": "Dennis Dugan",
            "rating": 7.4,
            "genre": "comedy",
        }
    ),
    Document(
        page_content="This film is just boring",
        metadata={
            "year": 1988,
            "director": "Ryan Fleck",
            "rating": 6.8,
            "genre": "horror",
        }
    )
]

metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie, one of ['science fiction', 'comedy', 'drama', thriller', 'romance', 'action', animated']",
        type="string",
    ),
    AttributeInfo(
        name="year",
        description="The year when the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="director",
        description="The name of the movie director",
        type="string",
    ),
    AttributeInfo(
        name="rating",
        description="A 1-10 rating for the movie",
        type="float",
    ),
]

document_content_description="Brief summary of a movie"

## 2. 构建 SelfQueryRetriever

In [7]:
vectorstore = Chroma.from_documents(documents, OllamaEmbeddings(model = "llama2-chinese"))
llm = get_model('llama')
# vectorstore.as_retriever()
retriever = SelfQueryRetriever.from_llm(
    llm, vectorstore, document_content_description, metadata_field_info, search_type="mmr", search_kwargs={"score_threshold": 0.2})
# retriever = db.as_retriever(search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.5})  # 构建检索器

## 3. 只通过元数据进行查询

In [10]:
retriever.invoke("I want to watch a movie with rating higher than 8", verbose = True)

Number of requested results 20 is greater than number of elements in index 10, updating n_results = 10


[Document(id='a522575e-f5d7-42a7-bf6a-b8259e6cdf25', metadata={'director': 'Andrei Tarkovsky', 'genre': 'thriller', 'rating': 8.1, 'year': 1964}, page_content='A psychologist / detective gets lost in a series of dreams within some dreams'),
 Document(id='4e7d7cd8-93fc-45f3-8b26-577d8816019d', metadata={'director': 'Andrei Tarkovsky', 'genre': 'thriller', 'rating': 8.1, 'year': 1964}, page_content='A psychologist / detective gets lost in a series of dreams within some dreams')]

## 4. 同时查询元数据和文档内容

In [11]:
retriever.invoke("Has Dennis Dugan directed any movies about women", verbose = True)


Number of requested results 20 is greater than number of elements in index 10, updating n_results = 10


[Document(id='097185a9-5905-4115-a76a-c3a0676527d5', metadata={'director': 'Dennis Dugan', 'genre': 'comedy', 'rating': 7.4, 'year': 2015}, page_content='A bunch of normal-sized women are supremely wholesome and very very noble for your ...'),
 Document(id='c4391631-6570-40fa-abbd-62d4de248c64', metadata={'director': 'Christopher Nolan', 'genre': 'action', 'rating': 8.2, 'year': 2010}, page_content='Leo DiCaprio gets lost in a dream within a dream within a dream within a ...'),
 Document(id='8a257bcb-723a-4db0-8e97-06884fdf40d5', metadata={'director': 'Ryan Fleck', 'genre': 'horror', 'rating': 6.8, 'year': 1988}, page_content='This film is just boring'),
 Document(id='2dcf6ff8-7f75-494f-9282-9acdde5a3385', metadata={'director': 'Ryan Fleck', 'genre': 'horror', 'rating': 6.8, 'year': 1988}, page_content='This film is just boring')]

In [34]:
pprint(retriever.get_graph().print_ascii())

+-------------------------+  
| SelfQueryRetrieverInput |  
+-------------------------+  
              *              
              *              
              *              
   +--------------------+    
   | SelfQueryRetriever |    
   +--------------------+    
              *              
              *              
              *              
+--------------------------+ 
| SelfQueryRetrieverOutput | 
+--------------------------+ 
None
