# How to do "self-querying" retrieval


Self-querying retrieval is an advanced mechanism used in information retrieval systems, particularly when working with vector stores and metadata-rich documents. It leverages a Language Model (LLM) to interpret a natural language query, generate a structured query, and apply that structured query to retrieve relevant documents from a vector store.

# Overview
A self-querying retriever interprets natural language queries and constructs a structured query. This structured query extracts relevant information from documents in a vector store (a database where document embeddings are stored) by:

1. Semantic Similarity: Matching the query's intent with document content.

2. Metadata Filtering: Extracting conditions (e.g., "rating > 8.5") and applying them to document metadata.


# Steps to Implement
# 1. Prerequisites

Install necessary packages:



In [1]:
pip install lark langchain-chroma


Collecting lark
  Downloading lark-1.2.2-py3-none-any.whl.metadata (1.8 kB)
Downloading lark-1.2.2-py3-none-any.whl (111 kB)
Installing collected packages: lark
Successfully installed lark-1.2.2
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


# 2. Define Your Data

Create a list of Document objects, each with content (e.g., movie summaries) and metadata (e.g., genre, year, rating).

# 3. Embed Your Data

Use a vector store like Chroma to store the documents as embeddings. Embeddings allow for semantic comparisons between queries and documents.

# 4. Set Metadata Field Info

Define the metadata fields supported by your documents using AttributeInfo. This includes:
* Name (e.g., genre, rating).
* Description (e.g., "The genre of the movie").
* Type (e.g., string, integer, float).

# 5. Use a Query-Constructing Chain
A query-constructing LLM chain (powered by an LLM like GPT) converts user queries into structured queries. These structured queries capture filters and key information.

# 6. Create the Retriever
 The retriever combines:
* Query Constructor: Turns natural language queries into structured filters.
* Vector Store: Retrieves documents based on embeddings.
* Metadata Filters: Filters documents using metadata conditions.

# Example

# 1. Install Dependencies
Ensure you have the necessary packages installed:

In [2]:
pip install --upgrade lark langchain langchain-chroma chromadb


Collecting langchain
  Downloading langchain-0.3.9-py3-none-any.whl.metadata (7.1 kB)
Collecting chromadb
  Downloading chromadb-0.5.20-py3-none-any.whl.metadata (6.8 kB)
Downloading langchain-0.3.9-py3-none-any.whl (1.0 MB)
   ---------------------------------------- 0.0/1.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.0 MB ? eta -:--:--
   ---------- ----------------------------- 0.3/1.0 MB ? eta -:--:--
   -------------------- ------------------- 0.5/1.0 MB 1.3 MB/s eta 0:00:01
   ------------------------------- -------- 0.8/1.0 MB 1.1 MB/s eta 0:00:01
   ---------------------------------------- 1.0/1.0 MB 1.2 MB/s eta 0:00:00
Downloading chromadb-0.5.20-py3-none-any.whl (617 kB)
   ---------------------------------------- 0.0/617.9 kB ? eta -:--:--
   ---------------------------------------- 0.0/617.9 kB ? eta -:--:--
   ---------------------------------------- 0.0/617.9 kB ? eta -:--:--
   ---------------- ----------------------- 262.1/617.9 kB ? eta -:--:--



[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


# 2. Prepare Your Data
Create a dataset of documents with metadata. For this example, we'll use a collection of movie descriptions:

In [3]:
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings

# Sample dataset
docs = [
    Document(
        page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
        metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"},
    ),
    Document(
        page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
        metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2},
    ),
    Document(
        page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
        metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6},
    ),
    Document(
        page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
        metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3},
    ),
    Document(
        page_content="Toys come alive and have a blast doing so",
        metadata={"year": 1995, "genre": "animated"},
    ),
    Document(
        page_content="Three men walk into the Zone, three men walk out of the Zone",
        metadata={
            "year": 1979,
            "director": "Andrei Tarkovsky",
            "genre": "thriller",
            "rating": 9.9,
        },
    ),
]

# Create a Chroma vector store
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())


# 3. Define Metadata Fields
Define the metadata schema to describe the fields available for filtering:

#

In [4]:
from langchain.chains.query_constructor.schema import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_openai import ChatOpenAI

# Define metadata fields
metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']",
        type="string",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="director",
        description="The name of the movie director",
        type="string",
    ),
    AttributeInfo(
        name="rating",
        description="A 1-10 rating for the movie",
        type="float",
    ),
]

# Description of document content
document_content_description = "Brief summary of a movie"


# 4. Create the Self-Querying Retriever
Instantiate the retriever with the metadata schema:



In [5]:
# Use a ChatGPT-like model for query construction
llm = ChatOpenAI(temperature=0)

# Create the Self-Querying Retriever
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectorstore,
    document_content_description,
    metadata_field_info,
)


# 5. Query Examples
Now you can test the retriever by providing natural language queries.

a) Filter by rating

In [6]:
result = retriever.invoke("I want to watch a movie rated higher than 8.5")
for doc in result:
    print(doc.page_content, doc.metadata)


Three men walk into the Zone, three men walk out of the Zone {'director': 'Andrei Tarkovsky', 'genre': 'thriller', 'rating': 9.9, 'year': 1979}
A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea {'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006}


b) Filter by genre and rating



In [7]:
result = retriever.invoke("What's a highly-rated animated movie?")
for doc in result:
    print(doc.page_content, doc.metadata)


A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea {'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006}
Leo DiCaprio gets lost in a dream within a dream within a dream within a ... {'director': 'Christopher Nolan', 'rating': 8.2, 'year': 2010}
Three men walk into the Zone, three men walk out of the Zone {'director': 'Andrei Tarkovsky', 'genre': 'thriller', 'rating': 9.9, 'year': 1979}
A bunch of normal-sized women are supremely wholesome and some men pine after them {'director': 'Greta Gerwig', 'rating': 8.3, 'year': 2019}


# c) Composite Query

In [8]:
result = retriever.invoke("Is there a science fiction movie from the 1990s?")
for doc in result:
    print(doc.page_content, doc.metadata)


A bunch of scientists bring back dinosaurs and mayhem breaks loose {'genre': 'science fiction', 'rating': 7.7, 'year': 1993}


This is how self-querying retrieval allows you to seamlessly filter and retrieve documents using both semantic search and metadata-based filtering. You can expand this by integrating other vector stores or customizing the metadata fields!