# Self Querying Retrieval (SQR)

Traditional retrieval systems often require complex query languages or predefined search facets. Self-querying retrieval (SQR), on the other hand, 
offers a more natural and user-friendly approach. Here’s why it’s valuable:

Natural language queries: SQR allows users to pose their queries in natural language, making the interaction more intuitive and accessible. This is 
akin to having a conversation with a librarian, where users can ask questions in their own words without needing to learn complex query languages
or specific search facets.

Advanced retrieval capabilities: Unlike basic keyword searches, SQR boasts advanced retrieval capabilities. It can search documents based on both 
their content and metadata (author, genre, year, etc.). This dual approach allows for highly precise results. Imagine searching for a specific legal 
document; SQR can not only find documents containing relevant keywords but also narrow them down by author (judge) or year (case date) for a perfect
match.

Flexibility: SQR systems can adapt to user intent by refining search results based on follow-up questions or additional inputs. This, therefore, 
helps in narrowing down search results to better match the user’s needs.

What is SQR ?

Self-querying retrieval (SQR) leverages the power of LLMs to understand the user’s intent within a document collection. Here’s the core idea:

       Document representation: Word embeddings convert each document into a numerical representation. This allows for efficient comparison 
       between documents.

       User query: The user submits a natural language query expressing their information need.

       LLM-driven retrieval: The LLM analyzes the query and the document representations. It then retrieves documents that best match the user’s intent.

       Refine and repeat: The user can refine their query or ask follow-up questions for a more focused search based on the retrieved documents.



## 1. Import necessary libraries

In [None]:
import os
from langchain.schema import Document
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain_openai import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

## 2. Set up the OpenAI API key

In [None]:
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"] = ""  # Add your OpenAI API key
if OPENAI_API_KEY == "":
    raise ValueError("Please set the OPENAI_API_KEY environment variable")

## 3. Example data with metadata

"""
Code Explanation : 

initializes a list named docs that contains multiple Document objects, each representing a book. Each Document object has page_content describing
the narrative and metadata that includes title, author, year, genre, rating, language, and country. The books ranged from fiction and historical 
fiction to romance, adventure, dystopian, thriller, and magical realism, authored by renowned writers from different countries and languages.

"""

In [None]:
docs = [
    Document(
        page_content="A complex, layered narrative exploring themes of identity and belonging",
        metadata={"title":"The Namesake", "author": "Jhumpa Lahiri", "year": 2003, "genre": "Fiction", "rating": 4.5, "language":"English", "country":"USA"},
    ),
    Document(
        page_content="A luxurious, heartfelt novel with themes of love and loss set against a historical backdrop",
        metadata={"title":"The Nightingale", "author": "Kristin Hannah", "year": 2015, "genre": "Historical Fiction", "rating": 4.8, "language":"English", "country":"France"},
    ),
    Document(
        page_content="A full-bodied epic with rich characters and a sprawling plot",
        metadata={"title":"War and Peace", "author": "Leo Tolstoy", "year": 1869, "genre": "Historical Fiction", "rating": 4.7, "language":"Russian", "country":"Russia"},
    ),
    Document(
        page_content="An elegant, balanced narrative with intricate character development and subtle themes",
        metadata={"title":"Pride and Prejudice", "author": "Jane Austen", "year": 1813, "genre": "Romance", "rating": 4.6, "language":"English", "country":"UK"},
    ),
    Document(
        page_content="A highly regarded novel with deep themes and a nuanced exploration of human nature",
        metadata={"title":"To Kill a Mockingbird", "author": "Harper Lee", "year": 1960, "genre": "Fiction", "rating": 4.9, "language":"English", "country":"USA"},
    ),
    Document(
        page_content="A crisp, engaging story with vibrant characters and a compelling plot",
        metadata={"title":"The Alchemist", "author": "Paulo Coelho", "year": 1988, "genre": "Adventure", "rating": 4.4, "language":"Portuguese", "country":"Brazil"},
    ),
    Document(
        page_content="A rich, complex narrative set in a dystopian future with strong thematic elements",
        metadata={"title":"1984", "author": "George Orwell", "year": 1949, "genre": "Dystopian", "rating": 4.7, "language":"English", "country":"UK"},
    ),
    Document(
        page_content="An intense, gripping story with dark themes and intricate plot twists",
        metadata={"title":"Gone Girl", "author": "Gillian Flynn", "year": 2012, "genre": "Thriller", "rating": 4.3, "language":"English", "country":"USA"},
    ),
    Document(
        page_content="An exotic, enchanting tale with rich descriptions and an intricate plot",
        metadata={"title":"One Hundred Years of Solitude", "author": "Gabriel García Márquez", "year": 1967, "genre": "Magical Realism", "rating": 4.8, "language":"Spanish", "country":"Colombia"},
    ),
    # ... (add more book documents as needed)
]

## 4. Define the embedding function

"""
Creating an instance of OpenAIEmbeddings which converts document text into numerical representations suitable for retrieval.

"""

In [None]:
embeddings = OpenAIEmbeddings()

## 5. Initializing vector store

"""
Creating a vector store using the Chroma library with the document embeddings. This store will hold the document embeddings for efficient retrieval.
"""

In [None]:
vectorstore = Chroma.from_documents(docs, embeddings)

## 6. Create LLM and retriever

"""
Code Explanation :

Defines the metadata_field_info list, which specifies metadata attributes used for retrieval, such as title, author, year, genre, rating, language,
and country, along with their descriptions and data types.

Specifies document_content_description which provides a brief description of the document content.

Creates an instance of OpenAI with the temperature set to 0 for more factual responses.

Creates a SelfQueryRetriever object named retriever. This is the core component for self-query retrieval (SQR) functionality, using the following arguments:

          llm: The OpenAI LLM instance created earlier.

          vectorstore: The vector store containing document embeddings.

          document_content_description: The description of the document content.

          metadata_field_info: The list defining the searchable metadata attributes.

          verbose=True: Enables verbose output during retrieval, showing the reasoning behind retrieved documents.

"""


In [None]:
metadata_field_info = [
    AttributeInfo(
        name="title",
        description="The title of the book",
        type="string or list[string]",
    ),
    AttributeInfo(
        name="author",
        description="The author of the book",
        type="string or list[string]",
    ),
    AttributeInfo(
        name="year",
        description="The year the book was published",
        type="integer",
    ),
    AttributeInfo(
        name="genre",
        description="The genre of the book",
        type="string or list[string]",
    ),
    AttributeInfo(
        name="rating",
        description="The rating of the book (1-5 scale)",
        type="float",
    ),
    AttributeInfo(
        name="language",
        description="The language the book is written in",
        type="string",
    ),
    AttributeInfo(
        name="country",
        description="The country the author is from",
        type="string",
    ),
]

document_content_description = "Brief description of the book"

In [None]:
llm = OpenAI(temperature=0)

retriever = SelfQueryRetriever.from_llm(
    llm,
    vectorstore,
    document_content_description,
    metadata_field_info,
    verbose=True
)

## 7. Example queries

"""
Code Explanation : 

This retrieves documents based on relevant keywords in the query (“highly rated historical fiction books”).

This combines keyword search with filtering based on metadata (“deep themes” and “rating above 4.5”).

This retrieves documents based on concepts derived from the query text (“complex characters” and “gripping plot”).

This retrieves documents using a specific metadata field (“books from the USA”).

This combines multiple filters for a more precise search (“published after 2003 but before 2015 with deep themes and a high rating”).

This allows specifying the maximum number of documents retrieved (e.g., “two books that have a rating above 4.8” or "two books that come from the USA or UK”).

"""


### 7.1 Basic query

In [9]:
retriever.invoke("What are some highly rated historical fiction books")

[Document(page_content='A luxurious, heartfelt novel with themes of love and loss set against a historical backdrop', metadata={'title': 'The Nightingale', 'author': 'Kristin Hannah', 'year': 2015, 'genre': 'Historical Fiction', 'rating': 4.8, 'language': 'English', 'country': 'France'}),
 Document(page_content='A rich, complex narrative set in a dystopian future with strong thematic elements', metadata={'title': '1984', 'author': 'George Orwell', 'year': 1949, 'genre': 'Dystopian', 'rating': 4.7, 'language': 'English', 'country': 'UK'}),
 Document(page_content='An exotic, enchanting tale with rich descriptions and an intricate plot', metadata={'title': 'One Hundred Years of Solitude', 'author': 'Gabriel García Márquez', 'year': 1967, 'genre': 'Magical Realism', 'rating': 4.8, 'language': 'Spanish', 'country': 'Colombia'}),
 Document(page_content='An intense, gripping story with dark themes and intricate plot twists', metadata={'title': 'Gone Girl', 'author': 'Gillian Flynn', 'year': 2

### 7.2 Query with filtering

In [10]:
retriever.invoke("I want a book with deep themes and a rating above 4.5")

[Document(page_content='A highly regarded novel with deep themes and a nuanced exploration of human nature', metadata={'title': 'To Kill a Mockingbird', 'author': 'Harper Lee', 'year': 1960, 'genre': 'Fiction', 'rating': 4.9, 'language': 'English', 'country': 'USA'}),
 Document(page_content='An elegant, balanced narrative with intricate character development and subtle themes', metadata={'title': 'Pride and Prejudice', 'author': 'Jane Austen', 'year': 1813, 'genre': 'Romance', 'rating': 4.6, 'language': 'English', 'country': 'UK'}),
 Document(page_content='A rich, complex narrative set in a dystopian future with strong thematic elements', metadata={'title': '1984', 'author': 'George Orwell', 'year': 1949, 'genre': 'Dystopian', 'rating': 4.7, 'language': 'English', 'country': 'UK'}),
 Document(page_content='A luxurious, heartfelt novel with themes of love and loss set against a historical backdrop', metadata={'title': 'The Nightingale', 'author': 'Kristin Hannah', 'year': 2015, 'genre':

### 7.3 Query with composite filter

In [11]:
retriever.invoke("I want a book with complex characters and a gripping plot")

[Document(page_content='An intense, gripping story with dark themes and intricate plot twists', metadata={'title': 'Gone Girl', 'author': 'Gillian Flynn', 'year': 2012, 'genre': 'Thriller', 'rating': 4.3, 'language': 'English', 'country': 'USA'}),
 Document(page_content='A crisp, engaging story with vibrant characters and a compelling plot', metadata={'title': 'The Alchemist', 'author': 'Paulo Coelho', 'year': 1988, 'genre': 'Adventure', 'rating': 4.4, 'language': 'Portuguese', 'country': 'Brazil'}),
 Document(page_content='An elegant, balanced narrative with intricate character development and subtle themes', metadata={'title': 'Pride and Prejudice', 'author': 'Jane Austen', 'year': 1813, 'genre': 'Romance', 'rating': 4.6, 'language': 'English', 'country': 'UK'}),
 Document(page_content='A full-bodied epic with rich characters and a sprawling plot', metadata={'title': 'War and Peace', 'author': 'Leo Tolstoy', 'year': 1869, 'genre': 'Historical Fiction', 'rating': 4.7, 'language': 'Rus

### 7.4 Query with country filter

In [12]:
retriever.invoke("What books come from the USA?")

[Document(page_content='A complex, layered narrative exploring themes of identity and belonging', metadata={'title': 'The Namesake', 'author': 'Jhumpa Lahiri', 'year': 2003, 'genre': 'Fiction', 'rating': 4.5, 'language': 'English', 'country': 'USA'}),
 Document(page_content='An intense, gripping story with dark themes and intricate plot twists', metadata={'title': 'Gone Girl', 'author': 'Gillian Flynn', 'year': 2012, 'genre': 'Thriller', 'rating': 4.3, 'language': 'English', 'country': 'USA'}),
 Document(page_content='A highly regarded novel with deep themes and a nuanced exploration of human nature', metadata={'title': 'To Kill a Mockingbird', 'author': 'Harper Lee', 'year': 1960, 'genre': 'Fiction', 'rating': 4.9, 'language': 'English', 'country': 'USA'})]

### 7.5 Query with year range and theme filter

In [13]:
retriever.invoke("What's a book published after 2003 but before 2015 with deep themes and a high rating")

[Document(page_content='An intense, gripping story with dark themes and intricate plot twists', metadata={'title': 'Gone Girl', 'author': 'Gillian Flynn', 'year': 2012, 'genre': 'Thriller', 'rating': 4.3, 'language': 'English', 'country': 'USA'})]

### 7.6 Retrieval with limiting

In [14]:
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectorstore,
    document_content_description,
    metadata_field_info,
    enable_limit=True,
    verbose=True,
)

In [15]:
retriever.invoke("What are two books that have a rating above 4.8")

[Document(page_content='A highly regarded novel with deep themes and a nuanced exploration of human nature', metadata={'title': 'To Kill a Mockingbird', 'author': 'Harper Lee', 'year': 1960, 'genre': 'Fiction', 'rating': 4.9, 'language': 'English', 'country': 'USA'})]

In [16]:
retriever.invoke("What are two books that come from USA or UK")

[Document(page_content='A complex, layered narrative exploring themes of identity and belonging', metadata={'title': 'The Namesake', 'author': 'Jhumpa Lahiri', 'year': 2003, 'genre': 'Fiction', 'rating': 4.5, 'language': 'English', 'country': 'USA'}),
 Document(page_content='An intense, gripping story with dark themes and intricate plot twists', metadata={'title': 'Gone Girl', 'author': 'Gillian Flynn', 'year': 2012, 'genre': 'Thriller', 'rating': 4.3, 'language': 'English', 'country': 'USA'})]