# Query Construction: Text-to-metadata-filter

![text-to-metadata-filter](../images/images-text-to-metadata-filter.png)

**Text-to-Metadata-Filter** is a technique that transforms a user's natural language query into specific metadata criteria, which are then used to filter and retrieve the most relevant documents from a database or document store. This approach enhances the precision of information retrieval by narrowing down search results based on structured metadata attributes such as date, author, category, or tags.

**How It Works:**

1. **Analyze the Query:**
   - The system examines the user's input to identify key elements that correspond to metadata fields.
   - *Example:*
     - User Query: "Find recent articles on climate change by Dr. Smith."
     - Identified Metadata:
       - Date: Recent
       - Topic: Climate Change
       - Author: Dr. Smith

2. **Construct Metadata Filters:**
   - Based on the analysis, the system creates filters that correspond to the identified metadata.
   - *Example:*
     - Filters:
       - Publication Date: Within the last year
       - Author: Dr. Smith
       - Subject: Climate Change

3. **Apply Filters to Retrieve Documents:**
   - The system uses these metadata filters to search the document store, retrieving only those documents that match the specified criteria.
   - *Example:*
     - Retrieved Documents:
       - "The Impact of Climate Change on Coastal Ecosystems" by Dr. Smith, published six months ago.
       - "Recent Advances in Climate Change Research" by Dr. Smith, published three months ago.

**Benefits:**

- **Precision:** By filtering based on specific metadata, the system retrieves documents that closely match the user's intent, reducing irrelevant results.
- **Efficiency:** Narrowing down the search space leads to faster retrieval times and a more streamlined user experience.
- **User Satisfaction:** Providing highly relevant results increases user trust and satisfaction with the system.

**Advanced Applications:**

- **Dynamic Metadata Extraction:** Some systems can automatically extract potential metadata filters from user queries without explicit input. For instance, a query like "Show me 2022 reports on renewable energy" can be parsed to apply filters for the year 2022 and the topic "renewable energy." 

- **Integration with Vector Stores:** In vector databases that support metadata filtering, natural language queries can be translated into structured queries with metadata filters, enhancing retrieval from unstructured documents. 

**Example in Practice:**

Imagine a digital library where each document is tagged with metadata such as author, publication date, and topics covered. A user interested in recent publications by a specific author on a particular subject can have their natural language query converted into metadata filters, allowing the system to efficiently retrieve the most relevant documents.

*User Query:*
- "What are the latest research papers on machine learning by Professor Johnson?"

*System Analysis and Filter Construction:*
- Author: Professor Johnson
- Topic: Machine Learning
- Publication Date: Last two years

*Retrieved Results:*
- "Advancements in Supervised Learning Techniques" by Professor Johnson, published in 2023.
- "Deep Learning Architectures for Image Recognition" by Professor Johnson, published in 2022.

By implementing Text-to-Metadata-Filter techniques, systems can significantly enhance the relevance and accuracy of retrieved information, leading to more effective and user-friendly search experiences. 

![text-to-metadata-filter](../images/text-to-metadata-filter.png)

## Setup

In [2]:
%run "../Z - Common/setup.ipynb"

!pip install -qU arxiv pymupdf 

Stored 'enable_langsmith' (bool)


USER_AGENT environment variable not set, consider setting it to identify your requests.


Let's use searching Arxiv papers as an example:


In [3]:
from langchain_community.document_loaders import ArxivLoader

docs = ArxivLoader(
    query="reasoning",
    load_max_docs=5,
    load_all_available_meta=True
).load()

print(docs[0].page_content[:1000])
print(docs[0].metadata)

GRAPH-CONSTRAINED REASONING: FAITHFUL REA-
SONING ON KNOWLEDGE GRAPHS WITH LARGE LAN-
GUAGE MODELS
Linhao Luo1∗, Zicheng Zhao2∗, Chen Gong2, Gholamreza Haffari1, Shirui Pan3†
1Monash University 2Nanjing University of Science and Technology 3Griffith University
{Linhao.Luo,Gholamreza.Haffari}@monash.edu
{zicheng.zhao,chen.gong}@njust.edu.cn, s.pan@griffith.edu.au
ABSTRACT
Large language models (LLMs) have demonstrated impressive reasoning abilities,
but they still struggle with faithful reasoning due to knowledge gaps and halluci-
nations. To address these issues, knowledge graphs (KGs) have been utilized to
enhance LLM reasoning through their structured knowledge. However, existing
KG-enhanced methods, either retrieval-based or agent-based, encounter difficul-
ties in accurately retrieving knowledge and efficiently traversing KGs at scale.
In this work, we introduce graph-constrained reasoning (GCR), a novel frame-
work that bridges structured knowledge in KGs with unstructured reasoni

Let's assume we want to build an index that enables us to:

- perform unstructured search over the `Title` and `Summary` attributes of each document
- use range filtering on `Published`

To convert a natural langauge query into a structured query we need to define a schema for the structured search queries:

In [4]:
from pydantic import BaseModel, Field
from typing import Optional
import datetime

class ArxivSearch(BaseModel):
    """Search over Arxiv documents."""

    title_search: str = Field(
        ...,
        description="Similarity search query applied to the title.",
    )
    summary_search: str = Field(
        ...,
        description=(
            "Alternate version of the content search query to apply to summaries. "
            "Should be succinct and only include key words that could be in a summary."
        ),
    )
    earliest_published_date: Optional[datetime.date] = Field(
        None,
        description="Earliest published date filter, inclusive. Only use if explicitly specified.",
    )
    latest_published_date: Optional[datetime.date] = Field(
        None,
        description="Latest published date filter, exclusive. Only use if explicitly specified.",
    )


    def pretty_print(self) -> None:
        for field in self.model_fields:
            if getattr(self, field) is not None and getattr(self, field) != getattr(
                self.model_fields[field], "default", None
            ):
                print(f"{field}: {getattr(self, field)}")

Now we prompt the LLM to produce queries.

In [5]:
from langchain.prompts import ChatPromptTemplate

template = """You are an expert at converting user questions into database queries. \
You have access to a database of scholarly articles. \
Given a question, return a database query optimized to retrieve the most relevant results.

If there are acronyms or words you are not familiar with, do not try to rephrase them."""
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", template),
        ("human", "{question}"),
    ]
)

structured_llm = llm.with_structured_output(ArxivSearch)

chain_query_analyzer = prompt | structured_llm

In [6]:
chain_query_analyzer.invoke({"question": "What are the main components of an LLM-powered autonomous agent system?"}).pretty_print()

title_search: LLM autonomous agent system architecture components
summary_search: LLM agent system components architecture framework autonomous planning reasoning memory


In [7]:
chain_query_analyzer.invoke({"question": "what papers on RAG were published in 2024?"}).pretty_print()

title_search: RAG "Retrieval Augmented Generation"
summary_search: RAG "Retrieval Augmented Generation" LLM retrieval-augmented
earliest_published_date: 2024-01-01
latest_published_date: 2025-01-01
