# Chapter 3: RAG Part II: Chatting with Your Data

## Query Construction

As discussed earlier, RAG is an effective strategy to embed and retrieve relevant unstructured data from a vector store based on a query. But most data available for use in production apps is structured and typically stored in relational databases. In addition, unstructured data embedded in a vector store also contains structured metadata that possesses important information.

_Query construction_ is the process of transforming a natural language query into the query language of the database or data source you are interacting with.

### Text-to-Metadata Filter

Most vector stores provide the ability to limit your vector search based on metadata. During the embedding process, we can attach metadata key-value pairs to vectors in an index and then later specify filter expressions when you query the index.

LangChain provides a ```SelfQueryRetriever``` that abstracts this logic and makes it easier to translate natural language queries into structured queries for various data sources. The self-querying utilizes an LLM to extract and execute the relevant metadata filters based on a user’s query and predefined metadata schema:

**NOTE**: Do not forget to launch a new pgvector docker container before using this notebook. execute ```docker compose up -d``` in the terminal.

1. Setup vector store:

In [1]:
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_deepseek import ChatDeepSeek
from langchain_community.document_loaders import TextLoader
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_postgres.vectorstores import PGVector
from langchain_core.documents import Document
from dotenv import load_dotenv
import os


load_dotenv()

# vector store credentials
connection_credentials = f"postgresql+psycopg://{os.getenv('POSTGRES_USER')}:{os.getenv('POSTGRES_PASSWORD')}@localhost:8888/{os.getenv('POSTGRES_DB')}"

docs = [
    Document(
        page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
        metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"},
    ),
    Document(
        page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
        metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2},
    ),
    Document(
        page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
        metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6},
    ),
    Document(
        page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
        metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3},
    ),
    Document(
        page_content="Toys come alive and have a blast doing so",
        metadata={"year": 1995, "genre": "animated"},
    ),
    Document(
        page_content="Three men walk into the Zone, three men walk out of the Zone",
        metadata={
            "year": 1979,
            "director": "Andrei Tarkovsky",
            "genre": "thriller",
            "rating": 9.9,
        },
    ),
]

# Create embeddings for the documents
embeddings_model = HuggingFaceEmbeddings(
    model="sentence-transformers/all-mpnet-base-v2", # use this model to perform the embedding
    model_kwargs={"device": "cpu"},
    encode_kwargs={"normalize_embeddings": False},
)

vector_store = PGVector.from_documents(documents=docs, embedding=embeddings_model, connection=connection_credentials)

2. Define the fields for the query and retriever

In [2]:
fields = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie",
        type="string or list[string]",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="director",
        description="The name of the movie director",
        type="string",
    ),
    AttributeInfo(
        name="rating",
        description="A 1-10 rating for the movie",
        type="float",
    ),
]

# define retriever
description = "Brief summary of a movie"
llm = ChatDeepSeek(model="deepseek-chat", temperature=0)
retriever = SelfQueryRetriever.from_llm(llm=llm, vectorstore=vector_store, document_contents=description, metadata_field_info=fields)

3. Run the retriever (this example only specifies a filter)

In [3]:
print(retriever.invoke("I want to watch a movie rated higher than 8.5"))

[Document(id='471484e2-0833-4d8a-bb07-54048139f507', metadata={'year': 1979, 'genre': 'thriller', 'rating': 9.9, 'director': 'Andrei Tarkovsky'}, page_content='Three men walk into the Zone, three men walk out of the Zone'), Document(id='beb17e84-4b28-43ac-bfb4-e04aadb5e0d2', metadata={'year': 2006, 'rating': 8.6, 'director': 'Satoshi Kon'}, page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea')]


This results in a retriever that will take a user query, and split it into:

* A filter to apply on the metadata of each document first
* A query to use for semantic search on the documents

To do this, we have to describe which fields the metadata of our documents contain; that description will be included in the prompt. The retriever will then do the following:

1. Send the query generation prompt to the LLM.
2. Parse metadata filter and rewritten search query from the LLM output.
3. Convert the metadata filter generated by the LLM to the format appropriate for our vector store.
4. Issue a similarity search against the vector store, filtered to only match documents whose metadata passes the generated filter.

**NOTE:** Do not forget to remove the pgvector container when done using this notebook. Execute ```docker compose down --volumes``` in the terminal.