# How to retrieve using multiple vectors per document

# 1. Why Multiple Vectors Per Document?
When we work with large documents, storing only a single vector (embedding) per document may not be enough. Different parts of the document might be relevant to different questions or contexts. So, instead of representing the entire document with one vector, we can split it into smaller chunks and create multiple vectors for each chunk. This helps in:

* Precision: Chunks allow embeddings to capture specific meanings.
* Recall: Even if a question is specific, it can still link back to a larger document.

# 2. How Does It Work?
LangChain provides a way to split documents into smaller pieces and associate them with the parent document. These smaller chunks can be embedded (converted into vectors), making it easier to search within a document based on different parts.

Here’s a step-by-step guide on the different methods to do this:

a) Splitting into Smaller Chunks (ParentDocumentRetriever)

* Idea: Split a large document into smaller parts (like splitting a book into paragraphs).
* Usage: Embed each smaller chunk separately.
* Retrieval: When searching, you retrieve the chunk that matches your query but get the context from the entire parent document.

Example:

Split a document into 400-character chunks and embed each one.
When you search for "justice breyer", it matches the relevant chunk but returns the full parent document.

b) Summarizing Chunks
* Idea: Instead of splitting the document into smaller pieces, create a summary for each document.
* Usage: Embed the summary along with the original document. This allows you to capture the gist of the content.
* Retrieval: The summary embedding helps in accurate retrieval, but you can also retrieve the full parent document if needed.

Example:

Summarize each document using a language model.
Store and embed the summary. When you search, you first get the relevant summary, then pull the full document.

c) Hypothetical Questions
* Idea: Generate possible questions that a document could answer.
* Usage: Create and embed these hypothetical questions along with the document. This way, even if the query isn't directly in the document, the embedded question can help retrieve it.
* Retrieval: Questions provide a wider net for catching semantically similar queries.

Example:

For a document about a Supreme Court judge, generate questions like "What impact did Judge X have on legal cases?"
When searching, the hypothetical questions guide the retrieval even if the exact wording doesn't appear.

# 3. Components Involved
Here's how we implement these ideas:

* 1. Vector Store:
This is where the vectors (embeddings) are stored.
For example, a vector store might store vectors of smaller document chunks or summaries.
* 2. Document Store:
This holds the full, original parent documents and associates them with unique identifiers.
It’s like a database that links back to the original source.
* 3. MultiVectorRetriever:
The retriever acts as a bridge between vector search and the document store.
It uses embeddings to search for relevant information and pulls the larger document context if needed.

# 4. Example Workflow
Here’s a simplified example to visualize the process:

* Split a document into smaller chunks (e.g., 400 characters each).
* Embed each chunk (convert it into a vector).
* Store the embeddings in a vector store.
* Store the full document with an identifier in the document store.
* When you query the retriever:
  * * It finds the best matching chunk using the vector store.
  * * Returns the larger parent document for more context.

# 5. Advanced Use-Cases
Summaries can replace smaller chunks if you prefer a concise view.
Hypothetical Questions can guide searches when the query doesn’t directly match the document content.g

# Key Takeaways:
* Precision: Smaller chunks = better embedding accuracy.
* Recall: Parent document linkage ensures no loss of context.
* Flexibility: Summaries and hypothetical questions offer alternatives to exact matching.

# How LangChain Helps:
LangChain’s MultiVectorRetriever abstracts much of this complexity. You just need to:

* Choose how you want to split the document.
* Embed those parts (chunks, summaries, questions).
* Retrieve results with better precision and flexibility.

This setup ensures your retrieval is both contextually rich (full document context) and semantically accurate (specific parts matching the query).

In [2]:
%pip install --upgrade --quiet langchain-chroma langchain langchain-openai


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
import uuid
import getpass
import os
from langchain.storage import InMemoryByteStore
from langchain_chroma import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.retrievers.multi_vector import MultiVectorRetriever, SearchType
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI

# Step 1: Load documents
# Load documents from text files using the TextLoader
loaders = [
    TextLoader(r"C:\Users\Admin\Desktop\10-20-2024\data\paul_graham_essay.txt", encoding="utf-8"),
    TextLoader(r"C:\Users\Admin\Desktop\10-20-2024\data\state_of_the_union.txt", encoding="utf-8"),
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

# Step 2: Split documents into larger chunks
# Split the documents into larger chunks of 10,000 characters
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
docs = text_splitter.split_documents(docs)

# Step 3: Set up the vector store using OpenAI embeddings
# Initialize a Chroma vector store to store embeddings of child chunks
vectorstore = Chroma(
    collection_name="full_documents", 
    embedding_function=OpenAIEmbeddings()
)

# Step 4: Create smaller chunks (for better semantic capture)
# Create smaller chunks (400 characters) from the larger documents
store = InMemoryByteStore()  # Storage layer for parent documents
id_key = "doc_id"  # Unique identifier key for documents
retriever = MultiVectorRetriever(  # Initialize the retriever
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)

# Generate unique IDs for each document
doc_ids = [str(uuid.uuid4()) for _ in docs]

# Split each document into smaller sub-documents and associate them with a parent ID
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
sub_docs = []
for i, doc in enumerate(docs):
    _id = doc_ids[i]
    _sub_docs = child_text_splitter.split_documents([doc])
    for _doc in _sub_docs:
        _doc.metadata[id_key] = _id  # Store the parent document ID in metadata
    sub_docs.extend(_sub_docs)

# Add the smaller chunks to the vector store and associate them with parent documents
retriever.vectorstore.add_documents(sub_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

# Step 5: Test retrieval
# Perform similarity search with the vector store to get smaller chunks
result = retriever.vectorstore.similarity_search("justice breyer")[0]
print("Retrieved Small Chunk:", result.page_content)

# Use the retriever to get the larger parent document
retrieved_docs = retriever.invoke("justice breyer")
print("Retrieved Full Document Length:", len(retrieved_docs[0].page_content))

# Step 6: Summary-based retrieval
# Use an LLM to summarize the documents and create embeddings based on summaries
os.environ["OPENAI_API_KEY"] = getpass.getpass()  # Set OpenAI API key

# Initialize the language model (OpenAI Chat model)
llm = ChatOpenAI(model="gpt-4o-mini")

# Create a chain to summarize documents using the LLM
chain = (
    {"doc": lambda x: x.page_content}  # Take the document content as input
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | llm  # Use the language model to generate a summary
    | StrOutputParser()  # Parse the output to get the summary string
)

# Generate summaries for all documents
summaries = chain.batch(docs, {"max_concurrency": 5})

# Initialize a new vector store for summaries
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())

# Set up a new retriever for summaries
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)

# Create Document objects from summaries and store them in the vector store
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

# Step 7: Hypothetical questions-based retrieval
# Generate hypothetical questions for documents using an LLM
from typing import List
from pydantic import BaseModel, Field

class HypotheticalQuestions(BaseModel):
    questions: List[str] = Field(..., description="List of questions")

# Create a chain to generate hypothetical questions
chain = (
    {"doc": lambda x: x.page_content}  # Input is the document content
    | ChatPromptTemplate.from_template(
        "Generate a list of exactly 3 hypothetical questions that the below document could be used to answer:\n\n{doc}"
    )
    | ChatOpenAI(max_retries=0, model="gpt-4o").with_structured_output(
        HypotheticalQuestions
    )
    | (lambda x: x.questions)  # Extract the questions from the output
)

# Generate hypothetical questions for all documents
hypothetical_questions = chain.batch(docs, {"max_concurrency": 5})

# Set up a vector store for hypothetical questions
vectorstore = Chroma(
    collection_name="hypo-questions", 
    embedding_function=OpenAIEmbeddings()
)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)

# Create Document objects from hypothetical questions and store them
question_docs = []
for i, question_list in enumerate(hypothetical_questions):
    question_docs.extend(
        [Document(page_content=s, metadata={id_key: doc_ids[i]}) for s in question_list]
    )
retriever.vectorstore.add_documents(question_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

# Perform a similarity search using hypothetical questions
sub_docs = retriever.vectorstore.similarity_search("justice breyer")
print("Retrieved Hypothetical Questions:", [doc.page_content for doc in sub_docs])

# Use the retriever to get the larger source document
retrieved_docs = retriever.invoke("justice breyer")
print("Retrieved Full Document Length:", len(retrieved_docs[0].page_content))


Retrieved Small Chunk: Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
Retrieved Full Document Length: 9874
Retrieved Hypothetical Questions: ['What impact would appointing a highly qualified Supreme Court justice have on the judicial system and its future decisions?', 'How might comprehensive immigration reform, including a path to citizenship for Dreamers, affect the U.S. economy and society?', 'How could the trajectory of Y Combinator have changed if Sam Altman refused the offer to become president?', "How would the Bipartisan Infrastructure Law impact America's global economic competitiveness?"]
Retrieved Full Document Length: 9194


* Document Loading and Splitting:

     * * The documents are loaded and split into chunks using RecursiveCharacterTextSplitter.
     * * Smaller chunks are embedded while larger "parent" documents are retained.

* Retrieving with Smaller Chunks:

     * * Embedding smaller chunks helps capture semantics better.
     * * Retrieval returns larger parent documents associated with smaller chunks.
* Summary-based Retrieval:

     * * Summarize documents using an LLM.
     * * Use these summaries to embed and retrieve relevant documents.
* Hypothetical Questions-based Retrieval:

    * * Generate questions that a document can answer using an LLM.
    * * Embed these questions for retrieval, improving retrieval accuracy for specific queries.


This code allows for effective multi-vector retrieval, supporting scenarios like retrieving precise answers or more comprehensive document segments.






