# Indexing: Multi-representation Indexing
 
![multi-representation indexing](../images/images-multi-representation.png)

**Multi-Representation Indexing** is a technique that enhances information retrieval systems by creating multiple representations of each document. This approach captures various facets of the document's content, enabling the system to understand and retrieve information more accurately and contextually.

**How It Works:**

1. **Textual Analysis:**
   - Extracts keywords, named entities, or topics from the document.
   - *Example:* Identifying that a document contains terms like "machine learning," "algorithms," and "data analysis."

2. **Semantic Embeddings:**
   - Utilizes language models to capture the document's meaning in a numerical form.
   - *Example:* Representing the document's content as a vector that reflects its semantic context.

3. **Visual Features (if applicable):**
   - Processes images or diagrams within the document to extract visual information.
   - *Example:* Analyzing a chart in a research paper to understand its contribution to the document's content.

**Benefits:**

- **Improved Retrieval Accuracy:** By incorporating different representations, the system can capture various aspects of the document content, leading to more relevant results for diverse queries.

- **Contextual Understanding:** Multi-representation indexing enhances the system’s ability to understand the context in which terms are used. Semantic representations, such as embeddings from language models, can capture the nuances and relationships between terms, leading to more contextually relevant search results. 

- **Diverse Query Handling:** The system can effectively process and respond to various queries, including natural language questions, keyword searches, and structured queries.

- **Enhanced Flexibility:** Utilizing different representations allows the system to adapt to various document types, such as PDFs, web pages, and databases, as well as varying user needs.

- **Handling Complex Information:** Multi-representation indexing can be particularly helpful for documents containing complex information, like scientific papers or code, where textual analysis alone might not be sufficient. 

**Example in Practice:**

Imagine a search engine designed for academic research:

- **Document:** A research paper on "Advancements in Neural Networks."

- **Representations Created:**
  - **Textual Analysis:** Keywords like "neural networks," "deep learning," "AI."
  - **Semantic Embedding:** A vector capturing the paper's overall topic and context.
  - **Visual Features:** Analysis of included diagrams illustrating neural network architectures.

- **User Query:** "Latest AI techniques in deep learning."

- **Retrieval Process:**
  - The system matches the query against the multiple representations.
  - The semantic embedding recognizes the relevance to "Advancements in Neural Networks."
  - Textual analysis aligns with keywords like "AI" and "deep learning."

- **Result:** The research paper is retrieved as a top result due to the comprehensive indexing capturing its relevance from multiple angles.

By implementing multi-representation indexing, information retrieval systems can provide more accurate, contextually relevant, and user-tailored search results, enhancing the overall user experience. 

![Multi-representation Indexing](../images/Multi-representation%20Indexing.png)

Arxiv paper:

- [Dense X Retrieval: What Retrieval Granularity Should We Use?](https://arxiv.org/pdf/2312.06648)

## Setup

In [1]:
%run "../Z - Common/setup.ipynb"


Stored 'enable_langsmith' (bool)


USER_AGENT environment variable not set, consider setting it to identify your requests.


For this example we will use a `MultiVectorRetriever` that retrieves raw documents from an `InMemoryByteStore`, and their summaries from a vector store (Chroma).

Let's start by downloading and summarizing docs to seed the database with:

In [2]:
from langchain_community.document_loaders import WebBaseLoader
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
docs = loader.load()

loader = WebBaseLoader("https://lilianweng.github.io/posts/2024-02-05-human-data-quality/")
docs.extend(loader.load())

chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | llm
    | StrOutputParser()
)

summaries = chain.batch(docs, {"max_concurrency": 5})

Next, store the summaries in the vector store and the raw documents in the byte store.

In [3]:
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryByteStore
from uuid import uuid4

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries",
                     embedding_function=embeddings)

# The storage layer for the parent documents
byte_store = InMemoryByteStore()
id_key = "doc_id"

# The retriever
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=byte_store,
    id_key=id_key,
)
doc_ids = [str(uuid4()) for _ in docs]

# Docs linked to summaries
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

# Add
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

Let's try querying both the summary and raw docs:

In [4]:
query = "Memory in agents"
sub_docs = vectorstore.similarity_search(query,k=1)
sub_docs[0]

Document(metadata={'doc_id': '0d710c7c-4080-4a9f-a785-d2fdf6eff68c'}, page_content="Here's a summary of the document on LLM-powered autonomous agents:\n\nKey Components:\n\n1. Planning\n- Task decomposition: Breaking complex tasks into manageable subgoals\n- Self-reflection: Ability to learn from mistakes and refine actions\n- Uses techniques like Chain of Thought (CoT) and Tree of Thoughts (ToT)\n\n2. Memory\n- Short-term memory: In-context learning with finite context window\n- Long-term memory: External vector store with fast retrieval\n- Uses Maximum Inner Product Search (MIPS) algorithms for efficient retrieval\n\n3. Tool Use\n- Enables LLMs to interact with external tools and APIs\n- Examples include calculators, search engines, code execution, etc.\n- Frameworks like MRKL and Toolformer help integrate tools\n\nNotable Case Studies:\n- ChemCrow: Domain-specific agent for chemistry tasks\n- Generative Agents: Simulation of 25 virtual characters interacting\n- AutoGPT and GPT-Engin

In [5]:
retrieved_docs = retriever.invoke(query,n_results=1)
retrieved_docs[0].page_content[0:500]

Number of requested results 4 is greater than number of elements in index 2, updating n_results = 2


"\n\n\n\n\n\nLLM Powered Autonomous Agents | Lil'Log\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nLil'Log\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n|\n\n\n\n\n\n\nPosts\n\n\n\n\nArchive\n\n\n\n\nSearch\n\n\n\n\nTags\n\n\n\n\nFAQ\n\n\n\n\nemojisearch.app\n\n\n\n\n\n\n\n\n\n      LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\n \n\n\nTable of Contents\n\n\n\nAgent System Overview\n\nComponent One: Planning\n\nTask Decomposition\n\nSelf-Reflection\n\n\nComponent Two: Memory\n\nTypes of Memory\n\nMaximum Inner Product Search (MIPS"

When splitting documents for retrieval, there are often conflicting desires:

- You may want to have small documents, so that their embeddings can most accurately reflect their meaning. If too long, then the embeddings can lose meaning.
- You want to have long enough documents that the context of each chunk is retained.

An alternative to the approach we just implemented for multi-representation indexing is the [ParentDocumentRetriever](https://python.langchain.com/docs/how_to/parent_document_retriever/) that strikes that balance by splitting and storing small chunks of data. During retrieval, it first fetches the small chunks but then looks up the parent ids for those chunks and returns those larger documents.

Note that "parent document" refers to the document that a small chunk originated from. This can either be the whole raw document OR a larger chunk.

> **TODO**: Implement example using `ParentDocumentRetriever`