# Hierarchical Indices in Document Retrieval
## Overview
This code implements a Hierarchical Indexing system for document retrieval, utilizing two levels of encoding: document-level summaries and detailed chunks. This approach aims to improve the efficiency and relevance of information retrieval by fist identifying relevant documnent sections through summaries, then drilling down to specific details within those sections.
## Motivation
Traditional flat indexing methods can struggle with large documents or corpus, potentially missing context or returning irrelevant information. Hierarchical indexing addresses this by creating a two-tier search system, allowing for more efficient and context-aware retrieval.
## Key Components
- PDF processing and text chunking
- Asynchronous document summarization using LLM
- Vector store creation for both summaries and detailed chunks using FAISS and embeddings
- Custom hierarchical retrieval function
## Benefits of this Approach
- Improved Retrieval Efficiency: By first searching summaries, the system can quickly identify relevant document sections without processing all detailed chunks
- Better Context Preservation: The hierarchical approach helps maintain the broader context of retrieved information
- Scalability: This method is particularly beneficial for large documents or corpus, where flat searching might be inefficient or miss important context
- Flexibility: The system allows for adjusting the number of summaries and chunks retrieved, enabling fine-tuning for different use cases
## Conclusion
Hierarchical indexing represents a sophisticated approach to document retrieval, particularly suitable for large or complex document sets. By leveraging both high-level summaries and detailed chunks, it offers a balance between broad context understanding and specifc information retrieval. 

In [1]:
import os
from dotenv import load_dotenv

from langchain_openai.chat_models.azure import AzureChatOpenAI
load_dotenv()
openai_endpoint = os.environ.get("AZURE_OPENAI_ENDPOINT")
openai_api_key = os.environ.get("AZURE_OPENAI_API_KEY")
openai_deployment = os.getenv("AZURE_OPENAI_DEPLOYMENT_ID")
openai_api_version = os.getenv("AZURE_API_VERSION")

llm = AzureChatOpenAI(
    azure_deployment=openai_deployment,
    api_version="2024-10-01-preview",
    azure_endpoint=f"{openai_endpoint}openai/deployments/{openai_deployment}/chat/completions?api-version=2024-10-01-preview",
    temperature=0,
    logprobs=True,
)

In [2]:
from langchain.chains.summarize.chain import load_summarize_chain
from langchain.docstore.document import Document
path = "./data/Understanding_Climate_Change.pdf"

In [4]:
from helper_functions import PyPDFLoader, RecursiveCharacterTextSplitter, retry_with_exponentail_backoff, FAISS
import asyncio
from langchain_openai.embeddings import AzureOpenAIEmbeddings
openai_embedding = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_ID")
async def encode_pdf_hierarchical(path, chunk_size=1000, chunk_overlap=200, is_string=False):
    """
    Asynchronously encodes a PDF book into a hierarchical vector store using embeddings.
    Includes rate limit handling with exponential backoff.
    
    Args:
        path: The path to the PDF file.
        chunk_size: The desired size of each text chunk.
        chunk_overlap: The amount of overlap between consecutive chunks.
        
    Returns:
        A tuple containing two FAISS vector stores:
        1. Document-level summaries
        2. Detailed chunks
    """
    if not is_string:
        loader = PyPDFLoader(path)
        documents = await asyncio.to_thread(loader.load)
    else:
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len, is_separator_regex=False,
        )
        documents = text_splitter.create_documents([path])

    summary_chain = load_summarize_chain(llm=llm, chain_type="map_reduce")

    async def summarize_document(document):
        """
        Summarizes a single document with rate limit handling.
        
        Args:
            doc: The document to be summarized.
            
        Returns:
            A summarized Document object.
        """
        summary_output = await retry_with_exponentail_backoff(summary_chain.ainvoke([document]))
        summary = summary_output["output_text"]
        return Document(
            page_content=summary,
            metadata={"source": path, "page": document.metadata["page"], "summary": True},
        )
    
    batch_size = 5
    summaries = []
    for i in range(0, len(documents), batch_size):
        batch = documents[i:i + batch_size]
        batch_summaries = await asyncio.gather(*[summarize_document(doc) for doc in batch])
        summaries.extend(batch_summaries)
        await asyncio.sleep(1)

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
    )
    detailed_chunks = await asyncio.to_thread(text_splitter.split_documents, documents)

    for i, chunk in enumerate(detailed_chunks):
        chunk.metadata.update({"chunk_id": i, "summary": False, "page": int(chunk.metadata.get("page", 0))})

    embeddings = AzureOpenAIEmbeddings(
        deployment=openai_embedding,
        model="text-embedding-ada-002",
        chunk_size=16
    )

    async def create_vectorstor(docs):
        """
        Creates a vector store from a list of documents with rate limit handling.
        
        Args:
            docs: The list of documents to be embedded.
            
        Returns:
            A FAISS vector store containing the embedded documents.
        """
        return await retry_with_exponentail_backoff(
            asyncio.to_thread(FAISS.from_documents, docs, embeddings)
        )
    
    summary_vectorstore, detailed_vectorstore = await asyncio.gather(
        create_vectorstor(summaries),
        create_vectorstor(detailed_chunks),
    )

    return summary_vectorstore, detailed_vectorstore

ImportError: cannot import name 'retry_with_exponentail_backoff' from 'helper_functions' (c:\Users\uidp8109\OneDrive - Continental AG\projects\AdvancedRAG\helper_functions.py)

In [None]:
if os.path.exists("./vector_stores/summary_store") and os.path.exists("./vector_stores/detailed_store"):
   embeddings = AzureOpenAIEmbeddings(
        deployment=openai_embedding,
        model="text-embedding-ada-002",
        chunk_size=16
    )
   summary_store = FAISS.load_local("./vector_stores/summary_store", embeddings, allow_dangerous_deserialization=True)
   detailed_store = FAISS.load_local("./vector_stores/detailed_store", embeddings, allow_dangerous_deserialization=True)

else:
    summary_store, detailed_store = await encode_pdf_hierarchical(path)
    summary_store.save_local("./vector_stores/summary_store")
    detailed_store.save_local("./vector_stores/detailed_store")

In [None]:
def retrieve_hierarchical(query, summary_vectorstore, detailed_vectorstore, k_summaries=3, k_chunks=5):
    """
    Performs a hierarchical retrieval using the query.

    Args:
        query: The search query.
        summary_vectorstore: The vector store containing document summaries.
        detailed_vectorstore: The vector store containing detailed chunks.
        k_summaries: The number of top summaries to retrieve.
        k_chunks: The number of detailed chunks to retrieve per summary.

    Returns:
        A list of relevant detailed chunks.
    """
    
    # Retrieve top summaries
    top_summaries = summary_vectorstore.similarity_search(query, k=k_summaries)
    
    relevant_chunks = []
    for summary in top_summaries:
        # For each summary, retrieve relevant detailed chunks
        page_number = summary.metadata["page"]
        page_filter = lambda metadata: metadata["page"] == page_number
        page_chunks = detailed_vectorstore.similarity_search(
            query, 
            k=k_chunks, 
            filter=page_filter
        )
        relevant_chunks.extend(page_chunks)
    
    return relevant_chunks

In [None]:
query = "What is the greenhouse effect?"
results = retrieve_hierarchical(query, summary_store, detailed_store)

# Print results
for chunk in results:
    print(f"Page: {chunk.metadata['page']}")
    print(f"Content: {chunk.page_content}...")  # Print first 100 characters
    print("---")