# Contextual Compression in Document Retrieval
## Overview
This code demonstrates the implementation of contextual compression in a document retrieval system using LangChain and LLM. The technique aims to improve the reelvance and conciseness of retrieved information by compressing and extracting the most pertinent parts of documents in the context of a given query.
## Motivation
Traditional document retrieval systems often return entire chunks or documents, which may contain irrelevant information. Contextual compression addresses this by interlligently extracting and compressing only the most relevant parts of retrieved documents, leading to more focused and efficient information retrieval.
## Key Components
1. Vector strore creation from a PDF document
2. Base retriever setup
3. LLM-based contextual compressor
4. Contextual compression retriever
5. Question-answering chain integrating the compressed retriever
## Method Detail
### Document Preprocessing and Vector Store Creation
PDF is processed and encoded into a vector store using a `encode_pdf` function
### Retriever and Compressor Setup
- A based retriever is created from the vector store
- An LLM-based contextual compressor is initialized 
### Contextual Compression Retriever
- The base retriever and compressor are combined into a ContextualCompressionRetriever
- The retriever first fetches documents using the base retriever, then applies the compressor to extract the most relevant information
### Question-Answering Chain
- RetrivalQA chain is created, integrating the compression retriever
- This chain uses the compressed and extracted information to generate answers to queries
## Benefits of this Approach
1. Improved relevance: The system returns only the most pertinent information to the query
2. Increased efficiency: By compressing and extracting relevant parts, it reduces the amount of text the LLM needs to process
3. Enhanced context understanding: The LLM-based compressor can understand the context of the query and extract information accordingly
4. Flexibility: The system can be deasily adapted to different types of documents and queries
## Conclusion
Contextual compression in document retrieval offers a powerful way to enhance the quality and efficiency of information retrieval systems. By intelligently extracting an dcompressing relevant information, it provides more focused and context-aware responses to queries. This approach has potential applications in various fields requiring efficient and accurate information retrieval from large document collections.

In [1]:
import os
from dotenv import load_dotenv

from langchain_openai.chat_models.azure import AzureChatOpenAI
load_dotenv()
openai_endpoint = os.environ.get("AZURE_OPENAI_ENDPOINT")
openai_api_key = os.environ.get("AZURE_OPENAI_API_KEY")
openai_deployment = os.getenv("AZURE_OPENAI_DEPLOYMENT_ID")
openai_api_version = os.getenv("AZURE_API_VERSION")

llm = AzureChatOpenAI(
    azure_deployment=openai_deployment,
    api_version="2024-10-01-preview",
    azure_endpoint=f"{openai_endpoint}openai/deployments/{openai_deployment}/chat/completions?api-version=2024-10-01-preview",
    temperature=0,
    logprobs=True,
)

In [2]:
path = "./data/Understanding_Climate_Change.pdf"

In [3]:
openai_embedding = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_ID")

from langchain_openai.embeddings.azure import AzureOpenAIEmbeddings
from helper_functions import PyPDFLoader, RecursiveCharacterTextSplitter, replace_t_with_space, FAISS

def encode_pdf(path, chunk_size=1000, chunk_overlap=200):
    """
    Encodes a PDF book into a vector store using OpenAI embeddings.

    Args:
        path: The path to the PDF file.
        chunk_size: The desired size of each text chunk.
        chunk_overlap: The amount of overlap between consecutive chunks.

    Returns:
        A FAISS vector store containing the encoded book content.
    """

    # Load PDF documents
    loader = PyPDFLoader(path)
    documents = loader.load()


    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len)
    texts = text_splitter.split_documents(documents)

    cleaned_texts = replace_t_with_space(texts)

    embeddings = AzureOpenAIEmbeddings(
        deployment=openai_embedding,
        model="text-embedding-ada-002",
        chunk_size=16
    )
    vectorstore = FAISS.from_documents(cleaned_texts, embeddings)

    return vectorstore

In [4]:
vectorestore = encode_pdf(path)

In [5]:
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.retrievers import ContextualCompressionRetriever
from langchain.chains import RetrievalQA
retriever = vectorestore.as_retriever()
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=compression_retriever,
    return_source_documents=True
)

In [6]:
query = "What is the main topic of the document?"
result = qa_chain.invoke({"query": query})
print(result["result"])
print("Source documents:", result["source_documents"])

The main topic of the document is climate action, focusing on both global and local efforts to address climate change. It discusses international collaboration through frameworks like the United Nations Framework Convention on Climate Change (UNFCCC) and the Paris Agreement, as well as national strategies such as carbon pricing mechanisms.
Source documents: [Document(metadata={'source': './data/Understanding_Climate_Change.pdf', 'page': 9}, page_content='Chapter 6: Global and Local Climate Action  \nInternational Collaboration  \nUnited Nations Framework Convention on Climate Change (UNFCCC)  \nThe UNFCCC is an international treaty aimed at addressing climate change. It provides a \nframework for negotiating specific protocols and agreements, such as the Kyoto Protocol and \nthe Paris Agreement. Global cooperation under the UNFCCC is crucial for coor dinated \nclimate action.  \nParis Agreement  \nThe Paris Agreement, adopted in 2015, aims to limit global warming to well below 2 degree