<a href="https://colab.research.google.com/github/duper203/RAG_Techniques_with_upstage/blob/main/upstage/12_contextual_compression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Contextual Compression in Document Retrieval






## Key Components

1. Vector store creation from a PDF document
2. Base retriever setup
3. LLM-based contextual compressor
4. Contextual compression retriever
5. Question-answering chain integrating the compressed retriever





## Method Details

1. Document Preprocessing and Vector Store Creation

  The PDF is processed and encoded into a vector store using a custom `encode_pdf` function.


2. Retriever and Compressor Setup

  1) A base retriever is created from the vector store.

  2) An LLM-based contextual compressor (LLMChainExtractor) is initialized using Upstage Solar

3. Contextual Compression Retriever
  
  1) The base retriever and compressor are combined into a ContextualCompressionRetriever.

  2) This retriever first fetches documents using the base retriever, then applies the compressor to extract the most relevant information.

4. Question-Answering Chain

  1) A RetrievalQA chain is created, integrating the compression retriever.

  2) This chain uses the compressed and extracted information to generate answers to queries.


In [12]:
! pip3 install -qU langchain-upstage langchain langchain-community faiss-cpu

In [4]:
import os
from google.colab import userdata
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.retrievers import ContextualCompressionRetriever
from langchain.chains import RetrievalQA
from langchain_upstage import UpstageEmbeddings

os.environ["UPSTAGE_API_KEY"] = userdata.get("UPSTAGE_API_KEY")

## Define document(s) path & Read PDf to string

In [6]:
path = "data/Understanding_Climate_Change.pdf"

## Create a vector store

In [14]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS

def encode_pdf(path, chunk_size=1000, chunk_overlap=200):
    """
    Encodes a PDF book into a vector store using OpenAI embeddings.

    Args:
        path: The path to the PDF file.
        chunk_size: The desired size of each text chunk.
        chunk_overlap: The amount of overlap between consecutive chunks.

    Returns:
        A FAISS vector store containing the encoded book content.
    """

    # Load PDF documents
    loader = PyPDFLoader(path)
    documents = loader.load()

    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
    )
    texts = text_splitter.split_documents(documents)
    cleaned_texts = replace_t_with_space(texts)

    # Create embeddings and vector store
    embeddings = UpstageEmbeddings(model="solar-embedding-1-large")
    vectorstore = FAISS.from_documents(cleaned_texts, embeddings)

    return vectorstore
def replace_t_with_space(list_of_documents):
    """
    Replaces all tab characters ('\t') with spaces in the page content of each document.

    Args:
        list_of_documents: A list of document objects, each with a 'page_content' attribute.

    Returns:
        The modified list of documents with tab characters replaced by spaces.
    """

    for doc in list_of_documents:
        doc.page_content = doc.page_content.replace('\t', ' ')  # Replace tabs with spaces
    return list_of_documents

In [15]:
vector_store = encode_pdf(path)

## Create a retriever + contexual compressor + combine them


In [21]:
from langchain_upstage import ChatUpstage

# Create a retriever
retriever = vector_store.as_retriever()


#Create a contextual compressor
llm = ChatUpstage(model="solar-pro")
compressor = LLMChainExtractor.from_llm(llm)

#Combine the retriever with the compressor
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever
)

# Create a QA chain with the compressed retriever
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=compression_retriever,
    return_source_documents=True
)

## Example usage

In [22]:
query = "What is the main topic of the document?"
result = qa_chain.invoke({"query": query})
print(result["result"])
print("Source documents:", result["source_documents"])

The main topic of the document is global and local climate action.
Source documents: [Document(metadata={'source': 'data/Understanding_Climate_Change.pdf', 'page': 23}, page_content='The Kyoto Protocol, adopted in 1997, set binding emission reduction targets for developed countries. It was the first major international treaty to address climate change. The protocol laid the groundwork for subsequent agreements, highlighting the importance of collective action.  \n\nRegional and National Initiatives  \n\nEuropean Green Deal  \n\nThe European Green Deal is an ambitious plan to make Europe the first climate -neutral continent by 2050. It includes measures to reduce emissions, promote clean energy, and support sustainable agriculture and biodiversity. The deal also aims to create jobs  and  enhance economic resilience.'), Document(metadata={'source': 'data/Understanding_Climate_Change.pdf', 'page': 22}, page_content="negative emissions. The captured CO2 can be stored or used in various app