# RAG w/ contextual compression & content filtering

### Contextual Compression

When there's a lot of data and queries are different in nature, we usually have to ingest the whole document/chunk with a lot of irrelevant text. Passing full document/chunk might lead to **more expensive LLM calls** and **poorer responses**.

!["Self-querying Retriever"](../images/contextual-compression.webp)

The basic logic is to use a basic **Retriever** (e.g. chroma_db.as_retriever()) and then add the retrieved documents/chunks to the **Compressor** for filtering and extraction only of what's needed.

**The main goal of compressors** is to pass only the relevant information to LLM by removing irrelevant.

In [5]:

import os

# langchain
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, NLTKTextSplitter
from langchain_core.documents import Document
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import (
    LLMChainExtractor, EmbeddingsFilter, DocumentCompressorPipeline
)
from langchain.chains import RetrievalQA
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.document_transformers import EmbeddingsRedundantFilter

# Azure OpenAI
from langchain_openai import AzureChatOpenAI


In [14]:
# OpenAI
AZURE_OPENAI_API_KEY = os.environ.get("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_ENDPOINT = os.environ.get('AZURE_OPENAI_ENDPOINT')
AZURE_OPENAI_VERSION = os.environ.get('AZURE_OPENAI_VERSION')
AZURE_OPENAI_DEPLOYMENT_NAME = os.environ.get('AZURE_OPENAI_DEPLOYMENT_NAME')

In [15]:
# load the data (for later compression)

data_path = "../data/rag-con-comp-data"
pdf_files = [f for f in os.listdir(data_path) if f.endswith('.pdf')]
data = [PyPDFLoader(os.path.join(data_path, file)).load() for file in pdf_files]

print("Total documents: ", len(data))
print(data[0][2].page_content)

Total documents:  1
Glossary: phase problem
Glossary 177 ©2004 New Science Press Ltdhomologous: describes genes or proteins related by
divergent evolution from a common ancestor.Homologous proteins, or homologs, will generally havesimilar sequences, structures and biochemical func-tions, although the sequence and/or functional similar-ity may be difficult to recognize. (4-1)
homology: the similarity seen between two gene or
protein sequences that are both derived by evolutionfrom a common ancestral sequence. (4-1)
homology modeling: a computational method for
modeling the structure of a protein based on itssequence similarity to one or more proteins of knownstructure. (4-6)
homotrimer: an assembly of three identical subunits:
in a protein, these are individual folded polypeptidechains. (1-19)
hydride ion: a hydrogen atom with an extra electron.
(2-9)
hydrogen bond: a noncovalent interaction between
the donor atom ,which is bound to a positively polar-
ized hydrogen atom, and the accept

In [16]:
# function for prettifying documents

def pretty_docs(docs):
    print(f"\n{'-'* 100}\n".join([F"##### DOC {i+1} #####\n\n" + d.page_content for i,d in enumerate(docs)]))


In [17]:
# init OpenAI (or any other open source model)

oai = AzureChatOpenAI(
    openai_api_version=AZURE_OPENAI_VERSION,
    azure_deployment=AZURE_OPENAI_DEPLOYMENT_NAME,
)

## Chunking

### Semantic chunking w/ NLTK

In [20]:
docs_list = [item for sublist in data for item in sublist]
print("Total pages:", len(docs_list))

Total pages: 5


In [21]:
# chunking based on semantic understanding (text, not lists)
text_splitter = NLTKTextSplitter()
doc_chunks = text_splitter.split_documents(docs_list)

print("Total no. of chunks: ", len(doc_chunks))

Total no. of chunks:  13


## Basic Retriever

In [22]:
# embedding model

emb_model = SentenceTransformerEmbeddings(model_name="thenlper/gte-large")

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: thenlper/gte-large
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cpu


In [23]:
# init vector store

db = Chroma.from_documents(documents=doc_chunks, embedding=emb_model)

INFO:chromadb.telemetry.product.posthog:Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.


In [24]:
# setup basic retriever

retriever = db.as_retriever(search_type="mmr")

In [25]:
# sample test from basic retriever WITHOUT compression

response = retriever.get_relevant_documents(query="What is the definition of homotrimer?")
pretty_docs(response)



##### DOC 1 #####

Glossary: phase problem
Glossary 177 ©2004 New Science Press Ltdhomologous: describes genes or proteins related by
divergent evolution from a common ancestor.Homologous proteins, or homologs, will generally havesimilar sequences, structures and biochemical func-tions, although the sequence and/or functional similar-ity may be difficult to recognize.

(4-1)
homology: the similarity seen between two gene or
protein sequences that are both derived by evolutionfrom a common ancestral sequence.

(4-1)
homology modeling: a computational method for
modeling the structure of a protein based on itssequence similarity to one or more proteins of knownstructure.

(4-6)
homotrimer: an assembly of three identical subunits:
in a protein, these are individual folded polypeptidechains.

(1-19)
hydride ion: a hydrogen atom with an extra electron.

(2-9)
hydrogen bond: a noncovalent interaction between
the donor atom ,which is bound to a positively polar-
ized hydrogen atom, and the ac

## Problem Definition

As we can see, the response contains not only the answer to the query **"What is Archaea?"**, but also a lot of additional information that will be passed to the LLM for the "Generation" part (e.g. definitions for **archaeplastida**, **archegonium**, etc.)

## Document Compressor

### Option 1: LLMChainExtractor

In [26]:
# initialize compressor instance (from OpenAI)

compressor = LLMChainExtractor.from_llm(oai) # --> LLMChain for extraction of only relevant statements from each document 
compressor

# ignore UserWarning
from warnings import filterwarnings
filterwarnings("ignore", category=UserWarning)

In [28]:
# compressor = base retriever + compressor

compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=retriever)
response = compression_retriever.get_relevant_documents("What is the the role of E-value?")
pretty_docs(response)

INFO:httpx:HTTP Request: POST https://vy-test-oai-instance.openai.azure.com//openai/deployments/gpt4-32k-test-instance/chat/completions?api-version=2024-02-15-preview "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://vy-test-oai-instance.openai.azure.com//openai/deployments/gpt4-32k-test-instance/chat/completions?api-version=2024-02-15-preview "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://vy-test-oai-instance.openai.azure.com//openai/deployments/gpt4-32k-test-instance/chat/completions?api-version=2024-02-15-preview "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://vy-test-oai-instance.openai.azure.com//openai/deployments/gpt4-32k-test-instance/chat/completions?api-version=2024-02-15-preview "HTTP/1.1 200 OK"


##### DOC 1 #####

E-value: the probability that an alignment score as good as the one found between two sequences wouldbe found in a comparison between two randomsequences; that is, the probability that such a matchwould occur by chance.


In [29]:
# test #1

query = "What is the the role of E-value?"
chain = RetrievalQA.from_chain_type(llm=oai, 
                                    retriever=compression_retriever)

response_1 = chain.invoke(input=query)
print(response_1['result'])

INFO:httpx:HTTP Request: POST https://vy-test-oai-instance.openai.azure.com//openai/deployments/gpt4-32k-test-instance/chat/completions?api-version=2024-02-15-preview "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://vy-test-oai-instance.openai.azure.com//openai/deployments/gpt4-32k-test-instance/chat/completions?api-version=2024-02-15-preview "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://vy-test-oai-instance.openai.azure.com//openai/deployments/gpt4-32k-test-instance/chat/completions?api-version=2024-02-15-preview "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://vy-test-oai-instance.openai.azure.com//openai/deployments/gpt4-32k-test-instance/chat/completions?api-version=2024-02-15-preview "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://vy-test-oai-instance.openai.azure.com//openai/deployments/gpt4-32k-test-instance/chat/completions?api-version=2024-02-15-preview "HTTP/1.1 200 OK"


The role of the E-value is to estimate the chance of an alignment between two sequences occurring by random chance. It's used in bioinformatics to assess the significance of a match between two sequences. The smaller the E-value, the more significant the match is considered to be, implying that it is less likely to have occurred by chance.


### Option 2: EmbeddingsFilter

**Chepear** & **faster** option than LLMChainExtractor, since it compares the cosine similarity between the **query** and **embeddings**, and returns only those documents which have sufficiently similar embeddings to the query.

- **_k_** = number of relevant documents to return
- **_similarity_threshold_** = threshold for determining when two docs are similar

In [36]:
# initialize embeddings filter

emb_filter = EmbeddingsFilter(embeddings=emb_model, 
                                      k=10,
                                      similarity_threshold=0.6)

In [37]:
# test #2

compression_retriever_filter = ContextualCompressionRetriever(base_retriever=retriever,
                                                       base_compressor=emb_filter)

response_2 = compression_retriever_filter.invoke("What is the the role of E-value?")
pretty_docs(response_2)



##### DOC 1 #####

Glossary
Glossary 175 ©2004 New Science Press Ltdacid: a molecule or chemical group that donates a pro-
ton, either to water or to some other base.

(2-12)
acid-base catalysis: catalysis in which a proton is
transferred in going to or from the transition state.When the acid or base that abstracts or donates the
proton is derived directly from water (H
+or OH–) this is
called specific acid-base catalysis.

When the acid or
base is not H+or OH–,it is called general acid-base
catalysis.

Nearly all enzymatic acid-base catalysis is gen-
eral acid-base catalysis.

(2-12)
activation energy: the energy required to bring a
species in a chemical reaction from the ground state to
a state of higher free energy, in which it can transformspontaneously to another low-energy species.

(2-6)
activation-energy barrier: the higher-energy region
between two consecutive chemical species in a reaction.

(2-6)
activation loop: a stretch of polypeptide chain that
changes conformation when 

### Option 3: EmbeddingsRedundantFilter

Removes/drops redundant documents by comparing their **embeddings** with the **query**. The opposite of the EmbeddingsFilter.

In [38]:
emb_redundant_filter = EmbeddingsRedundantFilter(embeddings=emb_model, similarity_threshold=0.98)

In [40]:
# test #3

doc_pipeline_compressor = DocumentCompressorPipeline(transformers=[emb_redundant_filter]) # doc compressor pipeline for removal of redundant results ONLY
compression_retriever = ContextualCompressionRetriever(base_retriever=retriever,
                                                       base_compressor=doc_pipeline_compressor)

response_3 = compression_retriever.get_relevant_documents("What is the the role of E-value?")
pretty_docs(response_3)



##### DOC 1 #####

Glossary
Glossary 175 ©2004 New Science Press Ltdacid: a molecule or chemical group that donates a pro-
ton, either to water or to some other base.

(2-12)
acid-base catalysis: catalysis in which a proton is
transferred in going to or from the transition state.When the acid or base that abstracts or donates the
proton is derived directly from water (H
+or OH–) this is
called specific acid-base catalysis.

When the acid or
base is not H+or OH–,it is called general acid-base
catalysis.

Nearly all enzymatic acid-base catalysis is gen-
eral acid-base catalysis.

(2-12)
activation energy: the energy required to bring a
species in a chemical reaction from the ground state to
a state of higher free energy, in which it can transformspontaneously to another low-energy species.

(2-6)
activation-energy barrier: the higher-energy region
between two consecutive chemical species in a reaction.

(2-6)
activation loop: a stretch of polypeptide chain that
changes conformation when 

## Compression Pipeline

- **_emb_filter_** = relevant docs filter
- **_emb_redundant_filter_** = redundant docs filter

**TODOs**: 
1. add more filters for improvement of the results
2. add filtering of the results by manual customizable compression
3. Add proper system prompt
4. Add more data cleaning steps (e.g. to split terms and definitions)

In [51]:
# build compressor pipeline

comp_pipeline = DocumentCompressorPipeline(
    transformers=[emb_redundant_filter, emb_filter]
)

compressor_pipeline_retriever = ContextualCompressionRetriever(base_compressor=comp_pipeline,
                                                               base_retriever=retriever,
                                                               search_kwargs={"k": 5})

user_query = "What is Lactose?"
final_response = compressor_pipeline_retriever.invoke(input=user_query)
pretty_docs(final_response)



##### DOC 1 #####

Glossary
Glossary 175 ©2004 New Science Press Ltdacid: a molecule or chemical group that donates a pro-
ton, either to water or to some other base.

(2-12)
acid-base catalysis: catalysis in which a proton is
transferred in going to or from the transition state.When the acid or base that abstracts or donates the
proton is derived directly from water (H
+or OH–) this is
called specific acid-base catalysis.

When the acid or
base is not H+or OH–,it is called general acid-base
catalysis.

Nearly all enzymatic acid-base catalysis is gen-
eral acid-base catalysis.

(2-12)
activation energy: the energy required to bring a
species in a chemical reaction from the ground state to
a state of higher free energy, in which it can transformspontaneously to another low-energy species.

(2-6)
activation-energy barrier: the higher-energy region
between two consecutive chemical species in a reaction.

(2-6)
activation loop: a stretch of polypeptide chain that
changes conformation when 

In [49]:
compressor_pipeline_retriever.invoke("E-value")



[_DocumentWithState(page_content='Glossary\nGlossary 175 ©2004 New Science Press Ltdacid: a molecule or chemical group that donates a pro-\nton, either to water or to some other base.\n\n(2-12)\nacid-base catalysis: catalysis in which a proton is\ntransferred in going to or from the transition state.When the acid or base that abstracts or donates the\nproton is derived directly from water (H\n+or OH–) this is\ncalled specific acid-base catalysis.\n\nWhen the acid or\nbase is not H+or OH–,it is called general acid-base\ncatalysis.\n\nNearly all enzymatic acid-base catalysis is gen-\neral acid-base catalysis.\n\n(2-12)\nactivation energy: the energy required to bring a\nspecies in a chemical reaction from the ground state to\na state of higher free energy, in which it can transformspontaneously to another low-energy species.\n\n(2-6)\nactivation-energy barrier: the higher-energy region\nbetween two consecutive chemical species in a reaction.\n\n(2-6)\nactivation loop: a stretch of polype

## Chain

In [47]:
qa_chain = RetrievalQA.from_chain_type(llm=oai, 
                                       chain_type='stuff', 
                                       retriever=compressor_pipeline_retriever)
query = "What is the the role of E-value?"
response = qa_chain.invoke(query)
print('RESPONSE: ', response['result'])

INFO:httpx:HTTP Request: POST https://vy-test-oai-instance.openai.azure.com//openai/deployments/gpt4-32k-test-instance/chat/completions?api-version=2024-02-15-preview "HTTP/1.1 200 OK"


RESPONSE:  The E-value is the probability that an alignment score as good as the one found between two sequences would be found in a comparison between two random sequences; that is, the probability that such a match would occur by chance.
