# RAG w/ contextual compression & content filtering

### Contextual Compression

When there's a lot of data and queries are different in nature, we usually have to ingest the whole document/chunk with a lot of irrelevant text. Passing full document/chunk might lead to **more expensive LLM calls** and **poorer responses**.

!["Self-querying Retriever"](../images/contextual-compression.webp)

The basic logic is to use a basic **Retriever** (e.g. chroma_db.as_retriever()) and then add the retrieved documents/chunks to the **Compressor** for filtering and extraction only of what's needed.

**The main goal of compressors** is to pass only the relevant information to LLM by removing irrelevant.

In [2]:

import os

# langchain
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, NLTKTextSplitter
from langchain_core.documents import Document
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import (
    LLMChainExtractor, EmbeddingsFilter, DocumentCompressorPipeline
)
from langchain.chains import RetrievalQA
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.document_transformers import EmbeddingsRedundantFilter

# Azure OpenAI
from langchain_openai import AzureChatOpenAI


In [3]:
# OpenAI

AZURE_OPENAI_API_KEY = os.environ.get("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_ENDPOINT = os.environ.get('AZURE_OPENAI_ENDPOINT')
AZURE_OPENAI_VERSION = os.environ.get('AZURE_OPENAI_VERSION')
AZURE_OPENAI_DEPLOYMENT_NAME = os.environ.get('AZURE_OPENAI_DEPLOYMENT_NAME')

In [5]:
# load the data (for later compression)

data_path = "../data/rag-con-comp-data"
pdf_files = [f for f in os.listdir(data_path) if f.endswith('.pdf')]
data = [PyPDFLoader(os.path.join(data_path, file)).load() for file in pdf_files]

print("Total documents: ", len(data))
print(data[0][2].page_content[:150]) # first 150 char

Total documents:  1
283 Biomolecules+H
12 22 11 2 6 12 6 6 12 6 C H O H O C H O + C H O +  →
      Sucrose    Glucose  Fructose
2.From starch : Commercially glucose is


In [6]:
# function for prettifying documents

def pretty_docs(docs):
    print(f"\n{'-'* 100}\n".join([F"##### DOC {i+1} #####\n\n" + d.page_content for i,d in enumerate(docs)]))


In [7]:
# init OpenAI (or any other open source model)

oai = AzureChatOpenAI(
    openai_api_version=AZURE_OPENAI_VERSION,
    azure_deployment=AZURE_OPENAI_DEPLOYMENT_NAME,
)

## Chunking

### Semantic chunking w/ NLTK

In [8]:
docs_list = [item for sublist in data for item in sublist]
print("Total pages:", len(docs_list))

Total pages: 22


In [9]:
# chunking based on semantic understanding (text, not lists)
text_splitter = NLTKTextSplitter()
doc_chunks = text_splitter.split_documents(docs_list)

print("Total no. of chunks: ", len(doc_chunks))

Total no. of chunks:  22


## Basic Retriever

In [10]:
# embedding model

emb_model = SentenceTransformerEmbeddings(model_name="thenlper/gte-large")

In [11]:
# init vector store

db = Chroma.from_documents(documents=doc_chunks, embedding=emb_model)

In [12]:
# setup basic retriever

retriever = db.as_retriever(search_type="mmr")

In [22]:
# sample test from basic retriever WITHOUT compression

response = retriever.get_relevant_documents(query="What is the definition of homotrimer?")
pretty_docs(response)

##### DOC 1 #####

294 ChemistryFig.

10.3: Diagrammatic representation of protein structure (two sub-units
of two types in quaternary structure)A diagrammatic representation of all these four structures is
given in Figure 10.3 where each coloured ball represents an
amino acid.

Fig.

10.4:  Primary,
secondary, tertiary
and quaternary
structures of
haemoglobin
Protein found in a biological system with a unique three-dimensional
structure and biological activity is called a native protein.

When a
protein in its native form, is subjected to physical change like change
in temperature or chemical change like change in pH, the hydrogen
bonds are disturbed.

Due to this, globules unfold and helix get uncoiled
and protein loses its biological activity.

This is called denaturation  of10.2.4
Denaturation of
Proteins
Rationalised 2023-24
----------------------------------------------------------------------------------------------------
##### DOC 2 #####

292 ChemistryAmino acids are usually c

## Problem Definition

As we can see, the response contains not only the answer to the query **"What is Archaea?"**, but also a lot of additional information that will be passed to the LLM for the "Generation" part (e.g. definitions for **archaeplastida**, **archegonium**, etc.)

## Document Compressor

### Option 1: LLMChainExtractor

In [23]:
# initialize compressor instance (from OpenAI)

compressor = LLMChainExtractor.from_llm(oai) # --> LLMChain for extraction of only relevant statements from each document 
compressor

# ignore UserWarning
from warnings import filterwarnings
filterwarnings("ignore", category=UserWarning)

In [24]:
# compressor = base retriever + compressor

compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=retriever)
response = compression_retriever.get_relevant_documents("What is the the role of E-value?")
pretty_docs(response)




In [25]:
# test #1

query = "What is the the role of E-value?"
chain = RetrievalQA.from_chain_type(llm=oai, 
                                    retriever=compression_retriever)

response_1 = chain.invoke(input=query)
print(response_1['result'])

The E-value, or Expect value, plays a significant role in bioinformatics, particularly in sequence alignment. It is used to estimate the number of random hits one can "expect" to see just by chance given the size of the database. A lower E-value indicates a more significant match, meaning the match is less likely to be due to random chance. It helps in determining the statistical significance of a match.


### Option 2: EmbeddingsFilter

**Chepear** & **faster** option than LLMChainExtractor, since it compares the cosine similarity between the **query** and **embeddings**, and returns only those documents which have sufficiently similar embeddings to the query.

- **_k_** = number of relevant documents to return
- **_similarity_threshold_** = threshold for determining when two docs are similar

In [15]:
# initialize embeddings filter

emb_filter = EmbeddingsFilter(embeddings=emb_model, 
                                      k=10,
                                      similarity_threshold=0.6)

In [16]:
# test #2

compression_retriever_filter = ContextualCompressionRetriever(base_retriever=retriever,
                                                       base_compressor=emb_filter)

response_2 = compression_retriever_filter.invoke("What is the the role of E-value?")
pretty_docs(response_2)

##### DOC 1 #####

300 Chemistry
Har Gobind Khorana
DNA Fingerprinting
It is known that every individual has unique fingerprints.

These occur at the tips of
the fingers and have been used for identification for a long time but these can be
altered by surgery.

A sequence of bases on DNA is also unique for a person and
information regarding this is called DNA fingerprinting.

It is same for every cell and
cannot be altered by any known treatment.

DNA fingerprinting is now used
(i)in forensic laboratories for identification of criminals.

(ii)to determine paternity of an individual.

(iii)to identify the dead bodies in any accident by comparing the DNA’s of parents or
children.

(iv)to identify racial groups to rewrite biological evolution.

DNA is the chemical basis of heredity and may be regarded as the reserve
of genetic information.

DNA is exclusively responsible for maintaining
the identity of different species of organisms over millions of years.

A
DNA molecule is capable of se

### Option 3: EmbeddingsRedundantFilter

Removes/drops redundant documents by comparing their **embeddings** with the **query**. The opposite of the EmbeddingsFilter.

In [17]:
emb_redundant_filter = EmbeddingsRedundantFilter(embeddings=emb_model, similarity_threshold=0.98)

In [18]:
# test #3

doc_pipeline_compressor = DocumentCompressorPipeline(transformers=[emb_redundant_filter]) # doc compressor pipeline for removal of redundant results ONLY
compression_retriever = ContextualCompressionRetriever(base_retriever=retriever,
                                                       base_compressor=doc_pipeline_compressor)

response_3 = compression_retriever.get_relevant_documents("What is the the role of E-value?")
pretty_docs(response_3)

##### DOC 1 #####

300 Chemistry
Har Gobind Khorana
DNA Fingerprinting
It is known that every individual has unique fingerprints.

These occur at the tips of
the fingers and have been used for identification for a long time but these can be
altered by surgery.

A sequence of bases on DNA is also unique for a person and
information regarding this is called DNA fingerprinting.

It is same for every cell and
cannot be altered by any known treatment.

DNA fingerprinting is now used
(i)in forensic laboratories for identification of criminals.

(ii)to determine paternity of an individual.

(iii)to identify the dead bodies in any accident by comparing the DNA’s of parents or
children.

(iv)to identify racial groups to rewrite biological evolution.

DNA is the chemical basis of heredity and may be regarded as the reserve
of genetic information.

DNA is exclusively responsible for maintaining
the identity of different species of organisms over millions of years.

A
DNA molecule is capable of se

## Compression Pipeline

- **_emb_filter_** = relevant docs filter
- **_emb_redundant_filter_** = redundant docs filter

**TODOs**: 
1. add more filters for improvement of the results
2. add filtering of the results by manual customizable compression
3. Add proper system prompt
4. Add more data cleaning steps (e.g. to split terms and definitions)

In [19]:
# build compressor pipeline

comp_pipeline = DocumentCompressorPipeline(
    transformers=[emb_redundant_filter, emb_filter]
)

compressor_pipeline_retriever = ContextualCompressionRetriever(base_compressor=comp_pipeline,
                                                               base_retriever=retriever,
                                                               search_kwargs={"k": 5})

user_query = "What is Lactose?"
final_response = compressor_pipeline_retriever.invoke(input=user_query)

##### DOC 1 #####

288 Chemistry(iii)Lactose : It is more commonly known as milk sugar since this
disaccharide is found in milk.

It is composed of b-D-galactose and
b-D-glucose.

The linkage is between C1 of galactose and C4 of
glucose.

Free aldehyde group may be produced at C-1 of glucose
unit, hence it is also a r educing sugar .

Polysaccharides contain a large number of monosaccharide units joined
together by glycosidic linkages.

These are the most commonly
encountered carbohydrates in nature.

They mainly act as the food
storage or structural materials.

(i)Starch : Starch is the main storage polysaccharide of plants.

It is
the most important dietary source for human beings.

High content
of starch is found in cereals, roots, tubers and some vegetables.

It
is a polymer of a-glucose and consists of two components—
Amylose  and Amylopectin .

Amylose is water soluble component
which constitutes about 15-20% of starch.

Chemically amylose is
a long unbranched chain with 200-1000

## Chain

In [22]:
qa_chain = RetrievalQA.from_chain_type(llm=oai, 
                                       chain_type='stuff', 
                                       retriever=compressor_pipeline_retriever)
query = "What is Lactose"
response = qa_chain.invoke(query)
print('RESPONSE: ', response['result'])

RESPONSE:  Lactose is more commonly known as milk sugar since this disaccharide is found in milk. It is composed of b-D-galactose and b-D-glucose. The linkage is between C1 of galactose and C4 of glucose. A free aldehyde group may be produced at C-1 of the glucose unit, hence it is also a reducing sugar.
