# RAG w/ contextual compression & content filtering

### Contextual Compression

When there's a lot of data and queries are different in nature, we usually have to ingest the whole document/chunk with a lot of irrelevant text. Passing full document/chunk might lead to **more expensive LLM calls** and **poorer responses**.

!["Self-querying Retriever"](./images/contextual-compression.webp)

The basic logic is to use a basic **Retriever** (e.g. chroma_db.as_retriever()) and then add the retrieved documents/chunks to the **Compressor** for filtering and extraction only of what's needed.

**The main goal of compressors** is to pass only the relevant information to LLM by removing irrelevant.

In [22]:

import os

# langchain
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, NLTKTextSplitter
from langchain_core.documents import Document
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import (
    LLMChainExtractor, EmbeddingsFilter, DocumentCompressorPipeline
)
from langchain.chains import RetrievalQA
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.document_transformers import EmbeddingsRedundantFilter

# Azure OpenAI
from langchain_openai import AzureChatOpenAI


In [23]:
# OpenAI
AZURE_OPENAI_API_KEY = os.environ.get("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_ENDPOINT = os.environ.get('AZURE_OPENAI_ENDPOINT')
AZURE_OPENAI_VERSION = os.environ.get('AZURE_OPENAI_VERSION')
AZURE_OPENAI_DEPLOYMENT_NAME = os.environ.get('AZURE_OPENAI_DEPLOYMENT_NAME')

In [24]:
# load the data (for later compression)

loader = PyPDFLoader("data\sceince_glossary_37.pdf")
data = loader.load()
print("Total documents: ", len(data))
print(data[2].page_content)

Total documents:  37
antigen receptor The general term for a surface
protein, located on B cells and T cells, that
binds to antigens, initiating adaptive immuneresponses. The antigen receptors on B cells arecalled B cell receptors, and the antigen recep-tors on T cells are called T cell receptors.
antigen-presenting cell A cell that upon in-
gesting pathogens or internalizing pathogenproteins generates peptide fragments that arebound by class II MHC molecules and subse-quently displayed on the cell surface to T cells.Macrophages, dendritic cells, and B cells arethe primary antigen-presenting cells.
antiparallel Referring to the arrangement of the
sugar-phosphate backbones in a DNA doublehelix (they run in opposite 5 /H11032S3/H11032directions).
aphotic zone (a–/H11032-fo–/H11032-tik) The part of an ocean or
lake beneath the photic zone, where light doesnot penetrate sufﬁciently for photosynthesisto occur.
apical bud (a–/H11032-pik-ul) A bud at the tip of a plant
stem; also called a ter

In [25]:
# function for prettifying documents

def pretty_docs(docs):
    print(f"\n{'-'* 100}\n".join([F"##### DOC {i+1} #####\n\n" + d.page_content for i,d in enumerate(docs)]))


In [26]:
# init OpenAI (or any other open source model)

oai = AzureChatOpenAI(
    openai_api_version=AZURE_OPENAI_VERSION,
    azure_deployment=AZURE_OPENAI_DEPLOYMENT_NAME,
)

## Chunking

### Recursive Chunking

In [27]:
# recursive chunking on the glossary

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
split_recrs_docs = text_splitter.split_documents(data)
print("Total no. of chunks: ", len(split_recrs_docs))
print(split_recrs_docs[0].page_content)

Total no. of chunks:  346
G–1 GLOSSARYGlossaryGlossaryPronounce
a-as in ace
a/ah ash
ch chose
e ¯ meet
e/eh bet
g game
ı ¯ ice
i hit
ks box
kw quick
ng song
o-robe
o ox
oy boy
s say
sh shell
th thin
u ¯ boot
u/uh up
z zoo
/H11032/H11005primary accent
/H11032/H11005secondary accent
5/H11032cap A modiﬁed form of guanine nucleotide
added onto the 5 /H11032end of a pre-mRNA
molecule.
A site One of a ribosome’s three binding sites for
tRNA during translation. The A site holds the
tRNA carrying the next amino acid to be addedto the polypeptide chain. (A stands foraminoacyl tRNA.)
ABC hypothesis A model of ﬂower formation
identifying three classes of organ identitygenes that direct formation of the four types ofﬂoral organs.
abiotic (a–/H11032-bı ¯-ot /H11032-ik) Nonliving; referring to
the physical and chemical properties of anenvironment.
abortion The termination of a pregnancy in
progress.
abscisic acid (ABA) (ab-sis /H11032-ik) A plant hor-


### Semantic chunking w/ NLTK

In [28]:
# chunking based on semantic understanding (text, not lists)

# concatenate docs from list into a single string
text_splitter = NLTKTextSplitter()
merged_data = "\n".join(doc.page_content for doc in data)
split_nltk_docs = text_splitter.split_text(merged_data)
print("Total no. of chunks: ", len(split_nltk_docs))
print(split_nltk_docs[15])

Total no. of chunks:  82
cloaca (klo–-a–/H11032-kuh) A common opening for the
digestive, urinary, and reproductive tractsfound in many nonmammalian vertebrates butin few mammals.

clonal selection The process by which an anti-
gen selectively binds to and activates onlythose lymphocytes bearing receptors speciﬁcfor the antigen.

The selected lymphocytes pro-liferate and differentiate into a clone of effec-tor cells and a clone of memory cells speciﬁcfor the stimulating antigen.

clone (1) A lineage of genetically identical
individuals or cells.

(2) In popular usage, anindividual that is genetically identical toanother individual.

(3) As a verb, to make oneor more genetic replicas of an individual or cell.See also gene cloning.

cloning vector In genetic engineering, a DNA
molecule that can carry foreign DNA into a hostcell and replicate there.

Cloning vectors includeplasmids and bacterial artiﬁcial chromosomes(BACs), which move recombinant DNA from atest tube back into a cell, and v

In [29]:
# to use the NLTK chunks they must be converted to Langchain Documents from strings

def convert_to_langchain_docs(docs):

    documents = []

    for doc in docs:
        
        # append to list as a new document without metadata
        documents.append(
            Document(page_content=doc)
        )

    return documents

transformed_nltk_docs = convert_to_langchain_docs(split_nltk_docs)

## Basic Retriever

In [30]:
# embedding model

emb_model = SentenceTransformerEmbeddings(model_name="thenlper/gte-large")

In [31]:
# init vector store

db = Chroma.from_documents(documents=transformed_nltk_docs, embedding=emb_model)

In [32]:
# setup basic retriever

retriever = db.as_retriever(search_type="mmr")

In [33]:
# sample test from basic retriever WITHOUT compression

response = retriever.get_relevant_documents(query="What is Archaea?")
pretty_docs(response)

##### DOC 1 #####

Archaea (ar/H11032-ke¯/H11032-uh) One of two prokaryotic
domains, the other being Bacteria.

Archaeplastida (ar/H11032-ke ¯-plas /H11032-tid-uh) One of ﬁve
supergroups of eukaryotes proposed in a currenthypothesis of the evolutionary history ofeukaryotes.

This monophyletic group, whichincludes red algae, green algae, and land plants,descended from an ancient protist ancestor thatengulfed a cyanobacterium.

See also Excavata,
Chromalveolata, Rhizaria, and Unikonta.

archegonium (ar-ki-go–/H11032-ne ¯-um) (plural,
archegonia) In plants, the femalegametangium, a moist chamber in whichgametes develop.

archenteron (ar-ken /H11032-tuh-ron) The endoderm-
lined cavity, formed during gastrulation, thatdevelops into the digestive tract of an animal.

archosaur (ar/H11032-ko–-so–r) Member of the reptilian
group that includes crocodiles, alligators anddinosaurs, including birds.

arteriole (ar-ter /H11032-e ¯-o–l) A vessel that conveys
blood between an artery and a capillary b

## Problem Definition

As we can see, the response contains not only the answer to the query **"What is Archaea?"**, but also a lot of additional information that will be passed to the LLM for the "Generation" part (e.g. definitions for **archaeplastida**, **archegonium**, etc.)

## Document Compressor

### Option 1: LLMChainExtractor

In [34]:
# initialize compressor instance (from OpenAI)

compressor = LLMChainExtractor.from_llm(oai) # --> LLMChain for extraction of only relevant statements from each document 
compressor

# ignore UserWarning
from warnings import filterwarnings
filterwarnings("ignore", category=UserWarning)

In [35]:
# compressor = base retriever + compressor

compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=retriever)
response = compression_retriever.get_relevant_documents("What is Archaea?")
pretty_docs(response)

##### DOC 1 #####

Archaea (ar/H11032-ke¯/H11032-uh) One of two prokaryotic
domains, the other being Bacteria.


In [36]:
# test #1

query = "What is Archaea?"
chain = RetrievalQA.from_chain_type(llm=oai, 
                                    retriever=compression_retriever)

response_1 = chain.invoke(input=query)
print(response_1['result'])

Archaea is one of two prokaryotic domains, the other being Bacteria. Prokaryotic means that they are single-celled organisms that do not have a distinct nucleus with a membrane nor other specialized compartments.


### Option 2: EmbeddingsFilter

**Chepear** & **faster** option than LLMChainExtractor, since it compares the cosine similarity between the **query** and **embeddings**, and returns only those documents which have sufficiently similar embeddings to the query.

- **_k_** = number of relevant documents to return
- **_similarity_threshold_** = threshold for determining when two docs are similar

In [37]:
# initialize embeddings filter

emb_filter = EmbeddingsFilter(embeddings=emb_model, 
                                      k=10, 
                                      similarity_threshold=0.8)

In [39]:
# test #2

compression_retriever_filter = ContextualCompressionRetriever(base_retriever=retriever,
                                                       base_compressor=emb_filter)

response_2 = compression_retriever_filter.get_relevant_documents("What is Archaea?")
pretty_docs(response_2)

##### DOC 1 #####

Archaea (ar/H11032-ke¯/H11032-uh) One of two prokaryotic
domains, the other being Bacteria.

Archaeplastida (ar/H11032-ke ¯-plas /H11032-tid-uh) One of ﬁve
supergroups of eukaryotes proposed in a currenthypothesis of the evolutionary history ofeukaryotes.

This monophyletic group, whichincludes red algae, green algae, and land plants,descended from an ancient protist ancestor thatengulfed a cyanobacterium.

See also Excavata,
Chromalveolata, Rhizaria, and Unikonta.

archegonium (ar-ki-go–/H11032-ne ¯-um) (plural,
archegonia) In plants, the femalegametangium, a moist chamber in whichgametes develop.

archenteron (ar-ken /H11032-tuh-ron) The endoderm-
lined cavity, formed during gastrulation, thatdevelops into the digestive tract of an animal.

archosaur (ar/H11032-ko–-so–r) Member of the reptilian
group that includes crocodiles, alligators anddinosaurs, including birds.

arteriole (ar-ter /H11032-e ¯-o–l) A vessel that conveys
blood between an artery and a capillary b

### Option 3: EmbeddingsRedundantFilter

Removes/drops redundant documents by comparing their **embeddings** with the **query**. The opposite of the EmbeddingsFilter.

In [41]:
emb_redundant_filter = EmbeddingsRedundantFilter(embeddings=emb_model, similarity_threshold=0.98)

In [42]:
# test #3

doc_pipeline_compressor = DocumentCompressorPipeline(transformers=[emb_redundant_filter]) # doc compressor pipeline for removal of redundant results ONLY
compression_retriever = ContextualCompressionRetriever(base_retriever=retriever,
                                                       base_compressor=doc_pipeline_compressor)

response_3 = compression_retriever.get_relevant_documents("What is Archaea?")
pretty_docs(response_3)

##### DOC 1 #####

Archaea (ar/H11032-ke¯/H11032-uh) One of two prokaryotic
domains, the other being Bacteria.

Archaeplastida (ar/H11032-ke ¯-plas /H11032-tid-uh) One of ﬁve
supergroups of eukaryotes proposed in a currenthypothesis of the evolutionary history ofeukaryotes.

This monophyletic group, whichincludes red algae, green algae, and land plants,descended from an ancient protist ancestor thatengulfed a cyanobacterium.

See also Excavata,
Chromalveolata, Rhizaria, and Unikonta.

archegonium (ar-ki-go–/H11032-ne ¯-um) (plural,
archegonia) In plants, the femalegametangium, a moist chamber in whichgametes develop.

archenteron (ar-ken /H11032-tuh-ron) The endoderm-
lined cavity, formed during gastrulation, thatdevelops into the digestive tract of an animal.

archosaur (ar/H11032-ko–-so–r) Member of the reptilian
group that includes crocodiles, alligators anddinosaurs, including birds.

arteriole (ar-ter /H11032-e ¯-o–l) A vessel that conveys
blood between an artery and a capillary b

## Compression Pipeline

- **_emb_filter_** = relevant docs filter
- **_emb_redundant_filter_** = redundant docs filter

**TODOs**: 
1. add more filters for improvement of the results
2. add filtering of the results by manual customizable compression
3. Add proper system prompt
4. Add more data cleaning steps (e.g. to split terms and definitions)

In [43]:
# build compressor pipeline

comp_pipeline = DocumentCompressorPipeline(
    transformers=[emb_redundant_filter, emb_filter]
)

compressor_pipeline_retriever = ContextualCompressionRetriever(base_compressor=comp_pipeline,
                                                               base_retriever=retriever,
                                                               search_kwargs={"k": 5})

user_query = "Have you ever heard of coleorhiza?"
final_response = compressor_pipeline_retriever.get_relevant_documents(query=user_query)
pretty_docs(final_response)

##### DOC 1 #####

coleorhiza (ko–/H11032-le ¯-uh-rı ¯/H11032-zuh) The covering of
the young root of the embryo of a grass seed.

collagen A glycoprotein in the extracellular
matrix of animal cells that forms strong
ﬁbers, found extensively in connective tissue
and bone; the most abundant protein in theanimal kingdom.

collecting duct The location in the kidney
where processed ﬁltrate, called urine, is
collected from the renal tubules.

collenchyma cell (ko–-len/H11032-kim-uh) A ﬂexible
plant cell type that occurs in strands or cylin-ders that support young parts of the plant
without restraining growth.

colloid A mixture made up of a liquid and parti-
cles that (because of their large size) remain
suspended rather than dissolved in that liquid.

colon (ko–/H11032-len) The largest section of the
vertebrate large intestine; functions in water
absorption and formation of feces.

commensalism (kuh-men /H11032-suh-lizm) A symbi-
otic relationship in which one organism bene-
ﬁts but the oth

## Chain

In [44]:
qa_chain = RetrievalQA.from_chain_type(llm=oai, chain_type='stuff', retriever=compressor_pipeline_retriever)

query = "Which reaction uses reverse transcriptase and DNA polymerase for cdna generation?"
response = qa_chain.invoke(input=query)
print('RESPONSE: ', response['result'])

RESPONSE:  The reaction that uses reverse transcriptase and DNA polymerase for cDNA generation is called Reverse Transcriptase-Polymerase Chain Reaction (RT-PCR). This technique is used for determining expression of a particular gene. It synthesizes cDNA from all the mRNA in a sample and then subjects the cDNA to PCR amplification using primers specific for the gene of interest.


!["Compressor chain results from science_glossary"](./images/compressor-pipeline-results.png)