# Project: Document Retriever Search Engine for Wikipedia Data
* Notebook by Adam Lang
* Date: 7/16/2024

# Project Overview
* In this project I will build a document retriever search engine for Wikipedia Data using:
1. LangChain
2. SentenceTransformers
  * embeddings
  * re-ranker
3. OpenAI
  * GPT-3.5-turbo
4. Chroma Vector Database

## Architecture
1. Load and chunk Wikipedia document
2. Create document chunk embeddings
3. Index in vector database
4. Search Query experimentation with Retrievers
   * Stand-alone Retrievers
   * Pipeline of retrievers

### LangChain and SentenceTransformer Dependencies

In [6]:
!pip install langchain==0.2.0
!pip install langchain-openai==0.1.7
!pip install langchain-community==0.2.0
!pip install sentence-transformers==2.7.0

Collecting langchain-openai==0.1.7
  Downloading langchain_openai-0.1.7-py3-none-any.whl (34 kB)
Collecting openai<2.0.0,>=1.24.0 (from langchain-openai==0.1.7)
  Downloading openai-1.35.14-py3-none-any.whl (328 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m328.5/328.5 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken<1,>=0.7 (from langchain-openai==0.1.7)
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken, openai, langchain-openai
Successfully installed langchain-openai-0.1.7 openai-1.35.14 tiktoken-0.7.0


### Chroma Vector DB and Langchain Wrapper Dependencies

In [7]:
!pip install langchain-chroma



### OpenAI Dependencies/Access

In [8]:
from getpass import getpass

OPENAI_KEY = getpass('Please enter your Open AI key: ')

Please enter your Open AI key: ··········


In [9]:
## openai environment setup
import os

os.environ['OPENAI_API_KEY'] = OPENAI_KEY

### OpenAI Embedding Models
* We can access the OpenAI embedding models via LangChain APIs.

In [10]:
## langchain openai embedding import
from langchain_openai import OpenAIEmbeddings

# setup embedding model
openai_embed_model = OpenAIEmbeddings(model='text-embedding-3-small')

### Get the Wikipedia Data

In [11]:
# download data
!gdown 1oWBnoxBZ1Mpeond8XDUSO6J9oAjcRDyW

Downloading...
From (original): https://drive.google.com/uc?id=1oWBnoxBZ1Mpeond8XDUSO6J9oAjcRDyW
From (redirected): https://drive.google.com/uc?id=1oWBnoxBZ1Mpeond8XDUSO6J9oAjcRDyW&confirm=t&uuid=3010c796-3b30-4b3a-bb98-d399f59af82e
To: /content/simplewiki-2020-11-01.jsonl.gz
100% 50.2M/50.2M [00:01<00:00, 33.4MB/s]


In [12]:
## open the file
import gzip
import json

# filepath
wiki_path = '/content/simplewiki-2020-11-01.jsonl.gz'

# open and append to list
docs = []
with gzip.open(wiki_path, 'rt', encoding='utf8') as fIn:
  for line in fIn:
    data = json.loads(line.strip())

    # add all paragraphs
    # passages.extend(data['paragraphs'])

    # only add first paragraph
    docs.append({
        'metadata' : {
                        'title': data.get('title'),
                        'article_id': data.get('id')
        },
        'data': ' '.join(data.get('paragraphs')[0:3]) # restrict data to first 3 paragraphs to run modules/embeddings faster
    })

# print passages
print("Passages:", len(docs))

Passages: 169597


### Document splitting
* Subset data by topic and metadata.
* This also saves money on openai tokens with embedding models.

In [13]:
## now we can subset our data into subset of wikipedia docs to run faster
docs = [doc for doc in docs for x in ['linguistics', 'india', 'cheetah']
              if x in doc['data'].lower().split()]


In [14]:
# len of docs
len(docs)

1364

In [15]:
# view docs
docs[:4]

[{'metadata': {'title': 'Kurgan hypothesis', 'article_id': '72554'},
  'data': 'The Kurgan model of Indo-European origins is about both the people and their Proto-Indo-European language. It uses both archaeology and linguistics to show the history of their culture at different stages of the Indo-European expansion. The Kurgan model is the most widely accepted theory on the origins of Indo-European.'},
 {'metadata': {'title': 'Marija Gimbutas', 'article_id': '72558'},
  'data': 'Marija Gimbutas (Lithuanian: Marija Gimbutienė, born Marija Birutė Alseikaitė) (Vilnius, Lithuania, January 23, 1921 – Los Angeles, United States February 2, 1994), was a Lithuanian-American archeologist, known for her research into the Neolithic and Bronze Age cultures of "Old Europe" and the theories that she introduced. Between 1946 and 1971, her writings merged traditional spadework with linguistics and mythologies.'},
 {'metadata': {'title': 'Basil', 'article_id': '73985'},
  'data': 'Basil ("Ocimum basilic

In [16]:
type(docs)

list

## Create LangChain Documents

In [17]:
from langchain.docstore.document import Document

# create docs
docs = [Document(page_content=doc['data'],
                 metadata=doc['metadata']) for doc in docs]

In [18]:
## view first 5 docs
docs[:4]

[Document(metadata={'title': 'Kurgan hypothesis', 'article_id': '72554'}, page_content='The Kurgan model of Indo-European origins is about both the people and their Proto-Indo-European language. It uses both archaeology and linguistics to show the history of their culture at different stages of the Indo-European expansion. The Kurgan model is the most widely accepted theory on the origins of Indo-European.'),
 Document(metadata={'title': 'Marija Gimbutas', 'article_id': '72558'}, page_content='Marija Gimbutas (Lithuanian: Marija Gimbutienė, born Marija Birutė Alseikaitė) (Vilnius, Lithuania, January 23, 1921 – Los Angeles, United States February 2, 1994), was a Lithuanian-American archeologist, known for her research into the Neolithic and Bronze Age cultures of "Old Europe" and the theories that she introduced. Between 1946 and 1971, her writings merged traditional spadework with linguistics and mythologies.'),
 Document(metadata={'title': 'Basil', 'article_id': '73985'}, page_content

In [19]:
## len of docs
len(docs)

1364

### Splitting larger documents into smaller chunks
* For now we will use the standard `RecursiveCharacterTextSplitter`.
* However, in the future it might be better to consider using the `SemanticChunker` as it utilizes embeddings, cosine similarity and statistical splitting methods (e.g. percentile, standard deviation, interquartile, etc.) to split the text and prevent any need for chunk decoupling.
* We will use `chunk_size` of 2000 characters which is roughly 2 to 3 small paragraphs.

In [27]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# setup splitter
splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=300)

# chunk documents
chunked_docs = splitter.split_documents(docs)

In [29]:
## view chunked docs sample
chunked_docs[:2]

[Document(metadata={'title': 'Kurgan hypothesis', 'article_id': '72554'}, page_content='The Kurgan model of Indo-European origins is about both the people and their Proto-Indo-European language. It uses both archaeology and linguistics to show the history of their culture at different stages of the Indo-European expansion. The Kurgan model is the most widely accepted theory on the origins of Indo-European.'),
 Document(metadata={'title': 'Marija Gimbutas', 'article_id': '72558'}, page_content='Marija Gimbutas (Lithuanian: Marija Gimbutienė, born Marija Birutė Alseikaitė) (Vilnius, Lithuania, January 23, 1921 – Los Angeles, United States February 2, 1994), was a Lithuanian-American archeologist, known for her research into the Neolithic and Bronze Age cultures of "Old Europe" and the theories that she introduced. Between 1946 and 1971, her writings merged traditional spadework with linguistics and mythologies.')]

In [30]:
## len of chunked docs
len(chunked_docs)

1388

## Create a Vector Database
* We will initialize a vector DB using Chroma DB.
* We will also save it to disk using the Chroma client and pass the directory to where we want to save the data.
* Note about embedding distance metrics for Chroma DB
   * Need to manually set the distance to **cosine** otherwise uses euclidean by default.

In [31]:
## import chroma
from langchain_chroma import Chroma

# create vector DB of docs and embeddings
chroma_db = Chroma.from_documents(documents=chunked_docs,
                                  collection_name='wikipedia_db',
                                  embedding=openai_embed_model,
                                  # need to set distance function to cosine or uses default euclidean!
                                  collection_metadata={'hnsw:space': 'cosine'},
                                  persist_directory="./wikipedia_db")

## Load Vector DB from disk
* If you have already created the vector DB and saved on disk, you can load it and connect directly using this block of code.

In [32]:
## connect and load vector DB
chroma_db = Chroma(persist_directory="./wikipedia_db",
                   collection_name='wikipedia_db',
                   embedding_function=openai_embed_model)

In [33]:
## chroma instance -- generator
chroma_db

<langchain_chroma.vectorstores.Chroma at 0x7cf4e4478370>

# Experiments with Vector Database Retrievers

These are the retrieval strategies we will experiment/test with:
* Similarity or Ranking based retrieval
* Multi Query Retrieval
* Contextual Compression Retrieval
* Chained Retrieval Pipeline

## 1. Similarity or Ranking Based Retrieval
* Use **cosine similarity** to retrieve top 3 similar documents based on user input query.

In [34]:
## similarity retriever
similarity_retriever = chroma_db.as_retriever(search_type='similarity',
                                              search_kwargs={'k': 3})

In [37]:
## query 1
query = "What is the current capital of India"
top3_docs = similarity_retriever.invoke(query)
top3_docs

[Document(metadata={'article_id': '5117', 'title': 'New Delhi'}, page_content='New Delhi () is the capital of India and a union territory of the megacity of Delhi. It has a very old history and is home to several monuments where the city is expensive to live in. In traditional Indian geography it falls under the North Indian zone. The city has an area of about 42.7\xa0km. New Delhi has a population of about 9.4 Million people.'),
 Document(metadata={'article_id': '5114', 'title': 'Mumbai'}, page_content="Mumbai (previously known as Bombay until 1996) is a natural harbor on the west coast of India, and is the capital city of Maharashtra state. It is India's largest city, and one of the world's most populous cities. It is the financial capital of India. The city is the second most-populous in the world. It has approximately 13 million people. Along with the neighboring cities of Navi Mumbai and Thane, it forms the world's 4th largest urban agglomeration. They have around 19.1 million peo

In [38]:
## query 2
query = "What is the old capital of India?"
top3_docs = similarity_retriever.invoke(query)
top3_docs

[Document(metadata={'article_id': '5117', 'title': 'New Delhi'}, page_content='New Delhi () is the capital of India and a union territory of the megacity of Delhi. It has a very old history and is home to several monuments where the city is expensive to live in. In traditional Indian geography it falls under the North Indian zone. The city has an area of about 42.7\xa0km. New Delhi has a population of about 9.4 Million people.'),
 Document(metadata={'article_id': '22106', 'title': 'Delhi'}, page_content='Delhi (; "Dillī"; "Dillī"; "Dēhlī"), officially the National Capital Territory of Delhi (NCT), is a territory in India. It includes the country\'s capital New Delhi. It covers an area of . It is bigger than the Faroe Islands but smaller than Guadeloupe. Delhi is a part of the National Capital Region, which has 12.5 million residents. The governance of Delhi is like that of a state in India. It has its own legislature, high court and a council of executive ministers. Delhi is on the ban

In [39]:
## query 3
query = "What is the largest animal?"
top3_docs = similarity_retriever.invoke(query)
top3_docs

[Document(metadata={'article_id': '102713', 'title': 'Aurochs'}, page_content='The Aurochs, or urus, ("Bos primigenius") was a large species of cattle. The aurochs used to be common in Europe. It is extinct now. It was a wild animal, not a domesticated animal. The extinct aurochs/urus is a not the same species as the "wisent" (the European bison). According to the Paleontologisk Museum, University of Oslo, aurochs developed in India some two million years ago, came into the Middle East and farther into Asia, and reached Europe about 250,000 years ago. People once thought that they were a different species from modern European cattle ("Bos taurus"). Today, people think that aurochs and modern cattle are the same species. Modern cattle have become much smaller than their wild ancestors: the height of a large domesticated cow is about 1.5 meters (5 feet, 15 hands), while aurochs were about 1.75 meters (5.75 feet, 17 hands).'),
 Document(metadata={'article_id': '385562', 'title': 'Indian e

In [40]:
## query 4
query = "What is the smallest animal?"
top3_docs = similarity_retriever.invoke(query)
top3_docs

[Document(metadata={'article_id': '719935', 'title': 'Rusty-spotted cat'}, page_content='The rusty-spotted cat ("Prionailurus rubiginosus") is one of the cat family\'s smallest members. Historical records are known only from India and Sri Lanka. In 2012, it was also recorded in the western Terai of Nepal. Since 2016, the global wild population is listed as Near Threatened on the IUCN Red List. It is fragmented and affected by the loss and destruction of its prime habitat, deciduous forest.'),
 Document(metadata={'article_id': '36099', 'title': 'Lesser bandicoot rat'}, page_content='The lesser bandicoot rat, "Bandicota bengalensis", is a rodent. It lives in south Asia. It can grow up to 40cm long. It is a rat, but is not in the genus "Rattus". They may be a pest to cereal crops, and gardens in India and Sri Lanka. When attacking the rat grunts like a pig. Their fur is dark brown on the back (dorsally, as scientists say), and usually lighter or darker grey on the belly-side (ventrally). 

In [47]:
## query 5
query = "What is the study of human language called?"
top3_docs = similarity_retriever.invoke(query)
top3_docs

[Document(metadata={'article_id': '20194', 'title': 'Linguistics'}, page_content='Linguistics is the study of language. People who study language are called linguists. There are five main parts of linguistics: the study of sounds (phonology), the study of parts of words, like "un-" and "-ing" (morphology), the study of word order and how sentences are made (syntax), the study of the meaning of words (semantics), and the study of the unspoken meaning of speech that is separate from the literal meaning of what is said (for example, saying "I\'m cold" to get someone to turn off the fan (pragmatics). There are many ways to use linguistics every day. Some linguists are theoretical linguists and study the theory and ideas behind language, such as historical linguistics (the study of the history of language, and how it has changed), or the study of how different groups of people may use language differently (sociolinguistics). Some linguists are applied linguists and use linguistics to do thi

## 2. Multi Query Retrieval

Problem
* Retrieval may produce different results with subtle changes in query wording, or if the embeddings do not capture the semantics of the data well.
* Prompt engineering / tuning is sometimes done to manually address these problems, but can be tedious.

Solution

* [`MultiQueryRetriever`](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.multi_query.MultiQueryRetriever.html) automates the process of prompt tuning by using an LLM to **generate multiple queries from different perspectives** for a given user input query.
* For each query, it retrieves a set of relevant documents and takes the unique union across all queries to get a larger set of potentially relevant documents.

In [42]:
from langchain_openai import ChatOpenAI

# LLM setup
chatgpt = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0)

In [43]:
#setup for multi query retriever
from langchain.retrievers.multi_query import MultiQueryRetriever
## logging for query
import logging

## setup retriever
similarity_retriever = chroma_db.as_retriever(search_type='similarity',
                                              search_kwargs={'k': 3})

## MQ retriever setup
mq_retriever = MultiQueryRetriever.from_llm(
    retriever=similarity_retriever,
    llm=chatgpt
)

## logging
logging.basicConfig()
## so we can see multiple queries generated by LLM
logging.getLogger('langchain.retrievers.multi_query').setLevel(logging.INFO)

In [44]:
## query 1
query = "What is the current capital of India?"
docs = mq_retriever.invoke(query)
docs

INFO:langchain.retrievers.multi_query:Generated queries: ['1. Can you tell me the present capital city of India?', '2. What city is currently serving as the capital of India?', '3. Which city is currently designated as the capital of India?']


[Document(metadata={'article_id': '5117', 'title': 'New Delhi'}, page_content='New Delhi () is the capital of India and a union territory of the megacity of Delhi. It has a very old history and is home to several monuments where the city is expensive to live in. In traditional Indian geography it falls under the North Indian zone. The city has an area of about 42.7\xa0km. New Delhi has a population of about 9.4 Million people.'),
 Document(metadata={'article_id': '1968', 'title': 'Capital city'}, page_content="A capital city (or capital town or just capital) is a city or town, specified by law or constitution, by the government of a country, or part of a country, such as a state, province or county. It usually serves as the location of the government's central meeting place and offices. Most of the country's leaders and officials work in the capital city. Capitals are usually among the largest cities in their regions; often they are the biggest. For example, Montevideo is Uruguay's cap

In [45]:
## query 2
query = "What was the old capital of India?"
docs = mq_retriever.invoke(query)
docs

INFO:langchain.retrievers.multi_query:Generated queries: ['1. Which city served as the former capital of India?', '2. What was the historical capital city of India?', '3. Can you tell me the previous capital of India?']


[Document(metadata={'article_id': '5117', 'title': 'New Delhi'}, page_content='New Delhi () is the capital of India and a union territory of the megacity of Delhi. It has a very old history and is home to several monuments where the city is expensive to live in. In traditional Indian geography it falls under the North Indian zone. The city has an area of about 42.7\xa0km. New Delhi has a population of about 9.4 Million people.'),
 Document(metadata={'article_id': '4062', 'title': 'Kolkata'}, page_content='forces started, in 1756 the British began to upgrade their fortifications. When this was protested, the Nawab of Bengal Siraj-Ud-Daulah attacked and captured Fort William. This led to the infamous Black Hole incident. A force of Company sepoys and British troops led by Robert Clive recaptured the city the next year. Calcutta became the capital of British India in 1772. However, the capital shifted to the hilly town of Shimla during the summer months every year, starting from the year 

In [46]:
## query 3
query = "What is the study of human language called?"
docs = mq_retriever.invoke(query)
docs

INFO:langchain.retrievers.multi_query:Generated queries: ['What field of study focuses on human language?', 'What academic discipline is dedicated to the study of human language?', 'What is the formal term for the examination of human language?']


[Document(metadata={'article_id': '20194', 'title': 'Linguistics'}, page_content='Linguistics is the study of language. People who study language are called linguists. There are five main parts of linguistics: the study of sounds (phonology), the study of parts of words, like "un-" and "-ing" (morphology), the study of word order and how sentences are made (syntax), the study of the meaning of words (semantics), and the study of the unspoken meaning of speech that is separate from the literal meaning of what is said (for example, saying "I\'m cold" to get someone to turn off the fan (pragmatics). There are many ways to use linguistics every day. Some linguists are theoretical linguists and study the theory and ideas behind language, such as historical linguistics (the study of the history of language, and how it has changed), or the study of how different groups of people may use language differently (sociolinguistics). Some linguists are applied linguists and use linguistics to do thi

Summary:
* We can see the results are more refined based on the multi query approach where the LLM generates 3 variations of the initial query. We were able to retrieve documents that we had not seen with the initial simple retrieval.

## 3. Contextual Compression Retrieval

Problem

* Most relevant information to answer a user's query may be buried deep in a document with a lot of irrelevant text.
* Passing that full document through your application can lead to more expensive LLM calls (e.g. higher token usage!) and hallucination.

Solution

* **Contextual compression** is designed to fix this.
* The idea is simple:
   * instead of immediately returning retrieved documents *as-is*, you **compress them using the context of the given query**, so that **only relevant information** is returned.

* Compression can happen in a few ways:

   1. Remove parts of the content of retrieved documents which are not relevant to the query.
      * This is done by extracting only relevant parts of the document to the given query

   2. Filter out documents which are not relevant to the given query but do not remove content from the documents.


**Below we will demonstrate this by**:
   * wrapping the multi-query retriever with a `ContextualCompressionRetriever`.
   * Then we'll add an `LLMChainExtractor`, which will iterate over the initially returned documents and extract from each only the content that is relevant to the query.

In [48]:
## Method 1: extracting only relevant content to the users query
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# extracting ONLY relevant content to users query
compressor = LLMChainExtractor.from_llm(llm=chatgpt)

# retrieval of documents most similar to users query using contextualcompressor
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=mq_retriever
)

In [51]:
## query 1
query = "What is considered the current financial capital of India?"
docs = compression_retriever.invoke(query)
docs

INFO:langchain.retrievers.multi_query:Generated queries: ['1. Which city is currently recognized as the financial capital of India?', "2. What city holds the title of India's current financial hub?", '3. Where is the current financial center of India located?']


[Document(metadata={'article_id': '5114', 'title': 'Mumbai'}, page_content='Mumbai is the financial capital of India.'),
 Document(metadata={'article_id': '5117', 'title': 'New Delhi'}, page_content='New Delhi is the capital of India.'),
 Document(metadata={'article_id': '1968', 'title': 'Capital city'}, page_content='The capital of the India is New Delhi')]

In [50]:
## query 2
query = "What is considered the old capital of India?"
docs = compression_retriever.invoke(query)
docs

INFO:langchain.retrievers.multi_query:Generated queries: ['1. Which city is historically known as the former capital of India?', '2. What was the previous capital city of India?', '3. Can you identify the ancient capital of India?']


[Document(metadata={'article_id': '4062', 'title': 'Kolkata'}, page_content='Kolkata served as the capital of India during the British Raj until 1911.'),
 Document(metadata={'article_id': '5113', 'title': 'Chennai'}, page_content='Madras state (renamed as Tamil Nadu in 1968) when India became independent in 1947.'),
 Document(metadata={'article_id': '4062', 'title': 'Kolkata'}, page_content='Calcutta became the capital of British India in 1772.')]

In [52]:
## query 3
query = "What is the fastest animal in the world?"
docs = compression_retriever.invoke(query)
docs

INFO:langchain.retrievers.multi_query:Generated queries: ['1. Which animal holds the title for being the quickest on Earth?', '2. Can you identify the speediest creature in the animal kingdom?', '3. What species is known for its incredible speed and agility?']


[Document(metadata={'article_id': '9800', 'title': 'Cheetah'}, page_content='A cheetah ("Acinonyx jubatus") is a medium large cat which lives in Africa. It is the fastest land animal and can run up to 112 kilometers per hour for a short time.'),
 Document(metadata={'article_id': '528308', 'title': 'South African cheetah'}, page_content='The South African Cheetah')]

In [53]:
## query 4
query = "What is the slowest animal in the world?"
docs = compression_retriever.invoke(query)
docs

INFO:langchain.retrievers.multi_query:Generated queries: ['What animal holds the title for being the slowest in the world?', 'Which creature is known for its incredibly slow pace?', 'Can you tell me about the animal that moves at the slowest speed on Earth?']


[Document(metadata={'article_id': '558951', 'title': 'Slow loris'}, page_content='Slow lorises are the genus Nycticebus, nocturnal species of strepsirrhine primates.')]

In [54]:
## query 5
query = "What is the study of human language"
docs = compression_retriever.invoke(query)
docs

INFO:langchain.retrievers.multi_query:Generated queries: ['1. Can you provide information on the field that focuses on the analysis of human communication?', '2. How can I learn more about the discipline that examines the structure and use of human languages?', '3. What resources can I access to explore the study of verbal and nonverbal communication in humans?']


[Document(metadata={'article_id': '249472', 'title': 'Discourse analysis'}, page_content='study of human language'),
 Document(metadata={'article_id': '20194', 'title': 'Linguistics'}, page_content='Linguistics is the study of language.'),
 Document(metadata={'article_id': '2104', 'title': 'Cognitive science'}, page_content='the study of language'),
 Document(metadata={'article_id': '388205', 'title': 'Theoretical linguistics'}, page_content='Theoretical linguistics tries to understand how language works. Syntax, phonology, morphology and semantics are studied, also the things or universals that all languages have in common.')]

Now we will use the `LLMChainFilter` which:
* Simpler to use
* More robust compressor that uses an LLM chain to determine which of the initially retrieved documents to filter out and which ones to return without manipulating the contents of the documents.

In [55]:
from langchain.retrievers.document_compressors import LLMChainFilter

# determines which of initially retrieved docs to filter out and which ones to return
_filter = LLMChainFilter.from_llm(llm=chatgpt)

# retrieves documents similar to query and then applies filter
compression_retriever = ContextualCompressionRetriever(
    base_compressor=_filter,
    base_retriever=mq_retriever
)

In [57]:
## query 1
query = "What is the current financial capital of India?"
docs = compression_retriever.invoke(query)
docs

INFO:langchain.retrievers.multi_query:Generated queries: ['1. Which city is currently considered the financial capital of India?', '2. What is the present-day economic hub of India?', '3. Can you tell me the current financial center of India?']


[Document(metadata={'article_id': '5114', 'title': 'Mumbai'}, page_content="Mumbai (previously known as Bombay until 1996) is a natural harbor on the west coast of India, and is the capital city of Maharashtra state. It is India's largest city, and one of the world's most populous cities. It is the financial capital of India. The city is the second most-populous in the world. It has approximately 13 million people. Along with the neighboring cities of Navi Mumbai and Thane, it forms the world's 4th largest urban agglomeration. They have around 19.1 million people. The seven islands that form Bombay were home to fishing colonies. The islands were ruled by successive kingdoms and indigenous empires before Portuguese settlers took it. Then, it went to the British East India Company. During the mid-18th century, Bombay became a major trading town. It became a strong place for the Indian independence movement during the early 20th century. When India became independent in 1947, the city was

In [60]:
## query 2
query = "What is known as the old capital of India?"
docs = compression_retriever.invoke(query)
docs

INFO:langchain.retrievers.multi_query:Generated queries: ['1. Which city is historically referred to as the former capital of India?', '2. What was the previous name of the capital city of India?', '3. Can you identify the city that was once considered the capital of India?']


[Document(metadata={'article_id': '5113', 'title': 'Chennai'}, page_content='Chennai (formerly known as Madras) is the capital city of the Indian state of Tamil Nadu. It has a population of about 7 million people. Almost 10% of all of the people in the state live in Chennai. The city is the fourth largest city of India. It was founded in 1661 by the British East India Company. The city is on the Coromandel Coast of the Bay of Bengal. Chennai is the automobile capital of India. It is also referred as the Detroit of South Asia. The long Marina Beach in Chennai, is one of the longest beaches in the world. The city is separated into three parts by two rivers. The Cooum River divides the city into almost half and the Adyar River divides the southern half of the city into two parts. The historic Buckingham Canal runs through the city. It is almost parallel to the coast. The 350 year old city still has much of its old charm. Today, it is a big commercial and industrial centre. The city has mu

In [61]:
## query 3
query = "What is the fastest animal in the world?"
docs = compression_retriever.invoke(query)
docs

INFO:langchain.retrievers.multi_query:Generated queries: ['1. Which animal holds the title for being the quickest on Earth?', '2. Can you identify the speediest creature in the animal kingdom?', '3. What species is known for its incredible speed and agility?']


[Document(metadata={'article_id': '9800', 'title': 'Cheetah'}, page_content='A cheetah ("Acinonyx jubatus") is a medium large cat which lives in Africa. It is the fastest land animal and can run up to 112 kilometers per hour for a short time. Most cheetahs live in the savannas of Africa. There are a few in Asia. Cheetahs are active during the day, and hunt in the early morning or late evening. The cheetah compared to other big cats is light and slimly built. Its long thin legs and long spotted tail are necessary for fast running. Its lightly built, thin form is in sharp contrast with the robust build of other big cats. The head-and-body length ranges from . The cheetah stands 70 to 90\xa0cm at the shoulder, and weighs . The slightly curved claws are only weakly retractable (semi-retractable). This is a major point of difference between the cheetah and the other big cats, which have fully retractable claws.'),
 Document(metadata={'article_id': '528308', 'title': 'South African cheetah'}

In [62]:
## query 4
query = "What is the slowest animal in the world?"
docs = compression_retriever.invoke(query)
docs

INFO:langchain.retrievers.multi_query:Generated queries: ['1. Which animal holds the title for being the slowest in the world?', '2. Can you identify the animal that is known for its incredibly slow speed?', '3. What creature is recognized as the slowest animal on Earth?']


[Document(metadata={'article_id': '558951', 'title': 'Slow loris'}, page_content='Slow lorises are the genus Nycticebus, nocturnal species of strepsirrhine primates. They live in southeast Asia and nearby areas. There are about eight species: the Sunda slow loris ("N.\xa0coucang"), Bengal slow loris ("N.\xa0bengalensis"), pygmy slow loris ("N.\xa0pygmaeus"), Javan slow loris ("N.\xa0javanicus"), Philippine slow loris ("N.\xa0menagensis"), Bangka slow loris ("N.\xa0bancanus"), Bornean slow loris ("N.\xa0borneanus"), and Kayan River slow loris ("N.\xa0kayan"). The group\'s closest relatives are the slender lorises of southern India and Sri Lanka. Their next closest relatives are the African lorisids, the pottos, false pottos, and angwantibos. They are less closely related to the remaining lorisoids (the various types of galago), and more distantly to the lemurs of Madagascar. Their evolutionary history is uncertain: their fossil record is patchy and molecular clock studies have given var

## 4. Chained Retrieval Pipeline - with reranker!
* Here we will use a chain of multiple retrievers in sequence to the the most relevant documents based on a user query.
* Here we will use reranking!
* The pipeline is as follows:

`Similarity Retrieval` ---> `Compression Filter` ---> `Reranker Model Retrieval`

In [63]:
## imports - cross encoder rerankers
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.document_compressors import CrossEncoderReranker

# Retriever 1 - simple cosine distance based retriever
similarity_retriever = chroma_db.as_retriever(search_type="similarity",
                                              search_kwargs={"k": 5})

# LLMChainFilter - filters out irrelvant documents
_filter = LLMChainFilter.from_llm(llm=chatgpt)

# Retriever 2 - retrieves documents similar to query and then applies filter
compressor_retriever = ContextualCompressionRetriever(
    base_compressor=_filter,
    base_retriever=similarity_retriever
)

# open source reranker model via huggingface --> BAAI/bge-reranker-v2-m3
reranker = HuggingFaceCrossEncoder(model_name='BAAI/bge-reranker-large')
reranker_compressor = CrossEncoderReranker(model=reranker, top_n=3)

# Retriever 3 - Reranker model used to rerank retrieval results from retriever #2
final_retriever = ContextualCompressionRetriever(
    base_compressor=reranker_compressor,
    base_retriever=compressor_retriever
)




config.json:   0%|          | 0.00/801 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

We will see the results of this pipeline below:
1. Cosine Distance (Retriever 1)
2. LLMChain filter (Retriever 2)
3. Reranker (Retriever 3)

In [65]:
## query 1
%%time
query = "What is the current financial capital of India?"
docs = final_retriever.invoke(query)
docs

CPU times: user 12.2 s, sys: 15.1 ms, total: 12.2 s
Wall time: 13 s


[Document(metadata={'article_id': '5114', 'title': 'Mumbai'}, page_content="Mumbai (previously known as Bombay until 1996) is a natural harbor on the west coast of India, and is the capital city of Maharashtra state. It is India's largest city, and one of the world's most populous cities. It is the financial capital of India. The city is the second most-populous in the world. It has approximately 13 million people. Along with the neighboring cities of Navi Mumbai and Thane, it forms the world's 4th largest urban agglomeration. They have around 19.1 million people. The seven islands that form Bombay were home to fishing colonies. The islands were ruled by successive kingdoms and indigenous empires before Portuguese settlers took it. Then, it went to the British East India Company. During the mid-18th century, Bombay became a major trading town. It became a strong place for the Indian independence movement during the early 20th century. When India became independent in 1947, the city was

In [66]:
## query 2
%%time
query = "What is the old capital of India?"
docs = final_retriever.invoke(query)
docs


CPU times: user 22 s, sys: 4.03 s, total: 26 s
Wall time: 26.9 s


[Document(metadata={'article_id': '4062', 'title': 'Kolkata'}, page_content="Kolkata (spelled Calcutta before 1 January 2001) is the capital city of the Indian state of West Bengal. It is the second largest city in India after Mumbai. It is on the east bank of the River Hooghly. When it is called Calcutta, it includes the suburbs. This makes it the third largest city of India. This also makes it the world's 8th largest metropolitan area as defined by the United Nations. Kolkata served as the capital of India during the British Raj until 1911. Kolkata was once the center of industry and education. However, it has witnessed political violence and economic problems since 1954. Since 2000, Kolkata has grown due to economic growth. Like other metropolitan cities in India, Kolkata struggles with poverty, pollution and traffic congestion. The discovery of the nearby Chandraketugarh, an archaeological site has proved that people have lived there for over two millennia. The history of Kolkata b

In [67]:
## query 3
%%time

query = "What is the fastest animal in the world?"
docs = final_retriever.invoke(query)
docs

CPU times: user 5.29 s, sys: 8.41 ms, total: 5.3 s
Wall time: 6.21 s


[Document(metadata={'article_id': '9800', 'title': 'Cheetah'}, page_content='A cheetah ("Acinonyx jubatus") is a medium large cat which lives in Africa. It is the fastest land animal and can run up to 112 kilometers per hour for a short time. Most cheetahs live in the savannas of Africa. There are a few in Asia. Cheetahs are active during the day, and hunt in the early morning or late evening. The cheetah compared to other big cats is light and slimly built. Its long thin legs and long spotted tail are necessary for fast running. Its lightly built, thin form is in sharp contrast with the robust build of other big cats. The head-and-body length ranges from . The cheetah stands 70 to 90\xa0cm at the shoulder, and weighs . The slightly curved claws are only weakly retractable (semi-retractable). This is a major point of difference between the cheetah and the other big cats, which have fully retractable claws.'),
 Document(metadata={'article_id': '528308', 'title': 'South African cheetah'}

In [68]:
## query 4
%%time

query = "What is the slowest animal in the world?"
docs = final_retriever.invoke(query)
docs

CPU times: user 2.91 s, sys: 7.81 ms, total: 2.92 s
Wall time: 3.74 s


[Document(metadata={'article_id': '558951', 'title': 'Slow loris'}, page_content='Slow lorises are the genus Nycticebus, nocturnal species of strepsirrhine primates. They live in southeast Asia and nearby areas. There are about eight species: the Sunda slow loris ("N.\xa0coucang"), Bengal slow loris ("N.\xa0bengalensis"), pygmy slow loris ("N.\xa0pygmaeus"), Javan slow loris ("N.\xa0javanicus"), Philippine slow loris ("N.\xa0menagensis"), Bangka slow loris ("N.\xa0bancanus"), Bornean slow loris ("N.\xa0borneanus"), and Kayan River slow loris ("N.\xa0kayan"). The group\'s closest relatives are the slender lorises of southern India and Sri Lanka. Their next closest relatives are the African lorisids, the pottos, false pottos, and angwantibos. They are less closely related to the remaining lorisoids (the various types of galago), and more distantly to the lemurs of Madagascar. Their evolutionary history is uncertain: their fossil record is patchy and molecular clock studies have given var

In [69]:
## query 5
%%time

query = "What is the study of human language?"
docs = final_retriever.invoke(query)
docs

CPU times: user 14.5 s, sys: 25.3 ms, total: 14.5 s
Wall time: 15.4 s


[Document(metadata={'article_id': '388205', 'title': 'Theoretical linguistics'}, page_content='Theoretical linguistics tries to understand how language works. Syntax, phonology, morphology and semantics are studied, also the things or universals that all languages have in common.'),
 Document(metadata={'article_id': '20194', 'title': 'Linguistics'}, page_content='Linguistics is the study of language. People who study language are called linguists. There are five main parts of linguistics: the study of sounds (phonology), the study of parts of words, like "un-" and "-ing" (morphology), the study of word order and how sentences are made (syntax), the study of the meaning of words (semantics), and the study of the unspoken meaning of speech that is separate from the literal meaning of what is said (for example, saying "I\'m cold" to get someone to turn off the fan (pragmatics). There are many ways to use linguistics every day. Some linguists are theoretical linguists and study the theory 

In [70]:
## query 6
%%time
query = "Tell me the kingdom, phylum, class, and order of the cheetah."
docs = final_retriever.invoke(query)
docs

CPU times: user 13.8 s, sys: 44.9 ms, total: 13.8 s
Wall time: 14.6 s


[Document(metadata={'article_id': '528308', 'title': 'South African cheetah'}, page_content='The South African Cheetah ("Acinonyx jubatus jubatus"), also known as Namibian Cheetah, is the nominate subspecies of cheetah native to Southern Africa. It is the most abundant subspecies estimated at more than 6,000 individuals in the wild. Since 1990 and onwards, the population was estimated at approximately 2,500 individuals in Namibia, until 2015, the cheetah population has been increased to more than 3,500 in the country. The South African Cheetah is the closest relative to the two other distinct subspecies, the Asiatic Cheetahs and the Northeast African Cheetah.'),
 Document(metadata={'article_id': '9800', 'title': 'Cheetah'}, page_content='A cheetah ("Acinonyx jubatus") is a medium large cat which lives in Africa. It is the fastest land animal and can run up to 112 kilometers per hour for a short time. Most cheetahs live in the savannas of Africa. There are a few in Asia. Cheetahs are ac