<a href="https://colab.research.google.com/github/dinakajoy/UsingLLMs-RAG-course/blob/main/1_Basics_of_Retrieval_Systems.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Retrieval System

## Data

In [1]:
# Sample documents about LLMs, RAG and Retrieval System
documents = [
 "Large Language Models (LLMs) are AI systems trained on massive text corpora to generate and understand natural language.",
 "Retrieval-Augmented Generation (RAG) improves LLMs by allowing them to fetch external knowledge when answering questions.",
 "Vector databases like Pinecone, Weaviate, or FAISS are commonly used to power retrieval systems.",
 "LLMs often struggle with outdated or missing knowledge, which is why retrieval is essential for grounding responses.",
 "Context windows in LLMs limit how much information can be processed at once.",
 "Fine-tuning allows LLMs to specialize in a domain, but RAG can be a cheaper and more flexible alternative.",
 "Embeddings are numeric representations of text used to measure semantic similarity in retrieval systems.",
 "Hybrid search combines dense vector embeddings with keyword-based search for better accuracy.",
 "RAG systems often follow the retrieve-then-read pipeline: fetch documents and then use an LLM to generate an answer.",
 "Companies use retrieval systems to provide proprietary knowledge to LLMs without retraining the base model.",
 "Storing context in a vector database enables systems to handle large-scale document search efficiently.",
 "In-house LLM deployments are often combined with retrieval systems for privacy and security.",
 "Retrieval systems prevent hallucinations by grounding LLM outputs in verified sources.",
 "Scaling retrieval pipelines requires sharding, indexing, and efficient similarity search algorithms.",
 "LLMs like GPT, LLaMA, and Mistral can all be used in RAG setups.",
 "The retrieval step typically ranks documents based on semantic closeness to the user’s query.",
 "Chunking documents into smaller pieces improves retrieval accuracy and relevance.",
 "Metadata filters in vector search allow narrowing results by tags like date, author, or topic.",
 "Retrieval can be combined with caching to speed up repeated queries in production systems.",
 "Some RAG systems use graph databases instead of vectors for knowledge-rich retrieval.",
 "Evaluation of RAG involves metrics like precision, recall, and answer faithfulness.",
 "LLMs alone are generative, but when paired with retrieval, they become more reliable knowledge assistants.",
 "Context storage solutions range from simple in-memory stores to distributed vector databases.",
 "Retrieval systems can be domain-specific, such as for healthcare, legal, or financial data.",
 "LLMs with RAG are increasingly used in enterprise chatbots and knowledge assistants.",
 "Prompt engineering is often combined with retrieval to guide LLMs in how to use the fetched data.",
 "Scaling foundational models often requires techniques like parameter-efficient fine-tuning or LoRA.",
 "Retrieval helps reduce the need for frequent fine-tuning when new knowledge becomes available.",
 "Latency in retrieval systems is critical, especially when powering real-time applications.",
 "RAG pipelines can integrate with search engines, APIs, or internal document repositories.",
 "Knowledge grounding ensures that LLMs provide answers supported by external evidence.",
 "Some retrieval systems use re-ranking models to improve the quality of the top search results.",
 "Building a RAG system typically involves three steps: embedding, indexing, and retrieval.",
 "Retrieval allows organizations to keep proprietary knowledge private while still leveraging LLMs.",
 "An orchestration layer manages how LLMs, retrieval, and other tools interact in complex pipelines.",
 "LLMs combined with RAG are often described as knowledge-enhanced generative AI.",
 "The retriever can use dense embeddings or sparse methods like BM25, depending on the application.",
 "Chunk size in embeddings has a major effect on recall and precision in retrieval.",
 "Open-source libraries like LangChain and LlamaIndex simplify building retrieval-augmented systems.",
 "Document pre-processing, such as cleaning and normalization, is critical for good retrieval performance.",
 "Embedding models like OpenAI’s text-embedding-ada-002 or Sentence-BERT are popular for retrieval.",
 "RAG pipelines can handle both structured and unstructured data sources.",
 "Retrieval-based systems can log citations, giving users confidence in the generated answers.",
 "Distributed retrieval systems use multiple servers to scale to billions of documents.",
 "Context windows in LLMs can be extended with retrieval, allowing access to more knowledge.",
 "Retrieval can be applied in multi-modal systems, not just text but also images or audio.",
 "Enterprises prefer retrieval over fine-tuning when their data changes frequently.",
 "RAG systems can be evaluated with user satisfaction metrics in real-world applications.",
 "Building a good retrieval system requires balancing speed, accuracy, and storage costs."
]


## Libraries

In [3]:
# Import libraries
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [4]:
# Download some models from nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## Tokenization

In [8]:
# Sample texts
text1 = "Retrieval-Augmented Generation (RAG) improves LLMs by allowing them to fetch external knowledge when answering questions."
text2 = "Retrieval-Augmented Generation (RAG) improves LLMs.It allows them to fetch external knowledge when answering questions."
text3 = "Retrieval-Augmented Generation (RAG) improves LLMs. It allows them to fetch external knowledge when answering questions."

In [9]:
# Tokenization into sentences
print(nltk.sent_tokenize(text1))
print(nltk.sent_tokenize(text2))
print(nltk.sent_tokenize(text3))

['Retrieval-Augmented Generation (RAG) improves LLMs by allowing them to fetch external knowledge when answering questions.']
['Retrieval-Augmented Generation (RAG) improves LLMs.It allows them to fetch external knowledge when answering questions.']
['Retrieval-Augmented Generation (RAG) improves LLMs.', 'It allows them to fetch external knowledge when answering questions.']


A sentence is defined when there is one `.` followed by a space. More than one `.` won't break the sentence but be seen as one sentence

In [10]:
# Tokenization into words (a space defines a word)
print(nltk.word_tokenize(text3))
nltk.word_tokenize(text1)

['Retrieval-Augmented', 'Generation', '(', 'RAG', ')', 'improves', 'LLMs', '.', 'It', 'allows', 'them', 'to', 'fetch', 'external', 'knowledge', 'when', 'answering', 'questions', '.']


['Retrieval-Augmented',
 'Generation',
 '(',
 'RAG',
 ')',
 'improves',
 'LLMs',
 'by',
 'allowing',
 'them',
 'to',
 'fetch',
 'external',
 'knowledge',
 'when',
 'answering',
 'questions',
 '.']

## Preprocessing

* We do not capitalize. I would query 'improves llms' and not 'improves LLMs'
* llms is different from LLMs
* We don't really add punctuation

In [12]:
# Preprocess function 1
def preprocess(text):
  # Convert to lowercase
  text_lower = text.lower()

  # Tokenize into words
  tokens = nltk.word_tokenize(text_lower)

  return [word for word in tokens if word.isalnum()]

In [13]:
# Apply the pre-processing to the documents
proprocessed_docs = [' '.join(preprocess(doc)) for doc in documents]

## The Different Retrieval Systems

### Vector Space Model (TF-IDF)

In [15]:
# Creating an instance of the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

In [16]:
# Fit and transform the preprocessed docs
tfidf_matrix = vectorizer.fit_transform(proprocessed_docs)
print(f"The shape of the TF-IDF matrix is {tfidf_matrix.shape}")
print(f"The length of the documents is {len(documents)}")

The shape of the TF-IDF matrix is (49, 291)
The length of the documents is 49


In [17]:
# Query the index
query = "improves llms"
query_vec = vectorizer.transform([query])
cosine_similarity(tfidf_matrix, query_vec).flatten()

array([0.06420977, 0.33415949, 0.        , 0.07005281, 0.0811942 ,
       0.08171789, 0.        , 0.        , 0.        , 0.08066559,
       0.        , 0.        , 0.        , 0.        , 0.08890166,
       0.        , 0.28775125, 0.        , 0.        , 0.        ,
       0.        , 0.07808653, 0.        , 0.        , 0.09660699,
       0.07918326, 0.        , 0.        , 0.        , 0.        ,
       0.08560015, 0.        , 0.        , 0.084361  , 0.07849505,
       0.10346295, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.09217574,
       0.        , 0.        , 0.        , 0.        ])

In [18]:
# Sort the documents by similarity to the query
similarities = cosine_similarity(tfidf_matrix, query_vec).flatten()
sorted_similarities = list(enumerate(similarities))
sorted(sorted_similarities, key=lambda x: x[1], reverse=True)

[(1, np.float64(0.33415949496673886)),
 (16, np.float64(0.2877512460781136)),
 (35, np.float64(0.10346294516015833)),
 (24, np.float64(0.09660699134048674)),
 (44, np.float64(0.09217573730116045)),
 (14, np.float64(0.08890165726082282)),
 (30, np.float64(0.0856001523360016)),
 (33, np.float64(0.08436100339135313)),
 (5, np.float64(0.08171788621281637)),
 (4, np.float64(0.08119419863943664)),
 (9, np.float64(0.08066559496228859)),
 (25, np.float64(0.07918326096886749)),
 (34, np.float64(0.07849504934184123)),
 (21, np.float64(0.07808653322188236)),
 (3, np.float64(0.07005281261282652)),
 (0, np.float64(0.06420977109907239)),
 (2, np.float64(0.0)),
 (6, np.float64(0.0)),
 (7, np.float64(0.0)),
 (8, np.float64(0.0)),
 (10, np.float64(0.0)),
 (11, np.float64(0.0)),
 (12, np.float64(0.0)),
 (13, np.float64(0.0)),
 (15, np.float64(0.0)),
 (17, np.float64(0.0)),
 (18, np.float64(0.0)),
 (19, np.float64(0.0)),
 (20, np.float64(0.0)),
 (22, np.float64(0.0)),
 (23, np.float64(0.0)),
 (26, np.flo

In [19]:
# Build a function to search with TF-IDF
def search_tfidf(query, vectorizer, tfidf_matrix):
  # Vectorize the query
  query_vec = vectorizer.transform([query])

  # Compute the Cosine Similarity
  simlarities = cosine_similarity(tfidf_matrix, query_vec).flatten()

  # Pair each document index with its similarity score
  sorted_similarities =  list(enumerate(similarities))

  # Sort the documents index with its similarity score
  results = sorted(sorted_similarities, key=lambda x:x[1], reverse=True)

  return results

In [22]:
# Apply the function to the query
search_similarities = search_tfidf(query, vectorizer, tfidf_matrix)

# Print out the top 10 documents by similarity score
print(f"Top 10 documents by similarity score for query \"{query}\":")
for doc_index, score in search_similarities[:10]:
  print(f"Document {doc_index + 1}: {documents[doc_index]}")

Top 10 documents by similarity score for query "improves llms":
Document 2: Retrieval-Augmented Generation (RAG) improves LLMs by allowing them to fetch external knowledge when answering questions.
Document 17: Chunking documents into smaller pieces improves retrieval accuracy and relevance.
Document 36: LLMs combined with RAG are often described as knowledge-enhanced generative AI.
Document 25: LLMs with RAG are increasingly used in enterprise chatbots and knowledge assistants.
Document 45: Context windows in LLMs can be extended with retrieval, allowing access to more knowledge.
Document 15: LLMs like GPT, LLaMA, and Mistral can all be used in RAG setups.
Document 31: Knowledge grounding ensures that LLMs provide answers supported by external evidence.
Document 34: Retrieval allows organizations to keep proprietary knowledge private while still leveraging LLMs.
Document 6: Fine-tuning allows LLMs to specialize in a domain, but RAG can be a cheaper and more flexible alternative.
Docum

### Boolean Retrieval Model

In [23]:
!pip install whoosh -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/468.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m112.6/468.8 kB[0m [31m3.1 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m460.8/468.8 kB[0m [31m7.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.8/468.8 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [24]:
# Import libraries
import os
import shutil
from whoosh.index import create_in
from whoosh.fields import *
from whoosh.qparser import QueryParser

In [25]:
# Preprocess function 2
def preprocess2(text):
  # Convert to lowercase
  text_lower = text.lower()

  # Tokenize into words
  tokens = nltk.word_tokenize(text_lower)

  # List the tokens per document
  tokens = [word for word in tokens if word.isalnum()]

  # Define the English stopwords
  stopwords = set(nltk.corpus.stopwords.words('english')) - {"and", "or", "not"}

  # Remove the stopwords
  tokens = [word for word in tokens if word not in stopwords]

  return tokens

In [26]:
# Apply and test the function on text1 above
print(text1)
preprocess2(text1)

Retrieval-Augmented Generation (RAG) improves LLMs by allowing them to fetch external knowledge when answering questions.


['generation',
 'rag',
 'improves',
 'llms',
 'allowing',
 'fetch',
 'external',
 'knowledge',
 'answering',
 'questions']

In [27]:
%cd /content/drive/MyDrive/Basics of Retrieval System

/content/drive/MyDrive/Basics of Retrieval System


In [28]:
# Create a new folder but remove old one first if available
if os.path.exists("index_dir"):
  shutil.rmtree("index_dir")
os.mkdir("index_dir")

In [29]:
# Define a Schema for the directory
schema = Schema(title=ID(stored=True, unique=True),
                content=TEXT(stored=True))

In [30]:
# Create the index in the folder
index = create_in("index_dir", schema)

In [31]:
# Open a writer to add documents to the index
writer = index.writer()
for i, doc in enumerate(documents):
  writer.add_document(title=str(i),
                      content=doc)
writer.commit()

In [33]:
# Boolean search function
def boolean_search(query, index):
  # Create a QueryParser that targets the content field
  parser = QueryParser("content", schema=index.schema)

  # Parse the user's query
  parsed_query = parser.parse(query)

  # Open the directory and perform the query
  with index.searcher() as searcher:
    results = searcher.search(parsed_query)
    return [(hit["title"], hit["content"]) for hit in results]

In [34]:
# Apply the function
query = "LLMs not RAG"
boolean_search(query, index)

[('24',
  'LLMs with RAG are increasingly used in enterprise chatbots and knowledge assistants.'),
 ('14', 'LLMs like GPT, LLaMA, and Mistral can all be used in RAG setups.'),
 ('35',
  'LLMs combined with RAG are often described as knowledge-enhanced generative AI.'),
 ('5',
  'Fine-tuning allows LLMs to specialize in a domain, but RAG can be a cheaper and more flexible alternative.'),
 ('1',
  'Retrieval-Augmented Generation (RAG) improves LLMs by allowing them to fetch external knowledge when answering questions.')]

### Probabilistic Retrieval Model

In [35]:
!pip install rank_bm25 -q

In [36]:
# import the BM25 class
from rank_bm25 import BM25Okapi

In [37]:
# Tokenize the documents => preprocess function is used because we don't need to remove stopwords since we are using BM25 which is probabilistic
tokenized_docs = [preprocess(doc) for doc in documents]
tokenized_docs

[['large',
  'language',
  'models',
  'llms',
  'are',
  'ai',
  'systems',
  'trained',
  'on',
  'massive',
  'text',
  'corpora',
  'to',
  'generate',
  'and',
  'understand',
  'natural',
  'language'],
 ['generation',
  'rag',
  'improves',
  'llms',
  'by',
  'allowing',
  'them',
  'to',
  'fetch',
  'external',
  'knowledge',
  'when',
  'answering',
  'questions'],
 ['vector',
  'databases',
  'like',
  'pinecone',
  'weaviate',
  'or',
  'faiss',
  'are',
  'commonly',
  'used',
  'to',
  'power',
  'retrieval',
  'systems'],
 ['llms',
  'often',
  'struggle',
  'with',
  'outdated',
  'or',
  'missing',
  'knowledge',
  'which',
  'is',
  'why',
  'retrieval',
  'is',
  'essential',
  'for',
  'grounding',
  'responses'],
 ['context',
  'windows',
  'in',
  'llms',
  'limit',
  'how',
  'much',
  'information',
  'can',
  'be',
  'processed',
  'at',
  'once'],
 ['allows',
  'llms',
  'to',
  'specialize',
  'in',
  'a',
  'domain',
  'but',
  'rag',
  'can',
  'be',
  'a'

In [39]:
# Initialize the BM25 model
bm25 = BM25Okapi(tokenized_docs)

In [40]:
# probabilistic search function
def search_bm25(query, bm25):
  tokenized_query = preprocess(query)
  results = bm25.get_scores(tokenized_query)
  return results

In [42]:
# Start the probabilistic search
query = "improve llm"

# Perfom the BM25 search
results = search_bm25(query, bm25)

# Sort the documents by relevance to the query
np.argsort(results)[::-1]

# Print the douments
for i in np.argsort(results)[::-1]:
  print(f"Document {i + 1}: {documents[i]}")

Document 32: Some retrieval systems use re-ranking models to improve the quality of the top search results.
Document 13: Retrieval systems prevent hallucinations by grounding LLM outputs in verified sources.
Document 12: In-house LLM deployments are often combined with retrieval systems for privacy and security.
Document 9: RAG systems often follow the retrieve-then-read pipeline: fetch documents and then use an LLM to generate an answer.
Document 45: Context windows in LLMs can be extended with retrieval, allowing access to more knowledge.
Document 48: RAG systems can be evaluated with user satisfaction metrics in real-world applications.
Document 47: Enterprises prefer retrieval over fine-tuning when their data changes frequently.
Document 46: Retrieval can be applied in multi-modal systems, not just text but also images or audio.
Document 41: Embedding models like OpenAI’s text-embedding-ada-002 or Sentence-BERT are popular for retrieval.
Document 40: Document pre-processing, such a