# Retrieval Augumented Generation (RAG)

## Data

In [1]:
# Sample documents about LLMs, RAG and Retrieval System
documents = [
 "Large Language Models (LLMs) are AI systems trained on massive text corpora to generate and understand natural language.",
 "Retrieval-Augmented Generation (RAG) improves LLMs by allowing them to fetch external knowledge when answering questions.",
 "Vector databases like Pinecone, Weaviate, or FAISS are commonly used to power retrieval systems.",
 "LLMs often struggle with outdated or missing knowledge, which is why retrieval is essential for grounding responses.",
 "Context windows in LLMs limit how much information can be processed at once.",
 "Fine-tuning allows LLMs to specialize in a domain, but RAG can be a cheaper and more flexible alternative.",
 "Embeddings are numeric representations of text used to measure semantic similarity in retrieval systems.",
 "Hybrid search combines dense vector embeddings with keyword-based search for better accuracy.",
 "RAG systems often follow the retrieve-then-read pipeline: fetch documents and then use an LLM to generate an answer.",
 "Companies use retrieval systems to provide proprietary knowledge to LLMs without retraining the base model.",
 "Storing context in a vector database enables systems to handle large-scale document search efficiently.",
 "In-house LLM deployments are often combined with retrieval systems for privacy and security.",
 "Retrieval systems prevent hallucinations by grounding LLM outputs in verified sources.",
 "Scaling retrieval pipelines requires sharding, indexing, and efficient similarity search algorithms.",
 "LLMs like GPT, LLaMA, and Mistral can all be used in RAG setups.",
 "The retrieval step typically ranks documents based on semantic closeness to the user’s query.",
 "Chunking documents into smaller pieces improves retrieval accuracy and relevance.",
 "Metadata filters in vector search allow narrowing results by tags like date, author, or topic.",
 "Retrieval can be combined with caching to speed up repeated queries in production systems.",
 "Some RAG systems use graph databases instead of vectors for knowledge-rich retrieval.",
 "Evaluation of RAG involves metrics like precision, recall, and answer faithfulness.",
 "LLMs alone are generative, but when paired with retrieval, they become more reliable knowledge assistants.",
 "Context storage solutions range from simple in-memory stores to distributed vector databases.",
 "Retrieval systems can be domain-specific, such as for healthcare, legal, or financial data.",
 "LLMs with RAG are increasingly used in enterprise chatbots and knowledge assistants.",
 "Prompt engineering is often combined with retrieval to guide LLMs in how to use the fetched data.",
 "Scaling foundational models often requires techniques like parameter-efficient fine-tuning or LoRA.",
 "Retrieval helps reduce the need for frequent fine-tuning when new knowledge becomes available.",
 "Latency in retrieval systems is critical, especially when powering real-time applications.",
 "RAG pipelines can integrate with search engines, APIs, or internal document repositories.",
 "Knowledge grounding ensures that LLMs provide answers supported by external evidence.",
 "Some retrieval systems use re-ranking models to improve the quality of the top search results.",
 "Building a RAG system typically involves three steps: embedding, indexing, and retrieval.",
 "Retrieval allows organizations to keep proprietary knowledge private while still leveraging LLMs.",
 "An orchestration layer manages how LLMs, retrieval, and other tools interact in complex pipelines.",
 "LLMs combined with RAG are often described as knowledge-enhanced generative AI.",
 "The retriever can use dense embeddings or sparse methods like BM25, depending on the application.",
 "Chunk size in embeddings has a major effect on recall and precision in retrieval.",
 "Open-source libraries like LangChain and LlamaIndex simplify building retrieval-augmented systems.",
 "Document pre-processing, such as cleaning and normalization, is critical for good retrieval performance.",
 "Embedding models like OpenAI’s text-embedding-ada-002 or Sentence-BERT are popular for retrieval.",
 "RAG pipelines can handle both structured and unstructured data sources.",
 "Retrieval-based systems can log citations, giving users confidence in the generated answers.",
 "Distributed retrieval systems use multiple servers to scale to billions of documents.",
 "Context windows in LLMs can be extended with retrieval, allowing access to more knowledge.",
 "Retrieval can be applied in multi-modal systems, not just text but also images or audio.",
 "Enterprises prefer retrieval over fine-tuning when their data changes frequently.",
 "RAG systems can be evaluated with user satisfaction metrics in real-world applications.",
 "Building a good retrieval system requires balancing speed, accuracy, and storage costs."
]

## Setup

In [2]:
from google.colab import userdata
hf_key = userdata.get('HF_TOKEN')

In [3]:
import os
os.environ['HUGGINGFACEHUB_API_TOKEN'] = hf_key

In [4]:
!pip install langchain-huggingface sentence-transformers langchain_community faiss-cpu -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m29.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m41.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.7/64.7 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests==2.32.4, but you have requests 2.32.5 which is incompatible.[0m[31m
[0m

In [5]:
# Import the libraries
from langchain_huggingface import HuggingFaceEmbeddings, HuggingFaceEndpoint, ChatHuggingFace
from langchain.vectorstores.faiss import FAISS
from langchain_core.messages import HumanMessage, SystemMessage
from langchain.docstore.document import Document
from IPython.display import Markdown, display

## Embeddings

In [6]:
# Check how many documents we have
print(f"We have {len(documents)} documents")

We have 49 documents


In [7]:
# Wrap each string in a Document
docs = [Document(page_content=text) for text in documents]

In [8]:
# Use a transformer-based embedding model
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [9]:
# Create a FAISS vector store for the docs
faiss_store = FAISS.from_documents(docs, embedding_model)

In [10]:
index = faiss_store.index

# Print total number of indexes
print(f"Total number of indexes: {index.ntotal}")

# Print total number of dimensions
print(f"Total number of dimensions: {index.d}")

# Print the embeddings for the first vector
print(f"First vector: {index.reconstruct(0)}")

Total number of indexes: 49
Total number of dimensions: 384
First vector: [ 5.59799224e-02 -1.23615697e-01  5.04221395e-02 -2.13914042e-04
  7.37531185e-02  8.95548984e-03 -6.00405224e-02  1.49373375e-02
  5.68501055e-02  3.37935537e-02 -2.43983790e-02  7.96056166e-03
  6.15357235e-02 -8.11576284e-03  1.36438720e-02  5.61926477e-02
  7.67754018e-02  3.06968205e-03 -2.04155762e-02 -9.15290713e-02
  7.54889697e-02  9.49257165e-02 -2.44868174e-02  7.48942839e-03
  2.30758637e-02  3.10892090e-02 -7.53808301e-03 -1.41408620e-02
  4.08316962e-02 -4.55593131e-02  1.01234503e-02  5.85593507e-02
  7.60235637e-02  8.76820683e-02 -2.13428158e-02  6.30844012e-02
 -6.90986440e-02  4.78140311e-03  2.68640146e-02  6.28683064e-03
 -2.52215122e-03 -6.12178817e-03  7.45765865e-02 -6.18263520e-02
  1.61846399e-01  4.47338596e-02 -8.18227157e-02  1.99009180e-02
 -4.68176790e-02  2.07310989e-02 -1.00719728e-01 -5.11116572e-02
 -1.06674517e-02  3.76786180e-02 -3.49273942e-02 -7.82647058e-02
  3.20381373e-02

## Retrieval System

In [11]:
# Retrieve relevant documents
query = "What is LLMs?"
k = 10
retrieved_docs = faiss_store.similarity_search(query, k)

In [12]:
# Build a function for retrieving documents
def get_relevant_documents(query, k=5):
  return faiss_store.similarity_search(query, k)

## Generative System

In [17]:
# Load the LLM
llm = HuggingFaceEndpoint(
    repo_id="mistralai/Mistral-7B-Instruct-v0.2",
    task="text-generation"
)
chat_model = ChatHuggingFace(llm=llm)

In [14]:
# Define the system and human message
def generative_system(query, context):
  messages = [
      SystemMessage(content = f"""
      You are an English tour guide with a Nigerian Accent.
      Your task is to reply in English.
      Only answer with information from {context}.
      If you don't know the answer, just say you don't know
      """),
      HumanMessage(content=f"Answer the {query} based on the {context}")
  ]
  ai_output = chat_model.invoke(messages)
  return display(Markdown(ai_output.content))

# Combining the Retrieval and Generative System

In [15]:
# Build the RAG system
def rag(query):
  context = get_relevant_documents(query)
  return generative_system(query, context)

In [19]:
# Test the RAG
query = "What are LLMs"
rag(query)

 LLMs, or Large Language Models, are artificial intelligence systems that are trained on vast text corpora to generate and understand natural language. They are the foundation of several advanced language technologies, including chatbots, virtual assistants, and knowledge assistants.

Fine-tuning allows LLMs to specialize in a particular domain, but an alternative, more flexible and potentially less expensive option is to use LLMs in combination with Retrieval-as-a-Service (RAG). RAG retrieves factual information directly from large datasets or knowledge bases in response to users' queries, making LLMs more reliable knowledge assistants.

Various LLMs, such as GPT, LLaMA, and Mistral, can be employed in RAG setups, expanding their capabilities and applications. In summary, LLMs, with their generative capabilities, power fully-fledged conversational agents when paired with RAG, enabling them to provide accurate and contextual responses to users.

In [20]:
# Prepare test queries
query_list = [
    "What is RAG",
    "How does RAG and retrieval systems work?"
]

In [21]:
# Test the RAG system
for query in query_list:
  rag(query)

 RAG, which stands for Relevant, Acceptable, and Given system, is a method used for filtering and ranking responses from large language models (LLMs) like GPT, LLaMA, and Mistral. Building a RAG system typically involves three steps: embedding, indexing, and retrieval [Document(id='44a37328-d9b9-4b1b-b3da-e31e18a016d4')]. Embedding refers to the process of converting data into numerical vectors. Indexing involves creating an inverted index or other data structures that allow quick lookups based on specific query terms. Retrieval is the process of returning the most relevant responses based on these indexes [Document(id='44a37328-d9b9-4b1b-b3da-e31e18a016d4')].

The performance of RAG systems can be evaluated with various metrics such as precision, recall, and answer faithfulness [Document(id='72a2f738-a473-4eb2-b08b-53f75be729f1')]. Precision measures the percentage of relevant responses among the total responses returned. Recall measures the percentage of relevant responses that were retrieved out of all possible relevant responses. Answer faithfulness refers to how closely the retrieved response matches the original query [Document(id='72a2f738-a473-4eb2-b08b-53f75be729f1')].

RAG systems can also be evaluated based on user satisfaction in real-world applications [Document(id='afc7aedf-b4bf-4783-a288-27b69d38231f')].

Compared to fine-tuning LLMs to specialize in a particular domain, RAG can be a cheaper and more flexible alternative [Document(id='eba1df3b-f202-4be5-90e7-c9aa6e5d5b77')]. Fine-tuning involves training the LLM on a specific domain-related dataset, which can be resource-intensive and time-consuming. In contrast, RAG can be particularly useful for handling out-of-distribution queries and adapting to changing domain requirements.

 RAG systems, or Relevance and Answer Generation systems, are designed to retrieve relevant information from large datasets and generate accurate answers to queries. The process typically involves three main steps: embedding, indexing, and retrieval.

During the embedding step, data is transformed into numerical vectors using techniques like word embeddings or document embeddings. These vectors represent the meaning and context of the data and make it possible for the system to compare and find similarities between different data points.

The indexing step involves creating an index or database for efficient data retrieval. Instead of scrolling through the entire dataset for every query, the system uses the index to quickly find the most relevant data. Some RAG systems use graph databases instead of regular vectors for knowledge-rich retrieval.

During the retrieval step, the system uses various information retrieval methods, such as classic vector space models or modern neural network-based models, to find the most reliable and relevant answers to a given query. To ensure the accuracy and reliability of the answers, the evaluation of RAG systems may involve metrics like precision, recall, and answer faithfulness.

Moreover, RAG systems can be evaluated based on user satisfaction in real-world applications. The overall user experience, understanding of queries, and the precision and relevance of the generated answers are essential indicators of the system's performance. Eventually, the goal of a RAG system is to provide accurate, reliable, and user-friendly information retrieval and answer generation while maintaining efficiency in large-scale datasets.