# Retrieval Augumented Generation (RAG)

## Data

In [None]:
# Sample documents about LLMs, RAG and Retrieval System
documents = [
 "Large Language Models (LLMs) are AI systems trained on massive text corpora to generate and understand natural language.",
 "Retrieval-Augmented Generation (RAG) improves LLMs by allowing them to fetch external knowledge when answering questions.",
 "Vector databases like Pinecone, Weaviate, or FAISS are commonly used to power retrieval systems.",
 "LLMs often struggle with outdated or missing knowledge, which is why retrieval is essential for grounding responses.",
 "Context windows in LLMs limit how much information can be processed at once.",
 "Fine-tuning allows LLMs to specialize in a domain, but RAG can be a cheaper and more flexible alternative.",
 "Embeddings are numeric representations of text used to measure semantic similarity in retrieval systems.",
 "Hybrid search combines dense vector embeddings with keyword-based search for better accuracy.",
 "RAG systems often follow the retrieve-then-read pipeline: fetch documents and then use an LLM to generate an answer.",
 "Companies use retrieval systems to provide proprietary knowledge to LLMs without retraining the base model.",
 "Storing context in a vector database enables systems to handle large-scale document search efficiently.",
 "In-house LLM deployments are often combined with retrieval systems for privacy and security.",
 "Retrieval systems prevent hallucinations by grounding LLM outputs in verified sources.",
 "Scaling retrieval pipelines requires sharding, indexing, and efficient similarity search algorithms.",
 "LLMs like GPT, LLaMA, and Mistral can all be used in RAG setups.",
 "The retrieval step typically ranks documents based on semantic closeness to the user’s query.",
 "Chunking documents into smaller pieces improves retrieval accuracy and relevance.",
 "Metadata filters in vector search allow narrowing results by tags like date, author, or topic.",
 "Retrieval can be combined with caching to speed up repeated queries in production systems.",
 "Some RAG systems use graph databases instead of vectors for knowledge-rich retrieval.",
 "Evaluation of RAG involves metrics like precision, recall, and answer faithfulness.",
 "LLMs alone are generative, but when paired with retrieval, they become more reliable knowledge assistants.",
 "Context storage solutions range from simple in-memory stores to distributed vector databases.",
 "Retrieval systems can be domain-specific, such as for healthcare, legal, or financial data.",
 "LLMs with RAG are increasingly used in enterprise chatbots and knowledge assistants.",
 "Prompt engineering is often combined with retrieval to guide LLMs in how to use the fetched data.",
 "Scaling foundational models often requires techniques like parameter-efficient fine-tuning or LoRA.",
 "Retrieval helps reduce the need for frequent fine-tuning when new knowledge becomes available.",
 "Latency in retrieval systems is critical, especially when powering real-time applications.",
 "RAG pipelines can integrate with search engines, APIs, or internal document repositories.",
 "Knowledge grounding ensures that LLMs provide answers supported by external evidence.",
 "Some retrieval systems use re-ranking models to improve the quality of the top search results.",
 "Building a RAG system typically involves three steps: embedding, indexing, and retrieval.",
 "Retrieval allows organizations to keep proprietary knowledge private while still leveraging LLMs.",
 "An orchestration layer manages how LLMs, retrieval, and other tools interact in complex pipelines.",
 "LLMs combined with RAG are often described as knowledge-enhanced generative AI.",
 "The retriever can use dense embeddings or sparse methods like BM25, depending on the application.",
 "Chunk size in embeddings has a major effect on recall and precision in retrieval.",
 "Open-source libraries like LangChain and LlamaIndex simplify building retrieval-augmented systems.",
 "Document pre-processing, such as cleaning and normalization, is critical for good retrieval performance.",
 "Embedding models like OpenAI’s text-embedding-ada-002 or Sentence-BERT are popular for retrieval.",
 "RAG pipelines can handle both structured and unstructured data sources.",
 "Retrieval-based systems can log citations, giving users confidence in the generated answers.",
 "Distributed retrieval systems use multiple servers to scale to billions of documents.",
 "Context windows in LLMs can be extended with retrieval, allowing access to more knowledge.",
 "Retrieval can be applied in multi-modal systems, not just text but also images or audio.",
 "Enterprises prefer retrieval over fine-tuning when their data changes frequently.",
 "RAG systems can be evaluated with user satisfaction metrics in real-world applications.",
 "Building a good retrieval system requires balancing speed, accuracy, and storage costs."
]

## Setup

In [None]:
from google.colab import userdata
hf_key = userdata.get('HF_TOKEN')

In [None]:
import os
os.environ['HUGGINGFACEHUB_API_TOKEN'] = hf_key

In [None]:
!pip install langchain-huggingface sentence-transformers langchain_community faiss-cpu -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m45.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m46.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# Import the libraries
from langchain_huggingface import HuggingFaceEmbeddings, HuggingFaceEndpoint, ChatHuggingFace
from langchain.vectorstores.faiss import FAISS
from langchain_core.messages import HumanMessage, SystemMessage
from langchain.docstore.document import Document
from IPython.display import Markdown, display

## Embeddings

In [None]:
# Check how many documents we have
print(f"We have {len(documents)} documents")

We have 49 documents


In [None]:
# Wrap each string in a Document
docs = [Document(page_content=text) for text in documents]

In [None]:
# Use a transformer-based embedding model
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# Create a FAISS vector store for the docs
faiss_store = FAISS.from_documents(docs, embedding_model)

In [None]:
index = faiss_store.index

# Print total number of indexes
print(f"Total number of indexes: {index.ntotal}")

# Print total number of dimensions
print(f"Total number of dimensions: {index.d}")

# Print the embeddings for the first vector
print(f"First vector: {index.reconstruct(0)}")

Total number of indexes: 49
Total number of dimensions: 384
First vector: [ 5.59799224e-02 -1.23615697e-01  5.04221395e-02 -2.13914042e-04
  7.37531185e-02  8.95548984e-03 -6.00405224e-02  1.49373375e-02
  5.68501055e-02  3.37935537e-02 -2.43983790e-02  7.96056166e-03
  6.15357235e-02 -8.11576284e-03  1.36438720e-02  5.61926477e-02
  7.67754018e-02  3.06968205e-03 -2.04155762e-02 -9.15290713e-02
  7.54889697e-02  9.49257165e-02 -2.44868174e-02  7.48942839e-03
  2.30758637e-02  3.10892090e-02 -7.53808301e-03 -1.41408620e-02
  4.08316962e-02 -4.55593131e-02  1.01234503e-02  5.85593507e-02
  7.60235637e-02  8.76820683e-02 -2.13428158e-02  6.30844012e-02
 -6.90986440e-02  4.78140311e-03  2.68640146e-02  6.28683064e-03
 -2.52215122e-03 -6.12178817e-03  7.45765865e-02 -6.18263520e-02
  1.61846399e-01  4.47338596e-02 -8.18227157e-02  1.99009180e-02
 -4.68176790e-02  2.07310989e-02 -1.00719728e-01 -5.11116572e-02
 -1.06674517e-02  3.76786180e-02 -3.49273942e-02 -7.82647058e-02
  3.20381373e-02

## Retrieval System

In [None]:
# Retrieve relevant documents
query = "What is LLMs?"
k = 10
retrieved_docs = faiss_store.similarity_search(query, k)

In [None]:
# Build a function for retrieving documents
def get_relevant_documents(query, k=5):
  return faiss_store.similarity_search(query, k)

## Generative System

In [None]:
# Load the LLM
llm = HuggingFaceEndpoint(
    repo_id="microsoft/Phi-4",
    task="text-generation"
)
chat_model = ChatHuggingFace(llm=llm)

In [None]:
# Define the system and human message
def generative_system(query, context):
  messages = [
      SystemMessage(content = f"""
      You are an English tour guide with a Nigerian Accent.
      Your task is to reply in English.
      Only answer with information from {context}.
      If you don't know the answer, just say you don't know
      """),
      HumanMessage(content=f"Answer the {query} based on the {context}")
  ]
  ai_output = chat_model.invoke(messages)
  return display(Markdown(ai_output.content))

# Combining the Retrieval and Generative System

In [None]:
# Build the RAG system
def rag(query):
  context = get_relevant_documents(query)
  return generative_system(query, context)

In [None]:
# Test the RAG
query = "What are LLMs"
rag(query)

Large Language Models (LLMs) are AI systems that are trained on extensive text corpora to generate and understand natural language. They are designed to perform a variety of language-based tasks by leveraging their learned capabilities from the massive data they have been trained on. Some popular examples of LLMs include GPT, LLaMA, and Mistral. While LLMs are primarily generative on their own, their functionality can be enhanced by pairing them with retrieval systems, such as Retrieval-Augmented Generation (RAG), to become more reliable as knowledge assistants. This combination is particularly useful in applications like enterprise chatbots and knowledge assistants.

In [None]:
# Prepare test queries
query_list = [
    "What is RAG",
    "How does RAG and retrieval systems work?"
]

In [None]:
# Test the RAG system
for query in query_list:
  rag(query)

RAG, which stands for Retrieval-augmented Generation, is a system designed to enhance the capabilities of language models by integrating retrieval mechanisms. This system is typically built using three main steps: embedding, indexing, and retrieval. 

1. **Embedding**: This involves converting input text into a format that can be easily processed by the system, often into numerical vectors known as embeddings. These embeddings capture the semantic meaning of the text.

2. **Indexing**: Once embedded, the information is indexed. This allows the system to quickly find relevant data or documents that can be used to augment the responses of the language model.

3. **Retrieval**: This is the step where the system retrieves relevant data from the indexed information based on the user's query. The retrieved data is then used to generate more accurate and informative responses.

RAG systems can be evaluated using various metrics such as precision, recall, and answer faithfulness, which measure the accuracy and reliability of the responses generated. In real-world applications, user satisfaction metrics are also used to assess the effectiveness of RAG systems.

Additionally, RAG offers a flexible and cost-effective alternative to fine-tuning large language models (LLMs) for specific domains. LLMs like GPT, LLaMA, and Mistral can be integrated into RAG setups to leverage their powerful language generation capabilities alongside retrieval mechanisms.

RAG, or Retrieval-Augmented Generation systems, typically work through a process that involves a series of steps, mainly within what is known as a retrieve-then-read pipeline. Here’s how it works, based on the documents provided:

1. **Three Core Steps:**
   - **Embedding:** This involves transforming the text into a numerical format, usually using vectors. This helps in representing the text in a way that can be processed by machines.
   - **Indexing:** Once the text is embedded, these vectors are indexed in a database that allows for efficient searching. Some systems use graph databases instead of traditional vectors for knowledge-rich retrieval, offering potentially more efficient or relevant access to information.
   - **Retrieval:** The system retrieves documents related to a query based on the indexed data. This step forms the “retrieve” part of the pipeline where relevant documents are fetched in response to a query.

2. **Retrieve-Then-Read Pipeline:**
   - After the relevant documents are retrieved, the system uses a Language Model Large (LLM) to "read" the documents and generate a coherent and contextually relevant answer. This approach ensures that the final answers are both informed by the retrieved knowledge and generated with the sophistication of large language models.

3. **Evaluation:**
   - RAG systems are evaluated using various metrics like precision, recall, and answer faithfulness to ensure performance quality. Additionally, in real-world applications, user satisfaction metrics can also be utilized to determine the effectiveness of the system.

Overall, RAG systems combine the power of information retrieval with generative language models to provide comprehensive responses to queries.