<a href="https://colab.research.google.com/github/dinakajoy/UsingLLMs-RAG-course/blob/main/3_Introduction_to_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Retrieval Augumented Generation (RAG)

## Data

In [21]:
# Sample documents about LLMs, RAG and Retrieval System
documents = [
 "Large Language Models (LLMs) are AI systems trained on massive text corpora to generate and understand natural language.",
 "Retrieval-Augmented Generation (RAG) improves LLMs by allowing them to fetch external knowledge when answering questions.",
 "Vector databases like Pinecone, Weaviate, or FAISS are commonly used to power retrieval systems.",
 "LLMs often struggle with outdated or missing knowledge, which is why retrieval is essential for grounding responses.",
 "Context windows in LLMs limit how much information can be processed at once.",
 "Fine-tuning allows LLMs to specialize in a domain, but RAG can be a cheaper and more flexible alternative.",
 "Embeddings are numeric representations of text used to measure semantic similarity in retrieval systems.",
 "Hybrid search combines dense vector embeddings with keyword-based search for better accuracy.",
 "RAG systems often follow the retrieve-then-read pipeline: fetch documents and then use an LLM to generate an answer.",
 "Companies use retrieval systems to provide proprietary knowledge to LLMs without retraining the base model.",
 "Storing context in a vector database enables systems to handle large-scale document search efficiently.",
 "In-house LLM deployments are often combined with retrieval systems for privacy and security.",
 "Retrieval systems prevent hallucinations by grounding LLM outputs in verified sources.",
 "Scaling retrieval pipelines requires sharding, indexing, and efficient similarity search algorithms.",
 "LLMs like GPT, LLaMA, and Mistral can all be used in RAG setups.",
 "The retrieval step typically ranks documents based on semantic closeness to the user’s query.",
 "Chunking documents into smaller pieces improves retrieval accuracy and relevance.",
 "Metadata filters in vector search allow narrowing results by tags like date, author, or topic.",
 "Retrieval can be combined with caching to speed up repeated queries in production systems.",
 "Some RAG systems use graph databases instead of vectors for knowledge-rich retrieval.",
 "Evaluation of RAG involves metrics like precision, recall, and answer faithfulness.",
 "LLMs alone are generative, but when paired with retrieval, they become more reliable knowledge assistants.",
 "Context storage solutions range from simple in-memory stores to distributed vector databases.",
 "Retrieval systems can be domain-specific, such as for healthcare, legal, or financial data.",
 "LLMs with RAG are increasingly used in enterprise chatbots and knowledge assistants.",
 "Prompt engineering is often combined with retrieval to guide LLMs in how to use the fetched data.",
 "Scaling foundational models often requires techniques like parameter-efficient fine-tuning or LoRA.",
 "Retrieval helps reduce the need for frequent fine-tuning when new knowledge becomes available.",
 "Latency in retrieval systems is critical, especially when powering real-time applications.",
 "RAG pipelines can integrate with search engines, APIs, or internal document repositories.",
 "Knowledge grounding ensures that LLMs provide answers supported by external evidence.",
 "Some retrieval systems use re-ranking models to improve the quality of the top search results.",
 "Building a RAG system typically involves three steps: embedding, indexing, and retrieval.",
 "Retrieval allows organizations to keep proprietary knowledge private while still leveraging LLMs.",
 "An orchestration layer manages how LLMs, retrieval, and other tools interact in complex pipelines.",
 "LLMs combined with RAG are often described as knowledge-enhanced generative AI.",
 "The retriever can use dense embeddings or sparse methods like BM25, depending on the application.",
 "Chunk size in embeddings has a major effect on recall and precision in retrieval.",
 "Open-source libraries like LangChain and LlamaIndex simplify building retrieval-augmented systems.",
 "Document pre-processing, such as cleaning and normalization, is critical for good retrieval performance.",
 "Embedding models like OpenAI’s text-embedding-ada-002 or Sentence-BERT are popular for retrieval.",
 "RAG pipelines can handle both structured and unstructured data sources.",
 "Retrieval-based systems can log citations, giving users confidence in the generated answers.",
 "Distributed retrieval systems use multiple servers to scale to billions of documents.",
 "Context windows in LLMs can be extended with retrieval, allowing access to more knowledge.",
 "Retrieval can be applied in multi-modal systems, not just text but also images or audio.",
 "Enterprises prefer retrieval over fine-tuning when their data changes frequently.",
 "RAG systems can be evaluated with user satisfaction metrics in real-world applications.",
 "Building a good retrieval system requires balancing speed, accuracy, and storage costs."
]

## Setup

In [22]:
from google.colab import userdata
hf_key = userdata.get('HF_TOKEN')

In [23]:
import os
os.environ['HUGGINGFACEHUB_API_TOKEN'] = hf_key

In [24]:
!pip install langchain-huggingface sentence-transformers langchain_community faiss-cpu -q

In [25]:
# Import the libraries
from langchain_huggingface import HuggingFaceEmbeddings, HuggingFaceEndpoint, ChatHuggingFace
from langchain.vectorstores.faiss import FAISS
from langchain_core.messages import HumanMessage, SystemMessage
from langchain.docstore.document import Document
from IPython.display import Markdown, display

## Embeddings

In [26]:
# Check how many documents we have
print(f"We have {len(documents)} documents")

We have 49 documents


In [27]:
# Wrap each string in a Document
docs = [Document(page_content=text) for text in documents]

In [28]:
# Use a transformer-based embedding model
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

In [29]:
# Create a FAISS vector store for the docs
faiss_store = FAISS.from_documents(docs, embedding_model)

In [30]:
index = faiss_store.index

# Print total number of indexes
print(f"Total number of indexes: {index.ntotal}")

# Print total number of dimensions
print(f"Total number of dimensions: {index.d}")

# Print the embeddings for the first vector
print(f"First vector: {index.reconstruct(0)}")

Total number of indexes: 49
Total number of dimensions: 384
First vector: [ 5.59799224e-02 -1.23615697e-01  5.04221395e-02 -2.13914042e-04
  7.37531185e-02  8.95548984e-03 -6.00405224e-02  1.49373375e-02
  5.68501055e-02  3.37935537e-02 -2.43983790e-02  7.96056166e-03
  6.15357235e-02 -8.11576284e-03  1.36438720e-02  5.61926477e-02
  7.67754018e-02  3.06968205e-03 -2.04155762e-02 -9.15290713e-02
  7.54889697e-02  9.49257165e-02 -2.44868174e-02  7.48942839e-03
  2.30758637e-02  3.10892090e-02 -7.53808301e-03 -1.41408620e-02
  4.08316962e-02 -4.55593131e-02  1.01234503e-02  5.85593507e-02
  7.60235637e-02  8.76820683e-02 -2.13428158e-02  6.30844012e-02
 -6.90986440e-02  4.78140311e-03  2.68640146e-02  6.28683064e-03
 -2.52215122e-03 -6.12178817e-03  7.45765865e-02 -6.18263520e-02
  1.61846399e-01  4.47338596e-02 -8.18227157e-02  1.99009180e-02
 -4.68176790e-02  2.07310989e-02 -1.00719728e-01 -5.11116572e-02
 -1.06674517e-02  3.76786180e-02 -3.49273942e-02 -7.82647058e-02
  3.20381373e-02

## Retrieval System

In [31]:
# Retrieve relevant documents
query = "What is LLMs?"
k = 10
retrieved_docs = faiss_store.similarity_search(query, k)

In [32]:
# Build a function for retrieving documents
def get_relevant_documents(query, k=5):
  return faiss_store.similarity_search(query, k)

## Generative System

In [33]:
# Load the LLM
llm = HuggingFaceEndpoint(
    repo_id="microsoft/Phi-4",
    task="text-generation"
)
chat_model = ChatHuggingFace(llm=llm)

In [34]:
# Define the system and human message
def generative_system(query, context):
  messages = [
      SystemMessage(content = f"""
      You are an English tour guide with a Nigerian Accent.
      Your task is to reply in English.
      Only answer with information from {context}.
      If you don't know the answer, just say you don't know
      """),
      HumanMessage(content=f"Answer the {query} based on the {context}")
  ]
  ai_output = chat_model.invoke(messages)
  return display(Markdown(ai_output.content))

# Combining the Retrieval and Generative System

In [35]:
# Build the RAG system
def rag(query):
  context = get_relevant_documents(query)
  return generative_system(query, context)

In [36]:
# Test the RAG
query = "What are LLMs"
rag(query)

Ah, my friend, Large Language Models, or LLMs as we call them, are essentially AI systems that are trained on massive amounts of text. Their purpose is to generate and understand natural language. These models, some of which you might know include GPT, LLaMA, and Mistral, are designed to be generative.

But here's where it gets interesting—when LLMs are paired with something called retrieval, they become more than just text generators. They transform into more reliable knowledge assistants. This combination, known as Retrieval-Augmented Generation or RAG, significantly boosts their ability to provide accurate information. You see, LLMs can give answers with an intelligent flair, but with RAG, they bring in the reliability factor, fetching reliable knowledge on the fly. Quite useful in enterprise settings, wouldn't you say? They're used in chatbots, knowledge assistants, and a variety of applications across industries.

So, while on their own these models are powerful, the integration with retrieval makes them more dependable, broadening the horizon of what they can achieve!

In [37]:
# Prepare test queries
query_list = [
    "What is RAG",
    "How does RAG and retrieval systems work?"
]

In [38]:
# Test the RAG system
for query in query_list:
  rag(query)

RAG, which stands for Retrieval Augmented Generation, is a system designed to enhance language models by integrating document retrieval with language generation capabilities. Building a RAG system typically involves three steps: embedding, indexing, and retrieval. These steps are crucial for enabling the RAG to access and use information from large sets of documents effectively. 

In RAG setups, LLMs (Large Language Models) like GPT, LLaMA, and Mistral are commonly used to leverage the combination of retrieval and generation for tasks. Unlike fine-tuning, which allows LLMs to specialize in a specific domain, RAG offers a more flexible and cost-effective alternative by dynamically retrieving relevant information from external sources during inference.

RAG systems can be evaluated using metrics like precision, recall, answer faithfulness, and user satisfaction in real-world applications to assess their performance and effectiveness. This evaluation emphasizes the system's ability to provide accurate and contextually appropriate responses.

Ah, delightful question! A RAG system, or Retrieval-Augmented Generation system, operates primarily through a process known as the retrieve-then-read pipeline. To begin, it involves three crucial steps: embedding, indexing, and retrieval. Here's a breezy explanation:

1. **Embedding**: This is where documents or data are converted into numerical vectors that a computer can process. It's sort of like translating items into a universal language computers understand well!

2. **Indexing**: After embedding, the vectors are organized and stored in an efficient way for quick access. Think of it as an indexing system like you’ll find in a library, to make findin' document data a breeze.

3. **Retrieval**: In this stage, the system fetches the most relevant documents based on the input query. This is like walking into a shop and finding exactly what you're searching for amid a crowd of items.

Many systems follow a "retrieve-then-read" approach. They first fetch relevant documents based on a query and then use a large language model, or LLM, to generate an answer that takes the retrieved documents into account.

Interestingly, some RAG systems even use graph databases for a more knowledge-rich retrieval as opposed to traditional vector approaches. This can offer tremendous benefits in handling complex queries! Evaluation of these systems involves looking at metrics such as precision, recall, and answer faithfulness, and in real-world applications, user satisfaction metrics are also considered.

That sums it up! If you have more queries, feel free to ask, mon ami!