# Retrieval-Augmented Generation (RAG)


## 1. Introduction

**Retrieval-Augmented Generation (RAG)** is an advanced AI framework that enhances large language models (LLMs) by integrating **external information retrieval** with **text generation**.  
It was introduced by Facebook AI Research (FAIR) to address one major limitation of LLMs — their inability to access or update external knowledge after training.

In essence, RAG **retrieves relevant data from a knowledge base** and uses it as additional context to **generate more accurate, up-to-date, and factually grounded responses**.

---

## 2. Why RAG is Needed

Traditional LLMs (like GPT or BERT-based models) have limitations:
- They are **static** — knowledge is frozen at the time of training.
- They **hallucinate**, meaning they can generate false or unverifiable information.
- They require **expensive retraining** to add new knowledge.

**RAG solves these issues** by connecting the model to an **external database or document store**, allowing it to fetch relevant context dynamically during inference.

---

## 3. Core Components of RAG

### 3.1 Retriever
- The **retriever** searches for the most relevant documents or passages from a large corpus.
- It uses **embeddings** (vector representations of text) to measure similarity between the user query and stored documents.
- Common retrieval methods:
  - **Dense retrieval**: Uses embeddings and vector similarity (e.g., cosine similarity).
  - **Sparse retrieval**: Uses traditional search methods like BM25.
  - **Hybrid retrieval**: Combines both dense and sparse approaches.

**Common tools:**  
- FAISS (Facebook AI Similarity Search)  
- Pinecone  
- Weaviate  
- Milvus  
- ElasticSearch  

### 3.2 Generator
- The **generator** is a large language model (LLM) that takes the user query and the retrieved documents as input.
- It **conditions** its generation on both the question and the context to produce a final answer.
- Example models: GPT, T5, BART, or custom fine-tuned transformers.

---

## 4. Architecture and Workflow

```text
User Query → Retriever → Relevant Documents → Generator → Final Response
```
Step-by-Step Workflow

Input Query:
The user provides a question or task.

Embedding Generation:
The query is converted into an embedding vector using a pre-trained model such as Sentence-BERT or OpenAI embeddings.

Document Retrieval:
The retriever searches a knowledge base or vector database to find the top-k most relevant documents.

Context Preparation:
The retrieved texts are ranked and combined to form a contextual input for the generator.

Response Generation:
The generator (LLM) uses the user query and the retrieved context to produce a coherent and factual response.

Output:
The final, contextually grounded answer is presented to the user, often with references to the source documents.


# RAG Implementation

In [None]:
# Install required packages
!pip install sentence-transformers faiss-cpu transformers accelerate torch tqdm

Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m60.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.12.0


In [None]:
# ==============================================================
# Project: Ask About Space — RAG-Powered Q&A Chat
# ==============================================================
# This project:
#   • Builds an in-memory document set about space
#   • Embeds and indexes them using FAISS
#   • Retrieves relevant passages for a user query
#   • Generates an answer using a seq2seq model (Flan-T5)
#   • Provides an interactive Gradio chat interface
# ==============================================================

!pip install -q faiss-cpu sentence-transformers transformers accelerate torch gradio

import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import gradio as gr

# --------------------------------------------------------------
# 1. Prepare small knowledge base
# --------------------------------------------------------------
documents = [
    "The James Webb Space Telescope observes infrared light to study early galaxies.",
    "Black holes are regions of spacetime where gravity is so strong that nothing can escape.",
    "The Milky Way is the galaxy that contains our Solar System.",
    "Mars is known as the Red Planet because of its iron oxide-rich soil.",
    "The Hubble Space Telescope was launched in 1990 and orbits the Earth.",
    "NASA's Artemis program aims to return humans to the Moon.",
    "SpaceX designs and launches reusable rockets that reduce space travel costs.",
    "Jupiter is the largest planet in our solar system and has a Great Red Spot storm.",
    "Comets are icy bodies that release gas and dust as they approach the Sun.",
    "Astronauts aboard the International Space Station experience microgravity."
]

# --------------------------------------------------------------
# 2. Build embeddings and FAISS index
# --------------------------------------------------------------
embed_model = "sentence-transformers/all-MiniLM-L6-v2"
embedder = SentenceTransformer(embed_model)

embeddings = embedder.encode(documents, convert_to_numpy=True, normalize_embeddings=True)
dim = embeddings.shape[1]
index = faiss.IndexFlatIP(dim)
index.add(embeddings)

# --------------------------------------------------------------
# 3. Load generator (Flan-T5)
# --------------------------------------------------------------
gen_model = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(gen_model)
generator = AutoModelForSeq2SeqLM.from_pretrained(gen_model)

# --------------------------------------------------------------
# 4. Core RAG function
# --------------------------------------------------------------
def rag_answer(question: str, top_k: int = 3) -> str:
    # Retrieve top_k documents
    q_emb = embedder.encode([question], convert_to_numpy=True, normalize_embeddings=True)
    scores, idxs = index.search(q_emb, top_k)
    retrieved = [documents[i] for i in idxs[0]]

    # Build context
    context = "\n\n".join([f"[Doc {i+1}] {text}" for i, text in enumerate(retrieved)])

    # Generate answer
    prompt = (
        f"Context:\n{context}\n\n"
        f"Question: {question}\n"
        f"Answer briefly and factually using the context only."
    )
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1024)
    outputs = generator.generate(**inputs, max_length=200, num_beams=4)
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Optional: show context for debugging
    show_context = "\n\n".join([f"• {t}" for t in retrieved])
    return f"**Answer:** {answer}\n\n**Retrieved Info:**\n{show_context}"

# --------------------------------------------------------------
# 5. Gradio Interface
# --------------------------------------------------------------
demo = gr.Interface(
    fn=rag_answer,
    inputs=gr.Textbox(label="Ask a question about space!", placeholder="e.g., What is the James Webb Telescope used for?"),
    outputs=gr.Markdown(label="RAG Answer"),
    title="🚀 Ask About Space — RAG Mini Project",
    description="A small Retrieval-Augmented Generation demo that answers your questions about space facts."
)

# --------------------------------------------------------------
# 6. Launch app
# --------------------------------------------------------------
demo.launch(debug=True, share=False)


Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Note: opening Chrome Inspector may crash demo inside Colab notebooks.
* To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>

Keyboard interruption in main thread... closing server.




# RAG Project

In [None]:
!pip install wikipedia
!pip install faiss-cpu
!pip install sentence-transformers
!pip install transformers
!pip install gradio

Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.12.0


In [None]:
"""
Retrieval-Augmented Generation (RAG) System using Wikipedia
------------------------------------------------------------

This project implements a simple question-answering system that combines
document retrieval and text generation. The system retrieves relevant
information from Wikipedia, builds vector embeddings for efficient search,
and generates fact-based answers using a language model.

Author: Anshuman Sinha
Company: Xyphor Advisors
"""

import wikipedia
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import gradio as gr

# -----------------------------
# Model Configuration
# -----------------------------
EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
GEN_MODEL = "google/flan-t5-base"
TOP_WIKI_PAGES = 8
CHUNK_SIZE = 512
CHUNK_OVERLAP = 128
TOP_K = 5
MAX_GEN_TOKENS = 200

# Load models
embedder = SentenceTransformer(EMBED_MODEL)
tokenizer = AutoTokenizer.from_pretrained(GEN_MODEL)
generator = AutoModelForSeq2SeqLM.from_pretrained(GEN_MODEL)

# -----------------------------
# Utility Functions
# -----------------------------
def fetch_wikipedia_pages(query: str, top_n: int = TOP_WIKI_PAGES):
    """
    Fetches the top-N Wikipedia pages related to the given query.
    """
    try:
        results = wikipedia.search(query, results=top_n)
    except Exception:
        return []
    pages = []
    for title in results:
        try:
            page = wikipedia.page(title, auto_suggest=False, redirect=True)
            pages.append({"title": title, "content": page.content})
        except Exception:
            continue
    return pages


def chunk_text(text, chunk_size=CHUNK_SIZE, overlap=CHUNK_OVERLAP):
    """
    Splits large text into smaller overlapping chunks for embedding.
    """
    chunks = []
    start = 0
    text = text.replace("\n", " ")
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunk = text[start:end]
        chunks.append(chunk)
        if end == len(text):
            break
        start = max(0, end - overlap)
    return chunks


def embed_texts(texts):
    """
    Converts a list of texts into vector embeddings.
    """
    return embedder.encode(texts, convert_to_numpy=True, normalize_embeddings=True)


def build_index(embeddings):
    """
    Builds a FAISS index for fast similarity search.
    """
    dim = embeddings.shape[1]
    index = faiss.IndexFlatIP(dim)
    index.add(embeddings)
    return index


def generate_answer(question: str, context: str):
    """
    Generates an answer to a given question based on the retrieved context.
    """
    prompt = (
        f"Context:\n{context}\n\n"
        f"Question: {question}\n"
        "Answer factually and concisely using only the context."
    )
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1024)
    outputs = generator.generate(**inputs, max_length=MAX_GEN_TOKENS, num_beams=4)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


def rag_query(question: str):
    """
    Performs the Retrieval-Augmented Generation process:
    1. Fetches Wikipedia articles
    2. Builds embeddings and index
    3. Retrieves the most relevant text chunks
    4. Generates a fact-based answer
    """
    pages = fetch_wikipedia_pages(question, top_n=TOP_WIKI_PAGES)
    if not pages:
        return "No relevant Wikipedia pages found."

    # Prepare passages
    passages = []
    for page in pages:
        for chunk in chunk_text(page["content"]):
            passages.append({"title": page["title"], "text": chunk})

    # Create embeddings and FAISS index
    texts = [p["text"] for p in passages]
    embeddings = embed_texts(texts)
    index = build_index(embeddings)

    # Retrieve top-k results
    q_emb = embed_texts([question])
    scores, idxs = index.search(q_emb, TOP_K)

    retrieved = []
    for score, idx in zip(scores[0], idxs[0]):
        if 0 <= idx < len(passages):
            retrieved.append({
                "title": passages[idx]["title"],
                "text": passages[idx]["text"],
                "score": float(score)
            })

    # Combine retrieved text as context
    context = "\n\n".join([f"[{r['title']}] {r['text']}" for r in retrieved])
    answer = generate_answer(question, context)

    # Format final output
    output = f"**Answer:** {answer}\n\n**Sources:**\n"
    for r in retrieved:
        output += f"- {r['title']} (score: {r['score']:.3f})\n"
    return output


# -----------------------------
# User Interface
# -----------------------------
interface = gr.Interface(
    fn=rag_query,
    inputs=gr.Textbox(
        label="Ask Wikipedia (RAG)",
        placeholder="Type your question here..."
    ),
    outputs=gr.Markdown(label="Answer"),
    title="Retrieval-Augmented Generation using Wikipedia",
    description=(
        "This system retrieves relevant information from Wikipedia and "
        "generates factual answers using a retrieval-augmented approach."
    )
)

if __name__ == "__main__":
    interface.launch()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://c5749dfc64edc6806d.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
