# Day 3 – Workshop 5 (Production-Like RAG Demo)

**Objective:** Demonstrate a *more production-like* Retrieval-Augmented Generation (RAG) approach using:
1. A **larger open-source LLM** (GPU recommended),
2. An **external Hugging Face dataset** to build a small text corpus,
3. **FAISS** as a local vector store,
4. **Console tests** comparing prompt outputs *with* and *without* RAG,
5. A final **Gradio** UI for **interactive chat** against the RAG pipeline.

---
## Table of Contents
1. [Setup & Dependencies](#Setup)
2. [Load a Powerful Open LLM](#LLM)
3. [Build the Corpus from a Hugging Face Dataset](#Corpus)
4. [Create FAISS Vector Store](#FAISS)
5. [RAG Pipeline Functions](#Pipeline)
6. [Console Test: Compare No-RAG vs. RAG](#Compare)
7. [Gradio UI: Interactive RAG Chat](#Gradio)
8. [Wrap-Up](#WrapUp)

---

<a id="Setup"></a>

## 1. Setup & Dependencies
```
pip install datasets sentence-transformers faiss-cpu transformers torch gradio
```
**Important**:
- A **GPU** environment (e.g., at least 16 GB VRAM) is recommended for large models.
- We’ll demonstrate **FAISS** locally. In real production, you might use a hosted vector DB (Pinecone, Milvus, Weaviate, etc.).

**Reflection**:
- *Which text dataset is relevant to your domain?*
- *Do you need an even larger model? Or can a smaller one suffice for cost/performance reasons?*

Below, we filter out any annoying warning messages that might pop up when using certain libraries. Your code will run the same without doing this, we just prefer a cleaner output.

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
%%capture
!pip install datasets sentence-transformers faiss-cpu transformers torch gradio

In [None]:
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)

<a id="LLM"></a>

## 2. Load a Powerful Open LLM  
Below we show an example using **Flan-T5**, an open-source instruction-tuned model developed by Google Research. It supports a wide range of NLP tasks and is particularly well-suited for zero-shot and few-shot prompting.

> **Note**: The size of the Flan-T5 model you choose (e.g., `flan-t5-base`, `flan-t5-large`, `flan-t5-xl`, or `flan-t5-xxl`) will affect memory requirements. For example, `flan-t5-xl` typically requires ~13GB VRAM in FP16. If you’re working with limited resources, consider using `flan-t5-base` or `flan-t5-large`, or explore quantised versions for more efficient inference.


In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import pipeline

device = "cuda" if torch.cuda.is_available() else "cpu"
gen_model_name = ""HuggingFaceH4/zephyr-7b-beta"
gen_pipeline = pipeline("text2text-generation", model=gen_model_name)

def generate_llm_text(prompt, max_length=150, temperature=0.7):

  sequences = gen_pipeline(prompt, max_length=max_length, do_sample=True, temperature=temperature)

  if sequences:
    return sequences[0]['generated_text']
  else:
    return ""


print("Model loaded successfully!")

<a id="Corpus"></a>

## 3. Build the Corpus from a Hugging Face Dataset  
We’ll load a subset of the **arXiv dataset** from Hugging Face to serve as our document corpus. This dataset contains scientific papers from arXiv.org and is well-suited for tasks involving technical or academic text.

```bash
pip install datasets


In [None]:
%%capture
!pip install datasets

In [None]:
from datasets import load_dataset

# Load the "arxiv" subset of the scientific_papers dataset
dataset = load_dataset("scientific_papers", "arxiv", split="train", trust_remote_code=True)

print("Dataset size:", len(dataset))
print(dataset[0])

# We'll create a smaller corpus by filtering out empty or very short abstracts and limiting the total docs.
docs = []
for item in dataset:
    # Use the "abstract" field from the dataset
    text = item["abstract"].strip()
    if len(text) > 30:  # skip very short abstracts
        docs.append(text)
    if len(docs) >= 500:  # limit to 500 abstracts for the demo
        break

print(f"Using {len(docs)} documents in our corpus.")


<a id="FAISS"></a>

## 4. Create FAISS Vector Store
We’ll use [FAISS](https://github.com/facebookresearch/faiss) locally to build an index of **document embeddings**.

```bash
pip install faiss-cpu
```

**Steps**:
1. Use **sentence-transformers** (or any embedding model) to embed each doc.
2. Create a **FAISS index**.
3. Store metadata so we can retrieve which doc was matched.


In [None]:
%%capture
!pip install faiss-cpu

In [None]:
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

embed_model = SentenceTransformer('all-mpnet-base-v2', device=device)  # Chosen for strong semantic similarity performance on technical text

def create_faiss_index(documents):
    # Convert each doc to an embedding
    embeddings = embed_model.encode(documents, show_progress_bar=True)
    embeddings = np.array(embeddings, dtype="float32")

    # Create FAISS index (using L2 or IP similarity)
    index = faiss.IndexFlatIP(embeddings.shape[1])

    # Add embeddings to the index
    index.add(embeddings)
    return index, embeddings

faiss_index, doc_embeddings = create_faiss_index(docs)
print("FAISS index created with", faiss_index.ntotal, "vectors.")

### Retrieving Similar Documents
We’ll define a helper to do nearest-neighbor search in FAISS.

In [None]:
def faiss_search(query, k=1):
    # embed query
    q_emb = embed_model.encode([query], show_progress_bar=False)
    q_emb = np.array(q_emb, dtype="float32")
    # search
    distances, indices = faiss_index.search(q_emb, k)
    # retrieve docs
    results = []
    for idx in indices[0]:
        results.append(docs[idx])
    return results

test_search = faiss_search("Recent advances in quantum computing", k=2)
print("Sample retrieval:", test_search)

<a id="Pipeline"></a>

## 5. RAG Pipeline Functions
We’ll create a function that:
1. Retrieves the top doc(s) from FAISS,
2. Appends them as context,
3. Calls our **LLM** to generate an answer.


In [None]:
def rag_prompt(user_query, top_k=3):
    # Retrieve the top matching document(s) from FAISS
    top_docs = faiss_search(user_query, k=top_k)
    context = "\n\n".join(top_docs)

    # Build the RAG prompt with the retrieved context
    prompt = f"""
    User's Question: {user_query}

    You are a knowledgeable and detail-oriented academic assistant.
    Using information provided in the context below, answer the user's question as accurately as possible."

    Context:
    {context}

    Answer:
    """
    return generate_llm_text(prompt, max_length=250, temperature=0.2)

# Quick test with a domain-relevant query
sample_query = "What recent advances have been made in quantum computing?"
rag_answer = rag_prompt(sample_query)
print("=== RAG Answer ===\n", rag_answer)


<a id="Compare"></a>

## 6. Console Test: Compare No-RAG vs. RAG
To emphasise the difference, we’ll take the **same** user query and:
1. **No RAG**: Just feed the query to the LLM with no additional context.
2. **RAG**: Retrieve relevant doc and feed it to the LLM.


In [None]:
query_test = "What recent advances have been made in quantum computing?"

# 1) No RAG prompt
no_rag_output = generate_llm_text(query_test, max_length=250, temperature=0.2)

# 2) RAG-based prompt
rag_output = rag_prompt(query_test)

print("=== WITHOUT RAG ===\n")
print(no_rag_output)
print("\n=== WITH RAG ===\n")
print(rag_output)

**Activity**: Try various queries that might appear in your domain. Check how the **RAG** approach references actual context from the dataset, whereas the no-RAG approach just relies on the model’s parametric knowledge.

<a id="Gradio"></a>

## 7. Gradio UI: Interactive RAG Chat
We’ll create a chat-like interface. For each user message, we’ll do:
1. Retrieve docs from FAISS,
2. Generate an LLM response with context.

```bash
pip install gradio
```

In [None]:
%%capture
!pip install gradio

In [None]:
import gradio as gr

def rag_chat(query):
    return rag_prompt(query, top_k=2)

demo = gr.Interface(
    fn=rag_chat,
    inputs="text",
    outputs="text",
    title="Production-Like RAG Demo",
    description="Enter a query. We'll retrieve context from arXiv and provide an answer via Flan-T5.",
)

# Uncomment to launch the Gradio UI
demo.launch(debug=False)

<a id="WrapUp"></a>

## 8. Wrap-Up
In this notebook, we:
- Used a **larger LLM** (Flan-T5) for “production-like” performance.
- Built a **local FAISS** index from a Hugging Face dataset.
- Demonstrated **RAG** queries that retrieve relevant text from the dataset.
- Compared **no-RAG** vs. **RAG** answers.
- Offered an **interactive Gradio** chat.

**Next Steps**:
1. Explore **bigger or more domain-specific** text corpora.
2. Migrate from local FAISS to a **hosted vector DB**.
3. Fine-tune or quantise your LLM to reduce VRAM or improve domain accuracy.
4. Deploy the Gradio app behind an **API gateway** or container for real-world usage.

---
# End of Day 3 – Workshop 5 (Production-Like RAG) Notebook
