<a href="https://colab.research.google.com/github/arwiseman/GenAIEngineering-Cohort2/blob/main/AW_of_RAG_BasicsV2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Retrieval-Augmented Generation (RAG) — Customer Support Chatbot

Welcome! In this project, we'll build a working prototype of a **RAG-based chatbot** that answers customer support questions by searching through real documentation and generating answers using a language model.

---

## Project Context

Imagine you're running a company with hundreds of customer support articles — password resets, cancellations, shipping policies, etc. Customers often ask the same types of questions, and your support team spends hours answering them.

What if we could build an AI assistant that could:

- Search through those documents intelligently
- Pick out the most relevant information
- Generate an accurate, helpful response
- And do all of that instantly?

This is exactly what **RAG** enables.

---

## Project Goal

To build a chatbot that:
1. Accepts a user’s question (e.g. “How do I cancel my order?”)
2. Searches through a collection of support documents
3. Retrieves the most relevant chunks of information
4. Feeds those chunks into a language model (like GPT-4o)
5. Generates a context-aware answer based **only on the retrieved content**

---

## What is RAG?

**Retrieval-Augmented Generation** (RAG) is a hybrid approach:
- **Retrieval**: Search a knowledge base to find relevant context
- **Generation**: Use a large language model (LLM) to generate answers using that context

This helps reduce hallucinations, keep answers grounded in facts, and extend an LLM’s capabilities without retraining it.

---

## What You'll Learn

By the end of this notebook, you’ll understand:
- How to clean and chunk source documents
- How to embed text using `SentenceTransformer`
- How to perform semantic search using `FAISS`
- How to send context-aware prompts to OpenAI’s GPT model
- How to build a simple but effective RAG-powered chatbot

---

Let's get started!

## Step 1: Import Required Dependencies/Mounts/Vaults/Libraries

- Install dependencies (run once per runtime):
- Mount Google Drive (must run after runtime restart)
- Add your utils path + import vault module (your exact setup)
- Load vault keys into environment
- Verify env vars (fail fast if OpenAI missing)

Next importing the core Python libraries needed for our RAG pipeline. These libraries will help us with:

- Reading and manipulating our dataset
- Cleaning and processing the text
- Generating semantic embeddings
- Performing similarity search

Let’s take a look at each one.

In [1]:
!pip -q install faiss-cpu sentence-transformers openai pandas



[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m91.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
from google.colab import drive
drive.mount("/content/drive")


Mounted at /content/drive


In [3]:
import sys
import importlib

sys.path.append("/content/drive/MyDrive/colab_utils")

import colab_vault
importlib.reload(colab_vault)

print("colab_vault imported ✅")


colab_vault imported ✅


In [4]:
colab_vault.load_vault_from_drive()
print("Vault loaded into environment ✅")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Vault passphrase: ··········
Vault loaded from Drive.
Vault loaded into environment ✅


In [5]:
import os

for k in ["GEMINI_API_KEY", "HF_TOKEN", "GROQ_API_KEY", "OPENAI_API_KEY"]:
    v = os.getenv(k)
    print(k, "loaded" if v else "MISSING", "| length:", len(v) if v else 0)

if not os.getenv("OPENAI_API_KEY"):
    raise RuntimeError("OPENAI_API_KEY is missing — vault did not inject it.")

print("Environment ready ✅")


GEMINI_API_KEY loaded | length: 39
HF_TOKEN loaded | length: 37
GROQ_API_KEY loaded | length: 56
OPENAI_API_KEY loaded | length: 164
Environment ready ✅


In [6]:
# Text preprocessing
import re

# Data handling
import pandas as pd # Used for loading and working with structured datasets like CSV

# Vector search engine
import faiss # Enables fast similarity search over dense embeddings | Facebook AI Similarity Search

# Embedding model
from sentence_transformers import SentenceTransformer # Converts text into dense semantic vectors

from openai import OpenAI


### Line-by-Line Explanation

- `import pandas as pd`  
  Loads the **Pandas** library, which is commonly used for reading CSV files, manipulating tables, and analyzing structured data.

- `import numpy as np`  
  Loads **NumPy**, a numerical computing library. It helps with handling arrays and performing vector operations.

- `import re`  
  Imports Python’s **regular expressions** module, which we’ll use for cleaning and standardizing text (e.g., removing placeholders or unwanted characters).

- `from sentence_transformers import SentenceTransformer`  
  Imports the **SentenceTransformer** class, which lets us convert text into vector embeddings. These embeddings capture the meaning of a sentence in a numerical format that machines can understand.

- `import faiss`  
  Loads **FAISS (Facebook AI Similarity Search)** — a high-performance library for indexing and searching over vector embeddings. It allows us to find the most semantically similar chunks of text given a user query.

Together, these libraries form the foundation of our Retrieval-Augmented Generation (RAG) pipeline.

## Step 2: Load the Customer Support Dataset

Now that our libraries are imported, we’ll load the actual dataset that contains common customer support interactions.

This dataset typically includes:
- `instruction` — the customer’s question or request
- `response` — the correct support answer

These pairs form the core knowledge base for our chatbot. Later, we’ll use this data to generate embeddings and retrieve the most relevant response for any new query.

Let’s load the CSV file and preview the first few rows.

In [7]:
CSV_PATH = (
    "/content/drive/MyDrive/"
    "PZ - Aug 2025 AI ML Docs Ankur/GenAI with AA/"
    "RAG/Customer_Support_Training_Dataset.csv"
)

df = pd.read_csv(CSV_PATH)
print("Loaded CSV ✅  Shape:", df.shape)
df.head(3)


Loaded CSV ✅  Shape: (26872, 5)


Unnamed: 0,flags,instruction,category,intent,response
0,B,question about cancelling order {{Order Number}},ORDER,cancel_order,I've understood you have a question regarding ...
1,BQZ,i have a question about cancelling oorder {{Or...,ORDER,cancel_order,I've been informed that you have a question ab...
2,BLQZ,i need help cancelling puchase {{Order Number}},ORDER,cancel_order,I can sense that you're seeking assistance wit...


### Line-by-Line Explanation

- `CSV_PATH = "Customer_Support_Training_Dataset.csv"`  
  This sets the file path for the dataset. You can change this if your file is in a different location or has a different name.

- `df = pd.read_csv(CSV_PATH)`  
  Loads the CSV file into a **Pandas DataFrame** named `df`. This allows us to work with the data in a table-like structure with rows and columns.

- `print("Shape:", df.shape)`  
  Displays the shape of the DataFrame — that is, how many rows and columns it contains. This helps confirm the data was loaded correctly.

- `df.head(3)`  
  Shows the **first 3 rows** of the dataset. This gives us a quick preview of what the data looks like before we begin processing it.

## Step 3: Clean the Instruction and Response Text

Before converting our data into embeddings, we need to clean the text.

Why? Because raw text often contains:
- Placeholder variables like `{{Order Number}}`
- Inconsistent whitespace or formatting
- Mixed casing that can affect semantic consistency

Cleaning helps improve the quality of embeddings and ensures better search and generation results.

We define a simple function that:
- Removes placeholders like `{{...}}`
- Normalizes whitespace (e.g., multiple spaces → one)
- Converts everything to lowercase

We then apply this function to both the `instruction` and `response` columns, creating two new columns: `instruction_clean` and `response_clean`.

In [8]:
# Define a cleaning function
def clean(s):
    s = str(s)  # Ensure input is string
    s = re.sub(r"\{\{.*?\}\}", "", s)  # Remove placeholders like {{Order Number}}
    s = re.sub(r"\s+", " ", s).strip()  # Normalize spaces and remove leading/trailing whitespace
    return s.lower()  # Convert to lowercase

# Apply cleaning to instruction and response columns
df["instruction_clean"] = df["instruction"].map(clean)
df["response_clean"] = df["response"].map(clean)

df[["instruction_clean", "response_clean"]].head(3)

Unnamed: 0,instruction_clean,response_clean
0,question about cancelling order,i've understood you have a question regarding ...
1,i have a question about cancelling oorder,i've been informed that you have a question ab...
2,i need help cancelling puchase,i can sense that you're seeking assistance wit...


### Line-by-Line Explanation

- `def clean(s):`  
  Defines a function called `clean()` that takes in a text string and returns a cleaned version.

- `s = str(s)`  
  Converts the input to a string, in case it’s null or another type.

- `re.sub(r"\{\{.*?\}\}", "", s)`  
  Uses a regular expression to remove any placeholder enclosed in double curly braces like `{{Order Number}}`. These placeholders are often used in templates but don't help with training or retrieval.

- `re.sub(r"\s+", " ", s).strip()`  
  Collapses multiple whitespace characters into a single space. Also removes any extra spaces at the start or end.

- `return s.lower()`  
  Converts the entire string to lowercase to ensure uniformity (important for matching and embedding).

- `df["instruction_clean"] = df["instruction"].map(clean)`  
  Applies the `clean()` function to every row in the `instruction` column and stores the result in a new column called `instruction_clean`.

- `df["response_clean"] = df["response"].map(clean)`  
  Same as above, but for the `response` column.

By the end of this step, we have cleaned versions of both questions and answers, ready for chunking and embedding.

## Step 4: Chunk the Text into Smaller Segments

Long blocks of text (like entire support responses) can be hard for a model to embed or search accurately. To solve this, we split each combined instruction-response into smaller, overlapping chunks.

This chunking helps in:
- Capturing more precise information in each chunk
- Avoiding token-length limits during embedding and retrieval
- Preserving context with overlapping text

We define a `chunk_words()` function that breaks down a string into ~120-word chunks with a 20-word overlap. Then, for each row in the dataset, we:
1. Combine the cleaned instruction and response into one string
2. Split that string into chunks
3. Store the chunks in a list along with metadata like `rid` (row index) and `chunk_id`

In [9]:
# Define a chunking function
def chunk_words(text, n=120, overlap=20):
    words = text.split()  # Split text into words
    out = []              # List to store chunks
    step = max(1, n - overlap)  # Step size between chunks
    for i in range(0, len(words), step): # Keep sliding over the text
        out.append(" ".join(words[i:i+n]))  # Extract chunk and join back into string
    return out

# Create mini documents (chunks) with metadata
mini_docs = []
for rid, row in df.iterrows():
    # Combine instruction and response into a single body
    body = f"instruction: {row['instruction_clean']} | response: {row['response_clean']}"

    # Split into chunks
    chs = chunk_words(body, n=120, overlap=20)

    # Store each chunk with row ID and chunk ID
    for j, ch in enumerate(chs):
        mini_docs.append({
            "rid": rid,            # Row index from original DataFrame
            "chunk_id": j,         # Position of the chunk within the document
            "text": ch             # Chunk text itself
        })

print("mini_docs built ✅  Total chunks:", len(mini_docs))
print("Example doc:", mini_docs[0])

mini_docs built ✅  Total chunks: 41956
Example doc: {'rid': 0, 'chunk_id': 0, 'text': "instruction: question about cancelling order | response: i've understood you have a question regarding canceling order , and i'm here to provide you with the information you need. please go ahead and ask your question, and i'll do my best to assist you."}


### Line-by-Line Explanation

#### Chunking Function

- `def chunk_words(text, n=120, overlap=20):`  
  Defines a function that splits the input text into chunks of up to 120 words, with a 20-word overlap between chunks.

- `words = text.split()`  
  Splits the text into a list of words.

- `step = max(1, n - overlap)`  
  Calculates how far to move ahead when creating the next chunk. This ensures chunks overlap (if overlap > 0).

- `for i in range(0, len(words), step):`  
  Iterates over the word list using the calculated step size.

- `out.append(" ".join(words[i:i+n]))`  
  Joins the selected words into a chunk and adds it to the output list.

- `return out`  
  Returns the list of chunks.

#### Building `mini_docs`

- `for rid, row in df.iterrows():`  
  Loops over each row in the DataFrame, getting both the index (`rid`) and the row itself.

- `body = f"instruction: ... | response: ..."`  
  Combines the cleaned question and answer into a single string.

- `chs = chunk_words(body)`  
  Breaks the combined string into overlapping chunks.

- `for j, ch in enumerate(chs):`  
  Loops over each chunk and assigns it a chunk ID.

- `mini_docs.append(...)`  
  Stores each chunk along with its row index and chunk index in a list called `mini_docs`.

This results in a list of smaller text passages — ready to be embedded and searched efficiently.

## Step 5: Generate Embeddings and Build the FAISS Index

Now that we have clean, chunked text data, we need to convert each chunk into a **dense vector embedding** — a numerical representation that captures the meaning of the text.

We use a pre-trained model from the `sentence-transformers` library called: “all-MiniLM-L6-v2” <br>
This model maps text into 384-dimensional vectors that are ideal for fast, semantic similarity search.

Once the embeddings are created:
- We normalize them so that **inner product ≈ cosine similarity**
- We store them in a **FAISS index**, which lets us quickly find the most similar chunks for any input query

In [10]:
# Load a lightweight sentence embedding model
embedder = SentenceTransformer("all-MiniLM-L6-v2")

# Extract only the chunk text from our mini_docs
chunk_texts = [d["text"] for d in mini_docs]

# Generate normalized embeddings for all chunks
X = embedder.encode(chunk_texts, normalize_embeddings=True).astype("float32")

# Create a FAISS index using inner product (≈ cosine similarity on normalized vectors)
index = faiss.IndexFlatIP(X.shape[1])

# Add all embeddings to the index
index.add(X)

print("FAISS ready ✅  vectors:", index.ntotal, "dim:", X.shape[1])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

FAISS ready ✅  vectors: 41956 dim: 384


### Line-by-Line Explanation

- `embedder = SentenceTransformer("all-MiniLM-L6-v2")`  
  Loads a pre-trained transformer model that converts text into dense semantic vectors. This particular model is lightweight and fast, making it ideal for prototyping.

- `chunk_texts = [d["text"] for d in mini_docs]`  
  Creates a list of the actual text chunks from the earlier `mini_docs` list. These are the inputs to our embedding model.

- `embedder.encode(..., normalize_embeddings=True)`  
  Converts each chunk of text into a 384-dimensional vector and normalizes it. Normalization ensures the vectors lie on a unit hypersphere, which makes **inner product** equivalent to **cosine similarity**.

- `.astype("float32")`  
  Ensures the embeddings are in the right format for FAISS (32-bit float array).

- `faiss.IndexFlatIP(X.shape[1])`  
  Creates a FAISS index with **inner product** as the similarity metric. Since our embeddings are normalized, this behaves like cosine similarity.

- `index.add(X)`  
  Adds all embeddings to the FAISS index. This index can now be queried to return the most relevant chunks given a new question.

## Step 6: Generate an Answer Using RAG

This is the most critical part of our project: **answering user queries using a combination of search + generation**.

We use a method called **Retrieval-Augmented Generation (RAG)**, which combines:
- **Retrieval**: Find relevant document chunks using a vector similarity search (FAISS)
- **Generation**: Use a language model (GPT-4o-mini) to answer the user’s question using only the retrieved context

### What makes this block important?
- This is where the **final chatbot response** is produced.
- It demonstrates the **“groundedness”** of GenAI — instead of hallucinating, the model uses **your own data** to answer.
- It also shows **real-world RAG practices** like chunk filtering, prompt engineering, fallback design, and error handling.

### What this function does (high level):
1. Encodes the user’s question and retrieves a pool of document chunks
2. Filters out low-quality or short chunks
3. Selects top-k relevant and informative chunks
4. Formats these chunks into a readable “context” block
5. Sends a carefully crafted prompt to the LLM, telling it to answer from context only
6. Handles incomplete or low-confidence answers gracefully
7. Returns both the **final answer** and the **supporting chunks**

-**Create OpenAI client (do NOT name it client)**





In [11]:
OPENAI_MODEL = os.getenv("OPENAI_MODEL", "gpt-5-mini")  # optional: vault can set this too
openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# sanity test
resp = openai_client.responses.create(
    model=OPENAI_MODEL,
    input="Reply with exactly: OK"
)

print(resp.output_text)


OK


In [18]:
# Main RAG answer function
def answer_with_rag(
    question: str,
    k: int = 3,
    pool: int = 30,
    temperature: float = 0.2,
    max_tokens: int = 300

):
    # Step 1: Embed the user question
    qv = embedder.encode([question], normalize_embeddings=True).astype("float32")

    # Step 2: Retrieve top candidates 'pool' similar chunks from FAISS
    _, I = index.search(qv, pool)

    # Step 3: Pick top-k chunks (no need to filter for length)
    hits = [mini_docs[int(idx)] for idx in I[0][:k]]

    # Step 4: Build context : Format selected chunks into context block
    context = "\n\n".join(
        [f"[Doc {i+1}] (rid={h['rid']}, chunk={h['chunk_id']})\n{h['text']}"for i, h in enumerate(hits)]
    )

    # Step 5: Define assistant instructions
    system_msg = (
        "You are a helpful customer support assistant.\n"
        "Use only the provided context to answer the user's question.\n"
        "If the answer is not available in the context, say: 'Sorry, I don't have that information.'\n"
        "Be polite, concise, and cite sources like [Doc 1], [Doc 2] when relevant."
    )

    # Step 6: Create final prompt to send to OpenAI
    user_prompt = f"""Context:
{context}

Question: {question}

Answer:"""

    # Step 7: Generate response using OpenAI ChatCompletion
    resp = openai_client.responses.create(
        model=OPENAI_MODEL,
        input=[
            {"role": "system", "content": system_msg},
            {"role": "user", "content": user_prompt}
        ],
        # Removed temperature parameter as it's not supported by the current model
        max_output_tokens=max_tokens,
    )

    # Step 8: Extract and return the generated answer and source chunks
    return resp.output_text.strip(), hits

### Line-by-Line Explanation

### Function Signature

- `def answer_with_rag(question: str, k: int = 3, pool: int = 30, temperature: float = 0.2, max_tokens: int = 300):`
  - `question`: User's input query string.
  - `k`: Final number of top document chunks to use as context.
  - `pool`: Initial number of top chunks retrieved by FAISS before filtering.
  - `temperature`: Controls randomness of model output. Lower = more deterministic.
  - `max_tokens`: Maximum number of tokens the model is allowed to generate.

---

### Inside the function

1. `embedder.encode(...)`  
   - Converts the user query into a vector using the same embedding model used to encode chunks.
   - `normalize_embeddings=True`: Ensures cosine similarity works correctly.
   - `.astype("float32")`: FAISS requires float32 precision.

2. `index.search(qv, pool)`  
   - Queries the FAISS index to find the `pool` most similar document vectors.
   - `D`: similarity scores.
   - `I`: indices of the matched chunks.

3. `for idx in I[0]: ...`  
   - Loops through each matched index.
   - Skips chunks with fewer than 20 words to avoid using very short, noisy text.
   - Keeps collecting until `k` good chunks are found.

4. `if not hits: ...`  
   - If no suitable chunks are found, return a polite fallback response.

5. `context = "\n\n".join([...])`  
   - Builds a clearly labeled context block, adding `rid` and `chunk_id` for traceability.
   - Helps model (and user) understand where the text came from.

6. `system_msg = (...)`  
   - Instructs the model to behave like a support agent.
   - Encourages grounded, polite, and citation-backed responses.
   - Tells the model what to do *and* what not to do.

7. `user_prompt = f"""Context: ... """`  
   - Embeds the context and user question in a structured format.
   - Important: GPT-4 responds better to well-formatted inputs.

8. `resp = openai_client.responses.create(...)`  
   - Sends the chat prompt to OpenAI API.
   - `model`: Specifies the LLM to use (here, `gpt-4o-mini`).
   - `messages`: Follows chat format (system + user messages).
   - `temperature`: Controls creativity.  but not avail with gpt-5-mini
   - `max_tokens`: Caps output length.

9. `if len(answer.split()) < 5: ...`  
   - Catches empty or one-word answers, replacing with a graceful fallback.

10. `except Exception as e: ...`  
    - If any error occurs (e.g., API fails), we return an error message + empty hits list.

---

### Result

Returns a tuple:
- `answer`: The final response from the LLM (or fallback message).
- `hits`: The list of document chunks used to build the context, useful for transparency/logging.

---

This function is the heart of your chatbot — responsible for combining *retrieval* and *generation* while keeping things safe, relevant, and understandable.

**Test Run**

In [19]:
answer, hits = answer_with_rag("What is the cancellation policy?")
print("ANSWER:\n", answer)

print("\nTOP MATCHES:")
for i, h in enumerate(hits, start=1):
    print(f"[Doc {i}] rid={h['rid']} chunk={h['chunk_id']}")
    print(h["text"][:200], "...\n")


ANSWER:
 Sorry, I don't have that information.

TOP MATCHES:
[Doc 1] rid=17522 chunk=1
you would like to know about the cancellation process? ...

[Doc 2] rid=407 chunk=2
the cancellation process. ...

[Doc 3] rid=43 chunk=2
cancellation process. ...



## Step 7: Try It Live — Chat with Your RAG Bot

Let’s bring everything together into an interactive experience.

This function lets you chat with your Retrieval-Augmented Generation (RAG) bot directly in the notebook.

Here’s how it works:
- You type a question
- It searches the most relevant document chunks
- It sends those to the language model
- It shows you the assistant's response **and** the chunks it used to answer

To end the chat, simply type `exit` or `quit`.

In [20]:
# Simple chat loop to interact with your RAG-powered assistant
def chat():
    print("\nChat with the RAG bot (type 'exit' to stop):")

    while True:
        # Take user input
        question = input("You: ").strip()

        # Exit loop if user types exit
        if question.lower() in {"exit", "quit"}:
            break

        # Generate answer using RAG
        answer, hits = answer_with_rag(question)

        # Display the model's answer
        print("\nAssistant:", answer)

        # Show the top matching document chunks that were used
        print("\nTop Matches:")
        for i, h in enumerate(hits, start=1):
            print(f"[Doc {i}] rid={h['rid']} | chunk={h['chunk_id']} | {h['text'][:120]}...")
        print()

chat()


Chat with the RAG bot (type 'exit' to stop):
You: Hi how do I cancel

Assistant: 

Top Matches:
[Doc 1] rid=153 | chunk=1 | click on it to view the details. 4. initiate the cancellation: within the purchase details, you should find an option la...
[Doc 2] rid=285 | chunk=1 | you should see an option labeled ''. please select this to start the cancellation process. 5. follow any additional inst...
[Doc 3] rid=729 | chunk=1 | an option labeled '' next to your purchase. please select this to begin the cancellation process. 5. follow any further ...

You: Ok thanks for the answers

Assistant: 

Top Matches:
[Doc 1] rid=14999 | chunk=1 | help or clarification....
[Doc 2] rid=22328 | chunk=1 | clarification or assistance....
[Doc 3] rid=26820 | chunk=1 | in advance....

You: exit


### Line-by-Line Explanation

- `def chat():`  
  Defines a function to start an interactive chat loop.

- `print("\nChat with the RAG bot ...")`  
  Displays welcome instructions to the user.

- `while True:`  
  Runs an infinite loop until the user types `exit` or `quit`.

- `question = input("You: ")`  
  Prompts the user for a question.

- `if question.lower() in {"exit", "quit"}:`  
  Checks if the user wants to exit the chat.

- `answer, hits = answer_with_rag(question)`  
  Passes the question to the RAG pipeline to get the model’s response and supporting chunks.

- `print("\nAssistant:", answer)`  
  Prints the final answer returned by the language model.

- `for i, h in enumerate(hits, start=1):`  
  Iterates over the top-k matched chunks.

- `print(f"[Doc {i}] rid=...")`  
  Displays the document/chunk metadata and a preview of the content used to generate the answer.

This makes the experience interactive and fully transparent — users can see both the **generated answer** and the **retrieved sources**.

---
# Step 7: with Hugging Face

In [21]:
import os
from huggingface_hub import InferenceClient

HF_TOKEN = os.getenv("HF_TOKEN")
if not HF_TOKEN:
    raise RuntimeError("HF_TOKEN not found in environment variables")

# Models to try (first working one wins)
HF_CANDIDATE_MODELS = [
    "mistralai/Mistral-7B-Instruct-v0.2",
    "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "google/gemma-2-9b-it",
]

def pick_hf_model():
    test_prompt = "Reply with exactly: OK"
    for m in HF_CANDIDATE_MODELS:
        try:
            c = InferenceClient(model=m, token=HF_TOKEN)
            out = c.text_generation(
                test_prompt,
                max_new_tokens=5,
                return_full_text=False
            )
            if "OK" in out:
                print("Using HF model:", m)
                return m
        except Exception as e:
            # keep trying
            continue
    raise RuntimeError(
        "No Hugging Face router-supported model worked for your enabled providers. "
        "Use Groq/Gemini/OpenAI, or deploy an HF Inference Endpoint."
    )

LLM_MODEL = pick_hf_model()
client = InferenceClient(model=LLM_MODEL, token=HF_TOKEN)

def answer_with_rag(question: str, k: int = 3, pool: int = 30, max_new_tokens: int = 300):
    # Embed question
    qv = embedder.encode([question], normalize_embeddings=True).astype("float32")

    # Retrieve
    _, I = index.search(qv, pool)
    hits = [mini_docs[int(idx)] for idx in I[0][:k]]

    # Context
    context = "\n\n".join([
        f"[Doc {i+1}] (rid={h['rid']}, chunk={h['chunk_id']})\n{h['text']}"
        for i, h in enumerate(hits)
    ])

    prompt = f"""You are a helpful customer support assistant.
Use only the provided context to answer the question.
If the answer is not in the context, say: "Sorry, I don't have that information."
Cite sources like [Doc 1], [Doc 2] when possible.

Context:
{context}

Question: {question}

Answer:
"""

    out = client.text_generation(
        prompt,
        max_new_tokens=max_new_tokens,
        do_sample=False,          # deterministic for RAG
        return_full_text=False
    )

    return out.strip(), hits


RuntimeError: No Hugging Face router-supported model worked for your enabled providers. Use Groq/Gemini/OpenAI, or deploy an HF Inference Endpoint.

In [None]:
def chat():
    print("\nChat with the Hugging Face RAG bot (type 'exit' to stop):")

    while True:
        question = input("You: ").strip()
        if question.lower() in {"exit", "quit"}:
            break

        answer, hits = answer_with_rag(question)

        print("\nAssistant:", answer)
        print("\nTop Matches:")
        for i, h in enumerate(hits, start=1):
            print(f"[Doc {i}] rid={h['rid']} | chunk={h['chunk_id']}")
            print(h['text'][:200].strip(), "...\n")

In [None]:
chat()