
# 🧠 Retrieval-Augmented Generation (RAG) Q&A Chatbot

This notebook implements a lightweight Retrieval-Augmented Generation (RAG) pipeline using:
- **FAISS** for document retrieval
- **MiniLM** embeddings from Hugging Face
- **Hugging Face Transformers** (`distilbert-base-uncased`) as a lightweight generative model

The chatbot answers questions using context retrieved from a custom knowledge base created from the provided **Training Dataset.csv**.


In [None]:

!pip install faiss-cpu sentence-transformers transformers datasets --quiet


In [None]:

import pandas as pd
import faiss
from sentence_transformers import SentenceTransformer
from transformers import pipeline
import numpy as np
import torch


In [None]:

# Load the Training Dataset
df = pd.read_csv("Training Dataset.csv")

# Show basic info
df.head()


In [None]:

# Convert rows to a text corpus (e.g., concatenating all columns)
corpus = df.apply(lambda row: " | ".join([str(cell) for cell in row]), axis=1).tolist()


In [None]:

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(corpus, show_progress_bar=True)


In [None]:

dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))


In [None]:

qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased", tokenizer="distilbert-base-uncased")


In [None]:

def rag_qa(user_query, top_k=5):
    query_embedding = model.encode([user_query])
    D, I = index.search(np.array(query_embedding), top_k)
    retrieved_docs = [corpus[i] for i in I[0]]
    context = " ".join(retrieved_docs)
    answer = qa_pipeline(question=user_query, context=context)
    return answer['answer']


In [None]:

# Example usage
user_question = "What is the age of the customer with the highest score?"
answer = rag_qa(user_question)
print("Answer:", answer)



### 📌 Notes:
- This notebook uses `distilbert-base-uncased` as a small open model. You can swap it with a larger model if you have more compute or use OpenAI/Gemini APIs by modifying the `rag_qa` function.
- The knowledge base is created from the Training Dataset CSV, converted into a text corpus row-wise.
