## Task: Build a Campus FAQ Chatbot using RAG

### Objective:
Learn how Retrieval-Augmented Generation (RAG) works by building a small chatbot that answers questions about your college using vector embeddings and a mini vector database.

#### Step 0: Setup

1. Install required packages:

In [None]:
%%capture # I don't want to display the installation progress. Capture captures it just in case I need to see what happenned.
%pip install streamlit sentence-transformers faiss-cpu numpy pypdf2

: 

#### Step 1: Prepare the Data

Task: Create a small FAQ dataset with at least 5 Q&A pairs.
Example:

Q: When does the library open?
A: The library opens at 8 AM and closes at 8 PM.

In [None]:
import PyPDF2 as pdf2 # PDF handling
import numpy as np
import streamlit as st # FrontEnd 
import re # ReGex
import faiss # Embeddings Database
from sentence_transformers import SentenceTransformer

In [None]:
# Data Extraction - Collection - Gathering

# Import the BoK from the PDF
text = ''
with open ('./NeuralNetwork.pdf', 'rb') as nn:
    reader = pdf2.PdfReader(nn)
    text = ' '.join([text.extract_text() for text in reader.pages])

print(text[0:500])

Checkpoint:

Students should have a list of questions and answers ready.

#### Step 2: Split Text into Chunks

Task: Split your FAQ into separate lines to treat each Q&A as a chunk.

In [None]:
# Data Processing - Cleaning 

#pattern = r'\w+\s*\(.\):\s*(.*?)(?=\w+\s*\(.\):|$)'
pattern = r'RN-\d+\s+\|\s(.*?)(?=\s+ID:)'

text = text.strip().replace('\n', ' ').replace('\t', ' ')
chunks = re.findall(pattern, text)

In [None]:
chunks[:5]

Checkpoint:

Ensure each Q&A is a separate element in a Python list.

#### Step 3: Create Embeddings

Task: Convert each line to a vector using SentenceTransformer.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(lines)

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(chunks)

In [None]:
embeddings[0]

#### Step 4: Build the FAISS Index

Task: Store all embeddings in a FAISS vector database.

> - The 'all-MiniLM-L6-v2' model is an efficient option for generating sentence embeddings, which are numerical representations of text. These embeddings capture the semantic meaning of the text in 384 dimensions, where each dimension represents a feature of the content, allowing for the comparison and retrieval of similar texts based on their embeddings.
- The fiss.IndexFlatL2

In [None]:
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))
print(f'Dimension for each embedding of text: {dimension}')

#### Step 5: Query the Database
Task: Take a user question, convert it to a vector, and find the most relevant FAQ line.

In [None]:
user_question = "What is a Explain About chain rule"
q_emb = model.encode([user_question])
D, I = index.search(np.array(q_emb), k=1)
print(chunks[I[0][0]])

#### Step 6: Make it Interactive with Streamlit
Task: Use Streamlit to create a simple chatbot UI.

In [None]:
st.title("Neural Networks BOK")
user_question = st.text_input("Ask your question:")
if user_question:
    q_emb = model.encode([user_question])
    D, I = index.search(np.array(q_emb), k=1)
    st.write("Answer:", chunks[I[0][0]])

#### Step 7: Reflection

Questions for students:

##### **How does the chatbot “understand” the question?**

To better understand this, it is necessary to explain the whole RAG (Retrieval-Augmented Generation) process.

1. **The backstage part:**  
   The BoK (Body of Knowledge) is constructed from a source document that is extracted, cleaned, and split into smaller chunks.  
   For this example, Q&A pairs are used as chunks.

2. **Embeddings creation:**  
   Each chunk is converted into a vector using a pre-trained model called *SentenceTransformer – all-MiniLM-L6-v2*.  
   This model generates embeddings that capture the semantic meaning of the text in **384 dimensions**.

3. **Vector database:**  
   These embeddings are stored in a **FAISS** vector database, which allows efficient similarity search.  
   The embeddings are indexed according to their position within the original document.

4. **Question processing:**  
   When the user inputs a question, it is also converted into an embedding using the same SentenceTransformer model.  
   The system then searches for the most similar embeddings within the FAISS index to retrieve the most relevant chunk(s) of text.



##### **What happens if the user asks something not in the FAQ?**

In that case, the application will still return the most similar vector stored in the FAISS database, even if it’s not relevant to the user’s query.  
Keep in mind that the retrieval process relies on **L2 (Euclidean) distance** to calculate similarity.  
Therefore, while the system always returns the “closest” vector, that doesn’t necessarily mean the answer will accurately address the user’s question — it’s simply the best match numerically.



##### **How could you improve this system to handle more questions or longer documents?**

- **Optimize similarity search:**  
  L2 distance works well for small datasets, but performance degrades as the dataset grows.  
  In larger collections, algorithms such as **HNSW (Hierarchical Navigable Small World graphs)** or **IVF (Inverted File Index)** can improve both search time and accuracy.

- **Integrate a Large Language Model (LLM):**  
  Adding an LLM to the pipeline (e.g., using a *retrieval + generation* approach) makes the system more robust.  
  The LLM can refine, combine, or even generate new answers based on the retrieved chunks, reducing irrelevant or incomplete responses.

- **Add metadata filtering:**  
  Include contextual metadata (e.g., topic, source, date) to allow filtered retrieval, ensuring only relevant document sections are compared.  
  This reduces noise and improves the quality of the retrieved results.

- **Use hybrid search (semantic + keyword):**  
  Combine vector similarity with keyword-based search (like **BM25**).  
  This hybrid approach balances semantic understanding with lexical precision, improving relevance for factual or domain-specific queries. (this recommendation answer is AI generated)

- **Scale document processing:**  
  For longer documents, apply **hierarchical chunking** — dividing texts into sections and sub-sections — and store embeddings at different levels of granularity.  
  This allows retrieval at the most contextually appropriate level and supports better scalability.


