## Task: Build a Campus FAQ Chatbot using RAG

### Objective:
Learn how Retrieval-Augmented Generation (RAG) works by building a small chatbot that answers questions about your college using vector embeddings and a mini vector database.

#### Step 0: Setup

1. Install required packages:

In [None]:
%%capture # I don't want to display the installation progress. Capture captures it just in case I need to see what happenned.
%pip install streamlit sentence-transformers faiss-cpu numpy pypdf2

: 

#### Step 1: Prepare the Data

Task: Create a small FAQ dataset with at least 5 Q&A pairs.
Example:

Q: When does the library open?
A: The library opens at 8 AM and closes at 8 PM.

In [None]:
import PyPDF2 as pdf2 # PDF handling
import numpy as np
import streamlit as st # FrontEnd 
import re # ReGex
import faiss # Embeddings Database
from sentence_transformers import SentenceTransformer

In [None]:
# Data Extraction - Collection - Gathering

# Import the BoK from the PDF
text = ''
with open ('./NeuralNetwork.pdf', 'rb') as nn:
    reader = pdf2.PdfReader(nn)
    text = ' '.join([text.extract_text() for text in reader.pages])

print(text[0:500])

Checkpoint:

Students should have a list of questions and answers ready.

#### Step 2: Split Text into Chunks

Task: Split your FAQ into separate lines to treat each Q&A as a chunk.

In [None]:
# Data Processing - Cleaning 

#pattern = r'\w+\s*\(.\):\s*(.*?)(?=\w+\s*\(.\):|$)'
pattern = r'RN-\d+\s+\|\s(.*?)(?=\s+ID:)'

text = text.strip().replace('\n', ' ').replace('\t', ' ')
chunks = re.findall(pattern, text)

In [None]:
chunks[:5]

Checkpoint:

Ensure each Q&A is a separate element in a Python list.

#### Step 3: Create Embeddings

Task: Convert each line to a vector using SentenceTransformer.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(lines)

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(chunks)

In [None]:
embeddings[0]

#### Step 4: Build the FAISS Index

Task: Store all embeddings in a FAISS vector database.

> - The 'all-MiniLM-L6-v2' model is an efficient option for generating sentence embeddings, which are numerical representations of text. These embeddings capture the semantic meaning of the text in 384 dimensions, where each dimension represents a feature of the content, allowing for the comparison and retrieval of similar texts based on their embeddings.
- The fiss.IndexFlatL2

In [None]:
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))
print(f'Dimension for each embedding of text: {dimension}')

#### Step 5: Query the Database
Task: Take a user question, convert it to a vector, and find the most relevant FAQ line.

In [None]:
user_question = "What is a Explain About chain rule"
q_emb = model.encode([user_question])
D, I = index.search(np.array(q_emb), k=1)
print(chunks[I[0][0]])

#### Step 6: Make it Interactive with Streamlit
Task: Use Streamlit to create a simple chatbot UI.

In [None]:
st.title("Neural Networks BOK")
user_question = st.text_input("Ask your question:")
if user_question:
    q_emb = model.encode([user_question])
    D, I = index.search(np.array(q_emb), k=1)
    st.write("Answer:", chunks[I[0][0]])

#### Step 7: Reflection

Questions for students:

**How does the chatbot “understand” the question?**
To better undestand this, it is necesary to explain the whole RAG process.
1. The backstage part: the BoK is constructed using a document that it is extracted, cleaned and splitted into chunks. For this example the pairs Q&A are used as chunks.
2. Embeddings creation: each chuunk is converted into a vector using a pre-trained model called SentenceTransformer - All-mini-v6-l2. This model generates embeddings that capture the semantic meaning of the text in 384 dimensions.
3. Vector database: the embeddings are stored in a FAISS vector database, which allows for efficient similarity search. The emebddings are indexed following their positions chunks in the original document.
4. The questions is answerd here: the user inputs a question that is converted into an embedding using the same sentemce tranformer model.

**What happens if the user asks something not in the FAQ?**
Given that, the application will return the most similar FAISS vector stored in the database, which may not be relevant to the user's question. Keep in mind that the retrieval is based on L2 (Euclidean distance) to calculate similarity, so the returned answer might not accurately address the user's query, but the shortest distance vector will be provided.

**How could you improve this system to handle more questions or longer documents?**
- L2 distancia is good by calculating similarity in an small dataset, however it gets worse when the dataset is larger. In this case, we could use other algorithms like HNSW or IVF to improve the search time and accuracy.
- LLM integration: this is a good and a more robust solution as Large Language Models can refine, disregard or even generate new answers based on the retrieved context.
-