# RAG personal bot

Exercise for week 5 of LLM Engineering course.

This notebook will create a personal RAG bot. It will use a the ./kb directory to store the files that we want to include in the RAG. Subdirectories will be used to denote categories for the files.
**Important: only one level of subdirectories will be used for the categories**

It uses LangChain to create and process the RAG pipeline and chat.
The voector database persistent sotre is in the ./vdb folder. 

In this version we use chromadb for the vector store.
The store is recreated each run. This is not efficient for large datasets. 

Future upgrades - To Do (in no particular order): 
- [X] Create a fully local version for security and privacy
- [ ] Create persistent data store - only load, chunk and embed changed documents. 
- [ ] Provide selection of vector db engines (Chroma DB as default, or connect to external vector db e.g. ElasticSearch or AWS Opensearch)
- [ ] Add an interface to upload documents in data store - including user-defined metadata tags
- [ ] Add more document data types
- [ ] Add online search capability - use web crawler tool to crawl a website and create website-specific RAG bot
- [ ] Read e-mails/calendars/online docs (Amazon S3 bucket, Google Drive)



In [None]:
# These were necessary as langchain does not install them by default
!pip install pypdf
!pip install pdfminer.six
!pip install python-docx
!pip install docx2txt

In [None]:
# imports

import os
import glob
from dotenv import load_dotenv
import gradio as gr

# imports for langchain, plotly and Chroma
# plotly is commented out, as it is not used in the current code

from langchain.document_loaders import DirectoryLoader, TextLoader, PDFMinerLoader, Docx2txtLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# from langchain.schema import Document
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
#import matplotlib.pyplot as plt
#from sklearn.manifold import TSNE
#import numpy as np
#import plotly.graph_objects as go
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
# from langchain.embeddings import HuggingFaceEmbeddings

In [None]:
MODEL = "gpt-4o-mini"
db_name = "vdb"

In [None]:
# Load environment variables in a file called .env

load_dotenv(override=True)
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')


## Loading the documents
In the code below we read in the KB documents and create the vector store. 
We will be adding PDF documents, Word documents and text/markdown documents.
Each document has its own loader, which we are calling separately through DirectoryLoader.
At the end, we are combining the results, and then start splitting the documents using the Recursive Character Text Splitter.

In [None]:
# Read in documents using LangChain's loaders
# Take everything in all the sub-folders of our knowledgebase

folders = glob.glob("kb/*")
print(f"Found {len(folders)} folders in the knowledge base.")

def add_metadata(doc, doc_type):
    doc.metadata["doc_type"] = doc_type
    return doc

# For text files
text_loader_kwargs = {'encoding': 'utf-8'}

documents = []
for folder in folders:
    print(f"Loading documents from folder: {folder}")
    doc_type = os.path.basename(folder)
    # PDF Loader
    pdf_loader = DirectoryLoader(folder, glob="**/*.pdf", loader_cls=PDFMinerLoader)
    # Text loaders
    txt_loader = DirectoryLoader(folder, glob="**/*.txt", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
    md_loader = DirectoryLoader(folder, glob="**/*.md", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
    # Load MS Word documents - UnstructuredWordDocumentLoader does not play well with numpy > 1.24.0, and we use Docx2txtLoader instead. 
    # doc_loader = DirectoryLoader(folder, glob="**/*.doc", loader_cls=UnstructuredWordDocumentLoader)
    docx_loader = DirectoryLoader(folder, glob="**/*.docx", loader_cls=Docx2txtLoader)
    # document doc_type is used to identify the type of document
    # Load documents from PDF, text and word files and combine the results
    pdf_docs = pdf_loader.load()
    print(f"Loaded {len(pdf_docs)} PDF documents from {folder}")
    text_docs = txt_loader.load() + md_loader.load()
    print(f"Loaded {len(text_docs)} text documents from {folder}")
    word_docs = docx_loader.load()
    print(f"Loaded {len(word_docs)} Word documents from {folder}")
    folder_docs = pdf_docs + text_docs + word_docs
    # Add metadata to each document
    if not folder_docs:
        print(f"No documents found in folder: {folder}")
        continue
    documents.extend([add_metadata(doc, doc_type) for doc in folder_docs])

# Split the documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)

# Print out some basic info for the loaded documents and chunks
print(f"Total number of documents: {len(documents)}")
print(f"Total number of chunks: {len(chunks)}")
print(f"Document types found: {set(doc.metadata['doc_type'] for doc in documents)}")


## Vector Store

We use Chromadb for vector store
Same code as the one in the lesson notebook, minus the visualization part

In [None]:
# embeddings = OpenAIEmbeddings()

# If you would rather use the free Vector Embeddings from HuggingFace sentence-transformers
# Then replace embeddings = OpenAIEmbeddings()
# with:
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Delete if already exists

if os.path.exists(db_name):
    Chroma(persist_directory=db_name, embedding_function=embeddings).delete_collection()

# Create vectorstore

vectorstore = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory=db_name)
print(f"Vectorstore created with {vectorstore._collection.count()} documents")

# Let's investigate the vectors

collection = vectorstore._collection
count = collection.count()

sample_embedding = collection.get(limit=1, include=["embeddings"])["embeddings"][0]
dimensions = len(sample_embedding)
print(f"There are {count:,} vectors with {dimensions:,} dimensions in the vector store")

## LangChain
Create Langchain chat, memory and retrievers.

Note: for this localized version, Gemma3 4B worked much better than Llama 3.2, with my documents. 

In [None]:
# create a new Chat with OpenAI
#llm = ChatOpenAI(temperature=0.7, model_name=MODEL)

# Alternative - if you'd like to use Ollama locally, uncomment this line instead
llm = ChatOpenAI(temperature=0.7, model_name='gemma3:4b', base_url='http://localhost:11434/v1', api_key='ollama')

# set up the conversation memory for the chat
memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)

# the retriever is an abstraction over the VectorStore that will be used during RAG
retriever = vectorstore.as_retriever(search_kwargs={"k": 20})  # k is the number of documents to retrieve

# putting it together: set up the conversation chain with the GPT 3.5 LLM, the vector store and memory
conversation_chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory)

## UI part
Create Gradio interface

Simple built-in chat interface

To Do: Add upload interface to include additional documents in data store.

In [None]:
# Wrapping that in a function

def chat(question, history):
    result = conversation_chain.invoke({"question": question})
    return result["answer"]

# And in Gradio:

view = gr.ChatInterface(chat, type="messages").launch(inbrowser=True)