# KUSOE RAG System Demo

This notebook demonstrates a simple Retrieval-Augmented Generation (RAG) pipeline to query information about Kathmandu University School of Engineering (KUSOE).

### Components:
- **Data Source**: A custom-built database of text files located in `../KUSOE_database/`.
- **Chunking**: Files are split into smaller chunks using a custom delimiter (`-c-h-u-n-k-h-e-r-e-`).
- **Embeddings**: `bge-small-en` model is used to generate embeddings for the chunks.
- **Vector Store**: `ChromaDB` (in-memory) is used to store the embeddings and perform similarity searches.
- **Framework**: `LlamaIndex` orchestrates the pipeline from data loading to querying.

This setup is designed for fast, low-latency retrieval.

In [3]:
!pip install -qU llama-index llama-index-vector-stores-chroma llama-index-embeddings-huggingface chromadb sentence-transformers

In [2]:
import os
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import StorageContext
from llama_index.core import Settings
import chromadb
from chromadb.utils import embedding_functions

# For ChromaDB integration
try:
    from llama_index.vector_stores.chroma import ChromaVectorStore
except ImportError:
    # Fallback for newer versions
    from llama_index.vector_stores.chroma_vector_store import ChromaVectorStore

# For embeddings
try:
    from llama_index.embeddings.huggingface import HuggingFaceEmbedding
except ImportError:
    # Fallback for newer versions
    from llama_index.embeddings.huggingface_embedding import HuggingFaceEmbedding

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# 1. Load Data
print("Loading data...")
documents = SimpleDirectoryReader(
    input_dir="../KUSOE_database/",
    recursive=True  # This will load files from subdirectories too
).load_data()

print(f"Loaded {len(documents)} documents:")
for i, doc in enumerate(documents):
    file_name = doc.metadata.get('file_name', 'Unknown')
    print(f"  {i+1}. {file_name}")
print("---")

Loading data...
Loaded 7 documents:
  1. overview.txt
  2. artificial_intelligence.txt
  3. civil_engineering.txt
  4. computer_engineering.txt
  5. electrical_and_electronics_engineering.txt
  6. information_technology.txt
  7. mechanical_engineering.txt
---


In [4]:
documents = [doc for doc in documents if doc.text.strip() != ""]
documents

[Document(id_='4f4454af-ab43-4340-8451-f80d59bd8a9d', embedding=None, metadata={'file_path': '/media/epein5/Data1/Voice-to-voice/rag/../KUSOE_database/overview.txt', 'file_name': 'overview.txt', 'file_type': 'text/plain', 'file_size': 10577, 'creation_date': '2025-06-28', 'last_modified_date': '2025-06-28'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text="# KUSOE General Information: Overview\n\nKathmandu University School of Engineering (KUSOE), established in 1994 AD, is a leading autonomous, non-profit, and self-funding academic institution in Nepal. Situated in Dhulikhel, KUSOE offers a wide range of undergraduate and graduate programs

In [5]:
# 2. Configure Custom Chunking
from llama_index.core.schema import Document, TextNode

def custom_chunk_splitter(documents):
    """Custom function to split documents exactly at the delimiter"""
    all_nodes = []
    
    for doc in documents:
        # Split the text at the delimiter
        chunks = doc.text.split("-c-h-u-n-k-h-e-r-e-")
        
        # Create nodes from each chunk
        for i, chunk in enumerate(chunks):
            chunk = chunk.strip()  # Remove extra whitespace
            if chunk:  # Only create nodes for non-empty chunks
                node = TextNode(
                    text=chunk,
                    metadata={
                        **doc.metadata,
                        "chunk_id": i,
                        "total_chunks": len(chunks)
                    }
                )
                all_nodes.append(node)
    
    return all_nodes

print("Custom chunking function created!")

Custom chunking function created!


In [6]:
# Debug: Check if chunking works
print("Testing custom chunking...")
test_nodes = custom_chunk_splitter(documents[:1])  # Test with first document
print(f"Number of chunks created: {len(test_nodes)}")
print(f"First chunk preview:")
print(test_nodes[0].text[:300] + "...")
print(f"\nSecond chunk preview:")
if len(test_nodes) > 1:
    print(test_nodes[1].text[:300] + "...")
print(f"\nOriginal document contains delimiter: {'-c-h-u-n-k-h-e-r-e-' in documents[0].text}")
print("---")

Testing custom chunking...
Number of chunks created: 16
First chunk preview:
# KUSOE General Information: Overview

Kathmandu University School of Engineering (KUSOE), established in 1994 AD, is a leading autonomous, non-profit, and self-funding academic institution in Nepal. Situated in Dhulikhel, KUSOE offers a wide range of undergraduate and graduate programs, aiming to p...

Second chunk preview:
# KUSOE Admission Information: General Timeline
Normal annual intake: Fall (July–September) for undergraduate and most graduate programs. Specific dates for application submission and entrance exams are announced on the official KU website (ku.edu.np) and the School of Engineering portal.

Some grad...

Original document contains delimiter: True
---


In [7]:
# 3. Configure Embeddings
print("Initializing embedding model...")
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en")

Initializing embedding model...


In [8]:
# 4. Setup ChromaDB Persistent Vector Store (COMMENTED OUT - using in-memory instead)
# print("Setting up ChromaDB...")

# # Disable ChromaDB telemetry to suppress warnings
# import os
# import shutil
# os.environ["ANONYMIZED_TELEMETRY"] = "False"

# # Reset ChromaDB system to clear any existing instances
# import chromadb
# from chromadb.api.models.Collection import Collection
# chromadb.api.client._identifier_to_system = {}

# # Create persistent storage directory
# persist_directory = "../chromadb_storage"

# # Remove existing directory if it exists to avoid conflicts
# if os.path.exists(persist_directory):
#     print("Removing existing ChromaDB storage...")
#     shutil.rmtree(persist_directory)

# os.makedirs(persist_directory, exist_ok=True)

# # Use PersistentClient to save embeddings to disk
# db = chromadb.PersistentClient(path=persist_directory)
# chroma_collection = db.get_or_create_collection("kusoe_rag")
# vector_store = ChromaVectorStore(chroma_collection=chroma_collection)


In [9]:
# Alternative: Setup ChromaDB In-Memory (if persistent fails)
# Using this approach since persistent storage is having conflicts

print("Setting up ChromaDB (In-Memory)...")
import chromadb
db = chromadb.EphemeralClient()
chroma_collection = db.get_or_create_collection("kusoe_rag")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

Setting up ChromaDB (In-Memory)...


Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


In [10]:
# Fresh ChromaDB Setup - Run this if you're having issues
import importlib
import sys

# Clear ChromaDB from memory
modules_to_remove = [module for module in sys.modules if 'chromadb' in module]
for module in modules_to_remove:
    del sys.modules[module]

# Fresh imports
import chromadb
print("Setting up fresh ChromaDB (In-Memory)...")

# Create completely new client
db = chromadb.EphemeralClient()
chroma_collection = db.create_collection("kusoe_rag_fresh")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
print("ChromaDB setup complete!")

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


Setting up fresh ChromaDB (In-Memory)...
ChromaDB setup complete!


In [11]:
# 5. Configure Global Settings
Settings.embed_model = embed_model
Settings.llm = None  # Disable LLM for retrieval-only mode

# Create storage context
storage_context = StorageContext.from_defaults(vector_store=vector_store)


LLM is explicitly disabled. Using MockLLM.


In [12]:
print("Creating index with custom chunking...")
# Apply custom chunking to all documents
all_nodes = custom_chunk_splitter(documents)
print(f"Total chunks created: {len(all_nodes)}")

# Create index from the custom chunks
index = VectorStoreIndex(
    nodes=all_nodes,
    storage_context=storage_context,
    show_progress=True
)

Creating index with custom chunking...
Total chunks created: 58


Generating embeddings: 100%|██████████| 58/58 [00:02<00:00, 20.33it/s]
Failed to send telemetry event CollectionAddEvent: capture() takes 1 positional argument but 3 were given


In [21]:

# 6. Create Query Engine (Retrieval-only mode)
print("Creating query engine...")
from llama_index.core.retrievers import VectorIndexRetriever

# Create a simple retriever that just returns similar chunks
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=3  # Return top 3 most similar chunks
)

# 7. Run a Sample Query
print("Querying...")
query = "What is the time Computer Engineering?"
retrieved_nodes = retriever.retrieve(query)

print(f"\n--- Retrieved {len(retrieved_nodes)} chunks for query: '{query}' ---")
for i, node in enumerate(retrieved_nodes, 1):
    print(f"\nChunk {i} (Score: {node.score:.3f}):")
    print(node.text[:800] + "..." if len(node.text) > 800 else node.text)
    print("-" * 50)


Creating query engine...
Querying...

--- Retrieved 3 chunks for query: 'What is the time Computer Engineering?' ---

Chunk 1 (Score: 0.756):
## Computer Engineering (CE) Program Overview
The Bachelor of Engineering in Computer Engineering program at Kathmandu University is a four-year, eight-semester course that provides a strong foundation in the principles and practices of both computer science and electronics engineering. The curriculum is designed to produce graduates who are proficient in designing, developing, and implementing computer systems and applications. The program emphasizes a hands-on, project-based learning approach, preparing students for careers in software development, network engineering, embedded systems, and other related fields.
--------------------------------------------------

Chunk 2 (Score: 0.748):
## Computer Engineering (CE) Program Objectives
- To provide a solid understanding of the fundamental principles of computer science and engineering.
- To devel

## Loading Existing Embeddings

Once you've run the above cells, the embeddings are saved to disk in the `../chromadb_storage` directory. 

For future sessions, you can skip the indexing step and directly load the existing embeddings using the code in the next cell.

In [14]:
# Alternative: Load existing embeddings (run this instead of the indexing cells above)
def load_existing_index():
    """Load previously created index from disk"""
    persist_directory = "../chromadb_storage"
    
    if not os.path.exists(persist_directory):
        print("No existing embeddings found. Please run the full pipeline first.")
        return None
    
    # Configure embeddings (same as before)
    embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en")
    Settings.embed_model = embed_model
    
    # Load existing ChromaDB
    db = chromadb.PersistentClient(path=persist_directory)
    chroma_collection = db.get_or_create_collection("kusoe_rag")
    vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
    
    # Load the index
    index = VectorStoreIndex.from_vector_store(vector_store)
    
    print("Existing index loaded successfully!")
    return index

# Uncomment the next two lines to load existing embeddings instead of creating new ones:
# existing_index = load_existing_index()
# query_engine = existing_index.as_query_engine() if existing_index else None

In [18]:
# Memory Usage Estimation
print("=== KUSOE Vector Database Memory Estimation ===")

# Count total chunks and characters
all_nodes = custom_chunk_splitter(documents)
total_chunks = len(all_nodes)
total_characters = sum(len(node.text) for node in all_nodes)

print(f"Documents loaded: {len(documents)}")
print(f"Total chunks created: {total_chunks}")
print(f"Total characters: {total_characters:,}")
print(f"Average chunk size: {total_characters // total_chunks:,} characters")

# Embedding model details
embedding_model = "bge-small-en"
embedding_dimensions = 384  # bge-small-en has 384 dimensions
bytes_per_float = 4  # 32-bit float

# Memory calculations
embedding_memory_per_chunk = embedding_dimensions * bytes_per_float  # bytes
total_embedding_memory = total_chunks * embedding_memory_per_chunk  # bytes

# Text storage (assuming UTF-8, ~1 byte per character)
text_memory = total_characters  # bytes

# ChromaDB overhead (metadata, indices, etc.) - estimated 20-30% overhead
overhead_factor = 1.25
total_memory_bytes = (total_embedding_memory + text_memory) * overhead_factor

# Convert to human-readable units
memory_kb = total_memory_bytes / 1024
memory_mb = memory_kb / 1024
memory_gb = memory_mb / 1024

print(f"\n=== Memory Breakdown ===")
print(f"Embeddings: {embedding_dimensions} dims × {total_chunks} chunks × 4 bytes = {total_embedding_memory/1024/1024:.2f} MB")
print(f"Text storage: {text_memory/1024/1024:.2f} MB")
print(f"ChromaDB overhead (~25%): {(total_memory_bytes - total_embedding_memory - text_memory)/1024/1024:.2f} MB")

print(f"\n=== Total Estimated Memory Usage ===")
print(f"Total: {memory_mb:.2f} MB ({memory_gb:.3f} GB)")

if memory_mb < 100:
    print("💚 Very lightweight - easily fits in RAM")
elif memory_mb < 500:
    print("💚 Lightweight - no memory concerns")
elif memory_mb < 1000:
    print("💛 Moderate - should be fine on most systems")
else:
    print("🟡 Large - consider system RAM capacity")

print(f"\nNote: This is in-memory storage. Persistent storage would be similar in disk space.")

=== KUSOE Vector Database Memory Estimation ===
Documents loaded: 7
Total chunks created: 58
Total characters: 34,274
Average chunk size: 590 characters

=== Memory Breakdown ===
Embeddings: 384 dims × 58 chunks × 4 bytes = 0.08 MB
Text storage: 0.03 MB
ChromaDB overhead (~25%): 0.03 MB

=== Total Estimated Memory Usage ===
Total: 0.15 MB (0.000 GB)
💚 Very lightweight - easily fits in RAM

Note: This is in-memory storage. Persistent storage would be similar in disk space.
