# RAG-Based Query System for UB CSE Website

This notebook implements a **Retrieval Augmented Generation (RAG)** system to query the large mirrored website.

## The Problem
- **4,042 HTML files** (~403MB) - too large for any LLM context window
- Need to find relevant content before querying
- Need to provide only relevant context to the LLM

## The Solution: RAG Pipeline

1. **Chunk** HTML files into smaller pieces
2. **Embed** each chunk using embeddings model
3. **Store** embeddings in a vector database
4. **Retrieve** relevant chunks for a query
5. **Generate** answer using LLM with retrieved context

This allows querying the entire website efficiently!

## 1) Install Dependencies

We'll need:
- `ollama` - for LLM and embeddings
- `chromadb` - lightweight vector database
- `beautifulsoup4` & `html2text` - for HTML processing

In [1]:
# Install required packages
import subprocess
import sys

packages = ['ollama', 'chromadb', 'beautifulsoup4', 'html2text']

for package in packages:
    try:
        if package == 'chromadb':
            __import__('chromadb')
        elif package == 'beautifulsoup4':
            __import__('bs4')
        elif package == 'html2text':
            __import__('html2text')
        else:
            __import__(package)
        print(f"‚úì {package} already installed")
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package, "--quiet"])
        print(f"‚úì {package} installed")

import ollama
import chromadb
from bs4 import BeautifulSoup
import html2text
import os
from pathlib import Path
import hashlib
from typing import List, Dict

print("\n‚úì All packages ready!")

‚úì ollama already installed
Installing chromadb...
‚úì chromadb installed
‚úì beautifulsoup4 already installed
Installing html2text...
‚úì html2text installed

‚úì All packages ready!


## 2) Configuration

Set up paths and model choices.

In [2]:
# Configuration
MIRROR_FOLDER = "engineering.buffalo.edu"
CHROMA_DB_PATH = "./chroma_db"  # Where to store the vector database

# Model choices
# For embeddings: nomic-embed-text is good and small
# For querying: any Ollama model (llama3, mistral, etc.)
EMBEDDING_MODEL = "nomic-embed-text"  # Good embedding model for Ollama
LLM_MODEL = "llama3.2:latest"  # Change to your preferred model

# Chunking settings
CHUNK_SIZE = 1000  # Characters per chunk
CHUNK_OVERLAP = 200  # Overlap between chunks

print(f"Mirror folder: {MIRROR_FOLDER}")
print(f"Vector DB path: {CHROMA_DB_PATH}")
print(f"Embedding model: {EMBEDDING_MODEL}")
print(f"LLM model: {LLM_MODEL}")

Mirror folder: engineering.buffalo.edu
Vector DB path: ./chroma_db
Embedding model: nomic-embed-text
LLM model: llama3.2:latest


## 3) HTML Processing Functions

Functions to extract clean text from HTML files.

In [3]:
# Initialize HTML to text converter
h = html2text.HTML2Text()
h.ignore_links = False
h.ignore_images = True
h.body_width = 0

def extract_text_from_html(file_path: str) -> str:
    """Extract clean text from HTML file."""
    try:
        with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
            html_content = f.read()
        
        soup = BeautifulSoup(html_content, 'html.parser')
        
        # Remove script, style, and other non-content elements
        for element in soup(['script', 'style', 'meta', 'link', 'nav', 'footer', 'header']):
            element.decompose()
        
        # Get text
        text = h.handle(str(soup))
        return text.strip()
    except Exception as e:
        return f"Error reading {file_path}: {e}"

def find_html_files(root_dir: str) -> List[str]:
    """Find all HTML files in directory."""
    html_files = []
    root_path = Path(root_dir)
    
    for html_file in root_path.rglob("*.html"):
        html_files.append(str(html_file))
    
    return html_files

# Test
if os.path.exists(MIRROR_FOLDER):
    html_files = find_html_files(MIRROR_FOLDER)
    print(f"Found {len(html_files)} HTML files")
    
    # Test extraction on one file
    if html_files:
        sample_text = extract_text_from_html(html_files[0])
        print(f"\nSample file: {html_files[0]}")
        print(f"Extracted text length: {len(sample_text)} characters")
        print(f"Preview: {sample_text[:200]}...")
else:
    print(f"‚ö† {MIRROR_FOLDER} folder not found!")

Found 4042 HTML files

Sample file: engineering.buffalo.edu/computer-science-engineering/information-for-faculty-and-staff.html
Extracted text length: 15620 characters
Preview: #  Information for Faculty and Staff 

[ Davis Hall by moonlight.  ](http://engineering.buffalo.edu/content/engineering/computer-science-engineering/information-for-faculty-and-staff/_jcr_content/par/...


## 4) Chunking Strategy

Split large HTML files into smaller chunks that fit in context windows.

In [4]:
def chunk_text(text: str, chunk_size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> List[str]:
    """Split text into overlapping chunks."""
    if len(text) <= chunk_size:
        return [text]
    
    chunks = []
    start = 0
    
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        
        # Move start forward, accounting for overlap
        start = end - overlap
        
        # Prevent infinite loop
        if start >= len(text) - overlap:
            break
    
    return chunks

def create_chunks_from_files(html_files: List[str], max_files: int = None) -> List[Dict]:
    """Create chunks from HTML files with metadata."""
    chunks = []
    
    files_to_process = html_files[:max_files] if max_files else html_files
    
    for file_path in files_to_process:
        text = extract_text_from_html(file_path)
        
        if text and not text.startswith("Error"):
            file_chunks = chunk_text(text, CHUNK_SIZE, CHUNK_OVERLAP)
            
            for i, chunk in enumerate(file_chunks):
                chunks.append({
                    'text': chunk,
                    'file_path': file_path,
                    'chunk_index': i,
                    'total_chunks': len(file_chunks)
                })
    
    return chunks

# Test chunking
if os.path.exists(MIRROR_FOLDER):
    html_files = find_html_files(MIRROR_FOLDER)
    print(f"Testing chunking on first 5 files...")
    test_chunks = create_chunks_from_files(html_files[:5])
    print(f"Created {len(test_chunks)} chunks from 5 files")
    if test_chunks:
        print(f"Average chunk size: {sum(len(c['text']) for c in test_chunks) / len(test_chunks):.0f} characters")

Testing chunking on first 5 files...
Created 90 chunks from 5 files
Average chunk size: 970 characters


## 5) Create Vector Database

This is the core of RAG - we'll:
1. Create embeddings for each chunk
2. Store them in ChromaDB
3. This allows fast semantic search

In [5]:
def get_embedding(text: str, model: str = EMBEDDING_MODEL) -> List[float]:
    """Get embedding for text using Ollama."""
    try:
        response = ollama.embeddings(model=model, prompt=text)
        return response['embedding']
    except Exception as e:
        print(f"Error getting embedding: {e}")
        return None

def create_vector_db(chunks: List[Dict], collection_name: str = "website_chunks"):
    """Create and populate ChromaDB vector database."""
    
    # Initialize ChromaDB
    client = chromadb.PersistentClient(path=CHROMA_DB_PATH)
    
    # Create or get collection
    try:
        collection = client.get_collection(name=collection_name)
        print(f"Using existing collection: {collection_name}")
        print(f"Current documents: {collection.count()}")
    except:
        collection = client.create_collection(
            name=collection_name,
            metadata={"hnsw:space": "cosine"}
        )
        print(f"Created new collection: {collection_name}")
    
    # Process chunks in batches
    batch_size = 10
    total_chunks = len(chunks)
    
    print(f"\nProcessing {total_chunks} chunks...")
    
    for i in range(0, total_chunks, batch_size):
        batch = chunks[i:i+batch_size]
        
        ids = []
        texts = []
        embeddings = []
        metadatas = []
        
        for chunk in batch:
            # Create unique ID
            chunk_id = hashlib.md5(
                f"{chunk['file_path']}_{chunk['chunk_index']}".encode()
            ).hexdigest()
            
            # Get embedding
            embedding = get_embedding(chunk['text'], EMBEDDING_MODEL)
            
            if embedding:
                ids.append(chunk_id)
                texts.append(chunk['text'])
                embeddings.append(embedding)
                metadatas.append({
                    'file_path': chunk['file_path'],
                    'chunk_index': str(chunk['chunk_index']),
                    'total_chunks': str(chunk['total_chunks'])
                })
        
        if ids:
            collection.add(
                ids=ids,
                documents=texts,
                embeddings=embeddings,
                metadatas=metadatas
            )
        
        if (i + batch_size) % 100 == 0 or i + batch_size >= total_chunks:
            print(f"  Processed {min(i + batch_size, total_chunks)}/{total_chunks} chunks")
    
    print(f"\n‚úì Vector database created with {collection.count()} chunks")
    return collection

print("‚úì Vector DB functions ready!")

‚úì Vector DB functions ready!


### Step 0: Pull Embedding Model (Required First!)

Before building the index, you need to pull the embedding model. Run this cell first:

In [None]:
# Check if embedding model is available, if not, provide instructions
try:
    test_embedding = get_embedding("test", EMBEDDING_MODEL)
    if test_embedding:
        print(f"‚úì Embedding model '{EMBEDDING_MODEL}' is ready!")
        print(f"  Embedding dimension: {len(test_embedding)}")
    else:
        print(f"‚ö† Embedding model '{EMBEDDING_MODEL}' not found.")
        print(f"\nPlease run this command in your terminal:")
        print(f"  ollama pull {EMBEDDING_MODEL}")
except Exception as e:
    print(f"‚ö† Embedding model '{EMBEDDING_MODEL}' not available.")
    print(f"\nPlease run this command in your terminal:")
    print(f"  ollama pull {EMBEDDING_MODEL}")
    print(f"\nThen re-run this cell to verify it's installed.")

## 6) Build the Index (One-Time Setup)

**‚ö†Ô∏è This will take time!** For 4,000+ files, this might take 30-60 minutes.

You can:
1. **Test first** with a small subset (e.g., 100 files)
2. **Run full index** when ready
3. **Resume later** - ChromaDB persists, so you can add more files incrementally

In [11]:
# OPTION 1: Test with small subset (recommended first!)
TEST_MODE = True  # Set to False for full index
TEST_FILE_COUNT = 100  # Number of files to index for testing

if os.path.exists(MIRROR_FOLDER):
    html_files = find_html_files(MIRROR_FOLDER)
    print(f"Total HTML files: {len(html_files)}")
    
    if TEST_MODE:
        print(f"\nüß™ TEST MODE: Indexing first {TEST_FILE_COUNT} files")
        files_to_index = html_files[:TEST_FILE_COUNT]
    else:
        print(f"\nüöÄ FULL MODE: Indexing all {len(html_files)} files")
        files_to_index = html_files
    
    # Create chunks
    print("\nStep 1: Creating chunks...")
    chunks = create_chunks_from_files(files_to_index)
    print(f"Created {len(chunks)} chunks")
    
    # First, make sure embedding model is available
    print(f"\nChecking embedding model: {EMBEDDING_MODEL}")
    test_embedding = None
    try:
        test_embedding = get_embedding("test", EMBEDDING_MODEL)
        if test_embedding:
            print(f"‚úì Embedding model ready (embedding dimension: {len(test_embedding)})")
        else:
            print(f"‚ö† Embedding model not available.")
    except Exception as e:
        print(f"‚ö† Error with embedding model: {e}")
    
    # Only build index if embedding model is available
    if test_embedding:
        # Build the index
        print("\nStep 2: Creating embeddings and vector database...")
        print("This will take a while - grab a coffee! ‚òï")
        collection = create_vector_db(chunks)
        
        print("\n‚úì Indexing complete! You can now proceed to query the database.")
    else:
        print(f"\n‚ö† Cannot build index - embedding model '{EMBEDDING_MODEL}' not available.")
        print(f"\nPlease run this command in your terminal:")
        print(f"  ollama pull {EMBEDDING_MODEL}")
        print(f"\nThen re-run this cell to build the index.")
else:
    print(f"‚ö† {MIRROR_FOLDER} folder not found!")

Total HTML files: 4042

üß™ TEST MODE: Indexing first 100 files

Step 1: Creating chunks...
Created 1038 chunks

Checking embedding model: nomic-embed-text
‚úì Embedding model ready (embedding dimension: 768)

Step 2: Creating embeddings and vector database...
This will take a while - grab a coffee! ‚òï
Created new collection: website_chunks

Processing 1038 chunks...
  Processed 100/1038 chunks
  Processed 200/1038 chunks
  Processed 300/1038 chunks
  Processed 400/1038 chunks
  Processed 500/1038 chunks
  Processed 600/1038 chunks
  Processed 700/1038 chunks
  Processed 800/1038 chunks
  Processed 900/1038 chunks
  Processed 1000/1038 chunks
  Processed 1038/1038 chunks

‚úì Vector database created with 1038 chunks

‚úì Indexing complete! You can now proceed to query the database.


## 7) Query Function (RAG Pipeline)

This is where the magic happens:
1. Convert query to embedding
2. Find similar chunks in vector DB
3. Retrieve top-k chunks
4. Pass to LLM with context
5. Get answer!

In [12]:
def query_website(query: str, collection, top_k: int = 5, model: str = LLM_MODEL):
    """Query the website using RAG."""
    
    # Step 1: Get query embedding
    print(f"üîç Query: {query}")
    print("Step 1: Getting query embedding...")
    query_embedding = get_embedding(query, EMBEDDING_MODEL)
    
    if not query_embedding:
        return "Error: Could not get query embedding"
    
    # Step 2: Search vector database
    print(f"Step 2: Searching vector database (top {top_k} results)...")
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k
    )
    
    # Step 3: Prepare context
    print("Step 3: Preparing context for LLM...")
    context_chunks = []
    for i, (doc, metadata) in enumerate(zip(results['documents'][0], results['metadatas'][0])):
        context_chunks.append(f"\n--- Chunk {i+1} from {metadata['file_path']} ---\n{doc}")
    
    context = "\n".join(context_chunks)
    
    # Step 4: Query LLM with context
    print(f"Step 4: Querying LLM ({model})...")
    
    prompt = f"""You are a helpful assistant answering questions about the University at Buffalo Computer Science and Engineering department website.

Use the following context from the website to answer the question. If the answer is not in the context, say so.

Context from website:
{context}

Question: {query}

Answer based on the context:"""
    
    try:
        response = ollama.generate(
            model=model,
            prompt=prompt
        )
        
        answer = response['response']
        
        # Step 5: Return answer with sources
        print("\n" + "="*70)
        print("ANSWER:")
        print("="*70)
        print(answer)
        print("\n" + "="*70)
        print("SOURCES:")
        print("="*70)
        for i, metadata in enumerate(results['metadatas'][0], 1):
            print(f"{i}. {metadata['file_path']}")
        
        return {
            'answer': answer,
            'sources': [m['file_path'] for m in results['metadatas'][0]]
        }
    except Exception as e:
        return f"Error querying LLM: {e}"

print("‚úì Query function ready!")

‚úì Query function ready!


## 8) Load Vector Database and Query

Now you can query the website!

In [13]:
# Load the vector database
client = chromadb.PersistentClient(path=CHROMA_DB_PATH)

try:
    collection = client.get_collection(name="website_chunks")
    print(f"‚úì Loaded vector database")
    print(f"  Documents: {collection.count()}")
except Exception as e:
    print(f"‚ö† Vector database not found: {e}")
    print("Run the indexing cells above first!")
    collection = None

‚úì Loaded vector database
  Documents: 1038


In [14]:
# Example queries
if collection:
    # Query 1
    result1 = query_website(
        "What are the main research areas in the Computer Science department?",
        collection,
        top_k=5
    )
    
    print("\n" + "="*70 + "\n")
    
    # Query 2
    result2 = query_website(
        "Who are the faculty members?",
        collection,
        top_k=5
    )
else:
    print("‚ö† Vector database not loaded. Build the index first!")

üîç Query: What are the main research areas in the Computer Science department?
Step 1: Getting query embedding...
Step 2: Searching vector database (top 5 results)...
Step 3: Preparing context for LLM...
Step 4: Querying LLM (llama3.2:latest)...

ANSWER:
According to Chunk 2 from engineering.buffalo.edu/computer-science-engineering/research/research-areas.html, there are 18 research areas in the Computer Science and Engineering department. However, these 18 areas are grouped into four categories:

1. Artificial Intelligence
2. Programming Languages and Software Engineering
3. Theory (subdivided into Algorithms and Complexity, Computer Security and Cryptography, Interdisciplinary, etc.)
4. Interdisciplinary (subdivided into Computational Biology and Bioinformatics, Computing Education, Human-Computer Interaction, Society and Computing, etc.)

These categories are not listed in order of priority or importance, but rather provide a way to organize the various research areas within the d

## 9) Interactive Query Function

Easy function to ask questions.

In [15]:
def ask(question: str, top_k: int = 5):
    """Simple function to ask questions."""
    if collection is None:
        print("‚ö† Vector database not loaded. Run the cell above first!")
        return
    
    return query_website(question, collection, top_k=top_k, model=LLM_MODEL)

# Example:
ask("What undergraduate programs are offered?")
# ask("Tell me about the graduate program requirements")
# ask("What research labs are there?")

üîç Query: What undergraduate programs are offered?
Step 1: Getting query embedding...
Step 2: Searching vector database (top 5 results)...
Step 3: Preparing context for LLM...
Step 4: Querying LLM (llama3.2:latest)...

ANSWER:
According to the context, the following undergraduate programs are offered by the University at Buffalo Computer Science and Engineering department:

1. BS/MS in Computer Science and Engineering
2. BS in Computer Engineering
3. BS in Computer Science
4. BA in Computer Science
5. Interdisciplinary Undergraduate Programs (including a concentration in Bioinformatics and Computational Biology)
6. BA in Social Sciences with an interdisciplinary Cognitive Science Concentration

SOURCES:
1. engineering.buffalo.edu/computer-science-engineering/academics.html
2. engineering.buffalo.edu/computer-science-engineering/sitemap.html
3. engineering.buffalo.edu/computer-science-engineering/undergraduate.html
4. engineering.buffalo.edu/computer-science-engineering/academics.html

{'answer': 'According to the context, the following undergraduate programs are offered by the University at Buffalo Computer Science and Engineering department:\n\n1. BS/MS in Computer Science and Engineering\n2. BS in Computer Engineering\n3. BS in Computer Science\n4. BA in Computer Science\n5. Interdisciplinary Undergraduate Programs (including a concentration in Bioinformatics and Computational Biology)\n6. BA in Social Sciences with an interdisciplinary Cognitive Science Concentration',
 'sources': ['engineering.buffalo.edu/computer-science-engineering/academics.html',
  'engineering.buffalo.edu/computer-science-engineering/sitemap.html',
  'engineering.buffalo.edu/computer-science-engineering/undergraduate.html',
  'engineering.buffalo.edu/computer-science-engineering/academics.html',
  'engineering.buffalo.edu/computer-science-engineering/undergraduate.html']}

## Summary: How RAG Works Here

### The Problem
- 4,042 HTML files = too much for LLM context
- Need to find relevant info efficiently

### The RAG Solution

1. **Indexing Phase** (one-time, takes ~30-60 min):
   - Extract text from HTML
   - Chunk into smaller pieces
   - Create embeddings (vector representations)
   - Store in vector database

2. **Query Phase** (fast, ~5-10 seconds):
   - Convert your question to embedding
   - Find similar chunks (semantic search)
   - Retrieve top-k most relevant chunks
   - Pass chunks + question to LLM
   - Get answer with sources!

### Advantages
- ‚úÖ Handles large datasets
- ‚úÖ Fast queries (only relevant chunks)
- ‚úÖ Provides sources
- ‚úÖ Works with local models (Ollama)
- ‚úÖ Can update incrementally

### Next Steps
1. Pull embedding model: `ollama pull nomic-embed-text`
2. Run indexing (start with test mode)
3. Query away!