# Query UB CSE Mirrored Website with Ollama

This notebook allows you to query the locally mirrored `engineering.buffalo.edu` website using a local Ollama model.

**Features:**
- Choose from available Ollama models
- Query the mirrored website content
- Search and analyze HTML files
- Extract information using local LLM models

**Prerequisites:**
- Ollama installed and running
- The `engineering.buffalo.edu` folder from the mirroring workflow

## 1) Install and Setup Ollama

### Install Ollama (if not already installed)

If you don't have Ollama installed, you can install it from [ollama.com](https://ollama.com) or use Homebrew:

```bash
brew install ollama
```

### Start Ollama service

Make sure Ollama is running. You can start it with:

```bash
ollama serve
```

Or if it's installed as a service, it should already be running.

### Install Python dependencies

We'll use the `ollama` Python package to interact with Ollama models.

In [6]:
# Install ollama Python package if not already installed
import subprocess
import sys

try:
    import ollama
    print("‚úì ollama package already installed")
except ImportError:
    print("Installing ollama package...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "ollama"])
    import ollama
    print("‚úì ollama package installed successfully")

‚úì ollama package already installed


## 2) Check Ollama Connection and List Available Models

First, let's verify Ollama is running and see what models are available.

In [7]:
# Check Ollama connection and list available models
import ollama

try:
    # List available models
    models = ollama.list()
    print("Available Ollama models:")
    print("-" * 50)
    
    if models.get('models'):
        for model in models['models']:
            model_name = model.get('model', 'Unknown')
            model_size = model.get('size', 0)
            size_gb = model_size / (1024**3) if model_size > 0 else 0
            print(f"  ‚Ä¢ {model_name} ({size_gb:.2f} GB)")
    else:
        print("  No models found. Pull a model first:")
        print("  Example: ollama pull llama2")
        print("  Example: ollama pull mistral")
        print("  Example: ollama pull codellama")
except Exception as e:
    print(f"Error connecting to Ollama: {e}")
    print("\nMake sure Ollama is running:")
    print("  Run 'ollama serve' in a terminal, or")
    print("  Install Ollama from https://ollama.com")

Available Ollama models:
--------------------------------------------------
  ‚Ä¢ nomic-embed-text:latest (0.26 GB)
  ‚Ä¢ llama3.2-vision:latest (7.28 GB)
  ‚Ä¢ gemma3:4b (3.11 GB)
  ‚Ä¢ llama4:latest (62.81 GB)
  ‚Ä¢ llama3.2-vision:11b (7.36 GB)
  ‚Ä¢ llama3.2:latest (1.88 GB)
  ‚Ä¢ llama3.1:latest (4.34 GB)
  ‚Ä¢ mistral:latest (3.83 GB)
  ‚Ä¢ llama3:latest (4.34 GB)


## 3) Choose a Model

Available models are listed below with labels (model_1, model_2, etc.). 

To select a model, set `MODEL_NAME` to one of the model labels (e.g., `MODEL_NAME = "model_1"`).

**Note:** If you don't have a model yet, you can pull it using:
```bash
ollama pull <model-name>
```

In [8]:
# List available models and create numbered labels
try:
    models = ollama.list()
    available_models = [m['model'] for m in models.get('models', [])]
    
    if available_models:
        # Create a dictionary mapping model labels to model names
        model_dict = {}
        print("Available models:")
        print("-" * 60)
        
        for i, model_name in enumerate(available_models, 1):
            label = f"model_{i}"
            model_dict[label] = model_name
            model_size = next((m.get('size', 0) for m in models.get('models', []) if m.get('model') == model_name), 0)
            size_gb = model_size / (1024**3) if model_size > 0 else 0
            print(f"  {label:12} = {model_name} ({size_gb:.2f} GB)")
        
        print("-" * 60)
        print(f"\nTotal: {len(available_models)} models available")
        print("\nTo select a model, set MODEL_NAME to one of the labels above.")
        print("Example: MODEL_NAME = 'model_1'")
        
        # Store the model dictionary globally
        MODEL_DICT = model_dict
        
        # Set default to first model
        MODEL_NAME = "model_1"
        print(f"\n‚úì Default model set to: {MODEL_NAME} ({model_dict[MODEL_NAME]})")
        print("  Change MODEL_NAME to use a different model.")
        
    else:
        print("‚ö† No models found. Pull a model first:")
        print("  Example: ollama pull llama2")
        print("  Example: ollama pull mistral")
        MODEL_NAME = None
        MODEL_DICT = {}
        
except Exception as e:
    print(f"Error listing models: {e}")
    print("\nMake sure Ollama is running:")
    print("  Run 'ollama serve' in a terminal, or")
    print("  Install Ollama from https://ollama.com")
    MODEL_NAME = None
    MODEL_DICT = {}

Available models:
------------------------------------------------------------
  model_1      = nomic-embed-text:latest (0.26 GB)
  model_2      = llama3.2-vision:latest (7.28 GB)
  model_3      = gemma3:4b (3.11 GB)
  model_4      = llama4:latest (62.81 GB)
  model_5      = llama3.2-vision:11b (7.36 GB)
  model_6      = llama3.2:latest (1.88 GB)
  model_7      = llama3.1:latest (4.34 GB)
  model_8      = mistral:latest (3.83 GB)
  model_9      = llama3:latest (4.34 GB)
------------------------------------------------------------

Total: 9 models available

To select a model, set MODEL_NAME to one of the labels above.
Example: MODEL_NAME = 'model_1'

‚úì Default model set to: model_1 (nomic-embed-text:latest)
  Change MODEL_NAME to use a different model.


### Select Your Model

Set `MODEL_NAME` to the label of the model you want to use (e.g., `model_1`, `model_2`, etc.).

In [9]:
# Select your model by setting MODEL_NAME to one of the labels (model_1, model_2, etc.)
# Example: MODEL_NAME = "model_2"

MODEL_NAME = "model_4"

# Verify the selected model
if 'MODEL_DICT' in globals() and MODEL_DICT and 'MODEL_NAME' in globals() and MODEL_NAME:
    if MODEL_NAME in MODEL_DICT:
        actual_model_name = MODEL_DICT[MODEL_NAME]
        print(f"‚úì Selected model: {MODEL_NAME} = {actual_model_name}")
    else:
        print(f"‚ö† Warning: '{MODEL_NAME}' not found in available models.")
        print(f"Available labels: {', '.join(MODEL_DICT.keys())}")
        print(f"Using default: model_1 = {MODEL_DICT.get('model_1', 'N/A')}")
        MODEL_NAME = "model_1"
else:
    print("‚ö† Models not loaded. Run the cell above first!")

‚úì Selected model: model_4 = llama4:latest


## 4) Setup: Load and Process Website Content

We'll create helper functions to read and process HTML files from the mirrored website.

In [10]:
# Install required packages for HTML processing
import subprocess
import sys

try:
    from bs4 import BeautifulSoup
    import html2text
    print("‚úì Required packages already installed")
except ImportError:
    print("Installing required packages...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "beautifulsoup4", "html2text"])
    from bs4 import BeautifulSoup
    import html2text
    print("‚úì Packages installed successfully")

import os
import re
from pathlib import Path

# Set the path to the mirrored website folder
MIRROR_FOLDER = "engineering.buffalo.edu"

# Check if folder exists
if not os.path.exists(MIRROR_FOLDER):
    print(f"‚ö† Warning: '{MIRROR_FOLDER}' folder not found in current directory.")
    print(f"Current directory: {os.getcwd()}")
    print(f"\nMake sure you've run the mirroring workflow first.")
else:
    print(f"‚úì Found mirrored website folder: {MIRROR_FOLDER}")

# Initialize HTML to text converter
h = html2text.HTML2Text()
h.ignore_links = False
h.ignore_images = True
h.body_width = 0  # Don't wrap text

‚úì Required packages already installed
‚úì Found mirrored website folder: engineering.buffalo.edu


In [11]:
def extract_text_from_html(file_path):
    """Extract readable text from an HTML file."""
    try:
        with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
            html_content = f.read()
        
        # Use BeautifulSoup to parse and clean HTML
        soup = BeautifulSoup(html_content, 'html.parser')
        
        # Remove script and style elements
        for script in soup(["script", "style", "meta", "link"]):
            script.decompose()
        
        # Convert to text
        text = h.handle(str(soup))
        return text.strip()
    except Exception as e:
        return f"Error reading file: {e}"

def find_html_files(root_dir, max_files=None):
    """Find all HTML files in the directory."""
    html_files = []
    root_path = Path(root_dir)
    
    for html_file in root_path.rglob("*.html"):
        html_files.append(str(html_file))
        if max_files and len(html_files) >= max_files:
            break
    
    return html_files

# Test: Find some HTML files
if os.path.exists(MIRROR_FOLDER):
    html_files = find_html_files(MIRROR_FOLDER, max_files=10)
    print(f"Found {len(html_files)} HTML files (showing first 10)")
    for f in html_files[:5]:
        print(f"  ‚Ä¢ {f}")
else:
    print("Cannot find HTML files - mirror folder not found")

Found 10 HTML files (showing first 10)
  ‚Ä¢ engineering.buffalo.edu/computer-science-engineering/information-for-faculty-and-staff.html
  ‚Ä¢ engineering.buffalo.edu/computer-science-engineering/news-and-events.html
  ‚Ä¢ engineering.buffalo.edu/computer-science-engineering/sitemap.html
  ‚Ä¢ engineering.buffalo.edu/computer-science-engineering/research.html
  ‚Ä¢ engineering.buffalo.edu/computer-science-engineering/alumni-and-friends.html


## 5) Query Functions

Create functions to query the website content using Ollama.

In [12]:
# Helper function to get current model name
def get_current_model():
    """Get the actual model name from MODEL_NAME label using MODEL_DICT."""
    try:
        if 'MODEL_NAME' in globals() and MODEL_NAME and 'MODEL_DICT' in globals() and MODEL_DICT:
            if MODEL_NAME in MODEL_DICT:
                return MODEL_DICT[MODEL_NAME]
            else:
                # Fallback to model_1 if invalid label
                if 'model_1' in MODEL_DICT:
                    return MODEL_DICT['model_1']
    except:
        pass
    return None

def query_file_with_ollama(file_path, question, model_name=None, max_chars=8000):
    """Query a specific HTML file using Ollama."""
    # Use MODEL_NAME if model_name not provided
    if model_name is None:
        model_name = get_current_model()
        if model_name is None:
            return "Error: No model selected. Please set MODEL_NAME to a model label (e.g., 'model_1')."
    
    # Extract text from HTML
    file_text = extract_text_from_html(file_path)
    
    # Truncate if too long (to avoid token limits)
    if len(file_text) > max_chars:
        file_text = file_text[:max_chars] + "... [truncated]"
    
    # Create prompt
    prompt = f"""You are analyzing content from a university website. Answer the question based on the following content.

Content from {file_path}:
{file_text}

Question: {question}

Answer:"""
    
    try:
        response = ollama.generate(model=model_name, prompt=prompt)
        return response['response']
    except Exception as e:
        return f"Error querying model: {e}"

def search_and_query(query_text, model_name=None, max_files=5, max_chars_per_file=4000):
    """Search for relevant files and query them."""
    # Use MODEL_NAME if model_name not provided
    if model_name is None:
        model_name = get_current_model()
        if model_name is None:
            return [{"file": "Error", "answer": "No model selected. Please set MODEL_NAME to a model label (e.g., 'model_1')."}]
    
    if not os.path.exists(MIRROR_FOLDER):
        return [{"file": "Error", "answer": "Mirror folder not found"}]
    
    # Find HTML files
    html_files = find_html_files(MIRROR_FOLDER)
    
    # Simple keyword-based search (you could enhance this with better search)
    query_lower = query_text.lower()
    relevant_files = []
    
    for file_path in html_files:
        file_lower = file_path.lower()
        # Check if query keywords appear in filename or path
        if any(keyword in file_lower for keyword in query_lower.split()):
            relevant_files.append(file_path)
            if len(relevant_files) >= max_files:
                break
    
    # If no files found by filename, use first few files
    if not relevant_files:
        relevant_files = html_files[:max_files]
    
    print(f"Found {len(relevant_files)} relevant files to query")
    
    results = []
    for file_path in relevant_files:
        print(f"  Querying: {file_path}")
        file_text = extract_text_from_html(file_path)
        
        if len(file_text) > max_chars_per_file:
            file_text = file_text[:max_chars_per_file] + "... [truncated]"
        
        prompt = f"""Based on the following content from a university website, answer: {query_text}

Content:
{file_text}

Answer:"""
        
        try:
            response = ollama.generate(model=model_name, prompt=prompt)
            results.append({
                'file': file_path,
                'answer': response['response']
            })
        except Exception as e:
            results.append({
                'file': file_path,
                'answer': f"Error: {e}"
            })
    
    return results

## 6) Query the Website

Now you can query the mirrored website. Try asking questions about:
- Faculty members
- Research areas
- Academic programs
- News and events
- Department information

In [13]:
# Example query - modify this to ask your own questions
question = "What are the main research areas in the Computer Science department?"

# Get current model
current_model = get_current_model()
print(f"Query: {question}")
print(f"Model: {current_model}")
print("=" * 70)

# Search and query (will use MODEL_NAME if model_name not specified)
results = search_and_query(question, max_files=3)

# Display results
for i, result in enumerate(results, 1):
    print(f"\n--- Result {i} ---")
    print(f"File: {result['file']}")
    print(f"Answer:\n{result['answer']}")
    print()

Query: What are the main research areas in the Computer Science department?
Model: llama4:latest
Found 3 relevant files to query
  Querying: engineering.buffalo.edu/computer-science-engineering/information-for-faculty-and-staff.html
  Querying: engineering.buffalo.edu/computer-science-engineering/news-and-events.html
  Querying: engineering.buffalo.edu/computer-science-engineering/sitemap.html

--- Result 1 ---
File: engineering.buffalo.edu/computer-science-engineering/information-for-faculty-and-staff.html
Answer:
Error: llama runner process has terminated: signal: killed (status code: 500)


--- Result 2 ---
File: engineering.buffalo.edu/computer-science-engineering/news-and-events.html
Answer:
Error: llama runner process has terminated: signal: killed (status code: 500)


--- Result 3 ---
File: engineering.buffalo.edu/computer-science-engineering/sitemap.html
Answer:
Error: llama runner process has terminated: signal: killed (status code: 500)



## 7) Query a Specific File

If you know which file you want to query, you can query it directly.

In [None]:
# Example: Query a specific file
# Modify the file path and question as needed

specific_file = "engineering.buffalo.edu/computer-science-engineering/sitemap.html"
question = "What pages are listed in this sitemap?"

if os.path.exists(specific_file):
    current_model = get_current_model()
    print(f"Querying file: {specific_file}")
    print(f"Question: {question}")
    print(f"Model: {current_model}")
    print("=" * 70)
    
    # Will use dropdown model if model_name not specified
    answer = query_file_with_ollama(specific_file, question)
    print(answer)
else:
    print(f"File not found: {specific_file}")
    print("\nAvailable files in the mirror folder:")
    if os.path.exists(MIRROR_FOLDER):
        sample_files = find_html_files(MIRROR_FOLDER, max_files=10)
        for f in sample_files:
            print(f"  ‚Ä¢ {f}")

## 8) Interactive Query Function

Use this cell to easily query the website with your own questions.

In [None]:
def ask_question(question, model_name=None, max_files=5):
    """Convenience function to ask a question about the website."""
    # Use dropdown value if model_name not provided
    if model_name is None:
        model_name = get_current_model()
    
    print(f"üîç Question: {question}")
    print(f"ü§ñ Model: {model_name}")
    print("=" * 70)
    
    results = search_and_query(question, model_name=model_name, max_files=max_files)
    
    print(f"\nüìä Found {len(results)} results\n")
    
    for i, result in enumerate(results, 1):
        print(f"{'='*70}")
        print(f"Result {i} from: {result['file']}")
        print(f"{'='*70}")
        print(result['answer'])
        print()
    
    return results

# Example usage - modify the question below
# The function will automatically use the model selected in the dropdown
# results = ask_question("Who are the faculty members in the Computer Science department?")

## Notes

- **Model Selection**: Set `MODEL_NAME` to one of the model labels (model_1, model_2, etc.) listed in Step 3. Make sure you've pulled the model you want to use with `ollama pull <model-name>`
- **Performance**: Larger models may be slower but provide better answers
- **Token Limits**: Very long HTML files may be truncated to stay within model limits
- **File Search**: The current search is keyword-based on filenames. For better results, you might want to:
  - Use a vector database (like ChromaDB or FAISS) for semantic search
  - Implement full-text search across all HTML content
  - Use embeddings to find most relevant files before querying

### Tips for Better Queries

1. Be specific: "What research does Professor X do?" is better than "Tell me about research"
2. Use context: Mention the department or area you're interested in
3. Try different models: Change `MODEL_NAME` to test different models (e.g., `MODEL_NAME = "model_2"`)
4. Query specific files: If you know the file, query it directly for faster results