## Website Summarizer
This Jupyter Notebook script creates an interactive web interface that allows users to summarize website content using local language models through Ollama. It combines Selenium for JavaScript-enabled web scraping with BeautifulSoup for HTML parsing, then sends the extracted content to a user-selected language model which generates a concise summary of the webpage. The notebook provides a user-friendly experience with interactive widgets for model selection and URL input, making it easy to analyze and summarize web content without leaving the notebook environment.

## Requirements
* Python 3.11 or higher       - ```scripting combines the pieces```
* Chrome Browser              - ```uses headless browser```
* pip install selenium        - ```browser automation and web scraping```
* pip install webdriver-manager      - ```handles browser driver installation and compatibility```
* pip install beautifulsoup4         - ```parsing and extracting HTML content```
* pip install ollama-python          - ```local language model server; scipt works well with llama3.2:3b; make sure your serving ollama with at least one model installed```
* sockets                      - ```test ollama server```
* re                           - ```regualar expressions mapping```
* time                         - ```time access and conversions```
* sys                          - ```access to system variables```
* ipywidgets                   - ```interactive UI elements in Jupyter Notebook```
### 1. Import Libraries

Imports all necessary Python libraries including Selenium for web automation, BeautifulSoup for HTML parsing, and the Ollama client for interacting with local language models.

In [1]:
import socket
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from IPython.display import display, Markdown, HTML
from bs4 import BeautifulSoup
import ollama
import time
import sys
import re
from typing import Optional, List, Dict, Any, Union

### 2. Define Constants and Helper Functions
Defines the default model constant and helper functions for checking if the Ollama server is running and formatting file sizes in a human-readable format.

In [2]:
# Default model if selection fails
DEFAULT_MODEL = "llama3.2:3b"

def is_ollama_server_running(host='localhost', port=11434):
    """Check if Ollama server is running by attempting to connect to its port"""
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    try:
        s.connect((host, port))
        s.close()
        return True
    except ConnectionRefusedError:
        return False

def check_ollama_server():
    """Check if Ollama server is running and exit if not"""
    if not is_ollama_server_running():
        display(HTML("<div style='color: red; font-weight: bold;'>Error: Ollama server is not running.</div>"))
        display(HTML("<div>Please start the Ollama server with 'ollama serve' and try again.</div>"))
        return False
    return True

def format_size(size_in_bytes):
    """Format byte size to human-readable format"""
    if size_in_bytes > 1_000_000_000:
        return f"{size_in_bytes / 1_000_000_000:.2f} GB"
    elif size_in_bytes > 1_000_000:
        return f"{size_in_bytes / 1_000_000:.2f} MB"
    elif size_in_bytes > 1_000:
        return f"{size_in_bytes / 1_000:.2f} KB"
    else:
        return f"{size_in_bytes} bytes"


### 3. Model Selection Functions
These functions query the Ollama API to retrieve available models and display them in a formatted HTML table within the notebook for easy selection.

In [3]:
def get_available_models():
    """Query Ollama for available models"""
    try:
        models_response = ollama.list()
        
        # Check different possible response formats
        if 'models' in models_response:
            return models_response['models']
        elif isinstance(models_response, list):
            return models_response
        elif isinstance(models_response, dict):
            # If it's a dict but doesn't have 'models' key, return all items
            return [models_response]
        else:
            display(HTML("<div style='color: orange;'>Warning: Unexpected response format from Ollama. Using default model.</div>"))
            return []
    except Exception as e:
        display(HTML(f"<div style='color: red;'>Error querying available models: {e}</div>"))
        display(HTML("<div>Using default model instead.</div>"))
        return []

def display_models(models):
    """Display available models in a formatted table"""
    if not models:
        display(HTML(f"<div>No models found. Using default model: {DEFAULT_MODEL}</div>"))
        return
    
    html = "<h3>Available Models</h3>"
    html += "<table style='width:100%; border-collapse: collapse;'>"
    html += "<tr style='background-color: #f2f2f2;'><th style='border: 1px solid #ddd; padding: 8px; text-align: left;'>Number</th><th style='border: 1px solid #ddd; padding: 8px; text-align: left;'>Model Name</th><th style='border: 1px solid #ddd; padding: 8px; text-align: left;'>Size</th><th style='border: 1px solid #ddd; padding: 8px; text-align: left;'>Modified</th></tr>"
    
    for i, model in enumerate(models, 1):
        # Try different possible key names for model information
        model_name = model.get('name', model.get('model', 'Unknown'))
        
        # Handle different size formats
        model_size = model.get('size', model.get('digest', 'Unknown size'))
        if isinstance(model_size, int):
            model_size = format_size(model_size)
        
        # Try to get modified time
        modified_time = model.get('modified_at', model.get('modified', ''))
        
        html += f"<tr><td style='border: 1px solid #ddd; padding: 8px;'>{i}</td><td style='border: 1px solid #ddd; padding: 8px;'>{model_name}</td><td style='border: 1px solid #ddd; padding: 8px;'>{model_size}</td><td style='border: 1px solid #ddd; padding: 8px;'>{modified_time}</td></tr>"
    
    html += "</table>"
    display(HTML(html))
    return models


### 4. URL Processing Functions
Contains the function that ensures URLs are properly formatted with the correct protocol prefix (https://) before processing.

In [4]:
def format_url(url):
    """Format URL to ensure it has the proper protocol prefix"""
    # Remove any leading/trailing whitespace
    url = url.strip()
    
    # Check if the URL already has a protocol (http:// or https://)
    if not re.match(r'^https?://', url):
        # Add https:// prefix if not present
        url = "https://" + url
    
    return url


### 5. Web Scraping Class
The ScrapeWebsite class handles the entire web scraping process using Selenium and BeautifulSoup, configuring a headless Chrome browser with optimized settings and extracting clean text content from web pages.

In [5]:
class ScrapeWebsite:
    def __init__(self, url):
        """
        Create this Website object from the given URL using Selenium + BeautifulSoup
        Supports JavaScript-heavy and normal websites uniformly.
        """
        self.url = url
        self.title = "No title found"
        self.text = ""
        
        display(HTML(f"<div>Fetching content from {url}...</div>"))
        
        # Configure headless Chrome with optimized settings
        options = Options()
        options.add_argument('--headless')
        options.add_argument('--no-sandbox')
        options.add_argument('--disable-dev-shm-usage')
        options.add_argument('--disable-extensions')
        options.add_argument('--disable-gpu')
        options.add_argument('--enable-unsafe-swiftshader')  # Handle WebGL issues
        
        # Use webdriver-manager to manage ChromeDriver
        service = Service(ChromeDriverManager().install())
        
        try:
            # Initialize the Chrome WebDriver with the service and options
            driver = webdriver.Chrome(service=service, options=options)
            
            # Start Selenium WebDriver
            driver.get(url)
            
            # Wait for page to load using explicit wait
            WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.TAG_NAME, "body"))
            )
            
            # Fetch the page source after JS execution
            page_source = driver.page_source
            driver.quit()
            
            # Parse the HTML content with BeautifulSoup
            soup = BeautifulSoup(page_source, 'html.parser')
            
            # Extract title
            self.title = soup.title.string if soup.title else "No title found"
            
            # Remove unnecessary elements
            for tag in ('script', 'style', 'img', 'input', 'iframe', 'noscript'):
                for element in soup.find_all(tag):
                    element.decompose()
            
            # Extract the main text
            self.text = soup.body.get_text(separator="\n", strip=True) if soup.body else ""
            
            display(HTML(f"<div style='color: green;'>Successfully fetched content from {url}</div>"))
            display(HTML(f"<div><strong>Title:</strong> {self.title}</div>"))
            
        except Exception as e:
            display(HTML(f"<div style='color: red;'>Error scraping website: {e}</div>"))


### 6. Summarization Functions
These functions prepare the message format for the Ollama API and handle the process of sending the scraped content to the selected language model for summarization.

In [6]:
def messages_for(website):
    """
    Create a messages array for the chat model based on website content
    """
    return [
        {"role": "system", "content": "You are a helpful assistant that summarizes web content."},
        {"role": "user", "content": f"Summarize the following website content titled '{website.title}':\n\n{website.text[:5000]}"}
    ]

def summarize(url, model):
    """Scrape website and generate summary using Ollama"""
    website = ScrapeWebsite(url)
    
    if not website.text:
        return "Could not extract content from the website."
    
    display(HTML("<div>Generating summary...</div>"))
    
    messages = messages_for(website)
    response = ollama.chat(model=model, messages=messages)
    return response['message']['content']


### 7. Check Ollama Server
Verifies that the Ollama server is running and displays an appropriate message if it's not available.

In [7]:
# Check if Ollama server is running
server_running = check_ollama_server()
if not server_running:
    display(HTML("<div style='color: red;'>Please restart the notebook after starting the Ollama server.</div>"))


### 8. Display and Select Models
Creates an interactive dropdown widget populated with available models from the Ollama server, allowing users to easily select which model to use.

In [19]:
# Get and display available models
models = get_available_models()
displayed_models = display_models(models)

# Create a dropdown for model selection
from ipywidgets import Dropdown, VBox, Label, Button, Text, Output
import ipywidgets as widgets

model_options = [(model.get('name', model.get('model', 'Unknown')), model.get('name', model.get('model', DEFAULT_MODEL))) 
                 for model in models] if models else [(DEFAULT_MODEL, DEFAULT_MODEL)]

model_dropdown = Dropdown(
    options=model_options,
    description='Select Model:',
    disabled=False,
    style={'description_width': 'initial'}
)

display(model_dropdown)


Number,Model Name,Size,Modified
1,deepseek-r1:1.5b,1.12 GB,2025-05-18 18:34:47.314876-05:00
2,llama3.2:3b,2.02 GB,2025-05-18 16:11:56.987203-05:00
3,mistral-small3.1:latest,15.49 GB,2025-05-10 20:33:23.221494-05:00


Dropdown(description='Select Model:', options=(('deepseek-r1:1.5b', 'deepseek-r1:1.5b'), ('llama3.2:3b', 'llam…

### 9. URL Output
Creates a text input widget where users can enter the URL of the website they want to summarize.

In [9]:
# Create URL input widget
url_input = Text(
    value='',
    placeholder='Enter website URL (e.g., msn.com)',
    description='Website:',
    disabled=False,
    style={'description_width': 'initial'}
)

display(url_input)


Text(value='', description='Website:', placeholder='Enter website URL (e.g., msn.com)', style=TextStyle(descri…

### 10. Run Button and Output
Creates the "Fetch and Summarize" button that triggers the entire process and displays the results in a dedicated output area below the button.

In [10]:
# Create output area and run button
output = Output()
run_button = Button(description="Fetch and Summarize")

def on_run_button_clicked(b):
    with output:
        output.clear_output()
        
        # Format URL
        url = format_url(url_input.value)
        if not url_input.value:
            display(HTML("<div style='color: red;'>Please enter a valid website URL.</div>"))
            return
        
        model = model_dropdown.value
        display(HTML(f"<div>Fetching and summarizing <b>{url}</b> using model: <b>{model}</b></div>"))
        
        # Summarize the website
        summary = summarize(url, model)
        
        # Display the summary
        display(Markdown("## Summary"))
        display(Markdown(summary))

run_button.on_click(on_run_button_clicked)

display(run_button)
display(output)


Button(description='Fetch and Summarize', style=ButtonStyle())

Output()