# 🌐 WebPage Summarizer

An intelligent web content summarization tool that extracts and condenses webpage information using advanced AI models.

**🚀 [Try it live on Hugging Face Spaces!](https://huggingface.co/spaces/daniela-veloz/webpage-summarizer)**

## 📋 Overview

This project creates concise, structured summaries of web content by leveraging state-of-the-art language models and robust web scraping techniques. The tool supports both cloud-based and local AI models, including OpenAI's GPT-4o-mini and the open-source GPT-OSS:20B model through Ollama, providing flexibility for different deployment scenarios. Perfect for quickly understanding lengthy articles, blog posts, or documentation.

## ✨ Key Features

- **🤖 Dual AI Models**: Powered by OpenAI's `gpt-4o-mini` and open-source `gpt-oss:20b` through Ollama for high-quality text summarization
- **🔓 Local & Cloud Options**: Choose between cloud-based OpenAI models or run models locally with Ollama
- **🕷️ Advanced Web Scraping**: Uses Selenium to handle both static and dynamic JavaScript-rendered websites
- **📝 Markdown Output**: Generates clean, formatted summaries in Markdown for easy reading and sharing
- **🎯 Focused Processing**: Efficiently processes individual webpage URLs without crawling entire sites
- **⚡ Multi-Tool Integration**: Combines multiple libraries for robust and reliable content extraction

## 🛠️ Technology Stack

| Component | Technology | Purpose |
|-----------|------------|---------|
| **AI Models** | OpenAI GPT-4o-mini, GPT-OSS:20B | Content summarization |
| **Web Scraping** | Selenium WebDriver | Dynamic content extraction |
| **HTML Parsing** | BeautifulSoup | Static content processing |
| **HTTP Requests** | Python Requests | Basic web requests |
| **AI Integration** | OpenAI API, Ollama | Model access and inference |
| **Local AI Runtime** | Ollama | Local model execution |
| **Language** | Python | Core development |

## 🚀 Installation Requirements

### Ollama Setup
To use the GPT-OSS:20B model locally, you need to install Ollama:

1. **Install Ollama**: Visit [ollama.com](https://ollama.com) and download for your platform
2. **Pull the model**: After installation, run:
   ```bash
   ollama pull gpt-oss:20b
   ```
3. **Start Ollama service**: The service should start automatically, or run:
   ```bash
   ollama serve
   ```

### Python Dependencies
Install required Python packages:
```bash
pip install selenium beautifulsoup4 webdriver-manager openai requests python-dotenv
```

## 🎯 Project Scope

- ✅ **Single URL Processing**: Focuses on individual webpage content
- ✅ **Content Extraction**: Handles both static and dynamic web content
- ✅ **AI Summarization**: Generates intelligent, contextual summaries
- ✅ **Structured Output**: Provides clean Markdown formatting
- ✅ **Local & Cloud AI**: Supports both local Ollama and cloud OpenAI models
- ❌ **Site Crawling**: Does not process entire websites or multiple pages

## 🏆 Skill Level

**Beginner-Friendly** - Perfect for developers learning:
- Web scraping fundamentals
- AI model integration
- API consumption
- Local AI deployment with Ollama
- Content processing pipelines

## 🚀 Use Cases

- **📰 News Article Summaries**: Quickly digest lengthy news articles
- **📚 Research Papers**: Extract key points from academic content
- **📖 Documentation**: Summarize technical documentation
- **🛍️ Product Reviews**: Condense detailed product information
- **💼 Business Reports**: Extract insights from corporate content

## 💡 Benefits

- **⏰ Time-Saving**: Reduces reading time by 70-80%
- **🎯 Focus Enhancement**: Highlights key information and insights
- **📱 Accessibility**: Markdown format works across all platforms
- **🔄 Consistency**: Standardized summary format for all content
- **🤝 Shareability**: Easy to share and collaborate on summaries
- **🔒 Privacy Options**: Local processing with Ollama for sensitive content

---

*This project demonstrates practical application of AI, web scraping, and content processing technologies with both cloud and local deployment options.*

## Environment Setup

In [None]:
!uv pip install selenium beautifulsoup4 webdriver-manager

In [1]:
# ===========================
# System & Environment
# ===========================
import os
from dotenv import load_dotenv
from IPython.display import Markdown, display

## Web Scraping Module

In [2]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, WebDriverException

class WebUrlCrawler:
    def __init__(self, headless=True, timeout=10):
        self.timeout = timeout
        self.driver = None
        self.headless = headless

    def _setup_driver(self):
        chrome_options = Options()
        if self.headless:
            chrome_options.add_argument("--headless")
        chrome_options.add_argument("--no-sandbox")
        chrome_options.add_argument("--disable-dev-shm-usage")
        chrome_options.add_argument("--disable-gpu")
        chrome_options.add_argument("--window-size=1920,1080")

        try:
            self.driver = webdriver.Chrome(options=chrome_options)
            self.driver.set_page_load_timeout(self.timeout)
        except WebDriverException as e:
            raise Exception(f"Failed to initialize Chrome driver: {e}")

    def _extract_main_content(self, html):
        soup = BeautifulSoup(html, 'html.parser')

        # Remove unwanted elements
        unwanted_tags = ['script', 'style', 'img', 'input', 'button', 'nav', 'footer', 'header']
        for tag in unwanted_tags:
            for element in soup.find_all(tag):
                element.decompose()

        # Try to find main content containers in order of preference
        content_selectors = [
            'main',
            'article',
            '[role="main"]',
            '.content',
            '#content',
            '.main-content',
            '#main-content'
        ]

        for selector in content_selectors:
            content_element = soup.select_one(selector)
            if content_element:
                return content_element.get_text(strip=True, separator='\n')

        # Fallback to body if no main content container found
        body = soup.find('body')
        if body:
            return body.get_text(strip=True, separator='\n')

        return soup.get_text(strip=True, separator='\n')

    def crawl(self, url):
        if not self.driver:
            self._setup_driver()

        try:
            self.driver.get(url)

            WebDriverWait(self.driver, self.timeout).until(
                EC.presence_of_element_located((By.TAG_NAME, "body"))
            )

            html_content = self.driver.page_source
            main_content = self._extract_main_content(html_content)
            return main_content

        except TimeoutException:
            raise Exception(f"Timeout while loading {url}")
        except WebDriverException as e:
            raise Exception(f"Error crawling {url}: {e}")

    def close(self):
        if self.driver:
            self.driver.quit()
            self.driver = None

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.close()

In [None]:
from bs4 import BeautifulSoup
import requests

class WebSite:
    """
    A data class representing a scraped website with its core content and metadata.
    
    This class serves as a container for website information extracted during
    the web scraping process, providing a structured way to store and access
    webpage data for further processing.
    
    Attributes:
        url (str): The URL of the scraped website
        title (str): The page title extracted from the HTML <title> tag
        body (str): The cleaned text content from the webpage body
        links (List[str]): A list of all hyperlink URLs found on the page
    """
    
    def __init__(self, url, title, body, links):
        """
        Initialize a WebSite object with scraped content.
        
        Args:
            url (str): The URL of the website
            title (str): The page title
            body (str): The cleaned body text content
            links (List[str]): List of hyperlink URLs found on the page
        """
        self.url = url
        self.title = title
        self.body = body
        self.links = links

class WebUrlCrawler:
    """
    A web scraper that extracts content from web pages using HTTP requests and BeautifulSoup.
    
    This crawler fetches webpage content, cleans HTML markup, and extracts meaningful
    text content along with metadata. It's designed for simple, fast content extraction
    from static web pages without JavaScript rendering requirements.
    
    Attributes:
        headers (dict): HTTP headers used for web requests to mimic browser behavior
        timeout (int): Request timeout in seconds
        driver: Placeholder attribute for compatibility (not used in this implementation)
        headless (bool): Placeholder attribute for compatibility (not used in this implementation)
    """
    
    # some websites need to use proper headers when fetching them
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
    }

    def __init__(self, headless=True, timeout=10):
        """
        Initialize the web crawler with configuration options.
        
        Args:
            headless (bool, optional): Compatibility parameter, not used in this implementation.
                                      Defaults to True
            timeout (int, optional): Request timeout in seconds. Defaults to 10
        """
        self.timeout = timeout
        self.driver = None
        self.headless = headless

    def crawl(self, url) -> WebSite:
        """
        Scrape a webpage and extract its content and metadata.
        
        This method performs the following operations:
        1. Sends an HTTP GET request to the specified URL
        2. Parses the HTML content using BeautifulSoup
        3. Extracts the page title
        4. Cleans the body text by removing scripts, styles, images, and inputs
        5. Extracts all hyperlinks from the page
        6. Returns a WebSite object with the processed data
        
        Args:
            url (str): The URL of the webpage to scrape
            
        Returns:
            WebSite: An object containing the scraped website data including
                    URL, title, cleaned body text, and list of links
                    
        Raises:
            requests.RequestException: If the HTTP request fails
            BeautifulSoup parsing errors: If HTML parsing fails
        """
        response = requests.get(url, headers=self.headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        title = soup.title.string if soup.title else "No title found"

        if soup.body:
            for irrelevant in soup.body(["script", "style", "img", "input"]):
                irrelevant.decompose()
            body = soup.body.get_text(strip=True, separator='\n')
        else:
            body = ""

        links = [link.get('href') for link in soup.find_all('a')]
        links = [link for link in links if link]

        return WebSite(url, title, body, links)

## LLM Client

In [None]:
from openai import OpenAI

class LLMClient:
    """
    A client for interacting with language models through OpenAI's API.
    
    This client supports both OpenAI's hosted models and local models via custom base URLs.
    It provides a simplified interface for generating text responses from language models
    with system prompts to guide model behavior.
    
    Attributes:
        model (str): The model name to use for text generation
        openai (OpenAI): The OpenAI client instance for API communication
    """
    
    def __init__(self, model, base_url=None):
        """
        Initialize the LLM client with model configuration.
        
        Args:
            model (str): The model name to use (e.g., 'gpt-4o-mini', 'gpt-3.5-turbo')
            base_url (str, optional): Custom base URL for local models. If provided,
                                     the model parameter is used as the API key for
                                     local model authentication. Defaults to None
        """
        self.model = model
        if base_url:
            self.openai = OpenAI(base_url=base_url, api_key=model)
        else:
            self.openai = OpenAI()

    def generate_text(self, user_prompt, system_prompt="") -> str:
        """
        Generate a text response using the configured language model.
        
        This method sends a user prompt along with optional system instructions
        to the language model and returns the generated response. System prompts
        are used to guide the model's behavior and response style.
        
        Args:
            user_prompt (str): The user's input message or query
            system_prompt (str, optional): System instructions to guide the model's
                                         behavior and response format. Defaults to ""
        
        Returns:
            str: The model's generated text response
            
        Raises:
            OpenAIError: If the API request fails or returns an error
        """
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ]
        response = self.openai.chat.completions.create(
            model=self.model,
            messages= messages,
        )
        return response.choices[0].message.content

## Summarization

In [None]:
def summarize(url, llm_client):
    """
    Generate an AI-powered summary of a webpage's content.
    
    This function combines web scraping and AI text generation to create
    concise, structured summaries of web content. It extracts the webpage
    content using the WebUrlCrawler and then processes it through a language
    model to generate a markdown-formatted summary with a TL;DR section.
    
    The function performs the following workflow:
    1. Scrapes the webpage content using WebUrlCrawler
    2. Constructs prompts for the AI model with the scraped content
    3. Generates a summary using the provided LLM client
    4. Displays the summary in markdown format
    
    Args:
        url (str): The URL of the webpage to summarize
        llm_client (LLMClient): An initialized LLM client for generating the summary
        
    Returns:
        None: The function displays the summary directly using IPython.display.Markdown
        
    Raises:
        Exception: If web scraping fails (network issues, invalid URL, etc.)
        OpenAIError: If the AI model request fails or returns an error
    """
    crawler = WebUrlCrawler()
    website = crawler.crawl(url)

    system_prompt = """You are a web page summarizer that analyzes the content of a provided web page and provides a short and relevant summary. You will also provide a TL;DR at the top. Return your response in markdown."""
    user_prompt = f"""You are looking at the website titled: {website.title}. The content if the website is as follows: {website.body}. """

    summary = llm_client.generate_text(system_prompt=system_prompt, user_prompt=user_prompt)
    display(Markdown(summary))

### Summarization with gpt-4o-mini



#### Load open_api_key

In [6]:
load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

if not api_key:
   raise ValueError("OPENAI_API_KEY not found in environment variables")

print("✅ API key loaded successfully!")

✅ API key loaded successfully!


#### Configure gpt-4o-mini client

In [7]:
model_open_ai = "gpt-4o-mini"
open_ai_llm_client = LLMClient(model=model_open_ai)

#### Example

In [8]:
summarize("https://en.wikipedia.org/wiki/Marie_Curie", open_ai_llm_client)

# TL;DR
Marie Curie was a groundbreaking Polish-French physicist and chemist who won Nobel Prizes in both Physics (1903) and Chemistry (1911) for her pioneering work on radioactivity, including the discovery of the elements polonium and radium. She made significant contributions to medical treatment using radioactive isotopes and remains a symbol of female scientific achievement.

---

Marie Curie, born Maria Salomea Skłodowska on November 7, 1867, in Warsaw, Poland, was a pioneering scientist known for her research in radioactivity. She was the first woman to win a Nobel Prize and the only person to win Nobel Prizes in two scientific fields: Physics in 1903 (shared with her husband Pierre Curie and Henri Becquerel) and Chemistry in 1911 for her discoveries of polonium and radium.

Curie's academic journey began in Warsaw, where she participated in the clandestine "Flying University" due to the restrictions placed on women in higher education. In 1891, she moved to Paris to study at the University of Paris, where she made profound advancements in the understanding of radiation. 

Throughout her career, she faced numerous challenges, including sexism in academia and personal tragedies, such as the death of her husband in 1906. Despite these obstacles, Marie Curie established the Curie Institute in Paris and pioneered mobile X-ray units during World War I to aid wounded soldiers.

Curie's work not only transformed the field of physics and chemistry but also laid the groundwork for cancer treatment using radiation. She died on July 4, 1934, from aplastic anemia believed to be linked to her long-term exposure to radiation. Today, she is celebrated as a key figure in science and remains an inspiring role model for women in STEM fields.

### Summarization with gpt-oss:20b

#### Configure gpt-oss:20b client

In [57]:
model_open_ai = "gpt-oss:20b"
gpt_oss_llm_client = LLMClient(model=model_open_ai, base_url="http://localhost:11434/v1")

#### Example

In [58]:
summarize("https://en.wikipedia.org/wiki/Marie_Curie", gpt_oss_llm_client)

0