# 🌐 WebPage Summarizer

An intelligent web content summarization tool that extracts and condenses webpage information using advanced AI models.

## 📋 Overview

This project creates concise, structured summaries of web content by leveraging state-of-the-art language models and robust web scraping techniques. The tool supports both cloud-based and local AI models, including OpenAI's GPT-4o-mini and the open-source GPT-OSS:20B model through Ollama, providing flexibility for different deployment scenarios. Perfect for quickly understanding lengthy articles, blog posts, or documentation.

## ✨ Key Features

- **🤖 Dual AI Models**: Powered by OpenAI's `gpt-4o-mini` and open-source `gpt-oss:20b` through Ollama for high-quality text summarization
- **🔓 Local & Cloud Options**: Choose between cloud-based OpenAI models or run models locally with Ollama
- **🕷️ Advanced Web Scraping**: Uses Selenium to handle both static and dynamic JavaScript-rendered websites
- **📝 Markdown Output**: Generates clean, formatted summaries in Markdown for easy reading and sharing
- **🎯 Focused Processing**: Efficiently processes individual webpage URLs without crawling entire sites
- **⚡ Multi-Tool Integration**: Combines multiple libraries for robust and reliable content extraction

## 🛠️ Technology Stack

| Component | Technology | Purpose |
|-----------|------------|---------|
| **AI Models** | OpenAI GPT-4o-mini, GPT-OSS:20B | Content summarization |
| **Web Scraping** | Selenium WebDriver | Dynamic content extraction |
| **HTML Parsing** | BeautifulSoup | Static content processing |
| **HTTP Requests** | Python Requests | Basic web requests |
| **AI Integration** | OpenAI API, Ollama | Model access and inference |
| **Local AI Runtime** | Ollama | Local model execution |
| **Language** | Python | Core development |

## 🚀 Installation Requirements

### Ollama Setup
To use the GPT-OSS:20B model locally, you need to install Ollama:

1. **Install Ollama**: Visit [ollama.com](https://ollama.com) and download for your platform
2. **Pull the model**: After installation, run:
   ```bash
   ollama pull gpt-oss:20b
   ```
3. **Start Ollama service**: The service should start automatically, or run:
   ```bash
   ollama serve
   ```

### Python Dependencies
Install required Python packages:
```bash
pip install selenium beautifulsoup4 webdriver-manager openai requests python-dotenv
```

## 🎯 Project Scope

- ✅ **Single URL Processing**: Focuses on individual webpage content
- ✅ **Content Extraction**: Handles both static and dynamic web content
- ✅ **AI Summarization**: Generates intelligent, contextual summaries
- ✅ **Structured Output**: Provides clean Markdown formatting
- ✅ **Local & Cloud AI**: Supports both local Ollama and cloud OpenAI models
- ❌ **Site Crawling**: Does not process entire websites or multiple pages

## 🏆 Skill Level

**Beginner-Friendly** - Perfect for developers learning:
- Web scraping fundamentals
- AI model integration
- API consumption
- Local AI deployment with Ollama
- Content processing pipelines

## 🚀 Use Cases

- **📰 News Article Summaries**: Quickly digest lengthy news articles
- **📚 Research Papers**: Extract key points from academic content
- **📖 Documentation**: Summarize technical documentation
- **🛍️ Product Reviews**: Condense detailed product information
- **💼 Business Reports**: Extract insights from corporate content

## 💡 Benefits

- **⏰ Time-Saving**: Reduces reading time by 70-80%
- **🎯 Focus Enhancement**: Highlights key information and insights
- **📱 Accessibility**: Markdown format works across all platforms
- **🔄 Consistency**: Standardized summary format for all content
- **🤝 Shareability**: Easy to share and collaborate on summaries
- **🔒 Privacy Options**: Local processing with Ollama for sensitive content

---

*This project demonstrates practical application of AI, web scraping, and content processing technologies with both cloud and local deployment options.*

## Environment Setup

In [6]:
import site
!uv pip install selenium beautifulsoup4 webdriver-manager

[2mUsing Python 3.12.11 environment at: /Users/daniela_veloz/Workspace/llm_portfolio/.venv[0m
[2mAudited [1m3 packages[0m [2min 4ms[0m[0m


In [7]:
# ===========================
# System & Environment
# ===========================
import os
from dotenv import load_dotenv
from IPython.display import Markdown, display

## Model Configuration & Authentication

In [8]:
load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

if not api_key:
   raise ValueError("OPENAI_API_KEY not found in environment variables")

print("✅ API key loaded successfully!")

✅ API key loaded successfully!


## Web Scraping Module

In [10]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, WebDriverException

class WebUrlCrawler:
    def __init__(self, headless=True, timeout=10):
        self.timeout = timeout
        self.driver = None
        self.headless = headless

    def _setup_driver(self):
        chrome_options = Options()
        if self.headless:
            chrome_options.add_argument("--headless")
        chrome_options.add_argument("--no-sandbox")
        chrome_options.add_argument("--disable-dev-shm-usage")
        chrome_options.add_argument("--disable-gpu")
        chrome_options.add_argument("--window-size=1920,1080")

        try:
            self.driver = webdriver.Chrome(options=chrome_options)
            self.driver.set_page_load_timeout(self.timeout)
        except WebDriverException as e:
            raise Exception(f"Failed to initialize Chrome driver: {e}")

    def _extract_main_content(self, html):
        soup = BeautifulSoup(html, 'html.parser')

        # Remove unwanted elements
        unwanted_tags = ['script', 'style', 'img', 'input', 'button', 'nav', 'footer', 'header']
        for tag in unwanted_tags:
            for element in soup.find_all(tag):
                element.decompose()

        # Try to find main content containers in order of preference
        content_selectors = [
            'main',
            'article',
            '[role="main"]',
            '.content',
            '#content',
            '.main-content',
            '#main-content'
        ]

        for selector in content_selectors:
            content_element = soup.select_one(selector)
            if content_element:
                return content_element.get_text(strip=True, separator='\n')

        # Fallback to body if no main content container found
        body = soup.find('body')
        if body:
            return body.get_text(strip=True, separator='\n')

        return soup.get_text(strip=True, separator='\n')

    def crawl(self, url):
        if not self.driver:
            self._setup_driver()

        try:
            self.driver.get(url)

            WebDriverWait(self.driver, self.timeout).until(
                EC.presence_of_element_located((By.TAG_NAME, "body"))
            )

            html_content = self.driver.page_source
            main_content = self._extract_main_content(html_content)
            return main_content

        except TimeoutException:
            raise Exception(f"Timeout while loading {url}")
        except WebDriverException as e:
            raise Exception(f"Error crawling {url}: {e}")

    def close(self):
        if self.driver:
            self.driver.quit()
            self.driver = None

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.close()

In [11]:
from bs4 import BeautifulSoup
import requests

class WebSite:
    def __init__(self, url, title, body, links):
        self.url = url
        self.title = title
        self.body = body
        self.links = links

class WebUrlCrawler:
    # some websites need to use proper headers when fetching them
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
    }


    def __init__(self, headless=True, timeout=10):
        self.timeout = timeout
        self.driver = None
        self.headless = headless

    def crawl(self, url) -> WebSite:
        response = requests.get(url, headers=self.headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        title = soup.title.string if soup.title else "No title found"

        if soup.body:
            for irrelevant in soup.body(["script", "style", "img", "input"]):
                irrelevant.decompose()
            body = soup.body.get_text(strip=True, separator='\n')
        else:
            body = ""

        links = [link.get('href') for link in soup.find_all('a')]
        links = [link for link in links if link]

        return WebSite(url, title, body, links)



## LLM Client

In [18]:
from openai import OpenAI

class LLMClient:
    def __init__(self, model, base_url=None):
        self.model = model
        if base_url:
            self.openai = OpenAI(base_url=base_url, api_key=model)
        else:
            self.openai = OpenAI()

    def generate_text(self, user_prompt, system_prompt="") -> str:
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ]
        response = self.openai.chat.completions.create(
            model=MODEL_OPENAI,
            messages= messages,
        )
        return response.choices[0].message.content

## Summarization

In [22]:
def summarize(url, llm_client):
    crawler = WebUrlCrawler()
    website = crawler.crawl(url)

    system_prompt = """You are a web page summarizer that analyzes the content of a provided web page and provides a short and relevant summary. Return your response in markdown."""
    user_prompt = f"""You are looking at the website titled: {website.title}. The content if the website is as follows: {website.body}. """

    print("creating summary ...")

    summary = llm_client.generate_text(system_prompt=system_prompt, user_prompt=user_prompt)
    display(Markdown(summary))

### Summarization with gpt-4o-mini



In [24]:
model_open_ai = "gpt-4o-mini"
open_ai_llm_client = LLMClient(model=model_open_ai)

#### Example

In [21]:
summarize("https://en.wikipedia.org/wiki/Marie_Curie", open_ai_llm_client)

creating summary ...


# Marie Curie - Summary

Marie Curie (1867-1934) was a Polish-born physicist and chemist who became a naturalized French citizen. She is best known for her pioneering research on radioactivity, which included the discovery of the elements polonium and radium. Curie was the first woman to win a Nobel Prize and remains the only person to have received Nobel Prizes in two different scientific fields (Physics in 1903 and Chemistry in 1911).

Born Maria Salomea Skłodowska in Warsaw, Curie faced significant challenges as a woman in science during her time. Nonetheless, she excelled academically, earning her degrees from the University of Paris while conducting groundbreaking research. She shared her first Nobel Prize with her husband Pierre Curie and Henri Becquerel for their work on radiation phenomena. After Pierre's untimely death in 1906, she became the first female professor at the University of Paris. 

Curie's contributions extended beyond academia; during World War I, she developed mobile radiography units to assist in treating wounded soldiers. Her legacy includes not only advancements in science but also significant impacts on medical treatment through the application of radioactivity.

She passed away from aplastic anemia, likely due to prolonged exposure to radiation during her research. Posthumously, she was recognized for her work and became the first woman interred in the Panthéon in Paris based on her own merits. Marie Curie's name and contributions remain a symbol of women's achievements in science and continue to inspire countless individuals around the world.

### Summarization with gpt-oss:20b

In [25]:
model_open_ai = "gpt-oss:20b"
gpt_oss_llm_client = LLMClient(model=model_open_ai)

#### Example

In [26]:
summarize("https://en.wikipedia.org/wiki/Marie_Curie", gpt_oss_llm_client)

creating summary ...


# Summary of Marie Curie - Wikipedia

Marie Curie (1867-1934), born Maria Salomea Skłodowska in Warsaw, was a pioneering Polish-French physicist and chemist best known for her groundbreaking research on radioactivity. She was the first woman to win a Nobel Prize, the only person to win Nobel Prizes in two scientific fields (Physics in 1903 and Chemistry in 1911), and she discovered the elements polonium and radium. 

Curie's early life was marked by hardship, including her mother's death from tuberculosis and societal obstacles for women in education. She moved to Paris in 1891 to pursue her studies, where she eventually married fellow scientist Pierre Curie. Together, they discovered radioactivity and advanced the scientific understanding of atomic structures.

During World War I, she developed mobile X-ray units to assist in treating wounded soldiers, further demonstrating her commitment to humanitarian causes. Curie died from aplastic anemia, likely due to prolonged radiation exposure during her research work. 

Her legacy includes numerous honors and institutions named in her memory. The Curie Institutes in Paris and Warsaw continue her work in medical research. In 1995, she became the first woman to be interred at the Panthéon in Paris for her own merits, celebrating her significant contributions to science and society.