# 🌐 WebPage Summarizer

An intelligent web content summarization tool that extracts and condenses webpage information using advanced AI models.

## 📋 Overview

This project creates concise, structured summaries of web content by leveraging state-of-the-art language models and robust web scraping techniques. Perfect for quickly understanding lengthy articles, blog posts, or documentation.

## ✨ Key Features

- **🤖 Dual AI Models**: Powered by OpenAI's `gpt-4o-mini` and `llama3.2:3b` for high-quality text summarization
- **🕷️ Advanced Web Scraping**: Uses Selenium to handle both static and dynamic JavaScript-rendered websites
- **📝 Markdown Output**: Generates clean, formatted summaries in Markdown for easy reading and sharing
- **🎯 Focused Processing**: Efficiently processes individual webpage URLs without crawling entire sites
- **⚡ Multi-Tool Integration**: Combines multiple libraries for robust and reliable content extraction

## 🛠️ Technology Stack

| Component | Technology | Purpose |
|-----------|------------|---------|
| **AI Models** | OpenAI GPT-4o-mini, Llama 3.2:3b | Content summarization |
| **Web Scraping** | Selenium WebDriver | Dynamic content extraction |
| **HTML Parsing** | BeautifulSoup | Static content processing |
| **HTTP Requests** | Python Requests | Basic web requests |
| **AI Integration** | OpenAI API, Ollama | Model access and inference |
| **Language** | Python | Core development |

## 🎯 Project Scope

- ✅ **Single URL Processing**: Focuses on individual webpage content
- ✅ **Content Extraction**: Handles both static and dynamic web content
- ✅ **AI Summarization**: Generates intelligent, contextual summaries
- ✅ **Structured Output**: Provides clean Markdown formatting
- ❌ **Site Crawling**: Does not process entire websites or multiple pages

## 🏆 Skill Level

**Beginner-Friendly** - Perfect for developers learning:
- Web scraping fundamentals
- AI model integration
- API consumption
- Content processing pipelines

## 🚀 Use Cases

- **📰 News Article Summaries**: Quickly digest lengthy news articles
- **📚 Research Papers**: Extract key points from academic content
- **📖 Documentation**: Summarize technical documentation
- **🛍️ Product Reviews**: Condense detailed product information
- **💼 Business Reports**: Extract insights from corporate content

## 💡 Benefits

- **⏰ Time-Saving**: Reduces reading time by 70-80%
- **🎯 Focus Enhancement**: Highlights key information and insights
- **📱 Accessibility**: Markdown format works across all platforms
- **🔄 Consistency**: Standardized summary format for all content
- **🤝 Shareability**: Easy to share and collaborate on summaries

---

*This project demonstrates practical application of AI, web scraping, and content processing technologies in a real-world scenario.*

## Environment Setup

In [27]:
import site
!uv pip install selenium beautifulsoup4 webdriver-manager

[2mUsing Python 3.12.11 environment at: /Users/daniela_veloz/Workspace/llm_portfolio/.venv[0m
[2mAudited [1m3 packages[0m [2min 3ms[0m[0m


In [28]:
# ===========================
# System & Environment
# ===========================
import os
from dotenv import load_dotenv

# ===========================
# AI-related
# ===========================
from IPython.display import Markdown, display
from openai import OpenAI
import ollama

## Model Configuration & Authentication

In [29]:
load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

if not api_key:
   raise ValueError("OPENAI_API_KEY not found in environment variables")

print("✅ API key loaded successfully!")
openai = OpenAI()

✅ API key loaded successfully!


In [30]:
MODEL_OPENAI = "gpt-4o-mini"

## Web Scraping Module

In [31]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, WebDriverException
from bs4 import BeautifulSoup


class WebUrlCrawler:
    def __init__(self, headless=True, timeout=10):
        self.timeout = timeout
        self.driver = None
        self.headless = headless

    def _setup_driver(self):
        chrome_options = Options()
        if self.headless:
            chrome_options.add_argument("--headless")
        chrome_options.add_argument("--no-sandbox")
        chrome_options.add_argument("--disable-dev-shm-usage")
        chrome_options.add_argument("--disable-gpu")
        chrome_options.add_argument("--window-size=1920,1080")

        try:
            self.driver = webdriver.Chrome(options=chrome_options)
            self.driver.set_page_load_timeout(self.timeout)
        except WebDriverException as e:
            raise Exception(f"Failed to initialize Chrome driver: {e}")

    def _extract_main_content(self, html):
        soup = BeautifulSoup(html, 'html.parser')

        # Remove unwanted elements
        unwanted_tags = ['script', 'style', 'img', 'input', 'button', 'nav', 'footer', 'header']
        for tag in unwanted_tags:
            for element in soup.find_all(tag):
                element.decompose()

        # Try to find main content containers in order of preference
        content_selectors = [
            'main',
            'article',
            '[role="main"]',
            '.content',
            '#content',
            '.main-content',
            '#main-content'
        ]

        for selector in content_selectors:
            content_element = soup.select_one(selector)
            if content_element:
                return content_element.get_text(strip=True, separator='\n')

        # Fallback to body if no main content container found
        body = soup.find('body')
        if body:
            return body.get_text(strip=True, separator='\n')

        return soup.get_text(strip=True, separator='\n')

    def crawl(self, url):
        if not self.driver:
            self._setup_driver()

        try:
            self.driver.get(url)

            WebDriverWait(self.driver, self.timeout).until(
                EC.presence_of_element_located((By.TAG_NAME, "body"))
            )

            html_content = self.driver.page_source
            main_content = self._extract_main_content(html_content)
            return main_content

        except TimeoutException:
            raise Exception(f"Timeout while loading {url}")
        except WebDriverException as e:
            raise Exception(f"Error crawling {url}: {e}")

    def close(self):
        if self.driver:
            self.driver.quit()
            self.driver = None

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.close()

In [32]:
from bs4 import BeautifulSoup
import requests

class WebSite:
    def __init__(self, url, title, body, links):
        self.url = url
        self.title = title
        self.body = body
        self.links = links

class WebUrlCrawler:
    # some websites need to use proper headers when fetching them
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
    }


    def __init__(self, headless=True, timeout=10):
        self.timeout = timeout
        self.driver = None
        self.headless = headless

    def crawl(self, url) -> WebSite:
        response = requests.get(url, headers=self.headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        title = soup.title.string if soup.title else "No title found"

        if soup.body:
            for irrelevant in soup.body(["script", "style", "img", "input"]):
                irrelevant.decompose()
            body = soup.body.get_text(strip=True, separator='\n')
        else:
            body = ""

        links = [link.get('href') for link in soup.find_all('a')]
        links = [link for link in links if link]

        return WebSite(url, title, body, links)



## LLM functions

In [33]:
from typing import Any


def generate_prompt_messages(website:WebSite) -> Any:
    system_prompt = """You are a web page summarizer that analyzes the content of a provided web page and provides a short and relevant summary. Return your response in markdown."""
    user_prompt = f"""You are looking at the website titled: {website.title}. The content if the website is as follows: {website.body}. """

    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ]

def invoke_llm(website:WebSite) -> str:
    response = openai.chat.completions.create(
        model=MODEL_OPENAI,
        messages=generate_prompt_messages(website),
    )
    return response.choices[0].message.content

## Summarization

In [34]:
def summarize(url):
    crawler = WebUrlCrawler()
    site = crawler.crawl(url)

    print("creating summary ...")

    summary = invoke_llm(site)
    display(Markdown(summary))

In [35]:
summarize("https://en.wikipedia.org/wiki/Marie_Curie")

creating summary ...


# Summary of Marie Curie - Wikipedia

Marie Curie (1867–1934), born Maria Salomea Skłodowska, was a pioneering Polish-French physicist and chemist known for her groundbreaking research on radioactivity. She became the first woman to win a Nobel Prize in 1903 in Physics, shared with her husband Pierre Curie and Henri Becquerel. Additionally, she won a second Nobel Prize in Chemistry in 1911 for her discovery of the elements polonium and radium, making her the first person to win Nobel Prizes in two different scientific fields.

Curie was notable for her work in isolating radioactive isotopes and her development of mobile X-ray machines during World War I, which significantly contributed to medical practices in battlefield conditions. Born in Warsaw, she faced numerous obstacles due to her gender, yet she persevered, eventually becoming the first female professor at the University of Paris. 

Her health deteriorated due to prolonged exposure to radiation, leading to her death from aplastic anemia in 1934. Curie's legacy includes the continuing influence of her research in the fields of physics and medicine, several medical institutes named in her honor, and the establishment of the curie as a unit of radioactivity. She remains an iconic figure in science and a symbol of women's contributions to the field.

## Key Points:
- **Birth:** November 7, 1867, Warsaw, Poland
- **Death:** July 4, 1934, Passy, France
- **Nobel Prizes:** Physics (1903), Chemistry (1911)
- **Notable Discoveries:** Polonium, Radium
- **Legacy:** Curie unit of radioactivity, Curie Institutes, numerous honors and memorials worldwide.