# 🧠 Web Content Summarizer

This notebook extracts and summarizes text content from websites using OpenAI-compatible APIs (Groq or Ollama).

## Features:
- Scrapes content using `requests` or `Selenium`
- Uses OpenAI-compatible models like `grok` & Local models like `Ollama` to generate summaries
- Supports custom URLs


In [None]:
import os
import requests
import json
import socket
import ipaddress
from typing import List
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from urllib.parse import urlparse
from IPython.display import display, Markdown, update_display
from openai import OpenAI
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.common.exceptions import WebDriverException
from webdriver_manager.chrome import ChromeDriverManager

## 🔧 1. Load API Keys
We load the API key from a `.env` file to securely connect to the Groq or Ollama LLM backend.


In [None]:
load_dotenv(override=True)
api_key  = os.getenv("GROQ_API_KEY")

if not api_key:
    print("No API_Key found, Please set the API_KEY.")
    exit(1)
elif api_key.strip() != api_key:
    print("An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook")
else:
    print("API key found and looks good so far!")



API key found and looks good so far!


In [None]:
openai = OpenAI(api_key=api_key, base_url = "https://api.groq.com/openai/v1")
ollama_with_openai = OpenAI(api_key = "ollama", base_url = "http://localhost:11434/v1")

## 🌐 2. Website Content Extractor
This class retrieves the raw HTML and title of a webpage using `requests` or Selenium.


In [None]:
def is_safe_url(url):
    try:
        parsed = urlparse(url)
        if parsed.scheme not in ["http", "https"] or parsed.netloc == "":
            return False

        host = parsed.hostname
        ip = ipaddress.ip_address(socket.gethostbyname(host))
        if ip.is_private or ip.is_loopback or ip.is_reserved or ip.is_link_local:
            return False
    except Exception:
        return False
    return True


class Web_Scapper:
    """
    A utility class to represent a Website that we have scraped, using Selenium, with extracted links.
    """

    def __init__(self, url):
        if not is_safe_url(url):
            raise ValueError("Invalid or unsafe URL")

        self.url = url
        self.title = "No title found"
        self.text = ""
        self.links = []

        # Setup Selenium
        options = Options()
        options.add_argument("--headless")
        options.add_argument("--disable-gpu")
        options.add_argument("--no-sandbox")
        options.add_argument("--disable-dev-shm-usage")
        options.add_argument("--disable-extensions")
        options.add_argument("--disable-blink-features=AutomationControlled")

        driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

        try:
            driver.set_page_load_timeout(20)
            driver.get(url)
            soup = BeautifulSoup(driver.page_source, "html.parser")

            # Get title
            self.title = soup.title.string.strip() if soup.title else "No title found"

            # Remove irrelevant tags
            if soup.body:
                for tag in soup.body(["script", "style", "img", "input"]):
                    tag.decompose()
                self.text = soup.body.get_text(separator="\n", strip=True)

            # Extract all valid links
            all_links = [a.get("href") for a in soup.find_all("a") if a.get("href")]
            self.links = all_links

        except WebDriverException as e:
            print(f"Error loading page with Selenium: {e}")
        finally:
            driver.quit()

    def get_contents(self):
        return f"Webpage Title:\n{self.title}\n\nWebpage Contents:\n{self.text}\n\n"


In [None]:
website = Web_Scapper("https://edwarddonner.com")
# print(web_scrapper.get_contents())
print(website.get_contents())
# web_scrapper.links


## 📝 3. Prompt Engineering and Summarization
This section defines the prompts used to instruct the `large language model (LLM)` for summarization and provides the core functions to perform the website summarization using the `selected LLM (Groq or Ollama)`.


In [None]:
system_prompt = "You are an assistant that analyzes the contents of a website and provides a short summary, ignoring text that might be navigation related. Respond in markdown."

def user_prompt_for(website):
    user_prompt = f"- You are looking at a website titled: '{website.title}'.\n\n- The contents of this website is as follows; please provide a short summary of this website in markdown. If it includes news or announcements, then summarize these too.\n{website.text}"
    return user_prompt
# print(user_prompt_for(content))


In [None]:
# And now: call the OpenAI API. You will get very familiar with this!
def summarize(url):
    website = Web_Scapper(url)
    response = openai.chat.completions.create(
        model = "llama3-70b-8192",
        messages = [
        {"role": "system", "content": system_prompt}
        ,{"role": "user", "content": user_prompt_for(website)}
    ]
    )
    return response.choices[0].message.content


# A function to display this nicely in the Jupyter output, using markdown
def display_summary(url):
    summary = summarize(url)
    display(Markdown(summary))


In [None]:
display_summary("https://www.gopalinfo.com/")

**Gopal Info: Your Partner in Digital Growth**
=============================================

Gopal Info is a digital solutions provider that offers expert services in:

* **Graphic Designing**: Crafting eye-catching designs to elevate your brand
* **UI/UX Design**: Creating intuitive and engaging user experiences
* **Branding Consultation**: Building strong, memorable brands
* **Web Designing**: Creating stunning websites that capture your brand's essence
* **Web Development**: Building robust and scalable web solutions
* **Search Engine Optimization**: Boosting online visibility with proven SEO strategies
* **Social Media Marketing**: Driving engagement and growth with targeted social media strategies
* **Google Ads**: Maximizing reach and conversions with targeted Google Ads campaigns
* **Mobile App Development**: Developing innovative mobile apps that enhance user experience

The company follows an agile methodology, prioritizing collaboration, creativity, and quality at every step of the project. With a diverse range of advanced technologies, Gopal Info aims to empower businesses with innovative digital solutions to drive growth and success.

In [None]:
display_summary("https://techcrunch.com")

# TechCrunch: Startup and Technology News
=============================

**Latest News**
---------------

* European VC breaks taboo by investing in pure defense tech from Ukraine’s war zones
* LangChain is about to become a unicorn, sources say
* GenAI as a shopping assistant set to explode during Prime Day sales
* Apple COO Jeff Williams to step down later this month
* SpaceX in talks to raise new funding at $400B valuation

**Top Headlines**
----------------

* Grok is being antisemitic again and also the sky is blue
* Hugging Face opens up orders for its Reachy Mini desktop robots
* Rivian spinoff Also raises another $200M to build e-bikes and more
* In a blow to Google Cloud, Replit partners with Microsoft
* European VC breaks taboo by investing in pure defense tech from Ukraine’s war zones

**In Brief**
-------------

* Rivian spinoff Also raises another $200M to build e-bikes and more
* US government confirms arrest of Chinese national accused of stealing COVID research and mass-hacking email servers
* SpaceX in talks to raise new funding at $400B valuation
* Mistral is reportedly in talks to raise $1B
* ByteDance reportedly plans to release US-specific version of CapCut

**Upcoming Events**
------------------

* TechCrunch All Stage 2025 (July 15, 2025)
* TechCrunch Disrupt 2025 (October 27-29, 2025)
* StrictlyVC Palo Alto (December 3, 2025)