# Web Scraper & Summarizer

A tiny demo that fetches text from a public webpage, breaks it into chunks, and uses an OpenAI model to produce a concise summary with bullet points.

**Features**

* Fetches static pages (`requests` + `BeautifulSoup`) and extracts headings/paragraphs.
* Hierarchical summarization: chunk → chunk-summaries → final summary.
* Simple, configurable prompts and safe chunking to respect model limits.

**Quick run**

1. Add `OPENAI_API_KEY=sk-...` to a `.env` file.
2. `pip install requests beautifulsoup4 python-dotenv openai`
3. Run the script/notebook and set `url` to the page you want.

**Note**: Use for public/static pages; JS-heavy sites need Playwright/Selenium.


In [8]:
%pip install requests beautifulsoup4 python-dotenv openai

Note: you may need to restart the kernel to use updated packages.


In [9]:
from dotenv import load_dotenv
import os
import openai

load_dotenv()  # loads variables from .env into the environment
openai.api_key = os.getenv("OPENAI_API_KEY")

if not openai.api_key:
    raise ValueError("OPENAI_API_KEY not found. Please create a .env file with OPENAI_API_KEY=<your_key>")
else:
    print("API Key prefix:", openai.api_key[:10])  # show only prefix for safety

API Key prefix: sk-proj-lL


In [10]:
# This function extracts common text tags from a static page.
import requests
from bs4 import BeautifulSoup

def fetch_text_from_url(url, max_items=300, timeout=15):
    """
    Fetch the page using requests and extract text from common tags.
    Returns a single string containing the joined text blocks.
    """
    resp = requests.get(url, timeout=timeout)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "html.parser")

    items = []
    for tag in soup.find_all(["h1", "h2", "h3", "p", "li"], limit=max_items):
        text = tag.get_text(" ", strip=True)
        if text:
            items.append(text)
    return "\n\n".join(items)

In [12]:
# chunk_text: split long text into manageable pieces
# summarize_chunk: call OpenAI model to summarize one chunk
# hierarchical_summarize: summarize chunks then combine summaries into a final summary

import time

def chunk_text(text, max_chars=3000):
    """
    Simple character-based chunking.
    Try to cut at paragraph or sentence boundaries when possible.
    """
    chunks = []
    start = 0
    text_len = len(text)
    while start < text_len:
        end = start + max_chars
        if end < text_len:
            # Prefer to cut at a blank line or sentence end
            cut = text.rfind("\n\n", start, end)
            if cut == -1:
                cut = text.rfind(". ", start, end)
            if cut == -1:
                cut = end
            end = cut
        chunk = text[start:end].strip()
        if chunk:
            chunks.append(chunk)
        start = end
    return chunks

def summarize_chunk(chunk, system_prompt=None, model="gpt-4o-mini", temperature=0.2):
    """
    Summarize a single chunk using the OpenAI chat completions API.
    Returns the model's text output.
    """
    if system_prompt is None:
        system_prompt = "You are a concise summarizer. Produce a short (~100 words) summary and 3 bullet points."

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Summarize the following text concisely. Keep it short.\n\nTEXT:\n{chunk}"}
    ]

    resp = openai.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature,
    )
    return resp.choices[0].message.content

def hierarchical_summarize(text, max_chunk_chars=3000, model="gpt-4o-mini"):
    """
    1) Split the text into chunks
    2) Summarize each chunk
    3) Combine chunk summaries and ask model for a final concise summary
    """
    chunks = chunk_text(text, max_chars=max_chunk_chars)
    print(f"[info] {len(chunks)} chunk(s) created.")
    chunk_summaries = []
    for i, c in enumerate(chunks, 1):
        print(f"[info] Summarizing chunk {i}/{len(chunks)} (chars={len(c)})...")
        s = summarize_chunk(c, model=model)
        chunk_summaries.append(s)
        time.sleep(0.5)  # small delay to avoid hitting rate limits

    if len(chunk_summaries) == 1:
        return chunk_summaries[0]

    combined = "\n\n---\n\n".join(chunk_summaries)
    final_prompt = "You are a concise summarizer. Combine the following chunk summaries into one final summary of about 150 words and 5 bullet points."
    final_messages = [
        {"role": "system", "content": final_prompt},
        {"role": "user", "content": combined}
    ]
    resp = openai.chat.completions.create(
        model=model,
        messages=final_messages,
        temperature=0.2,
    )
    return resp.choices[0].message.content


In [13]:
# Change the URL to any static (non-JS-heavy) page you want to test.
if __name__ == "__main__":
    url = "https://www.basketball-reference.com/"  # replace with your chosen URL
    print("[info] Fetching page:", url)
    page_text = fetch_text_from_url(url, max_items=300)
    print("[info] Fetched text length:", len(page_text))

    print("[info] Running hierarchical summarization...")
    final_summary = hierarchical_summarize(page_text, max_chunk_chars=2500)
    print("\n\n=== FINAL SUMMARY ===\n")
    print(final_summary)

[info] Fetching page: https://www.basketball-reference.com/
[info] Fetched text length: 11778
[info] Running hierarchical summarization...
[info] 5 chunk(s) created.
[info] Summarizing chunk 1/5 (chars=2430)...
[info] Summarizing chunk 2/5 (chars=2460)...
[info] Summarizing chunk 3/5 (chars=2426)...
[info] Summarizing chunk 4/5 (chars=2467)...
[info] Summarizing chunk 5/5 (chars=1987)...


=== FINAL SUMMARY ===

Sports Reference is a comprehensive platform for sports statistics and history, particularly focusing on basketball, baseball, football, hockey, and soccer. It offers tools like Stathead for advanced data analysis and the Immaculate Grid for interactive gameplay. Users can access player stats, team standings, and historical records without ads. 

- Extensive stats available for NBA, WNBA, G League, and international leagues.
- Daily recaps of NBA and WNBA performances delivered via email.
- Stathead Basketball provides in-depth stats with a free first month for new subscribers.