# Website Summarizer with Playwright, OpenAI & Ollama

This notebook demonstrates an end-to-end pipeline for **scraping** a website and **summarizing** its content using a large language model. It uses:

- **Playwright** — a headless browser that fully renders JavaScript, making it suitable for modern single-page applications (React, Vue, Angular, etc.) where plain HTTP clients like `requests` would only see an empty shell.
- **OpenAI GPT-4.1-mini** — a cloud-hosted model for high-quality summaries.
- **Ollama (Llama 3.2)** — a locally-running open-source alternative that keeps your data on your machine at zero API cost.

## Installation and Setup

```sh
uv add playwright nest_asyncio openai
playwright install
```

Since Jupyter already runs an event loop, we patch it with `nest_asyncio` so Playwright's async API can work inside notebook cells:

```python
import nest_asyncio
nest_asyncio.apply()
```



In [None]:
import os
import asyncio
from typing import TypedDict
from dotenv import load_dotenv
from IPython.display import Markdown, display
from openai import OpenAI
from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeoutError

import nest_asyncio
nest_asyncio.apply()  # patches the running loop to allow nesting


## Connecting to OpenAI

The next cell is where we load in the environment variables in your `.env` file and connect to OpenAI.

In [None]:
# Load environment variables in a file called .env

load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

# Check the key

if not api_key:
    print("No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!")
elif not api_key.startswith("sk-proj-"):
    print("An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook")
elif api_key.strip() != api_key:
    print("An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook")
else:
    print("API key found and looks good so far!")

## Fetching Website Content with Playwright

Below we define the core scraping logic. The `scrape_webpage` async function launches a headless Chromium browser, navigates to the target URL, waits for the page to finish loading, then extracts the visible text content — stripping out scripts, styles, and other non-content elements. A synchronous `scrape` wrapper is also provided for convenience.

In [None]:
class PageData(TypedDict):
    """Structured result returned by the scraper."""
    url: str
    title: str
    content: str


async def scrape_webpage(url: str, wait_for: str = "networkidle", timeout: int = 30000) -> PageData:
    """
    Scrape the title and text content of a webpage using Playwright.

    Playwright launches a real browser (Chromium by default), which fully executes
    JavaScript — making it ideal for React, Vue, Angular, and other JS-heavy sites
    that plain HTTP clients like `requests` cannot render.

    Args:
        url (str): The full URL of the webpage to scrape
                   (e.g. "https://example.com").
        wait_for (str): The condition to wait for before extracting content.
                        Options:
                          - "networkidle"      → waits until network has been idle
                                                 for 500ms (best for SPAs). [default]
                          - "domcontentloaded" → waits for the HTML to be parsed.
                          - "load"             → waits for the load event.
                          - "commit"           → waits until the response starts arriving.
        timeout (int): Maximum time in milliseconds to wait for the page to load.
                       Defaults to 30000 (30 seconds).

    Returns:
        PageData: A TypedDict containing:
            - ``url``     (str): The URL that was scraped.
            - ``title``   (str): The page's <title> tag value, or an empty string
                                 if no title was found.
            - ``content`` (str): The visible text content of the fully rendered
                                 page, with whitespace normalized.

    Raises:
        PlaywrightTimeoutError: If the page does not load within the timeout period.
        Exception: For any other browser or network related errors.

    Example:
        >>> import asyncio
        >>> data = asyncio.run(scrape_webpage("https://news.ycombinator.com"))
        >>> print(data["title"])
        >>> print(data["content"][:500])
    """
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        try:
            context = await browser.new_context(
                user_agent=(
                    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                    "AppleWebKit/537.36 (KHTML, like Gecko) "
                    "Chrome/120.0.0.0 Safari/537.36"
                )
            )

            page = await context.new_page()
            await page.goto(url, wait_until=wait_for, timeout=timeout)

            # Extract both title and body text in a single evaluate call
            # to avoid multiple round trips to the browser
            result = await page.evaluate("""
                () => {
                    const title = document.title ?? "";

                    // Remove non-visible / non-content elements
                    const tagsToRemove = ['script', 'style', 'noscript', 'svg', 'img'];
                    tagsToRemove.forEach(tag => {
                        document.querySelectorAll(tag).forEach(el => el.remove());
                    });

                    const content = document.body.innerText ?? "";

                    return { title, content };
                }
            """)

            # Normalize excessive whitespace and blank lines
            lines = [line.strip() for line in result["content"].splitlines()]
            cleaned_content = "\n".join(line for line in lines if line)

            return PageData(
                url=url,
                title=result["title"].strip(),
                content=cleaned_content,
            )

        except PlaywrightTimeoutError:
            raise PlaywrightTimeoutError(
                f"Page '{url}' did not finish loading within {timeout}ms. "
                "Try increasing the timeout or using a different wait_for strategy."
            )
        finally:
            await browser.close()


def scrape(url: str, wait_for: str = "networkidle", timeout: int = 30000) -> PageData:
    """
    Synchronous wrapper around :func:`scrape_webpage`.

    Useful when you are not inside an async context and want a simple
    one-liner call.

    Args:
        url (str): The full URL of the webpage to scrape.
        wait_for (str): Load condition — see :func:`scrape_webpage` for options.
        timeout (int): Timeout in milliseconds. Defaults to 30000.

    Returns:
        PageData: A TypedDict containing ``url``, ``title``, and ``content``.
                  See :func:`scrape_webpage` for full field descriptions.

    Example:
        >>> from scraper import scrape
        >>> data = scrape("https://news.ycombinator.com")
        >>> print(data["title"])
        >>> print(data["content"][:300])
    """
    return asyncio.run(scrape_webpage(url, wait_for=wait_for, timeout=timeout))

### Test Run — Scraping a Page

Let's try the scraper on an OpenAI blog post. We use `wait_for="domcontentloaded"` and a generous timeout since some pages load heavy assets. The output shows the page title, URL, the first 5 000 characters of visible text, and the total character count.

In [None]:
url = "https://openai.com/index/introducing-openai-frontier/"
# url = "https://openai.com"
print(f"Scraping: {url}\n{'=' * 50}")
data = scrape(url, wait_for="domcontentloaded", timeout=120000)  # Increase timeout for slower pages

print(f"Title   : {data['title']}")
print(f"URL     : {data['url']}")
print(f"{'=' * 50}")
print(data["content"][:5000])
print(f"\n{'=' * 50}")
print(f"Total characters scraped: {len(data['content'])}")

## Summarizing Webpage Content with OpenAI (GPT-4.1-mini)

Now that we can scrape any page, we need to feed that content to a language model. This requires two prompt templates:

- A **system prompt** that tells the model to act as a website analyst and respond in Markdown.
- A **user prompt** that injects the scraped content and asks for a summary.

### Defining the Prompts

In [None]:
# Define our system prompt - you can experiment with this later, changing the last sentence to 'Respond in markdown in Spanish."

system_prompt = """
You are a helpful assistant that analyzes the contents of a website,
and provides a short summary, ignoring text that might be navigation related.
Respond in markdown. Do not wrap the markdown in a code block - respond just with the markdown.
"""


# Define our user prompt

user_prompt_prefix = """
Here are the contents of a website.
Provide a short summary of this website.
If it includes news or announcements, then summarize these too.

Website content:
{content}
"""

# Define our messages

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt_prefix},
]

### Building the Messages List

The `build_messages` helper formats the system and user prompts into the message list that the OpenAI API expects, injecting the scraped content into the user prompt via string formatting.

In [None]:
def build_messages(content: str) -> list[dict]:
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt_prefix.format(content=content)},
    ]


Preview the formatted messages to verify the prompt structure before sending them to the API.

In [None]:
build_messages(data["content"])

### Putting It All Together

The `summarize` function chains the entire pipeline: scrape a URL, build the prompt messages, call the OpenAI Responses API with GPT-4.1-mini, and return the summary text. The `display_summary` wrapper renders the result as formatted Markdown in the notebook output.

In [None]:
# And now: call the OpenAI API using the responses api
client = OpenAI()

def summarize(url):
    website_content = scrape(url, wait_for="domcontentloaded", timeout=120000)
    response = client.responses.create(
        model="gpt-4.1-mini",
        input=build_messages(website_content["content"])
    )
    
    return response.output_text

# A function to display this nicely in the output, using markdown
def display_summary(url):
    summary = summarize(url)
    display(Markdown(summary))

### Demo — Summarizing Different Websites

Let's run the full pipeline (scrape → build messages → call OpenAI → display Markdown) on a few different sites to see how well the summarizer handles varying content types — a blog post, a company homepage, and an educational platform.

In [None]:
url = "https://openai.com/index/introducing-openai-frontier/"
display_summary(url)

In [None]:
url = "https://anthropic.com"
display_summary(url)


In [None]:
url = "https://www.codingforentrepreneurs.com/"
display_summary(url)

## Project to summarize a webpage to use an Open Source model running locally via Ollama rather than OpenAI

### Benefits:

1. No API charges - open-source
2. Data doesn't leave your box

### Disadvantages:

1. Significantly less power than Frontier Model

### Recap on installation of Ollama

Simply visit [ollama](https://ollama.com) and install!

Once complete, the ollama server should already be running locally.
If you visit:
http://localhost:11434/

You should see the message `Ollama is running`.

If not, bring up a new Terminal (Mac) or Powershell (Windows) and enter `ollama serve`

And in another Terminal (Mac) or Powershell (Windows), enter `ollama pull llama3.2`

Then try http://localhost:11434/ again.

If Ollama is slow on your machine, try using `llama3.2:1b` as an alternative. Run `ollama pull llama3.2:1b`

from a Terminal or Powershell, and change the code from `MODEL = "llama3.2"` to `MODEL = "llama3.2:1b"`

### Redefining the Pipeline for Ollama

Below we create a new `OpenAI` client that points at Ollama's local OpenAI-compatible endpoint (`localhost:11434/v1`). The `summarize` and `display_summary` functions are redefined to use `chat.completions.create` with the `llama3.2` model instead of the OpenAI Responses API.

In [None]:
# And now: call the OpenAI API using the responses api
OLLAMA_BASE_URL = "http://localhost:11434/v1"
MODEL_LLAMA = 'llama3.2'
ollama = OpenAI(base_url=OLLAMA_BASE_URL, api_key='ollama')


def summarize(url):
    website_content = scrape(url, wait_for="domcontentloaded", timeout=120000)
    response = ollama.chat.completions.create(
        model = MODEL_LLAMA,
        messages = build_messages(website_content["content"])
    )
    return response.choices[0].message.content

# A function to display this nicely in the output, using markdown
def display_summary(url):
    summary = summarize(url)
    display(Markdown(summary))

### Demo — Summarizing with Ollama

The same scrape → summarize → display pipeline, now running entirely on your local machine via Ollama. Compare these results with the OpenAI summaries above to see how the open-source model stacks up.

In [None]:
url = "https://developer.hashicorp.com/terraform"
display_summary(url)


In [None]:
url = "https://www.codingforentrepreneurs.com/"
display_summary(url)