# EXERCISE SOLUTION

Upgrade the day 1 project to summarize a webpage to use an Open Source model running locally via Ollama rather than OpenAI

You'll be able to use this technique for all subsequent projects if you'd prefer not to use paid APIs.

**Benefits:**
1. No API charges - open-source
2. Data doesn't leave your box

**Disadvantages:**
1. Significantly less power than Frontier Model

## Recap on installation of Ollama

Simply visit [ollama.com](https://ollama.com) and install!

Once complete, the ollama server should already be running locally.  
If you visit:  
[http://localhost:11434/](http://localhost:11434/)

You should see the message `Ollama is running`.  

If not, bring up a new Terminal (Mac) or Powershell (Windows) and enter `ollama serve`  
Then try [http://localhost:11434/](http://localhost:11434/) again.

In [28]:
# imports

import requests
from bs4 import BeautifulSoup
from IPython.display import Markdown, display
import ollama

In [29]:
# Constants
MODEL = "llama3.2"

## Main diff from -copy1 file is the Website class, below is from Grok:
https://grok.com/share/bGVnYWN5_625bf8cd-4900-4bb1-bc19-d6798a320bad


In [30]:
from typing import Optional
import requests
from requests.exceptions import RequestException
from bs4 import BeautifulSoup
from urllib.parse import urlparse
from IPython.display import Markdown, display

class Website:
    """
    A utility class to represent a scraped website, storing its URL, title, and cleaned text content.
    """
    def __init__(self, url: str, timeout: int = 10, headers: Optional[dict] = None):
        """
        Initialize a Website object by scraping the given URL using BeautifulSoup.

        Args:
            url (str): The URL of the website to scrape.
            timeout (int, optional): Maximum time (in seconds) to wait for the HTTP request. Defaults to 10.
            headers (dict, optional): Custom HTTP headers for the request. Defaults to a standard user-agent.

        Raises:
            RequestException: If the HTTP request fails (e.g., network error, invalid URL).
            ValueError: If the URL is empty or malformed.
        """
        if not url or not url.strip():
            raise ValueError("URL cannot be empty")
        
        self.url = url
        # Use default headers if none provided
        self.headers = headers or {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
        }
        
        try:
            # Fetch the webpage with timeout
            response = requests.get(url, headers=self.headers, timeout=timeout)
            response.raise_for_status()  # Raise exception for bad status codes (e.g., 404, 500)
            
            # Parse with BeautifulSoup
            
            soup = BeautifulSoup(response.content, 'html.parser')
            # Extract title, fallback to domain if none found
            self.title = (
                soup.title.string.strip() if soup.title and soup.title.string
                else urlparse(url).netloc or "No title found"
            )
            
            # Remove irrelevant tags if body exists
            if soup.body:
                for irrelevant in soup.body(["script", "style", "img", "input", "nav", "footer"]):
                    irrelevant.decompose()
                self.text = soup.body.get_text(separator="\n", strip=True)
            else:
                self.text = ""
                
        except RequestException as e:
            raise RequestException(f"Failed to fetch {url}: {str(e)}") from e
        except Exception as e:
            raise RuntimeError(f"Error processing {url}: {str(e)}") from e

    def __str__(self) -> str:
        """
        Return a string representation of the Website object.
        """
        return f"Website(url={self.url}, title={self.title}, text_length={len(self.text)})"

    def __repr__(self) -> str:
        """
        Return a detailed string representation for debugging.
        """
        return f"Website(url={self.url!r}, title={self.title!r}, text={self.text[:100]!r}...)"

In [31]:
website = Website("https://edwarddonner.com/")
print(website.title)  # Example Domain
print(website.text[:200])


Home - Edward Donner
Well, hi there.
I’m Ed. I like writing code and experimenting with LLMs, and hopefully you’re here because you do too. I also enjoy DJing (but I’m badly out of practice), amateur electronic music prod


## Types of prompts

You may know this already - but if not, you will get very familiar with it!

Models like GPT4o have been trained to receive instructions in a particular way.

They expect to receive:

**A system prompt** that tells them what task they are performing and what tone they should use

**A user prompt** -- the conversation starter that they should reply to

In [32]:
# Define our system prompt - you can experiment with this later, changing the last sentence to 'Respond in markdown in Spanish."

system_prompt = "You are an assistant that analyzes the contents of a website \
and provides a short summary, ignoring text that might be navigation related. \
Respond in markdown."

In [33]:
# A function that writes a User Prompt that asks for summaries of websites:

def user_prompt_for(website):
    user_prompt = f"You are looking at a website titled {website.title}"
    user_prompt += "The contents of this website is as follows; \
please provide a short summary of this website in markdown. \
If it includes news or announcements, then summarize these too.\n\n"
    user_prompt += website.text
    return user_prompt

In [34]:
user_prompt_for(ed)


'You are looking at a website titled Home - Edward DonnerThe contents of this website is as follows; please provide a short summary of this website in markdown. If it includes news or announcements, then summarize these too.\n\nWell, hi there.\nI’m Ed. I like writing code and experimenting with LLMs, and hopefully you’re here because you do too. I also enjoy DJing (but I’m badly out of practice), amateur electronic music production (\nvery\namateur) and losing myself in\nHacker News\n, nodding my head sagely to things I only half understand.\nI’m the co-founder and CTO of\nNebula.io\n. We’re applying AI to a field where it can make a massive, positive impact: helping people discover their potential and pursue their reason for being. Recruiters use our product today to source, understand, engage and manage talent. I’m previously the founder and CEO of AI startup untapt,\nacquired in 2021\n.\nWe work with groundbreaking, proprietary LLMs verticalized for talent, we’ve\npatented\nour matc

## Messages

The API from Ollama expects the same message format as OpenAI:

```
[
    {"role": "system", "content": "system message goes here"},
    {"role": "user", "content": "user message goes here"}
]

In [35]:
# See how this function creates exactly the format above - what Ollama API needs

def messages_for(website):
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt_for(website)}
    ]

In [36]:
# these are the actual website content, with system and user prompt, which includes user request and web content
messages_for(ed)

[{'role': 'system',
  'content': 'You are an assistant that analyzes the contents of a website and provides a short summary, ignoring text that might be navigation related. Respond in markdown.'},
 {'role': 'user',
  'content': 'You are looking at a website titled Home - Edward DonnerThe contents of this website is as follows; please provide a short summary of this website in markdown. If it includes news or announcements, then summarize these too.\n\nWell, hi there.\nI’m Ed. I like writing code and experimenting with LLMs, and hopefully you’re here because you do too. I also enjoy DJing (but I’m badly out of practice), amateur electronic music production (\nvery\namateur) and losing myself in\nHacker News\n, nodding my head sagely to things I only half understand.\nI’m the co-founder and CTO of\nNebula.io\n. We’re applying AI to a field where it can make a massive, positive impact: helping people discover their potential and pursue their reason for being. Recruiters use our product to

## Time to bring it together - now with Ollama instead of OpenAI

In [37]:
# And now: call the Ollama function instead of OpenAI
# https://grok.com/share/bGVnYWN5_ae4c05fe-5ff7-4d1f-9b14-cd594d054c49

def summarize(url):
    website = Website(url, timeout=10)  # Use the improved class
    messages = messages_for(website)
    response = ollama.chat(model=MODEL, messages=messages)
    return response['message']['content']
 

In [38]:
summarize("https://edwarddonner.com")



In [39]:
# A function to display this nicely in the Jupyter output, using markdown

def display_summary(url):
    summary = summarize(url)
    display(Markdown(summary))

In [40]:
display_summary("https://edwarddonner.com")

**Summary of Edward Donner's Website**
=====================================

### About the Creator

* Ed is a tech enthusiast who enjoys writing code, experimenting with Large Language Models (LLMs), and DJing.
* He co-founded Nebula.io, an AI startup applying LLMs to help people discover their potential.

### Company Overview

* Nebula.io: develops proprietary LLMs for talent sourcing, engagement, and management.
* Previously founded AI startup untapt, acquired in 2021.

### Recent News and Announcements
---------------------------

* **Course Announcement**: "The Complete Agentic AI Engineering Course" (April 21, 2025)
* **Workshop Resource**: LLM Workshop – Hands-on with Agents (December 21, 2024)
* **Community Welcome**: Welcome, SuperDataScientists! (November 13, 2024)

# Let's try more websites

Note that this will only work on websites that can be scraped using this simplistic approach.

Websites that are rendered with Javascript, like React apps, won't show up. See the community-contributions folder for a Selenium implementation that gets around this. You'll need to read up on installing Selenium (ask ChatGPT!)

Also Websites protected with CloudFront (and similar) may give 403 errors - many thanks Andy J for pointing this out.

But many websites will work just fine!

In [41]:
# not working
# display_summary("https://cnn.com")

In [42]:
display_summary("https://anthropic.com")

**Anthropic Website Summary**
=============================

### Mission and Approach

Anthropic aims to build AI that serves humanity's long-term well-being. They focus on designing powerful technologies with human benefit at their foundation, prioritizing responsible AI development.

### Recent News and Announcements

*   **Claude 3.7 Sonnet**: Anthropic has released its most intelligent AI model, Claude 3.7 Sonnet.
*   **Research Updates**:
    *   Tracing the thoughts of a large language model (March 27, 2025)
    *   Anthropic Economic Index (March 27, 2025)
    *   Alignment faking in large language models (December 18, 2024)
*   **Product Releases**:
    *   Claude’s extended thinking (February 24, 2025)
    *   Introducing the Model Context Protocol (November 25, 2024)

### Key Initiatives

*   Anthropic Academy: Learn to build with Claude
*   Responsible Scaling Policy

This summary highlights Anthropic's mission, recent news and announcements, and key initiatives. It provides an overview of the website's content, excluding navigation-related sections.

In [43]:
# not working
# display_summary("https://www.tesla.com")

# Sharing your code

I'd love it if you share your code afterwards so I can share it with others! You'll notice that some students have already made changes (including a Selenium implementation) which you will find in the community-contributions folder. If you'd like add your changes to that folder, submit a Pull Request with your new versions in that folder and I'll merge your changes.

If you're not an expert with git (and I am not!) then GPT has given some nice instructions on how to submit a Pull Request. It's a bit of an involved process, but once you've done it once it's pretty clear. As a pro-tip: it's best if you clear the outputs of your Jupyter notebooks (Edit >> Clean outputs of all cells, and then Save) for clean notebooks.

PR instructions courtesy of an AI friend: https://chatgpt.com/share/670145d5-e8a8-8012-8f93-39ee4e248b4c