# Week 2 Exercise - Website AI Assistant

## What I Learned in Week 2

Week 2 covered the core building blocks of working LLM applications:
- Calling frontier model APIs (OpenAI, Anthropic, Ollama) using the OpenAI-compatible client pattern
- Streaming responses token-by-token using `stream=True` and Python generators with `yield`
- Crafting effective system prompts to shape model behaviour and ground responses
- Building conversational chat with multi-turn message history
- Creating interactive UIs with Gradio — `gr.Blocks`, `gr.Chatbot`, `gr.Textbox`, and event wiring
- Fetching and cleaning website content using `requests` and `BeautifulSoup` (the scraper pattern)

## How I Applied It — The Website AI Assistant

This notebook brings all of the above together in a single self-contained prototype. A user pastes any public website URL into a chat window and the assistant reads the site, summarises it, and answers questions — grounded only in what it found.

## Components Used

| Component | Purpose |
|---|---|
| `requests` + `BeautifulSoup` | Crawl same-domain pages and extract clean readable text (the Week 2 scraper pattern) |
| `OpenAI` client (OpenAI-compatible) | Connect to OpenAI, Anthropic Claude, and Ollama using the same client interface |
| System prompt with injected context | All crawled page text is appended to the system prompt so the model reads it as knowledge |
| Streaming with `yield` | LLM replies stream token-by-token into the chat window in real time |
| Multi-turn message history | Full conversation history is passed on every call so follow-up questions work naturally |
| `gr.Blocks` + `gr.Chatbot` | Gradio chat UI launches with a welcome message pre-loaded, no page reload needed |
| URL detection in chat | Detects when a message is a URL and triggers the crawl inline — no separate input box |
| Loading indicator | A status bubble appears in the chat during the crawl and is overwritten with the summary when done |

## How It Works

1. **Launch** — Gradio opens with a welcome message asking the user to paste a URL.
2. **Paste URL** — The assistant detects the URL, announces it is reading the site, and shows a loading indicator.
3. **Crawl** — Up to 20 same-domain pages are fetched and cleaned into plain text documents.
4. **Summarise** — The model reads a sample of pages and returns a 2-3 sentence summary plus 3-5 key findings, then asks what to explore.
5. **Conversation** — Every follow-up message passes the full crawled text as system context. The model answers from the site only and says so when it cannot find an answer.
6. **Switch site** — Pasting any new URL at any point resets the knowledge base and starts a fresh crawl.
7. **Model choice** — Set `MODEL_CHOICE = "gpt"`, `"anthropic"`, or `"llama"` in the constants cell to switch between OpenAI gpt-4.1-mini, Anthropic Claude Sonnet, and Ollama llama3.2.

In [47]:
import os
import re
import requests
from urllib.parse import urljoin, urlparse
from bs4 import BeautifulSoup
from dotenv import load_dotenv
from openai import OpenAI
import gradio as gr

In [48]:
MODEL_GPT = "gpt-4.1-mini"
MODEL_LLAMA = "llama3.2"
MODEL_ANTHROPIC = "claude-sonnet-4-5-20250929"

# Change to "llama" to use Ollama instead
MODEL_CHOICE = "anthropic"

MAX_PAGES = 20
MAX_CHARS_PER_PAGE = 1500

In [49]:
load_dotenv(override=True)

openai_api_key = os.getenv("OPENAI_API_KEY")
if openai_api_key:
    print(f"OpenAI API Key exists and begins {openai_api_key[:8]}")
else:
    print("OpenAI API Key not set")

anthropic_api_key = os.getenv("ANTHROPIC_API_KEY")

if anthropic_api_key:
    print(f"Anthropic API Key exists and begins {anthropic_api_key[:8]}")
else:
    print("Anthropic API Key not set")

openai_client = OpenAI()
anthropic_client = OpenAI(base_url="https://api.anthropic.com/v1", api_key=anthropic_api_key)
ollama_client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

OpenAI API Key exists and begins sk-proj-
Anthropic API Key exists and begins sk-ant-a


In [50]:
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}


def get_client_and_model():
    if MODEL_CHOICE == "gpt":
        return openai_client, MODEL_GPT
    elif MODEL_CHOICE == "anthropic":
        return anthropic_client, MODEL_ANTHROPIC
    return ollama_client, MODEL_LLAMA


def fetch_page_text(url):
    response = requests.get(url, headers=HEADERS, timeout=10)
    soup = BeautifulSoup(response.content, "html.parser")
    title = soup.title.string.strip() if soup.title and soup.title.string else url
    if soup.body:
        for tag in soup.body(["script", "style", "img", "input"]):
            tag.decompose()
        text = soup.body.get_text(separator="\n", strip=True)
    else:
        text = ""
    return title, re.sub(r"\n{3,}", "\n\n", text)


def is_same_domain(url, base_url):
    return urlparse(url).netloc == urlparse(base_url).netloc


def extract_links(soup, base_url):
    links = set()
    for a in soup.find_all("a", href=True):
        href = urljoin(base_url, a["href"])
        clean_url = urlparse(href)._replace(fragment="").geturl()
        if is_same_domain(clean_url, base_url) and clean_url.startswith("http"):
            links.add(clean_url)
    return links


def crawl_website(base_url, max_pages=MAX_PAGES):
    visited = set()
    queue = [base_url]
    documents = []

    while queue and len(visited) < max_pages:
        url = queue.pop(0)
        if url in visited:
            continue
        try:
            resp = requests.get(url, timeout=10, headers=HEADERS)
            if resp.status_code != 200 or "text/html" not in resp.headers.get("Content-Type", ""):
                visited.add(url)
                continue
            soup = BeautifulSoup(resp.content, "html.parser")
            title, text = fetch_page_text(url)
            if len(text) > 150:
                documents.append({"url": url, "title": title, "text": text})
            visited.add(url)
            for link in extract_links(soup, base_url):
                if link not in visited and link not in queue:
                    queue.append(link)
        except Exception:
            visited.add(url)

    return documents

In [51]:
client, model = get_client_and_model()

messages = [
    {"role": "user", "content": "Hello, how are you?"}
]

print(f"Model: {model}")

responseRaw = client.chat.completions.create(model=model, messages=messages)

response = responseRaw.choices[0].message.content

response


Model: claude-sonnet-4-5-20250929


"Hello! I'm doing well, thank you for asking. How are you doing today? Is there anything I can help you with?"

In [52]:
def build_website_context(documents, max_chars=MAX_CHARS_PER_PAGE):
    parts = [
        f"Page: {doc['title']}\nURL: {doc['url']}\n{doc['text'][:max_chars]}"
        for doc in documents
    ]
    return "\n\n---\n\n".join(parts)

In [53]:
SYSTEM_PROMPT = """You are a knowledgeable assistant that has studied a specific website in depth.
You ONLY answer questions based on information found on that website.
When answering, naturally cite specific pages using markdown links like [page name](url).
Be conversational, concise, and accurate.
If the answer is not in the website content provided, say so clearly — do not speculate or use outside knowledge."""


def stream_chat(messages, client, model):
    stream = client.chat.completions.create(model=model, messages=messages, stream=True)
    response = ""
    for chunk in stream:
        delta = chunk.choices[0].delta.content or ""
        response += delta
        yield response


def generate_site_summary(documents):
    client, model = get_client_and_model()
    sample = "\n\n".join(
        f"Page: {d['title']}\nURL: {d['url']}\n{d['text'][:800]}"
        for d in documents[:6]
    )
    messages = [
        {"role": "system", "content": "You are a website analyst. Be concise and friendly."},
        {
            "role": "user",
            "content": (
                "Based on these pages from the website, write:\n"
                "1. A 2-3 sentence summary of what this website is about\n"
                "2. 3-5 key topics or findings as bullet points\n\n"
                f"{sample}"
            ),
        },
    ]
    responseRaw = client.chat.completions.create(model=model, messages=messages)
    response = responseRaw.choices[0].message.content

    return response

In [56]:
WEBSITE_CONTEXT = ""
IS_INDEXED = False


def is_url(text):
    return text.startswith("http://") or text.startswith("https://")


def handle_chat(message, history):
    global WEBSITE_CONTEXT, IS_INDEXED

    message = message.strip()
    if not message:
        yield history, gr.update(value="")
        return

    history = history + [{"role": "user", "content": message}]
    yield history, gr.update(value="")

    if is_url(message):
        WEBSITE_CONTEXT = ""
        IS_INDEXED = False

        history = history + [{
            "role": "assistant",
            "content": (
                f"Give me a few minutes — I'm visiting **{message}** and building my knowledge of the site. "
                "I'll come back with a summary once I've finished reading through the pages!"
            )
        }]
        yield history, gr.update(value="")

        history = history + [{"role": "assistant", "content": "⏳ Reading through the pages, please wait..."}]
        yield history, gr.update(value="")

        documents = crawl_website(message)

        if not documents:
            history[-1]["content"] = (
                "I wasn't able to access that website. "
                "Please check the URL and make sure the site is publicly accessible, then try again."
            )
            yield history, gr.update(value="")
            return

        WEBSITE_CONTEXT = build_website_context(documents)
        IS_INDEXED = True

        summary = generate_site_summary(documents)
        page_word = "page" if len(documents) == 1 else "pages"
        summary_msg = (
            f"{summary}\n\n"
            f"I've read **{len(documents)} {page_word}** from the site. "
            "What would you like to learn more about?"
        )
        history[-1]["content"] = summary_msg
        yield history, gr.update(value="")
        return

    if not IS_INDEXED:
        history = history + [{
            "role": "assistant",
            "content": (
                "It looks like that isn't a URL. "
                "Please paste a website address starting with `https://` and I'll get reading!"
            )
        }]
        yield history, gr.update(value="")
        return

    system_with_context = f"{SYSTEM_PROMPT}\n\nWebsite knowledge:\n\n{WEBSITE_CONTEXT}"
    messages_for_llm = (
        [{"role": "system", "content": system_with_context}]
        + [{"role": h["role"], "content": h["content"]} for h in history]
    )

    client, model = get_client_and_model()
    history = history + [{"role": "assistant", "content": ""}]

    for partial in stream_chat(messages_for_llm, client, model):
        history[-1]["content"] = partial
        yield history, gr.update(value="")

    yield history, gr.update(value="")

In [57]:
WELCOME_MESSAGE = [{
    "role": "assistant",
    "content": (
        "Hi! I'm your **Website AI Assistant**.\n\n"
        "I can read any public website and have a conversation about what I find there — "
        "from products and services to pricing, team, and more.\n\n"
        "To get started, paste a website address (starting with `https://`) into the chat below."
    )
}]

with gr.Blocks(title="Website AI Assistant", theme=gr.themes.Soft()) as demo:

    gr.Markdown("# Website AI Assistant")

    chatbot = gr.Chatbot(
        type="messages",
        value=WELCOME_MESSAGE,
        height=500,
        show_label=False,
    )

    with gr.Row():
        msg_input = gr.Textbox(
            placeholder="Paste a website URL or ask a question...",
            show_label=False,
            scale=5,
        )
        send_btn = gr.Button("Send", variant="primary", scale=1)

    send_btn.click(
        fn=handle_chat,
        inputs=[msg_input, chatbot],
        outputs=[chatbot, msg_input],
    )

    msg_input.submit(
        fn=handle_chat,
        inputs=[msg_input, chatbot],
        outputs=[chatbot, msg_input],
    )

demo.queue().launch()

* Running on local URL:  http://127.0.0.1:7899
* To create a public link, set `share=True` in `launch()`.


