# 🚀 Advanced Web Scraping & AI Assistant - Week 1 Complete Exercise

## 📋 **Notebook Overview**

This notebook demonstrates the **complete evolution** of a web scraping solution through **Week 1** of the LLM Engineering course.

### **Exercise Progression:**
- **Cells 1-7**: Week 1 Day 1 (basic scraping + AI)
- **Cell 8**: Week 1 Day 2 (Ollama integration) 
- **Cells 9-13**: Week 1 Day 5 (advanced features + brochure generation)

### **Key Learning Progression:**
1. **Day 1**: JavaScript scraping problem → Selenium solution
2. **Day 2**: Remote ↔ Local AI flexibility (OpenAI ↔ Ollama)
3. **Day 5**: Multi-page intelligence + business automation

### **Technical Skills:**
- Selenium WebDriver, OpenAI API, Ollama, JSON processing, Class inheritance, Streaming responses


In [1]:
# week1 -> day1
import os
from dotenv import load_dotenv
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from IPython.display import Markdown, display, update_display
from openai import OpenAI

# week1 -> day5
import json
from typing import Dict, List

#week2 -> day2
import gradio as gr

## 📦 **Dependencies**

**Week 1 Day 1**: Core scraping + AI integration
**Week 1 Day 5**: Added JSON processing + type hints


## **Environment Setup**

This cell loads the OpenAI API key from the `.env` file. The `override=True` parameter ensures that any existing environment variables are replaced with values from the `.env` file.

**Important**: Make sure you have a `.env` file in your project root with:
```
OPENAI_API_KEY=your-actual-api-key-here
```


In [2]:
load_dotenv(override=True)
api_key:str = os.getenv('OPENAI_API_KEY')

## 🏗️ **WebpageSummarizer Class**

**Day 1**: Basic scraping + AI integration
**Day 2**: Remote ↔ Local flexibility (`set_endpoint`, `set_model`)
**Day 5**: Multi-page intelligence + brochure generation


In [11]:
class WebpageSummarizer:
    # week1 -> day1
    _system_prompt = """
        You are a snarkyassistant that analyzes the contents of a website, 
        and provides a short, snarky, humorous summary, ignoring text that might be navigation related.
        Respond in markdown. Do not wrap the markdown in a code block - respond just with the markdown.
    """
    
    # week1 -> day1
    _MODEL = "gpt-4o-mini"

    # week1 -> day1
    def __init__(self, model: str = _MODEL) -> None:
        self.openai_client = OpenAI()
        self.driver = webdriver.Chrome()
        self._MODEL = model
    
    # week1 -> day1
    def scrape_website(self, url: str) -> str:
        self.driver.get(url)
        self.driver.implicitly_wait(10)
        title = self.driver.title
        text_content = self.driver.find_element(By.TAG_NAME, "body").text
        return title + "\n\n" + text_content

    # week1 -> day1
    def summarize_text(self, url: str) -> str:
        text = self.scrape_website(url)
        response = self.openai_client.chat.completions.create(
            model=self._MODEL,
            messages=[
                {"role": "system", "content": self._system_prompt},
                {"role": "user", "content": text}
            ]
        )

        return response.choices[0].message.content

    # week1 -> day1
    def display_summary(self, url: str)-> None:
        summary:str = self.summarize_text(url)
        display(Markdown(summary))

    # week1 -> day2
    def set_endpoint(self, endpoint: str, api_key: str = "ollama") -> None:
        self.openai_client = OpenAI(base_url=endpoint, api_key=api_key)

    # week1 -> day2
    def set_model(self, model: str) -> None:
        self._MODEL = model

    # week1 -> day5
    def set_system_prompt(self, system_prompt: str) -> None:
        self._system_prompt = system_prompt

    # week1 -> day5
    def scrape_website_links(self, url: str) -> list[str]:
        self.driver.get(url)
        self.driver.implicitly_wait(10)
        
        links = self.driver.find_elements(By.TAG_NAME, "a")
        return [link.get_attribute("href") for link in links 
                if link.get_attribute("href") and link.get_attribute("href").strip()]

    # week1 -> day5
    def generate_user_prompt_to_select_relevant_links(self, url: str) -> str:
        user_prompt = f"""
            Here is the list of links on the website {url} -
            Please decide which of these are relevant web links for a brochure about the company, 
            respond with the full https URL in JSON format.
            Do not include Terms of Service, Privacy, email links.

            Links (some might be relative links):
        """
        links = self.scrape_website_links(url)
        user_prompt += "\n".join(links)
        return user_prompt

    # week1 -> day5
    def select_relevant_links(self, url:str) -> Dict[str, List[Dict[str, str]]]:
        message = self.generate_user_prompt_to_select_relevant_links(url)
        response = self.openai_client.chat.completions.create(
            model=self._MODEL,
            messages=[
                {"role": "system", "content": self._system_prompt},
                {"role": "user", "content": message}
            ],
            response_format={"type": "json_object"}
        )

        json_response = json.loads(response.choices[0].message.content)

        return json_response

    # week1 -> day5
    def fetch_page_and_all_relevant_links(self, url):
        contents = self.scrape_website(url)
        relevant_links = self.select_relevant_links(url)
        result = f"## Landing Page:\n\n{contents}\n## Relevant Links:\n"
        for link in relevant_links["links"]:
            result += f"\n\n### Link: {link["type"]}\n"
            result += self.scrape_website(link["url"])
        return result
        
    def get_user_prompt_for_brochure(self, company_name:str, url:str) -> str:
        user_prompt = f"""
        You are looking at a company called: {company_name}
        Here are the contents of its landing page and other relevant pages;
        use this information to build a short brochure of the company in markdown without code blocks.\n\n
        """
        user_prompt += self.fetch_page_and_all_relevant_links(url)
        user_prompt = user_prompt[:5_000] # Truncate if more than 5,000 characters
        return user_prompt

    # week1 -> day5
    def generate_brochure(self, company_name:str, url:str, link_prompt: str, brochure_prompt: str, stream: bool = False) -> None:
        self.set_system_prompt(link_prompt)
        contents = self.get_user_prompt_for_brochure(company_name,url)
        self.set_system_prompt(brochure_prompt)
        response = self.openai_client.chat.completions.create(
            model=self._MODEL,
            messages=[{"role": "system", "content": self._system_prompt}, {"role": "user", "content": contents}],
            stream=stream # for streaming response
        )

        if stream:
            full_response = ""
            display_handle = display(Markdown(full_response), display_id=True)
            for chunk in response:
                full_response += chunk.choices[0].delta.content or ""
                update_display(Markdown(full_response), display_id=display_handle.display_id)
        else:
            result = response.choices[0].message.content
            display(Markdown(result))

    # week2 -> day2
    def generate_brochure_with_gradio(self, company_name:str, url:str, link_prompt: str, brochure_prompt: str, stream: bool = False):
        self.set_system_prompt(link_prompt)
        contents = self.get_user_prompt_for_brochure(company_name,url)
        self.set_system_prompt(brochure_prompt)
        response = self.openai_client.chat.completions.create(
            model=self._MODEL,
            messages=[{"role": "system", "content": self._system_prompt}, {"role": "user", "content": contents}],
            stream=stream # for streaming response
        )

        if stream:
            full_response = ""
            for chunk in response:
                full_response += chunk.choices[0].delta.content or ""
                yield full_response
        else:
            result = response.choices[0].message.content
            yield result
    

## Demo: LinkedIn Summary

This cell demonstrates the WebpageSummarizer in action by:

1. **Creating an instance** with the GPT-5-nano model
2. **Scraping LinkedIn's homepage** - a JavaScript-heavy site that traditional scraping can't handle
3. **Generating a snarky summary** that captures the essence of LinkedIn's professional networking platform

### What Happens:
- Selenium opens Chrome browser (visible window)
- Navigates to LinkedIn.com
- Waits for JavaScript to render all content
- Extracts all visible text from the page
- Sends content to OpenAI for summarization
- Displays the humorous, sarcastic summary in markdown format

### Expected Output:
A witty, entertaining summary that captures LinkedIn's key features and business model with a humorous tone.


In [4]:
# week1 -> day1
Summarizer = WebpageSummarizer("gpt-5-nano")

Summarizer.display_summary("https://www.linkedin.com")


Error sending stats to Plausible: error sending request for url (https://plausible.io/api/event)


LinkedIn’s homepage in a nutshell: a corporate buffet of jobs, courses, tools, and guilt-inducing “Open to Work” vibes, wrapped in a lot of navigation clutter.

- Top Content: Curated posts and expert insights by topic (Career, Productivity, Finance, Soft Skills, Project Management, etc.). Yes, because your feed needed more buzzwords.
- Jobs: Find the right job or internship across a big menu of roles (Engineering, Marketing, IT, HR, Admin, Retail, etc.). Tempting you with endless openings.
- Post your job: Post a job for millions to see. Because nothing says “we’re hiring” like a public billboard.
- Software tools: Discover the best software—CRM, HRMS, Project Management, Help Desk, etc.—as if you were deciding which inbox to dread today.
- Games: Keep your mind sharp with daily games (Pinpoint, Queens, Crossclimb, Tango, Zip, Mini Sudoku). Productivity through micro-snacks!
- Open To Work: Privately tell recruiters or publicly broadcast you’re looking for opportunities. Subtle as a neon sign.
- Connect and Learn: Find people you know, learn new skills, and choose topics to study. Professional life, now with more onboarding prompts.
- Who is LinkedIn for?: Anyone navigating professional life—because apparently that’s everyone.
- Bottom line: It’s a hub of professional action—job hunting, learning, toolshopping, and the occasional brain teaser to distract you from the grim reality of deadlines.

## 🔄 **Day 2 - Remote ↔ Local AI**

Seamless switching between OpenAI (cloud) and Ollama (local) using `set_endpoint()`


## 🚀 **Day 5 - Multi-Page Intelligence**

AI-powered link analysis + automated company brochure generation


In [6]:
# week1 -> day2
Summarizer.set_endpoint("http://localhost:11434/v1")
Summarizer.set_model("llama3.2")

In [None]:
Summarizer.display_summary("https://www.linkedin.com")

In [12]:
Summarizer = WebpageSummarizer("gpt-5-nano")

In [13]:
LINK_SYSTEM_PROMPT = """
    You are provided with a list of links found on a webpage.
    You are able to decide which of the links would be most relevant to include in a brochure about the company,
    such as links to an About page, or a Company page, or Careers/Jobs pages.
    You should respond in JSON as in this example:

    {
        "links": [
            {"type": "about page", "url": "https://full.url/goes/here/about"},
            {"type": "careers page", "url": "https://another.full.url/careers"}
        ]
    }
 """

In [14]:
BRAND_SYSTEM_PROMPT = """ 
You are an assistant that analyzes the contents of several relevant pages from a company website
and creates a short brochure about the company for prospective customers, investors and recruits.
Respond in markdown without code blocks.
Include details of company culture, customers and careers/jobs if you have the information. 
"""


In [16]:
# Generate brochure without streaming the response
Summarizer.generate_brochure("Hugging Face", "https://huggingface.co", LINK_SYSTEM_PROMPT, BRAND_SYSTEM_PROMPT)

# Hugging Face — The AI community building the future

Hugging Face is the collaboration platform at the heart of the machine learning community. We empower researchers, engineers, and end users to learn, share, and build open, ethical AI together.

---

## What we do

- A vibrant platform where the ML community collaborates on models, datasets, and applications
- Browse 1M+ models, discover 400k+ apps, and explore 250k+ datasets
- Multi-modality support: text, image, video, audio, and even 3D
- Build and showcase your ML portfolio by sharing your work with the world
- Sign up to join a thriving ecosystem and accelerate your ML journey

---

## The platform (products and capabilities)

- Hub for Models, Datasets, and Spaces
  - Host and collaborate on unlimited public models, datasets, and applications
- HF Open Source Stack
  - Move faster with a comprehensive open source foundation
- Inference & Deployment
  - Inference Endpoints to deploy at scale; GPU-enabled Spaces in a few clicks
  - Inference Providers give access to 45,000+ models via a single unified API (no service fees)
- HuggingChat Omni
  - Chat with AI across the ecosystem
- Services for teams
  - Enterprise-grade security, access controls, and dedicated support
  - Starting at $20 per user per month
- Compute options
  - Starting at $0.60/hour for GPU
- Open ecosystem
  - Our open source projects power the ML toolchain and community
  - Key projects include Transformers, Diffusers, Safetensors, Tokenizers, TRL, Transformers.js, smolagents, and more

---

## Our open source core

We’re building the foundation of ML tooling with the community. Our flagship projects include:
- Transformers (state-of-the-art models for PyTorch)
- Diffusers (diffusion models)
- Safetensors (safe storage/distribution of weights)
- Hub Python Library (Python client for the Hugging Face Hub)
- Tokenizers, TRL, Transformers.js, smolagents
- These projects power the vast Hugging Face ecosystem and enable researchers and developers to innovate openly

---

## Customers, partners, and impact

- More than 50,000 organizations use Hugging Face
- Notable teams and enterprises rely on our platform, including leaders such as Meta AI, Amazon, Google, Microsoft, Intel, Grammarly, Writer, and more
- We support both individual researchers and large teams with scalable, secure solutions

---

## Culture, community, and values

- Open and ethical AI future, built together with the community
- A learning-first, collaborative environment that values openness and sharing
- Strong emphasis on open source tooling and transparent collaboration
- A platform that empowers the next generation of ML engineers, scientists, and end users

From brand storytelling to product strategy, we emphasize a cooperative, community-driven approach to advancing AI in a responsible way.

---

## Careers and how to join

- We regularly post opportunities on our Careers page. If you’re excited by open science, open source tooling, and building tools that empower thousands of practitioners, Hugging Face could be a great fit.
- Join a growing, mission-driven team that supports developers, researchers, and enterprise customers with cutting-edge AI tooling

---

## How to engage

- Explore Models, Datasets, and Spaces
- Try HuggingChat Omni
- Sign up to build your ML portfolio and collaborate with the community
- For teams, learn about our enterprise options, security, and dedicated support

---

## Why invest or partner with Hugging Face

- A thriving, open-source ecosystem with broad adoption across industry and academia
- A scalable platform that combines models, datasets, spaces, and applications under one roof
- A proven track record of enabling organizations to accelerate AI development while offering enterprise-grade security and support
- A growing customer base and a clear pathway from community tools to enterprise deployment

---

If you’d like more detail on specific products, a few success stories, or to see current open roles, I can pull together a concise section tailored to investors, customers, or prospective hires.

In [17]:
# Generate brochure while streaming the response
Summarizer.generate_brochure("Ed Donner", "https://edwarddonner.com", LINK_SYSTEM_PROMPT, BRAND_SYSTEM_PROMPT, stream=True)

# Edward (Ed) Donner — Co-founder & CTO, Nebula.io

A glimpse into the mission, technology, and culture behind Nebula.io, led by Ed Donner, with a focus on transforming recruitment through AI.

## Who we are
- Edward (Ed) Donner is the co-founder and CTO of Nebula.io.
- Nebula.io applies Generative AI and other machine learning to help recruiters source, understand, engage, and manage talent.
- The platform uses a patented matching model that connects people with roles more accurately and quickly—without relying on keywords.

## What we do
- Enable recruiters to source, understand, engage, and manage talent at scale.
- Use proprietary, verticalized LLMs tailored for talent and hiring workflows.
- Offer a patented matching model that improves accuracy and speed, with no keyword tyranny.
- Provide a platform that is award-winning and backed by press coverage, designed to help people discover roles where they will thrive.
- The product is described as free to try, offering a no-barrier way to explore its capabilities.

## Our technology and approach
- Proprietary LLMs specialized for talent recruitment.
- A patented matching engine that aligns people with roles more effectively than traditional keyword-based methods.
- Emphasis on real-world impact: applying AI to help people discover their potential and pursue their Ikigai—finding roles where they can be fulfilled and successful.
- The platform supports Gen AI and Agentic AI use cases, including practical deployments at scale (evidenced by references to AWS-scale implementations).

## Why Nebula.io matters
- Addressing a broad human capital challenge: many people feel uninspired or disengaged at work, and Nebula.io aims to change that by better matching individuals to meaningful roles.
- The long-term vision centers on raising human prosperity by helping people pursue fulfilling career paths.

## History, credibility, and impact
- Origin: Nebula.io traces back to Ed’s prior venture, untapt (founded in 2013), which built talent marketplaces and data science tools for recruitment.
- Early recognition: selected for the Accenture FinTech Innovation Lab; named an American Banker Top 20 Company To Watch.
- Media coverage: features in Fast Company, Forbes, and American Banker; Ed has spoken publicly about AI and recruitment, including high-profile interviews.
- Legacy of real-world impact: Nebula.io builds on a track record of applying AI to recruitment challenges and delivering value to customers.

## Culture and values
- Ikigai-driven philosophy: helping people discover their potential and pursue meaningful work.
- A hands-on, creative founder who blends technical rigor with curiosity (Ed’s interests include coding, experimenting with LLMs, DJing, and exploring tech culture).
- A pragmatic, impact-focused approach to AI—prioritizing real-world problems and measurable outcomes for customers and candidates alike.

## Customers and impact
- The platform is used by recruiters today to source, understand, engage, and manage talent.
- The emphasis is on delivering a better, faster, more accurate matching experience—reducing reliance on keyword matching and accelerating hiring outcomes.
- While specific customer names aren’t listed on the public pages, the platform is described as having happy customers and broad press coverage, underscoring credibility and market reception.

## Careers and opportunities
- The site highlights a culture of innovation and hands-on AI work, but does not list open job postings.
- For those inspired to work at the intersection of AI and talent, Nebula.io invites connections and conversations about opportunities to contribute to real-world hiring problems.
- If you’re interested in joining or collaborating, consider reaching out to Ed Donner and exploring how your skills could fit the mission.

## How to connect
- Email: ed [at] edwarddonner [dot] com
- Website: www.edwarddonner.com
- Follow Ed on social: LinkedIn, Twitter, Facebook
- Newsletter: Subscribe to updates and course offerings related to AI, LLMs, and talent acquisition

## Why invest or partner with Nebula.io
- Strong founder-led vision focused on meaningful, measurable outcomes in hiring.
- Proven track record through prior ventures and credible industry recognition.
- Patent-backed technology offering a differentiated approach to talent matching.
- Clear social impact goal: helping people find roles where they will be fulfilled and productive, contributing to broader prosperity.

If you’d like a tailored brochure version for investors, customers, or potential recruits, I can adjust the emphasis and add any additional details you’d like highlighted.

In [18]:
# Generate brochure using the Gradio interface
company_name = gr.Textbox(label="Company Name", info="Write the name of the company")
company_url = gr.Textbox(label="Company URL", info="Write the URL of the company")
link_system_prompt = gr.Textbox(
    label="Link System Prompt", 
    info="This is a system prompt to decide which of the links would be most relevant to include in a brochure about the company", 
    value=LINK_SYSTEM_PROMPT
)
brand_system_prompt = gr.Textbox(
    label="Brand System Prompt", 
    info="This is a system prompt that analyzes the contents of several relevant pages from a company website and creates a short brochure about the company for prospective customers, investors and recruits.", 
    value=BRAND_SYSTEM_PROMPT
)
stream_value = gr.Checkbox(label="Stream", value=False)
gr_output = gr.Markdown(label="Response")

interface = gr.Interface(
    fn=Summarizer.generate_brochure_with_gradio, 
    title="Brochure Generator", 
    inputs=[company_name, company_url, link_system_prompt, brand_system_prompt, stream_value], 
    outputs=[gr_output], 
    flagging_mode="never"
)

interface.launch(inbrowser=True)

* Running on local URL:  http://127.0.0.1:7862
* To create a public link, set `share=True` in `launch()`.


