In [22]:
import os
import requests
import json
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from IPython.display import Markdown, display, update_display
from openai import OpenAI

In [None]:
load_dotenv()
api_key = os.getenv('API_KEY')


MODEL = 'openai/gpt-oss-20b:free'
openai = OpenAI(base_url='https://openrouter.ai/api/v1',
                api_key=api_key)

In [48]:
MODEL = 'gemma3:1b'
openai =  OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')

In [65]:
api_key = os.getenv('OPENAI_API_KEY')
MODEL = "gpt-4o-mini"
openai = OpenAI(api_key=api_key)

In [49]:
headers = {
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

class Website:
    def __init__(self, url):
        self.url = url
        response = requests.get(url, headers=headers)
        self.body = response.content
        soup = BeautifulSoup(self.body, 'html.parser')
        self.title = soup.title.string if soup.title else "No title found"
        if soup.body:
            for irrelevant in soup.body(["script", "style", "img", "input"]):
                irrelevant.decompose()
            self.text = soup.body.get_text(separator="\n", strip=True)
        else:
            self.text = ""
        links = [link.get('href') for link in soup.find_all('a')]
        self.links = [link for link in links if link]

    def get_contents(self):
        return f"Webpage Title:\n{self.title}\nWebpage Contents:\n{self.text}\n\n"

In [26]:
ed = Website("https://edwarddonner.com")
ed.links

['https://edwarddonner.com/',
 'https://edwarddonner.com/connect-four/',
 'https://edwarddonner.com/outsmart/',
 'https://edwarddonner.com/about-me-and-about-nebula/',
 'https://edwarddonner.com/posts/',
 'https://edwarddonner.com/',
 'https://news.ycombinator.com',
 'https://nebula.io/?utm_source=ed&utm_medium=referral',
 'https://www.prnewswire.com/news-releases/wynden-stark-group-acquires-nyc-venture-backed-tech-startup-untapt-301269512.html',
 'https://patents.google.com/patent/US20210049536A1/',
 'https://www.linkedin.com/in/eddonner/',
 'https://edwarddonner.com/2025/05/28/connecting-my-courses-become-an-llm-expert-and-leader/',
 'https://edwarddonner.com/2025/05/28/connecting-my-courses-become-an-llm-expert-and-leader/',
 'https://edwarddonner.com/2025/05/18/2025-ai-executive-briefing/',
 'https://edwarddonner.com/2025/05/18/2025-ai-executive-briefing/',
 'https://edwarddonner.com/2025/04/21/the-complete-agentic-ai-engineering-course/',
 'https://edwarddonner.com/2025/04/21/the-

In [38]:
link_system_prompt = "You are provided with a list of links found on a webpage. \
You are able to decide which of the links would be most relevant to include in a brochure about the company, \
such as links to an About page, or a Company page, or Careers/Jobs pages.\n"
link_system_prompt += "You should respond in JSON as in this example:"
link_system_prompt += """
{
    "links": [
        {"type": "about page", "url": "https://full.url/goes/here/about"},
        {"type": "careers page", "url": "https://another.full.url/careers"}
    ]
}
"""

In [28]:
print(link_system_prompt)

You are provided with a list of HTML links (href values) scraped from a webpage. Your job is to examine each link and decide which links are most relevant to include in a short brochure about the company. Prioritise canonical company information pages such as About, Company, Team/Leadership, Careers/Jobs, Products/Services, Pricing, Customers/Case-Studies, Press/News, Investor Relations and Contact.
Follow these rules when deciding and formatting the output:
- Exclude links that point to Terms, Privacy, Cookies, login/signup, search, sitemap, mailto:, tel:, javascript: or obvious asset links (images, css, js).
- If a link is relative, resolve it to the site's full https:// URL (use the origin/domain of the page). If resolution is not possible, omit that link.
- De-duplicate identical target URLs.
- Prefer same-domain pages for brochure content; mark external domains if they appear relevant.
- Infer the link type from the URL path, link text, or surrounding context.
- Assign a type from

In [37]:
def get_links_user_prompt(website):
    user_prompt = f"Here is the list of links on the website of {website.url} - "
    user_prompt += "please decide which of these are relevant web links for a brochure about the company, respond with the full https URL in JSON format. \
Do not include Terms of Service, Privacy, email links.\n"
    user_prompt += "Links (some might be relative links):\n"
    user_prompt += "\n".join(website.links)
    return user_prompt

In [30]:
print(get_links_user_prompt(ed))

Here is the list of links found on the page: https://edwarddonner.com
Please examine the links below and return a JSON object containing only the links that are relevant for a company brochure.
Rules: Resolve relative URLs to absolute https:// URLs using the page origin. Omit links you cannot resolve.
Exclude Terms, Privacy, Cookies, login/signup, search, sitemap, mailto:, tel:, javascript:, and obvious asset or tracking links (images, css, js, analytics).
De-duplicate identical target URLs. Prefer same-domain pages; mark external domains if relevant.
Infer a type for each link from this set: about, team, careers, products, pricing, customers, press, investors, contact, other.
Include an optional 'title' when you can infer a human-readable title, and a 'confidence' score between 0 and 1.
Sort returned links by relevance (most relevant first).
IMPORTANT: Return valid JSON ONLY, matching the system schema/example. Do NOT include any prose or explanation.

Links (some might be relative):


In [50]:
def get_links(url):
    website = Website(url)
    response = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": link_system_prompt},
            {"role": "user", "content": get_links_user_prompt(website)}
      ],
        response_format={"type": "json_object"}
    )
    result = response.choices[0].message.content
 
    return json.loads(result)
    

In [51]:
Website('https://huggingface.co').links

['/',
 '/models',
 '/datasets',
 '/spaces',
 '/docs',
 '/enterprise',
 '/pricing',
 '/login',
 '/join',
 '/spaces',
 '/models',
 '/microsoft/VibeVoice-1.5B',
 '/openbmb/MiniCPM-V-4_5',
 '/tencent/Hunyuan-MT-7B',
 '/meituan-longcat/LongCat-Flash-Chat',
 '/tencent/HunyuanWorld-Voyager',
 '/models',
 '/spaces/enzostvs/deepsite',
 '/spaces/apple/fastvlm-webgpu',
 '/spaces/zerogpu-aoti/wan2-2-fp8da-aoti-faster',
 '/spaces/bytedance-research/USO',
 '/spaces/multimodalart/Qwen-Image-Edit-Fast',
 '/spaces',
 '/datasets/fka/awesome-chatgpt-prompts',
 '/datasets/openai/healthbench',
 '/datasets/syncora/developer-productivity-simulated-behavioral-data',
 '/datasets/data-agents/jupyter-agent-dataset',
 '/datasets/facebook/recycling_the_web',
 '/datasets',
 '/join',
 '/pricing#endpoints',
 '/pricing#spaces',
 '/pricing',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/allenai',
 '/facebook',
 '/amazon',
 '/google',
 '/Intel',
 '/mic

In [52]:
res =get_links('https://anthropic.com')

In [59]:
type(res)

dict

In [54]:
# Anthropic has made their site harder to scrape, so I'm using HuggingFace..

huggingface = Website("https://huggingface.co")
huggingface.links

['/',
 '/models',
 '/datasets',
 '/spaces',
 '/docs',
 '/enterprise',
 '/pricing',
 '/login',
 '/join',
 '/spaces',
 '/models',
 '/microsoft/VibeVoice-1.5B',
 '/openbmb/MiniCPM-V-4_5',
 '/tencent/Hunyuan-MT-7B',
 '/meituan-longcat/LongCat-Flash-Chat',
 '/tencent/HunyuanWorld-Voyager',
 '/models',
 '/spaces/enzostvs/deepsite',
 '/spaces/apple/fastvlm-webgpu',
 '/spaces/zerogpu-aoti/wan2-2-fp8da-aoti-faster',
 '/spaces/bytedance-research/USO',
 '/spaces/multimodalart/Qwen-Image-Edit-Fast',
 '/spaces',
 '/datasets/fka/awesome-chatgpt-prompts',
 '/datasets/openai/healthbench',
 '/datasets/syncora/developer-productivity-simulated-behavioral-data',
 '/datasets/data-agents/jupyter-agent-dataset',
 '/datasets/facebook/recycling_the_web',
 '/datasets',
 '/join',
 '/pricing#endpoints',
 '/pricing#spaces',
 '/pricing',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/allenai',
 '/facebook',
 '/amazon',
 '/google',
 '/Intel',
 '/mic

In [55]:
get_links("https://huggingface.co")

{'links': ['/models',
  '/datasets',
  '/spaces',
  '/docs',
  '/enterprise',
  '/pricing',
  '/login',
  '/join',
  '/pricing#endpoints',
  '/pricing#spaces',
  '/pricing',
  '/enterprise',
  '/enterprise',
  '/enterprise',
  '/enterprise',
  '/enterprise',
  '/enterprise',
  '/enterprise',
  '/enterprise',
  '/enterprise',
  '/allenai',
  '/facebook',
  '/amazon',
  '/google',
  '/Intel',
  '/microsoft',
  '/grammarly',
  '/Writer',
  '/docs/transformers',
  '/docs/diffusers',
  '/docs/safetensors',
  '/docs/huggingface_hub',
  '/docs/tokenizers',
  '/docs/trl',
  '/docs/transformers.js',
  '/docs/smolagents',
  '/docs/peft',
  '/docs/datasets',
  '/docs/text-generation-inference',
  '/docs/accelerate',
  '/models',
  '/datasets',
  '/spaces',
  '/changelog',
  '/endpoints.huggingface.co',
  '/chat',
  '/huggingface',
  '/brand',
  '/terms-of-service',
  '/privacy',
  '/apply.workable.com/huggingface',
  '/learn',
  '/docs',
  '/blog',
  '/discuss.huggingface.co',
  '/status.huggingf

In [62]:
def get_all_details(url):
    result = "Landing page:\n"
    result += Website(url).get_contents()
    links = get_links(url)
    print("Found links:", links)
    for link in links["links"]:
        result += f"\n\n{link['type']}\n"
        result += Website(link["url"]).get_contents()
    return result

In [66]:
print(get_all_details("https://huggingface.co"))

Found links: {'links': [{'type': 'about page', 'url': 'https://huggingface.co/about'}, {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'company page', 'url': 'https://huggingface.co/enterprise'}, {'type': 'pricing page', 'url': 'https://huggingface.co/pricing'}, {'type': 'blog page', 'url': 'https://huggingface.co/blog'}, {'type': 'community page', 'url': 'https://huggingface.co/discuss'}, {'type': 'social media link', 'url': 'https://twitter.com/huggingface'}, {'type': 'social media link', 'url': 'https://www.linkedin.com/company/huggingface/'}]}
Landing page:
Webpage Title:
Hugging Face – The AI community building the future.
Webpage Contents:
Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
The AI community building the future.
The platform where the machine learning community collaborates on models, datasets, and applications.
Explore AI Apps
or
Browse 1M+ models
Trending on
this week
Models
microsoft/VibeVoic

In [67]:
system_prompt = "You are an assistant that analyzes the contents of several relevant pages from a company website \
and creates a short brochure about the company for prospective customers, investors and recruits. Respond in markdown.\
Include details of company culture, customers and careers/jobs if you have the information."

# Or uncomment the lines below for a more humorous brochure - this demonstrates how easy it is to incorporate 'tone':

# system_prompt = "You are an assistant that analyzes the contents of several relevant pages from a company website \
# and creates a short humorous, entertaining, jokey brochure about the company for prospective customers, investors and recruits. Respond in markdown.\
# Include details of company culture, customers and careers/jobs if you have the information."


In [68]:
def get_brochure_user_prompt(company_name, url):
    user_prompt = f"You are looking at a company called: {company_name}\n"
    user_prompt += f"Here are the contents of its landing page and other relevant pages; use this information to build a short brochure of the company in markdown.\n"
    user_prompt += get_all_details(url)
    user_prompt = user_prompt[:5_000] # Truncate if more than 5,000 characters
    return user_prompt

In [66]:
get_brochure_user_prompt("HuggingFace", "https://huggingface.co")

hi
Found links: {'links': [{'type': 'about page', 'url': 'https://huggingface.co/enterprise'}, {'type': 'company page', 'url': 'https://huggingface.co/'}, {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'}]}


'You are looking at a company called: HuggingFace\nHere are the contents of its landing page and other relevant pages; use this information to build a short brochure of the company in markdown.\nLanding page:\nWebpage Title:\nHugging Face – The AI community building the future.\nWebpage Contents:\nHugging Face\nModels\nDatasets\nSpaces\nCommunity\nDocs\nEnterprise\nPricing\nLog In\nSign Up\nThe AI community building the future.\nThe platform where the machine learning community collaborates on models, datasets, and applications.\nExplore AI Apps\nor\nBrowse 1M+ models\nTrending on\nthis week\nModels\nQwen/Qwen-Image-Edit\nUpdated\n6 days ago\n•\n36.4k\n•\n1.19k\ndeepseek-ai/DeepSeek-V3.1-Base\nUpdated\n2 days ago\n•\n14k\n•\n885\ndeepseek-ai/DeepSeek-V3.1\nUpdated\n2 days ago\n•\n17.1k\n•\n507\nxai-org/grok-2\nUpdated\nabout 13 hours ago\n•\n319\n•\n410\ngoogle/gemma-3-270m\nUpdated\n10 days ago\n•\n68.7k\n•\n627\nBrowse 1M+ models\nSpaces\nRunning\n12.2k\n12.2k\nDeepSite v2\n🐳\nGenera

In [69]:
def create_brochure(company_name, url):
    response = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
          ],
    )
    result = response.choices[0].message.content
    display(Markdown(result))

In [70]:
create_brochure("HuggingFace", "https://huggingface.co")

Found links: {'links': [{'type': 'about page', 'url': 'https://huggingface.co/huggingface'}, {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'enterprise page', 'url': 'https://huggingface.co/enterprise'}, {'type': 'blog page', 'url': 'https://huggingface.co/blog'}, {'type': 'models page', 'url': 'https://huggingface.co/models'}, {'type': 'datasets page', 'url': 'https://huggingface.co/datasets'}, {'type': 'spaces page', 'url': 'https://huggingface.co/spaces'}]}


# Hugging Face Company Brochure

## Overview
Hugging Face is at the forefront of the AI revolution, positioning itself as a premier platform where the machine learning community comes together to collaborate and innovate. Focused on building the future of AI, Hugging Face boasts a vibrant ecosystem that includes over 1 million models, 250,000 datasets, and numerous applications.

## Community and Collaboration
### "The AI Community Building the Future"
Hugging Face is more than just a technology provider; it’s a community-driven platform facilitating collaboration among developers, researchers, and enthusiasts in the machine learning landscape. With services that range from a rich library of machine learning models to a comprehensive suite of datasets, Hugging Face empowers its users to create, discover, and advance their AI projects efficiently.

### Open Source Initiative
Hugging Face is committed to open-source principles, enabling individuals and organizations to share their work with the world. Their open-source stack supports a vast range of AI modalities, including:
- **Transformers:** Powerful models for natural language processing.
- **Diffusers:** State-of-the-art models for diffusion processes.
- **Safe and efficient utilities:** Including tokenizers and fine-tuning libraries.

## Customers
Hugging Face is trusted by over **50,000 organizations** globally, including industry giants such as:
- **Google**
- **Amazon**
- **Microsoft**
- **Meta**
These organizations utilize Hugging Face’s offerings for developing cutting-edge AI applications and services.

## Careers at Hugging Face
Working at Hugging Face means being part of a dynamic and innovative environment. The organization values collaboration and community, offering roles that allow team members to engage in meaningful projects related to state-of-the-art machine learning. If you are passionate about AI and eager to contribute to a collaborative and inclusive ecosystem, Hugging Face could be the perfect fit.

### Benefits of Joining Us:
- Work on groundbreaking AI technology.
- Be part of a global community that emphasizes growth and collaboration.
- Opportunities for personal and professional development in a supportive atmosphere.

## Join Us!
Whether you're looking to use our extensive models and datasets, contribute to our open-source projects, or seek a rewarding career, Hugging Face invites you to be part of this exciting journey into the future of AI. 

### Explore More:
- [Hugging Face Models](https://huggingface.co/models)
- [Datasets](https://huggingface.co/datasets)
- [Spaces](https://huggingface.co/spaces)
- [Join Our Team](https://huggingface.co/jobs)

Let’s build the future together—one model at a time!

In [70]:
def stream_brochure(company_name, url):
    stream = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
          ],
        stream=True
    )
    
    response = ""
    display_handle = display(Markdown(""), display_id=True)
    for chunk in stream:
        response += chunk.choices[0].delta.content or ''
        response = response.replace("```","").replace("markdown", "")
        update_display(Markdown(response), display_id=display_handle.display_id)

In [71]:
stream_brochure("HuggingFace", "https://huggingface.co")

Found links: {'links': [{'type': 'homepage', 'url': 'https://huggingface.co/'}, {'type': 'enterprise page', 'url': 'https://huggingface.co/enterprise'}, {'type': 'pricing page', 'url': 'https://huggingface.co/pricing'}, {'type': 'blog', 'url': 'https://huggingface.co/blog'}, {'type': 'jobs page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'documentation', 'url': 'https://huggingface.co/docs'}, {'type': 'learn page', 'url': 'https://huggingface.co/learn'}, {'type': 'brand page', 'url': 'https://huggingface.co/brand'}, {'type': 'social media - Twitter', 'url': 'https://twitter.com/huggingface'}, {'type': 'social media - LinkedIn', 'url': 'https://www.linkedin.com/company/huggingface/'}]}


# Hugging Face  
**The AI community building the future**

---

## 1️⃣ About Hugging Face  

Hugging Face is the *cloud‑first* hub where data scientists, researchers, and developers collaborate on machine‑learning assets—**models, datasets, and applications**—across every modality (text, image, video, audio, 3‑D).  
Building on an open‑source foundation, the platform powers **over 1 M public models and 400 k+ Apps, with 12.3 k Spaces running live** right now.

---

## 2️⃣ What We Offer  

| Feature | Description | Key Stats |
|---------|-------------|-----------|
| **Model Hub** | Browse, train, and deploy 1 M+ state‑of‑the‑art models (e.g., *Qwen‑Image‑Edit*, *DeepSeek‑V3.1*, *Gemma‑3‑270m*) | 1 M+ models, 35 k+ popular |
| **Datasets Hub** | 250k+ curated datasets spanning text, vision & multimodal workloads | 250k+ datasets |
| **Spaces** | Rapid prototyping with no‑code event loops (Python, React, Gradio) | 12.3 k running Spaces |
| **Compute Engine** | On‑demand GPU / TPU inferencing & training (from $0.60/hr) | GPU‑optimized endpoints |
| **Enterprise Solutions** | Unlimited private models, private datasets, SSO, audit logs, dedicated support | 20 k+ enterprise users |

---

## 3️⃣ Open‑Source Ecosystem  

| Project | Description | Primary Language |
|---------|-------------|------------------|
| **🤗 Transformers** | 148 k+ high‑performance language models | Python |
| **🤗 Diffusers** | 30 k+ diffusion models for generative imagery | PyTorch |
| **🤗 Tokenizers** | 10 k fast tokenizers optimized for research & production | Rust/Python |
| **🤗 Datasets** | 20 k+ ready‑to‑use data loaders & transforms | Python |
| **🤗 Text‑Generation‑Inference** | Low‑latency serving toolkit | C++/Python |
| **🤗 Transformers.js** | Run state‑of‑the‑art ML *directly in the browser* | JavaScript |
| **→ Other Projects** | PEFT, TRL, Safetensors, smolagents, etc. | — |

All projects are hosted on **GitHub** and available under permissive licenses, fueling collaboration worldwide.

---

## 4️⃣ Industry Impact  

> **More than 50 000 organisations** trust Hugging Face for their AI workloads—from academia to Fortune 500 giants.

| Company | Models | Followers |
|---------|--------|-----------|
| **Amazon** | 20 | 3.38 k |
| **Google (Enterprise)** | 1.04 k | 25.9 k |
| **Meta AI** | 783 | 3.91 k |
| **Microsoft** | 422 | 14.3 k |
| **Intel** | 238 | 2.93 k |
11 | 174 |
| **Writer** | 21 | 324 |

Our partners rely on Hugging Face for rapid prototyping, scaled inference, and secure, compliant deployment across the cloud.

---

## 5️⃣ Culture & Values  

- **Community‑First** – Giving back through open‑source contributions, tutorials, forums, and Discord chats.  
- **Fluency in Modality** – Supporting text, image, audio, video, and 3‑D models in one stack.  
- **Safety & Ethical AI** – Tools like **safetensors** and **PEFT** help reduce carbon footprints and guard against misuse.  
- **Inclusivity & Collaboration** – Designers, researchers, and engineers from diverse backgrounds shape our roadmap.  

---

## 6️⃣ Careers at Hugging Face  

Whether you’re a **researcher**, **engineer**, **product manager**, or **community advocate**, we offer:

- **Impactful projects** that scale to hundreds of millions of users.  
- A **flat, collaborative culture** where ideas move from paper to production fast.  
- **Competitive compensation** and comprehensive benefits (remote‑first, health, equity, etc.).  
- **Co‑innovation labs** partnering with academia and industry.

> **Current openings**: Data Scientists, Machine‑Learning Engineers, Platform Engineers, DevOps, HR, Marketing, and more.  
> Explore the career page on our website or reach out to careers@huggingface.co.

---

## 7️⃣ Get Started

| Resource | Where to go |
|----------|-------------|
| **Learn** – Tutorials, docs, and community discussions | https://huggingface.co/docs |
| **Explore** – 1 M+ models & 400 k+ datasets | https://huggingface.co/models |
| **Deploy** – Inference endpoints & Spaces | https://huggingface.co/spaces |
| **Enterprise** – Secure, scalable AI for teams | https://huggingface.co/enterprise |
| **Jobs** – Career opportunities | https://huggingface.co/careers |

---

### Join the movement.  
Together, we’re building the most open, collaborative, and forward‑thinking AI platform in the world.  

> **Hugging Face** – *Where models are shared, ideas collide, and the future of AI is built collaboratively.* 🚀