# AI brochure geberator using website scraping


### BUSINESS CHALLENGE:

Create a product that builds a Brochure for a company to be used for prospective clients, investors and potential recruits.

We will be provided a company name and their primary website.



In [1]:
# imports
# If these fail, please check you're running from an 'activated' environment with (llms) in the command prompt

import os
import requests
import json
from typing import List
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from IPython.display import Markdown, display, update_display
from openai import OpenAI

In [2]:

    
MODEL = 'llama3.2'
openai = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')

In [3]:
# A class to represent a Webpage

# Some websites need you to use proper headers when fetching them:
headers = {
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

class Website:
    """
    A utility class to represent a Website that we have scraped, now with links
    """

    def __init__(self, url):
        self.url = url
        response = requests.get(url, headers=headers)
        self.body = response.content
        soup = BeautifulSoup(self.body, 'html.parser')
        self.title = soup.title.string if soup.title else "No title found"
        if soup.body:
            for irrelevant in soup.body(["script", "style", "img", "input"]):
                irrelevant.decompose()
            self.text = soup.body.get_text(separator="\n", strip=True)
        else:
            self.text = ""
        links = [link.get('href') for link in soup.find_all('a')]
        self.links = [link for link in links if link]

    def get_contents(self):
        return f"Webpage Title:\n{self.title}\nWebpage Contents:\n{self.text}\n\n"

In [4]:
ed = Website("https://edwarddonner.com")
ed.links

['https://edwarddonner.com/',
 'https://edwarddonner.com/connect-four/',
 'https://edwarddonner.com/outsmart/',
 'https://edwarddonner.com/about-me-and-about-nebula/',
 'https://edwarddonner.com/posts/',
 'https://edwarddonner.com/',
 'https://news.ycombinator.com',
 'https://nebula.io/?utm_source=ed&utm_medium=referral',
 'https://www.prnewswire.com/news-releases/wynden-stark-group-acquires-nyc-venture-backed-tech-startup-untapt-301269512.html',
 'https://patents.google.com/patent/US20210049536A1/',
 'https://www.linkedin.com/in/eddonner/',
 'https://edwarddonner.com/2025/09/15/ai-in-production-gen-ai-and-agentic-ai-on-aws-at-scale/',
 'https://edwarddonner.com/2025/09/15/ai-in-production-gen-ai-and-agentic-ai-on-aws-at-scale/',
 'https://edwarddonner.com/2025/05/28/connecting-my-courses-become-an-llm-expert-and-leader/',
 'https://edwarddonner.com/2025/05/28/connecting-my-courses-become-an-llm-expert-and-leader/',
 'https://edwarddonner.com/2025/05/18/2025-ai-executive-briefing/',
 '

In [5]:
link_system_prompt = "You are provided with a list of links found on a webpage. \
You are able to decide which of the links would be most relevant to include in a brochure about the company, \
such as links to an About page, or a Company page, or Careers/Jobs pages.\n"
link_system_prompt += "You should respond in JSON as in this example:"
link_system_prompt += """
{
    "links": [
        {"type": "about page", "url": "https://full.url/goes/here/about"},
        {"type": "careers page", "url": "https://another.full.url/careers"}
    ]
}
"""

In [6]:
print(link_system_prompt)

You are provided with a list of links found on a webpage. You are able to decide which of the links would be most relevant to include in a brochure about the company, such as links to an About page, or a Company page, or Careers/Jobs pages.
You should respond in JSON as in this example:
{
    "links": [
        {"type": "about page", "url": "https://full.url/goes/here/about"},
        {"type": "careers page", "url": "https://another.full.url/careers"}
    ]
}



In [7]:
def get_links_user_prompt(website):
    user_prompt = f"Here is the list of links on the website of {website.url} - "
    user_prompt += "please decide which of these are relevant web links for a brochure about the company, respond with the full https URL in JSON format. \
Do not include Terms of Service, Privacy, email links.\n"
    user_prompt += "Links (some might be relative links):\n"
    user_prompt += "\n".join(website.links)
    return user_prompt

In [8]:
print(get_links_user_prompt(ed))

Here is the list of links on the website of https://edwarddonner.com - please decide which of these are relevant web links for a brochure about the company, respond with the full https URL in JSON format. Do not include Terms of Service, Privacy, email links.
Links (some might be relative links):
https://edwarddonner.com/
https://edwarddonner.com/connect-four/
https://edwarddonner.com/outsmart/
https://edwarddonner.com/about-me-and-about-nebula/
https://edwarddonner.com/posts/
https://edwarddonner.com/
https://news.ycombinator.com
https://nebula.io/?utm_source=ed&utm_medium=referral
https://www.prnewswire.com/news-releases/wynden-stark-group-acquires-nyc-venture-backed-tech-startup-untapt-301269512.html
https://patents.google.com/patent/US20210049536A1/
https://www.linkedin.com/in/eddonner/
https://edwarddonner.com/2025/09/15/ai-in-production-gen-ai-and-agentic-ai-on-aws-at-scale/
https://edwarddonner.com/2025/09/15/ai-in-production-gen-ai-and-agentic-ai-on-aws-at-scale/
https://edward

In [9]:
def get_links(url):
    website = Website(url)
    response = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": link_system_prompt},
            {"role": "user", "content": get_links_user_prompt(website)}
      ],
        response_format={"type": "json_object"}
    )
    result = response.choices[0].message.content
    return json.loads(result)

In [10]:


huggingface = Website("https://huggingface.co")
huggingface.links

['/',
 '/models',
 '/datasets',
 '/spaces',
 '/docs',
 '/enterprise',
 '/pricing',
 '/login',
 '/join',
 '/spaces',
 '/models',
 '/tencent/SRPO',
 '/Alibaba-NLP/Tongyi-DeepResearch-30B-A3B',
 '/openbmb/VoxCPM-0.5B',
 '/ibm-granite/granite-docling-258M',
 '/google/vaultgemma-1b',
 '/models',
 '/spaces/enzostvs/deepsite',
 '/spaces/zerogpu-aoti/wan2-2-fp8da-aoti-faster',
 '/spaces/Wan-AI/Wan2.2-Animate',
 '/spaces/IndexTeam/IndexTTS-2-Demo',
 '/spaces/abdul9999/NoWatermark',
 '/spaces',
 '/datasets/HuggingFaceFW/finepdfs',
 '/datasets/fka/awesome-chatgpt-prompts',
 '/datasets/InternRobotics/OmniWorld',
 '/datasets/LucasFang/FLUX-Reason-6M',
 '/datasets/HuggingFaceM4/FineVision',
 '/datasets',
 '/join',
 '/pricing#endpoints',
 '/pricing#spaces',
 '/pricing',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/allenai',
 '/facebook',
 '/amazon',
 '/google',
 '/Intel',
 '/microsoft',
 '/grammarly',
 '/Writer',
 '/docs/transforme

In [11]:
get_links("https://huggingface.co")

{'links': [{'type': 'Models', 'url': 'https://huggingface.co/models'},
  {'type': 'About us', 'url': 'https://huggingface.co/brand'},
  {'type': 'Pricing', 'url': 'https://huggingface.co/pricing'},
  {'type': 'Support', 'url': 'https://endpoints.huggingface.co'},
  {'type': 'Enterprise', 'url': 'https://huggingface.co/enterprise'},
  {'type': 'Blog', 'url': 'https://huggingface.co/blog'},
  {'type': 'GitHub Repository', 'url': 'https://github.com/huggingface'},
  {'type': 'Twitter Handle', 'url': 'https://twitter.com/huggingface'},
  {'type': 'LinkedIn Company Page',
   'url': 'https://www.linkedin.com/company/huggingface/'},
  {'type': 'Join Our Team', 'url': 'https://apply.workable.com/huggingface/'},
  {'type': 'Discuss Community Forum',
   'url': 'https://discuss.huggingface.co'}]}

## Second step: make the brochure!



In [14]:
def get_all_details(url):
    result = "Landing page:\n"
    result += Website(url).get_contents()
    links = get_links(url)
    print("Found links:", links)
    for link in links["links"]:
        result += f"\n\n{link['type']}\n"
        result += Website(link["url"]).get_contents()
    return result

In [15]:
print(get_all_details("https://huggingface.co"))

Found links: {'links': [{'type': 'About page', 'url': 'https://huggingface.co/'}, {'type': 'Company page', 'url': 'https://huggingface.co/'}, {'type': 'Careers/Jobs page', 'url': 'https://apply.workable.com/huggingface/'}]}
Landing page:
Webpage Title:
Hugging Face – The AI community building the future.
Webpage Contents:
Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
The AI community building the future.
The platform where the machine learning community collaborates on models, datasets, and applications.
Explore AI Apps
or
Browse 1M+ models
Trending on
this week
Models
tencent/SRPO
Updated
6 days ago
•
7k
•
862
Alibaba-NLP/Tongyi-DeepResearch-30B-A3B
Updated
4 days ago
•
6.26k
•
478
openbmb/VoxCPM-0.5B
Updated
2 days ago
•
2.19k
•
462
ibm-granite/granite-docling-258M
Updated
3 days ago
•
16.4k
•
425
google/vaultgemma-1b
Updated
9 days ago
•
3.25k
•
355
Browse 1M+ models
Spaces
Running
14k
14k
DeepSite v2
🐳
Generate any application with DeepSeek
Ru

In [16]:
system_prompt = "You are an assistant that analyzes the contents of several relevant pages from a company website \
and creates a short brochure about the company for prospective customers, investors and recruits. Respond in markdown.\
Include details of company culture, customers and careers/jobs if you have the information."




In [24]:
def get_brochure_user_prompt(company_name, url):
    user_prompt = f"You are looking at a company called: {company_name}\n"
    user_prompt += f"Here are the contents of its landing page and other relevant pages; use this information to build a short brochure of the company in markdown.\n"
    user_prompt += get_all_details(url)
    user_prompt = user_prompt[:5_000] # Truncate if more than 5,000 characters
    return user_prompt

In [20]:
def create_brochure(company_name, url):
    response = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
          ],
    )
    result = response.choices[0].message.content
    display(Markdown(result))

In [21]:
create_brochure("HuggingFace", "https://huggingface.co")

Found links: {'links': [{'type': 'about page', 'url': 'https://huggingface.co/'}, {'type': 'Company page', 'url': 'https://huggingface.co/allenai'}, {'type': 'Company page', 'url': 'https://huggingface.co/facebook'}, {'type': 'Company page', 'url': 'https://huggingface.co/amazon'}, {'type': 'Company page', 'url': 'https://huggingface.co/google'}, {'type': 'Company page', 'url': 'https://huggingface.co/Intel'}, {'type': 'Company page', 'url': 'https://huggingface.co/microsoft'}, {'type': 'Company page', 'url': 'https://huggingface.co/grammarly'}, {'type': 'Company page', 'url': 'https://huggingface.co/Writer'}, {'type': 'about page', 'url': 'https://apply.workable.com/huggingface/'}]}


# Hugging Face Brochure

## About Us

Hugging Face is the AI community building the future through collaboration, innovation, and open-source initiatives. Our platform empowers humans to work together to build, discover, and apply machine learning models, datasets, and applications.

## Company Culture

At Hugging Face, we value:

* Community-driven approach: We believe that everyone can contribute to the advancement of AI.
* Inclusion and diversity: We strive for an inclusive and diverse community of researchers, developers, and users from all backgrounds.
* Innovation: Our team is passionate about exploring new ideas and innovations in natural language processing, deep learning, and other areas of machine learning.

## Customers

Some of our notable customers include:

* Meta AI (AI2)
* Google Cloud
* Intel
* Microsoft Azure
* Amazon Web Services

Together, we're creating a vibrant ecosystem that fosters research, collaboration, and practical applications of artificial intelligence.

## Careers/Jobs

Are you passionate about natural language processing, deep learning, or other areas of machine learning? Check out our available positions on our [careers page](link).

**Some of Our Team Members:**

* non-profit (792 models)
* AI at Meta
* Enterprise company
...and many more!

## Models and Datasets

Browse 1M+ pre-trained models, including:

* Transformers
* Diffusers
* Safetensors
* Hub Python Library
* Tokenizers

Access 250k+ high-quality datasets for any ML task.

## Spaces

Engage with our community-driven platform, where you can collaborate on public models, datasets, and applications. Explore AI apps or create your own!

**Some of Our Popular Spaces:**

* DeepSite v2
* Wan2.2 Animate
* Tokenzier-in 1
+... and many more!

## Enterprise Solutions

Hugging Face provides scalable and secure platform solutions for large organizations. Contact us to learn more about our enterprise offerings.

**Starting at:** $20/user/month (single sign-on) or pricing tiered according to requirements...

## Finally - a minor improvement

With a small adjustment, we can change this so that the results stream back from OpenAI,
with the familiar typewriter animation

In [22]:
def stream_brochure(company_name, url):
    stream = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
          ],
        stream=True
    )
    
    response = ""
    display_handle = display(Markdown(""), display_id=True)
    for chunk in stream:
        response += chunk.choices[0].delta.content or ''
        response = response.replace("```","").replace("markdown", "")
        update_display(Markdown(response), display_id=display_handle.display_id)

In [23]:
stream_brochure("HuggingFace", "https://huggingface.co")

Found links: {'links': [{'type': 'About page', 'url': 'https://huggingface.co/'}, {'type': 'Company page', 'url': 'https://huggingface.co/brand'}]}


# Hugging Face Brochure

[Cover Page]
=================
Hugging Face: Empowering the Future of AI

[Company Description]
-----------------------

Hugging Face is a comprehensive platform for machine learning (ML) that fosters collaboration and innovation among the global ML community. Our mission is to democratize access to cutting-edge AI technologies, enabling researchers, developers, and businesses to build and deploy ML models faster.

[What We Do]
---------------

### Models

* Host and collaborate on unlimited public models, datasets, and applications
* Explore 1 million+ pre-trained models across various modalities (text, image, video, audio, and 3D)

### Datasets

* Access and share over 250,000 datasets for any ML tasks
* Discover new datasets with curated collections and browsing features

### Spaces

* Run applications using our optimized inference endpoints or update your existing spaces to a GPU in just a few clicks
* Explore popular spaces, including those generated by DeepSeek and Wan2.2

[Our Community]
------------------

Hugging Face has become the hub for AI enthusiasts worldwide. With over 50,000 organizations relying on our platform, we provide:

### Resources

* Transformations: State-of-the-art AI models and tools for PyTorch developers
* Diffusers: State-of-the-art diffusion models in PyTorch
* Tokenizers: Fast and optimized tokenizers for research and production

### Careers and Opportunities**

We are seeking talented individuals to join our team. Explore current job openings and submit your applications today!

[Get Involved]
-----------------

Join the Hugging Face community and start building, discovering, and collaborating on ML projects. Sign up now, explore our resources, and start accelerating your AI journey.

[Call-to-Action]
-------------------------------

Sign Up | Explore Models & Datasets | Start Building