# Web Scraping and Brochure Generation

## Overview
This Jupyter notebook demonstrates an automated process for generating company brochures using web scraping and AI-powered content generation. The script leverages several key technologies and libraries to:

1. Web Scraping: Retrieve and parse website content using requests and BeautifulSoup
2. Link Extraction: Automatically identify and filter relevant website links
3. AI-Powered Content Generation: Use OpenAI's GPT models to create engaging, professional brochures

## Key Features

- Scrape website content dynamically
- Extract and filter website links
- Generate markdown-formatted brochures
- Stream brochure content in real-time
- Customize brochure generation with system prompts

## Technical Components

### Libraries:

- Web Scraping: requests, BeautifulSoup
- Environment Management: python-dotenv
- AI Interaction: openai


- AI Model: GPT-4o-mini
- Output Format: Markdown

## Usage
To generate a brochure for a company:

- Provide the company's website URL
- Run the stream_brochure() function
- Watch as the AI generates a dynamic, engaging brochure

## Prerequisites

- Activated Python environment
- OpenAI API key
- Required Python libraries installed

In [47]:
# imports
import os
import requests
import json
from typing import List
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from IPython.display import Markdown, display, update_display
from openai import OpenAI

In [48]:
# Initialize and constants
load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

if api_key and api_key.startswith('sk-proj-') and len(api_key)>10:
    print("API key looks good so far")
else:
    print("There might be a problem with your API key? Please visit the troubleshooting notebook!")
    
MODEL = 'gpt-4o-mini'
openai = OpenAI()

API key looks good so far


In [49]:
# A class to represent a Webpage
# Some websites need you to use proper headers when fetching them:
headers = {
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

class Website:
    """
    A utility class to represent a Website that we have scraped, now with links
    """

    def __init__(self, url):
        self.url = url
        response = requests.get(url, headers=headers)
        self.body = response.content
        soup = BeautifulSoup(self.body, 'html.parser')
        self.title = soup.title.string if soup.title else "No title found"
        if soup.body:
            for irrelevant in soup.body(["script", "style", "img", "input"]):
                irrelevant.decompose()
            self.text = soup.body.get_text(separator="\n", strip=True)
        else:
            self.text = ""
        links = [link.get('href') for link in soup.find_all('a')]
        self.links = [link for link in links if link]

    def get_contents(self):
        return f"Webpage Title:\n{self.title}\nWebpage Contents:\n{self.text}\n\n"

In [50]:
vlf = Website("https://villefranche-sur-mer.fr")
print(vlf.get_contents())
vlf.links

Webpage Title:
Villefranche-sur-Mer – Site de la commune de Villefranche-sur-Mer
Webpage Contents:
Villefranche-sur-Mer
Site de la commune de Villefranche-sur-Mer
Accueil
Mairie
Conseil municipal
Ordre du jour du conseil municipal
Procès Verbaux
Documents budgétaires
Arrêtés municipaux
Actes règlementaires
Compte-Rendu et liste des délibérations de Conseils Municipaux 2020 – 2025
Les fonctions du maire
Les élus municipaux
Services municipaux
Organigramme des services
Services administratifs
Services techniques
Le service des mouillages
Marchés Publics & Concessions
Enquêtes Publiques
Démarches administratives
Changement d’usage
Abattage d’arbre ou défrichement
Ouverture de l’Espace France Services
Etat-civil
Démarches en ligne
Communication / presse
Magazines Municipaux
Villefranche News
La Ville s’engage en faveur de l’environnement !
Villefranche-sur-Mer et sa charte de l’arbre
Logo & charte graphique
Presse : communiqués & agendas
Présentation service Urbanisme
Service urbanisme
PLU

['https://villefranche-sur-mer.fr',
 'https://villefranche-sur-mer.fr/',
 'https://villefranche-sur-mer.fr/mairie/',
 'https://villefranche-sur-mer.fr/conseil-municipal/',
 'https://villefranche-sur-mer.fr/deliberations-du-conseil-municipal/',
 'https://villefranche-sur-mer.fr/comptes-rendus-des-conseils-municipaux/',
 'https://villefranche-sur-mer.fr/documents-budgetaires/',
 'https://villefranche-sur-mer.fr/actes-reglementaires/',
 'https://villefranche-sur-mer.fr/actes-reglementaires-2/',
 'https://villefranche-sur-mer.fr/deliberations-2/',
 'https://villefranche-sur-mer.fr/le-maire/',
 'https://villefranche-sur-mer.fr/les-elus-municipaux/',
 'https://villefranche-sur-mer.fr/services-municipaux/',
 'https://villefranche-sur-mer.fr/organigramme-des-services/',
 'https://villefranche-sur-mer.fr/services-administratifs/',
 'https://villefranche-sur-mer.fr/services-techniques/',
 'https://villefranche-sur-mer.fr/le-service-des-mouillages/',
 'https://villefranche-sur-mer.fr/marches-publ

In [51]:
# Generate a system prompt for link relevance selection
link_system_prompt = "You are provided with a list of links found on a webpage. \
You can decide which of the links would be most relevant to include in a brochure about the company, \
such as links to an About page, a Company page, or Careers/Jobs pages.\n"
link_system_prompt += "You should respond in JSON as in those examples:"
link_system_prompt += """
{
    "links": [
        {"type": "about page", "url": "https://full.url/goes/here/about"},
        {"type": "careers page": "url": "https://another.full.url/careers"}
    ]
}
"""
link_system_prompt += """
{
    "links": [
        {"type": "contact page", "url": "https://full.url/goes/here/contact"},
        {"type": "social links page": "url": "https://another.full.url/social"}
    ]
}
"""

In [52]:
print(link_system_prompt)

You are provided with a list of links found on a webpage. You can decide which of the links would be most relevant to include in a brochure about the company, such as links to an About page, a Company page, or Careers/Jobs pages.
You should respond in JSON as in those examples:
{
    "links": [
        {"type": "about page", "url": "https://full.url/goes/here/about"},
        {"type": "careers page": "url": "https://another.full.url/careers"}
    ]
}

{
    "links": [
        {"type": "contact page", "url": "https://full.url/goes/here/contact"},
        {"type": "social links page": "url": "https://another.full.url/social"}
    ]
}



In [53]:
# Construct a user prompt with website links for AI processing
def get_links_user_prompt(website):
    user_prompt = f"Here is the list of links on the website of {website.url} - "
    user_prompt += "please decide which of these are relevant web links for a brochure about the company, respond with the full https URL in JSON format. \
Do not include Terms of Service, Privacy, email links.\n"
    user_prompt += "Links (some might be relative links):\n"
    user_prompt += "\n".join(website.links)
    return user_prompt

In [54]:
print(get_links_user_prompt(vlf))

Here is the list of links on the website of https://villefranche-sur-mer.fr - please decide which of these are relevant web links for a brochure about the company, respond with the full https URL in JSON format. Do not include Terms of Service, Privacy, email links.
Links (some might be relative links):
https://villefranche-sur-mer.fr
https://villefranche-sur-mer.fr/
https://villefranche-sur-mer.fr/mairie/
https://villefranche-sur-mer.fr/conseil-municipal/
https://villefranche-sur-mer.fr/deliberations-du-conseil-municipal/
https://villefranche-sur-mer.fr/comptes-rendus-des-conseils-municipaux/
https://villefranche-sur-mer.fr/documents-budgetaires/
https://villefranche-sur-mer.fr/actes-reglementaires/
https://villefranche-sur-mer.fr/actes-reglementaires-2/
https://villefranche-sur-mer.fr/deliberations-2/
https://villefranche-sur-mer.fr/le-maire/
https://villefranche-sur-mer.fr/les-elus-municipaux/
https://villefranche-sur-mer.fr/services-municipaux/
https://villefranche-sur-mer.fr/organ

In [55]:
# Retrieve and filter relevant links from a website using OpenAI
def get_links(url):
    website = Website(url)
    response = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": link_system_prompt},
            {"role": "user", "content": get_links_user_prompt(website)}
      ],
        response_format={"type": "json_object"}
    )
    result = response.choices[0].message.content
    return json.loads(result)

In [56]:
huggingface = Website("https://huggingface.co")
huggingface.links

['/',
 '/models',
 '/datasets',
 '/spaces',
 '/posts',
 '/docs',
 '/enterprise',
 '/pricing',
 '/login',
 '/join',
 '/deepseek-ai/DeepSeek-R1',
 '/hexgrad/Kokoro-82M',
 '/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B',
 '/deepseek-ai/DeepSeek-R1-Zero',
 '/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B',
 '/models',
 '/spaces/hexgrad/Kokoro-TTS',
 '/spaces/tencent/Hunyuan3D-2',
 '/spaces/lllyasviel/iclight-v2',
 '/spaces/JeffreyXiang/TRELLIS',
 '/spaces/webml-community/deepseek-r1-webgpu',
 '/spaces',
 '/datasets/fka/awesome-chatgpt-prompts',
 '/datasets/HumanLLMs/Human-Like-DPO-Dataset',
 '/datasets/bespokelabs/Bespoke-Stratos-17k',
 '/datasets/cais/hle',
 '/datasets/yale-nlp/MMVU',
 '/datasets',
 '/join',
 '/pricing#endpoints',
 '/pricing#spaces',
 '/pricing',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/allenai',
 '/facebook',
 '/amazon',
 '/google',
 '/Intel',
 '/microsoft',
 '/grammarly',
 '/Writer',
 '/docs/transformers',

In [57]:
get_links("https://huggingface.co")

{'links': [{'type': 'about page', 'url': 'https://huggingface.co/huggingface'},
  {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'},
  {'type': 'enterprise page', 'url': 'https://huggingface.co/enterprise'},
  {'type': 'pricing page', 'url': 'https://huggingface.co/pricing'},
  {'type': 'blog page', 'url': 'https://huggingface.co/blog'},
  {'type': 'docs page', 'url': 'https://huggingface.co/docs'},
  {'type': 'social links page', 'url': 'https://twitter.com/huggingface'},
  {'type': 'social links page',
   'url': 'https://www.linkedin.com/company/huggingface/'}]}

In [58]:
# Collect comprehensive details from a website's landing page and relevant links
def get_all_details(url):
    result = "Landing page:\n"
    result += Website(url).get_contents()
    links = get_links(url)
    print("Found links:", links)
    for link in links["links"]:
        result += f"\n\n{link['type']}\n"
        result += Website(link["url"]).get_contents()
    return result

In [59]:
print(get_all_details("https://huggingface.co"))

Found links: {'links': [{'type': 'about page', 'url': 'https://huggingface.co/huggingface'}, {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'blog page', 'url': 'https://huggingface.co/blog'}, {'type': 'contact page', 'url': 'https://discuss.huggingface.co'}, {'type': 'pricing page', 'url': 'https://huggingface.co/pricing'}, {'type': 'enterprise page', 'url': 'https://huggingface.co/enterprise'}]}
Landing page:
Webpage Title:
Hugging Face – The AI community building the future.
Webpage Contents:
Hugging Face
Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up
The AI community building the future.
The platform where the machine learning community collaborates on models, datasets, and applications.
Trending on
this week
Models
deepseek-ai/DeepSeek-R1
Updated
1 day ago
•
69.6k
•
2.21k
hexgrad/Kokoro-82M
Updated
2 days ago
•
33.9k
•
2.32k
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
Updated
1 day ago
•
75.9k
•
429
deepseek-ai/DeepSeek-R1-Zer

In [60]:
# Here you can personalize the tone, design, and sections... of your brochure
system_prompt = "You are an assistant that analyzes the contents of several relevant pages from a company website \
and creates a short humorous, entertaining, jokey, and professional brochure about the company for prospective customers, investors, and recruits. Respond in markdown.\
Include details of company culture, customers, and careers/jobs if you have the information. Include at the end of the brochure all the relevant links with their correct path"

In [61]:
# Prepare a user prompt with company website details for brochure generation
def get_brochure_user_prompt(company_name, url):
    user_prompt = f"You are looking at a company called: {company_name}\n"
    user_prompt += f"Here are the contents of its landing page and other relevant pages; use this information to build a short brochure of the company in markdown.\n"
    user_prompt += get_all_details(url)
    user_prompt = user_prompt[:5_000] # Truncate if more than 5,000 characters
    return user_prompt

In [62]:
get_brochure_user_prompt("HuggingFace", "https://huggingface.co")

Found links: {'links': [{'type': 'about page', 'url': 'https://huggingface.co/brand'}, {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'enterprise page', 'url': 'https://huggingface.co/enterprise'}, {'type': 'pricing page', 'url': 'https://huggingface.co/pricing'}, {'type': 'blog page', 'url': 'https://huggingface.co/blog'}, {'type': 'contact page', 'url': 'https://huggingface.co/chat'}, {'type': 'community page', 'url': 'https://discuss.huggingface.co'}, {'type': 'GitHub page', 'url': 'https://github.com/huggingface'}, {'type': 'Twitter page', 'url': 'https://twitter.com/huggingface'}, {'type': 'LinkedIn page', 'url': 'https://www.linkedin.com/company/huggingface/'}]}


'You are looking at a company called: HuggingFace\nHere are the contents of its landing page and other relevant pages; use this information to build a short brochure of the company in markdown.\nLanding page:\nWebpage Title:\nHugging Face – The AI community building the future.\nWebpage Contents:\nHugging Face\nModels\nDatasets\nSpaces\nPosts\nDocs\nEnterprise\nPricing\nLog In\nSign Up\nThe AI community building the future.\nThe platform where the machine learning community collaborates on models, datasets, and applications.\nTrending on\nthis week\nModels\ndeepseek-ai/DeepSeek-R1\nUpdated\n1 day ago\n•\n69.6k\n•\n2.21k\nhexgrad/Kokoro-82M\nUpdated\n2 days ago\n•\n33.9k\n•\n2.32k\ndeepseek-ai/DeepSeek-R1-Distill-Qwen-32B\nUpdated\n1 day ago\n•\n75.9k\n•\n429\ndeepseek-ai/DeepSeek-R1-Zero\nUpdated\n1 day ago\n•\n4.48k\n•\n395\ndeepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B\nUpdated\n1 day ago\n•\n70.1k\n•\n328\nBrowse 400k+ models\nSpaces\nRunning\non\nZero\n1.47k\n❤️\nKokoro TTS\nNow in 5 l

In [63]:
# Adding the stream parameter, the results stream back from OpenAI
def stream_brochure(company_name, url):
    stream = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
          ],
        stream=True
    )
    
    response = ""
    display_handle = display(Markdown(""), display_id=True)
    for chunk in stream:
        response += chunk.choices[0].delta.content or ''
        response = response.replace("```","").replace("markdown", "")
        update_display(Markdown(response), display_id=display_handle.display_id)

In [64]:
stream_brochure("HuggingFace", "https://huggingface.co")

Found links: {'links': [{'type': 'about page', 'url': 'https://huggingface.co/huggingface'}, {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'enterprise page', 'url': 'https://huggingface.co/enterprise'}, {'type': 'pricing page', 'url': 'https://huggingface.co/pricing'}, {'type': 'documentation page', 'url': 'https://huggingface.co/docs'}, {'type': 'blog page', 'url': 'https://huggingface.co/blog'}]}


# Welcome to Hugging Face: Where AI Gets a Hug!

---

## Who Are We?

At **Hugging Face**, we're not just another tech company; we're a vibrant community dedicated to shaping the future of Artificial Intelligence. Think of us as the cozy library of machine learning where everyone is invited to create, collaborate, and share. And yes, we're all about the hugs—just not the kind that makes you awkward at parties!

---

## What Do We Do?

**Models, Datasets, and Spaces** - Oh my! 

- **400k+ Models:** From deep learning leviathans to the tiny models that might just blow your mind (in a good way).
- **100k+ Datasets:** Because we believe there's a dataset out there for every bizarre niche you can think of. Seriously, you haven’t lived until you've analyzed the sentiments of cat memes!
- **150k+ Applications:** Whether you're here to create, collaborate, or just see what all this fuss is about, we've got an application for you.

---

## Our Customers

Join the ranks of **over 50,000 organizations**, including heavyweights like Google, Microsoft, and AWS. Even the folks at Grammarly prefer us for their ML needs (yes, those venerated grammarians trust us with their spelling)!

---

## The Hugging Face Culture

Picture this: a collaborative haven where everyone’s ideas matter. We value innovation and creativity while maintaining a fun atmosphere—because who said serious business can't involve a few laughs (or a pun or two)? Our employees are as diverse and creative as our datasets, and we're always looking to expand the family!

---

## Come Work With Us!

Are you ready to help build the future? We’re on the lookout for trailblazers in AI who aren't afraid to roll up their sleeves and get their hands dirty (virtually, of course). Check out our **careers page** for opportunities that will turn your workplace into a launching pad for your innovative dreams.

---

## Why Choose Hugging Face? 

Because saving the world through AI is better with friends (and hugs)! We’ll equip you with the tools to accelerate your models and projects, and throw in some community vibes for free. 

---

## Dive In!

- [Explore Models](https://huggingface.co/models)
- [Datasets Galore](https://huggingface.co/datasets)
- [Launch Your Space](https://huggingface.co/spaces)
- [Join the Community](https://huggingface.co/community)
- [Careers at Hugging Face](https://huggingface.co/jobs)


---

So, what are you waiting for? Embrace the AI revolution—let’s build the future together, one hug at a time!

---

**Hugging Face - Not just a name, a mission!**

In [66]:
stream_brochure("Villefranche-sur-Mer", "https://villefranche-sur-mer.fr")

Found links: {'links': [{'type': 'about page', 'url': 'https://villefranche-sur-mer.fr/mairie/'}, {'type': 'services page', 'url': 'https://villefranche-sur-mer.fr/services-municipaux/'}, {'type': 'council page', 'url': 'https://villefranche-sur-mer.fr/conseil-municipal/'}, {'type': 'local news page', 'url': 'https://villefranche-sur-mer.fr/villefranche-news/'}, {'type': 'events page', 'url': 'https://villefranche-sur-mer.fr/events/'}, {'type': 'contact page', 'url': 'https://villefranche-sur-mer.fr/nous-contacter/'}, {'type': 'careers page', 'url': 'https://villefranche-sur-mer.fr/la-ville-recherche-une-controleurse-de-travaux-et-des-infractions-en-urbanisme-h-f/'}]}


ConnectionError: HTTPSConnectionPool(host='villefranche-sur-mer.fr', port=443): Max retries exceeded with url: /villefranche-news/ (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x113009c90>: Failed to establish a new connection: [Errno 12] Cannot allocate memory'))