# Web Scraping and Brochure Generation

## Overview
This Jupyter notebook demonstrates an automated process for generating company brochures using web scraping and AI-powered content generation. The script leverages several key technologies and libraries to:

1. Web Scraping: Retrieve and parse website content using requests and BeautifulSoup
2. Link Extraction: Automatically identify and filter relevant website links
3. AI-Powered Content Generation: Use OpenAI's GPT models to create engaging, professional brochures

## Key Features

- Scrape website content dynamically
- Extract and filter website links
- Generate markdown-formatted brochures
- Stream brochure content in real-time
- Customize brochure generation with system prompts

## Technical Components

### Libraries:

- Web Scraping: requests, BeautifulSoup
- Environment Management: python-dotenv
- AI Interaction: openai
- AI Model: GPT-4o-mini
- Output Format: Markdown

## Usage
To generate a brochure for a company:

- Provide the company's website URL
- Run the stream_brochure() function
- Watch as the AI generates a dynamic, engaging brochure

## Prerequisites

- Activated Python environment
- OpenAI API key
- Required Python libraries installed

In [221]:
# imports
import os
import requests
import json
import anthropic
import google.generativeai
import gradio as gr
from typing import List
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from IPython.display import Markdown, display, update_display
from openai import OpenAI

In [222]:
# Load environment variables in a file called .env
# Print the key prefixes to help with any debugging
load_dotenv()
openai_api_key = os.getenv('OPENAI_API_KEY')
anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')
google_api_key = os.getenv('GEMINI_API_KEY')
deepseek_api_key = os.getenv("DEEPSEEK_API_KEY")

if openai_api_key:
    print(f"OpenAI API Key exists and begins {openai_api_key[:8]}")
else:
    print("OpenAI API Key not set")
    
if anthropic_api_key:
    print(f"Anthropic API Key exists and begins {anthropic_api_key[:7]}")
else:
    print("Anthropic API Key not set")

if google_api_key:
    print(f"Google API Key exists and begins {google_api_key[:8]}")
else:
    print("Google API Key not set")

if deepseek_api_key:
    print(f"DeepSeek API Key exists and begins {deepseek_api_key[:5]}")
else:
    print("DeepSeek API Key not set")

OpenAI API Key exists and begins sk-proj-
Anthropic API Key exists and begins sk-ant-
Google API Key exists and begins AIzaSyA_
DeepSeek API Key exists and begins sk-b3


In [223]:
# Connect to OpenAI, Anthropic, DeepSeek, or Google

openai = OpenAI()

claude = anthropic.Anthropic()

google.generativeai.configure()

deepseek = OpenAI(api_key=deepseek_api_key, base_url="https://api.deepseek.com")

In [224]:
# A class to represent a Webpage
# Some websites need you to use proper headers when fetching them:
headers = {
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

class Website:
    """
    A utility class to represent a Website that we have scraped, now with links
    """

    def __init__(self, url):
        self.url = url
        response = requests.get(url, headers=headers)
        self.body = response.content
        soup = BeautifulSoup(self.body, 'html.parser')
        self.title = soup.title.string if soup.title else "No title found"
        if soup.body:
            for irrelevant in soup.body(["script", "style", "img", "input"]):
                irrelevant.decompose()
            self.text = soup.body.get_text(separator="\n", strip=True)
        else:
            self.text = ""
        links = [link.get('href') for link in soup.find_all('a')]
        self.links = [link for link in links if link]

    def get_contents(self):
        return f"Webpage Title:\n{self.title}\nWebpage Contents:\n{self.text}\n\n"

In [225]:
# Generate a system prompt for link relevance selection
link_system_prompt = "You are provided with a list of links found on a webpage. \
You can decide which of the links would be most relevant to include in a brochure about the company, \
such as links to an About page, a Company page, or Careers/Jobs pages.\n"
link_system_prompt += "You should respond in JSON as in those examples:"
link_system_prompt += """
{
    "links": [
        {"type": "about page", "url": "https://full.url/goes/here/about"},
        {"type": "careers page": "url": "https://another.full.url/careers"}
    ]
}
"""
link_system_prompt += """
{
    "links": [
        {"type": "contact page", "url": "https://full.url/goes/here/contact"},
        {"type": "social links page": "url": "https://another.full.url/social"}
    ]
}
"""

In [226]:
# Construct a user prompt with website links for AI processing
def get_links_user_prompt(website):
    user_prompt = f"Here is the list of links on the website of {website.url} - "
    user_prompt += "please decide which of these are relevant web links for a brochure about the company, respond with the full https URL in JSON format. \
Do not include Terms of Service, Privacy, email links.\n"
    user_prompt += "Links (some might be relative links):\n"
    user_prompt += "\n".join(website.links)
    return user_prompt

In [None]:
# Retrieve and filter relevant links from a website using OpenAI
def get_links(url):
    website = Website(url)
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": link_system_prompt},
            {"role": "user", "content": get_links_user_prompt(website)}
      ],
        response_format={"type": "json_object"}
    )
    result = response.choices[0].message.content
    return json.loads(result)

In [None]:
huggingface = Website("https://huggingface.co")

In [None]:
get_links("https://huggingface.co")

In [None]:
# Collect comprehensive details from a website's landing page and relevant links
def get_all_details(url):
    result = "Landing page:\n"
    result += Website(url).get_contents()
    links = get_links(url)
    for link in links["links"]:
        result += f"\n\n{link['type']}\n"
        result += Website(link["url"]).get_contents()
    return result

In [None]:
# Here you can personalize the tone, design, and sections... of your brochure
system_prompt = "You are an assistant that analyzes the contents of several relevant pages from a company website \
and creates a short humorous, entertaining, jokey, and professional brochure about the company for prospective customers, investors, and recruits. Respond in markdown.\
Include details of company culture, customers, and careers/jobs if you have the information. Include at the end of the brochure all the relevant links with their correct path"

In [None]:
# Prepare a user prompt with company website details for brochure generation
def get_brochure_user_prompt(url):
    user_prompt = f"You are looking at a company hosted there: {url}\n"
    user_prompt += f"Here are the contents of its landing page and other relevant pages; use this information to build a short brochure of the company in markdown.\n"
    user_prompt += get_all_details(url)
    user_prompt = user_prompt[:5_000] # Truncate if more than 5,000 characters
    return user_prompt

In [None]:
# Adding the stream parameter, the results stream back from OpenAI
def stream_brochure_gpt(url):
    stream = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(url)}
          ],
        stream=True
    )
    
    response = ""
    for chunk in stream:
        response += chunk.choices[0].delta.content or ''
        yield response

In [None]:
def stream_brochure_claude(url): 
    result = claude.messages.stream(
        model="claude-3-haiku-20240307",
        max_tokens=1000,
        temperature=0.7,
        system=system_message,
        messages=[
            {"role": "user", "content": get_brochure_user_prompt(url)},
        ],
    )
    response = ""
    with result as stream:
        for text in stream.text_stream:
            response += text or ""
            yield response

In [None]:
def stream_brochure_gemini(url):
    gemini = google.generativeai.GenerativeModel(
    model_name='gemini-1.5-flash',
    system_instruction=system_message)
    text_response = ""
    response = gemini.generate_content(get_brochure_user_prompt(url), stream=True)
    for chunk in response:
     text_response += chunk.text
    yield text_response

In [None]:
def stream_brochure_deepseek(url):
    response = deepseek.chat.completions.create(
        model="deepseek-chat", # deepseek-chat or deepseek-reasoner 
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(url)},
        ],
        temperature=1.5,
        stream=True
    )
    result = ""
    for chunk in response:
        result += chunk.choices[0].delta.content or ""
        yield result

In [None]:
def generate_brochure(url, model):
    prompt = f"Please generate a company brochure for the website, at this {url} Here is their landing page:\n"
    if model=="GPT":
        result = stream_brochure_gpt(url)
    elif model=="Claude":
        result = stream_brochure_claude(url)
    elif model=="DeepSeek":
        result = stream_brochure_deepseek(curl)
    elif model=="Gemini":
        result = stream_brochure_gemini(curl)
    else:
        raise ValueError("Unknown model")
    yield from result

In [None]:
# Define this variable and then pass js=force_dark_mode when creating the Interface to force dark mode
force_dark_mode = """
function refresh() {
    const url = new URL(window.location);
    if (url.searchParams.get('__theme') !== 'dark') {
        url.searchParams.set('__theme', 'dark');
        window.location.href = url.href;
    }
}
"""

In [None]:
gr.Interface(
    fn=generate_brochure, 
    inputs=[gr.Textbox(label="Link", placeholder="https://github.com/antomarchim"), gr.Dropdown(["GPT", "Claude", "Gemini", "DeepSeek"])],
    outputs=[gr.Markdown(label="Brochure")], 
    flagging_mode="never", 
    js=force_dark_mode
).launch(share=True)