# Brochure Generator (multi-language) Output
This Python Jupyter Notebook automates bilingual brochure generation by scraping company websites and using AI to create professional marketing materials in English and a chosen target language. The script combines BeautifulSoup web scraping to extract content, OpenAI's GPT models to filter important pages and generate a compelling copy, and real-time translation for Spanish, German, and French outputs. Through simple configuration variables, users can specify the company name, website URL, and target language—making it an adaptable tool for creating  brochures for customers, investors, and potential recruits.

## 1. Import Libraries
Key imports for web scraping and AI content generation: os/dotenv for environment variables, requests/BeautifulSoup for web scraping, json/typing for data handling, IPython.display for markdown output, and openai for AI content generation. These libraries enable automated creation of bilingual brochures through web scraping and AI.

In [None]:
# Imports
import os
import requests
import json
from typing import List
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from IPython.display import Markdown, display, update_display
from openai import OpenAI

## 2. Define Constants
Loads and validates environment variables including the OpenAI API key (checking for proper 'sk-proj-' format and length), then initializes the GPT-4o-mini model client. It provides validation feedback to help users resolve configuration issues.

In [None]:
# Initialize & constants
load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

if api_key and api_key.startswith('sk-proj-') and len(api_key) > 10:
    print("API key looks good so far")
else:
    print("There might be a problem with your API key? Please visit the troubleshooting notebook!")

MODEL = 'gpt-4o-mini'
openai = OpenAI()

## 3. Web Scraping Class
Website class scrapes web pages to extract titles, text content, and links. It offers a get_contents() method to format data for brochure generation.

In [None]:
class Website:
    """
    A utility class to represent a Website that we have scraped, now with links
    """
    url: str
    title: str
    body: str
    links: List[str]
    text: str

    def __init__(self, url):
        self.url = url
        response = requests.get(url)
        self.body = response.content
        soup = BeautifulSoup(self.body, 'html.parser')
        self.title = soup.title.string if soup.title else "No title found"
        if soup.body:
            for irrelevant in soup.body(["script", "style", "img", "input"]):
                irrelevant.decompose()
            self.text = soup.body.get_text(separator="\n", strip=True)
        else:
            self.text = ""
        links = [link.get('href') for link in soup.find_all('a')]
        self.links = [link for link in links if link]

    def get_contents(self):
        return f"Webpage Title:\n{self.title}\nWebpage Contents:\n{self.text}\n\n"

## 4. Filters and Scrapes content or relevant web-pages
AI filters and selects relevant company website links (About, Careers, Company pages) for brochure creation. The get_all_details() function scrapes content from these pages and the main page, combining them into a comprehensive text block about the company.

In [None]:
# Multi-shot prompt to cleanup the links
link_system_prompt = "You are provided with a list of links found on a webpage. \
You are able to decide which of the links would be most relevant to include in a brochure about the company, \
such as links to an About page, or a Company page, or Careers/Jobs pages.\n"
link_system_prompt += "You should respond in JSON as in this example:"
link_system_prompt += """
{
    "links": [
        {"type": "about page", "url": "https://full.url/goes/here/about"},
        {"type": "careers page", "url": "https://another.full.url/careers"}
    ]
}
"""

def get_links_user_prompt(website):
    """Generate user prompt for extracting relevant links from website"""
    user_prompt = f"Here is the list of links on the website of {website.url} - "
    user_prompt += "Please decide which of these are relevant web links for creating a brochure about a company, respond with the full https URL in JSON format." 
    user_prompt += "Links (some might be relative links):\n"
    user_prompt += "\n".join(website.links)
    return user_prompt

def get_links(url):
    """Extract relevant links from a website using OpenAI"""
    website = Website(url)
    response = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": link_system_prompt},
            {"role": "user", "content": get_links_user_prompt(website)}
        ],
        response_format={"type": "json_object"}
    )
    result = response.choices[0].message.content
    return json.loads(result)

def get_all_details(url):
    """Scrape and compile content from main page and relevant linked pages"""
    result = "Landing page:\n"
    result += Website(url).get_contents()
    links = get_links(url)
    print("Found links:", links)
    for link in links["links"]:
        result += f"\n\n{link['type']}\n"
        result += Website(link["url"]).get_contents()
    return result

## 5. Bilingual Generation Functions
AI functionality generates bilingual company brochures from website content. The stream_brochure_bilingual() creates real-time English and translated versions, while translate_brochure() handles the translation.

In [None]:
# System prompt for brochure creation
system_prompt = "You are an assistant that analyzes the contents of several relevant pages from a company website \
and creates a short brochure about the company for a prospective customer, investor and recruits. Respond in markdown.\
Include details of company culture, customers,careers/jobs, and what makes the company unique, if you have that information."

def get_brochure_user_prompt(company_name, url):
    """Generate user prompt for brochure creation"""
    user_prompt = f"You are looking at a company called: {company_name}\n"
    user_prompt += f"Here are the contents of its landing page and other relevant pages; use this information to build a short brochure of the company in markdown.\n"
    user_prompt += get_all_details(url)
    user_prompt = user_prompt[:20_000]  # Truncate if more than 20,000 characters
    return user_prompt

def translate_brochure(brochure_content, language="Spanish"):
    """
    Translate brochure content to specified language with streaming output
    Returns the translated content as a string
    """
    translation_system_prompt = f"You are a skilled translator. Translate the following brochure text into {language}.\
    Make sure to translate into idiomatic {language}, matching the target language's natural structure, wording and expressions, so it can't be recognised as a translation.\
    Be sure to also meet an appropriate tone, eg a good marketing language in other languages will probably be a bit less boastful than in English.\
    Output the translated brochure in Markdown format."
    
    # Create streaming response for translation
    stream = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": translation_system_prompt}, 
            {"role": "user", "content": brochure_content}
        ],
        stream=True
    )
    
    # Collect and display translated content with streaming
    translated_response = ""
    print(f"\n\n## {language} Translation\n")
    translation_display = display(Markdown(""), display_id=True)
    
    for chunk in stream:
        if chunk.choices[0].delta.content:
            translated_response += chunk.choices[0].delta.content
            # Clean up markdown formatting for display
            clean_response = translated_response.replace("```", "").replace("markdown", "")
            update_display(Markdown(clean_response), display_id=translation_display.display_id)
    
    return translated_response

def stream_brochure_bilingual(company_name, url, translation_language="Spanish"):
    """
    Generate and stream brochure in English, then translate and stream in specified language
    """
    print(f"## 🏢 Creating brochure for {company_name}")
    print(f"## 🇺🇸 English Version\n")
    
    # Generate English brochure with streaming
    stream = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
        ],
        stream=True
    )
    
    # Collect English response while streaming
    english_response = ""
    english_display = display(Markdown(""), display_id=True)
    
    for chunk in stream:
        if chunk.choices[0].delta.content:
            english_response += chunk.choices[0].delta.content
            clean_response = english_response.replace("```", "").replace("markdown", "")
            update_display(Markdown(clean_response), display_id=english_display.display_id)
    
    # Translate the complete English brochure
    translated_response = translate_brochure(english_response, translation_language)
    
    return {
        "english": english_response,
        "translated": translated_response,
        "language": translation_language
    }

# Legacy function for backward compatibility
def stream_brochure(company_name, url):
    """
    Original function - now calls the bilingual version but only shows English
    Kept for backward compatibility
    """
    result = stream_brochure_bilingual(company_name, url)
    return result["english"]

## 6. Script Settings
User specifies the company name, website URL, and target language for bilingual brochure generation. When executed, it creates English and Spanish versions by scraping and translating the company's website content.

In [None]:
# Default output is English and German; change to perfered
COMPANY_NAME = "Huggingface"
COMPANY_URL = "https://huggingface.co/"
TRANSLATION_LANGUAGE = "French"  # Options: German, Spanish, French, Italian, Portuguese, etc.

# Runs the brochure generator
if __name__ == "__main__":
    stream_brochure_bilingual(COMPANY_NAME, COMPANY_URL, TRANSLATION_LANGUAGE)