# 📄 Brochure Generator

An AI-powered tool that automatically creates professional company brochures by analyzing websites and extracting relevant content using advanced web scraping and natural language processing.

## 📋 Overview

This project intelligently crawls company websites, identifies and extracts relevant content from multiple pages (About, Products, Contact, etc.), and generates polished marketing brochures using AI language models. The tool leverages OpenAI's GPT-4o-mini to create engaging, professional content suitable for potential customers, investors, and job seekers.

## ✨ Key Features

- **🤖 AI-Powered Content Generation**: Uses OpenAI's `gpt-4o-mini` for intelligent brochure writing
- **🕷️ Smart Web Crawling**: Automatically identifies and crawls relevant company website pages
- **🎯 Intelligent Link Selection**: Uses AI to filter and select only brochure-relevant links (About, Products, Contact, etc.)
- **📝 Professional Output**: Generates clean, marketing-ready content in Markdown format
- **🔗 Multi-Page Processing**: Combines content from multiple website pages for comprehensive brochures
- **🚀 Automated Workflow**: End-to-end automation from URL input to finished brochure

## 🛠️ Technology Stack

| Component | Technology | Purpose |
|-----------|------------|---------|
| **AI Model** | OpenAI GPT-4o-mini | Content generation and link analysis |
| **Web Scraping** | BeautifulSoup + Requests | Website content extraction |
| **JSON Processing** | Python JSON | AI response parsing |
| **Content Processing** | Custom classes | Website data management |
| **Output Format** | Markdown | Professional document formatting |
| **Language** | Python | Core development |

## 🚀 Installation Requirements

### Python Dependencies
```bash
pip install requests beautifulsoup4 openai python-dotenv
```

### Environment Variables
- `OPENAI_API_KEY` - Required for AI content generation

## 🎯 Project Scope

- ✅ **Multi-Page Content**: Processes main page and relevant sub-pages
- ✅ **AI Link Selection**: Intelligently identifies brochure-relevant links
- ✅ **Content Synthesis**: Combines multiple pages into cohesive brochure
- ✅ **Professional Output**: Marketing-ready brochure content
- ✅ **Automated Processing**: Minimal manual intervention required
- ❌ **Visual Design**: Focuses on content, not visual layout

## 🏆 Skill Level

**Intermediate** - Perfect for developers learning:
- AI-assisted content creation
- Multi-step web scraping workflows
- JSON parsing and data processing
- OpenAI API integration
- Content synthesis techniques

## 🚀 Use Cases

- **🏢 Company Brochures**: Generate marketing materials for businesses
- **📈 Sales Materials**: Create compelling company overviews
- **💼 Investor Presentations**: Extract key company information
- **🎯 Marketing Automation**: Automated content creation pipelines
- **📊 Competitive Analysis**: Analyze competitor websites

## 💡 Benefits

- **⏰ Time-Saving**: Reduces brochure creation time by 90%
- **🎯 Professional Quality**: AI-generated marketing content
- **📱 Consistent Format**: Standardized brochure structure
- **🔄 Scalable**: Process multiple companies efficiently
- **🤝 Marketing-Ready**: Professional output for immediate use

## 🔧 Core Components

- **`WebSite`**: Data class for website information
- **`WebUrlCrawler`**: Web scraping and content extraction
- **`LLMClient`**: OpenAI API integration
- **`BrochureGenerator`**: Main orchestration and content synthesis

---

*This project demonstrates advanced AI integration for automated marketing content creation, combining web scraping, natural language processing, and content generation.*

In [1]:
# ===========================
# System & Environment
# ===========================
import os
import requests
import json
import ollama
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from IPython.display import display, Markdown, update_display
from openai import OpenAI

## Web Scraping Module

In [15]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, WebDriverException

class WebUrlCrawler:
    def __init__(self, headless=True, timeout=10):
        self.timeout = timeout
        self.driver = None
        self.headless = headless

    def _setup_driver(self):
        chrome_options = Options()
        if self.headless:
            chrome_options.add_argument("--headless")
        chrome_options.add_argument("--no-sandbox")
        chrome_options.add_argument("--disable-dev-shm-usage")
        chrome_options.add_argument("--disable-gpu")
        chrome_options.add_argument("--window-size=1920,1080")

        try:
            self.driver = webdriver.Chrome(options=chrome_options)
            self.driver.set_page_load_timeout(self.timeout)
        except WebDriverException as e:
            raise Exception(f"Failed to initialize Chrome driver: {e}")

    def _extract_main_content(self, html):
        soup = BeautifulSoup(html, 'html.parser')

        # Remove unwanted elements
        unwanted_tags = ['script', 'style', 'img', 'input', 'button', 'nav', 'footer', 'header']
        for tag in unwanted_tags:
            for element in soup.find_all(tag):
                element.decompose()

        # Try to find main content containers in order of preference
        content_selectors = [
            'main',
            'article',
            '[role="main"]',
            '.content',
            '#content',
            '.main-content',
            '#main-content'
        ]

        for selector in content_selectors:
            content_element = soup.select_one(selector)
            if content_element:
                return content_element.get_text(strip=True, separator='\n')

        # Fallback to body if no main content container found
        body = soup.find('body')
        if body:
            return body.get_text(strip=True, separator='\n')

        return soup.get_text(strip=True, separator='\n')

    def crawl(self, url):
        if not self.driver:
            self._setup_driver()

        try:
            self.driver.get(url)

            WebDriverWait(self.driver, self.timeout).until(
                EC.presence_of_element_located((By.TAG_NAME, "body"))
            )

            html_content = self.driver.page_source
            main_content = self._extract_main_content(html_content)
            return main_content

        except TimeoutException:
            raise Exception(f"Timeout while loading {url}")
        except WebDriverException as e:
            raise Exception(f"Error crawling {url}: {e}")

    def close(self):
        if self.driver:
            self.driver.quit()
            self.driver = None

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.close()

In [16]:
from bs4 import BeautifulSoup
import requests

class WebSite:
    def __init__(self, url, title, body, links):
        self.url = url
        self.title = title
        self.body = body
        self.links = links

class WebUrlCrawler:
    # some websites need to use proper headers when fetching them
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
    }


    def __init__(self, headless=True, timeout=10):
        self.timeout = timeout
        self.driver = None
        self.headless = headless

    def crawl(self, url) -> WebSite:
        response = requests.get(url, headers=self.headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        title = soup.title.string if soup.title else "No title found"

        if soup.body:
            for irrelevant in soup.body(["script", "style", "img", "input"]):
                irrelevant.decompose()
            body = soup.body.get_text(strip=True, separator='\n')
        else:
            body = ""

        links = [link.get('href') for link in soup.find_all('a')]
        links = [link for link in links if link]

        return WebSite(url, title, body, links)


## LLM client

In [17]:
from openai import OpenAI

class LLMClient:
    def __init__(self, model, base_url=None):
        self.model = model
        if base_url:
            self.openai = OpenAI(base_url=base_url, api_key=model)
        else:
            self.openai = OpenAI()

    def generate_text(self, user_prompt, system_prompt="") -> str:
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ]
        response = self.openai.chat.completions.create(
            model=self.model,
            messages= messages,
        )
        return response.choices[0].message.content

## Brochure Creation

In [None]:
class BrochureGenerator:
    """
    An AI-powered brochure generator that creates professional company brochures
    by analyzing websites and extracting relevant content.
    
    This class orchestrates the entire brochure generation process:
    1. Crawls the main company website
    2. Uses AI to identify relevant sub-pages (About, Products, Contact, etc.)
    3. Scrapes content from selected pages
    4. Combines all content into a cohesive professional brochure
    
    The generator focuses on creating marketing materials suitable for potential
    customers, investors, and job seekers by extracting company culture, products,
    and career information from the website.
    
    Attributes:
        brochure_url (str): The main URL of the company website to process
        company_name (str): The name of the company for brochure personalization
        crawler (WebUrlCrawler): Web scraping client for content extraction
        main_webpage (WebSite): The scraped main website content and metadata
        llm_client (LLMClient): AI client for content generation and link analysis
    """

    def __init__(self, llm_client, url, company_name):
        """
        Initialize the brochure generator with company information and AI client.
        
        Args:
            llm_client (LLMClient): An initialized LLM client for AI operations
            url (str): The main URL of the company website to process
            company_name (str): The name of the company for brochure personalization
        """
        self.brochure_url = url
        self.company_name = company_name
        self.crawler = WebUrlCrawler(headless=True)
        self.main_webpage = self.crawler.crawl(self.brochure_url)
        self.llm_client = llm_client

    def generate(self) -> str:
        """
        Generate a complete company brochure by processing website content.
        
        This method orchestrates the full brochure generation workflow:
        1. Analyzes the main webpage links to identify relevant sub-pages
        2. Scrapes content from selected relevant pages
        3. Combines all content into a comprehensive text corpus
        4. Generates a professional brochure using AI content synthesis
        
        Returns:
            str: A complete company brochure in Markdown format, ready for
                 use in marketing materials, presentations, or publications
                 
        Raises:
            Exception: If web scraping fails for any of the target pages
            OpenAIError: If AI processing fails during link analysis or content generation
            json.JSONDecodeError: If the AI returns invalid JSON for link selection
        """
        links_json = self._get_relevant_links()
        links = json.loads(links_json)
        content = self.main_webpage.body

        for link in links['links']:
            linked_website = self.crawler.crawl(link['url'])
            content += f"\n\n{link['type']}:\n"
            content += linked_website.body

        return self._get_brochure_body(content=content)

    def _get_relevant_links(self) -> str:
        """
        Use AI to identify and filter relevant website links for brochure content.
        
        This method analyzes all links found on the main webpage and uses AI
        to select only those that are relevant for a professional brochure.
        It excludes utility links like login, terms of service, and privacy
        policies while focusing on content-rich pages like About, Products,
        Contact, and Company information.
        
        Returns:
            str: A JSON string containing an array of selected relevant links
                 with their types (about, product, contact, etc.) and URLs
                 
        Raises:
            OpenAIError: If the AI model request fails or returns an error
        """
        system_prompt = """
        You are given a list of links from a company website.
        Select only relevant links for a brochure (About, Company, Careers, Products, Contact).
        Exclude login, terms, privacy, and emails.

        ### **Instructions**
        - Return **only valid JSON**.
        - **Do not** include explanations, comments, or Markdown.
        - Example output:
        {
            "links": [
                {"type": "about", "url": "https://company.com/about"},
                {"type": "contact", "url": "https://company.com/contact"},
                {"type": "product", "url": "https://company.com/products"}
            ]
        }
        """

        user_prompt = f"""
        Here is the list of links on the website of {self.main_webpage.url}:
        Please identify the relevant web links for a company brochure. Respond in JSON format.
        Do not include login, terms of service, privacy, or email links.
        Links (some might be relative links):
        {', '.join(self.main_webpage.links)}
        """

        return self.llm_client.generate_text(system_prompt=system_prompt, user_prompt=user_prompt)

    def _get_brochure_body(self, content) -> str:
        """
        Generate the final brochure content using AI synthesis of scraped website data.
        
        This method takes the combined content from all relevant website pages
        and uses AI to create a cohesive, professional brochure. The generated
        content is optimized for multiple audiences including potential customers,
        investors, and job seekers.
        
        Args:
            content (str): Combined text content from all scraped website pages
            
        Returns:
            str: A professional company brochure in Markdown format with
                 engaging content about the company's products, culture,
                 customers, and career opportunities
                 
        Raises:
            OpenAIError: If the AI model request fails during brochure generation
        """
        system_prompt = """
        You are an expert at writing engaging company brochures. Your task is to read content from a a provided company website and write a short, professional and engaging brochure for potential customers, investors, and job seekers. Include details about the company's culture, customers, and career opportunities if available. Respond in Markdown format.
        """
        user_prompt = f"""
        Create a brochure for '${self.company_name}' using the following content:
        ${content}
        """

        return self.llm_client.generate_text(system_prompt=system_prompt, user_prompt=user_prompt)

## Generate Brochure with gpt-4o-mini

#### Load open_api_key

In [19]:
load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

if not api_key:
   raise ValueError("OPENAI_API_KEY not found in environment variables")

print("✅ API key loaded successfully!")

✅ API key loaded successfully!


#### configure LLMClient

In [20]:
model_open_ai = "gpt-4o-mini"
open_ai_llm_client = LLMClient(model=model_open_ai)

In [21]:
brochure_generator = BrochureGenerator(llm_client=open_ai_llm_client, company_name='Anthopic', url='https://www.anthropic.com/claude' )
brochure_content = brochure_generator.generate()
display(Markdown(brochure_content))

# Anthropic: Building the Future of Safe AI

Welcome to Anthropic, where we believe in making AI systems that are safe, reliable, and beneficial for humanity. As a leading AI safety and research company, we are dedicated to pioneering trustworthy AI technologies and creating solutions that enhance the way people interact with technology. Explore our offerings and learn about our commitment to building a brighter future through responsible AI.

## Meet Claude: Your Intelligent Partner

**Claude** is our flagship AI system designed to enhance productivity for individuals and teams. Whether you're tackling coding challenges, conducting research, or developing creative projects, Claude connects seamlessly to your world, offering personalized support that amplifies your capabilities.

- **Reliable Assistance**: Handle complex questions and tasks with clear, step-by-step guidance.
- **Creative Solutions**: Transform initial ideas into polished, practical outputs.
- **Collaborative Tools**: Work smarter, whether it's for education, business, or personal development.

## Our Commitment to Safety and Transparency

At Anthropic, we prioritize safety as a science. Our interdisciplinary team—comprising researchers, engineers, and policy experts—works together to ensure that our AI systems are not only cutting-edge but also secure and interpretable. With a focus on transparency, we actively share our research findings and safety practices with the public to foster a culture of accountability in the AI industry.

### Our Core Values
1. **Act for the Global Good**: We commit to outcomes that benefit humanity in the long run.
2. **Hold Light and Shade**: We acknowledge the potential risks and rewards of AI development.
3. **Be Good to Our Users**: Cultivating kindness and generosity in every interaction.
4. **Ignite a Race to the Top on Safety**: We aim to lead in AI safety measures collectively.
5. **Do the Simple Thing That Works**: Focusing on pragmatic solutions over complexity.
6. **Be Helpful, Honest, and Harmless**: Promoting a culture of trust and open communication.
7. **Put the Mission First**: Our mission drives us to make impactful decisions together.

## Join Our Growing Team

Anthropic is not just about AI; it's about people. We're actively looking for talented individuals from diverse backgrounds who share our passion for AI safety and innovation. Our inclusive company culture and commitment to employee well-being are reflected in our comprehensive benefits:

- **Health & Wellness**: Competitive health, dental, and vision insurance.
- **Flexible Paid Time Off**: Supporting work-life balance.
- **Career Growth Support**: Annual education stipends and professional development opportunities.

### Career Opportunities
We welcome candidates with varying experiences, whether you have a traditional machine learning background or come from a different field altogether. Our interview process assesses critical thinking and problem-solving skills rather than only specific industry experience.

Explore our open roles and see how you can contribute to shaping the future of safe AI!

## Let’s Connect

We invite you to join us on this remarkable journey of harnessing AI for good. Whether you’re a potential customer, investor, or future employee, connect with Anthropic to learn more about our mission and how we can work together toward advancing the responsible use of AI.

**Contact Us**: [Visit our website](https://www.anthropic.com) to explore how you can get involved.

---

Together, let’s create a future where AI elevates lives and innovations thrive. Welcome to Anthropic!