# 🚀 **Automated Company Brochure Generator**

Generate professional, concise company brochures in seconds—powered by advanced AI and real-time web scraping.  
Just enter a company name and website: our tool gathers, summarizes, and formats content into a polished, ready-to-use Markdown brochure.

## 📦1. Importing Essential Libraries

In this cell, we bring in all the crucial libraries needed for:

- **Environment management** (`os`, `dotenv`)
- **HTTP requests and data parsing** (`requests`, `json`, `BeautifulSoup`)
- **Web scraping & automation** (`selenium`)
- **Networking utilities** (`socket`, `ipaddress`)
- **OpenAI and Gradio Integration** (`openai`, `gradio`)
- **Web display and markdown rendering** (`IPython.display`)

These imports lay the foundation for the brochure generation and interaction capabilities of this notebook.


In [1]:
import os
import requests
import json
import socket
import ipaddress
from typing import List
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from urllib.parse import urlparse
from IPython.display import display, Markdown, update_display
from openai import OpenAI
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.common.exceptions import WebDriverException
from webdriver_manager.chrome import ChromeDriverManager
import gradio as gr

## 🔐2. Loading API Keys & Base URL

This section ensures we have access to the necessary **GROQ API credentials** by:

- Loading variables from a `.env` file using `load_dotenv`
- Fetching the `GROQ_API_KEY` and `BASE_URL`
- Checking and validating the presence and format of the API key

✅ If everything checks out, you're ready to access the GROQ API!


In [2]:
load_dotenv(override=True)
API_KEY = os.getenv("GROQ_API_KEY")
BASE_URL = os.getenv("GROQ_BASE_URL")

if not API_KEY:
    print("No API_Key found, Please set the API_KEY.")
    exit(1)
elif API_KEY.strip() != API_KEY:
    print("An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook")
else:
    print("API key found and looks good so far!")


API key found and looks good so far!


## ⚙️3. Configuring the LLM Client

This cell sets the **language model** to use (`llama3-70b-8192`) and instantiates the GROQ-compatible OpenAI client. These settings are essential for programmatic access to the AI model used in downstream prompt-based automation.


In [3]:
MODEL = "llama3-70b-8192"
groq_client = OpenAI(api_key=API_KEY, base_url =BASE_URL)

## 🌐4. Website Content Extractor

This cell defines utilities for safe URL detection and implements the `WebScraper` class using Selenium and BeautifulSoup. The extractor:

- Validates URLs for security.
- Loads and renders web pages.
- Extracts the page title, primary text, images, links, and any HTML tables.
- Returns a string summary of these elements for further processing.

This forms the core data-acquisition module for our brochure generator.


In [4]:
"""
Explanation of this code: https://chatgpt.com/share/686f8aed-a210-8007-970d-37906975fa4f
"""


def is_safe_url(url):
    try:
        parsed = urlparse(url)
        if parsed.scheme not in ["http", "https"] or parsed.netloc == "":
            return False

        host = parsed.hostname
        ip = ipaddress.ip_address(socket.gethostbyname(host))
        if ip.is_private or ip.is_loopback or ip.is_reserved or ip.is_link_local:
            return False
    except Exception:
        return False
    return True


class WebScraper:
    """
    A utility class to represent a Website that we have scraped, using Selenium, with extracted links.
    """

    def __init__(self, url):
        if not is_safe_url(url):
            raise ValueError("Invalid or unsafe URL")

        self.url = url
        self.title = "No title found"
        self.text = ""
        self.links = []
        self.images = []
        self.files = []
        self.tables = []

        # Setup Selenium
        options = Options()
        options.add_argument("--headless")
        options.add_argument("--disable-gpu")
        options.add_argument("--no-sandbox")
        options.add_argument("--disable-dev-shm-usage")
        options.add_argument("--disable-extensions")
        options.add_argument("--disable-blink-features=AutomationControlled")

        driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

        try:
            driver.set_page_load_timeout(60)
            driver.get(url)
            soup = BeautifulSoup(driver.page_source, "html.parser")

            # Get title
            self.title = soup.title.string.strip() if soup.title else "No title found"

            # Remove irrelevant tags
            if soup.body:
                for tag in soup.body(["script", "style", "input"]):
                    tag.decompose()
                self.text = soup.body.get_text(strip=True)

            # Extract all Images
            all_images = [img.get("src") for img in soup.find_all("img") if img.get("src")]
            self.images = all_images

            # Extract all valid links
            all_links = [a.get("href") for a in soup.find_all("a") if a.get("href") and is_safe_url(a.get("href"))]
            self.links = all_links

            # Extract all tables
            all_tables = [table for table in soup.find_all("table")]
            self.tables = all_tables


        except WebDriverException as e:
            print(f"Error loading page with Selenium: {e}")
        finally:
            driver.quit()

    def get_contents(self):
        return f"-> Webpage Title:\n{self.title}\n\n\n-> Webpage Contents (limited text displayed up to 1000 characters):\n{self.text[:1000]}\n\n\n-> Links (limited to 20 links displayed):\n{self.links[:20]}\n\n\n-> Images:\n{self.images}\n\n\n-> Tables:\n{self.tables}\n\n\n"


## 🛠️ Example: Scraping a Webpage

Here, we instantiate the `WebScraper` class with a sample company site URL. The extracted title, snippet of content, a sample of found links, and images are printed. This demonstrates the result of basic website scraping before automation.


In [5]:
website = WebScraper("https://www.microwebtec.com/")
print(website.get_contents())
# # website.links


-> Webpage Title:
Full stack Development Company - Microweb Software Pvt Ltd


-> Webpage Contents (limited text displayed up to 1000 characters):
HomeAboutCasesServicesTechnologyTechnology SubCloud Application Development ServicesAzure DevOps ServicesAI & ML ServicesShopify DevelopmentGolang Development ServicesDevOps Consulting ServicesWebflow DevelopmentBusiness TransformationLaravel Application DevelopmentSymfony Web DevelopmentNode.js DevelopmentAngularJs Web Development ServicesRuby on Rails Application DevelopmentMicrosoft DevelopmentMobile Application DevelopmentIoT and Embedded Systems & Smart SolutionsCloud TechnologyReactJS DevelopmentDrupal Web Development Services2D and 3D Video AnimationUI & UX DesignWeb DevelopmentEnterprise SolutionsDigital MarketingSoftware Outsource to IndiaGraphic DesigningeCommerce DevelopmentWordPress DevelopmentWooCommerce DevelopmentShopify DevelopmentPython DevelopmentGet in touchGet in touchWe Build Brilliance!WhoMicroweb software specializes i

## 🤖5. LLM Prompt Template for Relevant Links

This cell builds the **system prompt** for the LLM describing how to choose only the most brochure-relevant links (About, Careers, etc.) from all URLs discovered on the home page.

It also defines `link_user_prompt_for(website)`, which dynamically assembles a per-site list of links for the LLM to select from.


In [6]:
link_system_prompt = (
    "You are provided with a list of links found on a webpage. "
    "You are able to decide which of the links would be most relevant to include in a brochure about the company, "
    "such as links to an About page, or a Company page, or Careers/Jobs pages.\n"
    "You should respond in JSON as in this example:"
)
link_system_prompt += """
{
    "links": [
        {"type": "about page", "url": "https://full.url/goes/here/about"},
        {"type": "careers page", "url": "https://another.full.url/careers"}
    ]
}
"""

def link_user_prompt_for(website):
    user_prompt = (
        f"Here is the list of links on the website of {website.url} - please decide which of these are relevant web links for a brochure about the company, "
        f"respond with the full https URL in JSON format. Do not include Terms of Service, Privacy, email links.\n\n"
        "Links (some might be relative links):\n"
    )
    user_prompt += "\n".join(website.links)
    return user_prompt


In [7]:
# print(link_user_prompt_for(website))

## 🚦6. Selecting the Most Relevant Links with LLM

This function, `get_links(url)`, uses the LLM to select and return only those subpage URLs truly relevant for brochure content. It:
- Scrapes the given URL for links,
- Prompts the LLM with the list and the boilerplate examples,
- Expects a standardized JSON object with "type" and "url" for About, Careers, etc.

Ensures brochure content is both concise and complete.


In [8]:
def get_links(url):
    website= WebScraper(url)
    response = groq_client.chat.completions.create(
        model = MODEL,
        messages = [
            {"role": "system","content": link_system_prompt},
            {"role": "user", "content": link_user_prompt_for(website)}
        ],
        response_format = {"type": "json_object"}
    )
    result = response.choices[0].message.content
    return json.loads(result)

In [9]:
# get_links("https://www.microwebtec.com")

## 🔄7. Aggregate All Relevant Page Contents

Defines `get_all_details(url)`, which scrapes the landing page, uses the LLM to select additional key subpages (like About, Careers), scrapes those as well, and aggregates their contents into a single formatted string. This prepares the unified company dataset for brochure generation.


In [10]:
def get_all_details(url):
    result = "Landing page:\n"
    result += WebScraper(url).get_contents()
    links = get_links(url)
    print("Found links:", links)
    for link in links["links"]:
        result += f"\n\n{link['type']}\n"
        result += WebScraper(link["url"]).get_contents()
    return result

In [11]:
# print(get_all_details("https://www.microwebtec.com"))

## 📝8. Construct Brochure Generation Prompts

This section sets up:
- The **system prompt** instructing the LLM to create a concise, informative Markdown brochure for customers, investors, and recruits.
- The `brochure_user_prompt` function, which packages aggregated company content and detailed requirements into a clean input for the LLM.

This is the workflow's final prep before brochure synthesis.


In [12]:
brochure_system_prompt = (
    "You are an assistant that analyzes the contents of several relevant pages from a company website "
    "and creates a short, compelling brochure about the company. "
    "Your audience includes prospective customers, investors, and potential recruits. "
    "Respond in clear, well-formatted Markdown. "
    "Include information about the company's mission, products or services, culture and values, key customers or partners, and careers/jobs if that information is available."
)

def brochure_user_prompt(company_name, url):
    content = get_all_details(url)
    content = content[:5_000]  # Truncate content to 5,000 characters

    user_prompt = (
        f"You are looking at a company called: **{company_name}**\n\n"
        f"Below is the content gathered from the company's landing page and other relevant subpages (such as About, Careers, and Press).\n"
        f"Use this content to generate a **concise, informative brochure** in **Markdown format** for prospective **customers, investors, and potential recruits**.\n\n"
        f"The brochure should aim to:\n"
        f"- Describe what the company does\n"
        f"- Highlight company culture and values (if available)\n"
        f"- Mention notable customers or partners\n"
        f"- Include a summary of career opportunities or team info if relevant\n\n"
        f"### Company Website Content:\n\n"
        f"{content}"
    )
    return user_prompt


## 🏗️9. Brochure Generation

This section is meant to contain the LLM call and brochure assembly, but the function is currently incomplete.

To complete: Implement a function that combines `brochure_user_prompt`, sends it (along with the system prompt) to the LLM, and returns the Markdown brochure text for downstream display or export.


In [13]:
def create_brochure(company_name, url):
    try:
        streamed_response = groq_client.chat.completions.create(
            model=MODEL,
            messages=[
                {"role": "system", "content": brochure_system_prompt},
                {"role": "user", "content": brochure_user_prompt(company_name, url)}
            ],
            stream=True  # ✅ Required for iteration!
        )
        result = ""
        for chunk in streamed_response:
            content_piece = chunk.choices[0].delta.content or ""
            result += content_piece
            cleaned_result = result.replace("```", "").replace("markdown", "")
            yield cleaned_result  # <- Streaming to Gradio

    except Exception as e:
        yield f"[LLM Error] {e}"  # <- yield for Gradio to handle the stream

In [14]:
# create_brochure("Microweb Software", "https://www.microwebtech.com")

In [15]:
# create_brochure("Gopal Info", "https://www.gopalinfo.com")

## 🖥️10. Building an Interactive Gradio UI

This cell creates an interactive **web-based interface** for brochure generation using Gradio, integrated within the Jupyter notebook. The UI includes:

- **Input fields** for the Company Name and Company Website URL, allowing users to easily provide the information required for brochure creation.
- A **'Generate Brochure' button** which, when clicked, runs the full brochure generation pipeline—from web scraping and LLM-powered summarization to markdown formatting.
- An **output area** where the generated markdown brochure is displayed within the notebook.

Gradio is used here to wrap the end-to-end workflow into a simple, user-friendly experience, making brochure generation accessible even to users without coding knowledge. The interface launches inside the Jupyter environment and can also be optionally shared as a public link.

In [16]:
view = gr.Interface(
    fn=create_brochure,
    inputs = [
        gr.Textbox(label='Company Name:'),
        gr.Textbox(label="Landing page URL including http:// or https://"),
    ],
    outputs = [gr.Markdown(label='Brochure')],
    title='Brochure Generator',
    description='This is a brochure generator',
    flagging_mode = 'never',
    # js = force_dark_mode
)
view.launch(share=True)

* Running on local URL:  http://127.0.0.1:7860
* Running on public URL: https://d80a816e586391b1ba.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


