Here’s a summary of what has been done so far in the brochure-generator.ipynb notebook:

Imported required libraries for web scraping, environment management, and OpenAI API access.
Loaded the OpenAI API key from environment variables and initialized the OpenAI client with the GPT-4o-mini model.
Defined a Website class to scrape a given URL, extract the main text content (excluding scripts, styles, images, and input tags), and collect all links from the page.
Instantiated the Website class for a sample website and extracted its links.
Created a system prompt for the LLM to help identify which links are relevant for a company brochure (e.g., About, Careers).
Defined a user prompt function to format the list of links for the LLM and ask it to select the most relevant ones.
Implemented a function to call the LLM (using OpenAI’s API) to analyze the links and return a JSON object with only the brochure-relevant links.
In summary:
The notebook sets up web scraping for a company website, gathers all links, and uses an LLM to intelligently filter and select the most relevant links for inclusion in a company brochure.

Import all the requried libraries.

In [1]:
import os, requests, json
from typing import List
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from IPython.display import display, Markdown, update_display
from openai import OpenAI

Initialize and constants.

In [2]:
load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')
if not api_key:
    print("No API key was found")
else:
    print("API key found and looks good so far!")

MODEL = "gpt-4o-mini"
openai = OpenAI()

API key found and looks good so far!


Now lets define a class to Website to scrape the content of the website data and links by skipping any script style img and input tags.

In [3]:
headers = {
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

class Website:
    """
    A utility class to represent a Website that we have scraped, now with links
    """
    def __init__(self, url):
        self.url = url
        response = requests.get(url, headers=headers)
        self.body = response.content
        soup = BeautifulSoup(self.body, 'html.parser')
        self.title = soup.title.string if soup.title else "No title found"
        if soup.body:
            for irrelevant in soup.body(["script", "style", "img", "input"]):
                irrelevant.decompose()
            self.text = soup.body.get_text(separator="\n", strip=True)
        else:
            self.text = ""
        links = [link.get('href') for link in soup.find_all('a')]
        self.links = [link for link in links if link]

    def get_contents(self):
        return f"Webpage Title:\n{self.title}\nWebpage Contents:\n{self.text}\n\n"

In [4]:
website = Website("https://pankajtechblogs.dev/")
website.links

['https://medium.com/',
 'https://rsci.app.link/?%24canonical_url=https%3A%2F%2Fmedium.com/iampankajsharma%3F~feature=LoMobileNavBar&~channel=ShowCollectionHome&~stage=m2',
 'https://medium.com/m/signin?redirect=https%3A%2F%2Fpankajtechblogs.dev%2F&source=--------------------------nav_reg&operation=login',
 'https://medium.com/m/signin?redirect=https%3A%2F%2Fpankajtechblogs.dev%2F&source=--------------------------nav_reg&operation=register',
 'https://pankajtechblogs.dev',
 'https://pankajtechblogs.dev',
 'https://pankajtechblogs.dev/order-processing-system-using-event-driven-architecture-472bfc47cb06?source=collection_home---4------0-----------------------',
 'https://pankajtechblogs.dev/order-processing-system-using-event-driven-architecture-472bfc47cb06?source=collection_home---4------0-----------------------',
 'https://pankajtechblogs.dev/@iampankajsharma',
 'https://pankajtechblogs.dev/@iampankajsharma',
 'https://pankajtechblogs.dev/hubble-observability-with-cilium-kubernetes-67

Now since we have all the website content and related links - let us use LLM Model to produce a brochure wuth relevant links. As not all links will be relevant for the purpose. 

Lets define a dictionary for the LLM Model here, using system_prompt

In [5]:
link_system_prompt = "You are provided with a list of links found on a webpage. \
You are able to decide which of the links would be most relevant to include in a brochure about the company, \
such as links to an About page, or a Company page, or Careers/Jobs pages.\n"
link_system_prompt += "You should respond in JSON as in this example:"
link_system_prompt += """
{
    "links": [
        {"type": "about page", "url": "https://full.url/goes/here/about"},
        {"type": "careers page": "url": "https://another.full.url/careers"}
    ]
}
"""

Now lets define the user prompt.

In [6]:
def get_links_user_prompt(website):
    user_prompt = f"Here is the list of links on the website of {website.url} - "
    user_prompt += "please decide which of these are relevant web links for a brochure about the company, respond with the full https URL in JSON format. \
Do not include Terms of Service, Privacy, email links.\n"
    user_prompt += "Links (some might be relative links):\n"
    user_prompt += "\n".join(website.links)
    return user_prompt

Let us use power of LLM, and define a function get_links, that will use above system and user prompts, and hit the LLM, and retrieve the LLM response for relevant links that would be helpful for including in the brochure.

In [7]:
def get_links(url):
    website = Website(url)
    response = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": link_system_prompt},
            {"role": "user", "content": get_links_user_prompt(website)}
      ],
        response_format={"type": "json_object"}
    )
    result = response.choices[0].message.content
    return json.loads(result)

Now let us write get_all_details function takes a website URL and performs the following steps:

- Scrapes and adds the website home page content to the result.
- Uses an LLM to identify and retrieve only the most brochure-relevant links from the page.
- For each relevant link, it fetches and appends the content of that linked page to the result, labeling each section by its type (e.g., "about page", "careers page").
- Returns a combined string containing the landing page content and the contents of all selected relevant subpages.

This function automates gathering and organizing all key information needed for a company brochure from a website.

In [8]:
def get_all_details(url):
    result = "Website home page:\n"
    result += Website(url).get_contents()
    links = get_links(url)
    print("Found links:", links)
    for link in links["links"]:
        result += f"\n\n{link['type']}\n"
        result += Website(link["url"]).get_contents()
    return result

Now lets finally write system and user prompt, to relevant information for creating the company brochure and create the brochure.

In [9]:
system_prompt = "You are an assistant that analyzes the contents of several relevant pages from a company website \
and creates a short brochure about the company for prospective customers, investors and recruits. Respond in markdown.\
Include details of company culture, customers and careers/jobs if you have the information."

In [10]:
def get_brochure_user_prompt(company_name, url):
    user_prompt = f"You are looking at a company called: {company_name}\n"
    user_prompt += f"Here are the contents of its home page and other relevant pages; use this information to build a short brochure of the company in markdown.\n"
    user_prompt += get_all_details(url)
    user_prompt = user_prompt[:5_000] # Truncate if more than 5,000 characters
    return user_prompt

In [11]:
def create_brochure(company_name, url):
    response = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
          ],
    )
    result = response.choices[0].message.content
    display(Markdown(result))

In [12]:
create_brochure("Pankaj Blogs", "https://pankajtechblogs.dev/")

Found links: {'links': [{'type': 'about page', 'url': 'https://pankajtechblogs.dev/about'}]}


# Pankaj Tech Blogs Brochure

---

## Welcome to Pankaj Tech Blogs

### Sharing Knowledge and Empowering Learning

At Pankaj Tech Blogs, we believe in learning through practical experience. Our platform provides self-explanatory, concise, and accessible blog posts aimed at demystifying complex tech concepts for enthusiasts and professionals alike.

---

### Our Mission

Our mission is to simplify technology for everyone, regardless of their background. We design our content to be user-friendly, ensuring that even those without prior knowledge can understand and engage with the topics we cover.

---

### Featured Topics

- **Event Driven Architecture**: Explore innovative solutions like order processing systems.
- **Kubernetes and Observability**: Delve into advanced networking practices and zero trust networking.
- **Cloud Migration Strategies**: Learn essential techniques for smooth transitions from on-premises to cloud.
- **Microservices Design Patterns**: Understand vital patterns such as circuit breaker and database per service.
- **Security Practices**: Information on JWT validation and split-key encryption for secure data handling.

---

### Meet Our Founder

**Pankaj Sharma**  
A passionate tech enthusiast, Pankaj is dedicated to exploring various tech stacks and frameworks. With years of experience in the IT industry, he brings a wealth of knowledge that he shares through engaging and insightful blogs.

---

### Company Culture

At Pankaj Tech Blogs, we foster a culture of curiosity and continuous learning. We believe that technology should be accessible and that sharing knowledge plays a crucial role in professional growth. Our environment encourages exploration, collaboration, and innovation.

---

### Who Are Our Readers?

Our blogs cater to a wide range of audiences, including:

- **Tech Enthusiasts**: Individuals eager to learn about emerging technologies and best practices.
- **Developers**: Professionals looking for practical insights to improve their work and projects.
- **Students**: Learners wanting to build a solid tech foundation for their future careers.

---

### Career Opportunities

We are continually looking for passionate writers and tech enthusiasts to join our community. If you have a love for technology and a desire to share knowledge with others, Pankaj Tech Blogs is the place for you!

---

### Join Us!

**Get Started Today!**  
Empower yourself through knowledge by exploring our blog at [Pankaj Tech Blogs](https://example.com) (link not real) and follow us on our journey towards simplifying technology for everyone.

### Contact Us

For inquiries, collaborations, or job opportunities, feel free to reach us at [contact@pankajtechblogs.com](mailto:contact@pankajtechblogs.com).

---

Together, let's simplify technology, one blog at a time!