# A full business solution

### BUSINESS CHALLENGE:

Create a product that builds a Brochure for a company to be used for prospective clients, investors and potential recruits.

We will be provided a company name and their primary website.

## Setup

Key imports and why each is needed:

| Import | Purpose |
|---|---|
| `openai.OpenAI` | OpenAI Python SDK for calling GPT models via the Responses API |
| `pydantic.BaseModel` | Define typed schemas for Structured Outputs — guarantees model responses match an expected shape |
| `dotenv.load_dotenv` | Load `OPENAI_API_KEY` from a `.env` file without hard-coding credentials |
| `IPython.display` | Render and live-update Markdown in the notebook output cell |
| `scraper` | Local module — wraps Playwright to fetch page content and links from JavaScript-rendered sites |
| `nest_asyncio` | Patches the running event loop so `asyncio` works inside Jupyter's own async environment |

In [None]:
import os
from dotenv import load_dotenv
from IPython.display import Markdown, display, update_display
from scraper import playwright_fetch_website_contents, playwright_fetch_website_links
from pydantic import BaseModel
from openai import OpenAI

import nest_asyncio
nest_asyncio.apply() 


## Configuration

Validates `OPENAI_API_KEY` from `.env` and initialises the OpenAI client.

Two models are used in this pipeline for different reasons:

| Model | Used for | Rationale |
|---|---|---|
| `gpt-5-nano` | Link classification | Fast and cheap — the task is simple classification, not prose generation |
| `gpt-4.1-mini` | Brochure generation | Higher quality output needed for polished, audience-aware marketing copy |

In [None]:
# Initialize and constants

load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

if api_key and api_key.startswith('sk-proj-') and len(api_key)>10:
    print("API key looks good so far")
else:
    print("There might be a problem with your API key? Please visit the troubleshooting notebook!")
    
MODEL = 'gpt-5-nano'
openai = OpenAI()

## Using the Playwright Implementation

Playwright launches a real Chromium browser to fully execute JavaScript before extracting content. This is essential for modern sites that rely on client-side rendering — plain HTTP clients like `requests` would miss most of the links and text.

### Step 0 — Fetch all raw links from the landing page

`playwright_fetch_website_links(url)` returns a deduplicated list of absolute URLs found on the page. Many of these will be irrelevant to a brochure (login pages, dataset listings, individual model cards, etc.) — that filtering happens in the next step.

In [None]:
url = "https://huggingface.co"
hf_links = playwright_fetch_website_links(url)
len(hf_links)

In [None]:
hf_links[:5]

## First step: Have GPT-5-nano figure out which links are relevant

### Use a call to gpt-5-nano to read the links on a webpage, and respond in structured JSON.  
It should decide which links are relevant, and replace relative links such as "/about" with "https://company.com/about".  
We will use "one shot prompting" in which we provide an example of how it should respond in the prompt.

This is an excellent use case for an LLM, because it requires nuanced understanding. Imagine trying to code this without LLMs by parsing and analyzing the webpage - it would be very hard!

Sidenote: there is a more advanced technique called "Structured Outputs" in which we require the model to respond according to a spec. We cover this technique in Week 8 during our autonomous Agentic AI project.

In [None]:
link_system_prompt = """
You are provided with a list of links found on a webpage.
You are able to decide which of the links would be most relevant to include in a brochure about the company,
such as links to an About page, or a Company page, or Careers/Jobs pages.
You should respond in JSON as in this example:

{
    "links": [
        {"type": "about page", "url": "https://full.url/goes/here/about"},
        {"type": "careers page", "url": "https://another.full.url/careers"}
    ]
}
"""

### Building the User Prompt

`get_links_user_prompt(url)` fetches all raw links from the page and appends them to the instruction text, producing the full user message to send alongside the system prompt.

Separating the static *instruction* (system prompt) from the dynamic *data* (user prompt) keeps each part maintainable independently — the system prompt never needs to change, only the data does.

The cell below previews the exact string that will be sent as the user message, which is useful for debugging prompt content before making an API call.

In [None]:
def get_links_user_prompt(url):
    user_prompt = f"""
Here is the list of links on the website {url} -
Please decide which of these are relevant web links for a brochure about the company, 
respond with the full https URL in JSON format.
Do not include Terms of Service, Privacy, email links.

Links (some might be relative links):

"""
    links = playwright_fetch_website_links(url)
    user_prompt += "\n".join(links)
    return user_prompt

In [None]:
print(get_links_user_prompt(url))

In [None]:
MODEL = "gpt-5-nano"

class Link(BaseModel):
    type: str
    url: str

class PageLinks(BaseModel):
    links: list[Link]


def select_relevant_links(url):
    response = openai.responses.parse(
        model=MODEL,
        input=[
            {"role": "system", "content": link_system_prompt},
            {"role": "user", "content": get_links_user_prompt(url)}
        ],
        # text_format={"type": "json_object"}
        text_format=PageLinks
    )
    result = response.output_parsed
    # links = json.loads(result)
    links = result.model_dump()
    print(f"Found {len(links['links'])} relevant links")
    return links

### Structured Outputs with Pydantic

`select_relevant_links` uses `openai.responses.parse()` with a **Pydantic model** as `text_format`. This is OpenAI's Structured Outputs feature — the model is constrained to return JSON that exactly matches the schema, so no manual parsing or error handling is needed.

The schema mirrors the expected shape:

```python
class Link(BaseModel):
    type: str   # e.g. "about page", "careers page"
    url:  str   # absolute URL

class PageLinks(BaseModel):
    links: list[Link]
```

`.model_dump()` converts the parsed Pydantic object back to a plain Python dict for easier downstream use.

In [None]:
result = select_relevant_links(url)

In [None]:
result

## Second Step: Build the Brochure

With the relevant links identified, the pipeline now:

1. Fetches the **full rendered text** of the landing page and each relevant sub-page using Playwright
2. Assembles all scraped content into a single user prompt
3. Sends it to `gpt-4.1-mini` to generate a polished Markdown brochure

`fetch_page_and_all_relevant_links` handles steps 1–2. `get_brochure_user_prompt` wraps the result into the final prompt. `create_brochure` ties everything together and renders the output.

In [None]:
def fetch_page_and_all_relevant_links(url):
    contents = playwright_fetch_website_contents(url)
    relevant_links = select_relevant_links(url)
    result = f"## Landing Page:\n\n{contents}\n## Relevant Links:\n"
    for link in relevant_links['links']:
        result += f"\n\n### Link: {link['type']}\n"
        result += playwright_fetch_website_contents(link["url"])
    return result

In [None]:
page_relevant_links = fetch_page_and_all_relevant_links(url)

In [None]:
print(page_relevant_links)

In [None]:
len(page_relevant_links)

### Brochure Prompt Design

The **system prompt** instructs the model to write a short, audience-aware brochure targeting three distinct reader types: prospective **customers**, **investors**, and **recruits**. This directs the model to surface culture, product value, and careers content — rather than generating generic marketing copy.

The **user prompt** (`get_brochure_user_prompt`) injects all the scraped content, structured as:

```
## Landing Page:
<full page text>

## Relevant Links:

### Link: about page
<scraped text>

### Link: careers page
<scraped text>
...
```

The next cell prints the assembled prompt so you can inspect what content was scraped before committing to an API call.

In [None]:
brochure_system_prompt = """
You are an assistant that analyzes the contents of several relevant pages from a company website
and creates a short brochure about the company for prospective customers, investors and recruits.
Respond in markdown without code blocks.
Include details of company culture, customers and careers/jobs if you have the information.
"""

In [None]:
def get_brochure_user_prompt(company_name, url):
    user_prompt = f"""
You are looking at a company called: {company_name}
Here are the contents of its landing page and other relevant pages;
use this information to build a short brochure of the company in markdown without code blocks.\n\n
"""
    user_prompt += fetch_page_and_all_relevant_links(url)
    user_prompt = user_prompt
    return user_prompt

In [None]:
print(get_brochure_user_prompt("HuggingFace", "https://huggingface.co"))

In [None]:
def create_brochure(company_name, url):
    brochure_user_prompt = get_brochure_user_prompt(company_name, url)
    response = openai.responses.create(
        model="gpt-4.1-mini",
        input=[
            {"role": "system", "content": brochure_system_prompt},
            {"role": "user", "content": brochure_user_prompt}
        ],
    )
    result = response.output_text
    display(Markdown(result))

### Generate Brochures

Call `create_brochure(company_name, url)` with any company name and website URL. The full pipeline runs end-to-end — scraping, link selection, content assembly, and generation — and renders the result as formatted Markdown inline.

> **Note:** Each call makes multiple Playwright browser sessions (one for links, one per relevant page) and several API calls, so it may take 20–60 seconds depending on the site and number of relevant links found.

In [None]:
company_name = "OpenAI"
url = "https://openai.com"
create_brochure(company_name, url)

In [None]:
company_name = "Sunbird AI"
url = "https://sunbird.ai"
create_brochure(company_name, url)

## Streaming Variant: `stream_brochure`

A small but impactful improvement over `create_brochure` — instead of waiting for the full response before displaying anything, tokens are streamed back and the Markdown output is updated in place as they arrive, giving the familiar typewriter effect.

**How it works:**

- `stream=True` activates the streaming Responses API
- Each `response.output_text.delta` event carries the next text chunk
- `update_display(Markdown(response), display_id=...)` re-renders the growing Markdown string in the same output cell — no flicker, no duplicate cells

Use `stream_brochure` in preference to `create_brochure` for interactive use; use `create_brochure` when you need the full string returned (e.g. for saving to a file).

In [None]:
def stream_brochure(company_name, url):
    brochure_user_prompt = get_brochure_user_prompt(company_name, url)
    stream = openai.responses.create(
        model="gpt-4.1-mini",
        input=[
            {"role": "system", "content": brochure_system_prompt},
            {"role": "user", "content": brochure_user_prompt}
          ],
        stream=True
    )    
    response = ""
    display_handle = display(Markdown(""), display_id=True)
    for event in stream:
        if event.type == "response.output_text.delta":
            response += event.delta
            update_display(Markdown(response), display_id=display_handle.display_id)

In [None]:
company_name = "HuggingFace"
url = "https://huggingface.co"
stream_brochure(company_name, url)

In [None]:
company_name = "OpenAI"
url = "https://openai.com"
stream_brochure(company_name, url)

In [None]:
company_name = "Sunbird AI"
url = "https://sunbird.ai"
stream_brochure(company_name, url)

In [None]:
company_name = "Cardington Motors Uganda"
url = "https://cardingtonmotors.com"
stream_brochure(company_name, url)

In [None]:
company_name = "Andela  "
url = "https://andela.com"
stream_brochure(company_name, url)
