# A full business solution

## Now we will take our project from Day 1 to the next level

### BUSINESS CHALLENGE:

Create a product that builds a Brochure for a company to be used for prospective clients, investors and potential recruits.

We will be provided a company name and their primary website.

See the end of this notebook for examples of real-world business applications.

And remember: I'm always available if you have problems or ideas! Please do reach out.

In [1]:
# imports
# If these fail, please check you're running from an 'activated' environment with (llms) in the command prompt

import os
import requests
import json
from typing import List
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from IPython.display import Markdown, display, update_display
from openai import OpenAI

In [2]:
# Initialize and constants

load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

if api_key and api_key.startswith('sk-proj-') and len(api_key)>10:
    print("API key looks good so far")
else:
    print("There might be a problem with your API key? Please visit the troubleshooting notebook!")
    
MODEL = 'gpt-4o-mini'
openai = OpenAI()

API key looks good so far


In [3]:
# A class to represent a Webpage

# Some websites need you to use proper headers when fetching them:
headers = {
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

class Website:
    """
    A utility class to represent a Website that we have scraped, now with links
    """

    def __init__(self, url):
        self.url = url
        response = requests.get(url, headers=headers)
        self.body = response.content
        soup = BeautifulSoup(self.body, 'html.parser')
        self.title = soup.title.string if soup.title else "No title found"
        if soup.body:
            for irrelevant in soup.body(["script", "style", "img", "input"]):
                irrelevant.decompose()
            self.text = soup.body.get_text(separator="\n", strip=True)
        else:
            self.text = ""
        links = [link.get('href') for link in soup.find_all('a')]
        self.links = [link for link in links if link]

    def get_contents(self):
        return f"Webpage Title:\n{self.title}\nWebpage Contents:\n{self.text}\n\n"

In [5]:
ed = Website("https://urszulaczerwinska.github.io/")
ed.links

['/index.html',
 '#menu',
 '/index.html',
 '/about/',
 '/works/',
 '/thoughts/',
 '/feed.xml',
 '#greeting',
 'https://twitter.com/ulalaparis',
 'https://github.com/urszulaczerwinska',
 'https://linkedin.com/in/urszulaczerwinska',
 '/thoughts/',
 '/thoughts/',
 '/works/',
 '/works/',
 '/works/deep-dive-in-paddleocr-inference',
 '/works/ner_cc',
 '/works/interpretability-shap',
 '/works/egg_ner',
 'https://twitter.com/ulalaparis',
 'https://github.com/urszulaczerwinska',
 'https://linkedin.com/in/urszulaczerwinska',
 'mailto:ulcia.liberte@gmail.com',
 '/credits/']

## First step: Have GPT-4o-mini figure out which links are relevant

### Use a call to gpt-4o-mini to read the links on a webpage, and respond in structured JSON.  
It should decide which links are relevant, and replace relative links such as "/about" with "https://company.com/about".  
We will use "one shot prompting" in which we provide an example of how it should respond in the prompt.

This is an excellent use case for an LLM, because it requires nuanced understanding. Imagine trying to code this without LLMs by parsing and analyzing the webpage - it would be very hard!

Sidenote: there is a more advanced technique called "Structured Outputs" in which we require the model to respond according to a spec. We cover this technique in Week 8 during our autonomous Agentic AI project.

In [6]:
link_system_prompt = "You are provided with a list of links found on a webpage. \
You are able to decide which of the links would be most relevant to include in a brochure about the company, \
such as links to an About page, or a Company page, or Careers/Jobs pages.\n"
link_system_prompt += "You should respond in JSON as in this example:"
link_system_prompt += """
{
    "links": [
        {"type": "about page", "url": "https://full.url/goes/here/about"},
        {"type": "careers page": "url": "https://another.full.url/careers"}
    ]
}
"""

In [7]:
print(link_system_prompt)

You are provided with a list of links found on a webpage. You are able to decide which of the links would be most relevant to include in a brochure about the company, such as links to an About page, or a Company page, or Careers/Jobs pages.
You should respond in JSON as in this example:
{
    "links": [
        {"type": "about page", "url": "https://full.url/goes/here/about"},
        {"type": "careers page": "url": "https://another.full.url/careers"}
    ]
}



In [8]:
def get_links_user_prompt(website):
    user_prompt = f"Here is the list of links on the website of {website.url} - "
    user_prompt += "please decide which of these are relevant web links for a brochure about the company, respond with the full https URL in JSON format. \
Do not include Terms of Service, Privacy, email links.\n"
    user_prompt += "Links (some might be relative links):\n"
    user_prompt += "\n".join(website.links)
    return user_prompt

In [9]:
print(get_links_user_prompt(ed))

Here is the list of links on the website of https://urszulaczerwinska.github.io/ - please decide which of these are relevant web links for a brochure about the company, respond with the full https URL in JSON format. Do not include Terms of Service, Privacy, email links.
Links (some might be relative links):
/index.html
#menu
/index.html
/about/
/works/
/thoughts/
/feed.xml
#greeting
https://twitter.com/ulalaparis
https://github.com/urszulaczerwinska
https://linkedin.com/in/urszulaczerwinska
/thoughts/
/thoughts/
/works/
/works/
/works/deep-dive-in-paddleocr-inference
/works/ner_cc
/works/interpretability-shap
/works/egg_ner
https://twitter.com/ulalaparis
https://github.com/urszulaczerwinska
https://linkedin.com/in/urszulaczerwinska
mailto:ulcia.liberte@gmail.com
/credits/


In [10]:
def get_links(url):
    website = Website(url)
    response = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": link_system_prompt},
            {"role": "user", "content": get_links_user_prompt(website)}
      ],
        response_format={"type": "json_object"}
    )
    result = response.choices[0].message.content
    return json.loads(result)

In [11]:
# Anthropic has made their site harder to scrape, so I'm using HuggingFace..

huggingface = Website("https://huggingface.co")
huggingface.links

['/',
 '/models',
 '/datasets',
 '/spaces',
 '/posts',
 '/docs',
 '/enterprise',
 '/pricing',
 '/login',
 '/join',
 '/deepseek-ai/DeepSeek-R1',
 '/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B',
 '/deepseek-ai/DeepSeek-R1-Zero',
 '/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B',
 '/hexgrad/Kokoro-82M',
 '/models',
 '/spaces/tencent/Hunyuan3D-2',
 '/spaces/hexgrad/Kokoro-TTS',
 '/spaces/webml-community/deepseek-r1-webgpu',
 '/spaces/lllyasviel/iclight-v2',
 '/spaces/JeffreyXiang/TRELLIS',
 '/spaces',
 '/datasets/fka/awesome-chatgpt-prompts',
 '/datasets/cais/hle',
 '/datasets/bespokelabs/Bespoke-Stratos-17k',
 '/datasets/HumanLLMs/Human-Like-DPO-Dataset',
 '/datasets/yale-nlp/MMVU',
 '/datasets',
 '/join',
 '/pricing#endpoints',
 '/pricing#spaces',
 '/pricing',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/allenai',
 '/facebook',
 '/amazon',
 '/google',
 '/Intel',
 '/microsoft',
 '/grammarly',
 '/Writer',
 '/docs/transformers',

In [12]:
get_links("https://huggingface.co")

{'links': [{'type': 'about page', 'url': 'https://huggingface.co/huggingface'},
  {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'},
  {'type': 'enterprise page', 'url': 'https://huggingface.co/enterprise'},
  {'type': 'pricing page', 'url': 'https://huggingface.co/pricing'},
  {'type': 'blog page', 'url': 'https://huggingface.co/blog'},
  {'type': 'community discussion page',
   'url': 'https://discuss.huggingface.co'},
  {'type': 'GitHub page', 'url': 'https://github.com/huggingface'},
  {'type': 'LinkedIn page',
   'url': 'https://www.linkedin.com/company/huggingface/'},
  {'type': 'Twitter page', 'url': 'https://twitter.com/huggingface'}]}

## Second step: make the brochure!

Assemble all the details into another prompt to GPT4-o

In [13]:
def get_all_details(url):
    result = "Landing page:\n"
    result += Website(url).get_contents()
    links = get_links(url)
    print("Found links:", links)
    for link in links["links"]:
        result += f"\n\n{link['type']}\n"
        result += Website(link["url"]).get_contents()
    return result

In [14]:
print(get_all_details("https://urszulaczerwinska.github.io/"))

Found links: {'links': [{'type': 'about page', 'url': 'https://urszulaczerwinska.github.io/about/'}, {'type': 'works page', 'url': 'https://urszulaczerwinska.github.io/works/'}, {'type': 'careers page', 'url': 'https://urszulaczerwinska.github.io/thoughts/'}]}
Landing page:
Webpage Title:
Urszula Czerwinska | Python, TensorFlow Expert | Data Scientist & Deep Learning Engineer
Webpage Contents:
Urszula Czerwinska
Menu
Home
About
Works
Thoughts
RSS Feed
Urszula Czerwinska
Science → Data Science
Continue
Welcome to My Data Science Journey
Welcome to my personal blog, where I documented my transition from a PhD in Computational Biology to a thriving career in Data Science. Here, you’ll find insights on where to learn data science, how data science and machine learning, artificial intelligence are revolutionizing various fields, and explore whether data science can be self-taught, are data science bootcamps or courses worth it. Discover why data science is important in today’s world.
twitte

In [15]:
system_prompt = "You are an assistant that analyzes the contents of several relevant pages from a peronal blog website \
and creates a short brochure about the profile for prospective customers, investors and recruiters. Respond in markdown.\
Include details of profile qualities, careers if you have the information."

# Or uncomment the lines below for a more humorous brochure - this demonstrates how easy it is to incorporate 'tone':

# system_prompt = "You are an assistant that analyzes the contents of several relevant pages from a company website \
# and creates a short humorous, entertaining, jokey brochure about the company for prospective customers, investors and recruits. Respond in markdown.\
# Include details of company culture, customers and careers/jobs if you have the information."


In [16]:
def get_brochure_user_prompt(company_name, url):
    user_prompt = f"You are looking at a person called: {company_name}\n"
    user_prompt += f"Here are the contents of its landing page and other relevant pages; use this information to build a short brochure of the company in markdown.\n"
    user_prompt += get_all_details(url)
    user_prompt = user_prompt[:5_000] # Truncate if more than 5,000 characters
    return user_prompt

In [17]:
get_brochure_user_prompt("Urszula Czerwinska", "https://urszulaczerwinska.github.io/")

Found links: {'links': [{'type': 'about page', 'url': 'https://urszulaczerwinska.github.io/about/'}, {'type': 'works page', 'url': 'https://urszulaczerwinska.github.io/works/'}, {'type': 'thoughts page', 'url': 'https://urszulaczerwinska.github.io/thoughts/'}, {'type': 'twitter', 'url': 'https://twitter.com/ulalaparis'}, {'type': 'github', 'url': 'https://github.com/urszulaczerwinska'}, {'type': 'linkedin', 'url': 'https://linkedin.com/in/urszulaczerwinska'}]}


"You are looking at a person called: Urszula Czerwinska\nHere are the contents of its landing page and other relevant pages; use this information to build a short brochure of the company in markdown.\nLanding page:\nWebpage Title:\nUrszula Czerwinska | Python, TensorFlow Expert | Data Scientist & Deep Learning Engineer\nWebpage Contents:\nUrszula Czerwinska\nMenu\nHome\nAbout\nWorks\nThoughts\nRSS Feed\nUrszula Czerwinska\nScience → Data Science\nContinue\nWelcome to My Data Science Journey\nWelcome to my personal blog, where I documented my transition from a PhD in Computational Biology to a thriving career in Data Science. Here, you’ll find insights on where to learn data science, how data science and machine learning, artificial intelligence are revolutionizing various fields, and explore whether data science can be self-taught, are data science bootcamps or courses worth it. Discover why data science is important in today’s world.\ntwitter\ngithub\nlinkedin-square\nThoughts\nWelcom

In [18]:
def create_brochure(company_name, url):
    response = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
          ],
    )
    result = response.choices[0].message.content
    display(Markdown(result))

In [19]:
create_brochure("Urszula Czerwinska", "https://urszulaczerwinska.github.io/")

Found links: {'links': [{'type': 'about page', 'url': 'https://urszulaczerwinska.github.io/about/'}, {'type': 'works page', 'url': 'https://urszulaczerwinska.github.io/works/'}, {'type': 'twitter', 'url': 'https://twitter.com/ulalaparis'}, {'type': 'github', 'url': 'https://github.com/urszulaczerwinska'}, {'type': 'linkedin', 'url': 'https://linkedin.com/in/urszulaczerwinska'}]}


```markdown
# Brochure: Urszula Czerwinska - Data Scientist & Deep Learning Engineer

## Profile Overview
Urszula Czerwinska is an accomplished Senior Deep Learning Engineer and Data Scientist based in Paris, with a unique background transitioning from a PhD in Computational Biology to becoming a key innovator in the field of Artificial Intelligence and Data Science. 

## Contact Information
- **Email:** urszula.czerwinska@cri-paris.org
- **Twitter:** [@urszulaczerwinska](https://twitter.com/urszulaczerwinska)
- **GitHub:** [UlaLaParis](https://github.com/UlaLaParis)
- **LinkedIn:** [Urszula Czerwinska](https://linkedin.com/in/urszulaczerwinska)

## Career Highlights
- **Current Position:** Senior Deep Learning Engineer, Data Scientist at Adevinta, Paris
- **Key Skills:** 
  - Expertise in Python, R, TensorFlow, and PyTorch
  - Strong foundation in complex systems, machine learning, and deep learning
  - Specialization in computer vision and multimodal AI models
- **Achievements:**
  - Led the enhancement of scene-text extraction capabilities by 33%
  - Optimized multi-class classification models to increase accuracy by 45%
  - Head of Adevinta ML Guild, fostering a culture of continuous learning and innovation

## Academic Background
- PhD in Computational Biology, focusing on cancer and data analysis
- Extensive education across France and Singapore, enriching her interdisciplinary approach to problem-solving

## Areas of Expertise
- **Data Science & Machine Learning:** Insights on how data science is evolving and its importance in today's world including discussions on self-taught methods, bootcamps, and industry trends.
- **Deep Learning:** Specializing in projects involving Natural Language Processing (NLP), model explainability, and recent advancements in Computer Vision and Generative AI.
- **Industry Solutions:** Provides consulting services in deep learning technologies, helping businesses innovate and solve complex problems.

## Portfolio & Works
Urszula has documented her works showcasing:
- Research in cancer computational biology
- Hands-on projects at the French Supreme Court 
- Projects focused on NLP (Named Entity Recognition) and modern machine learning algorithms

### Notable Projects
- **PaddleOCR Inference:** An exploration of using PaddleOCR as a Text in Image service.
- **Named Entity Recognition Tool:** Developed for the Open Data of the French Supreme Court.
- **SHAP Library Mastery:** A comprehensive guide on push the limits of machine learning explainability.

## Thoughts & Insights
On her blog, Urszula shares reflections from her career journey, evaluations of educational opportunities in data science, and insights into the latest SOTA deep learning trends.

## Connecting with Urszula
Urszula invites businesses and individuals interested in leveraging deep learning technologies to connect and explore collaboration possibilities. 

> "Let’s talk about how deep learning can help your business or solve your problem." 
```


## Finally - a minor improvement

With a small adjustment, we can change this so that the results stream back from OpenAI,
with the familiar typewriter animation

In [20]:
def stream_brochure(company_name, url):
    stream = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
          ],
        stream=True
    )
    
    response = ""
    display_handle = display(Markdown(""), display_id=True)
    for chunk in stream:
        response += chunk.choices[0].delta.content or ''
        response = response.replace("```","").replace("markdown", "")
        update_display(Markdown(response), display_id=display_handle.display_id)

In [22]:
stream_brochure("Urszula Czerwinska", "https://urszulaczerwinska.github.io/")

Found links: {'links': [{'type': 'about page', 'url': 'https://urszulaczerwinska.github.io/about/'}, {'type': 'works page', 'url': 'https://urszulaczerwinska.github.io/works/'}, {'type': 'GitHub profile', 'url': 'https://github.com/urszulaczerwinska'}, {'type': 'LinkedIn profile', 'url': 'https://linkedin.com/in/urszulaczerwinska'}]}


# Urszula Czerwinska - Data Scientist & Deep Learning Engineer

---

## Introduction

Urszula Czerwinska is a seasoned **Data Scientist** and **Deep Learning Engineer** with a profound background in **Computational Biology**. Based in Paris, her work integrates the complexities of data science and cutting-edge deep learning techniques to foster innovation in various industries, particularly e-commerce and AI solutions.

---

## Professional Profile

### Expertise:
- **Data Science & Machine Learning**: Specializes in translating complex datasets into actionable insights.
- **Deep Learning**: Skilled in frameworks such as **TensorFlow** and **PyTorch**.
- **Computer Vision**: Extensive work on scene-text extraction and multimodal AI models.
- **Natural Language Processing (NLP)**: Experience in named entity recognition and model explainability (SHAP).

### Education:
- **Ph.D. in Computational Biology** from a reputable institution. Focus on cancer research and computational methodologies.

### Skills:
- Proficient in programming with **Python** and **R**.
- Expertise in statistics, complex systems, and model optimization.
- Strong leadership capabilities, fostering collaboration among teams.

---

## Career Highlights

Currently serving as a **Senior Deep Learning Engineer** at **Adevinta**, Urszula has successfully:
- Spearheaded advancements in **scene-text extraction**, boosting capabilities by **33%**.
- Optimized **multi-class classification models**, increasing accuracy by **45%**.
- Led the **Adevinta ML Guild**, promoting knowledge sharing and collaborative learning.

---

## Notable Projects
- **Cancer Computational Biology**: Pioneered research focusing on dimension reduction techniques.
- **NLP Projects at French Supreme Court**: Developed tools for pseudonymization of Open Data in legal decisions.
- **Computer Vision & Generative AI**: Engaged in innovative projects that apply deep learning solutions to real-world challenges.

---

## Community Engagement
Urszula actively contributes to the data science community by sharing insights through her blog. She discusses:
- The transition from academia to industry.
- Evaluations of data science bootcamps and courses.
- Current trends in deep learning and AI.

---

## Get in Touch
Curious about how Urszula can help your business with deep learning solutions, or want to discuss the evolving landscape of data science? You can connect with her:

- **Email**: [urszula.czerwinska@cri-paris.org](mailto:urszula.czerwinska@cri-paris.org)
- **LinkedIn**: [Urszula Czerwinska](https://www.linkedin.com/in/urszulaczerwinska/)
- **GitHub**: [ulalaparis](https://github.com/ulalaparis)
- **Twitter**: [@UrszulaC](https://twitter.com/UrszulaC)

---

## Conclusion
Urszula Czerwinska represents the intersection of academia and industry, leveraging her extensive knowledge to drive innovation in data science and deep learning. With a passion for problem-solving and a commitment to advancing technology, she stands as a remarkable asset for prospective employers, clients, and partners alike.

--- 

Thank you for considering Urszula Czerwinska as a potential collaborator or expert consultant in your data science needs!

In [None]:
# Try changing the system prompt to the humorous version when you make the Brochure for Hugging Face:

stream_brochure("HuggingFace", "https://huggingface.co")

<table style="margin: 0; text-align: left;">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../business.jpg" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#181;">Business applications</h2>
            <span style="color:#181;">In this exercise we extended the Day 1 code to make multiple LLM calls, and generate a document.

This is perhaps the first example of Agentic AI design patterns, as we combined multiple calls to LLMs. This will feature more in Week 2, and then we will return to Agentic AI in a big way in Week 8 when we build a fully autonomous Agent solution.

Generating content in this way is one of the very most common Use Cases. As with summarization, this can be applied to any business vertical. Write marketing content, generate a product tutorial from a spec, create personalized email content, and so much more. Explore how you can apply content generation to your business, and try making yourself a proof-of-concept prototype.</span>
        </td>
    </tr>
</table>

<table style="margin: 0; text-align: left;">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../important.jpg" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#900;">Before you move to Week 2 (which is tons of fun)</h2>
            <span style="color:#900;">Please see the week1 EXERCISE notebook for your challenge for the end of week 1. This will give you some essential practice working with Frontier APIs, and prepare you well for Week 2.</span>
        </td>
    </tr>
</table>

<table style="margin: 0; text-align: left;">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../resources.jpg" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#f71;">A reminder on 2 useful resources</h2>
            <span style="color:#f71;">1. The resources for the course are available <a href="https://edwarddonner.com/2024/11/13/llm-engineering-resources/">here.</a><br/>
            2. I'm on LinkedIn <a href="https://www.linkedin.com/in/eddonner/">here</a> and I love connecting with people taking the course!
            </span>
        </td>
    </tr>
</table>

<table style="margin: 0; text-align: left;">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../thankyou.jpg" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#090;">Finally! I have a special request for you</h2>
            <span style="color:#090;">
                My editor tells me that it makes a MASSIVE difference when students rate this course on Udemy - it's one of the main ways that Udemy decides whether to show it to others. If you're able to take a minute to rate this, I'd be so very grateful! And regardless - always please reach out to me at ed@edwarddonner.com if I can help at any point.
            </span>
        </td>
    </tr>
</table>