<p align="center">
  <img src="../resources/Pankajtechblogs.png" alt="Pankaj Tech Blogs" width="320"/>
</p>

# Welcome to the first notebook

## What are we going to achieve and learn here

This notebook tells us how to scrape the text content from a website using Python (we will use a existing library here), then summarize that content using OpenAI's GPT model. It covers:

- Importing required libraries for web requests, HTML parsing, environment management, and OpenAI API access.
- Loading the OpenAI API key from environment variables.
- Defining a class to fetch and extract text from a given website URL.
- Displaying the scraped text in a Jupyter notebook.
- Sending the website content to an OpenAI GPT model with a custom prompt to generate a concise, markdown-formatted summary.
- Displaying the generated summary in the notebook.
- The workflow is: scrape → display raw text → summarize with LLM → display summary.
- As an example I will use my own tech blogs website - https://pankajtechblogs.dev/ - please feel free to visit if not done yet. :)

<p>
  <img src="../resources/highlevel-idea-website-summarization.png" alt="High Level Design" width="320"/>
</p>

*As first step - let's import the necessary Python libraries and packages that will assist us throughout our code.*

**Brief summary of what we gonna use:**

- **os:** Provides functions to interact with the operating system, such as reading environment variables or file paths.

- **requests:** Allows you to send HTTP requests easily, useful for fetching data from web APIs or websites.

- **dotenv.load_dotenv:** Loads environment variables from a .env file into your environment, often used for managing secrets like API keys.

- **bs4.BeautifulSoup:** Parses and extracts data from HTML or XML documents, commonly used for web scraping.

- **IPython.display.display, Markdown:** Lets you display rich content (like formatted Markdown) directly in Jupyter notebooks.

- **openai.OpenAI:** Provides access to the OpenAI API, enabling you to interact with models like GPT for tasks such as text generation or summarization.


In [1]:
import os
import requests
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from IPython.display import display, Markdown
from openai import OpenAI

Now Next step is to connect with OpenAI, remember to setup your own personal OPENAI_API_KEY in the env vars, to connect to frontier models, here we will use the gpt-4o-mini.

In [4]:
load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')
if not api_key:
    print("No API key was found")
else:
    print("API key found and looks good so far!")

openai = OpenAI()

API key found and looks good so far!


Lets talk to OpenAI model and get started with our first prompt message.

We will pass the message as shown below. And hit openai.chat.completions api to retrieve the response from the mentioned LLM model. 

Now if you notice we have passed certain arguments here, let me explain each of them and their usage. 

model: defines which llm model to use, as there are variety of LLM Models available.

messages: this is an object which defines the conversation history between the user and the assistant (model). Each message object has a role ("user", "assistant" or "system") and content (usually the text message which we want to send to the model usually known as Prompt.)

This structure allows you to provide context to the model, so it can generate responses that are relevant to the ongoing conversation. For example:
```
messages=[
    {"role": "user", "content": "Hello, how are you?"},
    {"role": "assistant", "content": "I'm good, how can I help you?"},
    {"role": "user", "content": "Can you help me scrape a website?"}
]
```
In the below snippet, we are passing a single user message, but we can also include multiple messages to give the model more context about the conversation.

temperature: controls the randomness or creativity of the model’s responses (varies from 0-1). A lower value (ex: 0.2) makes the output more focused and deterministic, while a higher value (ex: 0.8) makes it 
more random and creative.

max_tokens: sets the maximum number of tokens (words or word pieces) in the generated response. It limits how long the model’s reply can be. For ex:  max_tokens=150 means the response will not exceed 150 tokens.

In [5]:
message = "Hello GPT, I want to scrape a website and get the text from it. Can you help me with that?"
response = openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": message}
    ],
    max_tokens=10,
    temperature=0
)
print(response.choices[0].message.content)

Absolutely! I can help you with the general steps


Now lets define a function that will 
  1. Take a URL as input
  2. Sends a HTTP GET request to retrieve the webpage
  3. Parses the HTML content using BeautifulSoup python library which we just imported above
  4. Then extracts all the text from the page
  5. Returns it. 

And if there is an error during the request, it prints an error message and returns None.

In [24]:
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

class WebsiteScraper:
    def __init__(self, url):
        try:
            self.url = url
            response = requests.get(url, headers=headers)
            response.raise_for_status()  # Raise an error for bad responses
            soup = BeautifulSoup(response.content, 'html.parser')
            content = soup.get_text()
            self.content = content
        except requests.exceptions.RequestException as e:
            print(f"Error fetching the website: {e}")
            self.content = None

Now lets write a function that will display the text and here we define it as display_text(text). Which is self explanatory. 

Now define a main function that calls the get_website_text function defined above in the class, and output of get_website_text is passed to the display_text function that we just created, that will help to Markdown in the notebook. 

if name == "main": This block ensures that main() runs only when the script is executed directly (not imported as a module). After running main, it prints completion and help messages for the user.

In [27]:
def display_text(text):
    if text:
        display(Markdown(text))
    else:
        print("No text to display.")

def main():
    url = "https://pankajtechblogs.dev/"
    website_data = WebsiteScraper(url)
    display_text(website_data.content)
    return website_data.content

if __name__ == "__main__":
    main()
    print("Done!")
    print("You can now use the text from the website. \n If you need further assistance, feel free to ask!")

pankajtechblogsHomepageOpen in appSign inGet startedPankaj Tech BlogsSharing the learningsFollowFollowingOrder Processing System — using Event Driven ArchitectureOrder Processing System — using Event Driven Architecture1. IntroductionPankaj SharmaMar 13Hubble Observability with Cilium — KubernetesHubble Observability with Cilium — KubernetesIn continuation with the previous blog on Zero Trust Networking with Cilium on Kubernetes, let us see how we can establish observability…Pankaj SharmaApr 11, 2024Cilium Network Policy (CNP)—Zero Trust Networking — KubernetesCilium Network Policy (CNP)—Zero Trust Networking — KubernetesNetworking is at the heart of Kubernetes (K8s). When it comes to the connectivity, we need to look forward at the 4 typical use cases (as…Pankaj SharmaApr 11, 2024Spring cloud config server — Auto reload config properties — zero-touchSpring cloud config server — Auto reload config properties — zero-touchIn a distributed system, Spring Cloud Config provides server-side and client-side support for externalized configuration.Pankaj SharmaSep 4, 2023JWT validation — Single Usage — One-time validJWT validation — Single Usage — One-time validSigned JSON Web Token (JWT) is an industry-standard method for exchanging claims securely between two parties. And for…Pankaj SharmaAug 21, 2023Split-key encryption- Securing the data at restSplit-key encryption- Securing the data at restPankaj SharmaAug 17, 2022Disaster Recovery Strategies for Cloud ApplicationsDisaster Recovery Strategies for Cloud ApplicationsWhen we talk about Disaster Recovery aka DR, as a first thought we always think for multi-region replication of workloads. This is true to…Pankaj SharmaJan 25, 2022Database per service — Microservices Design PatternDatabase per service — Microservices Design PatternAny enterprise application will need to persist data in some or another way. For temporary purposes maybe it uses the Cache layer, and for…Pankaj SharmaAug 1, 2021Cloud Migration Strategy — On-Premises to CloudCloud Migration Strategy — On-Premises to CloudLet’s discuss — what is the cloud migration strategy to move the on-prem resources/servers/applications/systems to the cloud.Pankaj SharmaJul 25, 2021Spring Boot Application Monitoring using Prometheus + GrafanaSpring Boot Application Monitoring using Prometheus + GrafanaPankaj SharmaJul 19, 2021Why to use Circuit Breaker Pattern? — Microservices Design PatternWhy to use Circuit Breaker Pattern? — Microservices Design PatternIn the microservices architecture, multiple services are deployed onto the cluster(s), and these services may have Inter-service…Pankaj SharmaJul 16, 2021How to edit an Apigee Proxy?How to edit an Apigee Proxy?Welcome back, this is a continuation of my previous blog —  https://iampankajsharma.medium.com/how-to-create-apigee-api-proxy-219fa2df1425Pankaj SharmaJul 10, 2021How to create Apigee API Proxy?How to create Apigee API Proxy?Welcome back, this is a continuation of my previous blog — https://iampankajsharma.medium.com/why-to-use-an-api-gateway-b36c9988f581Pankaj SharmaJul 10, 2021Why to use an API Gateway?Why to use an API Gateway?Gateway — Chowkidaar (in the Hindi language), that will look after our home for safety and protecting us in some way. Homes in technical…Pankaj SharmaJul 8, 2021About pankajtechblogsLatest StoriesArchiveAbout MediumTermsPrivacyTeams

Done!
You can now use the text from the website. 
 If you need further assistance, feel free to ask!


Now whatever content is scrapped we will pass it to our LLM model to generate a nice summary for it. 

As we already discussed above that we interact with LLM using some instructions that are to be passed in a particular way. 

**SystemPrompt** - tells models what task they have to perform and what tone they should use, like business, formal, funny etc. 

**UserPrompt** - Conversation start that user provides.

Lets us define each of them here.

In [28]:
system_prompt = "You are an assistant that analyzes the contents of a website \
and provides a short summary, ignoring text that might be related to navigation or links. \
Response should be in markdown."

def user_prompt_for(website_text):
    user_prompt = "You are looking at a website\nThe contents of this website is as follows \n \
please provide a short summary of this website in markdown. \n"
    user_prompt += website_text
    return user_prompt

In [31]:
print(user_prompt_for(WebsiteScraper("https://pankajtechblogs.dev/").content))

You are looking at a website
The contents of this website is as follows 
 please provide a short summary of this website in markdown. 
pankajtechblogsHomepageOpen in appSign inGet startedPankaj Tech BlogsSharing the learningsFollowFollowingOrder Processing System — using Event Driven ArchitectureOrder Processing System — using Event Driven Architecture1. IntroductionPankaj SharmaMar 13Hubble Observability with Cilium — KubernetesHubble Observability with Cilium — KubernetesIn continuation with the previous blog on Zero Trust Networking with Cilium on Kubernetes, let us see how we can establish observability…Pankaj SharmaApr 11, 2024Cilium Network Policy (CNP)—Zero Trust Networking — KubernetesCilium Network Policy (CNP)—Zero Trust Networking — KubernetesNetworking is at the heart of Kubernetes (K8s). When it comes to the connectivity, we need to look forward at the 4 typical use cases (as…Pankaj SharmaApr 11, 2024Spring cloud config server — Auto reload config properties — zero-touchSp

In [None]:
def messages_for(website_text):
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt_for(website_text)}
    ]

def summarize(url):
    website = WebsiteScraper(url)
    response = openai.chat.completions.create(
        model = "gpt-4o-mini",
        messages = messages_for(website.content),
        max_tokens=1000,
        temperature=0
    )
    return response.choices[0].message.content


def display_summary(url):
    summary = summarize(url)
    display(Markdown(summary))
    

display_summary("https://pankajtechblogs.dev/")
print("Done!")
print("Here is the summary. \n If you need further assistance, feel free to ask!")

# Summary of Pankaj Tech Blogs

Pankaj Tech Blogs is a platform where Pankaj Sharma shares insights and learnings on various technology topics, primarily focusing on software development and cloud computing. The blog features a range of articles covering:

- **Event Driven Architecture**: Discussing order processing systems.
- **Kubernetes**: Exploring observability with Cilium and implementing Zero Trust Networking.
- **Spring Cloud**: Providing guidance on auto-reloading configuration properties in distributed systems.
- **Security**: Covering JWT validation and split-key encryption for data protection.
- **Disaster Recovery**: Strategies for cloud applications and multi-region workload replication.
- **Microservices**: Discussing design patterns like database per service and the circuit breaker pattern.
- **Cloud Migration**: Strategies for transitioning from on-premises to cloud environments.
- **Monitoring**: Techniques for monitoring Spring Boot applications using Prometheus and Grafana.
- **API Management**: Instructions on creating and editing Apigee API proxies.

The blog serves as a resource for developers and IT professionals looking to enhance their knowledge in these areas.

Done!
Here is the summary. 
 If you need further assistance, feel free to ask!


> Above summary was generated by LLM Model.

# **Well, now that we learnt some cool stuff :-)**

### Lets also see the practical business use cases for the workflow demonstrated in this notebook:

**Content Summarization:**
Automatically summarize long articles, reports, or web pages for newsletters, executive briefs, or dashboards.

**Market & Competitor Monitoring:**
Scrape competitor websites or industry news portals and generate concise summaries for market intelligence.

**Customer Support:**
Summarize FAQ pages, documentation, or support forums to provide quick answers to customer queries.

**Research Automation:**
Aggregate and summarize research papers, blogs, or news articles for analysts or researchers.

**SEO & Content Curation:**
Curate and summarize trending topics or blog posts for content marketing and SEO teams.

**Internal Knowledge Management:**
Scrape and summarize internal wikis, policy documents, or meeting notes for easy reference.

**Regulatory & Compliance Monitoring:**
Summarize updates from regulatory bodies’ websites to keep compliance teams informed.

**Product/Service Monitoring:**
Track and summarize reviews or feedback from product pages or forums.

This AI workflow saves time, improves information accessibility, and enables faster, data-driven decision-making across various business functions including some of the above ones. 


### Thanks for staying with me till here. Happy Learnings. Let's DIY (Do it yourself...)..!!