# Website Analysis and Summarization with Selenium and OpenAI

> This notebook demonstrates how to extract and summarize the main content of any website using Selenium for dynamic extraction and OpenAI for generating concise summaries in Mexican Spanish.

## Overview
This notebook provides a workflow to automatically analyze websites, extract relevant text, and generate a short summary using a language model. Navigation elements are ignored, focusing on news, announcements, and main content.

## Features
- Extracts relevant text from web pages using Selenium and BeautifulSoup.
- Generates automatic summaries using OpenAI's language models.
- Presents results in markdown format.

## Requirements
- Python 3.8+
- Google Chrome browser installed
- The following Python packages:
  - selenium
  - webdriver-manager
  - beautifulsoup4
  - openai
  - python-dotenv
  - requests
- An OpenAI API key (project key, starting with `sk-proj-`)
- Internet connection

## How to Use
1. Install the required packages:
   ```bash
   pip install selenium webdriver-manager undetected-chromedriver beautifulsoup4 openai python-dotenv requests
   ```
2. Add your OpenAI API key to a `.env` file as `OPENAI_API_KEY`.
3. Run the notebook cells in order. You can change the target website URL in the code to analyze different sites.
4. The summary will be displayed in markdown format below the code cell.

**Note:** Some websites may block automated access. The notebook includes options to simulate a real user and avoid bot detection, but results may vary depending on the site's protections.

---

In [8]:
!pip install selenium webdriver-manager undetected-chromedriver





In [12]:
import sys
print(sys.executable)


/Users/devanshuprakash/projects/llm_engineering/.venv/bin/python


In [15]:
import selenium
print("✅ Selenium imported successfully!")


✅ Selenium imported successfully!


In [16]:
# Imports
import os
import requests
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from IPython.display import Markdown, display
from openai import OpenAI

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
import undetected_chromedriver as uc

In [17]:
# Load the environment variables from .env
load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

# Check the key

if not api_key:
    print("No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!")
elif not api_key.startswith("sk-proj-"):
    print("An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook")
elif api_key.strip() != api_key:
    print("An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook")
else:
    print("API key found and looks good so far!")


API key found and looks good so far!


In [18]:
openai = OpenAI()

In [20]:
class Website:
    def __init__(self, url, headless=True, wait_time=10):
        self.url = url  # Website URL to analyze
        self.title = None  # Title of the website
        self.text = None  # Extracted text from the website
        
        # Chrome options configuration for Selenium
        options = Options()
        if headless:
            options.add_argument("--headless=new")  # Run Chrome in headless mode (no window)
        options.add_argument("--disable-gpu")  # Disable GPU acceleration
        options.add_argument("--no-sandbox")  # Disable Chrome sandbox (required for some environments)
        options.add_argument("--window-size=1920,1080")  # Set window size to simulate a real user
        # Simulate a real user-agent to avoid bot detection
        options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
        
        # Initialize Chrome WebDriver
        self.driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
        self.driver.get(url)  # Open the URL in the browser
        
        try:
            # Wait until the <body> element is present in the page
            WebDriverWait(self.driver, wait_time).until(EC.presence_of_element_located((By.TAG_NAME, "body")))
            html = self.driver.page_source  # Get the full HTML of the page
            soup = BeautifulSoup(html, 'html.parser')  # Parse HTML with BeautifulSoup
            self.title = soup.title.string if soup.title else 'No title found'  # Extract the title
            if soup.body:
                # Remove irrelevant elements from the body
                for irrelevant in soup.body(["script", "style", "img", "input"]):
                    irrelevant.decompose()
                # Extract clean text from the body
                self.text = soup.body.get_text(separator='\n', strip=True)
            else:
                self.text = "No body found"  # If no body is found, indicate it
        except Exception as e:
            print(f"Error accessing the site: {e}")  # Print error to console
            self.text = "Error accessing the site"  # Store error in the attribute
        finally:
            self.driver.quit()  # Always close the browser, whether or not an error occurred

In [32]:
system_prompt = "You are an assistant that SOLVES THE PROBLEM GIVEn from a website and gives code for that problem"

In [33]:
# A function that writes a User Prompt that asks for summaries of websites:

def user_prompt_for(website):
    user_prompt = f"You are looking at a website titled {website.title}"
    user_prompt += "\nThe contents of this website is as follows; \
please provide a short summary of this website in markdown. \
If it includes news or announcements, then summarize these too.\n\n"
    user_prompt += website.text
    return user_prompt

In [34]:
# Creates messages for the OpenAI API
def messages_for(website):
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt_for(website)}
    ]

In [36]:
# Creates a summary for the given URL
def summarize(url):
    website = Website(url)
    response = openai.chat.completions.create(
        model = "gpt-4o-mini",
        messages = messages_for(website)
    )
    return response.choices[0].message.content

In [37]:
# Shows the summary for the given URL
def display_summary(url):
    summary = summarize(url)
    display(Markdown(summary))

In [38]:
display_summary("https://my.newtonschool.co/playground/code/4f3on5rrmvez")

# Newton School Summary

Newton School is an online platform designed to help individuals develop their tech careers. The website offers:

- **Coding Education**: A Full Stack Web Development course aimed at teaching coding skills.
- **Job Placement Opportunities**: Assistance in securing jobs with salaries ranging from 5 to 40 LPA (Lakhs per Annum).
- **Engagement**: Monthly coding contests to enhance skills and promote learning through competition.

### Registration
Users can register by providing their full name, email ID, and phone number, after which they will receive an OTP for verification.

### Legal and Support
The website includes links to its Terms & Conditions, Privacy Policy, and other relevant support resources.

**Copyright Notice**: © 2025 Incanus Technologies Pvt. Ltd. All rights reserved.

*Note: For optimal functionality, users are encouraged to enable JavaScript in their browsers.*