# Web Page Text Scraper â†’ LiveScores

This notebook:
- Scrapes **JavaScript-rendered** pages with Selenium
- Extracts **text only**
- Sends the extracted text to an OpenAI model for a **concise, well-structured summary**
- Uses **tables where helpful** (the model will output Markdown tables)

## Setup
Ensure your API key is set as an environment variable before running:
- `OPENAI_API_KEY`


In [None]:
# If needed, install dependencies
# !pip install -q selenium webdriver-manager openai


## Imports & Scraper (Text Only)


In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
import time
from IPython.display import Markdown, display


def scrape_js_page_text(url: str, wait_time: int = 5) -> dict:
    """Scrape a JavaScript-rendered page and return only clean text."""
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    chrome_options.add_argument('--disable-gpu')
    chrome_options.add_argument('--window-size=1920,1080')

    service = Service(ChromeDriverManager().install())
    driver = webdriver.Chrome(service=service, options=chrome_options)

    try:
        driver.get(url)
        time.sleep(wait_time)

        page_title = driver.title
        page_text = driver.find_element(By.TAG_NAME, 'body').text

        return {
            "title": page_title,
            "text": page_text,
            "url": url,
        }
    finally:
        driver.quit()


## Scrape a URL


In [None]:
url = "https://livescore.com"  # <- change this

scraped_data = scrape_js_page_text(url, wait_time=10)

# Optional preview
scraped_data["text"][:1000]


## Summarize with an LLM

Notes:
- This uses the OpenAI Python client style (`OpenAI()`)
- The model will typically output Markdown including tables
- A simple char-limit is applied to avoid overly large prompts


In [4]:
from openai import OpenAI

client = OpenAI()


def summarize_text(title: str, url: str, text: str, *, model: str = "gpt-4.1-mini") -> str:
    system_prompt = (
        "You are a liverscore and match reporter. "
        "Provide a clear, structured, and concise summary of the web content. "
        "Use tables where helpful for clarity. "
        "Remove repetition. Focus on key insights, data points, and conclusions."
    )

    # Token-safety (simple): trim very large pages
    MAX_CHARS = 20000
    trimmed_text = text[:MAX_CHARS]

    user_prompt = f"""Title: {title}
URL: {url}

Web Content:
{trimmed_text}
"""

    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        temperature=0.3,
    )

    return response.choices[0].message.content


## Run Summary


In [None]:
summary = summarize_text(
    scraped_data["title"],
    scraped_data["url"],
    scraped_data["text"],
)

display(Markdown(summary))
