# Tech & Finance Sentiment Analyzer with Selenium

This notebook demonstrates how to combine **Selenium web scraping** with **OpenAI's GPT** to analyze sentiment from tech and finance news websites.

**What you'll learn:**
- Using Selenium to scrape JavaScript-rendered websites
- Extracting clean text from HTML with BeautifulSoup
- Using GPT to perform structured sentiment analysis
- Aggregating insights from multiple news sources

**Why Selenium?**
Many modern news sites use JavaScript to load content dynamically. Simple HTTP requests won't capture this content - Selenium runs a real browser that executes JavaScript first.

## Setup

First, install the required packages if you haven't already:

In [None]:
# Uncomment and run once to install dependencies
# !pip install selenium webdriver-manager openai python-dotenv beautifulsoup4

In [None]:
import os
import time
import json
from datetime import datetime

from dotenv import load_dotenv
from bs4 import BeautifulSoup
from IPython.display import Markdown, display
from openai import OpenAI

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

In [None]:
# Load API key from .env file
load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

if not api_key:
    print("No API key found - please set OPENAI_API_KEY in your .env file")
elif not api_key.startswith("sk-"):
    print("API key found but doesn't look right - please check your .env file")
else:
    print("API key loaded successfully!")

openai_client = OpenAI()

## Part 1: The Selenium Web Scraper

This class handles all the web scraping logic:
- Creates a headless Chrome browser
- Navigates to URLs and waits for JavaScript to render
- Extracts clean text from the page

In [None]:
class SeleniumScraper:
    """
    A web scraper that uses Selenium to handle JavaScript-rendered pages.
    
    Args:
        headless: If True, runs browser without visible window (faster)
        wait_time: Seconds to wait for JavaScript to render
    """
    
    def __init__(self, headless=True, wait_time=3):
        self.headless = headless
        self.wait_time = wait_time
    
    def _create_driver(self):
        """Create and configure a Chrome WebDriver instance."""
        options = Options()
        
        if self.headless:
            options.add_argument('--headless')
        
        # Standard options for stability
        options.add_argument('--no-sandbox')
        options.add_argument('--disable-dev-shm-usage')
        options.add_argument('--disable-gpu')
        options.add_argument('--window-size=1920,1080')
        
        # Use a realistic user agent
        options.add_argument(
            'user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) '
            'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
        )
        
        # webdriver-manager automatically downloads the correct ChromeDriver
        service = Service(ChromeDriverManager().install())
        return webdriver.Chrome(service=service, options=options)
    
    def _clean_html(self, html):
        """Extract clean text from HTML, removing navigation and scripts."""
        soup = BeautifulSoup(html, 'html.parser')
        
        # Get the title
        title = soup.title.string.strip() if soup.title else "No title"
        
        # Remove elements that don't contain useful content
        for tag in soup(['script', 'style', 'nav', 'footer', 'header', 
                         'aside', 'img', 'input', 'button', 'form', 'iframe']):
            tag.decompose()
        
        # Get text with proper spacing
        text = soup.body.get_text(separator="\n", strip=True) if soup.body else ""
        
        # Clean up whitespace
        lines = [line.strip() for line in text.split('\n') if line.strip()]
        clean_text = "\n".join(lines)
        
        return title, clean_text
    
    def scrape(self, url, source_name="Website"):
        """
        Scrape a URL and return structured content.
        
        Args:
            url: The URL to scrape
            source_name: A friendly name for this source
            
        Returns:
            dict with keys: url, source, title, text, success, error
        """
        print(f"Scraping: {source_name}...")
        
        try:
            driver = self._create_driver()
            driver.get(url)
            
            # Wait for JavaScript to render
            time.sleep(self.wait_time)
            
            # Scroll down to trigger lazy loading
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight / 2);")
            time.sleep(1)
            
            html = driver.page_source
            driver.quit()
            
            title, text = self._clean_html(html)
            
            # Truncate very long content to stay within token limits
            if len(text) > 12000:
                text = text[:12000] + "\n[...content truncated...]"
            
            print(f"  Done! Extracted {len(text):,} characters")
            
            return {
                "url": url,
                "source": source_name,
                "title": title,
                "text": text,
                "success": True,
                "error": None
            }
            
        except Exception as e:
            print(f"  Error: {e}")
            return {
                "url": url,
                "source": source_name,
                "title": None,
                "text": None,
                "success": False,
                "error": str(e)
            }

## Part 2: The Sentiment Analyzer

This class uses GPT to analyze the scraped content and extract:
- Overall sentiment (bullish/bearish/neutral)
- A sentiment score from -100 to +100
- Key themes and topics
- Companies mentioned
- Actionable insights

In [None]:
class SentimentAnalyzer:
    """
    Uses GPT to analyze sentiment in tech/finance news content.
    Returns structured analysis with scores and insights.
    """
    
    SYSTEM_PROMPT = """You are a financial analyst AI. Analyze the news content and respond with JSON:

{
    "sentiment": "bullish" or "bearish" or "neutral",
    "score": <-100 to 100, where -100=very bearish, 100=very bullish>,
    "confidence": <0-100>,
    "themes": ["theme1", "theme2", "theme3"],
    "companies": ["company1", "company2"],
    "summary": "2-3 sentence summary of key points",
    "insight": "One actionable recommendation"
}

Focus on tech industry and market implications."""

    def __init__(self, model="gpt-5-nano"):
        self.model = model
        self.client = OpenAI()
    
    def analyze(self, scraped_data):
        """
        Analyze scraped content for sentiment.
        
        Args:
            scraped_data: dict from SeleniumScraper.scrape()
            
        Returns:
            dict with sentiment analysis results
        """
        if not scraped_data["success"]:
            return {
                "source": scraped_data["source"],
                "error": scraped_data["error"]
            }
        
        user_prompt = f"""Analyze this news content:

Source: {scraped_data['source']}
Title: {scraped_data['title']}

Content:
{scraped_data['text'][:10000]}"""

        try:
            response = self.client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "system", "content": self.SYSTEM_PROMPT},
                    {"role": "user", "content": user_prompt}
                ],
                response_format={"type": "json_object"}
            )
            
            result = json.loads(response.choices[0].message.content)
            result["source"] = scraped_data["source"]
            result["url"] = scraped_data["url"]
            return result
            
        except Exception as e:
            return {
                "source": scraped_data["source"],
                "error": str(e)
            }

## Part 3: Report Formatting

Helper functions to display the analysis results in a readable format.

In [None]:
def get_sentiment_badge(score):
    """Return a text badge based on sentiment score."""
    if score >= 50:
        return "[STRONGLY BULLISH]"
    elif score >= 20:
        return "[BULLISH]"
    elif score > -20:
        return "[NEUTRAL]"
    elif score > -50:
        return "[BEARISH]"
    else:
        return "[STRONGLY BEARISH]"


def format_analysis(analysis):
    """Format a single analysis result as markdown."""
    if "error" in analysis:
        return f"### {analysis.get('source', 'Unknown')}\n\nError: {analysis['error']}\n\n---"
    
    score = analysis.get('score', 0)
    badge = get_sentiment_badge(score)
    
    themes = ", ".join(analysis.get('themes', [])[:4])
    companies = ", ".join(analysis.get('companies', [])[:5]) or "None mentioned"
    
    return f"""### {analysis['source']} {badge}

**Sentiment Score:** {score}/100 (Confidence: {analysis.get('confidence', 'N/A')}%)

**Summary:** {analysis.get('summary', 'N/A')}

**Key Themes:** {themes}

**Companies Mentioned:** {companies}

**Actionable Insight:** {analysis.get('insight', 'N/A')}

*Source: {analysis.get('url', 'N/A')}*

---
"""


def display_report(analyses):
    """Display a complete sentiment report."""
    # Calculate aggregate stats
    valid = [a for a in analyses if "error" not in a]
    
    if valid:
        avg_score = sum(a.get('score', 0) for a in valid) / len(valid)
        avg_badge = get_sentiment_badge(avg_score)
    else:
        avg_score = 0
        avg_badge = "[NO DATA]"
    
    # Build the report
    timestamp = datetime.now().strftime("%Y-%m-%d %H:%M")
    
    report = f"""# Sentiment Analysis Report

**Generated:** {timestamp}

## Overall Market Sentiment {avg_badge}

**Average Score:** {avg_score:.1f}/100 across {len(valid)} sources

---

## Individual Source Analysis

"""
    
    for analysis in analyses:
        report += format_analysis(analysis) + "\n"
    
    display(Markdown(report))

## Part 4: Run the Analysis

Now let's put it all together! Configure your news sources and run the pipeline.

In [None]:
# Configure the news sources you want to analyze
# Format: {"Friendly Name": "URL"}

NEWS_SOURCES = {
    "Hacker News": "https://news.ycombinator.com/",
    "TechCrunch": "https://techcrunch.com/",
    "The Verge Tech": "https://www.theverge.com/tech",
}

# You can add more sources:
# "Ars Technica": "https://arstechnica.com/",
# "Wired": "https://www.wired.com/",

In [None]:
# Initialize our tools
scraper = SeleniumScraper(headless=True, wait_time=4)
analyzer = SentimentAnalyzer(model="gpt-5-nano")

In [None]:
# Step 1: Scrape all sources
print("="*50)
print("SCRAPING NEWS SOURCES")
print("="*50)

scraped_data = []
for name, url in NEWS_SOURCES.items():
    result = scraper.scrape(url, name)
    scraped_data.append(result)

successful = sum(1 for s in scraped_data if s["success"])
print(f"\nScraped {successful}/{len(NEWS_SOURCES)} sources successfully")

In [None]:
# Step 2: Analyze sentiment for each source
print("="*50)
print("ANALYZING SENTIMENT")
print("="*50)

analyses = []
for data in scraped_data:
    print(f"Analyzing: {data['source']}...")
    result = analyzer.analyze(data)
    analyses.append(result)
    print(f"  Sentiment: {result.get('sentiment', 'error')}")

print("\nAnalysis complete!")

In [None]:
# Step 3: Display the report
display_report(analyses)

## Bonus: Quick Single-URL Analysis

Use this function to quickly analyze any single URL:

In [None]:
def quick_sentiment(url, name="Website"):
    """One-liner to scrape and analyze any URL."""
    scraper = SeleniumScraper(headless=True)
    analyzer = SentimentAnalyzer(model="gpt-5-nano")
    
    data = scraper.scrape(url, name)
    result = analyzer.analyze(data)
    
    display(Markdown(format_analysis(result)))
    return result

In [None]:
# Try it! Uncomment and run:
quick_sentiment("https://news.ycombinator.com/", "Hacker News");

---

## Ideas for Extension

- Add more news sources (Reuters, Bloomberg, etc.)
- Track sentiment over time by saving results to a file
- Add email alerts when sentiment shifts dramatically
- Create charts with matplotlib to visualize trends
- Use Ollama for local/free sentiment analysis

---

*Week 1 Community Contribution - LLM Engineering Course*