<div style="text-align: center; padding: 10px; margin-bottom: 10px">
    <h1 style="font-family: 'Gill Sans', 'Gill Sans MT', Calibri, sans-serif; font-size: 2.5em; color: #3498DB; margin-bottom: 0.2em; font-weight: 300;">Job Market Data Scraping and Analysis using LLMs</h1>
</div>

<div style="max-width: 800px; margin: 0 auto; background-color: #f9f9f9; border-left: 5px solid #3498DB; padding: 20px; font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; font-size: 1.1em; line-height: 1.6; color: #2C3E50;">
    This notebook serves as a tutorial and proof-of-concept for a workflow for extracting job market data from Indeed and analyzing job descriptions using Large Language Models.
</div>

<br>

<div style="max-width: 900px; margin: 0 auto; background-color: #f0f0f0; border-left: 5px solid #1ABC9C; padding: 15px; font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; font-size: 1.1em; line-height: 1.5;">
<h2 style="font-family: 'Palatino Linotype', 'Book Antiqua', Palatino, serif; color: #333333; margin-top: 0;">🌐 Web Scraping</h2>

Web scraping is the process of automatically collecting data from websites. It allows us to transform unstructured web content into structured datasets that can be analyzed.

### How Web Scraping Works

1. **HTTP Requests**:
Web scraping begins with making HTTP Requests to web servers. Every time you visit a website, your browser sends an HTTP request to a server, which responds with the HTML, CSS, and JavaScript content that makes up the webpage.

2. **HTML Parsing**:
Once we receive this HTML response, we need to parse it to create a structured representation that can be navigated programmatically. Libraries like `BeautifulSoup` take the raw HTML and turn it into a Document Object Model (DOM)—a tree-like structure that represents the page as nodes and objects.

3. **Data Extraction**:
To extract specific data from a website, you first need to identify the HTML elements that contain the information you're interested in. This can be done by inspecting the webpage's source code to find unique identifiers like `id`, `class`, or specific tag names. Once the relevant elements have been identified, you can extract their data using CSS Selectors, XPath expressions, regular expressions (regex), or any other methods that can filter and extract patterns from text strings. The extracted data can then be stored in a structured format (e.g., CSV, JSON) for further analysis.

### Common Obstacles for Web Scraping

**Anti-scraping Measures**:
- CAPTCHAs
- Rate Limiting
- Dynamic Content Obfuscation



</div>

### Obstacles and Challenges

While web scraping can be a powerful technique, there are several obstacles and challenges that you may encounter:

**Bot Detection and Access Restrictions**:
Many websites employ bot detection mechanisms to prevent automated scraping. They may block IP addresses, user agents, or other indicators of non-human activity. To overcome this, you can use tools like Selenium to mimic human-like browsing behavior, as shown in the `setup_selenium_driver()` function below:

```python:indeed_scraper.py
def setup_selenium_driver() -> webdriver.Chrome:
    """Set up and configure a Selenium Chrome driver for web scraping.
    
    This function configures Chrome options for headless browsing with a realistic
    user agent and returns the configured driver.
    
    Returns:
        webdriver.Chrome: Configured Chrome webdriver instance
    """
    # Set up Chrome options
    chrome_options = Options()
    chrome_options.add_argument("--headless")  # Run in headless mode
    chrome_options.add_argument("--disable-gpu")  # Disable GPU hardware acceleration
    chrome_options.add_argument("--window-size=1920,1080")  # Set window size
    
    # Set a realistic user agent
    user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36'
    chrome_options.add_argument(f'user-agent={user_agent}')
    
    # Create and return the driver
    driver = webdriver.Chrome(options=chrome_options)
    return driver
```

**Rate Limiting and Delays**:
Many websites also implement rate limiting to prevent excessive requests from overloading their servers. To avoid this, you can introduce random delays between requests using the `random_delay()` function:

```python:indeed_scraper.py
def random_delay(min_seconds: float = 2.0, max_seconds: float = 5.0) -> None:
    """Introduce a random delay between operations to avoid overwhelming the server.
    
    Args:
        min_seconds: Minimum delay time in seconds
        max_seconds: Maximum delay time in seconds
    """
    delay = random.uniform(min_seconds, max_seconds)
    time.sleep(delay)
```

By using Selenium to mimic human-like behavior and introducing random delays, you can help bypass bot detection and rate limiting, increasing the chances of successful web scraping.

</div>