## Import Libraries
Summary: This block imports the necessary libraries for web scraping (selenium and BeautifulSoup) and for handling CSV files (csv).

In [56]:
## Importing Libraries
%pip install selenium beautifulsoup4
import csv
import os
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

from bs4 import BeautifulSoup


You should consider upgrading via the '/usr/local/bin/python3 -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


## Initialize WebDriver
Summary: This cell initializes the Chromedriver, and opens the specified URL using Selenium. Due to dynamic loading, it waits up to 20 seconds for an element with the text "5K" to appear on the page, ensuring that the page is fully loaded. Then, it retrieves the HTML source of the fully rendered page and parses it with BeautifulSoup.

In [57]:
# Set up Chrome options for headless browsing
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
chrome_options.binary_location = "/usr/local/bin/chrome-mac-x64/Google Chrome for Testing.app/Contents/MacOS/Google Chrome for Testing"

# Initialize the WebDriver
driver = webdriver.Chrome(options=chrome_options)


## Set Up CSV File Format
This cell defines how to properly configure the fields.

In [58]:
# Desired column order
ordered_fields = [
    'Bib Number', 'Place Gender', 'Place Age‑Graded', 'Gun Time', '5K', '10K', '15K', '20K', 'HALF',
    '25K', '30K', '35K', '40K', '20M', '25.2M', '26M', 'MAR', 'Official Time'
]



# Set up the CSV file with the ordered fields


# Define the filename and path to save the CSV file in the user's home directory
home_directory = os.path.expanduser("~")
filename = os.path.join(home_directory, 'ny_marathon_results.csv')
with open(filename, mode='a', newline='') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=ordered_fields)
    writer.writeheader()

## Extract Runner Information
This cell initializes an empty dictionary to store the split times extracted from the HTML. This cell finds the container with the class content-box and iterates through all form-group-item sections within it, extracting the label and corresponding time, and storing them in the split_times dictionary.

In [59]:
# Process each runner's bib number
for bib_number in range(1, 66900):  # Adjust the range as needed
    url = f"https://results.nyrr.org/event/M2024/result/{bib_number}"
    driver.get(url)

    # Initialize data dictionary for the runner's results with None as default for all ordered fields
    runner_data = {field: None for field in ordered_fields}
    runner_data['Bib Number'] = bib_number

    try:
        # Wait for the 5K split to load to ensure the page is fully rendered
        WebDriverWait(driver, .5).until(
            EC.presence_of_element_located((By.XPATH, "//label[text()='5K']"))
        )

        # Retrieve the page source after JavaScript has rendered
        page_source = driver.page_source

        # Parse the HTML with BeautifulSoup
        soup = BeautifulSoup(page_source, 'html.parser')

        # Dictionary to store the split times
        split_times = {}

        # Extract split times in the parsed HTML
        content_box = soup.find("div", class_="content-box")
        if content_box:
            split_sections = content_box.find_all("div", class_="form-group-item")
            for section in split_sections:
                label = section.find("label")
                time = section.find("span", class_="label-value")
                if label and time:
                    split_label = label.get_text(strip=True)
                    split_time = time.get_text(strip=True)
                    split_times[split_label] = split_time

        # Update runner_data with the extracted splits and other fields
        runner_data.update(split_times)


        # Append the current runner's data to the CSV file
        with open(filename, mode='a', newline='') as csvfile:
            writer = csv.DictWriter(csvfile, fieldnames=ordered_fields)
            writer.writerow(runner_data)

        print(f"Processed bib number {bib_number}")

    except TimeoutException:
        print(f"Timed out while processing bib number {bib_number}. Skipping...")

# Close the driver
driver.quit()
print("All bibs processed and saved to CSV.")

Processed bib number 1
Processed bib number 2
Processed bib number 3
Processed bib number 4
Timed out while processing bib number 5. Skipping...
Processed bib number 6
Processed bib number 7
Processed bib number 8
Processed bib number 9
Processed bib number 10
Processed bib number 11
Processed bib number 12
Timed out while processing bib number 13. Skipping...
Processed bib number 14
Processed bib number 15
Processed bib number 16
Processed bib number 17
Processed bib number 18
Processed bib number 19
Processed bib number 20
Processed bib number 21
Processed bib number 22
Processed bib number 23
Processed bib number 24
Processed bib number 25
Processed bib number 26
Processed bib number 27
Timed out while processing bib number 28. Skipping...
Timed out while processing bib number 29. Skipping...
Timed out while processing bib number 30. Skipping...
Timed out while processing bib number 31. Skipping...
Timed out while processing bib number 32. Skipping...
Timed out while processing bib 

WebDriverException: Message: disconnected: Unable to receive message from renderer
  (failed to check if window was closed: disconnected: not connected to DevTools)
  (Session info: chrome=130.0.6723.116)
Stacktrace:
0   chromedriver                        0x000000010891db58 chromedriver + 8182616
1   chromedriver                        0x000000010891508a chromedriver + 8147082
2   chromedriver                        0x00000001081b0fa0 chromedriver + 397216
3   chromedriver                        0x00000001081994fc chromedriver + 300284
4   chromedriver                        0x000000010819922e chromedriver + 299566
5   chromedriver                        0x00000001081981e9 chromedriver + 295401
6   chromedriver                        0x00000001081bccf2 chromedriver + 445682
7   chromedriver                        0x0000000108240614 chromedriver + 984596
8   chromedriver                        0x00000001082216f3 chromedriver + 857843
9   chromedriver                        0x00000001081f01c2 chromedriver + 655810
10  chromedriver                        0x00000001081f119e chromedriver + 659870
11  chromedriver                        0x00000001088e2da0 chromedriver + 7941536
12  chromedriver                        0x00000001088e6cf4 chromedriver + 7957748
13  chromedriver                        0x00000001088c4917 chromedriver + 7817495
14  chromedriver                        0x00000001088e777e chromedriver + 7960446
15  chromedriver                        0x00000001088b3be4 chromedriver + 7748580
16  chromedriver                        0x00000001089033a8 chromedriver + 8074152
17  chromedriver                        0x0000000108903566 chromedriver + 8074598
18  chromedriver                        0x0000000108914c98 chromedriver + 8146072
19  libsystem_pthread.dylib             0x00007ff812de21d3 _pthread_start + 125
20  libsystem_pthread.dylib             0x00007ff812dddbd3 thread_start + 15


## Append Runner Data to CSV