
# Hyrox Scraping and Data Extraction




This scraper is designed to systematically collect detailed information from the Hyrox race event website for a specified season. At a high level, the scraper operates through several key steps and components, tailored to efficiently navigate the structure of the Hyrox website and extract valuable data on race participants and their performances.

Here's a summary of its core functionalities:

`Dynamic URL Construction`: The scraper constructs URLs to access race event pages based on the specified season, incorporating necessary query parameters to filter events by various criteria such as gender and division. This dynamic approach allows for flexibility in targeting different seasons or specific event categories.

`Page Navigation`: It includes logic to navigate through the pagination of event listings, ensuring that all available events within a season are accounted for. This is crucial for comprehensive data collection, given the website's structure of listing multiple events across several pages.

`Data Extraction`: For each event URL visited, the scraper employs BeautifulSoup to parse the HTML content, extracting detailed participant data. This includes participant names, performance metrics, scoring details, and workout results, among other relevant information.

`Data Structuring`: Extracted data is structured into a clean, organized format, making it suitable for analysis or storage. The scraper handles the flattening of nested data (like participant details and workout results) to ensure it can be efficiently stored in a CSV file or database.

`Scalability and Adaptability`: The scraper is designed with modularity in mind, allowing for easy adjustments to target different seasons or accommodate changes in the website's structure. Parameters like the base URL, event codes, and query parameters can be updated to extend the scraper's utility across different seasons.

`CSV Output`: Finally, the scraper compiles the collected data into a CSV file, providing a structured and accessible format for data analysis, reporting, or use in front-end applications. This includes adding new columns for gender and event ID, enhancing the data's usefulness for detailed performance analytics.

##Beautiful Soup Approach

BeautifulSoup implies a significant change in approach. Instead of interacting with the website through a browser (clicking buttons, waiting for JavaScript to render content, etc.), you'll need to make HTTP requests to the website URLs directly and parse the returned HTML content. This approach works well for static content but may not work if the content is loaded dynamically with JavaScript after the initial page load.





*   page=1: Specifies the page number.
*   event=HPRO_JGDMS4JI6BA: Specifies the event ID.
*   num_results=100: Sets the number of results per page to 100.  
*   pid=list: Indicates the type of content to retrieve.
*   pidp=ranking_nav: A parameter likely related to the navigation or display of rankings.  
*   ranking=time_finish_netto: Specifies the ranking type.  
*   search[sex]=M: Filters results for male participants (assuming M stands for male).
*   search[age_class]=%25: Appears to be a placeholder for age class filtering, %25 URL-decodes to %, which might be used here to mean "all" or "any".  
*  search[nation]=%25: Similar to age_class, this filters by nation, with %25 indicating no specific filte









In [2]:
import requests
from bs4 import BeautifulSoup

class HyroxEvent:
    def __init__(self, event_id, season):
        self.event_id = event_id
        self.season = season

# Complete list of events for half of season 5
events = [
    HyroxEvent(event_id="999999212F07B50000000079", season=1),
    HyroxEvent(event_id="999999212F07B50000000068", season=1),
    HyroxEvent(event_id="999999212F07B50000000066", season=1),
    HyroxEvent(event_id="999999212F07B50000000065", season=1),
    HyroxEvent(event_id="999999212F07B50000000052", season=1),
    HyroxEvent(event_id="999999212F07B50000000051", season=1),
    HyroxEvent(event_id="999999212F07B5000000003D", season=1),
    HyroxEvent(event_id="999999212F07B50000000029", season=1),
    HyroxEvent(event_id="999999212F07B50000000015", season=1),

]

divisions = {"open": "H", "pro": "HPRO", "elite": "HE"}
genders = ["M", "W"]

def get_page_urls(base_url, page_number, event, division_code, gender, season):
    full_event_code = f"{division_code}_{event.event_id}"  # Adjusted to use event object
    url = f"{base_url}/season-{season}/?page={page_number}&event={full_event_code}&lang=EN_CAP&num_results=100&pid=list&pidp=ranking_nav&ranking=time_finish_netto&search%5Bsex%5D={gender}&search%5Bage_class%5D=%25&search%5Bnation%5D=%25"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    detail_urls = [a['href'] for a in soup.select("h4.list-field.type-fullname a")]
    return detail_urls, soup

def has_next_page(soup):
    next_page_button = soup.select_one("ul.pagination li.pages-nav-button a")
    return bool(next_page_button)

def main():
    base_url = "https://results.hyrox.com"
    all_detail_urls = []

    for event in events:  # Iterating over HyroxEvent objects
        for division_name, division_code in divisions.items():
            for gender in genders:
                page_number = 1
                while True:
                    urls, soup = get_page_urls(base_url, page_number, event, division_code, gender, event.season)  # Pass HyroxEvent object to get_page_urls
                    if not urls:
                        break
                    all_detail_urls.extend(urls)
                    print(f"Collected {len(urls)} URLs from event {event.event_id}, {division_name}, {gender}, page {page_number} in season {event.season}.")
                    if not has_next_page(soup):
                        break
                    page_number += 1

    print(f"Total collected URLs: {len(all_detail_urls)}")
    return all_detail_urls

# Execute the main function
collected_urls = main()
print(collected_urls)

Collected 100 URLs from event 999999212F07B50000000068, open, M, page 1 in season 1.
Collected 45 URLs from event 999999212F07B50000000068, open, M, page 2 in season 1.
Collected 77 URLs from event 999999212F07B50000000068, open, W, page 1 in season 1.
Collected 41 URLs from event 999999212F07B50000000068, pro, M, page 1 in season 1.
Collected 8 URLs from event 999999212F07B50000000068, pro, W, page 1 in season 1.
Collected 100 URLs from event 999999212F07B50000000066, open, M, page 1 in season 1.
Collected 75 URLs from event 999999212F07B50000000066, open, M, page 2 in season 1.
Collected 70 URLs from event 999999212F07B50000000066, open, W, page 1 in season 1.
Collected 65 URLs from event 999999212F07B50000000066, pro, M, page 1 in season 1.
Collected 20 URLs from event 999999212F07B50000000066, pro, W, page 1 in season 1.
Collected 100 URLs from event 999999212F07B50000000065, open, M, page 1 in season 1.
Collected 100 URLs from event 999999212F07B50000000065, open, M, page 2 in sea

This script uses **requests** and BeautifulSoup for web scraping, along with re for regular expression operations, to extract detailed information about participants from a specified URL. The get_participant_details function fetches the page, parses HTML content, and compiles participant details, scoring data, workout results, and overall time into a structured dictionary.

Key steps include:

Sending an HTTP request to the provided URL and parsing the HTML response.
Using regular expressions to clean specific text fields, such as removing parentheses from names.
Iterating through different sections of the HTML document to extract participant details, scoring, workout results, and overall times based on their respective HTML structures.
Storing extracted information in a well-organized dictionary for easy access and readability.
Finally, the script pretty prints the collected participant details for verification. This methodical approach enables detailed data extraction from complex web pages, suitable for applications requiring in-depth analysis of web content.

Replace the url variable with the target page URL to apply this script to other web scraping tasks.

In [3]:
import requests
from bs4 import BeautifulSoup
import re
import csv
from urllib.parse import urlparse, parse_qs

def get_participant_details(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Parse the URL to extract query parameters
    parsed_url = urlparse(url)
    query_params = parse_qs(parsed_url.query)

    # Extract Gender and Event ID from query parameters
    gender = query_params['search[sex]'][0] if 'search[sex]' in query_params else 'Unknown'
    event_id = query_params['event'][0] if 'event' in query_params else 'Unknown'

    # Remove division code from event_id if present
    if '_' in event_id:
        event_id = event_id.split('_', 1)[1]

    details = {
        'Participant': {},
        'Scoring': {},
        'Workout Result': [],
        'Overall Time': {},
        'Gender': gender,
        'Event ID': event_id
    }

   # Define a regular expression pattern to find parentheses and their contents

    pattern = re.compile(r'\s*\([^)]*\)')

    # Extract Participant Details
    for participant_row in soup.select(".box-general table tr"):
        desc = participant_row.find("th", class_="desc").get_text(strip=True)
        # Check if the cell contains an image with an 'alt' attribute
        img = participant_row.find("td", class_="last").find("img")
        if img and img.has_attr('alt'):
            value = img['alt']
        else:
            value = participant_row.find("td", class_="last").get_text(strip=True)
        details['Participant'][desc] = value

        # Remove parentheses and their contents from the name
        if desc == 'Name':
            value = pattern.sub('', value).strip()
        details['Participant'][desc] = value

    # Extract Scoring Details
    for scoring_row in soup.select(".box-eventinfo table tr"):
        desc = scoring_row.find("th", class_="desc").get_text(strip=True)
        value = scoring_row.find("td", class_="last").get_text(strip=True)
        details['Scoring'][desc] = value

    # Extract Workout Results
    for workout_row in soup.select(".box-other table tr"):
    # Check if row is a header or has relevant data
      if workout_row.find("th", class_="desc"):
        columns = workout_row.find_all("td")
        workout = {
            'Split': workout_row.find("th", class_="desc").get_text(strip=True),
            'Time': columns[0].get_text(strip=True) if len(columns) > 0 else None,
            'Place': columns[1].get_text(strip=True) if len(columns) > 1 else None
        }
        details['Workout Result'].append(workout)

    # Extract Overall Time
    for overall_row in soup.select(".box-totals table tr"):
        desc = overall_row.find("th", class_="desc").get_text(strip=True)
        value = overall_row.find("td", class_="last").get_text(strip=True)
        details['Overall Time'][desc] = value

    return details

# URL for testing, replace with your actual list of URLs
url = 'https://hyrox.r.mikatiming.de/season-6/?content=detail&fpid=list&pid=list&idp=JGDMS4JI9C378&lang=EN_CAP&event=HPRO_JGDMS4JI6BA&num_results=100&pidp=ranking_nav&ranking=time_finish_netto&search%5Bsex%5D=M&search%5Bage_class%5D=%25&search%5Bnation%5D=%25&search_event=HPRO_JGDMS4JI6BA'
participant_details = get_participant_details(url)




# Pretty printing the results
from pprint import pprint
pprint(participant_details)


{'Event ID': 'JGDMS4JI6BA',
 'Gender': 'M',
 'Overall Time': {'Overall Time': '01:00:33',
                  'Rank (AG)': '2',
                  'Rank (M/W)': '2'},
 'Participant': {'Age Group': '25-29',
                 'Name': 'Williamson, Jake',
                 'Nat': 'GBR',
                 'Number': '200004'},
 'Scoring': {'Division': 'HYROX PRO', 'Race': '2024 Manchester'},
 'Workout Result': [{'Place': '–', 'Split': 'Running 1', 'Time': '00:03:14'},
                    {'Place': '10',
                     'Split': '1000m SkiErg',
                     'Time': '00:03:44'},
                    {'Place': '–', 'Split': 'Running 2', 'Time': '00:03:33'},
                    {'Place': '21',
                     'Split': '50m Sled Push',
                     'Time': '00:02:57'},
                    {'Place': '–', 'Split': 'Running 3', 'Time': '00:03:47'},
                    {'Place': '16',
                     'Split': '50m Sled Pull',
                     'Time': '00:04:04'},
         

In [None]:
import time

def scrape_all_participants(base_url):
    collected_urls = main()  # Retrieve the list of URLs to scrape
    all_participant_details = []  # Initialize a list to store details of all participants

    total_urls = len(collected_urls)  # Total number of URLs to scrape
    start_time = time.time()  # Record the start time of the scraping process

    for index, url in enumerate(collected_urls, start=1):
        full_url = base_url + url
        participant_details = get_participant_details(full_url)  # Scrape participant details
        all_participant_details.append(participant_details)

        # Print scraped participant's name for progress tracking
        print(f"Scraped details for {participant_details['Participant'].get('Name', 'Unknown')}")

        # Time estimation logic
        current_time = time.time()
        elapsed_time = current_time - start_time
        avg_time_per_url = elapsed_time / index
        estimated_total_time = avg_time_per_url * total_urls
        estimated_remaining_time = estimated_total_time - elapsed_time

        # Print time estimation every 100 URLs or at certain intervals
        if index % 100 == 0:
            print(f"Completed {index} of {total_urls} URLs.")
            print(f"Estimated remaining time: {estimated_remaining_time/60:.2f} minutes.")

    return all_participant_details

if __name__ == "__main__":
    base_url = "https://hyrox.r.mikatiming.de/season-1/"  # Adjust if necessary
    all_details = scrape_all_participants(base_url)
    print(f"Scraping completed. Total profiles scraped: {len(all_details)}")

Collected 100 URLs from event 999999212F07B50000000068, open, M, page 1 in season 1.
Collected 45 URLs from event 999999212F07B50000000068, open, M, page 2 in season 1.
Collected 77 URLs from event 999999212F07B50000000068, open, W, page 1 in season 1.
Collected 41 URLs from event 999999212F07B50000000068, pro, M, page 1 in season 1.
Collected 8 URLs from event 999999212F07B50000000068, pro, W, page 1 in season 1.
Collected 100 URLs from event 999999212F07B50000000066, open, M, page 1 in season 1.
Collected 75 URLs from event 999999212F07B50000000066, open, M, page 2 in season 1.
Collected 70 URLs from event 999999212F07B50000000066, open, W, page 1 in season 1.
Collected 65 URLs from event 999999212F07B50000000066, pro, M, page 1 in season 1.
Collected 20 URLs from event 999999212F07B50000000066, pro, W, page 1 in season 1.
Collected 100 URLs from event 999999212F07B50000000065, open, M, page 1 in season 1.
Collected 100 URLs from event 999999212F07B50000000065, open, M, page 2 in sea

In [None]:
import csv

def flatten_participant_data(participant, all_splits):
    flat_data = {}

    # Flatten 'Participant' and 'Scoring' nested dictionaries
    for key in ['Participant', 'Scoring', 'Overall Time']:
        if key in participant:
            for sub_key, value in participant[key].items():
                flat_data[f"{key}_{sub_key}"] = value
            # Remove the original nested dictionary
            del participant[key]

    # Flatten 'Workout Result' list of dictionaries
    if 'Workout Result' in participant:
        for workout in participant['Workout Result']:
            split_name = workout['Split'].replace(' ', '_')  # Replace spaces with underscores
            flat_data[split_name] = workout['Time']
        # Remove the original 'Workout Result' list
        del participant['Workout Result']

    # Add remaining top-level key-values as is
    for key, value in participant.items():
        flat_data[key] = value

    return flat_data

def save_to_csv(participants, filename, all_splits):
    # Determine all fieldnames
    base_fields = ['Event ID','Participant_Name', 'Gender', 'Participant_Age Group', 'Participant_Number', 'Participant_Nat', 'Scoring_Race', 'Scoring_Division', 'Overall Time_Rank (M/W)', 'Overall Time_Rank (AG)', 'Overall Time_Overall Time']
    fieldnames = base_fields + all_splits

    with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames, extrasaction='ignore')
        writer.writeheader()

        for participant in participants:
            flat_participant = flatten_participant_data(participant, all_splits)
            writer.writerow(flat_participant)

# Define all the specific splits
all_splits = [
    'Running_1','1000m_SkiErg', 'Running_2', '50m_Sled_Push', 'Running_3', '50m_Sled_Pull',
    'Running_4', '80m_Burpee_Broad_Jump', 'Running_5', '1000m_Row', 'Running_6',
    '200m_Farmers_Carry', 'Running_7', '100m_Sandbag_Lunges', 'Running_8',
    'Wall_Balls', 'Roxzone_Time', 'Run_Total', 'Best_Run_Lap'
]

# Example usage
filename = '/content/participants.csv'  # Adjust the path if needed
save_to_csv(all_details, filename, all_splits)

print(f"Data saved to {filename}")


Data saved to /content/participants.csv
