The following workbook is to scrape data from the url: <https://ca.milesplit.com/meets/44115-aragons-center-meet-3-2008/results>

As our very first step, let's extract the race_id from the url programatically. This will be saved in our output dataframe and as part of the name of our output file.

In the space below, generate a variable called race_id and assign it the six digit value before the race name in the URL.  For example, in the url <https://ca.milesplit.com/meets/44115-aragons-center-meet-3-2008/results> the race_id should be 123456.  Verify that your code works by extracting the race_id from both the example url and your primary url. 

In [1]:
def extract_race_id(url):
    #define your function here!
    try:
        # Find the segment after '/meets/' and split on '-' to get the race_id
        race_segment = url.split("/meets/")[1]
        race_id = race_segment.split("-")[0]
        return race_id
    except (IndexError, AttributeError):
        return None  # Graceful handling if URL format is unexpected


url = "https://ca.milesplit.com/meets/44115-aragons-center-meet-3-2008/results"



race_id = extract_race_id(url)

print("Race ID:", race_id)

Race ID: 44115


Now lets process the HTML file! 

To get you started I've saved a file to the raw_html_files folder with example webpage (html) code. Change the file path to match the file path on your computer. Verify that the html file is being read correctly.

In [2]:
html_file_path = r"/Users/ml/Desktop/research_assistants copy/raw_html_files/meet_44115.html"

In [3]:
import pandas as pd
from bs4 import BeautifulSoup
import os

with open(html_file_path, 'r', encoding='utf-8') as file:
    html_content = file.read()

soup = BeautifulSoup(html_content, 'html.parser')
soup.prettify()[:500]  # This will display the HTML content in a structured format

'<!DOCTYPE html>\n<html lang="en" xmlns:="">\n <head>\n  <script src="https://cmp.osano.com/AzyWAQS5NWEEWkU9/eab0a836-8bac-45b1-8b3e-e92e57e669db/osano.js?language=en">\n  </script>\n  <script src="https://www.flolive.tv/osano-flo.js">\n  </script>\n  <!-- Google Tag Manager -->\n  <script>\n   (function (w, d, s, l, i) {\n            w[l] = w[l] || [];\n            w[l].push({\n                \'gtm.start\':\n                    new Date().getTime(), event: \'gtm.js\'\n            });\n            var f = d.getEle'

If the html displayed above you have read in your file!

Next we need to identify the correct portion of the html file with the individual results table that we want to scrape and format. Do so below.

In [4]:
def find_results_table(soup):
    # Your code to find the correct table in the soup object
    div = soup.find('div', id='meetResultsBody')
    if div:
        return div
    else:
        return None
    
find_results_table(soup)

<div id="meetResultsBody">
<pre> Aragon's Center Meet #3 
                   Millbrae, CA    - 10/30/2008
Varsity Boys
    Name                    Year School                Avg Mile     Finals  Points
1   Daniel Filipcik         SR   Woodside              5:11         15:18         
2   Kevin Liao              SR   Evergreen Valley      5:28         16:09   1     
3   Max Keleher             SR   Burlingame            5:34         16:27   2     
4   Grant Foster            FR   Prospect              5:35         16:29   3     
5   Paul Rechsteiner        SR   Sacred Heart Cathedral5:35         16:30   4     
6   Tom Liu                 SR   Evergreen Valley      5:39         16:40   5     
7   Peter Gunn              JR   Woodside              5:40         16:43         
8   Cesar  Aguilar          FR   Half Moon Bay         5:40         16:45   6     
9   Louis Dressel           SR   San Mateo             5:41         16:48   7     
10  Nathan Lee              JR   Carlmont          

Next, transform the content in your html table or text into a pandas dataframe.  The pandas dataframe output must have the following column names:
- race_id
- race_url
- race_name
- place
- athlete
- athlete_url
- grade
- team
- team_url
- finish
- point  

Get race_id from your generated variable above.  Get the race_url from the provided url.

If your text or table do not have the appropriate column names rename the columns or create the columns even if they are empty.

In [5]:
#def generate_dataframe(table):
    # Your code to convert the HTML table to a pandas DataFrame
   # return df

#df = generate_dataframe(table)

import re
import requests
import pandas as pd
from bs4 import BeautifulSoup

def extract_race_id_and_name(url):
    try:
        race_segment = url.split("/meets/")[1].split("/")[0]
        parts = race_segment.split("-")
        race_id = parts[0]
        race_name = "-".join(parts[1:]) if len(parts) > 1 else None
        return race_id, race_name
    except (IndexError, AttributeError):
        return None, None

def clean_race_name(slug):
    if slug is None:
        return None
    return slug.replace("-", " ").title()

def find_results_table(soup):
    return soup.find('div', id='meetResultsBody')

def generate_dataframe(div, race_id, race_url, race_name=None):
    # Must exactly match the column order you want
    expected_columns = [
        'race_id', 'race_url', 'race_name', 'place', 'video',
        'athlete', 'athlete_url', 'grade', 'team', 'team_url',
        'finish', 'point'
    ]

    if div is None or div.find('pre') is None:
        return pd.DataFrame(columns=expected_columns)

    text = div.find('pre').get_text("\n", strip=True)

    # Split text into chunks by division (e.g. "Varsity Boys", "JV Girls")
    sections = re.split(r'(?=\b[A-Z][A-Za-z/ &-]+ (?:Boys|Girls)\b)', text)

    data = []
    for section in sections:
        section = section.strip()
        if not section:
            continue

        # Try to extract the division name
        header_match = re.match(r'^([A-Z][A-Za-z/ &/-]+ (?:Boys|Girls))', section)
        if not header_match:
            continue

        division_name = header_match.group(1).strip()
        section_race_name = f"{division_name} - {race_name}"

        # Parse each result line
        for line in section.splitlines():
            line = line.strip()
            if not re.match(r'^\d+\s', line):  # only process lines that start with a place number
                continue

            # Pattern to safely extract each field
            match = re.match(
                r'^(\d+)\s+([A-Za-z\'\-. ]+?)\s+(FR|SO|JR|SR)\s+([A-Za-z\'\-. ]+?)\s+\d*:?[\d.]*\s+(\d+:\d+(?:\.\d+)?)\s+(\d+)?$',
                line
            )

            if match:
                place, athlete, grade, team, finish, point = match.groups()
                data.append({
                    'race_id': race_id,
                    'race_url': race_url,
                    'race_name': section_race_name,
                    'place': place,
                    'video': None,
                    'athlete': athlete.strip(),
                    'athlete_url': None,
                    'grade': grade,
                    'team': team.strip(),
                    'team_url': None,
                    'finish': finish,
                    'point': point
                })

    df = pd.DataFrame(data, columns=expected_columns)
    return df

def scrape_race_results(url):
    response = requests.get(url)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, "html.parser")

    div = find_results_table(soup)
    race_id, raw_race_name = extract_race_id_and_name(url)
    race_name = clean_race_name(raw_race_name)


    df = generate_dataframe(div, race_id, race_url=url, race_name=race_name)

    if df.empty:
        print("No results found. The page format may differ from expected.")
    else:
        print(f"Parsed {len(df)} athlete rows.")
        print("Unique race names found:", df['race_name'].unique())

    return df

# ---- Example usage ----
url = "https://ca.milesplit.com/meets/44115-aragons-center-meet-3-2008/results"
df = scrape_race_results(url)
print(df.head(20))




Parsed 202 athlete rows.
Unique race names found: ['Varsity Boys - Aragons Center Meet 3 2008'
 'Soph Girls - Aragons Center Meet 3 2008'
 'Soph Boys - Aragons Center Meet 3 2008'
 'JV Boys - Aragons Center Meet 3 2008'
 'Varsity Girls - Aragons Center Meet 3 2008']
   race_id                                           race_url  \
0    44115  https://ca.milesplit.com/meets/44115-aragons-c...   
1    44115  https://ca.milesplit.com/meets/44115-aragons-c...   
2    44115  https://ca.milesplit.com/meets/44115-aragons-c...   
3    44115  https://ca.milesplit.com/meets/44115-aragons-c...   
4    44115  https://ca.milesplit.com/meets/44115-aragons-c...   
5    44115  https://ca.milesplit.com/meets/44115-aragons-c...   
6    44115  https://ca.milesplit.com/meets/44115-aragons-c...   
7    44115  https://ca.milesplit.com/meets/44115-aragons-c...   
8    44115  https://ca.milesplit.com/meets/44115-aragons-c...   
9    44115  https://ca.milesplit.com/meets/44115-aragons-c...   
10   44115  https:

In [6]:
expected_columns = [
    'race_id', 'race_url', 'race_name', 'place', 'video', 'athlete', 'athlete_url',
    'grade', 'team', 'team_url', 'finish', 'point'
]

def verify_dataframe_columns(df, expected_columns):
    return list(df.columns) == expected_columns

def verify_dataframe_is_not_empty(df):
    return not df.empty

# Example usage:
df_columns_correct = verify_dataframe_columns(df, expected_columns)
df_is_not_empty = verify_dataframe_is_not_empty(df)

print("Columns are correct:", df_columns_correct)
print("DataFrame is not empty:", df_is_not_empty)


Columns are correct: True
DataFrame is not empty: True


Take time to verify the following about your dataframe: 
1. The dataframe is not empty.
2. Column names are correct and in the correct order
3. The data in the dataframe matches the data at the website url.

If anything is incorrect in your dataframe - iterate in the space above until it is correct!

Now generate the correct file name for your dataframe using the url your only input.  The URL for the HTML file is: <https://ca.milesplit.com/meets/44115-aragons-center-meet-3-2008/results>

Below generate a file name in the format TABLETYPE_results_meet_MEETID.csv.  For example, individual_results_meet_123456.csv.  Options for TableType are individual or team for individual or team results.

In [7]:
#def generate_filename(url, table_type):
    # Your code to generate the filename based on the URL and table type
   # return filename

#filename = generate_filename(url, "individual")



def generate_filename(url, table_type):
    # Split the URL into parts by the "/" symbol
    parts = url.split("/")
    
    # Find the part that contains the word "meets"
    # The next part right after it has the meet ID and name
    for i in range(len(parts)):
        if parts[i] == "meets":
            meet_part = parts[i + 1]   # Example: "44115-aragons-center-meet-3-2008"
            break
    
    # The meet ID is the first set of numbers before the "-"
    meet_id = meet_part.split("-")[0]
    
    # Build the file name using f-string
    filename = f"{table_type}_results_meet_{meet_id}.csv"
    
    return filename


# Example use
url = "https://ca.milesplit.com/meets/44115-aragons-center-meet-3-2008/results"
filename = generate_filename(url, "individual")

print(filename)


individual_results_meet_44115.csv


Finally, generate the correct file path, so that this csv saves in the folder 'output' in the 'research_assistant' folder.

In [8]:
file_path = f"research_assistant/output/{filename}"
print(file_path)


research_assistant/output/individual_results_meet_44115.csv


In [9]:
if df_columns_correct and df_is_not_empty:
    print("DataFrame columns match the expected columns.")
    output_file_locatin = os.path.join(file_path, filename)
    df.to_csv(filename, index=False)
else:
    print("DataFrame columns do not match the expected columns or DataFrame is empty.")

DataFrame columns match the expected columns.


***Want extra credit? Generate the team results output as well!