Good job last week! Your code looks good!

Now we need to create a detector that is able to decide which person's code to use to parse the html. 

In this notebook we are going to write code to determine whether to use the logic you developed last week to transform the an HTML file into a dataframe, or to use someone else's code. 

#### Let's Get Setup!

In the cell below re-direct the file path to your directory. 

In [182]:
#my html :  
html_content = '/Users/ml/Desktop/research_assistants copy/raw_html_files/meet_44115.html'

#html_content = '/Users/ml/Desktop/research_assistants copy/raw_html_files/meet_494231.html'

#html_content = '/Users/ml/Desktop/research_assistants copy/raw_html_files/meet_493916.html'

Read the html file you wrote code for last week.

In [183]:
from bs4 import BeautifulSoup
with open(html_content, 'r', encoding='utf-8') as f:
    html = f.read()

soup = BeautifulSoup(html, 'html.parser')
soup.prettify()[:500] 

'<!DOCTYPE html>\n<html lang="en" xmlns:="">\n <head>\n  <script src="https://cmp.osano.com/AzyWAQS5NWEEWkU9/eab0a836-8bac-45b1-8b3e-e92e57e669db/osano.js?language=en">\n  </script>\n  <script src="https://www.flolive.tv/osano-flo.js">\n  </script>\n  <!-- Google Tag Manager -->\n  <script>\n   (function (w, d, s, l, i) {\n            w[l] = w[l] || [];\n            w[l].push({\n                \'gtm.start\':\n                    new Date().getTime(), event: \'gtm.js\'\n            });\n            var f = d.getEle'

### Step 1: Visually Compare the Webpages  

Before writing any code, **take a careful look** at the webpages to spot what makes *your* HTML different from your teammates’ HTML.  
These differences will later form the basis of the “logic” you write in your detector function.

**Here are the URLs we’ll compare:**
- [Aragon’s Center Meet 3 (2008)](https://ca.milesplit.com/meets/44115-aragons-center-meet-3-2008/results)  
- [CVAA Preview (2022)](https://ca.milesplit.com/meets/493916-cvaa-preview-2022/results/846055/raw)  
- [CVL Meet 3 (ACE, 2022)](https://ca.milesplit.com/meets/494231-cvl-meet-3-ace-2022/results/846020/raw)

**Here’s the page for my own example test:**
- [Mission/Angelus League Clusters #2 (2025)](https://ca.milesplit.com/meets/711006-missionangelus-league-clusters-2-2025/results)

---

### Step 2: Identify the Differences  

When comparing these pages, notice *how the data is structured differently*.  
Ask yourself questions like:
- Does my meet’s data appear inside a **table**, or is it just text?  
- Are the **column headers** (e.g., “Place”, “Team”, “Mark”) the same or different? 
- Do I see any unique features **in or around** the table?   

Here’s what I found in my own example:
1. Some meet results are formatted as plain text, not tables.   
2. My column headers are slightly different from the other formats.  
3. My meet results include **URLs inside the table** (for athletes and teams) — others do not. 

---

### Step 3: Inspect the HTML  

To confirm your visual observations:
1. Open the webpage in Chrome.  
2. Right-click on part of the results → select **“Inspect”**.  
3. Look closely at the HTML structure — especially the table tags (`<table>`, `<thead>`, `<tbody>`, `<th>`, `<td>`).  
4. Identify patterns that are *unique* to your page type (like a specific `class` name or missing headers).

This step is where you’ll find the **exact clues** your code can look for.

---

### Step 4: Write Logic to Capture Those Differences  

Now that you know what’s unique about your HTML, write Python logic that **detects** those differences.  
Your goal is to have a detector function that looks at a webpage and scores how likely it is to match your format (from `0.0` = definitely not yours to `1.0` = definitely yours).  

For example:
- +0.6 if the page has a table with the class `"eventTable"`.  
- +0.3 if the table headers match your expected names.  
- +0.1 if there are links inside the table body.

Notice that my sub-scores add up to 1.

Your function will combine these checks to produce a final confidence score between 0 and 1.

---

### Step 5: Understand the Expected Output  

Below is an example description of what your detector function should do conceptually:

> The function looks for features that are **unique** to my HTML file compared to my teammates’ files.  
> Each unique feature adds to a total score between 0 and 1.  
> The higher the score, the more confident the script is that this HTML belongs to my format.

Once you’ve implemented your version, you’ll test it on all the HTML samples to verify that:
- Your own HTML file scores close to **1.0**  
- Your teammates’ HTML files score close to **0.0**


#### Let's look at some of my code

Here is what my code looked like AFTER development, but before refactoring.  For the rest of this notebook we will work to replicate my process with your own function.

For now, let's get an overview of what I wrote for my detect_katie function.

In [184]:
REQUIRED_HEADERS_KATIE = {"place", "video", "athlete", "team", "mark", "points"}

def detect_katie(html: str) -> float:
    """
    Return a confidence in [0,1] that this HTML matches the 'Katie' format. 
    NOTES FOR IMPLEMENTATION: 
    - Name your function detect_YOURNAME 
    - adjust signatures and weights for your format as needed 
    - you do not need to include all types of signals shown here if you can generate a reliable score with fewer 
    - be careful not to use signals that are too general and might appear in other formats 
    - aim for a final score that is well-calibrated (e.g., real Katie files score near 1.0, non-Katie files near 0.0) 
    
    Signals: 
    1) Table has class 'eventTable' (strong, +0.6) 
    2) Header names include Place/Video/Athlete/Team/Mark/Points (secondary, up to +0.3) 
    3) There are links (<a>) inside the table body (weak, +0.1)
    """
    soup = BeautifulSoup(html, "html.parser")
    score = 0.0

    # --- 1) Strong: find a table whose class list includes 'eventTable' (robust to spacing/case) ---
    event_table = None
    for tbl in soup.find_all("table"):
        cls = tbl.get("class", [])
        # BeautifulSoup usually gives a list; if string, split on whitespace
        if isinstance(cls, str):
            cls = cls.split()
        tokens = [c.strip().lower() for c in cls if c and c.strip()]
        if any("eventtable" == tok or "eventtable" in tok for tok in tokens):
            event_table = tbl
            break

    # If no EventTable found, score remains 0 and we exit early
    if event_table is None:
        return 0.0

    # Found the table → add strong credit
    score += 0.6

    # --- 2) Secondary: header names present (ignore blank headers like <th></th>) ---
    th_texts = [
        th.get_text(" ", strip=True).strip().lower()
        for th in event_table.select("thead th")
        if th.get_text(strip=True)
    ]
    headers_found = set(th_texts)
    matched = len(REQUIRED_HEADERS_KATIE.intersection(headers_found))
    if matched:
        score += 0.3 * (matched / len(REQUIRED_HEADERS_KATIE))  # proportional credit

    # --- 3) Weak: presence of links in tbody (athlete/team profile links) ---
    links_in_body = event_table.select("tbody a[href]")
    if len(links_in_body) >= 1:
        score += 0.05
    if len(links_in_body) >= 2:
        score += 0.05  # full weak credit

    return min(score, 1.0)

score = detect_katie(html) 
print(f"Katie format confidence score: {score:.2f}")


Katie format confidence score: 0.00


#### Development part 1: Write a function outline with the correct name and logic steps.

Let's create a roadmap by including the logic we expect to see as notes to ourself during this process.

In [185]:
def detect_max(html: str) -> float: ## replace the function name with your own
    """
    Return a confidence in [0,1] that this HTML matches the 'Katie' format. 
    NOTES FOR IMPLEMENTATION: 
    - Name your function detect_YOURNAME 
    - adjust signatures and weights for your format as needed 
    - you do not need to include all types of signals shown here if you can generate a reliable score with fewer 
    - be careful not to use signals that are too general and might appear in other formats 
    - aim for a final score that is well-calibrated (e.g., real Katie files score near 1.0, non-Katie files near 0.0) 
    
    Signals: 
    1) Table has class 'eventTable' (strong, +0.6) 
    2) Header names include Place/Video/Athlete/Team/Mark/Points (secondary, up to +0.3) 
    3) There are links (<a>) inside the table body (weak, +0.1)
    """
    ## Read in my html file with beautiful soup.
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, "html.parser")
    score = 0.0
    # --- 1) Strong: find a table whose class list includes 'eventTable'
    table = None
    for tbl in soup.find_all("table"):
        classes = [c.lower() for c in tbl.get("class", [])]
        if not any("eventtable" in c for c in classes):
            table = tbl
            break

    # If no EventTable found, score remains 0 and we exit early
    if not table:
        return 0.0  # early exit if no relevant table

    score += 0.6  # found the old-style table

    # --- 2) Secondary: header names present (ignore blank headers like <th></th>) ---
    REQUIRED_HEADERS = {"place", "athlete", "grade", "team", "avg mile", "finish", "points"}
    th_texts = [th.get_text(" ", strip=True).lower() for th in table.find_all("th")]
    headers_found = set(th_texts)
    matched = len(REQUIRED_HEADERS.intersection(headers_found))

    if matched:
        score += 0.3 * (matched / len(REQUIRED_HEADERS))  # proportional match

    # --- 3) Weak: presence of links in tbody (athlete/team profile links) ---
    links_in_body = table.select("tbody a[href]")
    if len(links_in_body) == 0:
        score += 0.1
    return min(score, 1.0) #return the final score, or 1 if the score is bigger than 1.

score = detect_max(html) 
print(f"Max format confidence score: {score:.2f}")

Max format confidence score: 0.00


#### Developoment Part 2: Generate working logic for a distinguishing feature of your code.

Let's take the first distinguishing feature - the table name - and write code isolate that feature.

In [186]:
# --- 1) Strong: find a table whose class list includes 'eventTable' (robust to spacing/case) ---
#event_table = None
#for tbl in soup.find_all("table"):
   # cls = tbl.get("class", [])
    # BeautifulSoup usually gives a list; if string, split on whitespace
   # if isinstance(cls, str):
       # cls = cls.split()
   # tokens = [c.strip().lower() for c in cls if c and c.strip()]
   # if any("eventtable" == tok or "eventtable" in tok for tok in tokens):
       ## event_table = tbl

#event_table



# --- 1) Strong: find a div with id='meetResultsBody' (robust to spacing/case) ---
results_div = None
for div in soup.find_all("div"):
    div_id = div.get("id", "")
    # Handle if id is somehow a list or string, normalize it
    if isinstance(div_id, list):
        div_id = " ".join(div_id)
    div_id_normalized = div_id.strip().lower()
    if "meetresultsbody" == div_id_normalized or "meetresultsbody" in div_id_normalized:
        results_div = div
        break

results_div

<div id="meetResultsBody">
<pre> Aragon's Center Meet #3 
                   Millbrae, CA    - 10/30/2008
Varsity Boys
    Name                    Year School                Avg Mile     Finals  Points
1   Daniel Filipcik         SR   Woodside              5:11         15:18         
2   Kevin Liao              SR   Evergreen Valley      5:28         16:09   1     
3   Max Keleher             SR   Burlingame            5:34         16:27   2     
4   Grant Foster            FR   Prospect              5:35         16:29   3     
5   Paul Rechsteiner        SR   Sacred Heart Cathedral5:35         16:30   4     
6   Tom Liu                 SR   Evergreen Valley      5:39         16:40   5     
7   Peter Gunn              JR   Woodside              5:40         16:43         
8   Cesar  Aguilar          FR   Half Moon Bay         5:40         16:45   6     
9   Louis Dressel           SR   San Mateo             5:41         16:48   7     
10  Nathan Lee              JR   Carlmont          

#### Development Part 2: Assign a score to your logic. 

In my function I want to assign a score to the table existing.  Since the table name is different between my html file and all other html files, I'm going to give this a "strong" score indicator.  Since some html files don't even have an event table, we will exit my detector if no table with the correct name is found.

In [187]:
#score=0.0
# If no EventTable found, score remains 0 and we exit early
#if event_table is None:
   # score = 0.0 #in the function we would exit here

# Found the table → add strong credit
#score += 0.6

#score


score = 0.0
# If no meetResultsBody div found, score remains 0 and we exit early
if results_div is None:
    score = 0.0  # in the function we would exit here

# Found the div → add strong credit
score += 0.6

score

0.6

### Development Part 3: Do the same thing for other distinguishing features.

If the table exists, then we apply our weaker logic (urls existing in the table, the table names being correct) to verify that we should keep using my logic. Remember, in the end only the highest score ABOVE 0.8 will determine which web scraping algorithm to use.  So your detect code should return a score above 0.8 for your html, and below 0.3 for the other samples.

Continue writing logic for other differences you noticed between tables. Put it together into a function that takes an HTML file path as input and returns a score between 0 and 1 as a float.

Finally, test your function on all provided HTML files to ensure it correctly identifies your format with high confidence and others with low confidence.

In [188]:
#Develop your other logic here

# --- 2) Medium-Strong: Check for <pre> tag inside results_div ---
pre_tag = None
if results_div:
    pre_tag = results_div.find('pre')

if pre_tag:
    score += 0.3  # Found pre tag with results
    
score

0.8999999999999999

#### Development Part 4: Combine each of your indicators into the same function.  Make sure your score adds to at least 1. 

Below I've added print statements to help me test that my logic is working for each feature, and to track how the score changes.

In [189]:
REQUIRED_HEADERS_KATIE = {"place", "video", "athlete", "team", "mark", "points"}


def detect_max(html: str) -> float:
    """
    Return a confidence in [0,1] that this HTML matches the 'Max' format (MileSplit pre-tag format). 
    
    Signals: 
    1) Div has id 'meetResultsBody' AND contains a <pre> tag (required) 
    2) Text contains division patterns like "Varsity Boys", "JV Girls" (strong, +0.6)
    3) Text does NOT have the specific equal-sign-header pattern from other formats (required)
    4) Text contains grade patterns "FR", "SO", "JR", "SR" (secondary, +0.4)
    """
    from bs4 import BeautifulSoup
    import re
    soup = BeautifulSoup(html, "html.parser")
    score = 0.0

    # --- 1) Strong: find a div with id='meetResultsBody' (robust to spacing/case) ---
    results_div = None
    for div in soup.find_all("div"):
        div_id = div.get("id", "")
        # Handle if id is somehow a list or string, normalize it
        if isinstance(div_id, list):
            div_id = " ".join(div_id)
        div_id_normalized = div_id.strip().lower()
        if "meetresultsbody" == div_id_normalized or "meetresultsbody" in div_id_normalized:
            results_div = div
            break

    # If no meetResultsBody div found, exit early
    if results_div is None:
        print("No meetResultsBody div found, score = 0.0")
        return 0.0

    print("Found meetResultsBody div")

    # --- 2) CRITICAL: Check for <pre> tag inside results_div ---
    pre_tag = None
    if results_div:
        pre_tag = results_div.find('pre')

    # BOTH div AND pre must exist for this to be Max's format
    if pre_tag is None:
        print("No pre tag found, score = 0.0")
        return 0.0
    
    print("Found pre tag")
    
    text_content = pre_tag.get_text()
    
    # --- 3) CRITICAL EXCLUSION: Check for the specific pattern of column headers with equals (NOT Max's format) ---
    # Pattern: lines with "Pl Athlete" or similar followed by a line of many equal signs
    if re.search(r'Pl\s+Athlete.*\n=====', text_content):
        print("Found 'Pl Athlete' header pattern (wrong format), score = 0.0")
        return 0.0
    
    print("No 'Pl Athlete' header pattern found (good)")

    # --- 4) Strong: Check for division patterns in the text ---
    text_lower = text_content.lower()
    division_patterns = ["varsity boys", "varsity girls", "jv boys", "jv girls", "frosh/soph"]
    division_matches = sum(1 for pattern in division_patterns if pattern in text_lower)
    if division_matches >= 1:
        score += 0.6  # Found at least one division pattern
    print(f"After checking division patterns ({division_matches} found), score =", score)

    # --- 5) Secondary: Check for grade abbreviations (FR, SO, JR, SR) ---
    grade_patterns = [" FR ", " SO ", " JR ", " SR "]
    grade_matches = sum(1 for pattern in grade_patterns if pattern in text_content)
    if grade_matches >= 2:
        score += 0.4  # Found multiple grade patterns
    print(f"After checking grade patterns ({grade_matches} found), score =", score)

    return min(score, 1.0)


score = detect_max(html) 
print(f"Max format confidence score: {score:.2f}")


Found meetResultsBody div
Found pre tag
No 'Pl Athlete' header pattern found (good)
After checking division patterns (4 found), score = 0.6
After checking grade patterns (4 found), score = 1.0
Max format confidence score: 1.00


#### Development part 5: Refactor your function and ensure it still works properly.

Finally, ask your LLM to refactor the code you have developed with all parts included.  Ensure that the input is an html file path and the output is a float. The refactored code should be more easy for you to use in the future, or a teammate to use.

In [190]:
from typing import Optional
from bs4 import BeautifulSoup, Tag
import re

# --- Config ---
W_DIV_AND_PRE = 0.6      # Strong signal: found both div and pre tag
W_DIVISIONS = 0.4         # Strong signal: found division patterns
# W_GRADES = 0.4          # Secondary signal: found grade patterns (commented out since we reach 1.0 without it)

DIVISION_PATTERNS = ["varsity boys", "varsity girls", "jv boys", "jv girls", "frosh/soph"]
GRADE_PATTERNS = [" FR ", " SO ", " JR ", " SR "]

# --- Helpers ---
def _find_meet_results_div(soup: BeautifulSoup) -> Optional[Tag]:
    """Return a <div> whose id is 'meetResultsBody' (robust to case/spacing)."""
    for div in soup.find_all("div"):
        div_id = div.get("id", "")
        # Handle if id is somehow a list or string, normalize it
        if isinstance(div_id, list):
            div_id = " ".join(div_id)
        div_id_normalized = div_id.strip().lower()
        if "meetresultsbody" == div_id_normalized or "meetresultsbody" in div_id_normalized:
            return div
    return None

def _has_pl_athlete_pattern(text: str) -> bool:
    """Check if text contains the 'Pl Athlete' header pattern (other format's signature)."""
    return bool(re.search(r'Pl\s+Athlete.*\n=====', text))

def _division_pattern_score(text: str) -> float:
    """Return score based on division patterns found in text."""
    text_lower = text.lower()
    matches = sum(1 for pattern in DIVISION_PATTERNS if pattern in text_lower)
    return W_DIVISIONS if matches >= 1 else 0.0

def _grade_pattern_score(text: str) -> float:
    """Return score based on grade abbreviation patterns found in text."""
    matches = sum(1 for pattern in GRADE_PATTERNS if pattern in text)
    return 0.4 if matches >= 2 else 0.0

# --- Detector ---
def detect_max(html: str) -> float:
    """
    Confidence in [0,1] that this HTML matches the 'Max' (MileSplit pre-tag) format.

    Signals:
      1) Div has id 'meetResultsBody' AND contains a <pre> tag (required)
      2) Text does NOT contain 'Pl Athlete' header pattern (exclusion)
      3) Text contains division patterns like "Varsity Boys", "JV Girls" (+0.4)
      4) Text contains grade patterns "FR", "SO", "JR", "SR" (+0.4, optional)
    """
    soup = BeautifulSoup(html, "html.parser")
    
    # Check for required div
    results_div = _find_meet_results_div(soup)
    if results_div is None:
        return 0.0
    
    # Check for required pre tag
    pre_tag = results_div.find('pre')
    if pre_tag is None:
        return 0.0
    
    text_content = pre_tag.get_text()
    
    # Exclusion: check for wrong format pattern
    if _has_pl_athlete_pattern(text_content):
        return 0.0
    
    # Build score from positive signals
    score = 0.0
    score += W_DIV_AND_PRE                      # found both div and pre
    score += _division_pattern_score(text_content)  # division patterns
    # Uncomment if needed to reach higher scores:
    # score += _grade_pattern_score(text_content)     # grade patterns
    
    return min(score, 1.0)

score = detect_max(html) 
print(f"Max format confidence score: {score:.2f}")

Max format confidence score: 1.00


#### Development Part 6: Double check that your code correctly scores all htmls.

Verify that your code not only scores your html highly, but also gives everyone else's html a low score.  If your score is below 0.8 or a teammates score is above 0.5 then edit your function.

In [193]:
#Change this base path to match your directory.
base_path = '/Users/ml/Desktop/research_assistants copy/raw_html_files'

In [196]:
test_files_names = ['meet_711006.html', 'meet_44115.html', 'meet_493916.html', 'meet_494231.html'] 
test_files = [base_path + "/" + name for name in test_files_names]

for name, file in zip(test_files_names, test_files):
    with open(file, 'r', encoding='utf-8') as f:
        html = f.read()
    score = detect_max(html)
    print(f"File: {name}, Max format confidence score: {score:.2f}")


'/Users/ml/Desktop/research_assistants copy/raw_html_files/meet_711006.html'

File: meet_711006.html, Max format confidence score: 0.00
File: meet_44115.html, Max format confidence score: 1.00
File: meet_493916.html, Max format confidence score: 0.00
File: meet_494231.html, Max format confidence score: 0.00


'/Users/ml/Desktop/research_assistants copy/raw_html_files/meet_711006.html'