# Question 4: What is the longest match recorded in terms of duration?

---

## **1. Initial Considerations and Approach**

I wanted to identify the longest match, but I ran into several challenges:

**Duration Data:** 
Ideally, I would have used the match duration (start and end times) to directly determine the longest match. However, when I checked the data, there were several issues:

- Some fields were empty or had NaN values.
- Some columns, such as period_4 and period_5, contained mostly empty data.
- There was a lack of consistency in the match duration information.

**Missing Data:**
I tried filtering out missing data or NaN values, but still encountered issues with very few valid entries. After filtering, I found that a large portion of the dataset had no valid data, making it difficult to work with the duration directly.

Given these limitations, I decided to explore an alternative method to estimate which match was the longest.

---

## **2. Exploring Alternatives**

After attempting to work with the duration and facing challenges, I explored using the number of points played as a proxy for match duration. This method seemed reasonable for a few reasons:

**Correlation Between Points and Duration:** 
In tennis, longer matches generally involve more points. A match with more points played is likely to be longer.

**Data Availability:** 
The number of points for each match was readily available and didn’t suffer from the same issues as the duration data (missing values). This made it a more reliable metric to use in my analysis.

---

## **3. Choosing the Method: Points as a Proxy for Duration**

I concluded that instead of directly calculating the match duration, I could use the number of points played in each match as an estimate of its length. 
Here’s why I chose this method:

**Consistency:** 
The number of points in each match was consistently recorded in the dataset, and it didn’t have the same missing data issues that the duration columns had.

**Practicality:**
The number of points is a reasonable proxy for the match length. Generally, a match with more points is likely to be longer, especially if it involves multiple tie-breaks, extended games, or high competitiveness.

**Simplicity:**
Using the number of points is a straightforward approach. I could group the matches by ID, count the points, and sort them to find the match with the most points. This approach is computationally simple and avoids the complications of working with more complex duration data.

---

## **4. Final Method: Based on the Number of Points Played**

After considering the challenges and alternatives, I decided to move forward with the following method:

1. **Group the data by match ID:** This allowed me to treat each match individually.
2. **Count the number of points played in each match:** I used `.size()` to count the number of points in each match, which provided a good approximation of how long the match was.
3. **Sort the matches by points in descending order:** Sorting the matches by the number of points played allowed me to easily identify the match with the highest count of points, which I assumed to be the longest match.
4. **Select the longest match:** The match with the highest number of points was selected as the longest match.

In [1]:
import os
from pathlib import Path
import pandas as pd
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm
import matplotlib.pyplot as plt

In [2]:
def get_day_folders(base_path):
    return sorted([
        folder for folder in os.listdir(base_path)
        if os.path.isdir(os.path.join(base_path, folder)) and folder.startswith("2024")
    ])

def load_file(file):
    try:
        return pd.read_parquet(file)
    except Exception as e:
        print(f"Error reading {file}: {e}")
        return None

def load_all_data(base_path, subfolder_name):
    all_files = []
    for folder in get_day_folders(base_path):
        path_pattern = Path(base_path) / folder / 'data' / 'raw' / subfolder_name
        all_files.extend(path_pattern.glob("*.parquet"))

    dfs = []
    max_workers = 16 

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = [executor.submit(load_file, file) for file in all_files]
        for future in tqdm(as_completed(futures), total=len(futures), desc="Loading files"):
            result = future.result()
            if result is not None:
                dfs.append(result)
    
    return pd.concat(dfs, ignore_index=True) if dfs else None

In [3]:
base_path = "../data/tennis_data"
print("Loading all tennis data...")
matches_df = load_all_data(base_path, 'raw_match_parquet')
pbp_df = load_all_data(base_path, 'raw_point_by_point_parquet')
print("Loading complete.")

Loading all tennis data...


Loading files: 100%|██████████| 316802/316802 [08:43<00:00, 604.72it/s]
  return pd.concat(dfs, ignore_index=True) if dfs else None
Loading files: 100%|██████████| 22272/22272 [00:38<00:00, 575.32it/s]


Loading complete.


In [4]:
longest_match_by_points = (
    pbp_df.groupby('match_id')
    .size()
    .reset_index(name='num_points')
    .sort_values(by='num_points', ascending=False)
    .head(1)
)

longest_match_id_by_points = longest_match_by_points['match_id'].iloc[0]
longest_match_points = pbp_df[pbp_df['match_id'] == longest_match_id_by_points]

print("Longest Match by Points ID:", longest_match_id_by_points)
print("Total Points Played:", longest_match_by_points['num_points'].iloc[0])

Longest Match by Points ID: 12185562
Total Points Played: 711


In [5]:
longest_match_points.head()

Unnamed: 0,match_id,set_id,game_id,point_id,home_point,away_point,point_description,home_point_type,away_point_type,home_score,away_score,serving,scoring
2210531,12185562,3,13,0,1,0,0,6,5,7,6,2,1
2210532,12185562,3,13,1,2,0,0,1,5,7,6,2,1
2210533,12185562,3,13,2,2,1,0,5,6,7,6,2,1
2210534,12185562,3,13,3,3,1,0,6,5,7,6,2,1
2210535,12185562,3,13,4,3,2,0,5,1,7,6,2,1


# **Conclusion**

By using the number of points played as a proxy for match duration, I avoided the issues with missing or incomplete duration data and created a practical method for identifying the longest match.

This method is not perfect, as there may be matches with fewer points that are still longer in duration due to factors like long breaks or interruptions. However, given the available data, this approach provided a reasonable estimate and was much more reliable than trying to work with the problematic duration data.