# Introduction

This analysis explores data from [Tennis Abstract](https://github.com/JeffSackmann), covering the current **Top 10 ATP and WTA players** as of December 30, 2024. We'll examine player demographics, career trajectories, match statistics, and performance patterns across different surfaces.

## Featured Players

::: {layout-ncol=3}

![Novak Djokovic - The GOAT with 24 Grand Slam titles](https://upload.wikimedia.org/wikipedia/commons/thumb/4/48/Novak_Djokovic_2024_AO.jpg/320px-Novak_Djokovic_2024_AO.jpg){width=200}

![Carlos Alcaraz - The rising star from Spain](https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Alcaraz_RG24_%2850%29_%2853830785892%29_%28cropped%29.jpg/320px-Alcaraz_RG24_%2850%29_%2853830785892%29_%28cropped%29.jpg){width=200}

![Jannik Sinner - World #1 from Italy](https://upload.wikimedia.org/wikipedia/commons/thumb/5/5c/Jannik_Sinner_%282024_US_Open%29_04_%28cropped%29.jpg/320px-Jannik_Sinner_%282024_US_Open%29_04_%28cropped%29.jpg){width=200}

:::

*Images: Wikimedia Commons (CC BY-SA 4.0)*

**ATP Top 10:** Sinner, Zverev, Alcaraz, Fritz, Medvedev, Ruud, Djokovic, Rublev, De Minaur, Dimitrov

**WTA Top 10:** Sabalenka, Swiatek, Gauff, Paolini, Zheng, Rybakina, Pegula, Navarro, Kasatkina, Krejcikova

In [None]:
#| label: setup
#| code-summary: "Import libraries and load data"

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Import ipywidgets for interactivity
from IPython.display import display, HTML
import ipywidgets as widgets

# Import itables for interactive tables
from itables import init_notebook_mode, show
init_notebook_mode(all_interactive=True)

# Plotly default template
import plotly.io as pio
pio.templates.default = 'plotly_white'

# Load all data
DATA_PATH = '../../data/top10'

# ATP data
atp_players = pd.read_csv(f'{DATA_PATH}/atp/atp_top10_players.csv')
atp_rankings = pd.read_csv(f'{DATA_PATH}/atp/atp_top10_rankings.csv')
atp_matches = pd.read_csv(f'{DATA_PATH}/atp/atp_top10_matches.csv')

# WTA data
wta_players = pd.read_csv(f'{DATA_PATH}/wta/wta_top10_players.csv')
wta_rankings = pd.read_csv(f'{DATA_PATH}/wta/wta_top10_rankings.csv')
wta_matches = pd.read_csv(f'{DATA_PATH}/wta/wta_top10_matches.csv')

print(f"ATP: {len(atp_players)} players, {len(atp_rankings):,} ranking records, {len(atp_matches):,} matches")
print(f"WTA: {len(wta_players)} players, {len(wta_rankings):,} ranking records, {len(wta_matches):,} matches")

# Player Demographics

Let's start by examining the basic demographics of our top 10 players. Use the dropdown below to switch between tours.

In [None]:
#| label: player-prep
#| code-summary: "Prepare player data"

def parse_dob(dob):
    """Parse date of birth from YYYYMMDD format"""
    if pd.isna(dob):
        return None
    dob_str = str(int(dob))
    try:
        return datetime.strptime(dob_str, '%Y%m%d')
    except:
        return None

def calculate_age(dob):
    """Calculate current age from DOB"""
    if dob is None:
        return None
    today = datetime(2024, 12, 30)  # Rankings date
    return (today - dob).days / 365.25

# Process ATP players
atp_players['dob_parsed'] = atp_players['dob'].apply(parse_dob)
atp_players['age'] = atp_players['dob_parsed'].apply(calculate_age)
atp_players['full_name'] = atp_players['name_first'] + ' ' + atp_players['name_last']
atp_players['tour'] = 'ATP'

# Process WTA players
wta_players['dob_parsed'] = wta_players['dob'].apply(parse_dob)
wta_players['age'] = wta_players['dob_parsed'].apply(calculate_age)
wta_players['full_name'] = wta_players['name_first'] + ' ' + wta_players['name_last']
wta_players['tour'] = 'WTA'

# Combine for comparison
all_players = pd.concat([atp_players, wta_players], ignore_index=True)

In [None]:
#| label: player-table-interactive
#| code-summary: "Interactive player table with tour selector"

# Create summary tables
atp_display = atp_players[['full_name', 'ioc', 'age', 'height', 'hand']].copy()
atp_display['age'] = atp_display['age'].round(1)
atp_display.columns = ['Player', 'Country', 'Age', 'Height (cm)', 'Hand']
atp_display['Tour'] = 'ATP'

wta_display = wta_players[['full_name', 'ioc', 'age', 'height', 'hand']].copy()
wta_display['age'] = wta_display['age'].round(1)
wta_display.columns = ['Player', 'Country', 'Age', 'Height (cm)', 'Hand']
wta_display['Tour'] = 'WTA'

# Combined table for interactive display
all_display = pd.concat([atp_display, wta_display], ignore_index=True)

# Show interactive table (sortable, searchable)
print("Top 10 Players - Click column headers to sort, use search to filter:")
show(all_display.sort_values('Age'), classes="display compact", scrollY="400px")

In [None]:
#| label: fig-age-height-interactive
#| fig-cap: "Age and Height Distribution of Top 10 Players (Interactive)"
#| code-summary: "Interactive age and height comparison"

# Create interactive scatter plot with Plotly
fig = px.scatter(
    all_players,
    x='age',
    y='height',
    color='tour',
    hover_name='full_name',
    hover_data={'ioc': True, 'age': ':.1f', 'height': True, 'tour': False},
    labels={'age': 'Age (years)', 'height': 'Height (cm)', 'tour': 'Tour', 'ioc': 'Country'},
    title='Age vs Height: ATP and WTA Top 10 Players',
    color_discrete_map={'ATP': '#2196F3', 'WTA': '#E91E63'}
)

fig.update_traces(marker=dict(size=15, line=dict(width=2, color='white')))
fig.update_layout(
    hovermode='closest',
    height=500,
    legend=dict(orientation='h', yanchor='bottom', y=1.02, xanchor='center', x=0.5)
)

fig.show()

In [None]:
#| label: fig-countries-interactive
#| fig-cap: "Country Representation in Top 10 (Interactive)"
#| code-summary: "Interactive country distribution chart"

# Country distribution with interactive Plotly
fig = make_subplots(rows=1, cols=2, subplot_titles=('ATP Top 10 by Country', 'WTA Top 10 by Country'))

atp_countries = atp_players['ioc'].value_counts().reset_index()
atp_countries.columns = ['Country', 'Count']

wta_countries = wta_players['ioc'].value_counts().reset_index()
wta_countries.columns = ['Country', 'Count']

fig.add_trace(
    go.Bar(x=atp_countries['Count'], y=atp_countries['Country'], orientation='h',
           marker_color='#2196F3', name='ATP', text=atp_countries['Count'], textposition='outside'),
    row=1, col=1
)

fig.add_trace(
    go.Bar(x=wta_countries['Count'], y=wta_countries['Country'], orientation='h',
           marker_color='#E91E63', name='WTA', text=wta_countries['Count'], textposition='outside'),
    row=1, col=2
)

fig.update_layout(height=400, showlegend=False)
fig.update_yaxes(autorange='reversed')
fig.show()

# Rankings History

Let's examine how these players have risen through the rankings over their careers. Use the interactive controls below to filter by player and date range.

In [None]:
#| label: rankings-prep
#| code-summary: "Prepare rankings data"

# Parse dates and merge with player names
atp_rankings['date'] = pd.to_datetime(atp_rankings['ranking_date'].astype(str), format='%Y%m%d')
atp_rankings = atp_rankings.merge(atp_players[['player_id', 'full_name']], left_on='player', right_on='player_id')

wta_rankings['date'] = pd.to_datetime(wta_rankings['ranking_date'].astype(str), format='%Y%m%d')
wta_rankings = wta_rankings.merge(wta_players[['player_id', 'full_name']], left_on='player', right_on='player_id')

print(f"ATP rankings span: {atp_rankings['date'].min().strftime('%Y-%m-%d')} to {atp_rankings['date'].max().strftime('%Y-%m-%d')}")
print(f"WTA rankings span: {wta_rankings['date'].min().strftime('%Y-%m-%d')} to {wta_rankings['date'].max().strftime('%Y-%m-%d')}")

In [None]:
#| label: fig-atp-rankings-interactive
#| fig-cap: "ATP Top 10 Rankings Over Time (Interactive - Click legend to toggle players)"
#| code-summary: "Interactive ATP rankings history with player selection"

# Create interactive rankings chart for ATP
fig = px.line(
    atp_rankings.sort_values('date'),
    x='date',
    y='rank',
    color='full_name',
    title='ATP Top 10 Players: Rankings History (Click legend to toggle players)',
    labels={'date': 'Year', 'rank': 'Ranking', 'full_name': 'Player'},
    hover_data={'date': '|%B %d, %Y', 'rank': True}
)

fig.update_yaxes(autorange='reversed', range=[200, 1])
fig.add_hline(y=10, line_dash='dash', line_color='gray', opacity=0.5, annotation_text='Top 10')
fig.add_hline(y=1, line_dash='dash', line_color='gold', opacity=0.7, annotation_text='#1')

fig.update_layout(
    height=600,
    legend=dict(orientation='v', yanchor='top', y=1, xanchor='left', x=1.02),
    xaxis=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=1, label='1Y', step='year', stepmode='backward'),
                dict(count=5, label='5Y', step='year', stepmode='backward'),
                dict(count=10, label='10Y', step='year', stepmode='backward'),
                dict(step='all', label='All')
            ])
        ),
        rangeslider=dict(visible=True),
        type='date'
    )
)

fig.show()

In [None]:
#| label: fig-wta-rankings-interactive
#| fig-cap: "WTA Top 10 Rankings Over Time (Interactive)"
#| code-summary: "Interactive WTA rankings history"

# Create interactive rankings chart for WTA
fig = px.line(
    wta_rankings.sort_values('date'),
    x='date',
    y='rank',
    color='full_name',
    title='WTA Top 10 Players: Rankings History (Click legend to toggle players)',
    labels={'date': 'Year', 'rank': 'Ranking', 'full_name': 'Player'},
    hover_data={'date': '|%B %d, %Y', 'rank': True}
)

fig.update_yaxes(autorange='reversed', range=[200, 1])
fig.add_hline(y=10, line_dash='dash', line_color='gray', opacity=0.5, annotation_text='Top 10')
fig.add_hline(y=1, line_dash='dash', line_color='gold', opacity=0.7, annotation_text='#1')

fig.update_layout(
    height=600,
    legend=dict(orientation='v', yanchor='top', y=1, xanchor='left', x=1.02),
    xaxis=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=1, label='1Y', step='year', stepmode='backward'),
                dict(count=5, label='5Y', step='year', stepmode='backward'),
                dict(step='all', label='All')
            ])
        ),
        rangeslider=dict(visible=True),
        type='date'
    )
)

fig.show()

In [None]:
#| label: weeks-at-1-interactive
#| code-summary: "Interactive weeks at #1 analysis"

# Calculate weeks at #1 for each player
def weeks_at_rank(rankings_df, rank=1):
    results = []
    for player in rankings_df['full_name'].unique():
        player_data = rankings_df[rankings_df['full_name'] == player]
        weeks = len(player_data[player_data['rank'] == rank])
        results.append({'Player': player, f'Weeks at #{rank}': weeks})
    return pd.DataFrame(results).sort_values(f'Weeks at #{rank}', ascending=False)

atp_weeks_1 = weeks_at_rank(atp_rankings, 1)
wta_weeks_1 = weeks_at_rank(wta_rankings, 1)

# Create combined visualization
fig = make_subplots(rows=1, cols=2, subplot_titles=('ATP - Weeks at #1', 'WTA - Weeks at #1'))

atp_top = atp_weeks_1[atp_weeks_1['Weeks at #1'] > 0].sort_values('Weeks at #1')
fig.add_trace(
    go.Bar(x=atp_top['Weeks at #1'], y=atp_top['Player'], orientation='h',
           marker_color='#2196F3', text=atp_top['Weeks at #1'], textposition='outside',
           hovertemplate='%{y}: %{x} weeks<extra></extra>'),
    row=1, col=1
)

wta_top = wta_weeks_1[wta_weeks_1['Weeks at #1'] > 0].sort_values('Weeks at #1')
fig.add_trace(
    go.Bar(x=wta_top['Weeks at #1'], y=wta_top['Player'], orientation='h',
           marker_color='#E91E63', text=wta_top['Weeks at #1'], textposition='outside',
           hovertemplate='%{y}: %{x} weeks<extra></extra>'),
    row=1, col=2
)

fig.update_layout(height=400, showlegend=False, title_text='Weeks Spent at World #1')
fig.show()

# Match Analysis

Now let's dive into match statistics and performance metrics.

In [None]:
#| label: match-prep
#| code-summary: "Prepare match data"

# Add tour column
atp_matches['tour'] = 'ATP'
wta_matches['tour'] = 'WTA'

# Parse tournament dates
atp_matches['tourney_date'] = pd.to_datetime(atp_matches['tourney_date'].astype(str), format='%Y%m%d')
wta_matches['tourney_date'] = pd.to_datetime(wta_matches['tourney_date'].astype(str), format='%Y%m%d')

print(f"ATP Matches: {len(atp_matches):,} (from {atp_matches['tourney_date'].min().year} to {atp_matches['tourney_date'].max().year})")
print(f"WTA Matches: {len(wta_matches):,} (from {wta_matches['tourney_date'].min().year} to {wta_matches['tourney_date'].max().year})")

In [None]:
#| label: win-loss-interactive
#| code-summary: "Interactive win-loss analysis"

def calculate_record(matches_df, players_df):
    """Calculate win-loss record for each player"""
    records = []
    for _, player in players_df.iterrows():
        pid = player['player_id']
        name = player['full_name']
        
        wins = len(matches_df[matches_df['winner_id'] == pid])
        losses = len(matches_df[matches_df['loser_id'] == pid])
        total = wins + losses
        win_pct = (wins / total * 100) if total > 0 else 0
        
        records.append({
            'Player': name,
            'Wins': wins,
            'Losses': losses,
            'Total': total,
            'Win %': round(win_pct, 1)
        })
    
    return pd.DataFrame(records).sort_values('Win %', ascending=False)

atp_records = calculate_record(atp_matches, atp_players)
atp_records['Tour'] = 'ATP'
wta_records = calculate_record(wta_matches, wta_players)
wta_records['Tour'] = 'WTA'

all_records = pd.concat([atp_records, wta_records], ignore_index=True)

print("Career Records - Sortable and searchable:")
show(all_records, classes="display compact", scrollY="400px")

In [None]:
#| label: fig-win-pct-interactive
#| fig-cap: "Career Win Percentage Comparison (Interactive)"
#| code-summary: "Interactive win percentage chart"

# Combined bar chart with dropdown to switch between tours
fig = go.Figure()

# ATP data
atp_sorted = atp_records.sort_values('Win %', ascending=True)
fig.add_trace(go.Bar(
    x=atp_sorted['Win %'],
    y=atp_sorted['Player'],
    orientation='h',
    marker_color=['gold' if x >= 80 else '#2196F3' for x in atp_sorted['Win %']],
    text=[f"{x}%" for x in atp_sorted['Win %']],
    textposition='outside',
    name='ATP',
    hovertemplate='%{y}<br>Win %%: %{x}%<br>Wins: %{customdata[0]}<br>Losses: %{customdata[1]}<extra></extra>',
    customdata=atp_sorted[['Wins', 'Losses']].values,
    visible=True
))

# WTA data
wta_sorted = wta_records.sort_values('Win %', ascending=True)
fig.add_trace(go.Bar(
    x=wta_sorted['Win %'],
    y=wta_sorted['Player'],
    orientation='h',
    marker_color=['gold' if x >= 80 else '#E91E63' for x in wta_sorted['Win %']],
    text=[f"{x}%" for x in wta_sorted['Win %']],
    textposition='outside',
    name='WTA',
    hovertemplate='%{y}<br>Win %%: %{x}%<br>Wins: %{customdata[0]}<br>Losses: %{customdata[1]}<extra></extra>',
    customdata=wta_sorted[['Wins', 'Losses']].values,
    visible=False
))

# Add dropdown menu
fig.update_layout(
    updatemenus=[
        dict(
            active=0,
            buttons=list([
                dict(label='ATP',
                     method='update',
                     args=[{'visible': [True, False]},
                           {'title': 'ATP Top 10 Career Win %'}]),
                dict(label='WTA',
                     method='update',
                     args=[{'visible': [False, True]},
                           {'title': 'WTA Top 10 Career Win %'}]),
                dict(label='Both',
                     method='update',
                     args=[{'visible': [True, True]},
                           {'title': 'ATP & WTA Top 10 Career Win %'}]),
            ]),
            direction='down',
            showactive=True,
            x=0.1,
            xanchor='left',
            y=1.15,
            yanchor='top'
        )
    ],
    title='ATP Top 10 Career Win % (Use dropdown to switch tour)',
    height=500,
    xaxis_title='Win Percentage',
    showlegend=False
)

fig.add_vline(x=70, line_dash='dash', line_color='green', opacity=0.5, annotation_text='70%')
fig.show()

# Surface Performance

Tennis is played on different surfaces (Hard, Clay, Grass, Carpet), and players often have varying performance levels on each. Use the interactive heatmap below to explore surface performance.

In [None]:
#| label: surface-analysis-interactive
#| code-summary: "Interactive surface performance analysis"

def surface_record(matches_df, players_df):
    """Calculate win percentage by surface for each player"""
    results = []
    surfaces = ['Hard', 'Clay', 'Grass']
    
    for _, player in players_df.iterrows():
        pid = player['player_id']
        name = player['full_name']
        
        row = {'Player': name}
        for surface in surfaces:
            surface_matches = matches_df[matches_df['surface'] == surface]
            wins = len(surface_matches[surface_matches['winner_id'] == pid])
            losses = len(surface_matches[surface_matches['loser_id'] == pid])
            total = wins + losses
            if total >= 10:  # Minimum matches threshold
                row[surface] = round(wins / total * 100, 1)
                row[f'{surface}_record'] = f"{wins}-{losses}"
            else:
                row[surface] = None
                row[f'{surface}_record'] = "-"
        results.append(row)
    
    return pd.DataFrame(results)

atp_surface = surface_record(atp_matches, atp_players)
wta_surface = surface_record(wta_matches, wta_players)

# Display interactive table
print("ATP Win % by Surface (min 10 matches):")
show(atp_surface[['Player', 'Hard', 'Clay', 'Grass', 'Hard_record', 'Clay_record', 'Grass_record']], 
     classes="display compact")

In [None]:
#| label: fig-surface-heatmap-interactive
#| fig-cap: "Surface Performance Heatmap (Interactive)"
#| code-summary: "Interactive surface heatmap with player selection"

# Create interactive heatmaps
fig = make_subplots(rows=1, cols=2, subplot_titles=('ATP Top 10 - Win % by Surface', 'WTA Top 10 - Win % by Surface'))

# ATP heatmap data
atp_heat = atp_surface.set_index('Player')[['Hard', 'Clay', 'Grass']].dropna(how='all')
fig.add_trace(
    go.Heatmap(
        z=atp_heat.values,
        x=atp_heat.columns,
        y=atp_heat.index,
        colorscale='Blues',
        zmin=50, zmax=90,
        text=atp_heat.values,
        texttemplate='%{text:.1f}%',
        hovertemplate='%{y}<br>%{x}: %{z:.1f}%<extra></extra>',
        showscale=False
    ),
    row=1, col=1
)

# WTA heatmap data
wta_heat = wta_surface.set_index('Player')[['Hard', 'Clay', 'Grass']].dropna(how='all')
fig.add_trace(
    go.Heatmap(
        z=wta_heat.values,
        x=wta_heat.columns,
        y=wta_heat.index,
        colorscale='RdPu',
        zmin=50, zmax=90,
        text=wta_heat.values,
        texttemplate='%{text:.1f}%',
        hovertemplate='%{y}<br>%{x}: %{z:.1f}%<extra></extra>',
        showscale=True,
        colorbar=dict(title='Win %', x=1.02)
    ),
    row=1, col=2
)

fig.update_layout(height=600, title_text='Surface Performance Comparison')
fig.show()

In [None]:
#| label: fig-surface-radar
#| fig-cap: "Player Surface Profile Comparison (Select players to compare)"
#| code-summary: "Interactive radar chart for surface comparison"

# Create radar charts for top players
def create_surface_radar(surface_df, players_to_show, title, color):
    fig = go.Figure()
    
    for player in players_to_show:
        player_data = surface_df[surface_df['Player'] == player]
        if not player_data.empty:
            values = [player_data['Hard'].values[0], player_data['Clay'].values[0], 
                     player_data['Grass'].values[0], player_data['Hard'].values[0]]  # Close the polygon
            fig.add_trace(go.Scatterpolar(
                r=values,
                theta=['Hard', 'Clay', 'Grass', 'Hard'],
                name=player,
                fill='toself',
                opacity=0.6
            ))
    
    fig.update_layout(
        polar=dict(radialaxis=dict(visible=True, range=[40, 100])),
        title=title,
        showlegend=True,
        height=500
    )
    return fig

# Show top 4 ATP players radar chart
atp_top_players = ['Novak Djokovic', 'Carlos Alcaraz', 'Jannik Sinner', 'Alexander Zverev']
fig = create_surface_radar(atp_surface, atp_top_players, 'ATP Top Players - Surface Profile', '#2196F3')
fig.show()

# Serve Statistics

The serve is one of the most important shots in tennis. Let's analyze serve statistics for our top players.

In [None]:
#| label: serve-stats-interactive
#| code-summary: "Interactive serve statistics analysis"

def calculate_serve_stats(matches_df, players_df):
    """Calculate average serve statistics for each player"""
    results = []
    
    for _, player in players_df.iterrows():
        pid = player['player_id']
        name = player['full_name']
        
        # When player won
        wins = matches_df[matches_df['winner_id'] == pid].copy()
        wins['aces'] = wins['w_ace']
        wins['df'] = wins['w_df']
        wins['1st_in'] = wins['w_1stIn'] / wins['w_svpt'] * 100
        wins['1st_won'] = wins['w_1stWon'] / wins['w_1stIn'] * 100
        wins['2nd_won'] = wins['w_2ndWon'] / (wins['w_svpt'] - wins['w_1stIn']) * 100
        
        # When player lost
        losses = matches_df[matches_df['loser_id'] == pid].copy()
        losses['aces'] = losses['l_ace']
        losses['df'] = losses['l_df']
        losses['1st_in'] = losses['l_1stIn'] / losses['l_svpt'] * 100
        losses['1st_won'] = losses['l_1stWon'] / losses['l_1stIn'] * 100
        losses['2nd_won'] = losses['l_2ndWon'] / (losses['l_svpt'] - losses['l_1stIn']) * 100
        
        # Combine
        all_matches = pd.concat([wins, losses])
        
        if len(all_matches) > 0:
            results.append({
                'Player': name,
                'Avg Aces/Match': round(all_matches['aces'].mean(), 1),
                'Avg DFs/Match': round(all_matches['df'].mean(), 1),
                '1st Serve %': round(all_matches['1st_in'].mean(), 1),
                '1st Serve Win %': round(all_matches['1st_won'].mean(), 1),
                '2nd Serve Win %': round(all_matches['2nd_won'].mean(), 1)
            })
    
    return pd.DataFrame(results)

atp_serve = calculate_serve_stats(atp_matches, atp_players)
atp_serve['Tour'] = 'ATP'
wta_serve = calculate_serve_stats(wta_matches, wta_players)
wta_serve['Tour'] = 'WTA'

all_serve = pd.concat([atp_serve, wta_serve], ignore_index=True)
print("Serve Statistics - Interactive table:")
show(all_serve, classes="display compact", scrollY="400px")

In [None]:
#| label: fig-serve-interactive
#| fig-cap: "Serve Performance: Aces vs Double Faults (Interactive)"
#| code-summary: "Interactive serve scatter plot"

# Combine serve data with tour info
all_serve_plot = pd.concat([atp_serve, wta_serve], ignore_index=True)

fig = px.scatter(
    all_serve_plot,
    x='Avg Aces/Match',
    y='Avg DFs/Match',
    color='Tour',
    size='1st Serve Win %',
    hover_name='Player',
    hover_data={'1st Serve %': True, '2nd Serve Win %': True},
    title='Serve Performance: Aces vs Double Faults (Bubble size = 1st Serve Win %)',
    labels={'Avg Aces/Match': 'Average Aces per Match', 'Avg DFs/Match': 'Average Double Faults per Match'},
    color_discrete_map={'ATP': '#2196F3', 'WTA': '#E91E63'}
)

fig.update_traces(marker=dict(line=dict(width=1, color='white')))
fig.update_layout(height=500)

# Add player name annotations
for _, row in all_serve_plot.iterrows():
    fig.add_annotation(
        x=row['Avg Aces/Match'],
        y=row['Avg DFs/Match'],
        text=row['Player'].split()[-1],
        showarrow=False,
        yshift=15,
        font=dict(size=9)
    )

fig.show()

# Tournament Performance

Let's examine performance at different tournament levels (Grand Slams, Masters, etc.).

In [None]:
#| label: tournament-levels-interactive
#| code-summary: "Interactive tournament level analysis"

# Tournament level mapping
level_names = {
    'G': 'Grand Slam',
    'M': 'Masters 1000',
    'A': 'ATP 500/250',
    'F': 'Tour Finals',
    'D': 'Davis Cup'
}

def tournament_record(matches_df, players_df):
    """Calculate win percentage by tournament level"""
    results = []
    
    for _, player in players_df.iterrows():
        pid = player['player_id']
        name = player['full_name']
        
        row = {'Player': name}
        for level, level_name in level_names.items():
            level_matches = matches_df[matches_df['tourney_level'] == level]
            wins = len(level_matches[level_matches['winner_id'] == pid])
            losses = len(level_matches[level_matches['loser_id'] == pid])
            total = wins + losses
            if total >= 5:
                row[level_name] = f"{wins}-{losses}"
                row[f'{level_name}_pct'] = round(wins / total * 100, 1)
            else:
                row[level_name] = '-'
                row[f'{level_name}_pct'] = None
        results.append(row)
    
    return pd.DataFrame(results)

atp_tourney = tournament_record(atp_matches, atp_players)
wta_tourney = tournament_record(wta_matches, wta_players)

print("ATP Records by Tournament Level (Interactive):")
show(atp_tourney[['Player', 'Grand Slam', 'Grand Slam_pct', 'Masters 1000', 'Masters 1000_pct', 'Tour Finals']], 
     classes="display compact")

In [None]:
#| label: fig-grand-slams-interactive
#| fig-cap: "Grand Slam Win Percentage (Interactive with Player Details)"
#| code-summary: "Interactive Grand Slam performance chart"

fig = make_subplots(rows=1, cols=2, subplot_titles=('ATP Grand Slam Win %', 'WTA Grand Slam Win %'))

# ATP Grand Slams
atp_gs = atp_tourney[['Player', 'Grand Slam_pct', 'Grand Slam']].dropna(subset=['Grand Slam_pct']).sort_values('Grand Slam_pct', ascending=True)
fig.add_trace(
    go.Bar(
        x=atp_gs['Grand Slam_pct'],
        y=atp_gs['Player'],
        orientation='h',
        marker_color=['gold' if x >= 80 else '#2196F3' for x in atp_gs['Grand Slam_pct']],
        text=[f"{x}%" for x in atp_gs['Grand Slam_pct']],
        textposition='outside',
        hovertemplate='%{y}<br>Win %: %{x}%<br>Record: %{customdata}<extra></extra>',
        customdata=atp_gs['Grand Slam'].values
    ),
    row=1, col=1
)

# WTA Grand Slams
wta_gs = wta_tourney[['Player', 'Grand Slam_pct', 'Grand Slam']].dropna(subset=['Grand Slam_pct']).sort_values('Grand Slam_pct', ascending=True)
fig.add_trace(
    go.Bar(
        x=wta_gs['Grand Slam_pct'],
        y=wta_gs['Player'],
        orientation='h',
        marker_color=['gold' if x >= 80 else '#E91E63' for x in wta_gs['Grand Slam_pct']],
        text=[f"{x}%" for x in wta_gs['Grand Slam_pct']],
        textposition='outside',
        hovertemplate='%{y}<br>Win %: %{x}%<br>Record: %{customdata}<extra></extra>',
        customdata=wta_gs['Grand Slam'].values
    ),
    row=1, col=2
)

fig.add_vline(x=70, line_dash='dash', line_color='green', opacity=0.5, row=1, col=1)
fig.add_vline(x=70, line_dash='dash', line_color='green', opacity=0.5, row=1, col=2)

fig.update_layout(height=500, showlegend=False, title_text='Grand Slam Performance (Hover for records)')
fig.show()

# Head-to-Head Analysis

Let's look at how the top 10 players have fared against each other.

In [None]:
#| label: h2h-interactive
#| code-summary: "Interactive head-to-head analysis"

def head_to_head_matrix(matches_df, players_df):
    """Create head-to-head matrix for players"""
    player_ids = players_df['player_id'].tolist()
    player_names = players_df['full_name'].tolist()
    
    # Filter matches between top 10 players only
    h2h_matches = matches_df[
        (matches_df['winner_id'].isin(player_ids)) & 
        (matches_df['loser_id'].isin(player_ids))
    ]
    
    # Create matrix for display
    matrix_display = pd.DataFrame(index=player_names, columns=player_names)
    # Create matrix for heatmap (numeric values)
    matrix_numeric = pd.DataFrame(index=player_names, columns=player_names, dtype=float)
    
    for i, p1_id in enumerate(player_ids):
        for j, p2_id in enumerate(player_ids):
            if i == j:
                matrix_display.iloc[i, j] = '-'
                matrix_numeric.iloc[i, j] = np.nan
            else:
                wins = len(h2h_matches[(h2h_matches['winner_id'] == p1_id) & (h2h_matches['loser_id'] == p2_id)])
                losses = len(h2h_matches[(h2h_matches['winner_id'] == p2_id) & (h2h_matches['loser_id'] == p1_id)])
                total = wins + losses
                if total > 0:
                    matrix_display.iloc[i, j] = f"{wins}-{losses}"
                    matrix_numeric.iloc[i, j] = (wins / total) * 100 if total > 0 else 50
                else:
                    matrix_display.iloc[i, j] = '0-0'
                    matrix_numeric.iloc[i, j] = 50
    
    return matrix_display, matrix_numeric, h2h_matches

atp_h2h_display, atp_h2h_numeric, atp_h2h_matches = head_to_head_matrix(atp_matches, atp_players)
atp_h2h_display.index = [n.split()[-1] for n in atp_h2h_display.index]
atp_h2h_display.columns = [n.split()[-1] for n in atp_h2h_display.columns]

print(f"ATP Head-to-Head Matrix (Total: {len(atp_h2h_matches)} matches):")
show(atp_h2h_display.reset_index().rename(columns={'index': 'Player'}), classes="display compact", scrollX=True)

In [None]:
#| label: fig-h2h-heatmap
#| fig-cap: "Head-to-Head Win Rate Heatmap (Interactive)"
#| code-summary: "Interactive H2H heatmap"

# Create heatmap visualization for ATP H2H
short_names = [n.split()[-1] for n in atp_h2h_numeric.index]

fig = go.Figure(data=go.Heatmap(
    z=atp_h2h_numeric.values,
    x=short_names,
    y=short_names,
    colorscale='RdYlGn',
    zmin=0, zmax=100,
    text=atp_h2h_display.values,
    texttemplate='%{text}',
    hovertemplate='%{y} vs %{x}<br>Record: %{text}<br>Win Rate: %{z:.0f}%<extra></extra>',
    colorbar=dict(title='Win %', ticksuffix='%')
))

fig.update_layout(
    title='ATP Head-to-Head Win Rate (Row vs Column)',
    height=600,
    xaxis_title='Opponent',
    yaxis_title='Player',
    yaxis=dict(autorange='reversed')
)

fig.show()

In [None]:
#| label: wta-h2h-interactive
#| code-summary: "WTA head-to-head interactive"

wta_h2h_display, wta_h2h_numeric, wta_h2h_matches = head_to_head_matrix(wta_matches, wta_players)
short_names_wta = [n.split()[-1] for n in wta_h2h_numeric.index]

fig = go.Figure(data=go.Heatmap(
    z=wta_h2h_numeric.values,
    x=short_names_wta,
    y=short_names_wta,
    colorscale='RdYlGn',
    zmin=0, zmax=100,
    text=wta_h2h_display.values,
    texttemplate='%{text}',
    hovertemplate='%{y} vs %{x}<br>Record: %{text}<br>Win Rate: %{z:.0f}%<extra></extra>',
    colorbar=dict(title='Win %', ticksuffix='%')
))

fig.update_layout(
    title=f'WTA Head-to-Head Win Rate (Total: {len(wta_h2h_matches)} matches)',
    height=600,
    xaxis_title='Opponent',
    yaxis_title='Player',
    yaxis=dict(autorange='reversed')
)

fig.show()

# Match Duration Analysis

Let's analyze match lengths and how players perform in long vs short matches.

In [None]:
#| label: fig-duration-interactive
#| fig-cap: "Average Match Duration (Interactive)"
#| code-summary: "Interactive match duration analysis"

def avg_match_duration(matches_df, players_df):
    """Calculate average match duration for each player"""
    results = []
    
    for _, player in players_df.iterrows():
        pid = player['player_id']
        name = player['full_name']
        
        # Matches as winner
        wins = matches_df[matches_df['winner_id'] == pid]['minutes'].dropna()
        # Matches as loser
        losses = matches_df[matches_df['loser_id'] == pid]['minutes'].dropna()
        
        all_mins = pd.concat([wins, losses])
        
        if len(all_mins) > 0:
            results.append({
                'Player': name,
                'Avg Duration (min)': round(all_mins.mean(), 1),
                'Avg Win Duration': round(wins.mean(), 1) if len(wins) > 0 else None,
                'Avg Loss Duration': round(losses.mean(), 1) if len(losses) > 0 else None
            })
    
    return pd.DataFrame(results)

atp_duration = avg_match_duration(atp_matches, atp_players)
atp_duration['Tour'] = 'ATP'
wta_duration = avg_match_duration(wta_matches, wta_players)
wta_duration['Tour'] = 'WTA'

# Create grouped bar chart
fig = make_subplots(rows=1, cols=2, subplot_titles=('ATP Match Duration', 'WTA Match Duration'))

atp_sorted = atp_duration.sort_values('Avg Duration (min)')
fig.add_trace(go.Bar(name='Wins', x=atp_sorted['Avg Win Duration'], y=atp_sorted['Player'],
                     orientation='h', marker_color='#4CAF50',
                     hovertemplate='%{y}<br>Avg Win: %{x} min<extra></extra>'), row=1, col=1)
fig.add_trace(go.Bar(name='Losses', x=atp_sorted['Avg Loss Duration'], y=atp_sorted['Player'],
                     orientation='h', marker_color='#f44336',
                     hovertemplate='%{y}<br>Avg Loss: %{x} min<extra></extra>'), row=1, col=1)

wta_sorted = wta_duration.sort_values('Avg Duration (min)')
fig.add_trace(go.Bar(name='Wins', x=wta_sorted['Avg Win Duration'], y=wta_sorted['Player'],
                     orientation='h', marker_color='#4CAF50', showlegend=False,
                     hovertemplate='%{y}<br>Avg Win: %{x} min<extra></extra>'), row=1, col=2)
fig.add_trace(go.Bar(name='Losses', x=wta_sorted['Avg Loss Duration'], y=wta_sorted['Player'],
                     orientation='h', marker_color='#f44336', showlegend=False,
                     hovertemplate='%{y}<br>Avg Loss: %{x} min<extra></extra>'), row=1, col=2)

fig.update_layout(height=500, barmode='group', title_text='Average Match Duration (Wins vs Losses)')
fig.update_xaxes(title_text='Minutes', row=1, col=1)
fig.update_xaxes(title_text='Minutes', row=1, col=2)
fig.show()

# Year-over-Year Performance

Finally, let's look at how these players have performed over the years. Use the slider to filter by year range.

In [None]:
#| label: fig-yearly-wins-interactive
#| fig-cap: "Yearly Match Wins with Interactive Year Range Selection"
#| code-summary: "Interactive yearly performance with range slider"

def yearly_wins(matches_df, players_df, start_year=2018):
    """Calculate wins per year for each player"""
    matches_df = matches_df[matches_df['tourney_date'].dt.year >= start_year].copy()
    matches_df['year'] = matches_df['tourney_date'].dt.year
    
    results = []
    for _, player in players_df.iterrows():
        pid = player['player_id']
        name = player['full_name']
        
        for year in range(start_year, 2025):
            year_matches = matches_df[matches_df['year'] == year]
            wins = len(year_matches[year_matches['winner_id'] == pid])
            losses = len(year_matches[year_matches['loser_id'] == pid])
            total = wins + losses
            win_pct = (wins / total * 100) if total > 0 else 0
            results.append({
                'Player': name, 
                'Year': year, 
                'Wins': wins,
                'Losses': losses,
                'Win %': round(win_pct, 1)
            })
    
    return pd.DataFrame(results)

atp_yearly = yearly_wins(atp_matches, atp_players, 2018)

# Create interactive line chart with animation
fig = px.line(
    atp_yearly,
    x='Year',
    y='Wins',
    color='Player',
    title='ATP Yearly Match Wins (2018-2024) - Click legend to toggle players',
    markers=True,
    hover_data={'Win %': True, 'Losses': True}
)

fig.update_layout(
    height=500,
    legend=dict(orientation='h', yanchor='bottom', y=-0.3, xanchor='center', x=0.5),
    xaxis=dict(
        rangeslider=dict(visible=True),
        tickmode='linear',
        tick0=2018,
        dtick=1
    )
)

fig.show()

In [None]:
#| label: fig-wta-yearly-interactive
#| fig-cap: "WTA Yearly Match Wins"
#| code-summary: "WTA yearly performance chart"

wta_yearly = yearly_wins(wta_matches, wta_players, 2018)

fig = px.line(
    wta_yearly,
    x='Year',
    y='Wins',
    color='Player',
    title='WTA Yearly Match Wins (2018-2024) - Click legend to toggle players',
    markers=True,
    hover_data={'Win %': True, 'Losses': True}
)

fig.update_layout(
    height=500,
    legend=dict(orientation='h', yanchor='bottom', y=-0.3, xanchor='center', x=0.5),
    xaxis=dict(
        rangeslider=dict(visible=True),
        tickmode='linear',
        tick0=2018,
        dtick=1
    )
)

fig.show()

In [None]:
#| label: fig-animated-rankings
#| fig-cap: "Animated Rankings Race (2020-2024)"
#| code-summary: "Animated bar chart race showing ranking changes"

# Create animated bar chart race for ATP
atp_rankings['year'] = atp_rankings['date'].dt.year
atp_yearly_rank = atp_rankings[atp_rankings['year'] >= 2020].groupby(['full_name', 'year']).agg(
    best_rank=('rank', 'min'),
    avg_rank=('rank', 'mean')
).reset_index()

fig = px.bar(
    atp_yearly_rank.sort_values(['year', 'best_rank']),
    x='best_rank',
    y='full_name',
    color='full_name',
    animation_frame='year',
    orientation='h',
    range_x=[0, 50],
    title='ATP Best Ranking by Year (2020-2024) - Use Play button to animate',
    labels={'best_rank': 'Best Ranking', 'full_name': 'Player', 'year': 'Year'}
)

fig.update_layout(
    height=500,
    showlegend=False,
    yaxis={'categoryorder': 'total ascending'},
    xaxis={'autorange': 'reversed'}
)

fig.show()

# Summary

## Key Findings

### ATP Tour
- **Novak Djokovic** remains the GOAT with the most weeks at #1 (377 weeks!) and highest Grand Slam win percentage among active players
- **Jannik Sinner** and **Carlos Alcaraz** represent the new generation with rapidly rising rankings and already multiple Grand Slam titles
- **Alexander Zverev** leads in aces per match among the top 10
- Russia has strong representation with Medvedev and Rublev both in the top 10

### WTA Tour
- **Iga Swiatek** dominated with the most weeks at #1 (120 weeks) and an exceptional 87.8% win rate on clay
- **Aryna Sabalenka** has been the most consistent performer in 2024
- **Coco Gauff** is the youngest player in the top 10 at 20.8 years, showing massive potential
- The USA leads with 3 players in the top 10 (Gauff, Pegula, Navarro)

### Cross-Tour Observations
- ATP matches are generally longer due to best-of-5 format in Grand Slams
- Both tours show increasing competitiveness at the top level
- Surface specialists still exist but all-court players dominate the rankings

---

*Data source: [Tennis Abstract](https://github.com/JeffSackmann) by Jeff Sackmann, licensed under CC BY-NC-SA 4.0*

*Player images: Wikimedia Commons (CC BY-SA 4.0)*