# Introduction

[Tennis Abstract](https://www.tennisabstract.com/) is one of the best resources for tennis statistics on the web. Created by Jeff Sackmann, it provides detailed match histories, serve statistics, and performance data for professional tennis players.

**The challenge:** Tennis Abstract uses JavaScript to render its data tables. If you try scraping with `requests` + `BeautifulSoup`, you'll get an empty page because the content loads dynamically after the initial HTML.

**The solution:** [Playwright](https://playwright.dev/python/) - a modern browser automation library that can:
- Launch a real browser (headless or visible)
- Wait for JavaScript to execute
- Extract data from the fully-rendered page

In this tutorial, we'll build a scraper to fetch player match histories and serve statistics, then visualize the data.

::: {.callout-note}
## Sync vs Async API
Playwright offers both synchronous and asynchronous APIs. In Jupyter notebooks, we must use the **async API** because notebooks run inside an asyncio event loop. For standalone scripts, the sync API (`sync_playwright`) is simpler.
:::

# Setup

First, install the required packages:

```bash
pip install playwright pandas plotly
playwright install chromium
```

The second command downloads a Chromium browser for Playwright to control.

In [1]:
from playwright.async_api import async_playwright
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from datetime import datetime
import asyncio
import re

# Understanding the Target Page

Each player on Tennis Abstract has a page like:
- ATP: `https://www.tennisabstract.com/cgi-bin/player.cgi?p=JannikSinner`
- WTA: `https://www.tennisabstract.com/cgi-bin/wplayer.cgi?p=IgaSwiatek`

The recent matches table has the ID `#recent-results` and contains these columns:

| Index | Column | Description |
|-------|--------|-------------|
| 0 | Date | Match date (e.g., "19-Jan-2026") |
| 1 | Tournament | Tournament name |
| 2 | Surface | Hard, Clay, or Grass |
| 3 | Rd | Round (F, SF, QF, R16, etc.) |
| 4 | Rk | Player's ranking at match time |
| 5 | vRk | Opponent's ranking |
| 6 | Result | Who won (e.g., "Sinner d. Djokovic") |
| 7 | Score | Match score |
| 8 | DR | Delta ranking |
| 9 | A% | Ace percentage |
| 10 | DF% | Double fault percentage |
| 11 | 1stIn | First serve in percentage |
| 12 | 1st% | First serve points won |
| 13 | 2nd% | Second serve points won |
| 14 | BPSvd | Break points saved |
| 15 | Time | Match duration |

# Building the Scraper

Let's build a scraper that captures both match results and serve statistics:

In [2]:
def parse_result(result_text: str, player_name: str) -> dict:
    """Parse the result cell to extract win/loss and opponent info."""
    if " d. " not in result_text:
        return {"won": None, "opponent": None}
    
    winner_part, loser_part = result_text.split(" d. ")
    
    def extract_name(text):
        text = re.sub(r'\(\d+\)', '', text)  # Remove seed
        text = re.sub(r'\[.*?\]', '', text)  # Remove country
        text = re.sub(r'\(WC\)|\(Q\)|\(LL\)', '', text)  # Remove entry type
        return text.strip()
    
    winner_name = extract_name(winner_part)
    loser_name = extract_name(loser_part)
    player_won = player_name.lower() in winner_name.lower()
    
    return {
        "won": player_won,
        "opponent": loser_name if player_won else winner_name
    }


def parse_percentage(text: str) -> float:
    """Parse percentage string like '65%' or '65' to float."""
    try:
        return float(text.replace('%', '').strip())
    except (ValueError, AttributeError):
        return None


def parse_date(date_str: str) -> datetime:
    """Parse date string like '19-Jan-2026' to datetime."""
    try:
        return datetime.strptime(date_str.strip(), "%d-%b-%Y")
    except ValueError:
        return None


async def scrape_player_with_stats(player_url_name: str, player_last_name: str, tour: str = "atp") -> pd.DataFrame:
    """
    Scrape recent matches with serve statistics for a player.
    
    Args:
        player_url_name: Player name in URL format (e.g., "JannikSinner")
        player_last_name: Player's last name for result parsing
        tour: "atp" or "wta"
    
    Returns:
        DataFrame with match data and serve stats
    """
    prefix = "w" if tour == "wta" else ""
    url = f"https://www.tennisabstract.com/cgi-bin/{prefix}player.cgi?p={player_url_name}"
    
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(url, wait_until="domcontentloaded")
        await page.wait_for_timeout(3000)
        
        table = await page.query_selector("#recent-results")
        if not table:
            print(f"Could not find table for {player_last_name}")
            await browser.close()
            return pd.DataFrame()
        
        rows = await table.query_selector_all("tr")
        rows = rows[1:]  # Skip header
        matches = []
        
        for row in rows:
            cells = await row.query_selector_all("td")
            cell_texts = [await cell.inner_text() for cell in cells]
            cell_texts = [text.strip() for text in cell_texts]
            
            if len(cell_texts) >= 12:  # Need serve stats columns
                result_info = parse_result(cell_texts[6], player_last_name)
                
                matches.append({
                    "player": player_last_name,
                    "date": parse_date(cell_texts[0]),
                    "tournament": cell_texts[1],
                    "surface": cell_texts[2],
                    "round": cell_texts[3],
                    "won": result_info["won"],
                    "opponent": result_info["opponent"],
                    "score": cell_texts[7],
                    "ace_pct": parse_percentage(cell_texts[9]) if len(cell_texts) > 9 else None,
                    "df_pct": parse_percentage(cell_texts[10]) if len(cell_texts) > 10 else None,
                    "first_serve_pct": parse_percentage(cell_texts[11]) if len(cell_texts) > 11 else None,
                    "first_serve_won_pct": parse_percentage(cell_texts[12]) if len(cell_texts) > 12 else None,
                    "second_serve_won_pct": parse_percentage(cell_texts[13]) if len(cell_texts) > 13 else None,
                })
        
        await browser.close()
    
    return pd.DataFrame(matches)

# Example: Scraping Jannik Sinner's Matches

Let's test our scraper by fetching Sinner's recent match history with serve stats:

In [3]:
sinner_matches = await scrape_player_with_stats("JannikSinner", "Sinner", tour="atp")

print(f"Found {len(sinner_matches)} matches for Sinner")
sinner_matches[["date", "tournament", "opponent", "won", "first_serve_pct", "ace_pct"]].head(10)

Found 21 matches for Sinner


Unnamed: 0,date,tournament,opponent,won,first_serve_pct,ace_pct
0,2026-01-19,Australian Open,Novak Djokovic,False,75.2,19.5
1,2026-01-19,Australian Open,Ben Shelton,True,59.3,5.5
2,2026-01-19,Australian Open,Luciano Darderi,True,72.1,22.1
3,2026-01-19,Australian Open,Eliot Spizzirri,True,66.7,12.9
4,2026-01-19,Australian Open,James Duckworth,True,64.9,23.4
5,2026-01-19,Australian Open,Hugo Gaston,True,64.4,13.3
6,2025-11-09,Tour Finals,Carlos Alcaraz,True,55.1,10.3
7,2025-11-09,Tour Finals,Alex De Minaur,True,74.6,11.9
8,2025-11-09,Tour Finals,Alexander Zverev,True,70.6,17.6
9,2025-11-09,Tour Finals,Ben Shelton,True,74.6,17.5


# Comparing Two Players

Now let's scrape data for both Sinner and Alcaraz to compare their serve statistics. We'll add a small delay between requests to be respectful to the server.

In [4]:
# Scrape Alcaraz (with rate limiting)
await asyncio.sleep(2)  # Be nice to the server
alcaraz_matches = await scrape_player_with_stats("CarlosAlcaraz", "Alcaraz", tour="atp")

print(f"Found {len(alcaraz_matches)} matches for Alcaraz")

# Combine the data
all_matches = pd.concat([sinner_matches, alcaraz_matches], ignore_index=True)
all_matches = all_matches.dropna(subset=["date", "first_serve_pct"])
all_matches = all_matches.sort_values("date")

print(f"\nTotal matches with serve data: {len(all_matches)}")
all_matches.groupby("player").size()

Found 20 matches for Alcaraz

Total matches with serve data: 41


player
Alcaraz    20
Sinner     21
dtype: int64

# Visualizing First Serve Percentage

Let's create an interactive plot comparing Sinner and Alcaraz's first serve percentage over time:

In [5]:
# Create the comparison plot
fig = px.scatter(
    all_matches,
    x="date",
    y="first_serve_pct",
    color="player",
    hover_data=["tournament", "opponent", "won", "surface"],
    title="First Serve % Over Time: Sinner vs Alcaraz",
    labels={
        "date": "Date",
        "first_serve_pct": "First Serve %",
        "player": "Player"
    },
    color_discrete_map={"Sinner": "#e94560", "Alcaraz": "#4facfe"}
)

# Add trend lines
for player, color in [("Sinner", "#e94560"), ("Alcaraz", "#4facfe")]:
    player_data = all_matches[all_matches["player"] == player].copy()
    if len(player_data) > 1:
        # Rolling average for trend
        player_data = player_data.sort_values("date")
        player_data["rolling_avg"] = player_data["first_serve_pct"].rolling(window=5, min_periods=1).mean()
        
        fig.add_trace(go.Scatter(
            x=player_data["date"],
            y=player_data["rolling_avg"],
            mode="lines",
            name=f"{player} (5-match avg)",
            line=dict(color=color, width=2, dash="dash"),
            hoverinfo="skip"
        ))

fig.update_layout(
    height=500,
    hovermode="closest",
    legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="center", x=0.5)
)

fig.show()

# Summary Statistics

Let's compare their average serve statistics:

In [6]:
# Calculate summary statistics
summary = all_matches.groupby("player").agg({
    "first_serve_pct": ["mean", "std", "min", "max"],
    "ace_pct": "mean",
    "df_pct": "mean",
    "won": "mean"
}).round(1)

summary.columns = ["1st Serve % (avg)", "1st Serve % (std)", "1st Serve % (min)", "1st Serve % (max)",
                   "Ace % (avg)", "DF % (avg)", "Win Rate"]
summary["Win Rate"] = (summary["Win Rate"] * 100).round(1)

print("Serve Statistics Comparison")
print("=" * 50)
summary

Serve Statistics Comparison


Unnamed: 0_level_0,1st Serve % (avg),1st Serve % (std),1st Serve % (min),1st Serve % (max),Ace % (avg),DF % (avg),Win Rate
player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Alcaraz,66.4,4.6,57.8,76.8,6.5,2.2,80.0
Sinner,67.8,5.5,55.1,75.2,13.0,1.5,100.0


# First Serve % by Surface

How does their first serve vary across different surfaces?

In [7]:
# Group by player and surface
surface_stats = all_matches.groupby(["player", "surface"]).agg({
    "first_serve_pct": "mean",
    "date": "count"
}).reset_index()
surface_stats.columns = ["player", "surface", "first_serve_pct", "matches"]

# Only show surfaces with enough matches
surface_stats = surface_stats[surface_stats["matches"] >= 3]

fig = px.bar(
    surface_stats,
    x="surface",
    y="first_serve_pct",
    color="player",
    barmode="group",
    title="Average First Serve % by Surface",
    labels={"first_serve_pct": "First Serve %", "surface": "Surface"},
    color_discrete_map={"Sinner": "#e94560", "Alcaraz": "#4facfe"},
    text="first_serve_pct"
)

fig.update_traces(texttemplate="%{text:.1f}%", textposition="outside")
fig.update_layout(height=400, yaxis_range=[50, 75])
fig.show()

# Rate Limiting & Ethics

When scraping any website, be a good citizen:

1. **Rate limit your requests** - Add delays between page loads (e.g., 2 seconds)
2. **Respect robots.txt** - Check if scraping is allowed
3. **Don't overload servers** - Scrape during off-peak hours for large jobs
4. **Credit the source** - Tennis Abstract data is maintained by Jeff Sackmann

```python
import asyncio

players = ["JannikSinner", "CarlosAlcaraz", "NovakDjokovic"]
for player in players:
    data = await scrape_player_with_stats(player, player.split()[-1])
    await asyncio.sleep(2)  # Wait 2 seconds between requests
```

# Using the Sync API (for Scripts)

If you're writing a standalone Python script (not a notebook), you can use the simpler synchronous API:

```python
from playwright.sync_api import sync_playwright

def scrape_player_with_stats(player_url_name: str, player_last_name: str, tour: str = "atp"):
    prefix = "w" if tour == "wta" else ""
    url = f"https://www.tennisabstract.com/cgi-bin/{prefix}player.cgi?p={player_url_name}"
    
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url, wait_until="domcontentloaded")
        page.wait_for_timeout(3000)
        
        # ... rest of scraping logic (without await)
        
        browser.close()
```

The sync API is cleaner for scripts, but won't work inside Jupyter notebooks or other async contexts.

# Conclusion

In this tutorial, we built a web scraper for Tennis Abstract using Playwright and used it to compare serve statistics between Jannik Sinner and Carlos Alcaraz. Key takeaways:

- **Playwright** handles JavaScript-rendered content that traditional scraping tools can't
- Use the **async API** in notebooks, **sync API** in scripts
- **Wait for content** to load before extracting data
- **Parse serve statistics** for deeper analysis beyond win/loss records
- **Visualize trends** with interactive Plotly charts
- **Be respectful** - rate limit and credit your sources

For a more comprehensive scraper that handles multiple players and outputs CSV compatible with Jeff Sackmann's data format, check out the full implementation at `data/scrape_tennisabstract.py` in this repository.

---

*Data source: [Tennis Abstract](https://www.tennisabstract.com/) by Jeff Sackmann. Match data is made available under the [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license.*