# Introduction

Main Points:
- webscrape https://www.atptour.com/ for pro-level match data
- Output two .csv files 
    - Whole tournament data for summary statistics dashboard
    - Point by Point (PBP) data for Match Viewer website
        - Attach timestamps using website tagger

# Install Libraries

For this project the only libraries we'll need are `requests_html`, `requests`, `bs4` (BeautifulSoup), `pandas`, `numpy` and `re` (Regular Expression). These will allow us to access the page HTML, search through the HTML, and put the HTML into something readable like a dataframe.

In [132]:
from requests_html import AsyncHTMLSession
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import requests
import re

# Get Page HTML for Tournament

For this match example, we're taking a look at a challenger tournament, Little Rock 2024. Particularly, the play Rudy Quan as he is one of the players joining the UCLA team in the upcoming school year.

After getting the page, we'll check to make sure that the request was okay.

In [284]:
tournament_page = requests.get('https://www.atptour.com/en/scores/archive/little-rock/9188/2024/results')
tournament_page.status_code, tournament_page.ok

(200, True)

The we can use BeautfulSoup to help us manange and search through the HTML a lot easier.

In [195]:
tournament_soup = BeautifulSoup(tournament_page.text, 'html')

# Tournament Summary Statistics Dashboard

### Get First Match Stats Link from Soup

In [219]:
# get stats of first match in first draw
qual_1 = tournament_soup.find_all('div', class_='atp_accordion-item')[-1]
match_stats = qual_1.find_all('div', class_='match-cta')[0]
stats_link = 'https://atptour.com' + match_stats.find_all('a')[1].get('href')
stats_link

'https://atptour.com/en/scores/stats-centre/archive/2024/9188/qs016'

### Get HTML and Link to Match Stats

Firstly we'll get the HTML and convert it to BeautifulSoup. In order to do this correctly, we need to utilize a library called `requests-html`. This allows us to fully render the HTML as some pages on the ATP Tour website are dynamically loaded after the first response. This means we'd be getting only snippets of the HTML, or HTML that has no relevant values that we need to scrape.

In [220]:
session = AsyncHTMLSession()
response = await session.get(stats_link)
if response.ok:
    await response.html.arender()
    stats_soup = BeautifulSoup(response.html.html, 'html.parser')
    print('Success!')

Success!


### Extract Relevant Information about Match/Tournament

Here we'll scrape for things like the date, the tournament name, the match score, the match time, and the round number, etc. Below is an example of the first qualifying match.

In [225]:
# extract tournament info
date = qual_1.find('div', class_='tournament-day').text
date = re.search(r'\w{3}, \d{2} \w+, \d{4}', qual_1.find('div', class_='tournament-day').text)[0]

tournament_name = stats_soup.find('div', class_='event-name').text
scores = stats_soup.find('div', class_='match-notes').text
scores = ', '.join(re.findall(r'\d{1}-\d{1}\(\d{1}\)|\d{1}-\d{1}', scores))

# extract round info
round_info = stats_soup.find_all('div', class_='match-header')[0].find_all('span')
round_number = round_info[0].text.strip()[:-1]
match_time = round_info[1].text.strip()

date, tournament_name, scores, round_number, match_time

('Sun, 26 May, 2024',
 'UAMS Health Little Rock Open',
 '6-3, 6-1',
 'Singles 1st Round Qualifying',
 '01:20:29')

In [252]:
stats_soup.find_all('div', class_='name')

[<div class="name"><a href="/en/players/bruno-kuzuhara/k0hn/overview">B. Kuzuhara</a> <!-- --></div>,
 <div class="name"><a href="/en/players/boris-kozlov/k0af/overview">B. Kozlov</a> <!-- --></div>]

In [257]:
stats_soup.find('title').text

'Bruno Kuzuhara vs. Boris Kozlov Little Rock 2024 1st Round Qualifying | Stats Centre | ATP Tour | Tennis'

In [256]:
re.findall(r'\w+ vs. \w+', stats_soup.find('title').text)

['Kuzuhara vs. Boris']

In [260]:
stats_soup.find('div', class_='player-team')

In [222]:
# get names of both players
player_1 = stats_soup.find_all('div', class_='name')[0].text.strip()
player_2 = stats_soup.find_all('div', class_='name')[1].text.strip()

player_1, player_2

('E. Escobedo', 'N. Becker')

### Get All Match Stats 

Get all the stats from the match stats and put them in a dataframe with the previously scraped data. 

In [223]:
# get all stats and iterate assign each to the respective player
all_stats = stats_soup.find_all('div', attrs={'data-test': 'StatsTile'})

In [226]:
# initilize list of all statistics
statistics = [
    'serve_rating',
    'aces',
    'double_faults',
    'first_serve',
    '1st_serve_points_won',
    '2nd_serve_points_won',
    'break_points_saved',
    'service_games_played',
    'return_rating',
    '1st_serve_return_points_won',
    '2nd_serve_return_points_won',
    'break_points_converted',
    'return_games_played',
    'service_points_won',
    'return_points_won',
    'total_points_won'
]

# add already scraped data
row_stats = {
    'date': date,
    'tournament': tournament_name,
    'match_time': match_time,
    'round': round_number,
    'player_1': player_1,
    'player_2': player_2,
    'score': scores
}


# add each stat for each player to the dataframe
for i in range(len(statistics)):
    # convert stat to int if it is a single integer stat
    if (0 <= i and i < 3) or (7 <= i and i < 9) or (i == 12):
        p1_stat = int(all_stats[i].find('div', class_='player1').text.strip())
        p2_stat = int(all_stats[i].find('div', class_='player2').text.strip())
        
    # remove percent from fraction stats
    else: 
        p1_stat = all_stats[i].find('div', class_='player1').text.strip().split()[0]
        p2_stat = all_stats[i].find('div', class_='player2').text.strip().split()[0]
    
    # add stats for each player to dictionary (row in df)
    row_stats[f'{statistics[i]}_p1'] = p1_stat
    row_stats[f'{statistics[i]}_p2'] = p2_stat
    
pd.DataFrame([row_stats])

Unnamed: 0,date,tournament,match_time,round,player_1,player_2,score,serve_rating_p1,serve_rating_p2,aces_p1,...,break_points_converted_p1,break_points_converted_p2,return_games_played_p1,return_games_played_p2,service_points_won_p1,service_points_won_p2,return_points_won_p1,return_points_won_p2,total_points_won_p1,total_points_won_p2
0,"Sun, 26 May, 2024",UAMS Health Little Rock Open,01:20:29,Singles 1st Round Qualifying,E. Escobedo,N. Becker,"6-3, 6-1",343,189,4,...,4/10,0/0,8,8,32/35,27/58,31/58,3/35,63/93,30/93


### Generalize the Scraping to Get Whole Tournament Data

In [194]:
tournament_page = requests.get('https://www.atptour.com/en/scores/archive/little-rock/9188/2024/results')
tournament_page.status_code, tournament_page.ok

(200, True)

In [288]:
# initilize list for each row
rows = []

# get day container for all matches
all_days = tournament_soup.find_all('div', class_='atp_accordion-item')

# iterate through days backwards to sort them chronologically
for i in range(len(all_days) - 1, -1, -1):
# for i in range(-1, 0):
    # scape current day
    date = all_days[i].find('div', class_='tournament-day').text
    date = re.search(r'\w{3}, \d{2} \w+, \d{4}', qual_1.find('div', class_='tournament-day').text)[0]
    
    # get every stats link in the current day
    all_match_stats = all_days[i].find_all('div', class_='match-cta')
    
    # iterate through all match stats and scrape for data
    for div in all_match_stats:
        # create link to scrape from
        stats_link = 'https://atptour.com' + div.find_all('a')[1].get('href')
        
        # create new session to scrape HTML generated by JS
        session = AsyncHTMLSession()
        response = await session.get(stats_link)
        await response.html.arender()
        
        # convert HTML to BeautifulSoup
        stats_soup = BeautifulSoup(response.html.html, 'html.parser')
        
        # scrape match info
        player_1 = stats_soup.find_all('div', class_='name')[0].text.strip()
        player_2 = stats_soup.find_all('div', class_='name')[1].text.strip()
        tournament_name = stats_soup.find('div', class_='event-name').text
        scores = stats_soup.find('div', class_='match-notes').text
        scores = ', '.join(re.findall(r'\d{1}-\d{1}\(\d{1}\)|\d{1}-\d{1}', scores))
        match_time = round_info[1].text.strip()
        round_info = stats_soup.find_all('div', class_='match-header')[0].find_all('span')
        round_number = round_info[0].text.strip().strip('.')
        
        # add scraped info to dictionary for dataframe
        row_stats = {
            'date': date,
            'tournament': tournament_name,
            'match_time': match_time,
            'round': round_number,
            'player_1': player_1,
            'player_2': player_2,
            'score': scores
        }
        
        # add each stat for each player to the dataframe
        for i in range(len(statistics)):
            # convert stat to int if it is a single integer stat
            if (0 <= i and i < 3) or (7 <= i and i < 9) or (i == 12):
                p1_stat = int(all_stats[i].find('div', class_='player1').text.strip())
                p2_stat = int(all_stats[i].find('div', class_='player2').text.strip())

            # remove percent from fraction stats
            else: 
                p1_stat = all_stats[i].find('div', class_='player1').text.strip().split()[0]
                p2_stat = all_stats[i].find('div', class_='player2').text.strip().split()[0]

            # add stats for each player to dictionary (row in df)
            row_stats[f'{statistics[i]}_p1'] = p1_stat
            row_stats[f'{statistics[i]}_p2'] = p2_stat

        rows.append(pd.DataFrame([row_stats]))

IndexError: list index out of range

In [289]:
pd.concat(rows)

Unnamed: 0,date,tournament,match_time,round,player_1,player_2,score,serve_rating_p1,serve_rating_p2,aces_p1,...,break_points_converted_p1,break_points_converted_p2,return_games_played_p1,return_games_played_p2,service_points_won_p1,service_points_won_p2,return_points_won_p1,return_points_won_p2,total_points_won_p1,total_points_won_p2
0,"Sun, 26 May, 2024",UAMS Health Little Rock Open,02:19:00,Singles 1st Round Qualifying,E. Escobedo,N. Becker,"6-3, 6-1",256,237,1,...,5/9,3/4,13,13,50/81,46/82,36/82,31/81,86/163,77/163
0,"Sun, 26 May, 2024",UAMS Health Little Rock Open,01:20:29,Singles 1st Round Qualifying,B. Kuzuhara,B. Kozlov,"3-6, 6-2, 6-3",256,237,1,...,5/9,3/4,13,13,50/81,46/82,36/82,31/81,86/163,77/163


In [275]:
stats_soup

<!DOCTYPE html>
<html class="lang-en-us" lang="en-US"><head><title>Just a moment...</title><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><meta content="IE=Edge" http-equiv="X-UA-Compatible"/><meta content="noindex,nofollow" name="robots"/><meta content="width=device-width,initial-scale=1" name="viewport"/><style>*{box-sizing:border-box;margin:0;padding:0}html{line-height:1.15;-webkit-text-size-adjust:100%;color:#313131}button,html{font-family:system-ui,-apple-system,BlinkMacSystemFont,Segoe UI,Roboto,Helvetica Neue,Arial,Noto Sans,sans-serif,Apple Color Emoji,Segoe UI Emoji,Segoe UI Symbol,Noto Color Emoji}@media (prefers-color-scheme:dark){body{background-color:#222;color:#d9d9d9}body a{color:#fff}body a:hover{color:#ee730a;text-decoration:underline}body .lds-ring div{border-color:#999 transparent transparent}body .font-red{color:#b20f03}body .pow-button{background-color:#4693ff;color:#1d1d1d}body #challenge-success-text{background-image:url(data:image/svg+xml;ba

In [290]:
requests.get(stats_link)

<Response [429]>

In [283]:
r

<Future finished result=<Response [429]>>