# Introduction

Main Points:
- webscrape https://www.atptour.com/ for pro-level match data
- Output two .csv files 
    - Whole tournament data for summary statistics dashboard
    - Point by Point (PBP) data for Match Viewer website
        - Attach timestamps using website tagger

# Install Libraries

For this project the only libraries we'll need are `requests_html`, `requests`, `bs4`, and `pandas`. These will allow us to access the page HTML, search through the HTML, and put the HTML into something readable like a dataframe.

In [1]:
from requests_html import AsyncHTMLSession
from bs4 import BeautifulSoup
import pandas as pd
import requests

# Get Page HTML for Tournament

For this match example, we're taking a look at a challenger tournament, Little Rock 2024. Particularly, the play Rudy Quan as he is one of the players joining the UCLA team in the upcoming school year.

After getting the page, we'll check to make sure that the request was okay.

In [24]:
tournament_page = requests.get('https://www.atptour.com/en/scores/archive/little-rock/9188/2024/draws')
tournament_page.status_code, tournament_page.ok

(200, True)

The we can use BeautfulSoup to help us manange and search through the HTML a lot easier.

In [25]:
tournament_soup = BeautifulSoup(tournament_page.text, 'html')

# Tournament Summary Statistics Dashboard

### Get First Match Stats Link from Soup

In [67]:
# get stats of first match in first draw
draw_1 = tournament_soup.find_all('div', class_='draw')[0]
match_stats = draw_1.find_all('div', class_='stats-cta')[0]
stats_link = 'https://atptour.com' + match_stats.find_all('a')[-1].get('href')
stats_link

'https://atptour.com/en/scores/stats-centre/archive/2024/9188/ms016'

### Get Player Info and Match Statistics from Link

Firstly we'll get the HTML and convert it to BeautifulSoup. In order to do this correctly, we need to utilize a library called `requests-html`. This allows us to fully render the HTML as some pages on the ATP Tour website are dynamically loaded after the first response. This means we'd be getting only snippets of the HTML, or HTML that has no relevant values that we need to scrape.

In [100]:
session = AsyncHTMLSession()
response = await session.get(stats_link)
if response.ok:
    await response.html.arender()
    stats_soup = BeautifulSoup(response.html.html, 'html.parser')
    print('Success!')

Success!


In [105]:
stats_soup.find_all('div', class_='match-header')[0].find('strong').text

'Singles Round of 32'