# Phase 1: Downloading the Dataset

The snippet of code below fetches records for all ATP Tennis matches within the past ten years (2015-2024), in CSV format (one file per year of matches). I have decided to utilise [JeffSackmann's tennis dataset](https://github.com/JeffSackmann/tennis_atp) since his recordings are very granular, even going as far to provide insight on stats such as FH winners and BP saved.

In [None]:
import os
import requests
from datetime import date

def download_historical_dataset(year_count=10):
    for year in range(date.today().year - year_count, date.today().year):
        response = requests.get(f"https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_{year}.csv")
        if response.status_code == 200:
            with open(f"../data/atp_matches_{year}.csv", "wb") as file:
                file.write(response.content)
            print(f"Dataset for {year} all ATP matches downloaded successfully!")
        else:
            print(f"Failed to download dataset for {year} ATP matches. Status code: {response.status_code}")

os.makedirs("../data", exist_ok=True)  # exist_ok=True avoids error if directory exists
download_historical_dataset()

Note that as of writing, JeffSackmann's log of the 2025 ATP matches are not published (I anticipate this data will be available upon the end of the tennis calendar at December). Therefore I have also written some code to download the live tennis dataset by Kaggle, which are updated at a daily basis. (although this dataset has not been incorporated in to the pipeline as of yet)

In [None]:
import kagglehub
import shutil

def download_live_dataset():
    filepath = f"../data/live_dataset_{date.today().isoformat()}.csv"
    try:
        path = kagglehub.dataset_download("dissfya/atp-tennis-2000-2023daily-pull", path="atp_tennis.csv")
        shutil.move(path, filepath)
        print("Dataset downloaded successfully!")
    except Exception as e:
        print(f"Error downloading dataset: {e}")

#download_live_dataset() don't need to run this as the data is not used in the pipeline yet