Before training our models, we need to split our data into a training and test dataset. This is to make sure there is no data leakage and that our model is not predicting test data based on the results of future matches. For example, if predicting the winner of a match from 2020, and the model is trained on future data, the accuracy of the model will be misleading.

Since this is time series based data, we need to split data based on the year. For now, we will train the model on match data from 2014-2021, validate with data from 2022, and finally test the model with matches from 2023-2024.

In [6]:
# Imports
import polars as pl
from tennis_match_predictor.config import PROCESSED_DATA_DIR

In [7]:
# Load dataframe
df = pl.read_csv("../data/processed/tennis_features.csv")

In [8]:
# Convert date to datetime for filtering
df = df.with_columns([
    pl.col('date').str.strptime(pl.Date, format='%Y-%m-%d')
])

In [11]:
# Time-based splits
# Train 2014-2021, validate 2022, test 2023-2024
train_df = df.filter(pl.col('date') <= pl.date(2021, 12, 31))
val_df = df.filter((pl.col('date') > pl.date(2021, 12, 31)) & (pl.col('date') <= pl.date(2022, 12, 31)))
test_df = df.filter(pl.col('date') > pl.date(2022, 12, 31))

print(f"Train: {len(train_df)} matches (2014-2021)")
print(f"Validation: {len(val_df)} matches (2022)")  
print(f"Test: {len(test_df)} matches (2023-2024)")

Train: 15409 matches (2014-2021)
Validation: 2167 matches (2022)
Test: 4311 matches (2023-2024)


In [None]:
# Identify metadata columns to exclude from features
metadata_cols = ['match_id', 'date', 'player_a_name', 'player_b_name', 
                'target', 'tournament', 'round', 'year']

# Get feature columns (everything except metadata)
feature_cols = [col for col in df.columns if col not in metadata_cols]

print(f"Using {len(feature_cols)} features for training")

# Feature matrix
X = df.select(feature_cols)

# Target variable (winner)
y = df.get_column('target')

# Fill any remaining null values with 0
X = X.fill_null(0)

Using 80 features for training
