### Feature Engineering

Now that the dataset is properly cleaned, we can start building the model features. While the data provided gives us a good baseline model, more advanced statistics can more accurately predict the winner of a match. Common information such as a player H2H (record against each other) or Elo can be used to gauge how players match up. 

### Features
- Encode categorical variables
    - Use dummies to encode categorical variables with 0/1
- Player H2H
    - All time H2H
    - Last 5 matches H2H
    - Last 5 matches H2H on specific surface
- Recent form
    - Last 5/10 match W/L
    - Last 5/10 match W/L on specific surface
    - W/L YTD
    - Current win streak
    - Injured recently (PR)
- Fatigue
    - Matches played in last 3 days
    - Total court time during tournament
- Elo
    - Overall Elo
    - Surface Elo
    - Delta Elo
- Aggregated statistics
    - Last 5 matches:
        - Avg 1st serve %
        - Avg 1st service points won %
        - etc
- Calculate delta of variables
    - Difference in ATP rank, height, betting odds, etc
- Betting odds
    - Normalize into implied win probability

** Features will be added gradually to ensure each feature is accurate. 

In [1]:
# Imports
import polars as pl
import numpy as np
from datetime import datetime, timedelta
from typing import Dict, List, Tuple, Optional, Any
from collections import defaultdict
from tennis_match_predictor.config import PROCESSED_DATA_DIR

[32m2025-09-11 16:54:44.744[0m | [1mINFO    [0m | [36mtennis_match_predictor.config[0m:[36m<module>[0m:[36m11[0m - [1mPROJ_ROOT path is: C:\Users\Admin\Projects\tennis-match-predictor[0m


In [2]:
# Load dataframe
df = pl.read_csv("../data/interim/joined_dataset_clean.csv")

For categorical variables to be useful, we have to convert them to numerical values for our model to understand through encoding. Polars has a built in function to convert each categorical value into 0/1 with new columns.

Serve statistics are one of the most important stats since the server has a massive advantage. As such, players who have a better serve (in terms of speed and accuracy) often win more matches. Tennis matches are often decided by a few crucial points in which the server loses, making these statistics important.

In [3]:
def calc_first_serve_pct(match_data: dict, prefix: str) -> Optional[float]:
    """Calculate first serve percentage"""
    served = match_data.get(f'{prefix}svpt')
    first_in = match_data.get(f'{prefix}1stIn')
    return (first_in / served * 100) if (served and first_in and served > 0) else None

def calc_first_serve_won_pct(match_data: dict, prefix: str) -> Optional[float]:
    """Calculate first serve points won percentage"""
    first_in = match_data.get(f'{prefix}1stIn')
    first_won = match_data.get(f'{prefix}1stWon')
    return (first_won / first_in * 100) if (first_in and first_won and first_in > 0) else None

def calc_second_serve_won_pct(match_data: dict, prefix: str) -> Optional[float]:
    """Calculate second serve points won percentage"""
    served = match_data.get(f'{prefix}svpt')
    first_in = match_data.get(f'{prefix}1stIn')
    second_won = match_data.get(f'{prefix}2ndWon')
    second_served = served - first_in if (served and first_in) else None
    return (second_won / second_served * 100) if (second_served and second_won and second_served > 0) else None

def calc_bp_saved_pct(match_data: dict, prefix: str) -> Optional[float]:
    """Calculate break points saved percentage"""
    bp_faced = match_data.get(f'{prefix}bpFaced')
    bp_saved = match_data.get(f'{prefix}bpSaved')
    return (bp_saved / bp_faced * 100) if (bp_faced and bp_saved and bp_faced > 0) else None

Another important aspect of tennis is matchups. Every player has a different playstyle that may be better against other playstyles, and vice versa may be weak against other playstyles. One indicator of this is players' head to head (H2H) record. Since professional players often play each other many times throughout the years, their past matchups can be a good indicator of who will win. For example, Novak Djokovic owns a 20-0 record over Gael Monfils, despite Monfils being in the top 20 for the majority of his career. 

In [4]:
def get_h2h_record(player_db: dict, player1: str, player2: str, before_date: datetime,
                  last_n: int = None, surface: str = None) -> dict:
    """Get head-to-head record between two players"""
    
    h2h_key = tuple(sorted([player1, player2]))
    matches = player_db['h2h_records'].get(h2h_key, [])
    
    # Filter matches before the given date to avoid data leakage
    valid_matches = [m for m in matches if m['date'] < before_date]
    
    # Apply surface filter if specified
    if surface:
        valid_matches = [m for m in valid_matches if m['surface'] == surface]
    
    # Apply last N filter if specified
    if last_n:
        valid_matches = valid_matches[-last_n:]
    
    if not valid_matches:
        return {'total_matches': 0, 'player1_wins': 0, 'player2_wins': 0, 'win_rate': 0.0}
    
    # Count wins (need to account for match perspective)
    player1_wins = 0
    player2_wins = 0
    
    for match in valid_matches:
        if match['won']:
            if match['opponent'] == player2:  # player1 won
                player1_wins += 1
            else:  # player2 won
                player2_wins += 1
        else:
            if match['opponent'] == player2:  # player1 lost
                player2_wins += 1
            else:  # player2 lost
                player1_wins += 1
    
    total_matches = player1_wins + player2_wins
    player1_win_rate = (player1_wins / total_matches * 100) if total_matches > 0 else 0.0
    
    return {
        'total_matches': total_matches,
        'player1_wins': player1_wins,
        'player2_wins': player2_wins,
        'win_rate': player1_win_rate
    }

Recent matches also matter. Players are constantly improving, so matches from 5-10+ years ago may not matter as much as matches played within the last year. However, some records are significant, like the Djokovic-Monfils H2H as mentioned above. Likewise, surface plays a big role in player success. Some players are considered "specialists" on some surfaces, skewing some H2H records.

In [5]:
def get_h2h_features(player_db: dict, player_a: str, player_b: str, 
                    before_date: datetime, match_data: dict) -> dict:
    """Get head-to-head features"""
    
    surface = match_data['surface']
    
    # All-time H2H
    h2h_all = get_h2h_record(player_db, player_a, player_b, before_date)
    
    # Last 5 H2H
    h2h_5 = get_h2h_record(player_db, player_a, player_b, before_date, last_n=5)
    
    # Surface-specific H2H (all-time)
    h2h_surface = get_h2h_record(player_db, player_a, player_b, before_date, surface=surface)
    
    # Surface-specific H2H (last 5)
    h2h_surface_5 = get_h2h_record(player_db, player_a, player_b, before_date, 
                                 last_n=5, surface=surface)
    
    return {
        'h2h_total_matches': h2h_all['total_matches'],
        'h2h_player_a_wins': h2h_all['player1_wins'],
        'h2h_player_a_win_rate': h2h_all['win_rate'],
        
        'h2h_5_total_matches': h2h_5['total_matches'],
        'h2h_5_player_a_win_rate': h2h_5['win_rate'],
        
        'h2h_surface_total_matches': h2h_surface['total_matches'],
        'h2h_surface_player_a_win_rate': h2h_surface['win_rate'],
        
        'h2h_surface_5_total_matches': h2h_surface_5['total_matches'],
        'h2h_surface_5_player_a_win_rate': h2h_surface_5['win_rate']
    }

In [None]:
def get_static_features(player_a: str, match_data: dict) -> dict:
    """Get static features that don't change during the match"""
    
    # Determine who is winner/loser in the data
    is_a_winner = match_data['winner_name'] == player_a
    
    if is_a_winner:
        a_prefix, b_prefix = 'winner_', 'loser_'
    else:
        a_prefix, b_prefix = 'loser_', 'winner_'
    
    return {
        # Player A static
        'player_a_rank': match_data.get(f'{a_prefix}rank', 1000),
        'player_a_rank_points': match_data.get(f'{a_prefix}rank_points', 0),
        'player_a_age': match_data.get(f'{a_prefix}age', 25),
        'player_a_height': match_data.get(f'{a_prefix}ht', 180),
        'player_a_lefty': 1 if match_data.get(f'{a_prefix}hand') == 'L' else 0,
        
        # Player B static  
        'player_b_rank': match_data.get(f'{b_prefix}rank', 1000),
        'player_b_rank_points': match_data.get(f'{b_prefix}rank_points', 0),
        'player_b_age': match_data.get(f'{b_prefix}age', 25),
        'player_b_height': match_data.get(f'{b_prefix}ht', 180),
        'player_b_lefty': 1 if match_data.get(f'{b_prefix}hand') == 'L' else 0,
        
        # Tournament context - will be converted to dummies later
        'surface': match_data['surface'],
        'Series': match_data['Series'],
        'best_of_5': 1 if match_data['best_of'] == 5 else 0,
        'draw_size': match_data.get('draw_size', 32),
    }

In [None]:
def get_match_statistics(player_a: str, match_data: dict) -> dict:
    """Get actual match statistics for both players"""
    
    # Determine who is winner/loser in the data
    is_a_winner = match_data['winner_name'] == player_a
    
    if is_a_winner:
        a_prefix, b_prefix = 'w_', 'l_'
    else:
        a_prefix, b_prefix = 'l_', 'w_'
    
    # Raw match statistics for Player A
    player_a_stats = {
        'player_a_aces': match_data.get(f'{a_prefix}ace', 0),
        'player_a_double_faults': match_data.get(f'{a_prefix}df', 0),
        'player_a_serve_points': match_data.get(f'{a_prefix}svpt', 0),
        'player_a_first_serves_in': match_data.get(f'{a_prefix}1stIn', 0),
        'player_a_first_serves_won': match_data.get(f'{a_prefix}1stWon', 0),
        'player_a_second_serves_won': match_data.get(f'{a_prefix}2ndWon', 0),
        'player_a_serve_games': match_data.get(f'{a_prefix}SvGms', 0),
        'player_a_bp_saved': match_data.get(f'{a_prefix}bpSaved', 0),
        'player_a_bp_faced': match_data.get(f'{a_prefix}bpFaced', 0),
    }
    
    # Raw match statistics for Player B
    player_b_stats = {
        'player_b_aces': match_data.get(f'{b_prefix}ace', 0),
        'player_b_double_faults': match_data.get(f'{b_prefix}df', 0),
        'player_b_serve_points': match_data.get(f'{b_prefix}svpt', 0),
        'player_b_first_serves_in': match_data.get(f'{b_prefix}1stIn', 0),
        'player_b_first_serves_won': match_data.get(f'{b_prefix}1stWon', 0),
        'player_b_second_serves_won': match_data.get(f'{b_prefix}2ndWon', 0),
        'player_b_serve_games': match_data.get(f'{b_prefix}SvGms', 0),
        'player_b_bp_saved': match_data.get(f'{b_prefix}bpSaved', 0),
        'player_b_bp_faced': match_data.get(f'{b_prefix}bpFaced', 0),
    }
    
    # Calculated percentages for Player A
    calculated_a_stats = {
        'player_a_first_serve_pct': calc_first_serve_pct(match_data, a_prefix) or 0,
        'player_a_first_serve_won_pct': calc_first_serve_won_pct(match_data, a_prefix) or 0,
        'player_a_second_serve_won_pct': calc_second_serve_won_pct(match_data, a_prefix) or 0,
        'player_a_bp_saved_pct': calc_bp_saved_pct(match_data, a_prefix) or 0,
    }
    
    # Calculated percentages for Player B
    calculated_b_stats = {
        'player_b_first_serve_pct': calc_first_serve_pct(match_data, b_prefix) or 0,
        'player_b_first_serve_won_pct': calc_first_serve_won_pct(match_data, b_prefix) or 0,
        'player_b_second_serve_won_pct': calc_second_serve_won_pct(match_data, b_prefix) or 0,
        'player_b_bp_saved_pct': calc_bp_saved_pct(match_data, b_prefix) or 0,
    }
    
    # Additional match context
    match_context = {
        'match_minutes': match_data.get('minutes', 0),
        'winner_sets': match_data.get('Wsets', 0),
        'loser_sets': match_data.get('Lsets', 0),
    }
    
    # Combine all statistics
    all_stats = {}
    all_stats.update(player_a_stats)
    all_stats.update(player_b_stats)
    all_stats.update(calculated_a_stats)
    all_stats.update(calculated_b_stats)
    all_stats.update(match_context)
    
    return all_stats

In [None]:
def get_delta_features(features: dict) -> dict:
    """Calculate delta features (differences between players)"""
    
    delta_features = {}
    
    # Define features to calculate deltas for
    delta_feature_pairs = [
        ('player_a_rank', 'player_b_rank', 'rank_delta'),
        ('player_a_rank_points', 'player_b_rank_points', 'rank_points_delta'),
        ('player_a_age', 'player_b_age', 'age_delta'),
        ('player_a_height', 'player_b_height', 'height_delta'),
    ]
    
    for a_feature, b_feature, delta_name in delta_feature_pairs:
        if a_feature in features and b_feature in features:
            # For rank, lower is better, so reverse the delta
            if 'rank' in delta_name and 'points' not in delta_name:
                delta_features[delta_name] = features[b_feature] - features[a_feature]
            else:
                delta_features[delta_name] = features[a_feature] - features[b_feature]
    
    return delta_features

In [None]:
def get_match_differential_stats(match_data: dict, player_a: str) -> dict:
    """Get differential statistics between players from the actual match"""
    
    # Determine who is winner/loser in the data
    is_a_winner = match_data['winner_name'] == player_a
    
    if is_a_winner:
        a_prefix, b_prefix = 'w_', 'l_'
    else:
        a_prefix, b_prefix = 'l_', 'w_'
    
    # Calculate differentials
    differentials = {}
    
    # Raw stat differentials
    stat_pairs = [
        ('ace', 'aces_differential'),
        ('df', 'double_faults_differential'),
        ('svpt', 'serve_points_differential'),
        ('1stIn', 'first_serves_in_differential'),
        ('1stWon', 'first_serves_won_differential'),
        ('2ndWon', 'second_serves_won_differential'),
        ('SvGms', 'serve_games_differential'),
        ('bpSaved', 'bp_saved_differential'),
        ('bpFaced', 'bp_faced_differential'),
    ]
    
    for stat, diff_name in stat_pairs:
        a_val = match_data.get(f'{a_prefix}{stat}', 0)
        b_val = match_data.get(f'{b_prefix}{stat}', 0)
        differentials[diff_name] = a_val - b_val
    
    # Percentage differentials (calculated stats)
    a_first_serve_pct = calc_first_serve_pct(match_data, a_prefix) or 0
    b_first_serve_pct = calc_first_serve_pct(match_data, b_prefix) or 0
    differentials['first_serve_pct_differential'] = a_first_serve_pct - b_first_serve_pct
    
    a_first_won_pct = calc_first_serve_won_pct(match_data, a_prefix) or 0
    b_first_won_pct = calc_first_serve_won_pct(match_data, b_prefix) or 0
    differentials['first_serve_won_pct_differential'] = a_first_won_pct - b_first_won_pct
    
    a_second_won_pct = calc_second_serve_won_pct(match_data, a_prefix) or 0
    b_second_won_pct = calc_second_serve_won_pct(match_data, b_prefix) or 0
    differentials['second_serve_won_pct_differential'] = a_second_won_pct - b_second_won_pct
    
    a_bp_saved_pct = calc_bp_saved_pct(match_data, a_prefix) or 0
    b_bp_saved_pct = calc_bp_saved_pct(match_data, b_prefix) or 0
    differentials['bp_saved_pct_differential'] = a_bp_saved_pct - b_bp_saved_pct
    
    return differentials

In [10]:
def get_market_features(match_data: dict, player_a: str) -> dict:
    """Get betting market features"""
    
    # Get all available odds
    max_winner_odds = match_data.get('MaxW', None)
    max_loser_odds = match_data.get('MaxL', None)
    avg_winner_odds = match_data.get('AvgW', None)
    avg_loser_odds = match_data.get('AvgL', None)
    
    # Determine if player A is the winner in the original data
    is_a_winner = match_data['winner_name'] == player_a
    
    # Initialize features with defaults
    features = {
        'player_a_avg_odds': 2.0,
        'player_b_avg_odds': 2.0,
        'player_a_max_odds': 2.0,
        'player_b_max_odds': 2.0,
        'player_a_implied_prob': 0.5,
        'player_b_implied_prob': 0.5,
        'avg_odds_difference': 0.0,
        'max_odds_difference': 0.0,
        'market_confidence_avg': 0.0,
        'market_confidence_max': 0.0,
        'odds_movement': 0.0,  # Difference between max and avg 
    }
    
    # Calculate features if odds are available
    if avg_winner_odds and avg_loser_odds:
        if is_a_winner:
            features['player_a_avg_odds'] = avg_winner_odds
            features['player_b_avg_odds'] = avg_loser_odds
        else:
            features['player_a_avg_odds'] = avg_loser_odds
            features['player_b_avg_odds'] = avg_winner_odds
        
        # Calculate implied probabilities from decimal odds
        features['player_a_implied_prob'] = 1 / features['player_a_avg_odds']
        features['player_b_implied_prob'] = 1 / features['player_b_avg_odds']
        
        # Market confidence (higher = more certain)
        features['market_confidence_avg'] = abs(features['player_a_implied_prob'] - features['player_b_implied_prob'])
        features['avg_odds_difference'] = abs(features['player_a_avg_odds'] - features['player_b_avg_odds'])
    
    if max_winner_odds and max_loser_odds:
        if is_a_winner:
            features['player_a_max_odds'] = max_winner_odds
            features['player_b_max_odds'] = max_loser_odds
        else:
            features['player_a_max_odds'] = max_loser_odds
            features['player_b_max_odds'] = max_winner_odds
            
        # Max odds market confidence
        max_a_prob = 1 / features['player_a_max_odds']
        max_b_prob = 1 / features['player_b_max_odds']
        features['market_confidence_max'] = abs(max_a_prob - max_b_prob)
        features['max_odds_difference'] = abs(features['player_a_max_odds'] - features['player_b_max_odds'])
    
    # Odds movement (if both max and avg available)
    if all([avg_winner_odds, avg_loser_odds, max_winner_odds, max_loser_odds]):
        # Simple proxy for line movement - difference between max and average odds
        a_movement = features['player_a_max_odds'] - features['player_a_avg_odds']
        features['odds_movement'] = a_movement  # Positive = odds increased (became underdog)
    
    return features

In [11]:
def initialize_player_database():
    """Initialize empty player database structure"""
    return {
        'player_matches': defaultdict(list),
        'h2h_records': defaultdict(list), 
        'surface_matches': defaultdict(list)
    }

def add_match_to_database(player_db: dict, winner: str, loser: str, match_data: dict):
    """Add a completed match to the player database"""
    
    match_date = datetime.strptime(match_data['Date'], '%Y-%m-%d')
    surface = match_data['surface']
    
    # Create match results for both players
    winner_result = {
        'date': match_date,
        'opponent': loser,
        'surface': surface,
        'won': True,
        'minutes': match_data.get('minutes'),
        'aces': match_data.get('w_ace'),
        'double_faults': match_data.get('w_df'),
        'first_serve_pct': calc_first_serve_pct(match_data, 'w_'),
        'first_serve_won_pct': calc_first_serve_won_pct(match_data, 'w_'),
        'second_serve_won_pct': calc_second_serve_won_pct(match_data, 'w_'),
        'break_points_saved_pct': calc_bp_saved_pct(match_data, 'w_')
    }
    
    loser_result = {
        'date': match_date,
        'opponent': winner,
        'surface': surface,
        'won': False,
        'minutes': match_data.get('minutes'),
        'aces': match_data.get('l_ace'),
        'double_faults': match_data.get('l_df'),
        'first_serve_pct': calc_first_serve_pct(match_data, 'l_'),
        'first_serve_won_pct': calc_first_serve_won_pct(match_data, 'l_'),
        'second_serve_won_pct': calc_second_serve_won_pct(match_data, 'l_'),
        'break_points_saved_pct': calc_bp_saved_pct(match_data, 'l_')
    }
    
    # Add to player histories
    player_db['player_matches'][winner].append(winner_result)
    player_db['player_matches'][loser].append(loser_result)
    
    # Add to H2H records (use consistent ordering)
    h2h_key = tuple(sorted([winner, loser]))
    if winner < loser:
        player_db['h2h_records'][h2h_key].append(winner_result)
    else:
        player_db['h2h_records'][h2h_key].append(loser_result)
        
    # Add to surface-specific records
    player_db['surface_matches'][(winner, surface)].append(winner_result)
    player_db['surface_matches'][(loser, surface)].append(loser_result)

In [None]:
def generate_match_features(player_db: dict, player_a: str, player_b: str, 
                          match_data: dict, before_date: datetime) -> dict:
    """Generate all features for a match between player A and B"""
    
    features = {}
    
    # Static features
    features.update(get_static_features(player_a, match_data))
    
    # Head-to-head features
    features.update(get_h2h_features(player_db, player_a, player_b, before_date, match_data))
    
    # Delta features
    features.update(get_delta_features(features))
    
    # Market features
    features.update(get_market_features(match_data, player_a))

    # Match statistics
    features.update(get_match_statistics(player_a, match_data))
    
    # Match differentials
    features.update(get_match_differential_stats(match_data, player_a))
    
    return features

In [13]:
# Sort by date to ensure chronological processing
df = df.sort('Date')

# Initialize player database
player_db = initialize_player_database()

processed_matches = []

# Process in batches
for i, row in enumerate(df.iter_rows(named=True)):
    if i % 1000 == 0:
        print(f"Processed {i}/{len(df)} matches...")
    
    match_date = datetime.strptime(row['Date'], '%Y-%m-%d')
    winner_name = row['winner_name']
    loser_name = row['loser_name']
    
    # Randomly assign Player A and Player B to balance dataset
    np.random.seed(hash(f"{winner_name}_{loser_name}_{row['Date']}") % 2**32)
    
    if np.random.random() < 0.5:
        player_a, player_b = winner_name, loser_name
        target = 1  # Player A wins
    else:
        player_a, player_b = loser_name, winner_name
        target = 0  # Player A loses
    
    # Generate features using only historical data
    features = generate_match_features(
        player_db, player_a, player_b, row, match_date
    )
    
    # Add metadata
    features.update({
        'match_id': i,
        'date': row['Date'],
        'player_a_name': player_a,
        'player_b_name': player_b,
        'target': target,
        'tournament': row['tourney_name'],
        'round': row['round'],
        'year': row['Year']
    })
    
    processed_matches.append(features)
    
    # Update player database with this match result
    add_match_to_database(player_db, winner_name, loser_name, row)

processed_df = pl.DataFrame(processed_matches)

print(f"Generated {len(processed_df.columns)} features for {len(processed_df)} matches")

Processed 0/21887 matches...
Processed 1000/21887 matches...
Processed 2000/21887 matches...
Processed 3000/21887 matches...
Processed 4000/21887 matches...
Processed 5000/21887 matches...
Processed 6000/21887 matches...
Processed 7000/21887 matches...
Processed 8000/21887 matches...
Processed 9000/21887 matches...
Processed 10000/21887 matches...
Processed 11000/21887 matches...
Processed 12000/21887 matches...
Processed 13000/21887 matches...
Processed 14000/21887 matches...
Processed 15000/21887 matches...
Processed 16000/21887 matches...
Processed 17000/21887 matches...
Processed 18000/21887 matches...
Processed 19000/21887 matches...
Processed 20000/21887 matches...
Processed 21000/21887 matches...
Generated 88 features for 21887 matches


In [14]:
dummy_cols = ["Series", "Court", "surface", "round", "winner_entry", "winner_hand", "loser_entry", "loser_hand"]
df_features = processed_df.to_dummies(columns=dummy_cols)

print(f"Number of columns added: {df_features.shape[1] - df.shape[1]}")

Number of columns added: 49


In [20]:
# Save csv file for later use
output_path = PROCESSED_DATA_DIR / "tennis_features.csv"
processed_df.write_csv(output_path)