## Unveiling the Next Tennis Superstars With Machine Learning

### Data Cleaning & Feature Engineering

##### In this project, I took on the role of a data scientist for an up-and-coming sportswear startup looking to make big waves in the industry. With limited resources, the goal was to predict whether tennis players could break into the top 25 rankings in the ATP (Association of Tennis Professionals). By leveraging data analytics and machine learning techniques, I aimed to identify key performance indicators and patterns that could potentially forecast a player's future success based on their first 50 professional matches. This project not only explored the predictive power of early career statistics but also showcased the value of data-driven decision-making for startups in the sports industry.

##### The initial phase involved meticulous cleaning of the tennis match database sourced from Kaggle. This included addressing missing values, rectifying inconsistencies, and removing irrelevant columns. Feature engineering was then employed to derive new variables from existing data, enhancing the dataset's predictive potential. These steps laid the foundation for developing a robust binary classification model to forecast future top 25 ATP players accurately.

In [335]:
import pandas as pd
import numpy as np
import os
import warnings
warnings.filterwarnings("ignore")

In [336]:
# Specify the directory containing the CSV files
folder_path = 'tennis_data/'

# List all files in the directory
file_list = os.listdir(folder_path)

# Initializing an empty list to store the data
matches_df = []

# Iterating over each file in the directory
for file in file_list:
    # Select all ATP tennis match files
    if file.startswith('atp_matches'):
        # Read the CSV files into a DataFrame
        df = pd.read_csv(os.path.join(folder_path, file))
        # Append the DataFrame to the list
        matches_df.append(df)

# Concatenate all DataFrames in the list into a single DataFrame
combined_df = pd.concat(matches_df, ignore_index=True)
print(combined_df)

       tourney_id                tourney_name surface  draw_size  \
0       2019-M020                    Brisbane    Hard         32   
1       2019-M020                    Brisbane    Hard         32   
2       2019-M020                    Brisbane    Hard         32   
3       2019-M020                    Brisbane    Hard         32   
4       2019-M020                    Brisbane    Hard         32   
...           ...                         ...     ...        ...   
234575   2014-605                 Tour Finals    Hard          8   
234576   2014-605                 Tour Finals    Hard          8   
234577  2014-D015  Davis Cup WG F: FRA vs SUI    Clay          4   
234578  2014-D015  Davis Cup WG F: FRA vs SUI    Clay          4   
234579  2014-D015  Davis Cup WG F: FRA vs SUI    Clay          4   

       tourney_level  tourney_date  match_num  winner_id  winner_seed  \
0                  A      20181231        300     105453          2.0   
1                  A      20181231   

In [338]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 234580 entries, 0 to 234579
Data columns (total 49 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   tourney_id          234580 non-null  object 
 1   tourney_name        234580 non-null  object 
 2   surface             234527 non-null  object 
 3   draw_size           234580 non-null  int64  
 4   tourney_level       234580 non-null  object 
 5   tourney_date        234580 non-null  int64  
 6   match_num           234580 non-null  int64  
 7   winner_id           234580 non-null  int64  
 8   winner_seed         101953 non-null  float64
 9   winner_entry        31961 non-null   object 
 10  winner_name         234580 non-null  object 
 11  winner_hand         234564 non-null  object 
 12  winner_ht           209248 non-null  float64
 13  winner_ioc          234577 non-null  object 
 14  winner_age          234524 non-null  float64
 15  loser_id            234580 non-nul

#### Data cleaning

##### Since the dataset contained a lot of irrelevant information to the intended model, I prioritized removing the unnecessary columns from the dataframe. I also removed all the rows containing a 'D' in the tourney_level column as it doesn't influence a player's rank.

In [339]:
#Checking tourney_level unique values
unique_values = df['tourney_level'].unique()
print("Unique values in the column:")
print(unique_values)

Unique values in the column:
['A' 'G' 'D' 'M' 'F']


In [340]:
#Removing tennis games that don't increase rank (Davies Cup)
string_to_remove = 'D'
combined_df = combined_df[combined_df['tourney_level'] != string_to_remove]

In [341]:
combined_df.nunique()

tourney_id             5199
tourney_name            879
surface                   4
draw_size                13
tourney_level             5
tourney_date           1290
match_num               836
winner_id              3666
winner_seed              34
winner_entry             19
winner_name            3664
winner_hand               4
winner_ht                30
winner_ioc              105
winner_age              273
loser_id               6886
loser_seed               35
loser_entry              20
loser_name             6882
loser_hand                4
loser_ht                 30
loser_ioc               128
loser_age               314
score                 13835
best_of                   2
round                    13
minutes                 340
w_ace                    59
w_df                     25
w_svpt                  233
w_1stIn                 158
w_1stWon                124
w_2ndWon                 63
w_SvGms                  43
w_bpSaved                25
w_bpFaced           

In [342]:
# Changing date format
combined_df['tourney_date'] = pd.to_datetime(combined_df['tourney_date'], format='%Y%m%d')

#Dropping columns with no value to model
combined_df = combined_df.drop(columns=['winner_entry', 'loser_entry', 'draw_size', 'tourney_name', 'loser_seed', 'winner_seed', 'tourney_level', 'match_num','tourney_id', 'minutes', 'score'])


In [343]:
combined_df.head()

Unnamed: 0,surface,tourney_date,winner_id,winner_name,winner_hand,winner_ht,winner_ioc,winner_age,loser_id,loser_name,...,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
0,Hard,2018-12-31,105453,Kei Nishikori,R,178.0,JPN,29.0,106421,Daniil Medvedev,...,54.0,34.0,20.0,14.0,10.0,15.0,9.0,3590.0,16.0,1977.0
1,Hard,2018-12-31,106421,Daniil Medvedev,R,198.0,RUS,22.8,104542,Jo-Wilfried Tsonga,...,52.0,36.0,7.0,10.0,10.0,13.0,16.0,1977.0,239.0,200.0
2,Hard,2018-12-31,105453,Kei Nishikori,R,178.0,JPN,29.0,104871,Jeremy Chardy,...,27.0,15.0,6.0,8.0,1.0,5.0,9.0,3590.0,40.0,1050.0
3,Hard,2018-12-31,104542,Jo-Wilfried Tsonga,R,188.0,FRA,33.7,200282,Alex De Minaur,...,60.0,38.0,9.0,11.0,4.0,6.0,239.0,200.0,31.0,1298.0
4,Hard,2018-12-31,106421,Daniil Medvedev,R,198.0,RUS,22.8,105683,Milos Raonic,...,56.0,46.0,19.0,15.0,2.0,4.0,16.0,1977.0,18.0,1855.0


#### Feature Engineering the Data for modeling

##### Creating a player-centric dictionary to centralize match data for each individual player

In [344]:
# Getting unique player names from both the 'Winner' and 'Loser' columns
players = pd.unique(combined_df[['winner_name', 'loser_name']].values.ravel('K'))

# Creating a dictionary to store DataFrames for each player
player_games = {}

# Iterating over each player
for player in players:
    # Filtering the DataFrame for games where the player is either the winner or the loser
    player_games[player] = combined_df[(combined_df['winner_name'] == player) | (combined_df['loser_name'] == player)]

In [345]:
#Example usage of player_games dictionary
player_df= player_games['Jannik Sinner']
print(player_df)

       surface tourney_date  winner_id           winner_name winner_hand  \
970       Clay   2019-04-22     111513           Laslo Djere           R   
979       Clay   2019-04-22     206173         Jannik Sinner           R   
1119      Clay   2019-05-13     126774    Stefanos Tsitsipas           R   
1139      Clay   2019-05-13     206173         Jannik Sinner           R   
1193      Clay   2019-05-20     106137      Tristan Lamasine           R   
...        ...          ...        ...                   ...         ...   
177789    Clay   2018-08-27     105413         Andrej Martin           R   
179075    Hard   2018-10-08     206173         Jannik Sinner           R   
179088    Hard   2018-10-08     106005    Constant Lestienne           R   
180085    Hard   2018-11-19     105011       Illya Marchenko           R   
180570    Clay   2018-09-17     121264  Oscar Jose Gutierrez           R   

        winner_ht winner_ioc  winner_age  loser_id      loser_name  ...  \
970         

##### One-hot encoding the necessary categorical variables for modeling

In [346]:
# Iterating over each player in the player_games dictionary
for player, df in player_games.items():
    # Checking if 'winner_hand' and 'loser_hand' columns exist
    if 'winner_hand' in df.columns and 'loser_hand' in df.columns:
        # Mapping 'right' to 1 and 'left' to 0 for winner_hand and loser_hand columns
        df['winner_hand'] = df['winner_hand'].map({'R': 1, 'L': 0})
        df['loser_hand'] = df['loser_hand'].map({'R': 1, 'L': 0})

        # Updating the DataFrame for the player in the dictionary
        player_games[player] = df
    else:
        print(f"Winner or loser hand column not found for player {player}")

# Example usage:
print("Player Games for Roger Federer:")
print(player_games['Roger Federer'])

Player Games for Roger Federer:
       surface tourney_date  winner_id         winner_name  winner_hand  \
140       Hard   2019-01-14     126774  Stefanos Tsitsipas            1   
171       Hard   2019-01-14     103819       Roger Federer            1   
197       Hard   2019-01-14     103819       Roger Federer            1   
198       Hard   2019-01-14     103819       Roger Federer            1   
552       Hard   2019-02-25     103819       Roger Federer            1   
...        ...          ...        ...                 ...          ...   
234568    Hard   2014-11-09     103819       Roger Federer            1   
234569    Hard   2014-11-09     103819       Roger Federer            1   
234572    Hard   2014-11-09     103819       Roger Federer            1   
234574    Hard   2014-11-09     103819       Roger Federer            1   
234576    Hard   2014-11-09     104925      Novak Djokovic            1   

        winner_ht winner_ioc  winner_age  loser_id          loser_n

##### Feature engineering win_ratio column to take into account the value of the win relative to their opponent's ranking

In [347]:
# Iterating through the player_games dictionary
for player, df in player_games.items():
    
    winner_rank = ['winner_rank']
    loser_rank = ['loser_rank']

    if pd.isnull(winner_rank):
        winner_rank = 1000
    
    if pd.isnull(loser_rank):
        loser_rank = 1000
    
    # Calculating the rank ratio for each row
    rank_ratio = df['winner_rank'] / df['loser_rank']
    
    # Calculating the win value based on the rank ratio
    # Assuming the baseline value is 1
    win_value = rank_ratio * 1  # You can adjust the baseline value as needed
    
    # Adding the win value as a new column in the DataFrame
    df['win_ratio'] = win_value

##### Creating a binary classification column serving as the target variable for my classification modeling. It designates players as "1" if they've entered the top 25 rankings and "0" if they haven't throughout tbeir professional tennis career.

In [348]:
# Iterating over each player's DataFrame in the dictionary
for player, df in player_games.items():
    # Filter the DataFrame to include only rows where the player's name appears in 'winner_name'
    player_df = df[df['winner_name'] == player]

    # Checking if player_df is empty
    if not player_df.empty:
        # Calculating the lowest rank from 'winner_rank' column for the player's games
        lowest_ranking = min(player_df['winner_rank'])

        # Determining if the player has reached top 25 rankings
        top_25_reached = 1 if lowest_ranking <= 25 else 0
    else:
        # If player_df is empty, set lowest_ranking to a large value and top_25_reached to 0
        lowest_ranking = float('inf')
        top_25_reached = 0

    # Creating a new column indicating whether the player has reached top 25 rankings
    df['top_25_reached'] = top_25_reached

    # Updating the player's DataFrame in the dictionary
    player_games[player] = df

In [349]:
player_df= player_games['Marcos Giron']
print(player_df)

       surface tourney_date  winner_id              winner_name  winner_hand  \
630       Hard   2019-03-04     105683             Milos Raonic          1.0   
651       Hard   2019-03-04     106218             Marcos Giron          1.0   
683       Hard   2019-03-04     106218             Marcos Giron          1.0   
1527     Grass   2019-07-01     103852          Feliciano Lopez          0.0   
2049      Hard   2019-08-19     106148  Roberto Carballes Baena          1.0   
...        ...          ...        ...                      ...          ...   
184111    Hard   2018-09-03     105714                 Hugo Nys          1.0   
184495    Hard   2018-11-12     103333             Ivo Karlovic          1.0   
184550    Hard   2018-03-05     104660        Sergiy Stakhovsky          1.0   
233936    Hard   2014-08-17     104873     Aleksandr Nedovyesov          1.0   
233981    Hard   2014-08-25     104545               John Isner          1.0   

        winner_ht winner_ioc  winner_ag

In [350]:
for player, df in player_games.items():
    print(f"Player: {player}")
    print("Columns:")
    print(df.columns)
    print()  

Player: Kei Nishikori
Columns:
Index(['surface', 'tourney_date', 'winner_id', 'winner_name', 'winner_hand',
       'winner_ht', 'winner_ioc', 'winner_age', 'loser_id', 'loser_name',
       'loser_hand', 'loser_ht', 'loser_ioc', 'loser_age', 'best_of', 'round',
       'w_ace', 'w_df', 'w_svpt', 'w_1stIn', 'w_1stWon', 'w_2ndWon', 'w_SvGms',
       'w_bpSaved', 'w_bpFaced', 'l_ace', 'l_df', 'l_svpt', 'l_1stIn',
       'l_1stWon', 'l_2ndWon', 'l_SvGms', 'l_bpSaved', 'l_bpFaced',
       'winner_rank', 'winner_rank_points', 'loser_rank', 'loser_rank_points',
       'win_ratio', 'top_25_reached'],
      dtype='object')

Player: Daniil Medvedev
Columns:
Index(['surface', 'tourney_date', 'winner_id', 'winner_name', 'winner_hand',
       'winner_ht', 'winner_ioc', 'winner_age', 'loser_id', 'loser_name',
       'loser_hand', 'loser_ht', 'loser_ioc', 'loser_age', 'best_of', 'round',
       'w_ace', 'w_df', 'w_svpt', 'w_1stIn', 'w_1stWon', 'w_2ndWon', 'w_SvGms',
       'w_bpSaved', 'w_bpFaced', 'l_

##### Creating a player-centric dictionary, filtering the first 50 matches based on age (≤25) and ranking (>500). The dataset doesn't specify the start of a player's career, so I have set this criteria to indirectly determine the start of each player's tennis career.

In [351]:
# Ensuring player_games is not empty
assert player_games, "player_games should not be empty"

# Initializing the dictionary to store filtered games
filtered_games = {}

# Iterating through player_games
for player, player_df in player_games.items():
    # Sort the player's DataFrame by date in ascending order
    sorted_player_df = player_df.sort_values(by='tourney_date')
    
    # Initialize an empty DataFrame to store filtered games for the player
    filtered_player_df = pd.DataFrame()
    
    # Flag to track if the first game meeting the criteria is found
    first_game_found = False
    
    # Iterate through the sorted DataFrame
    for index, game in sorted_player_df.iterrows():
        # Check if the game meets the criteria
        if not first_game_found and (((game['winner_age'] <= 25 or game['loser_age'] <= 25) and 
                                      ((game['winner_rank'] > 500 or pd.isnull(game['winner_rank'])) or
                                       (game['loser_rank'] > 500 or pd.isnull(game['loser_rank']))))):
            # Add the game to the filtered DataFrame
            filtered_player_df = pd.concat([filtered_player_df, pd.DataFrame(game).transpose()], ignore_index=True)
            first_game_found = True
        elif first_game_found:
            # Add subsequent games after the first game meeting the criteria
            filtered_player_df = pd.concat([filtered_player_df, pd.DataFrame(game).transpose()], ignore_index=True)
        
        # Check if we've collected 50 games (including the first game)
        if first_game_found and len(filtered_player_df) == 50:
            break  # Stop once 50 games are collected
    
    # Store the filtered DataFrame in the dictionary
    if not filtered_player_df.empty:
        filtered_games[player] = filtered_player_df

In [352]:
player_to_find = "Rafael Nadal"

# Check if the player exists in the filtered_games dictionary
if player_to_find in filtered_games:
    # Get the DataFrame for the specified player
    player_df = filtered_games[player_to_find]
    print(f"Games for {player_to_find}:")
    print(player_df)
else:
    print(f"{player_to_find} not found in filtered games")

Games for Rafael Nadal:
   surface         tourney_date winner_id          winner_name winner_hand  \
0     Clay  2001-09-17 00:00:00    104745         Rafael Nadal           0   
1     Clay  2001-09-17 00:00:00    102999      Stefano Galvani           1   
2     Clay  2002-04-29 00:00:00    103694       Olivier Rochus           1   
3     Clay  2002-04-29 00:00:00    104745         Rafael Nadal           0   
4     Clay  2002-09-30 00:00:00    104745         Rafael Nadal           0   
5     Clay  2002-09-30 00:00:00    102607        Juan Balcells           1   
6     Clay  2002-10-07 00:00:00    104745         Rafael Nadal           0   
7     Clay  2002-10-07 00:00:00    104745         Rafael Nadal           0   
8     Clay  2002-10-07 00:00:00    102287        Albert Portas           1   
9     Clay  2002-10-07 00:00:00    104745         Rafael Nadal           0   
10  Carpet  2003-01-27 00:00:00    104339          Mario Ancic           1   
11  Carpet  2003-01-27 00:00:00    10474

##### Creating two dictionaries that separate each player's wins and losses for easier central tendency calculations

In [353]:
# Iterate over each player in the dictionary
for player, df in filtered_games.items():
    # Filter rows where the player is the winner
    winner_df = df[df['winner_name'] == player].copy()
    
    # Filter rows where the player is the loser
    loser_df = df[df['loser_name'] == player].copy()
    
    # Remove irrelevant columns from winner DataFrame
    winner_df = winner_df[[col for col in winner_df.columns if not col.startswith('loser_') and not col.startswith('l_')]]
    
    # Remove irrelevant columns from loser DataFrame
    loser_df = loser_df[[col for col in loser_df.columns if not col.startswith('winner_') and not col.startswith('w_')]]
    
    # Add DataFrames to respective dictionaries
    winner_stats[player] = winner_df
    
    # Add win_ratio column to loser DataFrame and set all values to 0
    loser_df['win_ratio'] = 0
    
    loser_stats[player] = loser_df

# Example usage:
print("Loser Stats:")
print(winner_stats['Jannik Sinner'])


Loser Stats:
   surface         tourney_date winner_id    winner_name winner_hand  \
5     Hard  2018-10-08 00:00:00    206173  Jannik Sinner         1.0   
7     Hard  2019-02-18 00:00:00    206173  Jannik Sinner         1.0   
8     Hard  2019-02-18 00:00:00    206173  Jannik Sinner         1.0   
9     Hard  2019-02-18 00:00:00    206173  Jannik Sinner         1.0   
10    Hard  2019-02-18 00:00:00    206173  Jannik Sinner         1.0   
11    Hard  2019-02-18 00:00:00    206173  Jannik Sinner         1.0   
12    Hard  2019-02-18 00:00:00    206173  Jannik Sinner         1.0   
15    Clay  2019-04-08 00:00:00    206173  Jannik Sinner         1.0   
17    Clay  2019-04-22 00:00:00    206173  Jannik Sinner         1.0   
19    Clay  2019-04-22 00:00:00    206173  Jannik Sinner         1.0   
21    Clay  2019-04-29 00:00:00    206173  Jannik Sinner         1.0   
22    Clay  2019-04-29 00:00:00    206173  Jannik Sinner         1.0   
23    Clay  2019-04-29 00:00:00    206173  Jannik S

##### Concatenating the winner_stas and loser_stats dataframes for each player together to determine their averages among the first 50 games of their careers

In [354]:

# Initialize empty dictionary to store combined DataFrames
combined_stats = {}

# Iterate over each player in the winner_stats and loser_stats dictionaries
for player in winner_stats.keys():
    # Rename columns in winner DataFrame
    winner_df = winner_stats[player].copy()
    winner_df.columns = [col.replace('winner_', '').replace('w_', '') for col in winner_df.columns]
    
    # Rename columns in loser DataFrame
    loser_df = loser_stats[player].copy()
    loser_df.columns = [col.replace('loser_', '').replace('l_', '') for col in loser_df.columns]
    
    # Concatenate DataFrames
    combined_stats_df = pd.concat([winner_df, loser_df], ignore_index=True)
    
    # Store combined DataFrame in the combined_stats dictionary
    combined_stats[player] = combined_stats_df

# Example usage:
print("Combined Stats for Jannik:")
print(combined_stats['Jannik Sinner'])

Combined Stats for Jannik:
   surface         tourney_date      id           name hand     ht  ioc   age  \
0     Hard  2018-10-08 00:00:00  206173  Jannik Sinner  1.0  188.0  ITA  17.1   
1     Hard  2019-02-18 00:00:00  206173  Jannik Sinner  1.0  188.0  ITA  17.5   
2     Hard  2019-02-18 00:00:00  206173  Jannik Sinner  1.0  188.0  ITA  17.5   
3     Hard  2019-02-18 00:00:00  206173  Jannik Sinner  1.0  188.0  ITA  17.5   
4     Hard  2019-02-18 00:00:00  206173  Jannik Sinner  1.0  188.0  ITA  17.5   
5     Hard  2019-02-18 00:00:00  206173  Jannik Sinner  1.0  188.0  ITA  17.5   
6     Hard  2019-02-18 00:00:00  206173  Jannik Sinner  1.0  188.0  ITA  17.5   
7     Clay  2019-04-08 00:00:00  206173  Jannik Sinner  1.0  188.0  ITA  17.6   
8     Clay  2019-04-22 00:00:00  206173  Jannik Sinner  1.0  188.0  ITA  17.6   
9     Clay  2019-04-22 00:00:00  206173  Jannik Sinner  1.0  188.0  ITA  17.6   
10    Clay  2019-04-29 00:00:00  206173  Jannik Sinner  1.0  188.0  ITA  17.7   
1

In [355]:
# Iterate over each player in the combined_stats dictionary
for player, df in combined_stats.items():
    # Drop columns from the DataFrame
    columns_to_drop = ['surface', 'round', 'best_of', 'rank', 'rank_points', 'name', 'id', 'ioc', 'age', 'tourney_date', 'hand']
    df.drop(columns=columns_to_drop, inplace=True)

# Example usage:
print("Modified Combined Stats for Player:")
print(combined_stats['Jannik Sinner'].head())

Modified Combined Stats for Player1:
      ht  ace   df  svpt 1stIn 1stWon 2ndWon SvGms bpSaved bpFaced win_ratio  \
0  188.0  8.0  1.0  53.0  35.0   31.0   10.0  10.0     1.0     2.0  1.344668   
1  188.0  8.0  0.0  63.0  37.0   28.0   19.0  10.0     1.0     1.0  1.971119   
2  188.0  3.0  0.0  49.0  27.0   22.0   12.0   8.0     0.0     0.0     2.184   
3  188.0  3.0  1.0  46.0  21.0   16.0   14.0   8.0     3.0     5.0  1.813953   
4  188.0  4.0  2.0  50.0  26.0   21.0   12.0   9.0     1.0     3.0  3.615894   

  top_25_reached  
0              1  
1              1  
2              1  
3              1  
4              1  


In [356]:

# Initialize an empty dictionary to store mean values for each player
player_means = {}

# Iterate over each player in the combined_stats dictionary
for player, df in combined_stats.items():
    # Calculate the mean of the first 50 values for each column
    mean_values = df.iloc[:50].mean()
    
    # Store the mean values in the player_means dictionary
    player_means[player] = mean_values

# Create a DataFrame from the player_means dictionary
mean_stats_df = pd.DataFrame(player_means).T  # Transpose to have players as rows

In [1]:
mean_stats_df.info()

NameError: name 'mean_stats_df' is not defined

In [358]:
csv_file = mean_stats_df.to_csv('mean_stats_tennis.csv')

In [359]:
specific_player_stats = mean_stats_df.loc['Rinky Hijikata']

# Example usage:
print("Stats for Specific Player:")
print(specific_player_stats)

Stats for Specific Player:
ht                     NaN
ace                   4.42
df                     3.0
svpt                 73.12
1stIn                 41.9
1stWon               30.44
2ndWon               15.22
SvGms                11.08
bpSaved               4.18
bpFaced               6.56
win_ratio         0.740181
top_25_reached         0.0
Name: Rinky Hijikata, dtype: object
