**This project was completed during my 1st-year university break (end of 2017). Improvements to be added in the future.**




The aim of this project was to build a machine learning model that can accurately predict the winner of a men's tennis match. I obtained data for ATP matches played between 1995-2017 from Jeff Sackman's Github repository (https://github.com/JeffSackmann/tennis_atp).

There were 3 key stages to this project:
1. Data preparation (merging files, cleaning data and dealing with missing values)
2. Feature engineering (creating new informative features using my domain knowledge of tennis)
3. Modelling (training classification algorithms and evaluating their performance)


## Data Preparation

In [1]:
# Import libraries
import pandas as pd
import numpy as np

# Sort each csv file by date and merge them all into one file

df = pd.read_csv("atp_matches_2017.csv", usecols=range(49), index_col = False)
df.sort_values(by='tourney_date', ascending=False, na_position='first', inplace = True)
df_whole = df

for i in range(2016, 1994, -1):    
    df = pd.read_csv("atp_matches_" + str(i) + ".csv", usecols=range(49), index_col = False)
    df.sort_values(by='tourney_date', ascending=False, na_position='first', inplace = True)
    df_whole = pd.concat([df_whole, df])
    
# Save the merged file
df_whole.to_csv("atp_matches.csv", index = False)

In [8]:
# Read in the data
df = pd.read_csv("atp_matches.csv", usecols = range(49), index_col = False)

# Display the first 5 rows of the dataset, to give a preview of the data
df.head(5)

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,...,w_bpFaced,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced
0,2017-560,Us Open,Hard,128,G,20170828,156,106298,16.0,,...,8.0,3.0,3.0,90.0,57.0,37.0,19.0,0.0,5.0,8.0
1,2017-560,Us Open,Hard,128,G,20170828,124,105138,11.0,,...,16.0,11.0,3.0,120.0,69.0,50.0,19.0,0.0,8.0,14.0
2,2017-560,Us Open,Hard,128,G,20170828,206,103893,,,...,9.0,1.0,2.0,87.0,52.0,27.0,16.0,0.0,4.0,12.0
3,2017-560,Us Open,Hard,128,G,20170828,205,104999,23.0,,...,3.0,11.0,7.0,101.0,67.0,51.0,13.0,0.0,2.0,5.0
4,2017-560,Us Open,Hard,128,G,20170828,204,105023,17.0,,...,2.0,5.0,4.0,106.0,78.0,54.0,15.0,0.0,5.0,9.0


In [9]:
# Print a list of variable names, along with their data types
print df.dtypes

tourney_id             object
tourney_name           object
surface                object
draw_size               int64
tourney_level          object
tourney_date            int64
match_num               int64
winner_id               int64
winner_seed           float64
winner_entry           object
winner_name            object
winner_hand            object
winner_ht             float64
winner_ioc             object
winner_age            float64
winner_rank           float64
winner_rank_points    float64
loser_id                int64
loser_seed            float64
loser_entry            object
loser_name             object
loser_hand             object
loser_ht              float64
loser_ioc              object
loser_age             float64
loser_rank            float64
loser_rank_points     float64
score                  object
best_of                 int64
round                  object
minutes               float64
w_ace                 float64
w_df                  float64
w_svpt    

After having a quick look at the data, I began the data cleaning process. First, I went through the features (using my knowledge of tennis along with descriptions of the variables) and got rid of ones that were irrelevant to this project. Getting rid of irrelevant variables will prevent overfitting.

In [10]:
# Delete irrelevant variables
df.drop(['draw_size', 'winner_seed','winner_entry', 'winner_ioc','loser_seed','loser_entry','loser_ioc'], axis=1, inplace=True)

# Print unique values of the 'round' variable, with corresponding frequencies
df['round'].value_counts()

R32     23820
R16     12435
R64     11286
RR       7988
R128     7072
QF       6238
SF       3170
F        1615
BR          6
Name: round, dtype: int64

I then renamed the 'round' variable as 'stage' and numerically encoded it as follows:
- 0 for a non-final
- 1 for a quarter-final
- 2 for a semi-final
- 3 for a final

This numerical encoding reflects the fact that it is progressively harder to win matches as a player progresses through a tournament.

"BR" represents that the match was a bronze medal playoff in the Olympic Games. These matches are similar to semi-finals (in terms of the stage at which they are played during the tournament), so I changed all "BR" values to "SF" for consistency.

In [11]:
# Change "BR" to "SF"
df.loc[df['round'] == "BR", 'round'] = "SF"

# Rename the "round" variable
df.rename(columns = { 'round' : 'stage'}, inplace = True)

# Encode the categorical variable "stage"
mapping_dict = {
    "stage": {
        "RR": 0,
        "R128": 0,
        "R64": 0,
        "R32": 0,
        "R16": 0,
        "QF": 1,
        "SF": 2,
        "F": 3}
}
df = df.replace(mapping_dict)

In [12]:
# Check for missing values in the data
df.isnull().sum()

tourney_id               0
tourney_name             0
surface                210
tourney_level            0
tourney_date             0
match_num                0
winner_id                0
winner_name              0
winner_hand             14
winner_ht             4330
winner_age              28
winner_rank           1723
winner_rank_points    1723
loser_id                 0
loser_name               0
loser_hand              31
loser_ht              7173
loser_age               65
loser_rank            2667
loser_rank_points     2667
score                    2
best_of                  0
stage                    0
minutes               9572
w_ace                 7801
w_df                  7801
w_svpt                7801
w_1stIn               7801
w_1stWon              7801
w_2ndWon              7801
w_SvGms               7801
w_bpSaved             7801
w_bpFaced             7801
l_ace                 7801
l_df                  7801
l_svpt                7801
l_1stIn               7801
l

There was no reasonable way to estimate missing values from certain variables (e.g. score) so any rows that had missing values for these variables were deleted.

In [13]:
# Delete rows that have missing values in specified columns
df.dropna(subset = ['w_svpt','score','winner_rank','loser_rank'], inplace = True)

df.isnull().sum()

tourney_id               0
tourney_name             0
surface                 61
tourney_level            0
tourney_date             0
match_num                0
winner_id                0
winner_name              0
winner_hand              0
winner_ht             2020
winner_age               2
winner_rank              0
winner_rank_points       0
loser_id                 0
loser_name               0
loser_hand               3
loser_ht              3541
loser_age               11
loser_rank               0
loser_rank_points        0
score                    0
best_of                  0
stage                    0
minutes               1857
w_ace                    0
w_df                     0
w_svpt                   0
w_1stIn                  0
w_1stWon                 0
w_2ndWon                 0
w_SvGms                  0
w_bpSaved                0
w_bpFaced                0
l_ace                    0
l_df                     0
l_svpt                   0
l_1stIn                  0
l

There were 61 missing values for the surface variable, so I inputted these manually by looking up the matches online. I've omitted the code from this notebook for brevity, but here is an example of how I did it:

In [14]:
# Display 5 matches that had a missing value for surface
df[df.surface.isnull()].head(5)[['tourney_id','tourney_name','tourney_date','surface']]

Unnamed: 0,tourney_id,tourney_name,tourney_date,surface
1840,2017-M-DC-2017-G1-AO-M-TPE-CHN-01,Davis Cup G1 R1: TPE vs CHN,20170217,
1841,2017-M-DC-2017-G1-AO-M-TPE-CHN-01,Davis Cup G1 R1: TPE vs CHN,20170217,
1842,2017-M-DC-2017-G1-AO-M-TPE-CHN-01,Davis Cup G1 R1: TPE vs CHN,20170217,
2010,2017-M-DC-2017-G2-EPA-M-SWE-TUN-01,Davis Cup G2 R1: SWE vs TUN,20170203,
2011,2017-M-DC-2017-G2-EPA-M-TUR-CYP-01,Davis Cup G2 R1: TUR vs CYP,20170203,


In [15]:
# Set the surface to "Hard" for a specific match
df.loc[df.tourney_id == "2017-M-DC-2017-G1-AO-M-TPE-CHN-01", 'surface'] = "Hard"

I then filled in many of the missing values for players' heights and ages, which was a lot of manual work. I've again omitted the code for this to keep the notebook concise, but see below for 2 lines of sample code. I calculated age values using the player's birthdate and the tournament start date. After a while, it was taking too long to manually find all of the missing heights online, so I filled the rest of them with the mean height.

In [16]:
# Example of how I filled in player ages
df.loc[(df.loser_name == "Christopher Eubanks") & (df.tourney_date == 20170724), 'loser_age'] = 21.219178

# Example of how I filled in player heights
df.loc[df.winner_id == 105430, 'winner_ht'] = 175

# Fill in the remaining missing heights by using the mean value
df.winner_ht.fillna(value = df.winner_ht.mean(), inplace = True)
df.loser_ht.fillna(value = df.loser_ht.mean(), inplace = True)

I decided to check each variable and see if there were any improbable values which could have indicated errors. Whilst doing this, I noticed that many matches lasted less than 30 minutes. This obviously rang some alarm bells, since the shortest mens ATP tennis matches ever player lasted 28 minutes. 

In [17]:
# Display the score and minutes of some of the matches that were shorter than 30 minutes
df[df.minutes < 30].head()[['score','minutes']]

Unnamed: 0,score,minutes
297,5-3 RET,29.0
578,6-1 RET,17.0
580,5-0 RET,12.0
676,3-0 RET,12.0
919,6-1 1-0 RET,28.0


From the output above, I realised that some matches had ended by a player retiring (usually due to injury), which explained why so many matches' durations were so short. These incomplete matches were not useful for my analysis and modelling, so I deleted them.

In [18]:
# Get rid of matches that ended due to a player retiring
df = df[df.score.str.find('RET') == -1]

I also noticed that in some matches, the string representing the score contained letters. These matches were investigated as shown below.

In [23]:
allowed_chars = set('0123456789- ()')
for index,row in df.iterrows():
    if not set(row.score).issubset(allowed_chars):
        print [row.score, index]

As evident from the output, these matches involved a default (e.g. for a rule violation) or walkover (when a player forfeits the match). I got rid of these incomplete matches, as shown below.

In [24]:
allowed_chars = set("0123456789- ()")
for index,row in df.iterrows():
    if not set(row.score).issubset(allowed_chars):
        df.drop(index, inplace = True)

I checked to see if any more matches had durations of less than 28 minutes, and found that some of the scores were incomplete, yet hadn't been marked as retired matches, defaults or walk-overs, e.g. one score was "4-1". I deleted these rows. I also found that certain matches had very short durations, even though they had a complete score. E.g. one match lasted 5 sets, yet apparently lasted only 27 minutes, which was obviously an error. These matches weren't an issue though, since I didn't intend to use the _minutes_ variable in my model - it was just a useful tool to identify errors in the dataset.

In [25]:
# Drop the 'minutes' variable
df.drop('minutes', axis = 1, inplace = True)

## Feature Engineering

Next, I began doing some feature engineering. This included:
- Parsing the string variable 'score'. This function takes in a string representing a legitimate tennis scoreline and returns the percentage of games won by each player, as well as the total games played in the match
- Creating features representing the percentage of games won for each player (using the score parser)
- Creating features representing the percentage of points won whilst on serve for each player
- Creating features representing the percentage of points won whilst returning for each player
- Creating features representing the percentage of break points won for each player
- Creating features representing the ace percentage and double-fault percentage for each player

These percentage features provide a better estimation for how well players performed in different aspects of the match, compared to the raw counts. The number of games won by each player wasn't provided in the dataset, so I created a score parser to extract this useful information. The break-point-won percentages provide insight into how well the players performed in critical moments of the match (i.e. how well they saved break points and how well they capitalised on the opportunity to break their opponent's serve).

In [26]:
# Define a function to parse the score variable. 
'''
Takes in a string representing a tennis scoreline. 
Returns the percentage of games won by each player as well as the total number of games played in the match
'''
def parse_score(score):
    winner_gms = 0
    loser_gms = 0
    i = 0
    while i < len(score):        
        temp = int(score[i])
        i+=1
        if score[i].isdigit():
            winner_gms+= (temp * 10 + int(score[i]))
            i+=2
        else:
            winner_gms += temp
            i += 1
        temp = int(score[i])
        i+=1
        if i< len(score) and score[i].isdigit():
            loser_gms += (temp * 10 + int(score[i]))
            i+=1
        else:
            loser_gms += temp     
        if i < len(score):
            if score[i] == "(":
                while score[i] != ")":
                    i+=1
                i+=2
            else:
                i+=1
    total_gms = float(winner_gms + loser_gms)
    return [(winner_gms/total_gms)*100, (loser_gms/total_gms)*100, total_gms]

In [27]:
# Calculate the percentage of games won by each player in that match
df['w_gmsWon_pct'] = df.apply(lambda x: parse_score(x['score'])[0], axis = 1)
df['l_gmsWon_pct'] = df.apply(lambda x: parse_score(x['score'])[1], axis = 1)   

# Calculate both players' ace and double-fault percentages
df['w_ace_pct'] = (df['w_ace']/df.w_svpt) * 100
df['l_ace_pct'] = (df['l_ace']/df.l_svpt) * 100

df['w_df_pct'] = (df['w_df']/df['w_svpt']) * 100
df['l_df_pct'] = (df['l_df']/df['l_svpt']) * 100

# Calculate both players' win-on-serve, win-on-return and win-on-break-point percentages

df['w_serve_pts_pct'] = (df['w_1stWon'] + df['w_2ndWon'])/(df['w_svpt'])*100
df['l_serve_pts_pct'] = (df['l_1stWon'] + df['l_2ndWon'])/(df['l_svpt'])*100

df['w_retpts_pct'] = (df['l_svpt'] - df['l_1stWon'] - df['l_2ndWon'])/(df['l_svpt'])*100
df['l_retpts_pct'] = (df['w_svpt'] - df['w_1stWon'] - df['w_2ndWon'])/(df['w_svpt'])*100

df['w_bpWon_pct'] = (df.w_bpSaved + df.l_bpFaced - df.l_bpSaved)/(df.w_bpFaced + df.l_bpFaced)*100
df['l_bpWon_pct'] = (df.l_bpSaved + df.w_bpFaced - df.w_bpSaved)/(df.l_bpFaced + df.w_bpFaced)*100

After the creation of the new variables, I checked again for missing values. I saw that there were a few missing values in the new variables, which must have occurred because of a division by 0 (due to certain original features having been incorrectly recorded as having values of 0). I dropped those rows that had missing values.

In [29]:
df.dropna(inplace = True)

In [30]:
# Print unique values of the 'surface' variable, with corresponding frequencies
df['surface'].value_counts()

Hard      31136
Clay      21096
Grass      6812
Carpet     3490
Name: surface, dtype: int64

The 'grass' and 'carpet' surface values didn't have many samples, which could have caused some issues later on in my processing. Furthermore, carpet is a very similar playing surface to grass (known from my research and general knowledge of tennis). As such, I set all 'carpet' surface values to 'grass'.

In [31]:
df.loc[df.surface == 'Carpet', 'surface'] = 'Grass'

Once the data cleaning was complete, I saved the data to a new csv file called "atp_matches5.csv" which will be used in the next stages of the project. 

At this stage, I had a dataset where each row provided match statistics (including player performance metrics) of a particular match. The aim of this project was to create a model that can predict the winner of a match, so I had to create a dataset where each row, representing a match, had columns that estimate various abilities of the players (e.g. their ability to win points on their serve, their ability to win break points, etc) at the time before the match was played.  

To estimate these player characteristics, I calculated weighted averages (weighted by time, opponent ranking and playing surface) of each of the original variables from the players' past 40 matches. A rough outline of the algorithm I used is given below:

- For each row in the dataset, with surface _s_:
  - For each player (i.e. the winner and loser) in the row:
    - Find the last 40 matches that the player played in, of which at least 10 were played on surface _s_
    - For each feature in {serve_pts_pct, retpts_pct, ace_pct, bpWon_pct, gmsWon_pct}:
      - Find the average value of the feature from the set of 40 matches, weighted by how long ago the match took place, the opponent rank and the surface it was played on 
    - Calculate the percentage of matches the player won out of these last 40 matches, adjusted for how long ago each match took place, the opponent rank and the surface it was played on
    - Add the new features to the dataframe
    
See below for the code I used.

**NOTE: The calculate() function was originally cobbled together poorly. It will be re-written in the future to make use of vectorization (for greatly increased speed) and modularisation (to improve readability and maintainance)**

In [None]:
# Load the cleaned dataset
df = pd.read_csv("atp_matches5.csv", usecols=range(32), index_col = False)

# Create new variable called 'index' that will be used later on
df['index'] = df.index

# Get rid of matches that were played before 1999 (to reduce the size of the dataset)
df = df[df.tourney_date > 19990000]
df.tourney_date = df.tourney_date.astype(str)

# Create new variables
df['w_ave_svpt_pct'] = np.nan
df['w_ave_retpt_pct'] = np.nan
df['w_ave_bpWon_pct'] = np.nan
df['w_ave_ace_pct'] = np.nan
df['w_ave_gmsWon_pct'] = np.nan
df['w_win_pct'] = np.nan

# Takes in two strings representing dates, outputs number of years between the dates
def yrs_between(start, end):
    yrs = int(end[:4]) - int(start[:4])
    mths = (int(end[4:6]) - int(start[4:6]))/12.0
    days = (int(end[6:8]) - int(start[6:8]))/365.0
    return yrs + mths + days

# Calculates the weighting to be applied to a match played a certain number of years ago
def time_discount(time):
    return min(0.6**time,0.6)

# Functions used to calculate weightings for matches given the opponent's ranking
def opp_weight(opp_rank):
    return max(-0.8*opp_rank/299 + 1.4027, 0.6)

def l_opp_weight(opp_rank):
    return min(0.8*opp_rank/299 + 0.6, 1.4)

# Function to set the values of the new variables

def calculate():
    df_min = df[['tourney_date','surface','index','winner_id','loser_id','winner_rank','loser_rank','w_serve_pts_pct','w_retpts_pct',
    'w_bpWon_pct','w_ace_pct','w_gmsWon_pct', 'l_serve_pts_pct','l_retpts_pct','l_bpWon_pct','l_ace_pct','l_gmsWon_pct']]
    
    for row in df_min.itertuples():
      surface = row[2]
      serve_pts = 0
      ret_pts = 0
      bpWon = 0
      ace = 0
      gmsWon = 0
      match_pct = 0
      total_weight = 0
      match_weight = 0

      df_temp = df_min[((df_min.winner_id == row[4]) | (df_min.loser_id == row[4]))].loc[row[3] + 1:]
    
      if len(df_temp) < 40:
        df.drop(row[3], inplace = True)
        continue
      
      df_s = df_temp[df_temp['surface'] == surface].head(10)
      if len(df_s) < 10:
        df.drop(row[3], inplace = True)
        continue
      
      for match in df_s.itertuples():
          df_temp.drop(match[3], inplace = True) 
          if match[4] == row[4]:
            weight = 2 * time_discount(yrs_between(match[1],row[1]))
            multiplier = weight * opp_weight(match[7])
            serve_pts += (match[8] * multiplier)
            ret_pts += (match[9] * multiplier)
            bpWon += (match[10] * multiplier)
            ace += (match[11] * multiplier)
            gmsWon += (match[12] * multiplier)
            match_pct += (multiplier)
            total_weight+= weight
            match_weight += weight
          else:
            weight = 2 * time_discount(yrs_between(match[1],row[1]))
            multiplier = weight * opp_weight(match[6])
            serve_pts += (match[13] * multiplier)
            ret_pts += (match[14] * multiplier)
            bpWon += (match[15] * multiplier)
            ace += (match[16] * multiplier)
            gmsWon += (match[17] * multiplier)
            total_weight+= weight
            match_weight += (weight*l_opp_weight(match[6]))


      for match in df_temp.head(30).itertuples():
          if match[4] == row[4]:
              weight = time_discount(yrs_between(match[1],row[1]))
              multiplier = weight * opp_weight(match[7])
              serve_pts += (match[8] * multiplier)
              ret_pts += (match[9] * multiplier)
              bpWon += (match[10] * multiplier)
              ace += (match[11] * multiplier)
              gmsWon += (match[12] * multiplier)
              match_pct += (multiplier)
              total_weight+= weight
              match_weight += weight
          else:
            weight = time_discount(yrs_between(match[1],row[1]))
            multiplier = weight * opp_weight(match[6])
            serve_pts += (match[13] * multiplier)
            ret_pts += (match[14] * multiplier)
            bpWon += (match[15] * multiplier)
            ace += (match[16] * multiplier)
            gmsWon += (match[17] * multiplier)
            total_weight+= weight
            match_weight += (weight*l_opp_weight(match[6]))
      df.at[row[3], 'w_ave_svpt_pct']= (serve_pts/total_weight)
      df.at[row[3], 'w_ave_retpt_pct']= (ret_pts/total_weight)
      df.at[row[3], 'w_ave_bpWon_pct']= (bpWon/total_weight)
      df.at[row[3], 'w_ave_ace_pct']= (ace/total_weight)
      df.at[row[3], 'w_ave_gmsWon_pct']= (gmsWon/total_weight)
      df.at[row[3], 'w_win_pct']= (match_pct/match_weight)

**NOTE: The calculate() function above was originally cobbled together poorly. It will be re-written in the future to make use of vectorization (for greatly increased speed) and modularisation (to improve readability and maintainance)**

Because of this, executing the next code cell may take a long time. 
However, there is no need to run the cell, since its output was saved to csv files that will be loaded and used in the next stage of the workbook. 

In [None]:
# Run the function
calculate()

I ran the calculate() function on the dataframe and saved the modified data to a file called "atp_matches10.csv". Then, I modified the function slightly to calculate the feature values for the matches losers, and ran it, saving the data to a file called "atp_matches11.csv".

Now I had data that was almost ready to be used to train our machine learning algorithms. However, there were a few more processing steps to do. 

Firstly, I needed to merge the two csv files together to obtain a single dataset that contained both the winner statistics and loser statistics.

In [None]:
# Merge the 2 dataframes

df1 = pd.read_csv("atp_matches10.csv", usecols=range(39), index_col = False)
df2 = pd.read_csv("atp_matches11.csv", usecols=range(39), index_col = False)

df1 = df1[['surface','tourney_level','winner_hand','winner_ht','winner_age','winner_rank_points','loser_hand','loser_ht',
          'loser_age','loser_rank_points','best_of','stage', 'w_ave_svpt_pct','w_ave_retpt_pct','w_ave_bpWon_pct','w_ave_ace_pct',
           'w_ave_gmsWon_pct','w_win_pct','index']]

df2 = df2[['surface','tourney_level','winner_hand','winner_ht','winner_age','winner_rank_points','loser_hand','loser_ht',
          'loser_age','loser_rank_points','best_of','stage', 'l_ave_svpt_pct','l_ave_retpt_pct','l_ave_bpWon_pct','l_ave_ace_pct',
           'l_ave_gmsWon_pct','l_win_pct','index']]

df3 = pd.merge(df1, df2, on='index', how = 'inner', suffixes=('', '_y'))

# Define a function to get rid of duplicate columns
def drop_y(dataframe):
    to_drop = [x for x in dataframe if x.endswith('_y')]
    dataframe.drop(to_drop, axis=1, inplace=True)
    
# Get rid of duplicate columns  
drop_y(df3)

The data from the above cell was saved to "atp_matches12.csv". Next, I encoded the categorical variables. After that, the last data processing step was to create a response variable which indicates which player won the match. I created new variables for each player performance feature, generated a random number to assign either player 1 or player 2 as the winner, and set the new feature values accordingly.

In [8]:
df = pd.read_csv("atp_matches12.csv", usecols=range(25), index_col = False)

# Encode categorical variables
mapping_dict = {
    "surface": {
        "Clay": 0,
        "Hard": 1,
        "Grass": 2
        },
    "tourney_level": {
        "A": 0,
        "D": 1,
        "M": 2,
        "F": 3,
        "G": 4
    },
    "winner_hand": {
        "R": 0,
        "L": 1
        }
}
df = df.replace(mapping_dict)

In [9]:
# Generate a random number for each match, indicating whether "player 1" or "player 2" won the match
import random
for i in range(30118):
    x = random.randint(1,2)
    # If x = 1, then player 1 won
    if x == 1:
        df.at[i, 'winner'] = 1
        # Replace 'winner_' with 'p1_'
        # Replace 'loser_' with 'p2_'
        df.at[i, 'p1_hand'] = df.loc[i]['winner_hand']
        df.at[i, 'p1_ht'] = df.loc[i]['winner_ht']
        df.at[i, 'p1_age'] = df.loc[i]['winner_age']
        df.at[i, 'p1_rank_points'] = df.loc[i]['winner_rank_points']
        df.at[i, 'p1_ave_svpt_pct'] = df.loc[i]['w_ave_svpt_pct']
        df.at[i, 'p1_ave_retpt_pct'] = df.loc[i]['w_ave_retpt_pct']
        df.at[i, 'p1_ave_bpWon_pct'] = df.loc[i]['w_ave_bpWon_pct']
        df.at[i, 'p1_ave_ace_pct'] = df.loc[i]['w_ave_ace_pct']
        df.at[i, 'p1_ave_gmsWon_pct'] = df.loc[i]['w_ave_gmsWon_pct']        
        df.at[i, 'p1_win_pct'] = df.loc[i]['w_win_pct']
        
        df.at[i, 'p2_hand'] = df.loc[i]['loser_hand']
        df.at[i, 'p2_ht'] = df.loc[i]['loser_ht']
        df.at[i, 'p2_age'] = df.loc[i]['loser_age']
        df.at[i, 'p2_rank_points'] = df.loc[i]['loser_rank_points']
        df.at[i, 'p2_ave_svpt_pct'] = df.loc[i]['l_ave_svpt_pct']
        df.at[i, 'p2_ave_retpt_pct'] = df.loc[i]['l_ave_retpt_pct']
        df.at[i, 'p2_ave_bpWon_pct'] = df.loc[i]['l_ave_bpWon_pct']
        df.at[i, 'p2_ave_ace_pct'] = df.loc[i]['l_ave_ace_pct']
        df.at[i, 'p2_ave_gmsWon_pct'] = df.loc[i]['l_ave_gmsWon_pct']        
        df.at[i, 'p2_win_pct'] = df.loc[i]['l_win_pct']
        
    # If x = 2, then player 2 won
    else:
        # Replace 'winner_' with 'p2_'
        # Replace 'loser_' with 'p1_'
        df.at[i, 'winner'] = 2
        
        df.at[i, 'p2_hand'] = df.loc[i]['winner_hand']
        df.at[i, 'p2_ht'] = df.loc[i]['winner_ht']
        df.at[i, 'p2_age'] = df.loc[i]['winner_age']
        df.at[i, 'p2_rank_points'] = df.loc[i]['winner_rank_points']
        df.at[i, 'p2_ave_svpt_pct'] = df.loc[i]['w_ave_svpt_pct']
        df.at[i, 'p2_ave_retpt_pct'] = df.loc[i]['w_ave_retpt_pct']
        df.at[i, 'p2_ave_bpWon_pct'] = df.loc[i]['w_ave_bpWon_pct']
        df.at[i, 'p2_ave_ace_pct'] = df.loc[i]['w_ave_ace_pct']
        df.at[i, 'p2_ave_gmsWon_pct'] = df.loc[i]['w_ave_gmsWon_pct']
        df.at[i, 'p2_win_pct'] = df.loc[i]['w_win_pct']
        
        df.at[i, 'p1_hand'] = df.loc[i]['loser_hand']
        df.at[i, 'p1_ht'] = df.loc[i]['loser_ht']
        df.at[i, 'p1_age'] = df.loc[i]['loser_age']
        df.at[i, 'p1_rank_points'] = df.loc[i]['loser_rank_points']
        df.at[i, 'p1_ave_svpt_pct'] = df.loc[i]['l_ave_svpt_pct']
        df.at[i, 'p1_ave_retpt_pct'] = df.loc[i]['l_ave_retpt_pct']
        df.at[i, 'p1_ave_bpWon_pct'] = df.loc[i]['l_ave_bpWon_pct']
        df.at[i, 'p1_ave_ace_pct'] = df.loc[i]['l_ave_ace_pct']
        df.at[i, 'p1_ave_gmsWon_pct'] = df.loc[i]['l_ave_gmsWon_pct']
        df.at[i, 'p1_win_pct'] = df.loc[i]['l_win_pct'] 
        
# Get rid of variables that are no longer needed
df = df[['surface','tourney_level','winner','p1_hand','p1_ht','p1_age','p1_rank_points','p2_hand','p2_ht',
          'p2_age','p2_rank_points','best_of','stage', 'p1_ave_svpt_pct','p1_ave_retpt_pct','p1_ave_bpWon_pct','p1_ave_ace_pct',
           'p1_ave_gmsWon_pct','p1_win_pct','p2_ave_svpt_pct','p2_ave_retpt_pct','p2_ave_bpWon_pct','p2_ave_ace_pct',
           'p2_ave_gmsWon_pct','p2_win_pct']]

## Machine Learning Models

The next stage of the project was to train machine learning classification algorithms. I began by quickly trying out some simple algorithms.

In [11]:
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import tree

# Separate the features from the response variable 

features = [x for x in df.columns.tolist() if x != 'winner']
response = 'winner'
X = df[features]
Y = df[response]

# Split the data into a training set and testing set

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3)

In [17]:
from sklearn.tree import DecisionTreeClassifier

# Train a Decision Tree Classifier with the training data

dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [18]:
# Generate predictions for the match winner of records in the test set 
y_predict = dt.predict(X_test)

# Calculate the accuracy score of the classifier
print "Accuracy score: " + str(round(accuracy_score(y_test, y_predict)*100, 3)) + "%"

Accuracy score: 58.411%


An accuracy of 58.4% is quite poor, so I decided to try out other machine learning algorithms

In [23]:
from sklearn.neighbors import KNeighborsClassifier
# Train a k-Nearest Neighbours classifier with the training data

neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train) 

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

In [24]:
# Generate predictions and calculate the accuracy score of the model

y_predict = neigh.predict(X_test)
print "Accuracy score: " + str(round(accuracy_score(y_test, y_predict)*100, 3)) + "%"

Accuracy score: 60.015%


In [25]:
from sklearn.linear_model import LogisticRegression

# Train a Logistic Regression model with the training data

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

# Generate predictions and calculate the accuracy score of the model

y_predict = logreg.predict(X_test)
print "Accuracy score: " + str(round(accuracy_score(y_test, y_predict)*100, 3)) + "%"

Accuracy score: 67.773%


The prediction results varied from poor to fairly good - the Logistic Regression model was reasonably accurate. However, even its prediction accuracy of just under 68% was a little disappointing. The model could be improved by:

 - Standardising the features
 - Using a feature selection algorithm
 - Tuning the model parameters
 - Using ensemble methods (such as random forests, bagging, boosting, etc)

Even though the results weren't as impressive as I'd hoped for, I learnt a lot by completing this project, and believe that by refining some of the data processing and modelling techniques that I used, much better models can be created to predict the outcome of ATP tennis matches. I hope to continue this project in in the future.