# Group Project - Predicting the Winner of a Tennis Match

# 1. Introduction

Tennis is a popular racket sport throughout the world and has a large competitive scene. There are many competitions held throughout the world for tennis, to try to find the best tennis player in the world. Professional tennis players are ranked through the ATP ranking system, awarding points based on their tournament performances. For instance, a player would earn more points for making it deep into a tournament. 

In this project, we will attempt to answer the question: **“Can we predict the length of a match between two players?”** 


## Dataset Description

The dataset we will be using to create our classification model are the results of games between 2017 and 2019 of the top 500 tennis players. In this dataset, there is information about the various tournaments in the time period, information about the players that played in the tournament, and information about their wins and losses in those tournaments. There is also data about the player’s ATP rank and also how long each match was during the tournament.

# 2. Methods & Results

Firstly, we imported the tennis database into Jupyter. Then, we will transform our data to better suit our data analysis, and clean it into a tidy format. Finally, we can split the data into 75% training and 25% testing for analysis.

There are a large number of potential predictors in this dataset, many of which are potentially useless for determining how long a match is, and so further analysis is needed to determine the best set of predictor variables. We can do this using data visualizations.

After selecting the appropriate predictors, we can train a KNN (***k***
-nearest neighbors) classification model using the training data. A cross-fold validation will be performed to find the best value for ***k***. Thenly, the model will be evaluated against the testing data set.

Currently, there are columns that describe the ATP rank of the two players. We will create a new column called "average rank", which will be a mean of the two players' ranks.


## Preliminary Exploratory Data Analysis

### Library importting and graph formatting

In [1]:
## format graph
options(repr.matrix.max.rows = 6)

## import libraries
library(tidyverse)
library(repr)
library(tidymodels)
library(RColorBrewer)

“package ‘ggplot2’ was built under R version 4.3.2”
── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.5.0     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom    

### Importing dataset

In [2]:
tennis <- read_csv("https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2019.csv", na = c("", "NA"))
head(tennis)


[1mRows: [22m[34m2806[39m [1mColumns: [22m[34m49[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (14): tourney_id, tourney_name, surface, tourney_level, winner_entry, wi...
[32mdbl[39m (35): draw_size, tourney_date, match_num, winner_id, winner_seed, winner...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,⋯,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
<chr>,<chr>,<chr>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
2019-M020,Brisbane,Hard,32,A,20181231,300,105453,2.0,,⋯,54,34,20,14,10,15,9,3590,16,1977
2019-M020,Brisbane,Hard,32,A,20181231,299,106421,4.0,,⋯,52,36,7,10,10,13,16,1977,239,200
2019-M020,Brisbane,Hard,32,A,20181231,298,105453,2.0,,⋯,27,15,6,8,1,5,9,3590,40,1050
2019-M020,Brisbane,Hard,32,A,20181231,297,104542,,PR,⋯,60,38,9,11,4,6,239,200,31,1298
2019-M020,Brisbane,Hard,32,A,20181231,296,106421,4.0,,⋯,56,46,19,15,2,4,16,1977,18,1855
2019-M020,Brisbane,Hard,32,A,20181231,295,104871,,,⋯,54,40,18,15,6,9,40,1050,185,275


### Cleaning and tidying dataset

In [12]:
##Get statistics for winner & loser average height, age
tennis <- tennis[!(is.na(tennis$winner_ht) | is.na(tennis$loser_ht)), ]
tennis <- tennis |>
          mutate(rank_difference = sqrt((winner_rank - loser_rank)^2)) |>
          mutate(average_rank = (winner_rank + loser_rank)/2) 
tennis

tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,⋯,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points,average_player_rank,rank_difference,average_rank
<chr>,<chr>,<chr>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
2019-M020,Brisbane,Hard,32,A,20181231,300,105453,2,,⋯,14,10,15,9,3590,16,1977,7,7,12.5
2019-M020,Brisbane,Hard,32,A,20181231,299,106421,4,,⋯,10,10,13,16,1977,239,200,223,223,127.5
2019-M020,Brisbane,Hard,32,A,20181231,298,105453,2,,⋯,8,1,5,9,3590,40,1050,31,31,24.5
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
2019-9210,Laver Cup,Hard,8,A,20190920,107,104745,,,⋯,,,,2,9225,24,1450,22,22,13.0
2019-9210,Laver Cup,Hard,8,A,20190920,108,106233,,,⋯,,,,5,4575,33,1310,28,28,19.0
2019-9210,Laver Cup,Hard,8,A,20190920,109,106058,,,⋯,,,,210,235,11,2475,199,199,110.5


### Splitting the data (training and testing)

In [11]:
set.seed(69)
tennis_split <- initial_split(tennis, prop = 0.75, strata = winner_name)
tennis_training <- training(tennis_split)
tennis_testing <- testing(tennis_split)

“Too little data to stratify.
[36m•[39m Resampling will be unstratified.”


### Summarzation of data

### Visualizing the data

say some stuff about visualizations (need new columns)

### Adding new columns for analysis (win percentage for both players)

In [5]:
##Get data for each player by tournament
players_data <- tennis_training |>
    group_by(winner_name, tourney_name) |>
    summarize(
        winner_winning_match_count = sum(tennis_training$winner_name == winner_name),
        winner_losing_match_count = sum(tennis_training$loser_name == winner_name),
        winner_rank = winner_rank,
        loser_rank = loser_rank,
        loser_name = loser_name,
        match_num = match_num, 
        winner_match_count = winner_winning_match_count + winner_losing_match_count,
        winner_win_percent = winner_winning_match_count / winner_match_count
    )|>
    arrange(desc(winner_win_percent))
players_data
## Winner rank (of a match)
average_winner_ranks <- players_data |>
    group_by(winner_name) |>
    summarize(avg_rank = mean(winner_rank, na.rm = TRUE),
             winner_win_percent = mean(winner_win_percent, na.rm = TRUE),
             winner_match_count = mean(winner_match_count, na.rm = TRUE),
             winner_winning_match_count = mean(winner_winning_match_count, na.rm = TRUE),
             winner_losing_match_count = mean(winner_losing_match_count, na.rm = TRUE)) |>
    arrange(avg_rank)

## Loser rank (of a match)
average_loser_ranks <- players_data |>
    group_by(loser_name) |>
    summarize(avg_rank = mean(loser_rank, na.rm = TRUE)) |>
    arrange(avg_rank)

[1m[22m[36mℹ[39m In argument: `winner_winning_match_count = sum(tennis_training$winner_name ==
  winner_name)`.
[36mℹ[39m In group 42: `winner_name = "Alex De Minaur"`, `tourney_name = "Sydney"`.
[33m![39m longer object length is not a multiple of shorter object length
“[1m[22mReturning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
[36mℹ[39m Please use `reframe()` instead.
[36mℹ[39m When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.”
[1m[22m`summarise()` has grouped output by 'winner_name', 'tourney_name'. You can
override using the `.groups` argument.


winner_name,tourney_name,winner_winning_match_count,winner_losing_match_count,winner_rank,loser_rank,loser_name,match_num,winner_match_count,winner_win_percent
<chr>,<chr>,<int>,<int>,<dbl>,<dbl>,<chr>,<dbl>,<int>,<dbl>
Benjamin Hassan,Davis Cup G1 R1: UZB vs LBN,2,0,303,383,Khumoun Sultanov,2,2,1
Benjamin Hassan,Davis Cup G1 R1: UZB vs LBN,2,0,303,362,Sanjar Fayziev,4,2,1
Benjamin Lock,Davis Cup G2 R1: ROU vs ZIM,1,0,546,80,Marius Copil,1,1,1
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
Malek Jaziri,Cordoba,2,13,43,135,Carlos Berlocq,282,15,0.1333333
Malek Jaziri,Indian Wells Masters,2,13,60,90,Bradley Klahn,224,15,0.1333333
Jozef Kovalik,Kitzbuhel,1,8,326,158,Guillermo Garcia Lopez,280,9,0.1111111


In [6]:
loser_winrates <- c()

for (x in 1:nrow(players_data)) {
    row <- players_data[x, ]
    match <- "FALSE"
    
    for (y in 1:nrow(average_winner_ranks)) {
        row2 <- average_winner_ranks[y, ]
        if (row$loser_name == row2$winner_name) {
            loser_winrate <- row2$winner_win_percent
            loser_winning_match_count <- row2$winner_winning_match_count

            vec <- c(loser_winrate, loser_winning_match_count)
            
            loser_winrates <- rbind(loser_winrates, vec)
            match <- "TRUE"
            break
        }

    }

    if (match == "FALSE") {
        vec <- c(0, 0)
        loser_winrates <- rbind(loser_winrates, vec)
    }
}

match_data <- cbind(players_data, loser_winrates)
names(match_data)[names(match_data) == "...11"] <- "loser_win_percentage"
names(match_data)[names(match_data) == "...12"] <- "loser_matches_won"




[1m[22mNew names:
[36m•[39m `` -> `...11`
[36m•[39m `` -> `...12`


In [7]:
##Summarized data

match_training <- 


## Note: average rank is slightly inaccurate due to not taking into consideration the ranks when players lost their match
## For example, Federer's significant ranking difference between the two tables is mostly likely due to a bad tournament where he lost many games and his ranking dropped as a result.

## Sample plot of winner_rank vs win_percentage
winner_plot <- ggplot(match_training, aes(x=winner_rank, y=winner_win_percent)) +
    geom_point(alpha = 0.3) +
    ggtitle("Win rate vs Percent of Matches won")
winner_plot


ERROR: Error in eval(expr, envir, enclos): object 'match_training' not found


### Expected Outcomes and Significance (Need to edit to become discussion)
o?

We expect to find the winner between two professional tennis players with a greater accuracy than other simpler forms of prediction.These findings could help professional coaches and fans discover players with higher potential and allow them to decide whom to invest their time in. These findings could also lead to future questions such as “Which player will win a certain tournament?” or on a larger scale, we could ask “Which player will be ranked the highest in the future based on their predicted tournament wins?” Overall, we hope that our predictor will be effective at determining the winner of a tennis match and that it will be useful in future applications. 

### Discussion 

- summarize what you found- 
discuss whether this is what you expected to find
- discuss what impact could such findings have?
- discuss what future questiosn could this lead to?
  to?

Fernandez, J. “Intensity of Tennis Match Play * Commentary.” British Journal of Sports Medicine, vol. 40, no. 5, 1 May 2006, pp. 387–391, https://doi.org/10.1136/bjsm.2005.023168.