# Group Project Proposal - Predicting the Winner of a Tennis Match

## Introduction

Tennis is a popular racket sport throughout the world and has a large competitive scene. There are many competitions held throughout the world for tennis, to try to find the best tennis player in the world. Professional tennis players are ranked through the ATP ranking system, awarding points based on their tournament performances. For instance, a player would earn more points for making it deep into a tournament. In this project, we will attempt to answer the question: “Can we predict the winner of a match between two professional players?” The dataset we will be using to create our classification model are the results of games between 2017 and 2019 of the top 500 tennis players. In this dataset, there is information about the various tournaments in the time period, information about the players that played in the tournament, and information about their wins and losses in those tournaments. There is also data about the player’s ATP rank and how many points they gained from each tournament. 


## Preliminary Exploratory Data Analysis

In [16]:
options(repr.matrix.max.rows = 20)

library(tidyverse)
library(repr)
library(tidymodels)

In [12]:
tennis <- read_csv("tennis.csv", na = c("", "NA"))
## Cleaning and Wrangling
## Data is already tidy, with one observation per row being one match. However, this is not the most useful format for our question.
## Only get data from 2019
tennis <- tennis |>
    slice(1:2563)
tennis

[1m[22mNew names:
[36m•[39m `` -> `...1`
[1mRows: [22m[34m6866[39m [1mColumns: [22m[34m50[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (16): tourney_id, tourney_name, surface, tourney_level, winner_seed, win...
[32mdbl[39m (34): ...1, draw_size, tourney_date, match_num, winner_id, winner_ht, wi...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


...1,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,⋯,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
<dbl>,<chr>,<chr>,<chr>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0,2019-M020,Brisbane,Hard,32,A,20181231,300,105453,2,⋯,54,34,20,14,10,15,9,3590,16,1977
1,2019-M020,Brisbane,Hard,32,A,20181231,299,106421,4,⋯,52,36,7,10,10,13,16,1977,239,200
2,2019-M020,Brisbane,Hard,32,A,20181231,298,105453,2,⋯,27,15,6,8,1,5,9,3590,40,1050
3,2019-M020,Brisbane,Hard,32,A,20181231,297,104542,,⋯,60,38,9,11,4,6,239,200,31,1298
4,2019-M020,Brisbane,Hard,32,A,20181231,296,106421,4,⋯,56,46,19,15,2,4,16,1977,18,1855
5,2019-M020,Brisbane,Hard,32,A,20181231,295,104871,,⋯,54,40,18,15,6,9,40,1050,185,275
6,2019-M020,Brisbane,Hard,32,A,20181231,294,105453,2,⋯,53,37,13,12,6,9,9,3590,19,1835
7,2019-M020,Brisbane,Hard,32,A,20181231,293,104542,,⋯,51,34,11,11,6,11,239,200,77,691
8,2019-M020,Brisbane,Hard,32,A,20181231,292,200282,7,⋯,39,30,3,9,3,6,31,1298,72,715
9,2019-M020,Brisbane,Hard,32,A,20181231,291,106421,4,⋯,39,27,7,10,2,6,16,1977,240,200


In [13]:
##Get data for each player by tournament
players_data <- tennis |>
    group_by(winner_name, tourney_name) |>
    summarize(
        winner_winning_match_count = sum(tennis$winner_name == winner_name),
        winner_losing_match_count = sum(tennis$loser_name == winner_name),
        winner_rank = winner_rank,
        loser_rank = loser_rank,
        loser_name = loser_name,
        match_num = match_num, 
        winner_match_count = winner_winning_match_count + winner_losing_match_count,
        winner_win_percent = winner_winning_match_count / winner_match_count
    )|>
    arrange(desc(winner_win_percent))

## Winner rank (of a match)
average_winner_ranks <- players_data |>
    group_by(winner_name) |>
    summarize(avg_rank = mean(winner_rank, na.rm = TRUE),
             winner_win_percent = mean(winner_win_percent, na.rm = TRUE),
             winner_match_count = mean(winner_match_count, na.rm = TRUE),
             winner_winning_match_count = mean(winner_winning_match_count, na.rm = TRUE),
             winner_losing_match_count = mean(winner_losing_match_count, na.rm = TRUE)) |>
    arrange(avg_rank)

## Loser rank (of a match)
average_loser_ranks <- players_data |>
    group_by(loser_name) |>
    summarize(avg_rank = mean(loser_rank, na.rm = TRUE)) |>
    arrange(avg_rank)

[1m[22m[36mℹ[39m In argument: `winner_winning_match_count = sum(tennis$winner_name ==
  winner_name)`.
[36mℹ[39m In group 1: `winner_name = "Adrian Mannarino"`, `tourney_name =
  "'s-Hertogenbosch"`.
[33m![39m longer object length is not a multiple of shorter object length
“[1m[22mReturning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
[36mℹ[39m Please use `reframe()` instead.
[36mℹ[39m When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.”
[1m[22m`summarise()` has grouped output by 'winner_name', 'tourney_name'. You can
override using the `.groups` argument.


In [15]:
loser_winrates <- c()

for (x in 1:nrow(players_data)) {
    row <- players_data[x, ]
    match <- "FALSE"
    
    for (y in 1:nrow(average_winner_ranks)) {
        row2 <- average_winner_ranks[y, ]
        if (row$loser_name == row2$winner_name) {
            loser_winrate <- row2$winner_win_percent
            loser_match_count <- row2$winner_winning_match_count

            vec <- c(loser_winrate, loser_winning_match_count)
            
            loser_winrates <- rbind(loser_winrates, vec)
            match <- "TRUE"
            break
        }

    }

    if (match == "FALSE") {
        vec <- c(0, 0)
        loser_winrates <- rbind(loser_winrates, vec)
    }
}

match_data <- cbind(players_data, loser_winrates)
names(match_data)[names(match_data) == "...11"] <- "loser_win_percentage"
names(match_data)[names(match_data) == "...12"] <- "loser_matches_won"

match_split <- initial_split(match_data, prop = 0.75, strata = winner_name)
match_training <- training(match_split)
match_testing <- testing(match_split)

## match data with winner and loser winrates is acquired

#rank_when_win_match <- tennis |>
#    group_by(winner_name, tourney_name) |>
#    summarize(winner_rank = winner_rank) |>
#    drop_na(winner_rank)

#head(rank_when_win_match)

ERROR: Error in eval(expr, envir, enclos): object 'loser_winning_match_count' not found


In [6]:
##Summarized data

match_training


## Note: average rank is slightly inaccurate due to not taking into consideration the ranks when players lost their match
## For example, Federer's significant ranking difference between the two tables is mostly likely due to a bad tournament where he lost many games and his ranking dropped as a result.

## Sample plot of winner_rank vs win_percentage
winner_plot <- ggplot(match_training, aes(x=winner_rank, y=winner_win_percent)) +
    geom_point() +
    ggtitle("Win rate vs Percent of Matches won")
winner_plot


ERROR: Error in eval(expr, envir, enclos): object 'match_training' not found


## Methods

Firstly, we imported the tennis database into Jupyter. Then, we will transform our data to better suit our data analysis. Currently, the columns describing the two players of a game are labeled “winner” and "loser”, but we will change their labels to “player 1” and “player 2”. To allow for easier classification, we will create a column titled “winner” and have the variables be either “player 1” or “player 2” to display the player that won. The columns we will use to predict the winner will be a player’s average rank, and a new column called win percentage. To create this new column, we will find a player’s total number of wins and divide it by the total number of games they have played. These columns will be important to our project as they are the strongest indicator that a player will win a game. One way we can visualize our results is by plotting the percent accuracy of our results in a bar graph format and comparing it to the accuracy of a predictor only using player rank and the accuracy of another predictor only using win percentage. By doing this, we can ascertain if our predictor is more accurate than other simpler models. 


## Expected outcomes and significance

We expect to find the winner between two professional tennis players with a greater accuracy than other simpler forms of prediction.

These findings could help professional coaches and fans discover players with higher potential and allow them to decide whom to invest their time in. 

These findings could also lead to future questions such as “Which player will win a certain tournament?” or on a larger scale, we could ask “Which player will be ranked the highest in the future based on their predicted tournament wins?” Overall, we hope that our predictor will be effective at determining the winner of a tennis match and that it will be useful in future applications. 