# Project Proposal Group 33


### Predicting a Tennis Player's Best Rank Based on Their Age, Seasons Played, Current Rank, and Prize Money

### Introduction


Tennis is a popular sport with a history of competitive tournaments and rankings. Tennis player rankings are essential to evaluate their performance but predicting a player's best rank can be difficult. This project aims to use K-nearest neighbor regression to predict a player's best rank based on age, seasons played, current rank, and prize money. The dataset will be analyzed using five-fold cross-validation, and visualization techniques such as scatter plots will provide insight into factors that influence ranking.

### Preliminary Data Analysis


In [6]:
library(tidyverse)
library(repr)
library(tidymodels)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔

The seed value is set to 4321 to enusre that every random process yields the same oucome when the code reruns.

In [7]:
set.seed(4321)

In [8]:
url <- "https://drive.google.com/uc?export=download&id=1_MECmUXZuuILYeEOfonSGqodW6qVdhsS"
tennis_data <- read_csv(url)
head(tennis_data, n=10)

[1m[22mNew names:
[36m•[39m `` -> `...1`
[1mRows: [22m[34m500[39m [1mColumns: [22m[34m38[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (25): Age, Country, Plays, Wikipedia, Current Rank, Best Rank, Name, Bac...
[32mdbl[39m (13): ...1, Turned Pro, Seasons, Titles, Best Season, Retired, Masters, ...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


...1,Age,Country,Plays,Wikipedia,Current Rank,Best Rank,Name,Backhand,Prize Money,⋯,Facebook,Twitter,Nicknames,Grand Slams,Davis Cups,Web Site,Team Cups,Olympics,Weeks at No. 1,Tour Finals
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
0,26 (25-04-1993),Brazil,Right-handed,Wikipedia,378 (97),363 (04-11-2019),Oscar Jose Gutierrez,,,⋯,,,,,,,,,,
1,18 (22-12-2001),United Kingdom,Left-handed,Wikipedia,326 (119),316 (14-10-2019),Jack Draper,Two-handed,"$59,040",⋯,,,,,,,,,,
2,32 (03-11-1987),Slovakia,Right-handed,Wikipedia,178 (280),44 (14-01-2013),Lukas Lacko,Two-handed,"US$3,261,567",⋯,,,,,,,,,,
3,21 (29-05-1998),"Korea, Republic of",Right-handed,Wikipedia,236 (199),130 (10-04-2017),Duck Hee Lee,Two-handed,"$374,093",⋯,,,,,,,,,,
4,27 (21-10-1992),Australia,Right-handed,Wikipedia,183 (273),17 (11-01-2016),Bernard Tomic,Two-handed,"US$6,091,971",⋯,,,,,,,,,,
5,22 (11-02-1997),Poland,Right-handed,Wikipedia,31 (1398),31 (20-01-2020),Hubert Hurkacz,Two-handed,"$1,517,157",⋯,,,,,,,,,,
6,28 (18-11-1991),United States,Right-handed,Wikipedia,307 (131),213 (31-10-2016),Sekou Bangoura,Two-handed,"$278,709",⋯,,,,,,,,,,
7,21 (12-05-1998),"Taiwan, Province of China",Right-handed,Wikipedia,232 (205),229 (04-11-2019),Tung Lin Wu,Two-handed,"$59,123",⋯,,,,,,,,,,
8,25 (29-07-1994),Uzbekistan,Right-handed,Wikipedia,417 (81),253 (17-07-2017),Sanjar Fayziev,Two-handed,"$122,734",⋯,,,,,,,,,,
9,20 (02-04-1999),Finland,Right-handed,Wikipedia,104 (534),104 (13-01-2020),Emil Ruusuvuori,Two-handed,"US$74,927",⋯,,,,,,,,,,


For our project we chose the columns Age, Best Rank, Prize Money, Current Rank, and Seasons. These predictors were selected because they contain the most data compared to other columns which mostly contain N/As and all these variables have a direct or indirect impact on a player's performance.

We also isolated our desired variables, deleted empty observations, and removed unnecessary characters such as the unnecessary strings in the Prize Money varaible.

In [9]:
colnames(tennis_data) <- make.names(colnames(tennis_data))
tennis_data_separated <- tennis_data |> select(Age, Best.Rank, Prize.Money, Seasons, Current.Rank) |>
                separate(col = Age,
                        into = c("Age", "date"),
                        sep=" ",
                        convert=TRUE) |>
                separate(col = Best.Rank,
                        into = c("best_rank", "date_rank"),
                        sep=" ",
                        convert=TRUE) |>

                separate(col = Current.Rank,
                        into = c("current_rank", "date_cur_rank"),
                        sep=" ",
                        convert=TRUE) |>
    select(Age, best_rank, current_rank, Prize.Money, Seasons)

head(tennis_data_separated, n = 10)

Age,best_rank,current_rank,Prize.Money,Seasons
<int>,<int>,<int>,<chr>,<dbl>
26,363,378,,
18,316,326,"$59,040",
32,44,178,"US$3,261,567",14.0
21,130,236,"$374,093",2.0
27,17,183,"US$6,091,971",11.0
22,31,31,"$1,517,157",5.0
28,213,307,"$278,709",1.0
21,229,232,"$59,123",1.0
25,253,417,"$122,734",5.0
20,104,104,"US$74,927",3.0


In [10]:
tennis_data_separated$Prize.Money <- gsub("US", " ", tennis_data_separated$Prize.Money)


tennis_data_separated$Prize.Money <- gsub("US", "", tennis_data_separated$Prize.Money)
tennis_data_separated$Prize.Money <- gsub("all-time leader in earnings", "", tennis_data_separated$Prize.Money)
tennis_data_separated$Prize.Money <- gsub("11th", "", tennis_data_separated$Prize.Money)
tennis_data_separated$Prize.Money <- gsub("24th", "", tennis_data_separated$Prize.Money)
tennis_data_separated$Prize.Money <- gsub("10th", "", tennis_data_separated$Prize.Money)
tennis_data_separated$Prize.Money <- sub("14th", "", tennis_data_separated$Prize.Money)
tennis_data_separated$Prize.Money <- sub("27th", "", tennis_data_separated$Prize.Money)
tennis_data_separated$Prize.Money <- sub("15th", "", tennis_data_separated$Prize.Money)
tennis_data_separated$Prize.Money <- sub("30th", "", tennis_data_separated$Prize.Money)
tennis_data_separated$Prize.Money <- sub("All-time leader in earnings", "", tennis_data_separated$Prize.Money)
tennis_data_separated$Prize.Money <- sub("4th", "", tennis_data_separated$Prize.Money)
tennis_data_separated$Prize.Money <- sub("28th", "", tennis_data_separated$Prize.Money)
tennis_data_separated$Prize.Money <- sub("2nd", "", tennis_data_separated$Prize.Money)
tennis_data_separated$Prize.Money <- sub("6th", "", tennis_data_separated$Prize.Money)
tennis_data_separated$Prize.Money <- sub("33rd", "", tennis_data_separated$Prize.Money)
tennis_data_separated$Prize.Money <- sub("26th", "", tennis_data_separated$Prize.Money)
tennis_data_separated$Prize.Money <- sub("24th", "", tennis_data_separated$Prize.Money)
tennis_data_separated$Prize.Money <- sub("48th", "", tennis_data_separated$Prize.Money)
tennis_data_separated$Prize.Money <- sub("41st", "", tennis_data_separated$Prize.Money)
tennis_data_separated$Prize.Money <- sub("\\$","", tennis_data_separated$Prize.Money)
tennis_data_separated$Prize.Money <- sub(" ", "", tennis_data_separated$Prize.Money)
tennis_data_separated$Prize.Money <- sub("   ", "", tennis_data_separated$Prize.Money)
tennis_data_separated$Prize.Money <- sub("  ", "", tennis_data_separated$Prize.Money)
tennis_data_separated$Prize.Money <- sub(" all-time in earnings", "", tennis_data_separated$Prize.Money)
tennis_data_separated$Prize.Money <- gsub(",", "", tennis_data_separated$Prize.Money)

tennis_data_renamed <- tennis_data_separated |>
    rename("Best_Rank" = "best_rank") |>
    rename("Current_Rank" = "current_rank")
head(tennis_data_renamed, n = 10)

Age,Best_Rank,Current_Rank,Prize.Money,Seasons
<int>,<int>,<int>,<chr>,<dbl>
26,363,378,,
18,316,326,59040.0,
32,44,178,3261567.0,14.0
21,130,236,374093.0,2.0
27,17,183,6091971.0,11.0
22,31,31,1517157.0,5.0
28,213,307,278709.0,1.0
21,229,232,59123.0,1.0
25,253,417,122734.0,5.0
20,104,104,74927.0,3.0


We also converted the Prize Money column from character into double.

In [11]:
tennis_data_mutated <- tennis_data_renamed |>
    mutate(Prize_Money = as.numeric(Prize.Money)) |>
    na.omit() |>
    select(Age, Best_Rank, Current_Rank, Prize_Money, Seasons)

head(tennis_data_mutated, n = 10)

“NAs introduced by coercion”


Age,Best_Rank,Current_Rank,Prize_Money,Seasons
<int>,<int>,<int>,<dbl>,<dbl>
32,44,178,3261567,14
21,130,236,374093,2
27,17,183,6091971,11
22,31,31,1517157,5
28,213,307,278709,1
21,229,232,59123,1
25,253,417,122734,5
20,104,104,74927,3
19,17,22,1893476,3
23,4,4,10507693,5


We split the data into 75% training and 25% testing datasets with starta set to  our target variable (Current_Rank). We will use the training set to train the model and the testing set to assess its accuracy. 

In [12]:
tennis_data_split <- initial_split(tennis_data_mutated, prop = .75, strata = Current_Rank)
tennis_data_train <- training(tennis_data_split)
tennis_data_test <- testing(tennis_data_split)
head(tennis_data_train, n = 10)
head(tennis_data_test, n = 10)

Age,Best_Rank,Current_Rank,Prize_Money,Seasons
<int>,<int>,<int>,<dbl>,<dbl>
22,31,31,1517157,5
19,17,22,1893476,3
23,4,4,10507693,5
20,47,54,1285541,3
22,25,34,2722314,6
32,11,45,11912152,15
32,9,12,13470614,16
29,23,27,4850190,11
29,32,32,2301746,13
25,80,84,827193,5


Age,Best_Rank,Current_Rank,Prize_Money,Seasons
<int>,<int>,<int>,<dbl>,<dbl>
21,130,236,374093,2
30,98,105,898701,7
20,187,331,127760,2
34,48,408,3186839,14
27,11,14,7217264,8
27,26,41,3062847,11
28,10,30,8892564,9
22,298,344,71874,6
34,62,461,1453933,10
26,72,83,1701922,7


In [13]:
tennis_data_train_mean <- tennis_data_train |> map_df(mean)
tennis_data_train_median <- tennis_data_train |> map_df(median)

print("Training Tennis Data Mean")
tennis_data_train_mean
print("Training Tennis Data Median")
tennis_data_train_median

[1] "Training Tennis Data Mean"


Age,Best_Rank,Current_Rank,Prize_Money,Seasons
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
26.70221,128.8971,206.625,4493456,6.422794


[1] "Training Tennis Data Median"


Age,Best_Rank,Current_Rank,Prize_Money,Seasons
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
26.5,99,193.5,648769.5,5


The tables above show that the data needs to be standardized since our predictors have very different ranges and we do not want the predictors with larger sclaes (such as Prize_Money) to have a greater effect on our model than the ones with smaller scales (such as Seasons). We can also observe that Prize_Money has a much larger mean value compared to is median value which could point to potential outliers that are boosting the avarage and skewing the distribution which again suggests we should standardize our data.

// Need to add visuals 

### Methods

#### Recipe:

Since we need to standardize the data before any further analysis, we will begin by making a recipe that contains all predictor variables (tennis_data_recipe) as well as four other recipes for each of the predictors individually. 

In [15]:
tennis_data_recipe <- recipe(Best_Rank ~ Age + Current_Rank + Prize_Money + Seasons, data = tennis_data_train)|>
                        step_scale(all_predictors()) |>
                        step_center(all_predictors())

tennis_data_recipe_age <- recipe(Best_Rank ~ Age, data = tennis_data_train)|>
                        step_scale(all_predictors()) |>
                        step_center(all_predictors())

tennis_data_recipe_current_rank <- recipe(Best_Rank ~ Current_Rank, data = tennis_data_train)|>
                        step_scale(all_predictors()) |>
                        step_center(all_predictors())

tennis_data_recipe_prize_money <- recipe(Best_Rank ~ Prize_Money + Seasons, data = tennis_data_train)|>
                        step_scale(all_predictors()) |>
                        step_center(all_predictors())

tennis_data_recipe_Seasons <- recipe(Best_Rank ~ Seasons, data = tennis_data_train)|>
                        step_scale(all_predictors()) |>
                        step_center(all_predictors())

#### model specification:

#### Workflow:

#### Cross Validation:

### Results

### Discussion

### References