# The Characteristics of a Tennis Match Winner

### Group 38 Project Proposal

#### Introduction
Likelihood of winning a game of tennis can be divided into four categories: strategy, technique, physical attributes and preparation, and mental game. While all components contribute to a win, we aim to test the importance of physical attributes and preparation by predicting the current rank of a tennis player based on attributes such as height, age, hand-dominance, seasons active, and favorite surface. To answer this question we will use data from the dataset “player_stats.csv”, which describes the player statistics for the top 500 tennis players, using data on physical attributes such as height, age, and hand dominance, as well as on variables such as seasons active and favorite surface. 

#### Preliminary exploratory data analysis

##### Methods
In progress

##### Expected outcomes and significance:

- What do you expect to find?
- What impact could such findings have?
- What future questions could this lead to?

#### Our Data Analysis:

In [1]:
# Setup
set.seed(3)

library(tidyverse)
library(tidymodels)
library(repr)
library(cowplot)
library(GGally)
library(ISLR)
library(themis)
options(repr.matrix.max.rows = 6)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔

ERROR: Error in library(themis): there is no package called ‘themis’


##### First, we will tidy up our data:

In [5]:
# Load the data from the web
url <- "https://drive.google.com/uc?export=download&id=1_MECmUXZuuILYeEOfonSGqodW6qVdhsS"
tennis <- read_csv(url) 

# Fix column names
colnames(tennis) <- make.names(colnames(tennis))

# Select only relevant columns to our study
tennis <- select(tennis, Age:Seasons)
tennis <- select(tennis, c(Age, Plays, Current.Rank, Height))

tennis

[1m[22mNew names:
[36m•[39m `` -> `...1`
[1mRows: [22m[34m500[39m [1mColumns: [22m[34m38[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (25): Age, Country, Plays, Wikipedia, Current Rank, Best Rank, Name, Bac...
[32mdbl[39m (13): ...1, Turned Pro, Seasons, Titles, Best Season, Retired, Masters, ...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Age,Plays,Current.Rank,Height
<chr>,<chr>,<chr>,<chr>
26 (25-04-1993),Right-handed,378 (97),
18 (22-12-2001),Left-handed,326 (119),
32 (03-11-1987),Right-handed,178 (280),185 cm
21 (29-05-1998),Right-handed,236 (199),
27 (21-10-1992),Right-handed,183 (273),193 cm
22 (11-02-1997),Right-handed,31 (1398),
28 (18-11-1991),Right-handed,307 (131),
21 (12-05-1998),Right-handed,232 (205),
25 (29-07-1994),Right-handed,417 (81),
20 (02-04-1999),Right-handed,104 (534),


In [6]:
# Now we will tidy up our data further by getting rid of unnecessary information in certain columns (ex. dates, "cm", extra number next to rank)

# Separate unnecessary information from their original column into a new column
tennis <- separate(tennis, col = Age, into = c("Age", "x"), sep = " ", convert = TRUE) |> 
          separate(col = Current.Rank, into = c("Current.Rank", "y"), sep = " ", convert = TRUE) |>
          separate(col = Height, into = c("Height", "z"), sep = " ", convert = TRUE) 

# Delete unnecessary columns
tennis <- select(tennis, -c(x, y, z))

tennis

Age,Plays,Current.Rank,Height
<int>,<chr>,<int>,<int>
26,Right-handed,378,
18,Left-handed,326,
32,Right-handed,178,185
21,Right-handed,236,
27,Right-handed,183,193
22,Right-handed,31,
28,Right-handed,307,
21,Right-handed,232,
25,Right-handed,417,
20,Right-handed,104,


In [50]:
tennis_split <- initial_split(tennis, prop = 0.7, strata = Current.Rank) 
tennis_train <- training(tennis_split) 
tennis_test <- testing(tennis_split)

#sum(apply(tennis_train_table, 1, function(x) any(is.na(x)))) #Used to find the number of missing observations
          
mean_Height <- mean(tennis_train$Height, na.rm = TRUE) # used to find the mean Height

mean_Age <- mean(tennis_train$Age, na.rm = TRUE) # used to find the mean Age

          
tab <- matrix(c(25.905, 185.83, 271, 500), ncol=4, byrow=TRUE) #used to create a table to express the answers.
colnames(tab) <- c('Mean Age','Mean Height','Missing Observations', 'Observations')
rownames(tab) <- c('Value')
tab <- as.table(tab)
tab





      Mean Age Mean Height Missing Observations Observations
Value   25.905     185.830              271.000      500.000