## Title
Predicting Tennis Player Rankings Based on Player Statistics

## Introduction

Tennis, a globally popular sport, has seen numerous players rise and fall in the rankings. The rankings are determined by several factors, primarily by a player's performance in tournaments. However, there might be underlying features or characteristics of top-ranked players that could potentially predict their rankings. By using the "Player Stats for Top 500 Players" dataset, our goal is to predict a tennis player's rank based on various player statistics.

### Predictive Question:
Can we predict a tennis player's current rank based on their age, playing style, prize money earned, favorite surface, number of seasons played, current Elo rank, number of titles won, and the number of weeks they've been at No. 1?

### Dataset Description:
The dataset "Player Stats for Top 500 Players" contains a comprehensive list of variables related to the top tennis players. These variables range from personal details, professional statistics, and social media presence.is

**Reading the Dataset from Web into R**:
We would demonstrate reading the dataset directly from its web source into R, probably using functions like `read.csv()` or relevant packages like `readr`.

**Data Wrangling and Cleaning**:
Ensure that the dataset is in a tidy format. Remove any rows with missing data for the variables/columns we plan to use. Convert data types if necessary.

**Summary Table**:
A table showing:
- Number of observations for each favorite surface (e.g., Grass, Clay, Hard)
- Mean prize money earned by players
- Average age of players
- Average number of seasons played by players
- Mean of current Elo rank

**Visualization**:
A scatter plot comparing the "Prize Money" against the "Current Rank", to visualize if there's any trend between earnings and rank.

## Methods

We'll use a regression model, probably a linear regression if our exploratory data analysis suggests a linear relationship between the features and the target variable (Current Rank).

**Variables/Columns to be used**:
1. Age
2. Prize Money
3. Favorite Surface
4. Seasons
5. Current Elo Rank
6. Titles
7. Weeks at No. 1

**Visualizing the Results**:
We will plot the predicted ranks against the actual ranks to visualize the accuracy of our model. Additionally, a residual plot can be plotted to understand the differences between actual and predicted values.

## Expected Outcomes and Significance

**Expected Findings**:
We expect to find a relationship between the player's statistics (like prize money, number of titles, etc.) and their rankings. Players with higher earnings, more titles, and those who have spent more weeks at No. 1 might have better rankings.

**Impact**:
Understanding these relationships can provide insights into what factors contribute most to a player's ranking, which can be useful for players, coaches, and analysts.

**Future Questions**:
- How does the player's favorite surface impact their performance in grand slams specific to those surfaces?
- Does the age of turning pro correlate with a player's success in their career?
- How has the influence of social media presence (like Facebook, Twitter followers) affected a player's popularity or endorsements, if at all?

In [10]:
#Import library 
library(tidyverse)
library(repr)
library(readxl)
source("tests.R")
source("cleanup.R")
options(repr.matrix.max.rows = 6)

“cannot open file 'tests.R': No such file or directory”


ERROR: Error in file(filename, "r", encoding = encoding): cannot open the connection


In [7]:
# Commentted out the following code to prevent duplicate downloads
# url <- "https://drive.google.com/uc?export=download&id=1fOQ8sy_qMkQiQEAO6uFdRX4tLI8EpSTn"
# url <- "https://drive.google.com/uc?export=download&id=1_MECmUXZuuILYeEOfonSGqodW6qVdhsS"
# download.file(url, "data/raw_player_stat.csv")

In [12]:
#read and observe raw data
raw_result <- read_csv("data/raw_player_stat.csv")
head(raw_result)

[1m[22mNew names:
[36m•[39m `` -> `...1`
[1mRows: [22m[34m500[39m [1mColumns: [22m[34m38[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (25): Age, Country, Plays, Wikipedia, Current Rank, Best Rank, Name, Bac...
[32mdbl[39m (13): ...1, Turned Pro, Seasons, Titles, Best Season, Retired, Masters, ...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


...1,Age,Country,Plays,Wikipedia,Current Rank,Best Rank,Name,Backhand,Prize Money,⋯,Facebook,Twitter,Nicknames,Grand Slams,Davis Cups,Web Site,Team Cups,Olympics,Weeks at No. 1,Tour Finals
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
0,26 (25-04-1993),Brazil,Right-handed,Wikipedia,378 (97),363 (04-11-2019),Oscar Jose Gutierrez,,,⋯,,,,,,,,,,
1,18 (22-12-2001),United Kingdom,Left-handed,Wikipedia,326 (119),316 (14-10-2019),Jack Draper,Two-handed,"$59,040",⋯,,,,,,,,,,
2,32 (03-11-1987),Slovakia,Right-handed,Wikipedia,178 (280),44 (14-01-2013),Lukas Lacko,Two-handed,"US$3,261,567",⋯,,,,,,,,,,
3,21 (29-05-1998),"Korea, Republic of",Right-handed,Wikipedia,236 (199),130 (10-04-2017),Duck Hee Lee,Two-handed,"$374,093",⋯,,,,,,,,,,
4,27 (21-10-1992),Australia,Right-handed,Wikipedia,183 (273),17 (11-01-2016),Bernard Tomic,Two-handed,"US$6,091,971",⋯,,,,,,,,,,
5,22 (11-02-1997),Poland,Right-handed,Wikipedia,31 (1398),31 (20-01-2020),Hubert Hurkacz,Two-handed,"$1,517,157",⋯,,,,,,,,,,


In [15]:
selected_data <- raw_result |>
  select(Name,Age, Plays, `Current Rank`, Backhand, `Prize Money`, Height, `Favorite Surface`, Weight, `Turned Pro`, Seasons)
selected_data

Name,Age,Plays,Current Rank,Backhand,Prize Money,Height,Favorite Surface,Weight,Turned Pro,Seasons
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>
Oscar Jose Gutierrez,26 (25-04-1993),Right-handed,378 (97),,,,,,,
Jack Draper,18 (22-12-2001),Left-handed,326 (119),Two-handed,"$59,040",,,,,
Lukas Lacko,32 (03-11-1987),Right-handed,178 (280),Two-handed,"US$3,261,567",185 cm,"Fast (H, G) 40%",,2005,14
Duck Hee Lee,21 (29-05-1998),Right-handed,236 (199),Two-handed,"$374,093",,,,,2
Bernard Tomic,27 (21-10-1992),Right-handed,183 (273),Two-handed,"US$6,091,971",193 cm,"Fast (H, G) 36%",,2008,11
Hubert Hurkacz,22 (11-02-1997),Right-handed,31 (1398),Two-handed,"$1,517,157",,"Fast (H, G) 29%",,2015,5
Sekou Bangoura,28 (18-11-1991),Right-handed,307 (131),Two-handed,"$278,709",,,,2010,1
Tung Lin Wu,21 (12-05-1998),Right-handed,232 (205),Two-handed,"$59,123",,,,,1
Sanjar Fayziev,25 (29-07-1994),Right-handed,417 (81),Two-handed,"$122,734",,Hard 100%,,,5
Emil Ruusuvuori,20 (02-04-1999),Right-handed,104 (534),Two-handed,"US$74,927",,,,,3
