# Relationship between a player's rank and height #

## Introduction ##
With approximately 87 million playing tennis around the world, Tennis is one of the most popular sports in the world.  With a total of 3873 professional players (ITF Global Tennis Report 2019), it is no surprise that it is one of the most-watched sports on TV. 

“The Association of Tennis Professionals (ATP) is the governing body of the men’s professional tennis circuits– the ATP Tour, the ATP Challenger Tour, and the ATP Champions Tour.”(Wikipedia) ATP rankings are a tool used to quantify the qualifications of the players. 

The dataset we have attained includes information relevant to the ATP tournaments’ statistics. 
It has information about the tournament locations, the surface, draw size, tournament level, date, match number, etc. 

We would like to address whether there is a clear relationship between a winner’s rank points and their height. This can be accomplished by having a classification approach and trying to predict the height of a winner by looking at their rank point. 


## Methods and Results ## 
We will first load our original data from the source (ATP), by downloading it and naming the file “Tennis_Data.csv” for easy identification. Then we will upload it to our Jupyter folder before creating a new Notebook on Jupyter to start wrangling and cleaning the data. By importing pandas as pd, we can read the CSV file into our notebook by using the “read_csv”  command.

We labeled this dataset, _"tennis_data"_  which gives us a complicated and messy table of information regarding our data. Looking over this table, we can see that there are rows with blank areas or information not needed for our analysis. We removed these rows by using the command “skiprows” and selecting rows ... and ... to remove. While this greatly cleans up the table, it is still quite unkempt. The columns face the same problem as we had with the rows, as, some of the columns contain information, or lack thereof, that we do not require. We solve this by using three commands: “loc”, “columns”, and “isin”. By naming the columns to remove in hand with these three commands, we successfully remove any unnecessary columns from the “Tennis_data” and update it to “Tennis_data_2”. Our last step is making sure our table is labeled correctly. There was one unnamed column, which is incorrect. To change this, we use the “rename” and “inplace” command. By setting the “inplace” to equal to “True”, we can change the column name to its correct term: “Serial number”. This finishes our tidy of data from our original downloaded format to a new, clearer, concise and clearer format for our analysis. This includes: ...,...,..... 


In [2]:
import pandas as pd
tennis_data = pd.read_csv("Tennis_Dataset 1.csv")
tennis_data.head()

Unnamed: 0.1,Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,...,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
0,0,2019-M020,Brisbane,Hard,32,A,20181231,300,105453,2.0,...,54.0,34.0,20.0,14.0,10.0,15.0,9.0,3590.0,16.0,1977.0
1,1,2019-M020,Brisbane,Hard,32,A,20181231,299,106421,4.0,...,52.0,36.0,7.0,10.0,10.0,13.0,16.0,1977.0,239.0,200.0
2,2,2019-M020,Brisbane,Hard,32,A,20181231,298,105453,2.0,...,27.0,15.0,6.0,8.0,1.0,5.0,9.0,3590.0,40.0,1050.0
3,3,2019-M020,Brisbane,Hard,32,A,20181231,297,104542,,...,60.0,38.0,9.0,11.0,4.0,6.0,239.0,200.0,31.0,1298.0
4,4,2019-M020,Brisbane,Hard,32,A,20181231,296,106421,4.0,...,56.0,46.0,19.0,15.0,2.0,4.0,16.0,1977.0,18.0,1855.0


In [3]:
tennis_data = tennis_data[tennis_data["tourney_name"] == "Brisbane"]
tennis_data.head()

Unnamed: 0.1,Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,...,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
0,0,2019-M020,Brisbane,Hard,32,A,20181231,300,105453,2.0,...,54.0,34.0,20.0,14.0,10.0,15.0,9.0,3590.0,16.0,1977.0
1,1,2019-M020,Brisbane,Hard,32,A,20181231,299,106421,4.0,...,52.0,36.0,7.0,10.0,10.0,13.0,16.0,1977.0,239.0,200.0
2,2,2019-M020,Brisbane,Hard,32,A,20181231,298,105453,2.0,...,27.0,15.0,6.0,8.0,1.0,5.0,9.0,3590.0,40.0,1050.0
3,3,2019-M020,Brisbane,Hard,32,A,20181231,297,104542,,...,60.0,38.0,9.0,11.0,4.0,6.0,239.0,200.0,31.0,1298.0
4,4,2019-M020,Brisbane,Hard,32,A,20181231,296,106421,4.0,...,56.0,46.0,19.0,15.0,2.0,4.0,16.0,1977.0,18.0,1855.0


In [4]:
tennis_data_2 = tennis_data.loc[:, ["tourney_name","winner_rank_points", "winner_ht"]]
tennis_data_2.head()

Unnamed: 0,tourney_name,winner_rank_points,winner_ht
0,Brisbane,3590.0,178.0
1,Brisbane,1977.0,
2,Brisbane,3590.0,178.0
3,Brisbane,200.0,188.0
4,Brisbane,1977.0,


In [6]:
tennis_data_3=tennis_data_2.dropna(axis=0)
tennis_data_3.head()

Unnamed: 0,tourney_name,winner_rank_points,winner_ht
0,Brisbane,3590.0,178.0
2,Brisbane,3590.0,178.0
3,Brisbane,200.0,188.0
5,Brisbane,1050.0,188.0
6,Brisbane,3590.0,178.0


In [7]:
tennis_data_3.rename(columns = {'tourney_name': 'Location', 'winner_rank_points':'Winner Rank Points', 'winner_ht':'Winner Height'}, inplace = True)
tennis_data_3.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tennis_data_3.rename(columns = {'tourney_name': 'Location', 'winner_rank_points':'Winner Rank Points', 'winner_ht':'Winner Height'}, inplace = True)


Unnamed: 0,Location,Winner Rank Points,Winner Height
0,Brisbane,3590.0,178.0
2,Brisbane,3590.0,178.0
3,Brisbane,200.0,188.0
5,Brisbane,1050.0,188.0
6,Brisbane,3590.0,178.0


In [8]:
import altair as alt

x = 'Winner Rank Points'
y = 'Winner Height'

plot = alt.Chart(tennis_data_3).mark_point().encode(
    x = alt.X("Winner Rank Points", scale=alt.Scale(zero=False)),
    y = alt.Y("Winner Height", scale=alt.Scale(zero=False))
).properties(
    title = "Winners' rank points and their corressponding Heights"
)
plot


  for col_name, dtype in df.dtypes.iteritems():
