# Relationship between a player's rank and height #

## Introduction ##
With approximately 87 million playing tennis around the world, Tennis is one of the most popular sports in the world.  With a total of 3873 professional players (ITF Global Tennis Report 2019), it is no surprise that it is one of the most-watched sports on TV. 

“The Association of Tennis Professionals (ATP) is the governing body of the men’s professional tennis circuits– the ATP Tour, the ATP Challenger Tour, and the ATP Champions Tour.”(Wikipedia) ATP rankings are a tool used to quantify the qualifications of the players. 

The dataset we have attained includes information relevant to the ATP tournaments’ statistics. 
It has information about the tournament locations, the surface, draw size, tournament level, date, match number, etc. 

We would like to address whether there is a clear relationship between a winner’s rank points and their height. This can be accomplished by having a classification approach and trying to predict the height of a winner by looking at their rank point. 


## Methods and Results ## 
We will first load our original data from the source (ATP), by downloading it and naming the file “Tennis_Data.csv” for easy identification. Then we will upload it to our Jupyter folder before creating a new Notebook on Jupyter to start wrangling and cleaning the data. By importing pandas as pd, we can read the CSV file into our notebook by using the “read_csv”  command.

We labeled this dataset, _"tennis_data"_  which gives us a complicated and messy table of information regarding our data. Looking over this table, we can see that there are rows with blank areas or information not needed for our analysis. We removed these rows by using the command “skiprows” and selecting rows ... and ... to remove. While this greatly cleans up the table, it is still quite unkempt. The columns face the same problem as we had with the rows, as, some of the columns contain information, or lack thereof, that we do not require. We solve this by using three commands: “loc”, “columns”, and “isin”. By naming the columns to remove in hand with these three commands, we successfully remove any unnecessary columns from the “Tennis_data” and update it to “Tennis_data_2”. Our last step is making sure our table is labeled correctly. There was one unnamed column, which is incorrect. To change this, we use the “rename” and “inplace” command. By setting the “inplace” to equal to “True”, we can change the column name to its correct term: “Serial number”. This finishes our tidy of data from our original downloaded format to a new, clearer, concise and clearer format for our analysis. This includes: ...,...,..... 


## Methods and Results ##
First we will start by loading the data as follows. 

In [1]:
import pandas as pd
tennis_data = pd.read_csv("Tennis_Dataset 1.csv")
tennis_data.head()

Unnamed: 0.1,Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,...,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
0,0,2019-M020,Brisbane,Hard,32,A,20181231,300,105453,2.0,...,54.0,34.0,20.0,14.0,10.0,15.0,9.0,3590.0,16.0,1977.0
1,1,2019-M020,Brisbane,Hard,32,A,20181231,299,106421,4.0,...,52.0,36.0,7.0,10.0,10.0,13.0,16.0,1977.0,239.0,200.0
2,2,2019-M020,Brisbane,Hard,32,A,20181231,298,105453,2.0,...,27.0,15.0,6.0,8.0,1.0,5.0,9.0,3590.0,40.0,1050.0
3,3,2019-M020,Brisbane,Hard,32,A,20181231,297,104542,,...,60.0,38.0,9.0,11.0,4.0,6.0,239.0,200.0,31.0,1298.0
4,4,2019-M020,Brisbane,Hard,32,A,20181231,296,106421,4.0,...,56.0,46.0,19.0,15.0,2.0,4.0,16.0,1977.0,18.0,1855.0


To look at the data in a nutshell, let's make use of the "info" function.

In [7]:
tennis_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 62 entries, 0 to 4921
Data columns (total 50 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Unnamed: 0          62 non-null     int64  
 1   tourney_id          62 non-null     object 
 2   tourney_name        62 non-null     object 
 3   surface             62 non-null     object 
 4   draw_size           62 non-null     int64  
 5   tourney_level       62 non-null     object 
 6   tourney_date        62 non-null     int64  
 7   match_num           62 non-null     int64  
 8   winner_id           62 non-null     int64  
 9   winner_seed         30 non-null     object 
 10  winner_entry        14 non-null     object 
 11  winner_name         62 non-null     object 
 12  winner_hand         62 non-null     object 
 13  winner_ht           43 non-null     float64
 14  winner_ioc          62 non-null     object 
 15  winner_age          62 non-null     float64
 16  loser_id

Now, as we want to look at only the data concerning "Brisbane", the Winners' Rank points and the heights, let's filter the dataset accordingly. 

In [16]:
tennis_data = tennis_data[tennis_data["tourney_name"] == "Brisbane"]
tennis_data.head()

Unnamed: 0.1,Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,...,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
0,0,2019-M020,Brisbane,Hard,32,A,20181231,300,105453,2.0,...,54.0,34.0,20.0,14.0,10.0,15.0,9.0,3590.0,16.0,1977.0
1,1,2019-M020,Brisbane,Hard,32,A,20181231,299,106421,4.0,...,52.0,36.0,7.0,10.0,10.0,13.0,16.0,1977.0,239.0,200.0
2,2,2019-M020,Brisbane,Hard,32,A,20181231,298,105453,2.0,...,27.0,15.0,6.0,8.0,1.0,5.0,9.0,3590.0,40.0,1050.0
3,3,2019-M020,Brisbane,Hard,32,A,20181231,297,104542,,...,60.0,38.0,9.0,11.0,4.0,6.0,239.0,200.0,31.0,1298.0
4,4,2019-M020,Brisbane,Hard,32,A,20181231,296,106421,4.0,...,56.0,46.0,19.0,15.0,2.0,4.0,16.0,1977.0,18.0,1855.0


In [17]:
tennis_data_2 = tennis_data.loc[:, ["tourney_name","winner_rank_points", "winner_ht"]]
tennis_data_2.head()

Unnamed: 0,tourney_name,winner_rank_points,winner_ht
0,Brisbane,3590.0,178.0
1,Brisbane,1977.0,
2,Brisbane,3590.0,178.0
3,Brisbane,200.0,188.0
4,Brisbane,1977.0,


Following this, we will remove any rows with "None" values.

In [18]:
tennis_data_3=tennis_data_2.dropna(axis=0)
tennis_data_3.head()

Unnamed: 0,tourney_name,winner_rank_points,winner_ht
0,Brisbane,3590.0,178.0
2,Brisbane,3590.0,178.0
3,Brisbane,200.0,188.0
5,Brisbane,1050.0,188.0
6,Brisbane,3590.0,178.0


Now, let's rename the headings to make it shorter and easier to use repetitively. 

In [19]:
tennis_data_3.rename(columns = {'tourney_name': 'Location', 'winner_rank_points':'WRP', 'winner_ht':'WH'}, inplace = True)
tennis_data_3.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tennis_data_3.rename(columns = {'tourney_name': 'Location', 'winner_rank_points':'WRP', 'winner_ht':'WH'}, inplace = True)


Unnamed: 0,Location,WRP,WH
0,Brisbane,3590.0,178.0
2,Brisbane,3590.0,178.0
3,Brisbane,200.0,188.0
5,Brisbane,1050.0,188.0
6,Brisbane,3590.0,178.0


Finally, lets visualize the data in the form of a scatter plot.

In [21]:
import altair as alt

x = 'WRP'
y = 'WH'

plot = alt.Chart(tennis_data_3).mark_point().encode(
    x = alt.X("WRP",
              title = "Winner Rank Points",
              scale=alt.Scale(zero=False)),
    y = alt.Y("WH", 
              title = "Winner Heights",
              scale=alt.Scale(zero=False))
).properties(
    title = "Winners' rank points and their corressponding Heights"
)
plot


  for col_name, dtype in df.dtypes.iteritems():


In [31]:
# import the KNN regression model
from sklearn.neighbors import KNeighborsRegressor
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import make_pipeline

tennis_train, tennis_test = train_test_split(
    tennis_data_3, train_size=0.75
)

# preprocess the data, make the pipeline
tennis_preprocessor = make_column_transformer((StandardScaler(), ["WRH"]))
tennis_pipeline = make_pipeline(tennis_preprocessor, KNeighborsRegressor())

# create the 5-fold GridSearchCV object
param_grid = {
    "kneighborsregressor__n_neighbors": range(1, 201, 3),
}
tennis_gridsearch = GridSearchCV(
    estimator=tennis_pipeline,
    param_grid=param_grid,
    cv=5,
    scoring="neg_root_mean_squared_error",
)

In [32]:
# fit the GridSearchCV object 
tennis_fit = tennis_gridsearch.fit(
                  tennis_train[["WRP"]],
                  tennis_train[["WH"]]
              )
# retrieve the CV scores
tennis_results = pd.DataFrame(tennis_fit.cv_results_)[
    ["param_kneighborsregressor__n_neighbors", "mean_test_score", "std_test_score"]
]
tennis_results = tennis_results.assign(    
    sem_test_score = sacr_results["std_test_score"] / 5**(1/2)
).rename(
    columns = {"param_kneighborsregressor__n_neighbors" : "n_neighbors"}
).drop(
    columns = ["std_test_score"]
)
tennis_results

ValueError: 
All the 335 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
335 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3803, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'WRH'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/sklearn/utils/__init__.py", line 448, in _get_column_indices
    col_idx = all_columns.get_loc(col)
  File "/opt/conda/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc
    raise KeyError(key) from err
KeyError: 'WRH'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/pipeline.py", line 402, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/pipeline.py", line 360, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/opt/conda/lib/python3.10/site-packages/joblib/memory.py", line 349, in __call__
    return self.func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/pipeline.py", line 894, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/compose/_column_transformer.py", line 723, in fit_transform
    self._validate_column_callables(X)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/compose/_column_transformer.py", line 425, in _validate_column_callables
    transformer_to_input_indices[name] = _get_column_indices(X, columns)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/utils/__init__.py", line 456, in _get_column_indices
    raise ValueError("A given column is not a column of the dataframe") from e
ValueError: A given column is not a column of the dataframe


Let's finally get to the interesting part of the project: Linear Regression!

In [26]:
import pandas as pd
import altair as alt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

np.random.seed(1)

tennis_data_3_train, tennis_data_3_test = train_test_split(
    tennis_data_3, train_size=0.75
)

# fit the linear regression model
lm = LinearRegression()
lm.fit(
   tennis_data_3_train[["WRP"]],
   tennis_data_3_test[["WH"]]
)

# make a dataframe containing slope and intercept coefficients
pd.DataFrame({"slope": lm.coef_[0], "intercept": lm.intercept_})

ValueError: Found input variables with inconsistent numbers of samples: [32, 11]