# The Scientific Future of Tennis
*By Víctor García - A01232580

## Introduccion
This work will determine the effects of some variables in the performance of tennis players

- **Load the modules**

In [28]:
import pandas as pd # Data manipulation
import numpy as np # Numerical operations
import statsmodels.api as sm # Statistical modeling
import statsmodels.stats.api as sms
import statsmodels.formula.api as smf 
import scipy.stats # Statistical analysis
from statsmodels.stats.outliers_influence import variance_inflation_factor # VIF
from matplotlib import pyplot as plt # Plotting
from statsmodels.compat import lzip # List zip
import datapro # Custom module for model data processing
#       bp_test(res) - Returns a data frame with the Breusch-Pagan test
#       feasible_gls(data,res) - Feasible Generalized Least Squares
#       plot_fit(res,x,y,reg_line=True) - Plot of a OLS regression
#       robust_se (res) - Returns a df with the coeficients

- **Import the dataset**

In [12]:
data = pd.read_excel("SP_TheScientificFutureofTennis_alumn_VF.xlsx")
data.head()

Unnamed: 0,NAME,GRAND SLAMS,MASTERS 1000,ATP500,ATP250,OLYMPIC GOLD,OLYMPIC SILVER,CHAMP,AGE,YEARS,HEIGHT,WEIGHT,WHR,EARN,DEX,FSER,SPSER,UERR,SPONS
0,DJOKOVIC,23,40,15,12,0,0,64.1,16,21,188,77,0.409574,180937203,1,65,116,7.9,235000000
1,ALCARAZ,2,4,4,2,0,0,7.6,16,4,183,74,0.404372,27026147,1,65,121,8.5,15000000
2,MEDVEDEV,1,6,4,8,0,0,9.9,18,10,198,83,0.419192,38148405,1,61,122,8.4,18000000
3,SINNER,0,1,3,6,0,0,4.05,17,6,188,76,0.404255,17043434,1,59,118,9.7,12000000
4,RUBLEV,0,1,5,8,0,0,5.65,17,10,188,75,0.398936,21659040,1,60,125,10.9,13000000


In [13]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 19 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   NAME            52 non-null     object 
 1   GRAND SLAMS     52 non-null     int64  
 2   MASTERS 1000    52 non-null     int64  
 3   ATP500          52 non-null     int64  
 4   ATP250          52 non-null     int64  
 5   OLYMPIC GOLD    52 non-null     int64  
 6   OLYMPIC SILVER  52 non-null     int64  
 7   CHAMP           52 non-null     float64
 8   AGE             52 non-null     int64  
 9   YEARS           52 non-null     int64  
 10  HEIGHT          52 non-null     int64  
 11  WEIGHT          52 non-null     int64  
 12  WHR             52 non-null     float64
 13  EARN            52 non-null     int64  
 14  DEX             52 non-null     int64  
 15  FSER            52 non-null     int64  
 16  SPSER           52 non-null     int64  
 17  UERR            52 non-null     float

We don't want these variables, as explained by the case

In [8]:
data1=data.drop(columns=[
    "NAME", "GRAND SLAMS", "MASTERS 1000", "ATP500", "ATP250",
    "OLYMPIC GOLD", "OLYMPIC SILVER"
])
data1.head()

Unnamed: 0,CHAMP,AGE,YEARS,HEIGHT,WEIGHT,WHR,EARN,DEX,FSER,SPSER,UERR,SPONS
0,64.1,16,21,188,77,0.409574,180937203,1,65,116,7.9,235000000
1,7.6,16,4,183,74,0.404372,27026147,1,65,121,8.5,15000000
2,9.9,18,10,198,83,0.419192,38148405,1,61,122,8.4,18000000
3,4.05,17,6,188,76,0.404255,17043434,1,59,118,9.7,12000000
4,5.65,17,10,188,75,0.398936,21659040,1,60,125,10.9,13000000


Change from USD to MUSD

In [14]:
data1["EARN"] = data1["EARN"]/1000000
data1["SPONS"] = data1["SPONS"]/1000000

data1.head()

Unnamed: 0,CHAMP,AGE,YEARS,HEIGHT,WEIGHT,WHR,EARN,DEX,FSER,SPSER,UERR,SPONS
0,64.1,16,21,188,77,0.409574,0.180937,1,65,116,7.9,235.0
1,7.6,16,4,183,74,0.404372,0.027026,1,65,121,8.5,15.0
2,9.9,18,10,198,83,0.419192,0.038148,1,61,122,8.4,18.0
3,4.05,17,6,188,76,0.404255,0.017043,1,59,118,9.7,12.0
4,5.65,17,10,188,75,0.398936,0.021659,1,60,125,10.9,13.0


In [16]:
y = pd.DataFrame(data1["CHAMP"]) # Dependent variable
x = data1.drop(columns=["CHAMP"]) # Independent variables
x = sm.add_constant(x, prepend = True) # Add constant term
x.head()

Unnamed: 0,const,AGE,YEARS,HEIGHT,WEIGHT,WHR,EARN,DEX,FSER,SPSER,UERR,SPONS
0,1.0,16,21,188,77,0.409574,0.180937,1,65,116,7.9,235.0
1,1.0,16,4,183,74,0.404372,0.027026,1,65,121,8.5,15.0
2,1.0,18,10,198,83,0.419192,0.038148,1,61,122,8.4,18.0
3,1.0,17,6,188,76,0.404255,0.017043,1,59,118,9.7,12.0
4,1.0,17,10,188,75,0.398936,0.021659,1,60,125,10.9,13.0


In [23]:
mod1 = sm.OLS(y,x) # OLS model (Ordinary Least Squares)
res1 = mod1.fit()
print(res1.summary())

                            OLS Regression Results                            
Dep. Variable:                  CHAMP   R-squared:                       0.994
Model:                            OLS   Adj. R-squared:                  0.993
Method:                 Least Squares   F-statistic:                     649.0
Date:                Tue, 16 Jul 2024   Prob (F-statistic):           1.97e-41
Time:                        18:37:09   Log-Likelihood:                -70.424
No. Observations:                  52   AIC:                             164.8
Df Residuals:                      40   BIC:                             188.3
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -52.3853    106.683     -0.491      0.6

## **Check the OLS Assumptions**
### Asumption of no Multicollinearity of the independent variables

R2 = 0.994 > 0.8 and at least one non-signifcant coefficient. We suspect multicollinearity and we perform the VIF test.

In [24]:
datapro.vif(x, res1)

Unnamed: 0,VIF Factor,Variable
1,1.476176,AGE
2,2.424152,YEARS
3,578.032966,HEIGHT
4,2344.730156,WEIGHT
5,1485.929322,WHR
6,4.955868,EARN
7,1.233934,DEX
8,1.753157,FSER
9,1.494571,SPSER
10,1.474039,UERR


Since WHR combines WEIGHT and HEIGHT, we can eliminate either, WHR or both WEIGHT and HEIGHT, since they duplicate information.

In [25]:
x1=x.drop(columns=["WEIGHT", "HEIGHT"])
mod2 = sm.OLS(y,x1) # OLS model (Ordinary Least Squares)
res2 = mod2.fit()
datapro.vif(x1, res2)

Unnamed: 0,VIF Factor,Variable
1,1.458218,AGE
2,2.228295,YEARS
3,1.166199,WHR
4,4.754001,EARN
5,1.227299,DEX
6,1.657871,FSER
7,1.088887,SPSER
8,1.465343,UERR
9,4.758637,SPONS


#### Assumption of homoskedasticity

#### Testing for heterodskedasticity

In [29]:
Names = ["Lagrane multiplier statistic", "p-value", "f-value", "P(F)"]
Test = sms.het_breuschpagan(res2.resid, res2.model.exog)
lzip(Names, Test)

[('Lagrane multiplier statistic', np.float64(7.115507985087527)),
 ('p-value', np.float64(0.625094636148031)),
 ('f-value', np.float64(0.7398034920251779)),
 ('P(F)', np.float64(0.6705332525506313))]

Result: P-value = 0.625 > 0.05, the null hypothesis is not rejected. Conclusion: With a 95% confidence level, the residuals of the model are homoscedasticity. No solution is needed

#### Asumption of Normality

In [30]:
print(res2.summary())

                            OLS Regression Results                            
Dep. Variable:                  CHAMP   R-squared:                       0.994
Model:                            OLS   Adj. R-squared:                  0.993
Method:                 Least Squares   F-statistic:                     823.5
Date:                Tue, 16 Jul 2024   Prob (F-statistic):           2.99e-44
Time:                        18:51:06   Log-Likelihood:                -70.716
No. Observations:                  52   AIC:                             161.4
Df Residuals:                      42   BIC:                             180.9
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         10.3959      5.966      1.743      0.0

Result: p-value = 1.47e <0.05, the null hypothesis is rejected.

Conclusion: With a 95% confidence level, the residulas of the model are not normally distributed. A possible solution is to use a different method, but is not viable for this examle. It's left as it is

#### Individual significance of the model

In [31]:
print(res2.summary())

                            OLS Regression Results                            
Dep. Variable:                  CHAMP   R-squared:                       0.994
Model:                            OLS   Adj. R-squared:                  0.993
Method:                 Least Squares   F-statistic:                     823.5
Date:                Tue, 16 Jul 2024   Prob (F-statistic):           2.99e-44
Time:                        18:54:22   Log-Likelihood:                -70.716
No. Observations:                  52   AIC:                             161.4
Df Residuals:                      42   BIC:                             180.9
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         10.3959      5.966      1.743      0.0

In [32]:
x2=x1.drop(columns=["DEX", "FSER", "UERR", "AGE"])
mod3 = sm.OLS(y,x2) # OLS model (Ordinary Least Squares)
res3 = mod3.fit()
print(res3.summary())

                            OLS Regression Results                            
Dep. Variable:                  CHAMP   R-squared:                       0.994
Model:                            OLS   Adj. R-squared:                  0.993
Method:                 Least Squares   F-statistic:                     1487.
Date:                Tue, 16 Jul 2024   Prob (F-statistic):           1.24e-49
Time:                        18:59:06   Log-Likelihood:                -72.987
No. Observations:                  52   AIC:                             158.0
Df Residuals:                      46   BIC:                             169.7
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         11.7069      3.712      3.154      0.0

The model is individually significant.

#### Statistical Global Significance

Result: p-value=1.24e-49 < 0.05, therefore, the NULL HYPOTHESIS is rejected.

Conclusion: With a 95% confidence level, the model is globally significance