## Unveiling the Next Tennis Superstars With Machine Learning

### Model Selection & Evaluation

##### In the subsequent phase of the project, emphasis shifted towards model selection and evaluation to determine the most effective approach for predicting future top 25 ATP players. Leveraging logistic regression, random forest classification, and XGBoost algorithms, each model was rigorously assessed based on its performance metrics and ability to generalize to unseen data. Through comprehensive evaluation, insights were gained into the strengths and limitations of each model, paving the way for informed decision-making in subsequent stages of the project.

In [10]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, accuracy_score

In [11]:
#Importing the CSV dataset
tennis_data = pd.read_csv('../csvfiles/mean_stats_tennis.csv')

In [12]:
#Removing all the rows with NaN values
tennis_data = tennis_data.dropna()

In [13]:
tennis_data.sample(10)

Unnamed: 0.1,Unnamed: 0,ht,ace,df,svpt,1stIn,1stWon,2ndWon,SvGms,bpSaved,bpFaced,win_ratio,top_25_reached
6031,Tennyson Whiting,183.0,1.666667,5.0,55.333333,30.666667,17.0,9.666667,8.333333,2.666667,7.0,0.0,0.0
1193,Chuhan Wang,188.0,2.82,2.78,63.44,37.56,22.86,11.54,9.48,4.3,8.14,0.895964,0.0
2951,Toby Alex Kodat,178.0,3.317073,2.609756,64.804878,38.707317,26.097561,12.829268,10.121951,3.658537,6.512195,0.703789,0.0
85,Gael Monfils,193.0,7.5,4.0,78.607143,40.285714,29.071429,17.178571,11.75,4.535714,8.0,1.800118,1.0
601,Andrew Harris,183.0,2.7,3.48,72.6,47.92,30.92,12.14,10.84,4.46,7.58,0.79209,0.0
2466,Jordan Cox,188.0,4.6875,3.625,73.25,42.375,28.25,15.375,10.75,3.8125,6.9375,0.773171,0.0
60,Diego Schwartzman,170.0,1.24,2.66,63.68,38.0,23.2,12.92,10.02,3.54,7.22,0.551846,1.0
454,Mikhail Ledovskikh,188.0,10.2,4.4,72.8,36.6,27.2,15.6,10.2,4.6,7.4,0.432612,0.0
898,Federico Zeballos,185.0,3.489796,3.285714,69.183673,36.673469,25.183673,15.510204,10.755102,4.632653,8.020408,0.473266,0.0
730,Tomislav Brkic,185.0,5.282609,4.065217,72.782609,40.891304,29.195652,15.23913,11.304348,4.152174,7.108696,0.780756,0.0


In [14]:
#Creating x and y variables for the models
y = tennis_data['top_25_reached']
X = tennis_data.drop(columns=['top_25_reached','Unnamed: 0'])

In [15]:
#Checking the distribution
tennis_data['top_25_reached'].value_counts()

top_25_reached
0.0    1093
1.0     160
Name: count, dtype: int64

In [16]:
print(X)
print(y)

         ht       ace        df       svpt      1stIn     1stWon     2ndWon  \
0     178.0  2.933333  2.533333  70.000000  42.066667  30.466667  13.466667   
1     198.0  5.136364  2.750000  68.340909  39.272727  28.500000  13.863636   
2     188.0  6.400000  3.200000  75.000000  45.000000  32.200000  13.400000   
3     188.0  6.400000  5.200000  89.800000  53.400000  36.200000  15.400000   
4     183.0  3.040816  3.326531  78.326531  46.204082  31.061224  16.795918   
...     ...       ...       ...        ...        ...        ...        ...   
6288  183.0  1.000000  2.000000  71.000000  44.000000  30.000000  10.000000   
6290  175.0  0.000000  2.500000  86.500000  51.000000  32.500000  17.000000   
6533  190.0  3.000000  7.000000  45.000000  20.000000  16.000000   4.000000   
6541  190.0  1.000000  5.000000  64.000000  36.000000  22.000000  10.000000   
6543  183.0  1.000000  2.000000  55.000000  36.000000  23.000000  10.000000   

          SvGms   bpSaved    bpFaced  win_ratio  
0

In [17]:
#Creating a train/test split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33, random_state=42)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)


(839, 11)
(414, 11)
(839,)
(414,)


##### Since only 1/10th of the samples are players that have reached the top 25 in ATP rankings, class imbalance is inevitable.Therefore, I will be oversampling to increase the minority sample size.

In [18]:
from imblearn.over_sampling import RandomOverSampler

# Instantiate the RandomOverSampler
oversampler = RandomOverSampler(random_state=42)

# Resample the dataset
X_train_resampled, y_train_resampled = oversampler.fit_resample(X_train, y_train)

In [19]:
#Model training with logistic regressor
model = LogisticRegression()
model.fit(X_train_resampled, y_train_resampled)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [20]:
#Logistic Regression model evaluation
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.95      0.71      0.81       364
         1.0       0.25      0.70      0.37        50

    accuracy                           0.71       414
   macro avg       0.60      0.70      0.59       414
weighted avg       0.86      0.71      0.76       414



In [21]:
#Model training with random forest classifier
model = RandomForestClassifier()
model.fit(X_train_resampled, y_train_resampled)

In [22]:
#Random Forrest Classifier model evaluation
y_pred_forest = model.predict(X_test)
print(classification_report(y_test, y_pred_forest))

              precision    recall  f1-score   support

         0.0       0.90      0.95      0.93       364
         1.0       0.42      0.26      0.32        50

    accuracy                           0.87       414
   macro avg       0.66      0.61      0.62       414
weighted avg       0.84      0.87      0.85       414



In [23]:
#Model training xgboost
model = xgb.XGBClassifier(objective="binary:logistic", random_state=42)
model.fit(X_train_resampled, y_train_resampled)
y_pred_boost = model.predict(X_test)

In [24]:
#Model evaluation
accuracy = accuracy_score(y_test, y_pred_boost)
print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_report(y_test, y_pred_boost))

Accuracy: 0.8623188405797102
Classification Report:
              precision    recall  f1-score   support

         0.0       0.91      0.94      0.92       364
         1.0       0.41      0.32      0.36        50

    accuracy                           0.86       414
   macro avg       0.66      0.63      0.64       414
weighted avg       0.85      0.86      0.85       414



Due to class imbalance, the best model was the random forest classifier showing a prediction accuracy of 69%. We will continue further with this model for hyperparameter tuning.

In [28]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 500],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Creating a Random Forest Classifier instance
rf_classifier = RandomForestClassifier()

# Creating GridSearchCV instance
grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=5, scoring='accuracy', verbose=2)

# Fitting the GridSearchCV to your data
grid_search.fit(X_train_resampled, y_train_resampled)

# Getting the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Best Parameters:", best_params)
print("Best Score:", best_score)



Fitting 5 folds for each of 81 candidates, totalling 405 fits
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   0.4s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   0.3s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   0.6s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   0.2s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   0.2s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   0.8s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   0.5s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   0.4s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   0.4s
[CV] END max_dep

Running the best model with the best hyperparameters

In [29]:
# Instantiate Random Forest Classifier with best parameters
best_rf_classifier = RandomForestClassifier(**best_params)

# Fit the model on training data
best_rf_classifier.fit(X_train_resampled, y_train_resampled)

# Predict target labels for test data
y_pred_best = best_rf_classifier.predict(X_test)

# Evaluate model performance
print(classification_report(y_test, y_pred_best))

              precision    recall  f1-score   support

         0.0       0.91      0.96      0.94       364
         1.0       0.54      0.30      0.38        50

    accuracy                           0.88       414
   macro avg       0.72      0.63      0.66       414
weighted avg       0.86      0.88      0.87       414



##### Based on these results, we can determine that the highest accuracy score that we can obtain with hyperparameter tuning is 88%. Looking at the classification report results, the model has high precision and recall when it comes to predicting when a player will not enter the Top 25 in ATP rankings. Nevertheless, when it comes to predicting the likelihood of a player reaching the Top 25, the precision and recall values are low. This can be explained by the class imbalance or overfitting since there is only a small minority of samples within the dataset that have reached the top 25 in comparison to those that haven't. More data would have to be incorporated into the model or adjustement of the project criterias would be necessary for a more evenly balanced distribution.