# Tennis match prediction and playing strategy -- Modeling

## Goal
1. Try different algorithms and build the prediction model, using accuracy as our evaluation metrics
    * Naive Bayes
    * K-Nearest Neighbors
    * Logistic Regression
    * Decision Tree
    * Random Forest
    * Support Vector Machine
    * Gradient Boosting
    
2. Compare the performance of different feature selection methods
    * statistical-based
    * model-based
    * PCA

3. Deep learning using RNN

### Importing useful packages

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# supress warnings
import warnings
warnings.filterwarnings("ignore")

sns.set()
sns.set_style("whitegrid")

# import model package
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
import scipy.stats as stats

# import customized package
from MLuseful import get_best_model_accuracy
from MLuseful import roc_curve_plot
from MLuseful import print_confusion_matrix

### Loading data
We will load the training and testing set from the feature engineering step and we will be ready to fit the model

In [2]:
tennis_all_feature_clean_X_train = pd.read_csv('../Processed_Data/tennis_all_feature_clean_X_train.csv')
tennis_all_feature_clean_X_test = pd.read_csv('../Processed_Data/tennis_all_feature_clean_X_test.csv')
tennis_all_feature_clean_y_train = pd.read_csv('../Processed_Data/tennis_all_feature_clean_y_train.csv', header=None)
tennis_all_feature_clean_y_test = pd.read_csv('../Processed_Data/tennis_all_feature_clean_y_test.csv', header=None)
tennis_all_feature_clean_drop_oppo_X_train = pd.read_csv('../Processed_Data/tennis_all_feature_clean_drop_oppo_X_train.csv')
tennis_all_feature_clean_drop_oppo_X_test = pd.read_csv('../Processed_Data/tennis_all_feature_clean_drop_oppo_X_test.csv')
tennis_all_feature_clean_drop_oppo_y_train = pd.read_csv('../Processed_Data/tennis_all_feature_clean_drop_oppo_y_train.csv', header=None)
tennis_all_feature_clean_drop_oppo_y_test = pd.read_csv('../Processed_Data/tennis_all_feature_clean_drop_oppo_y_test.csv', header=None)

tennis_all_feature_clean_X_train_p_value = pd.read_csv('../Processed_Data/tennis_all_feature_clean_X_train_p_value.csv')
tennis_all_feature_clean_X_test_p_value = pd.read_csv('../Processed_Data/tennis_all_feature_clean_X_test_p_value.csv')
tennis_all_feature_clean_drop_oppo_X_train_p_value = pd.read_csv('../Processed_Data/tennis_all_feature_clean_drop_oppo_X_train_p_value.csv')
tennis_all_feature_clean_drop_oppo_X_test_p_value = pd.read_csv('../Processed_Data/tennis_all_feature_clean_drop_oppo_X_test_p_value.csv')

tennis_all_feature_clean_X_train_tree = pd.read_csv('../Processed_Data/tennis_all_feature_clean_X_train_tree.csv')
tennis_all_feature_clean_X_test_tree = pd.read_csv('../Processed_Data/tennis_all_feature_clean_X_test_tree.csv')
tennis_all_feature_clean_drop_oppo_X_train_tree = pd.read_csv('../Processed_Data/tennis_all_feature_clean_drop_oppo_X_train_tree.csv')
tennis_all_feature_clean_drop_oppo_X_test_tree = pd.read_csv('../Processed_Data/tennis_all_feature_clean_drop_oppo_X_test_tree.csv')

tennis_all_feature_clean_X_train_PCA = pd.read_csv('../Processed_Data/tennis_all_feature_clean_X_train_PCA.csv')
tennis_all_feature_clean_X_test_PCA = pd.read_csv('../Processed_Data/tennis_all_feature_clean_X_test_PCA.csv')
tennis_all_feature_clean_drop_oppo_X_train_PCA = pd.read_csv('../Processed_Data/tennis_all_feature_clean_drop_oppo_X_train_PCA.csv')
tennis_all_feature_clean_drop_oppo_X_test_PCA = pd.read_csv('../Processed_Data/tennis_all_feature_clean_drop_oppo_X_test_PCA.csv')

In [24]:
tennis_all_feature_clean_X_train_p_value.shape

(8962, 500)

In [25]:
tennis_all_feature_clean_y_train.shape

(8962, 1)

Before fitting the model, we need to see the base score we have to beat, the base score is basically calculated by the random guessing of the majority types in outcome variable.

In [7]:
tennis_all_feature_clean_y_train['1'].value_counts(normalize=True)

1    0.53253
0    0.46747
Name: 1, dtype: float64

We can see the base score we are trying to beat is 0.533, since if we predict every match that the player_1 is winning the current match, we will get 0.533, we will start to fit different models to find out the best in terms of fitting time, predicting accuracy...etc., we will do the grid search on hyperparameters of the algorithms and find out the best one that gives the highest accuracy.

#### Naive Bayes
We will first try the simplest algorithm naive bayes, the basic assumptions of the naive bayes is the variables are independent to each other, this is more like an ideal assumption that almost never happens in the real world, but we will still see how it works since the algorithm is simple and fast.

In [32]:
# data dictionary
data_all_train = {'all_feature': tennis_all_feature_clean_X_train, 
                  'all_feature_drop_oppo': tennis_all_feature_clean_drop_oppo_X_train,
                  'all_feature_p_value': tennis_all_feature_clean_X_train_p_value, 
                  'all_feature_drop_oppo_p_value': tennis_all_feature_clean_drop_oppo_X_train_p_value,
                  'all_feature_tree': tennis_all_feature_clean_X_train_tree, 
                  'all_feature_drop_oppo_tree': tennis_all_feature_clean_drop_oppo_X_train_tree,
                  'all_feature_PCA': tennis_all_feature_clean_X_train_PCA, 
                  'all_feature_drop_oppo_PCA': tennis_all_feature_clean_drop_oppo_X_train_PCA}

In [30]:
# create a new dictionary to collect all the score and all the best parameters
all_score = {}
all_params = {}

In [37]:
naive = GaussianNB()

# create an empty list to store the scores of naive bayes algorithms
all_score['Naive_Bayes'] = []
all_params['Naive_Bayes'] = []
# specify the hyperparameters we are going to do gridsearch on
naive_params = {'var_smoothing': [1e-11, 1e-10, 1e-9, 1e-8, 1e-7]}
for name, data in data_all_train.items():
    print(name)
    print(f'training set size: {data.shape}')
    print('-'*90)
    best, score = get_best_model_accuracy(naive, naive_params, data, tennis_all_feature_clean_y_train, score='accuracy')
    all_score['Naive_Bayes'].append(score)
    all_params['Naive_Bayes'].append(best)
    print('\n')

all_feature
training set size: (8962, 2280)
------------------------------------------------------------------------------------------
Best accuracy : 0.7007364427583129
Best Parameters: {'var_smoothing': 1e-11}
Average Time to Fit (s): 0.667
Average Time to Score (s): 0.121


all_feature_drop_oppo
training set size: (8962, 1540)
------------------------------------------------------------------------------------------
Best accuracy : 0.6967194822584245
Best Parameters: {'var_smoothing': 1e-11}
Average Time to Fit (s): 0.423
Average Time to Score (s): 0.083


all_feature_p_value
training set size: (8962, 500)
------------------------------------------------------------------------------------------
Best accuracy : 0.7194822584244588
Best Parameters: {'var_smoothing': 1e-11}
Average Time to Fit (s): 0.125
Average Time to Score (s): 0.029


all_feature_drop_oppo_p_value
training set size: (8962, 500)
----------------------------------------------------------------------------------------

Suprisingly, the data prepared using PCA outbeat the other in this case, the one with all the features did not show any advantage. Plus the modeling fitting time of PCA-processed data is a lot faster due to less features.

#### K-nearest neighbor
Next we will try another simple algorithm called KNN, it basically gather the neighbors that are closest to the one we are predicting and each neighbor got a vote, the majority of the vote result will be the predictive value.

In [None]:
knn = KNeighborsClassifier()

# create an empty list to store the scores of knn algorithms
all_score['KNN'] = []
all_params['KNN'] = []
knn_params = {'n_neighbors': [1,3,5,7,9,11]}

for name, data in data_all_train.items():
    print(name)
    print(f'training set size: {data.shape}')
    print('-'*90)
    best, score = get_best_model_accuracy(knn, knn_params, data, tennis_all_feature_clean_y_train, score='accuracy')
    all_score['KNN'].append(score)
    all_params['KNN'].append(best)
    print('\n')

#### Logistic Regression
Now we will try the logistic regression, this is a more complicated algorithms than the previous two, it basically utilize the sigmoid function to calculate the probability of an example and use it to predict the outcome value, the default probability is set to be 0.5, when the hypothesis is greater or equal to 0.5, it will predict 1 and 0 otherwise.

In [None]:
lgr = LogisticRegression()

# create an empty list to store the scores of logistic regress algorithms
all_score['Logistic_regression'] = []
all_params['Logistic_regression'] = []
lgr_params = {'penalty': ['l1', 'l2'],
              'C': [0.01, 0.1, 1, 10]}

for name, data in data_all_train.items():
    print(name)
    print(f'training set size: {data.shape}')
    print('-'*90)
    best, score = get_best_model_accuracy(lgr, lgr_params, data, tennis_all_feature_clean_y_train, score='accuracy')
    all_score['Logistic_regression'].append(score)
    all_params['Logistic_regression'].append(best)
    print('\n')

all_feature
training set size: (8962, 2280)
------------------------------------------------------------------------------------------
Best accuracy : 0.8975675072528454
Best Parameters: {'C': 0.01, 'penalty': 'l1'}
Average Time to Fit (s): 27.881
Average Time to Score (s): 0.024


all_feature_drop_oppo
training set size: (8962, 1540)
------------------------------------------------------------------------------------------
Best accuracy : 0.8981254184333854
Best Parameters: {'C': 0.01, 'penalty': 'l1'}
Average Time to Fit (s): 24.806
Average Time to Score (s): 0.017


all_feature_p_value
training set size: (8962, 500)
------------------------------------------------------------------------------------------
Best accuracy : 0.8629770140593618
Best Parameters: {'C': 0.1, 'penalty': 'l1'}
Average Time to Fit (s): 8.388
Average Time to Score (s): 0.006


all_feature_drop_oppo_p_value
training set size: (8962, 500)
---------------------------------------------------------------------------

### RNN

In [None]:
from numpy import array
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM