# First look at the dataset

The dataset contains 49 variables. There is **no independent variable**. So I will implement it.

First I tried to test a very simple model using those features: 
- *winner_rank_points*
- *loser_rank_points*

I chose *rank_points* features over *rank* features because there can be a either small or huge difference of rank points between for example the 2nd ATP player and the 3rd ATP player. 
<br>
rank_points are more meaningfull than the rank of a player. 
<br>
I didn't pick both features to avoid **multicollinearity** that would weaken my model.
<br><br>
The features *winner_rank_points* and *loser_rank_points* are related to the player that will either win or lose the match. 

To avoid **target leakage**, I renamed those features as *player_1_points* and *player_2_points* and **added the independent variable** *player_1_wins*. Then *player_1_wins* would always be 1. So I found 2 option to solve that:

1. Add to the dataset its inverse (switch *player_1_points* and *player_2_points* and set *player_1_wins* = 0)
2. Inverse 50% of the actual dataset. The training dataset size will remain the same.

Option 1 may be not very good as it would multiply by 2 the size of the training set. But it might yield better results than option 2. So I decided to test both options.


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

def inverseDataset(dataset_input):
    '''inverse dataset - for option 1'''
    inversed_dataset = pd.DataFrame()
    inversed_dataset["player_1_points"] = dataset_input["player_2_points"]
    inversed_dataset["player_2_points"] = dataset_input["player_1_points"]
    inversed_dataset["player_1_wins"] = 0
    return inversed_dataset

def inverseHalfDataset(dataset_input):
    '''inverse 50% of the dataset - for option 2'''
    inversed_dataset = pd.DataFrame()
    inversed_dataset["player_1_points"] = np.where(dataset_input.index % 2 == 0, dataset_input["player_1_points"] , dataset_input["player_2_points"])
    inversed_dataset["player_2_points"] = np.where(dataset_input.index % 2 == 0, dataset_input["player_2_points"] , dataset_input["player_1_points"])
    inversed_dataset["player_1_wins"] = np.where(dataset_input.index % 2 == 0, 1, 0)
    return inversed_dataset    

# Read the data
list_datasets = []
for year in range(2000, 2010):
    dataset = pd.read_csv("https://raw.githubusercontent.com/davy-datascience/tennis-prediction/master/datasets/atp_matches_{}.csv".format(year))
    list_datasets.append(dataset)

full_dataset = pd.concat(list_datasets)

features = ["winner_rank_points", "loser_rank_points"]

dataset = full_dataset[features]

#drop rows with null value
dataset = dataset.dropna()

dataset = dataset.rename(columns={'winner_rank_points': 'player_1_points', 'loser_rank_points': 'player_2_points'})
dataset["player_1_wins"] = 1

### OPTION 1
# Separate the dataset into a training set and a test set
train, test = train_test_split(dataset, test_size = 0.2)
    
inversed_train = inverseDataset(train)
train = pd.concat([train, inversed_train])

X_train = train[["player_1_points", "player_2_points"]]
y_train = train.player_1_wins
X_test = test[["player_1_points", "player_2_points"]]
y_test = test.player_1_wins

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(solver="liblinear")
classifier.fit(X_train, y_train)

# Predict
y_pred = pd.Series(classifier.predict(X_test), index = y_test.index)
mae = mean_absolute_error(y_pred, y_test)
print("MAE using option 1: {}".format(mae))


### OPTION 2
# Separate the dataset into a training set and a test set
train, test = train_test_split(dataset, test_size = 0.2)
train = inverseHalfDataset(train)

X_train = train[["player_1_points", "player_2_points"]]
y_train = train.player_1_wins
X_test = test[["player_1_points", "player_2_points"]]
y_test = test.player_1_wins

# Predict
y_pred = pd.Series(classifier.predict(X_test), index = y_test.index)
mae = mean_absolute_error(y_pred, y_test)
print("MAE using option 2: {}".format(mae))

MAE (mean absolute error) for option 2 is almost equal and even a bit lower than for option 1. So option 1 doesn't yield better results than option 1, it only increase the dataset size. Therefore I kept option 2 methodology.
<br><br>
Other variables are related to the winner or the loser of the match (as *winner_age*, *loser_age*, ...)

I will consider those variables in the *inverseHalfDataset* method.

# Feature importance

Before going further I will use PermutationImportance from sklearn to detect which features seems the most important at first sight.



In [2]:
!pip install eli5

Collecting eli5
[?25l  Downloading https://files.pythonhosted.org/packages/97/2f/c85c7d8f8548e460829971785347e14e45fa5c6617da374711dec8cb38cc/eli5-0.10.1-py2.py3-none-any.whl (105kB)
[K     |███                             | 10kB 16.8MB/s eta 0:00:01[K     |██████▏                         | 20kB 2.9MB/s eta 0:00:01[K     |█████████▎                      | 30kB 3.7MB/s eta 0:00:01[K     |████████████▍                   | 40kB 4.1MB/s eta 0:00:01[K     |███████████████▌                | 51kB 3.3MB/s eta 0:00:01[K     |██████████████████▋             | 61kB 3.8MB/s eta 0:00:01[K     |█████████████████████▊          | 71kB 4.0MB/s eta 0:00:01[K     |████████████████████████▊       | 81kB 4.4MB/s eta 0:00:01[K     |███████████████████████████▉    | 92kB 4.8MB/s eta 0:00:01[K     |███████████████████████████████ | 102kB 4.7MB/s eta 0:00:01[K     |████████████████████████████████| 112kB 4.7MB/s 
Installing collected packages: eli5
Successfully installed eli5-0.10.1


In [3]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
import eli5
from eli5.sklearn import PermutationImportance
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

long_col = ["id", "name", "hand", "ht", "ioc", "age", "rank", "rank_points"]
short_col = ["ace", "df", "svpt", "1stIn", "1stWon", "2ndWon", "SvGms", "bpSaved", "bpFaced"]

def inverseHalfDataset(dataset):
    '''inverse 50% of the dataset - for option 2'''
    inv = pd.DataFrame()
    for col in long_col + short_col:
        inv["player_1_" + col] = np.where(dataset.index % 2 == 0, dataset["player_1_" + col] , dataset["player_2_" + col])
        inv["player_2_" + col] = np.where(dataset.index % 2 == 0, dataset["player_2_" + col] , dataset["player_1_" + col])
   
    inv["player_1_wins"] = np.where(dataset.index % 2 == 0, 1, 0)
    return inv    

def renameColumnNames(dataset):
    columns = {}
    for col in long_col:
        columns["winner_" + col] = "player_1_" + col
        columns["loser_" + col] = "player_2_" + col
    
    for col in short_col:
        columns["w_" + col] = "player_1_" + col
        columns["l_" + col] = "player_2_" + col
        
    dataset = dataset.rename(columns= columns)
    return dataset

# Read the data
list_datasets = []
for year in range(2010, 2020):
    dataset = pd.read_csv("https://raw.githubusercontent.com/davy-datascience/tennis-prediction/master/datasets/atp_matches_{}.csv".format(year))
    list_datasets.append(dataset)

dataset = pd.concat(list_datasets)

dataset = dataset.drop(columns=["winner_seed", "winner_entry", "loser_seed", "loser_entry"])

#drop rows with null value
dataset = dataset.dropna()

dataset = renameColumnNames(dataset)

dataset["player_1_wins"] = 1

dataset = inverseHalfDataset(dataset)

X = dataset.iloc[:, :-1]
y = dataset.iloc[:, -1]

object_cols = [col for col in X.columns if dataset[col].dtype == "object"] # All categorical columns
low_cardinality_cols = [col for col in object_cols if X[col].nunique() < 10] # Categorical columns with few unique values
# Columns that will be dropped from the dataset
high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))

X = X.drop(high_cardinality_cols, axis=1)

columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), low_cardinality_cols)], remainder='passthrough')
#remainder='passthrough' : keep other columns (default:'drop')

transformed_data = np.array(columnTransformer.fit_transform(X), dtype = np.str)

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import _VectorizerMixin
from sklearn.feature_selection._base import SelectorMixin
def get_feature_out(estimator, feature_in):
    if hasattr(estimator,'get_feature_names'):
        if isinstance(estimator, _VectorizerMixin):
            # handling all vectorizers
            return [f'vec_{f}' \
                for f in estimator.get_feature_names()]
        else:
            return estimator.get_feature_names(feature_in)
    elif isinstance(estimator, SelectorMixin):
        return np.array(feature_in)[estimator.get_support()]
    else:
        return feature_in


def get_ct_feature_names(ct):
    # handles all estimators, pipelines inside ColumnTransfomer
    # doesn't work when remainder =='passthrough'
    # which requires the input column names.
    output_features = []

    for name, estimator, features in ct.transformers_:
        if name!='remainder':
            if isinstance(estimator, Pipeline):
                current_features = features
                for step in estimator:
                    current_features = get_feature_out(step, current_features)
                features_out = current_features
            else:
                features_out = get_feature_out(estimator, features)
            output_features.extend(features_out)
        elif estimator=='passthrough':
            output_features.extend(ct._feature_names_in[features])
                
    return output_features


X = pd.DataFrame(transformed_data, 
             columns=get_ct_feature_names(columnTransformer))


my_model = RandomForestClassifier(n_estimators=100, random_state=0)
my_model.fit(X, y)

perm = PermutationImportance(my_model, random_state=1).fit(X, y)
eli5.show_weights(perm, feature_names = X.columns.tolist())

Using TensorFlow backend.


Weight,Feature
0.1823  ± 0.0054,player_1_bpFaced
0.1698  ± 0.0039,player_2_bpFaced
0.0415  ± 0.0023,player_1_1stWon
0.0388  ± 0.0008,player_2_1stWon
0.0056  ± 0.0002,player_2_2ndWon
0.0045  ± 0.0003,player_1_2ndWon
0.0030  ± 0.0005,player_1_bpSaved
0.0030  ± 0.0005,player_2_bpSaved
0.0029  ± 0.0003,player_1_rank_points
0.0028  ± 0.0003,player_2_ace


In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [2]:
!ls /content/gdrive/My\ Drive/tennis-prediction/src/*.py

'/content/gdrive/My Drive/tennis-prediction/src/tennis_prediction.py'
