# First look at the dataset

The dataset contains 49 variables. There is **no independent variable**. I will implement it.

First I tried to test a very simple model using those features: 
- *winner_rank_points*
- *loser_rank_points*

I chose *rank_points* features over *rank* features because there can be a either small or huge difference of rank points between for example the 2nd ATP player and the 3rd ATP player. 
<br>
rank_points are more meaningfull than the rank of a player. 
<br>
I didn't pick both features to avoid **multicollinearity** that would weaken my model.
<br><br>
The features *winner_rank_points* and *loser_rank_points* are related to the player that will either win or lose the match. 

To avoid **target leakage**, I renamed those features as *player_1_points* and *player_2_points* and **added the independent variable** *player_1_wins*. Then *player_1_wins* would always be 1. So I found 2 option to solve that:

1. Add to the dataset its inverse (switch *player_1_points* and *player_2_points* and set *player_1_wins* = 0)
2. Inverse 50% of the actual dataset. The training dataset size will remain the same.

Option 1 may be not very good as it would multiply by 2 the size of the training set. But it might yield better results than option 2. So I decided to test both options.


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

def inverseDataset(dataset_input):
    '''inverse dataset - for option 1'''
    inversed_dataset = pd.DataFrame()
    inversed_dataset["player_1_points"] = dataset_input["player_2_points"]
    inversed_dataset["player_2_points"] = dataset_input["player_1_points"]
    inversed_dataset["player_1_wins"] = 0
    return inversed_dataset

def inverseHalfDataset(dataset_input):
    '''inverse 50% of the dataset - for option 2'''
    inversed_dataset = pd.DataFrame()
    inversed_dataset["player_1_points"] = np.where(dataset_input.index % 2 == 0, dataset_input["player_1_points"] , dataset_input["player_2_points"])
    inversed_dataset["player_2_points"] = np.where(dataset_input.index % 2 == 0, dataset_input["player_2_points"] , dataset_input["player_1_points"])
    inversed_dataset["player_1_wins"] = np.where(dataset_input.index % 2 == 0, 1, 0)
    return inversed_dataset    

# Read the data
list_datasets = []
for year in range(2000, 2010):
    dataset = pd.read_csv("https://raw.githubusercontent.com/davy-datascience/tennis-prediction/master/datasets/atp_matches_{}.csv".format(year))
    list_datasets.append(dataset)

full_dataset = pd.concat(list_datasets)

features = ["winner_rank_points", "loser_rank_points"]

dataset = full_dataset[features]

#drop rows with null value
dataset = dataset.dropna()

dataset = dataset.rename(columns={'winner_rank_points': 'player_1_points', 'loser_rank_points': 'player_2_points'})
dataset["player_1_wins"] = 1

### OPTION 1
# Separate the dataset into a training set and a test set
train, test = train_test_split(dataset, test_size = 0.2)
    
inversed_train = inverseDataset(train)
train = pd.concat([train, inversed_train])

X_train = train[["player_1_points", "player_2_points"]]
y_train = train.player_1_wins
X_test = test[["player_1_points", "player_2_points"]]
y_test = test.player_1_wins

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(solver="liblinear")
classifier.fit(X_train, y_train)

# Predict
y_pred = pd.Series(classifier.predict(X_test), index = y_test.index)
mae = mean_absolute_error(y_pred, y_test)
print("MAE using option 1: {}".format(mae))


### OPTION 2
# Separate the dataset into a training set and a test set
train, test = train_test_split(dataset, test_size = 0.2)
train = inverseHalfDataset(train)

X_train = train[["player_1_points", "player_2_points"]]
y_train = train.player_1_wins
X_test = test[["player_1_points", "player_2_points"]]
y_test = test.player_1_wins

# Predict
y_pred = pd.Series(classifier.predict(X_test), index = y_test.index)
mae = mean_absolute_error(y_pred, y_test)
print("MAE using option 2: {}".format(mae))

MAE (mean absolute error) for option 2 is almost equal and even a bit lower than for option 1. So option 1 doesn't get better results than option 1, it only increase the dataset size. Therefore I kept option 2 method.
<br><br>
Other variables are related to the winner or the loser of the match (as *winner_age*, *loser_age*, ...)

I will consider those variables in the *inverseHalfDataset* method.

# Feature importance

Before going further I will use PermutationImportance from sklearn to detect which features seems the most important at first sight. As there is a lot of variables, permutation importance will help