# First look at the dataset

The dataset contains 49 variables. There is **no independent variable**. So I will implement it.

First I tried to test a very simple model using those features: 
- *winner_rank_points*
- *loser_rank_points*

I chose *rank_points* features over *rank* features because there can be a either small or huge difference of rank points between for example the 2nd ATP player and the 3rd ATP player. 
<br>
rank_points are more meaningfull than the rank of a player. 
<br>
I didn't pick both features to avoid **multicollinearity** that would weaken my model.
<br><br>
The features *winner_rank_points* and *loser_rank_points* are related to the player that will either win or lose the match. 

To avoid **target leakage**, I renamed those features as *p1_points* and *p2_points* and **added the independent variable** *p1_wins*. Then *p1_wins* would always be 1. So I found 2 option to solve the fact that the independent variable has always the same value:

1. Add to the dataset its inverse (switch *p1_points* and *p2_points* and set *p1_wins* = 0)
2. Inverse 50% of the actual dataset. The training dataset size will remain the same.

Option 1 may be not very good as it would multiply by 2 the size of the training set. But it might yield better results than option 2. So I decided to test both options.


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

def inverseDataset(dataset_input):
    '''inverse dataset - for option 1'''
    inversed_dataset = pd.DataFrame()
    inversed_dataset["p1_points"] = dataset_input["p2_points"]
    inversed_dataset["p2_points"] = dataset_input["p1_points"]
    inversed_dataset["p1_wins"] = 0
    return inversed_dataset

def inverseHalfDataset(dataset_input):
    '''inverse 50% of the dataset - for option 2'''
    inversed_dataset = pd.DataFrame()
    inversed_dataset["p1_points"] = np.where(dataset_input.index % 2 == 0, dataset_input["p1_points"] , dataset_input["p2_points"])
    inversed_dataset["p2_points"] = np.where(dataset_input.index % 2 == 0, dataset_input["p2_points"] , dataset_input["p1_points"])
    inversed_dataset["p1_wins"] = np.where(dataset_input.index % 2 == 0, 1, 0)
    return inversed_dataset    

# Read the data
list_datasets = []
for year in range(2000, 2010):
    dataset = pd.read_csv("https://raw.githubusercontent.com/davy-datascience/tennis-prediction/master/datasets/atp_matches_{}.csv".format(year))
    list_datasets.append(dataset)

full_dataset = pd.concat(list_datasets)

features = ["winner_rank_points", "loser_rank_points"]

dataset = full_dataset[features]

#drop rows with null value
dataset = dataset.dropna()

dataset = dataset.rename(columns={'winner_rank_points': 'p1_points', 'loser_rank_points': 'p2_points'})
dataset["p1_wins"] = 1

### OPTION 1
# Separate the dataset into a training set and a test set
train, test = train_test_split(dataset, test_size = 0.2)
    
inversed_train = inverseDataset(train)
train = pd.concat([train, inversed_train])

X_train = train[["p1_points", "p2_points"]]
y_train = train.p1_wins
X_test = test[["p1_points", "p2_points"]]
y_test = test.p1_wins

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(solver="liblinear")
classifier.fit(X_train, y_train)

# Predict
y_pred = pd.Series(classifier.predict(X_test), index = y_test.index)
mae = mean_absolute_error(y_pred, y_test)
print("MAE using option 1: {}".format(mae))


### OPTION 2
# Separate the dataset into a training set and a test set
train, test = train_test_split(dataset, test_size = 0.2)
train = inverseHalfDataset(train)

X_train = train[["p1_points", "p2_points"]]
y_train = train.p1_wins
X_test = test[["p1_points", "p2_points"]]
y_test = test.p1_wins

# Predict
y_pred = pd.Series(classifier.predict(X_test), index = y_test.index)
mae = mean_absolute_error(y_pred, y_test)
print("MAE using option 2: {}".format(mae))

MAE using option 1: 0.3478676002546149
MAE using option 2: 0.34707192870782944


MAE (mean absolute error) for option 2 is almost equal and even a bit lower than for option 1. So option 1 doesn't yield better results than option 1, it only increase the dataset size. Therefore I kept option 2 methodology.
<br><br>
Other variables are related to the winner or the loser of the match (as *winner_age*, *loser_age*, ...)

I will rename those variables by adding "*p1_*" and "*p2_*" prefixes and consider them in the *inverseHalfDataset* method.

# Indentify variables causing target leakage 

On this dataset, several data is data that is not available before the match ends. This data should be excluded before modeling. However this data would be usefull to build new features as the percentage of first-serve points won in the last 3 matches, the number of break points faced in the last 3 matches, ...

Variables that are not available before the moment I should make predictions are:

"*p1_ace*", "*p1_df*", "*p1_svpt*", "*p1_1stIn*", "*p1_1stWon*", "*p1_2ndWon*", "*p1_SvGms*", "*p1_SvGms*", "*p1_bpSaved*", "*p1_bpFaced*", "*p2_ace*", "*p2_df*", "*p2_svpt*", "*p2_1stIn*", "*p2_1stWon*", "*p2_2ndWon*", "*p2_SvGms*", "*p2_SvGms*", "*p2_bpSaved*", "*p2_bpFaced*"


# Feature Engineering



I added some features that will be usefull for further pre-processing.

In [None]:
@numba.vectorize
def divideWithNumba(a, b):
''' Divide one column by an other column of a dataframe with increased performance thanks to vectorization '''
    return a / b

def getBpSavedRatio(a, b):
    ''' Divide break point saved by break point faced, if no break point faced consider as 1: max ratio'''
    return 1 if b == 0 else (a/b)

dataset["p1_ace_ratio"] = divideWithNumba(dataset["p1_ace"].to_numpy(), dataset["p1_svpt"].to_numpy())
dataset["p2_ace_ratio"] = divideWithNumba(dataset["p2_ace"].to_numpy(), dataset["p2_svpt"].to_numpy())
dataset["p1_df_ratio"] = divideWithNumba(dataset["p1_df"].to_numpy(), dataset["p1_svpt"].to_numpy())
dataset["p2_df_ratio"] = divideWithNumba(dataset["p2_df"].to_numpy(), dataset["p2_svpt"].to_numpy())
dataset["p1_1stIn_ratio"] = divideWithNumba(dataset["p1_1stIn"].to_numpy(), dataset["p1_svpt"].to_numpy())
dataset["p2_1stIn_ratio"] = divideWithNumba(dataset["p2_1stIn"].to_numpy(), dataset["p2_svpt"].to_numpy())
dataset["p1_1stWon_ratio"] = divideWithNumba(dataset["p1_1stWon"].to_numpy(), dataset["p1_svpt"].to_numpy())
dataset["p2_1stWon_ratio"] = divideWithNumba(dataset["p2_1stWon"].to_numpy(), dataset["p2_svpt"].to_numpy())
dataset["p1_2ndWon_ratio"] = divideWithNumba(dataset["p1_2ndWon"].to_numpy(), dataset["p1_svpt"].to_numpy())
dataset["p2_2ndWon_ratio"] = divideWithNumba(dataset["p2_2ndWon"].to_numpy(), dataset["p2_svpt"].to_numpy())
dataset["p1_bpFaced_ratio"] = divideWithNumba(dataset["p1_bpFaced"].to_numpy(), dataset["p1_SvGms"].to_numpy()) # Break points Faced per return-game
dataset["p2_bpFaced_ratio"] = divideWithNumba(dataset["p2_bpFaced"].to_numpy(), dataset["p2_SvGms"].to_numpy()) # Break points Faced per return-game
dataset["p1_bpSaved_ratio"] = [getBpSavedRatio(row[0], row[1]) for row in dataset[["p1_bpSaved", "p1_bpFaced"]].to_numpy()]       
dataset["p2_bpSaved_ratio"] = [getBpSavedRatio(row[0], row[1]) for row in dataset[["p2_bpSaved", "p2_bpFaced"]].to_numpy()]       
dataset['tourney_date'] = pd.to_datetime(dataset['tourney_date'], format="%Y%m%d") 

Those new feature help me create the following features

In [None]:
def getPreviousResults(player_results, index, p1_id, p2_id):
    results_p1 = player_results[p1_id]
    prev_res_p1 = pd.DataFrame([results_p1.loc[i] for i in results_p1.index if i < index])
    
    results_p2 = player_results[p2_id]
    prev_res_p2 = pd.DataFrame([results_p2.loc[i] for i in results_p2.index if i < index])
    
    (
     p1_ace_ratio_last3, p2_ace_ratio_last3, p1_df_ratio_last3, p2_df_ratio_last3, p1_1stIn_ratio_last3, 
     p2_1stIn_ratio_last3, p1_1stWon_ratio_last3, p2_1stWon_ratio_last3, p1_2ndWon_ratio_last3, p2_2ndWon_ratio_last3,
     p1_bpSaved_ratio_last3, p2_bpSaved_ratio_last3, p1_bpFaced_ratio_last3, p2_bpFaced_ratio_last3
     ) = (None, None, None, None, None, None, None, None, None, None, None, None, None, None)
    
    if len(prev_res_p1) > 0 :
        p1_ace_ratio_last3 = prev_res_p1["p1_ace_ratio"].tail(3).mean()
        p1_df_ratio_last3 = prev_res_p1["p1_df_ratio"].tail(3).mean()
        p1_1stIn_ratio_last3 = prev_res_p1["p1_1stIn_ratio"].tail(3).mean()
        p1_1stWon_ratio_last3 = prev_res_p1["p1_1stWon_ratio"].tail(3).mean()
        p1_2ndWon_ratio_last3 = prev_res_p1["p1_2ndWon_ratio"].tail(3).mean()
        p1_bpSaved_ratio_last3 = prev_res_p1["p1_bpSaved_ratio"].tail(3).mean()        
        p1_bpFaced_ratio_last3 = prev_res_p1["p1_bpFaced"].tail(3).mean()  
        
    if len(prev_res_p2) > 0 :
        p2_ace_ratio_last3 = prev_res_p2["p2_ace_ratio"].tail(3).mean()
        p2_df_ratio_last3 = prev_res_p2["p2_df_ratio"].tail(3).mean()
        p2_1stIn_ratio_last3 = prev_res_p2["p2_1stIn_ratio"].tail(3).mean()
        p2_1stWon_ratio_last3 = prev_res_p2["p2_1stWon_ratio"].tail(3).mean()
        p2_2ndWon_ratio_last3 = prev_res_p2["p2_2ndWon_ratio"].tail(3).mean()
        p2_bpSaved_ratio_last3 = prev_res_p2["p2_bpSaved_ratio"].tail(3).mean()
        p2_bpFaced_ratio_last3 = prev_res_p2["p2_bpFaced"].tail(3).mean()  
    
    return (p1_ace_ratio_last3, p2_ace_ratio_last3, p1_df_ratio_last3, p2_df_ratio_last3, 
            p1_1stIn_ratio_last3, p2_1stIn_ratio_last3, p1_1stWon_ratio_last3, p2_1stWon_ratio_last3, 
            p1_2ndWon_ratio_last3, p2_2ndWon_ratio_last3, p1_bpSaved_ratio_last3, p2_bpSaved_ratio_last3,
            p1_bpFaced_ratio_last3, p2_bpFaced_ratio_last3)
    


player_results = {}

for pid in player_ids:
    '''idx = np.where((dataset["p1_id"] == pid) | (dataset["p2_id"] == pid))
    all_matchs = dataset.iloc[idx[0]]'''
    all_matchs = dataset.loc[(dataset["p1_id"] == pid) | (dataset["p2_id"] == pid)]
    all_wins = all_matchs[all_matchs["p1_id"] == pid]
    all_lost = all_matchs[all_matchs["p2_id"] == pid]
    
    player_results[pid]= pd.concat([all_wins, inverseDataset(all_lost)]).sort_index()

print("--- %s seconds ---" % (time.time() - start_time))


start_time = time.time()
results = [getPreviousResults(player_results, index, ids[0], ids[1]) for index, ids in dataset[["p1_id", "p2_id"]].iterrows()]
print("--- %s seconds ---" % (time.time() - start_time))


dataset["p1_ace_ratio_last3"] = [result[0] for result in results]
dataset["p2_ace_ratio_last3"] = [result[1] for result in results]
dataset["p1_df_ratio_last3"] = [result[2] for result in results]
dataset["p2_df_ratio_last3"] = [result[3] for result in results]
dataset["p1_1stIn_ratio_last3"] = [result[4] for result in results]
dataset["p2_1stIn_ratio_last3"] = [result[5] for result in results]
dataset["p1_1stWon_ratio_last3"] = [result[6] for result in results]
dataset["p2_1stWon_ratio_last3"] = [result[7] for result in results]
dataset["p1_2ndWon_ratio_last3"] = [result[8] for result in results]
dataset["p2_2ndWon_ratio_last3"] = [result[9] for result in results]
dataset["p1_bpSaved_ratio_last3"] = [result[10] for result in results]
dataset["p2_bpSaved_ratio_last3"] = [result[11] for result in results]
dataset["p1_bpFaced_ratio_last3"] = [result[12] for result in results]
dataset["p2_bpFaced_ratio_last3"] = [result[13] for result in results]


# Feature Importance

I am using PermutationImportance from sklearn to detect which features seems the most important at first sight.



In [None]:
!pip install eli5

In [None]:
# Separate the dataset into a training set and a test set
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size = 0.2, shuffle=False)

# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train.columns if
                    X_train[cname].nunique() < 10 and 
                    X_train[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train.columns if 
                X_train[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train[my_cols].copy()
X_test = X_test[my_cols].copy()

columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), categorical_cols)], remainder='passthrough')
#remainder='passthrough' : keep other columns (default:'drop')

transformed_data = np.array(columnTransformer.fit_transform(X_train), dtype = np.str)
transformed_data_test = np.array(columnTransformer.transform(X_test), dtype = np.str)

X_train = pd.DataFrame(transformed_data, columns=get_ct_feature_names(columnTransformer))
X_test = pd.DataFrame(transformed_data_test, columns=get_ct_feature_names(columnTransformer))

# Fill in the lines below: imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_test = pd.DataFrame(my_imputer.transform(X_test))

# Fill in the lines below: imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_test.columns = X_test.columns

X_train = imputed_X_train
X_test = imputed_X_test

my_model = LogisticRegression()
my_model.fit(X_train, y_train)

perm = PermutationImportance(my_model).fit(X_test, y_test)
eli5.show_weights(perm, feature_names = X_test.columns.tolist())

![feature importance](https://raw.githubusercontent.com/davy-datascience/tennis-prediction/master/img/feature_importance_0.PNG)

# Suggested improvement

-