In [6]:
import pandas as pd, numpy as np
import importlib
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
mpl.rcParams['figure.figsize'] = (12,8)

In [7]:
import sys
sys.path.append("../../")
# Importing my helper module
from tennis_predictor import clean_data

In [10]:
data = pd.read_csv("../../independant_observations.csv", index_col=0, low_memory=False)

# 1/ Exploratory Data Analysis Summary

## 1.A The baseline
The baseline could be to predict that the better ranked player will win the tennis game. This yields an accuracy of **65.68%**

## 1.B Adding derived features

** As a reminder the features that are provided upfront by the dataset are relatively limited: **

In [16]:
data.loc[:, [c for c in data.columns if "__" not in c]].head()

Unnamed: 0,Location,Tournament,Date,Series,Court,Surface,Round,Best of,P1_Name,P1_Rank,...,P2_Name,P2_Rank,P2_1,P2_2,P2_3,P2_4,P2_5,P2_Sets,Player1Wins,RankDiff
14518,Adelaide,Next Generation Hardcourts,2005-01-03,International,Outdoor,Hard,1st Round,3,Saulnier C.,53.0,...,Baccanello P.,324.0,2.0,6.0,,,,0.0,True,-271.0
14519,Adelaide,Next Generation Hardcourts,2005-01-03,International,Outdoor,Hard,1st Round,3,Enqvist T.,72.0,...,Sluiter R.,82.0,3.0,1.0,,,,0.0,True,-10.0
14520,Adelaide,Next Generation Hardcourts,2005-01-03,International,Outdoor,Hard,1st Round,3,Melzer J.,39.0,...,Berdych T.,45.0,4.0,6.0,6.0,,,1.0,True,-6.0
14521,Adelaide,Next Generation Hardcourts,2005-01-03,International,Outdoor,Hard,1st Round,3,Rochus O.,66.0,...,Dupuis A.,79.0,3.0,6.0,1.0,,,1.0,True,-13.0
14522,Adelaide,Next Generation Hardcourts,2005-01-03,International,Outdoor,Hard,1st Round,3,Mayer F.,35.0,...,Arthurs W.,101.0,4.0,6.0,5.0,,,1.0,True,-66.0


We basically have the names of the players, their ranks, and various fields about the type of game:
* The Series: "Grand Slam", "ATP500", etc.
* The type of surface, if it is indoor or outdoor
* Tournament name, location
* The Round (Final, 1st Round etc.)
I deliberately ignored the in-game statistics (eg. 6-4 6-4 6-4), as well as the various betting odds that were provided

** As part of the EDA work, I have added some derived features that basically represent that performance of both players until that point in time **

In [23]:
P1_custom_columns = [c for c in data.columns if "__" in c and "P1" in c]
P1_custom_columns

['P1__TOTAL__Played',
 'P1__TOTAL__Won',
 'P1__TOTAL__Won_1st Round',
 'P1__TOTAL__Won_2nd Round',
 'P1__TOTAL__Won_3rd Round',
 'P1__TOTAL__Won_4th Round',
 'P1__TOTAL__Won_ATP250',
 'P1__TOTAL__Won_ATP500',
 'P1__TOTAL__Won_Grand Slam',
 'P1__TOTAL__Won_International',
 'P1__TOTAL__Won_International Gold',
 'P1__TOTAL__Won_International Series',
 'P1__TOTAL__Won_Masters',
 'P1__TOTAL__Won_Masters 1000',
 'P1__TOTAL__Won_Masters Cup',
 'P1__TOTAL__Won_Quarterfinals',
 'P1__TOTAL__Won_Round Robin',
 'P1__TOTAL__Won_Semifinals',
 'P1__TOTAL__Won_The Final',
 'P1__3M__Played',
 'P1__3M__Won',
 'P1__3M__Won_1st Round',
 'P1__3M__Won_2nd Round',
 'P1__3M__Won_3rd Round',
 'P1__3M__Won_4th Round',
 'P1__3M__Won_ATP250',
 'P1__3M__Won_ATP500',
 'P1__3M__Won_Grand Slam',
 'P1__3M__Won_International',
 'P1__3M__Won_International Gold',
 'P1__3M__Won_International Series',
 'P1__3M__Won_Masters',
 'P1__3M__Won_Masters 1000',
 'P1__3M__Won_Masters Cup',
 'P1__3M__Won_Quarterfinals',
 'P1__3M__Wo

For each player, we have to total number of games played and won until that point in time (to be exact until just the day before), as well as how many specific types of wins they have:
* The number of 1st Round wins, the number of Final wins (win for each type of Round)
* The number of Gramd Slam wins (win for each type of ATP Series)

On top of the totals, those features are duplicated with an exponential moving average with a 3M half life

All of those features have been **normalized** (removed mean and divided by std())

# 2/ Choosing what features to derive in preparation for modelling

The difficult for this problem is that most of the features used to model are custom engineered features. My first intuition was basically that:
* The more a player wins, the more likely he is to win again!
* Not all wins are worth the same, a final vs a 1st Round, a Grand Slam vs an ATP250 win

To capture those ideas I had derived the TOTAL wins as well as the 3M exponential moving average for the following reasons:
* So that I could capture both the performance of the players since the beginning of time
* And the *current dynamic* of the players, as hilighted by the graphs in the EDA

## 2.A Choosing the half-life ot the exponential moving average

I experimented with various values for the the width of the exponential moving average: 1Y, 6M, 3M, 1M etc.
**I noticed that the precision of a Logistic Regression model was converging towards 1 as the width of the exponential moving average decreased**
This was due to a mistake of mine: When predicting game happing on day D, I was using the performance of both players **until day D inclusive**, and basically relying on the very outcome I'm trying to predict...

I have corrected this by basically considering the performance of both players **until D-1** only. Reducing the width of the moving average doesn't make the precision of a Logistic Regresison converge towards 1 anymore

Using a basic cross validation technique, (looking at the precision of a basic default logistic regression) I found that 3M gives the best boost in precision so I stuck with that

# 3/ Modelling

## 3.A First look at a Logistic Regression

In [25]:
columns = [c for c in data if "__" in c]  # All the various engineered features have "__" in their names

In [28]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
X = data[columns]
Y = data.Player1Wins

lr.fit(X, Y)
lr.score(X, Y)

0.67688804317309792

As mentioned in the EDA, we beat the precision of the baseline by about 2%

Let's look at the various weights:

In [62]:
col_and_coef = list(zip(columns, lr.coef_[0]))
col_coefs_df = pd.DataFrame(
    {
        "Column": [t[0] for t in col_and_coef],
        "Coef": [t[1] for t in col_and_coef]
    },
)
col_coefs_df = col_coefs_df.assign(
    Abs_Coef=col_coefs_df.Coef.abs()
).sort_index(axis=1)

P1_col_coefs = col_coefs_df[col_coefs_df.Column.str.contains("P1")].sort_values(by="Abs_Coef", ascending=False)
P2_col_coefs = col_coefs_df[col_coefs_df.Column.str.contains("P2")].sort_values(by="Abs_Coef", ascending=False)

In [66]:
P1_col_coefs.head(20)

Unnamed: 0,Abs_Coef,Coef,Column
20,3.229196,3.229196,P1__3M__Won
19,2.982304,2.982304,P1__3M__Played
22,0.856424,0.856424,P1__3M__Won_2nd Round
27,0.793754,0.793754,P1__3M__Won_Grand Slam
31,0.742952,0.742952,P1__3M__Won_Masters
32,0.641598,0.641598,P1__3M__Won_Masters 1000
23,0.516324,0.516324,P1__3M__Won_3rd Round
34,0.500913,0.500913,P1__3M__Won_Quarterfinals
36,0.423505,0.423505,P1__3M__Won_Semifinals
21,0.345601,0.345601,P1__3M__Won_1st Round


In [67]:
P2_col_coefs.head(20)

Unnamed: 0,Abs_Coef,Coef,Column
57,3.426889,-3.426889,P2__3M__Played
58,2.40575,-2.40575,P2__3M__Won
59,0.8085,-0.8085,P2__3M__Won_1st Round
60,0.693247,-0.693247,P2__3M__Won_2nd Round
66,0.518213,-0.518213,P2__3M__Won_International
63,0.481536,-0.481536,P2__3M__Won_ATP250
72,0.347094,-0.347094,P2__3M__Won_Quarterfinals
69,0.346438,-0.346438,P2__3M__Won_Masters
25,0.302849,0.302849,P1__3M__Won_ATP250
64,0.286692,-0.286692,P2__3M__Won_ATP500
