## 1/ Project problem and hypothesis

* My goal is to basically predict the winner of a given tennis match (ATP Circuit) given that we know some basic information about the game:
 * The date of the game, the tournament, the stage of the tournament (1st round, finals etc.), the type of surface, the two players and their current ATP ranks
 * We can also look in the past and try to derive more features about the two players for example, how well did they perform in the current season, their % of win etc.
Real life applications of this would be to gain more insights into which factors can influence the odds of a player winning a game, and could be useful for betting for example.

This problem can basically be modelled as a classification problem, where we try to predict whether a player will win or lose given what we know about the configuration of the match together with the history of both players

I believe the following type of predictor features will potentially have an impact in predicting the outcome:
* The % of win for a given player, the % of win on a given type of surface, how far does a player usually go in tournaments, in which tournaments etc.

## 2/ Dataset

The dataset is from Kaggle, it has been compiled from various sources


In [9]:
import pandas as pd, numpy as np
df = pd.read_csv("../../data.csv", low_memory=False)
COLUMNS = [
    "Location", "Tournament", "Date",
    "Series",
    "Court", "Surface",
    "Round", "Best of",
    "Winner", "Loser", "WRank", "LRank",
]
GAME_STATS_COLS = [
    "W1", "L1", "W2", "L2", "W3", "L3", "W4", "L4", "W5", "L5",
    "Wsets", "Lsets"
]
df = df[COLUMNS + GAME_STATS_COLS]
df.iloc[:, :20].head()

Unnamed: 0,Location,Tournament,Date,Series,Court,Surface,Round,Best of,Winner,Loser,WRank,LRank,W1,L1,W2,L2,W3,L3,W4,L4
0,Adelaide,Australian Hardcourt Championships,3/01/2000,International,Outdoor,Hard,1st Round,3,Dosedel S.,Ljubicic I.,63,77,6.0,4.0,6.0,2.0,,,,
1,Adelaide,Australian Hardcourt Championships,3/01/2000,International,Outdoor,Hard,1st Round,3,Enqvist T.,Clement A.,5,56,6.0,3.0,6.0,3.0,,,,
2,Adelaide,Australian Hardcourt Championships,3/01/2000,International,Outdoor,Hard,1st Round,3,Escude N.,Baccanello P.,40,655,6.0,7.0,7.0,5.0,6.0,3.0,,
3,Adelaide,Australian Hardcourt Championships,3/01/2000,International,Outdoor,Hard,1st Round,3,Federer R.,Knippschild J.,65,87,6.0,1.0,6.0,4.0,,,,
4,Adelaide,Australian Hardcourt Championships,3/01/2000,International,Outdoor,Hard,1st Round,3,Fromberg R.,Woodbridge T.,81,198,7.0,6.0,5.0,7.0,6.0,4.0,,


In [11]:
df.iloc[:, 20:].head()

Unnamed: 0,W5,L5,Wsets,Lsets
0,,,2.0,0.0
1,,,2.0,0.0
2,,,2.0,1.0
3,,,2.0,0.0
4,,,2.0,1.0


The data is pretty self explanatory, some general information about the match: Location, Date, whether the game is indoor or outdoor, the type of surface, the number of sets that a player has to win to win the match. The names of the Winner/Loser, their ranks, and the breakdown of their scores for the various sets that make up the match.

** There are some other columns in that csv files (betting odds from various betting companies), I've chosen not to use those for now **

## 3/ Potential methods and models
A logistic regression could potentially be a good candidate model for this problem. Given that the features that I have in mind are likely to increase the odds that a given player will win a match, for example:
* The higher the % of win for a player then the higher his changes to win a match
* The higher the % of matches between Player 1 and Player 2 that were won by Player 1, then one could think that this would increase the odds of Player 1 winning

### 3.A Deriving features
The difficulty here will be to derive good features from the data that we have available, the end goal of this part of the project would be to end up with **independant rows**.

If we have a game occuring on the 01/01/2010 between Player A and Player B:
* Look at the % of win of Player A/B up until that point in time, the % of win on the specific surface of tournament of that particular observation etc.
* Another parameter here would be to play with the size of the *rolling window* to give more or less importance to events that are more distant in the past:
 * For example we could have a % of win of Player A since the *beginning of time*,  another predictor could be the % of win only over the last 6 months
 
So conceptually the derived features would be a Dataframe with a two-level index:
* The date
 * The name of the player
 
And the columns would be the features described above

In [27]:
df2 = pd.DataFrame({
    "Date": ["2005-01-01"] * 5 + ["2005-02-01"] * 5,
    "Player": ["Player{}".format(i) for i in range(1, 6)] * 2,
    "% Win so far": np.random.rand(10)
})
df2.set_index(["Date", "Player"])

Unnamed: 0_level_0,Unnamed: 1_level_0,% Win so far
Date,Player,Unnamed: 2_level_1
2005-01-01,Player1,0.357124
2005-01-01,Player2,0.095217
2005-01-01,Player3,0.026947
2005-01-01,Player4,0.334314
2005-01-01,Player5,0.620013
2005-02-01,Player1,0.717505
2005-02-01,Player2,0.81622
2005-02-01,Player3,0.953458
2005-02-01,Player4,0.464392
2005-02-01,Player5,0.107634


We would then conceptually "join" this data to the initial dataframe, for both players

### 3.B Cross validation
Cross validation isn't straightforward here, an approach could be to:
* Train on data up until the end 2008, test on matches in the beginning 2009
* Repeat for various other points in time

Another thing to note is that most of the features will be themselves derived from past data as described in the previous section, and we need a minimum of *past data* to derive those features.

So an approach could be to:
* Computer the derived features only starting from **2005** for example, so that we have at least five years of historical data to compute the various %-type of features for example.
* For cross-validation we would then train the models on the data between 2005-2010 and then test on the matches in 2011 for example

### 3.C Other models
Beside logistic regression, it'll be also interesting to assess other models such as KNN or Decision Trees. For example decision trees could be appropriate if it turns out that the behaviour of the observations differs a lot depending whether some features are above or below a certain thresholds (maybe predicting the outcome of matches involving the top ranked players differs a lot from the other ones)

## 4/ Domain knowledge
I've played some tennis myself when I was younger and I do follow some tournaments on TV, so I do have some context and intuition about what kind of features will be helpful to feed the model

I found a couple of other models in the internet where people try to predict the outcome of a game by modelling its evolution step by step: markov chains for all the various ways a single game can evolved for example https://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/a.madurska%20.pdf, where the accuracy is aorund 65-70% for various tournaments.

## 5/ Project concerns

### 5.A Renaming some columns for class balance
If we look at the data that we have, the players are already labelled as "Winner" and "Loser". Obviously the "Winner" player always win so we can't just fit a model using the "does the player in the Winner column" wins as they all do!
We'd have to somehow, for every observation assign one player to be "Player1" and the other to be "Player2" such that Player1 doesn't always win on our dataset


### 5.B Working with time series

I am not yet very familiar about how to compute time series related features. I'll probably be using the pandas **rolling** method for example
There's probably quite some work around extracting/transforming data to reach the state where I have independant observations, but I'm confident I'll get there

### 5.C Quality of the derived features
This is the heart of this project really, hopefully it is possible to relatively easily guess features that will contribute to the quality of the prediction, but I don't have any certainty there.

### 5.D Is the data enough?
Two things here:
* I decided not to use more fine-grained data about the tennis games (for example the % of 1st serves, of aces etc.) but chose to stay at a relatively high level ie looking at when players win and in what conditions (versus whom, in which tournament, type of surface etc.)
* So there there is a risk that this data doesn't actually capture enough knowledge to produce good predictions
* Even if I was using that additional data, even then there's always factors which won't be captured in the data (the psychology of the players on the day, their actual style, the strategy that they decided to use on the day etc.) In essence we're trying here to predict the interaction between two human beings and we'll have never have "all the data" to model the situation here.

A more specific concern here is that the **ATP Rank** of the players is alreay a very good predictor (ie the better ranked player should win) and that it might be difficult to beat it. I guess there's two approaches here:
* Consider the **ATP Rank** as the baseline, and try to incorporate more features to further increase the quality of that baseline modle
* Ignore **ATP Rank** and try to predict the winner with "my own" features

### 5.E Is the data correct?
Given that the dataset is relatively popular on Kaggle, I would assume this it is mostly correct. I might still double check that a given player's name is always spelled the same way throughout all of the observations, otherwise the historical features could be wrong!

## 6/ Outcomes

I expect the model to be relatively complex, crafting various features which are not correlated, experimenting in a lot of different ways etc. I might also derived non-linear features (ie the product of two features) to enhance the model.
The very model itself will hopefully be a simple logistic regression so the complexity will probably be more on the features engineering side.

As said above success could be defined in several different ways:
* I've been able to increase the baseline (ATP Rank based predictor) substantially by incorporating new features
* If the Rank alone is already too good, then achieving a "good enough" accuracy, hopefully quite close to the baseline of the ATP Ranks (65%) would qualify as success?

If the project goes bust, I guess this would mean that the accuracy score won't be very good... But I will have been through the whole process anyway, so maybe the actual score isn't that important in the end?