# The shape of football games

## The dataset

The original data can be found [here](https://www.kaggle.com/hugomathien/soccer). It contains briefly:

## Tables

* Team: it contains three id keys to relate to other tables, and the long and short name of the team.
* Team Attributes: historical players attributes updates for each team (not used in our model).
* Player: general player information like `name`, `birthday`, `weight` and `height`.
* Player_Attributes: historical players attributes updates. This table is linked to the `Player` table by `player_fifa_api_id`
* Match: it is the most important table, where each row describes a match using `date`, `season`, `league`, the id of the two participant teams, the id of the starting 22 players and their position in the field. 
* League and Country: it contains the name of the league and its home country.

<img src="FootballTDA.png"> 

In [1]:
from database import Database 
from cross_validation import extract_features_for_prediction
import pandas as pd
import numpy as np
from numpy import random
import soccer_basics 
from random import expovariate, gauss
from sklearn.ensemble import RandomForestClassifier
from utils import read_pickle
from notebook_functions import *

## Load the tables

The class `database` is set to manage the tables in order to modify the teams.  

In [2]:
database = Database()

## Modify teams

The method `hire_player` is used to move your favorite player to a selected team to simulate how the championship would go. You just need to select the team where you want put the player and then select the player to be replaced. The list of teams is sorted by the total of points that each team has totaled during the championship. Players are sorted by the number of appearances they had that year. 
Let's see how things would have gone.

**Note**: the higher the number of appearances of the player to be replaced, the greater the impact of the hired player!

In [3]:
new_player_df = database.hire_player()

Choose one league between "serie a" and "Premier League".
premier league
       team_long_name team_short_name  total_point
              Chelsea             CHE           87
      Manchester City             MCI           79
              Arsenal             ARS           75
    Manchester United             MUN           70
    Tottenham Hotspur             TOT           64
            Liverpool             LIV           62
          Southampton             SOU           60
         Swansea City             SWA           56
           Stoke City             STK           54
       Crystal Palace             CRY           48
              Everton             EVE           47
      West Ham United             WHU           47
 West Bromwich Albion             WBA           44
       Leicester City             LEI           41
     Newcastle United             NEW           39
           Sunderland             SUN           38
          Aston Villa             AVL           38
         



        player_name  appearance  (overall_rating, mean)
         John Terry          38               83.571429
 Branislav Ivanovic          38               80.285714
        Eden Hazard          38               88.250000
      Nemanja Matic          35               83.222222
        Gary Cahill          33               82.000000
      Cesc Fabregas          33               85.384615
  Cesar Azpilicueta          29               81.000000
            Willian          28               82.181818
              Oscar          26               83.700000
        Diego Costa          24               85.300000
            Ramires          11               80.333333
        Filipe Luis           9               81.875000
      Didier Drogba           8               81.000000
         Kurt Zouma           7               76.333333
          Loic Remy           6               80.125000
     John Obi Mikel           6               77.600000
    Andre Schuerrle           5               79

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [4]:
new_player_df.head()

Unnamed: 0,id,country_id,league_id,season,stage,date,match_api_id,home_team_api_id,away_team_api_id,home_team_goal,...,away_avg_attack,away_avg_defense,home_std_attack,home_std_defense,away_std_attack,away_std_defense,home_best_attack,home_best_defense,away_best_attack,away_best_defense
0,4009,1729,1729.0,2014/2015,1.0,2014-08-16,1723982.0,9825.0,9826.0,2.0,...,61.741176,61.033333,88.754445,75.226878,85.293559,75.504169,79.176471,81.833333,73.352941,77.666667
1,4010,1729,1729.0,2014/2015,1.0,2014-08-18,1723983.0,8191.0,8455.0,1.0,...,70.647059,64.033333,85.764605,79.205125,86.032886,74.597898,68.411765,69.833333,82.764706,81.0
2,4011,1729,1729.0,2014/2015,1.0,2014-08-16,1723984.0,8197.0,8668.0,2.0,...,66.411765,65.583333,85.587291,70.171042,85.614542,78.262409,69.941176,72.666667,77.0,80.166667
3,4012,1729,1729.0,2014/2015,1.0,2014-08-17,1723985.0,8650.0,8466.0,2.0,...,63.876471,65.083333,83.468412,74.8048,87.541719,80.228135,79.235294,81.166667,76.176471,78.833333
4,4013,1729,1729.0,2014/2015,1.0,2014-08-16,1723986.0,10260.0,10003.0,1.0,...,64.529412,60.966667,83.228847,71.130742,87.085492,78.862099,82.764706,81.666667,75.294118,78.0


Get the team ids, which are going to be used later 

In [5]:
team_ids = get_team_ids(new_player_df)

We want to make sure that the columns order is the same as in the training set.

In [6]:
new_players_df_stats = get_useful_cols(new_player_df)

In [7]:
new_players_df_stats.head()

Unnamed: 0,home_best_attack,home_best_defense,home_avg_attack,home_avg_defense,home_std_attack,home_std_defense,gk_home_player_1,away_avg_attack,away_avg_defense,away_std_attack,away_std_defense,away_best_attack,away_best_defense,gk_away_player_1
0,79.176471,81.833333,69.541176,65.266667,88.754445,75.226878,80.0,61.741176,61.033333,85.293559,75.504169,73.352941,77.666667,74.0
1,68.411765,69.833333,59.929412,59.0,85.764605,79.205125,70.0,70.647059,64.033333,86.032886,74.597898,82.764706,81.0,84.0
2,69.941176,72.666667,60.223529,52.05,85.587291,70.171042,74.0,66.411765,65.583333,85.614542,78.262409,77.0,80.166667,81.0
3,79.235294,81.166667,65.858824,64.816667,83.468412,74.8048,82.0,63.876471,65.083333,87.541719,80.228135,76.176471,78.833333,75.0
4,82.764706,81.666667,67.217647,56.783333,83.228847,71.130742,83.0,64.529412,60.966667,87.085492,78.862099,75.294118,78.0,75.0


## Feature selection

In order to decide which attributes belong to which group, we created a correlation matrix. From this, we saw that there were two big groups, where player attributes were strongly correlated with each other. Therefore, we decided to split the attributes into two groups, one to summarise the attacking characteristics of a player while the other one the defensive ones.
Finally, since the goalkeeper has completely different statistics with respect to the other players, we decided to take into account only the overall rating.
Below, is possible to see the features used for each player:
* **Attack**: "positioning", "crossing", "finishing", "heading_accuracy", "short_passing", "reactions", "volleys",                 "dribbling", "curve", "free_kick_accuracy", "acceleration", "sprint_speed", "agility", "penalties",                   "vision", "shot_power", "long_shots"
* **Defense**: "interceptions", "aggression", "marking", "standing_tackle", "sliding_tackle", "long_passing"
* **Goalkeeper**: "overall_rating"

From this set of features, the next step we did was to, for each non-goalkeeper player, compute the mean of the attack attributes and the defensive ones.

Finally, for each team in a given match, we compute the mean and the standard deviation for the attack and the defense from these stats of the team's players, as well as the best attack and best defense. 


In this way a match is described by 14 features (GK overall value, best attack, std attack, mean attack, best defense, std defense, mean defense), that mapped the match in the space, following the characterizes of the two team.

## Feature extraction

The aim of TDA is to catch the structure of the space underlying the data. In our project we assume that the neigborood of a data point hides meaningfull information which are correlated with the outcome of the match. Thus, we explored the data space looking for this kind of correlation.

In [9]:
best_pipeline_params, best_model_feat_params = get_best_params()

In [10]:
pipeline = get_pipeline(best_pipeline_params)

In [11]:
x_train, y_train = load_dataset()

In [12]:
x_test = extract_x_test_features(x_train, y_train, new_players_df_stats, pipeline)

100%|██████████| 38/38 [01:33<00:00,  2.47s/it]


In [13]:
rf_model = RandomForestClassifier(**best_model_feat_params)

In [14]:
rf_model.fit(x_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=10, max_features=0.5, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=None, oob_score=False, random_state=52, verbose=0,
                       warm_start=False)

In [15]:
matches_probabilities = get_probabilities(rf_model, x_test, team_ids)

In [16]:
matches_probabilities.head()

Unnamed: 0,home_team_api_id,away_team_api_id,away_team_prob,draw_prob,home_team_prob
0,9825.0,9826.0,0.172922,0.192259,0.634819
1,8191.0,8455.0,0.434398,0.346411,0.219191
2,8197.0,8668.0,0.33677,0.397576,0.265654
3,8650.0,8466.0,0.180363,0.31009,0.509546
4,10260.0,10003.0,0.121135,0.179443,0.699422


In [17]:
compute_final_standings(matches_probabilities, 'premier league')

simulation batch 1
simulation batch 2
simulation batch 3
simulation batch 4
simulation batch 5
simulation batch 6
simulation batch 7
simulation batch 8
simulation batch 9
simulation batch 10
1 Manchester City 77.0
2 Chelsea 72.0
3 Manchester United 67.0
4 Arsenal 64.0
5 Tottenham Hotspur 62.0
6 Everton 58.0
7 Liverpool 56.0
8 Newcastle United 48.0
9 Southampton 47.0
10 Stoke City 47.0
11 Aston Villa 46.0
12 Swansea City 46.0
13 West Ham United 46.0
14 Queens Park Ranger 45.0
15 West Bromwich Albion 44.0
16 Hull City 43.0
17 Sunderland 43.0
18 Crystal Palace 43.0
19 Burnley 41.0
20 Leicester City 41.0
probabilities to win the title, to be top 4, to be last 3
1 Manchester City 0.568 0.97 0.0
2 Chelsea 0.242 0.88 0.0
3 Manchester United 0.102 0.71 0.0
4 Arsenal 0.043 0.52 0.0
5 Tottenham Hotspur 0.03 0.42 0.0
6 Everton 0.01 0.23 0.01
7 Liverpool 0.004 0.15 0.01
8 Newcastle United 0.0 0.02 0.11
9 Southampton 0.0 0.01 0.15
10 Stoke City 0.0 0.02 0.14
11 Aston Villa 0.001 0.02 0.18
12 Swanse

## Messi in each team
Below, is possible to see the effect that Messi would have had on the final standings of the Premier League 2014/2015. The results are obtained by running 20 different simulations, eahc one with the player with the most number of appereances replaced by Messi.

In [18]:
teams_with_messi.set_index(np.arange(1, 21), drop=True)

Unnamed: 0,Team,Delta Pos.,Pr. Win,Pr. TOP 4,Pr. Rel.,Pr. Win. with Messi,Pr. TOP 4. with Messi,Pr. Rel. with Messi
1,Chelsea,0,0.25,0.84,0.0,0.41,0.94,0.0
2,Manchester City,1,0.59,0.97,0.0,0.58,0.97,0.0
3,Arsenal,0,0.05,0.56,0.0,0.17,0.82,0.0
4,Manchester United,1,0.1,0.69,0.0,0.17,0.81,0.0
5,Tottenham,1,0.03,0.44,0.0,0.06,0.56,0.0
6,Liverpool,3,0.01,0.17,0.01,0.1,0.66,0.0
7,Southampton,0,0.0,0.01,0.14,0.01,0.19,0.01
8,Swansea City,0,0.0,0.01,0.16,0.01,0.04,0.08
9,Stoke City,2,0.0,0.01,0.15,0.01,0.16,0.01
10,Crystal Palace,2,0.0,0.0,0.29,0.0,0.05,0.07


# Benchmarks: Market's odds and Elo ratings

While the performance is not our main goal, we nevertheless set up two simple benchmarks to make sure our (topological) model is a reasonable approximation of the reality.

The task we choose is simply the ternary match outcome prediction: will the home team win, the away team or will there be a draw?

The first benchmark is obtained from Market's probabilities for the three outcomes -- they are obtained by simply inverting the odds (see soccer_basics.py for details).

The second benchmark is by using instead Elo ratings, a standard tool for assessing teams' or players' strenghts: <a href="https://en.wikipedia.org/wiki/Elo_rating_system">Elo rating system</a>. For the related World Football Elo Ratings see:     . For a deeper mathematical discussion around this concept, see <a href="https://www.eloratings.net/about"> National teams Elo rating</a>, <a href="https://www.stat.berkeley.edu/~aldous/Papers/me-Elo-SS.pdf">Elo's rating mathematics</a>

We calculate the benchmarks on the Premier League dataset.

Our model is capable an accuracy of 0.531, which is comparable with market's performace. 

In [19]:
probabilities_with_odds = get_dataset(42198).get_data(dataset_format='dataframe')[0]

In [20]:
probabilities_with_odds.head()

Unnamed: 0,home_team_api_id,away_team_api_id,away_team_prob,draw_prob,home_team_prob,index,home_team_goal,away_team_goal,B365H,B365D,B365A
0,9825.0,9826.0,0.154923,0.180577,0.6645,3684,2.0,1.0,1.25,6.5,15.0
1,8191.0,8455.0,0.441281,0.329716,0.229002,3685,1.0,3.0,9.0,5.0,1.4
2,8197.0,8668.0,0.330793,0.402374,0.266833,3686,2.0,2.0,3.2,3.4,2.4
3,8650.0,8466.0,0.180788,0.295813,0.523398,3687,2.0,1.0,1.33,5.75,10.0
4,10260.0,10003.0,0.114999,0.177555,0.707445,3688,1.0,2.0,1.36,5.0,11.0


In [21]:
soccer_basics.useful_updates1(probabilities_with_odds)
soccer_basics.get_elo(probabilities_with_odds, 20, 100)
soccer_basics.useful_updates2(probabilities_with_odds, 100)

market's ternary prediction: 1, X or 2



In [22]:
print('market prediction, all data and 2014-2015 season')
acc1 = len(probabilities_with_odds[probabilities_with_odds['result'] == 
                                   probabilities_with_odds['market_prediction']]) / float(len(probabilities_with_odds))
df = probabilities_with_odds.reset_index()

print(np.round(acc1, 3))

market prediction, all data and 2014-2015 season
0.533


Elo based ternary prediction:



In [23]:
print('Elo based prediction, all data and 2015, with 30 matches quarantine')
soccer_basics.ternary_prediction(probabilities_with_odds, 30)

Elo based prediction, all data and 2015, with 30 matches quarantine
accuracy 0.455
