# The shape of football games

## The dataset

The original data can be found [here](https://www.kaggle.com/hugomathien/soccer). It contains briefly:

## Tables

* Team: it contains three id keys to relate to other tables, and the long and short name of the team.
* Team Attributes: historical players attributes updates for each team (not used in our model).
* Player: general player information like `name`, `birthday`, `weight` and `height`.
* Player_Attributes: historical players attributes updates. This table is linked to the `Player` table by `player_fifa_api_id`
* Match: it is the most important table, where each row describes a match using `date`, `season`, `league`, the id of the two participant teams, the id of the starting 22 players and their position in the field. 
* League and Country: it contains the name of the league and its home country.

<img src="FootballTDA.png"> 

In [1]:
from database import Database 
from cross_validation import extract_features_for_prediction
import pandas as pd
import numpy as np
from numpy import random
import soccer_basics 
from random import expovariate, gauss
from sklearn.ensemble import RandomForestClassifier
from utils import read_pickle
from notebook_functions import *

## Load the tables

The class `database` is set to manage the tables in order to modify the teams.  

In [2]:
database = Database()

## Modify teams

The method `hire_player` is used to move your favorite player to a selected team to simulate how the championship would go. You just need to select the team where you want put the player and then select the players. The list of team is sorted by total point that the teams have totaled in the championship. Players are sorted by the number of appearances they did that year. So switching the players expect to have a change. Let's see how things would have gone.

In [4]:
new_player_df = database.hire_player()

Choose one league between "serie a" and "Premier League".
serie a
team_long_name team_short_name  total_point
      Juventus             JUV           91
        Napoli             NAP           82
          Roma             ROM           80
         Inter             INT           67
    Fiorentina             FIO           64
      Sassuolo             SAS           61
         Milan             ACM           57
         Lazio             LAZ           54
 Chievo Verona             CHI           50
         Genoa             GEN           46
        Empoli             EMP           46
      Atalanta             ATA           45
        Torino             TOR           45
       Bologna             BOL           42
     Sampdoria             SAM           40
       Udinese             UDI           39
       Palermo             PAL           39
         Carpi             CAP           38
     Frosinone             FRO           31
 Hellas Verona             VER           28
Which play



             player_name  appearance  (overall_rating, mean)
       Cristian Zaccardo          24               73.111111
         Gaetano Letizia          24               65.937500
        Simone Romagnoli          21               69.100000
       Riccardo Gagliolo          20               65.538462
             Isaac Cofie          19               71.230769
           Lorenzo Lollo          18               69.076923
                   Matos          14               74.000000
           Gabriel Silva          14               70.181818
      Jerry Uche Mbakogu          13               71.000000
        Lorenzo Pasciuti          12               67.153846
         Raffaele Bianco          12               66.125000
            Luca Marrone           9               71.200000
       Antonio Di Gaudio           9               70.714286
         Rafael Martinho           8               72.166667
           Matteo Fedele           8               63.526316
         Marco Borriello

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [5]:
new_player_df.head()

Unnamed: 0,id,country_id,league_id,season,stage,date,match_api_id,home_team_api_id,away_team_api_id,home_team_goal,...,away_avg_attack,away_avg_defense,home_std_attack,home_std_defense,away_std_attack,away_std_defense,home_best_attack,home_best_defense,away_best_attack,away_best_defense
0,12894,10257,10257,2015/2016,1,2015-08-23,2060255,8534.0,8533.0,1.0,...,61.658824,61.316667,79.964041,74.572026,80.880559,74.802056,70.117647,71.333333,73.529412,74.666667
1,12895,10257,10257,2015/2016,1,2015-08-23,2060256,8535.0,8564.0,2.0,...,64.817647,61.95,86.864128,72.60399,82.046011,74.794772,75.647059,80.166667,76.764706,81.166667
2,12896,10257,10257,2015/2016,1,2015-08-23,2060257,9891.0,9804.0,1.0,...,61.052941,64.75,81.583265,68.831861,80.124494,76.188528,63.529412,70.166667,75.823529,78.333333
3,12897,10257,10257,2015/2016,1,2015-08-22,2060258,9876.0,8686.0,1.0,...,67.652941,66.266667,84.331552,70.472317,82.61284,75.42068,71.529412,74.833333,78.058824,83.0
4,12898,10257,10257,2015/2016,1,2015-08-23,2060259,8636.0,8524.0,1.0,...,60.405882,60.9,86.528635,77.32122,78.292566,72.695694,79.058824,82.0,74.647059,76.833333


Get the team ids, which are going to be used later 

In [6]:
team_ids = get_team_ids(new_player_df)

We want to make sure that the columns order is the same as in the training set.

In [7]:
new_players_df_stats = get_useful_cols(new_player_df)

In [8]:
new_players_df_stats.head()

Unnamed: 0,home_best_attack,home_best_defense,home_avg_attack,home_avg_defense,home_std_attack,home_std_defense,gk_home_player_1,away_avg_attack,away_avg_defense,away_std_attack,away_std_defense,away_best_attack,away_best_defense,gk_away_player_1
0,70.117647,71.333333,58.364706,55.283333,79.964041,74.572026,71.0,61.658824,61.316667,80.880559,74.802056,73.529412,74.666667,72.0
1,75.647059,80.166667,63.676471,62.066667,86.864128,72.60399,74.0,64.817647,61.95,82.046011,74.794772,76.764706,81.166667,81.0
2,63.529412,70.166667,52.794118,47.483333,81.583265,68.831861,72.0,61.052941,64.75,80.124494,76.188528,75.823529,78.333333,75.0
3,71.529412,74.833333,60.729412,58.966667,84.331552,70.472317,73.0,67.652941,66.266667,82.61284,75.42068,78.058824,83.0,78.0
4,79.058824,82.0,62.082353,66.883333,86.528635,77.32122,83.0,60.405882,60.9,78.292566,72.695694,74.647059,76.833333,70.0


## Feature selection

In order to make the table Match smaller and more manageble we have reduced the size aggregating some features. Each Team in a Match is described by 7 features, the value overall of Goal Keeper, tha maximus values of striker and defender, the mean of striker values and defender values, and the variances of striker values and defender values. So in our model a team is characterized by this 7 features.
Following we describe how we had calculate this values.

In this way a match is described by 14 features, that mapped the match in the space, following the characterizes of the two team.

In order to decide which attributes belong to which group, we created a correlation matrix. From this, we saw that there were two big groups, where player attributes were strongly correlated with each other. Therefore, we decided to split the attributes into two groups, one to summarise the attacking characteristics of a player while the other one the defensive ones.
Finally, since the goalkeeper has completely different statistics with respect to the other players, we decided to take into account only the overall rating.
Below, is possible to see the features used for each player:
* **Attack**: "positioning", "crossing", "finishing", "heading_accuracy", "short_passing", "reactions", "volleys",                 "dribbling", "curve", "free_kick_accuracy", "acceleration", "sprint_speed", "agility", "penalties",                   "vision", "shot_power", "long_shots"
* **Defense**: "interceptions", "aggression", "marking", "standing_tackle", "sliding_tackle", "long_passing"
* **Goalkeeper**: "overall_rating"
From this set of features, the next step we did was, for each non-goalkeeper player, compute the mean of the attack attributes and the defensive ones.
Finally, for each team in a given match, we compute the mean and the standard deviation for the attack and the defense from these stats of the team's players, as well as the best attack and best defense. (edited) 


## Feature extraction

The aim of TDA is to catch the structure of the space underlying the data. In our project we assume that the neigborood of a data point hides meaningfull information which are correlated with the outcome of the match. Thus, we explored the data space looking for this kind of correlation.

In [9]:
best_pipeline_params, best_model_feat_params = get_best_params()

In [10]:
pipeline = get_pipeline(best_pipeline_params)

In [11]:
x_train, y_train = load_dataset()

In [12]:
x_test = extract_x_test_features(x_train, y_train, new_players_df_stats, pipeline)

100%|██████████| 38/38 [04:52<00:00,  7.71s/it]


In [13]:
rf_model = RandomForestClassifier(**best_model_feat_params)

In [14]:
rf_model.fit(x_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=10, max_features=0.5, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=None, oob_score=False, random_state=52, verbose=0,
                       warm_start=False)

In [15]:
matches_probabilities = get_probabilities(rf_model, x_test, team_ids)

In [16]:
matches_probabilities.head()

Unnamed: 0,home_team_api_id,away_team_api_id,away_team_prob,draw_prob,home_team_prob
0,8534.0,8533.0,0.319951,0.292642,0.387407
1,8535.0,8564.0,0.299918,0.324351,0.375731
2,9891.0,9804.0,0.285442,0.331923,0.382635
3,9876.0,8686.0,0.372635,0.29134,0.336025
4,8636.0,8524.0,0.27291,0.303845,0.423245


In [17]:
compute_final_standings(matches_probabilities, 'premier league')

simulation batch 1
simulation batch 2
simulation batch 3
simulation batch 4
simulation batch 5
simulation batch 6
simulation batch 7
simulation batch 8
simulation batch 9
simulation batch 10
1 Stoke City 69.0
2 Hull City 66.0
3 West Bromwich Albion 64.0
4 Crystal Palace 61.0
5 Tottenham Hotspur 59.0
6 Liverpool 53.0
7 Southampton 51.0
8 Newcastle United 48.0
9 Queens Park Ranger 48.0
10 Burnley 48.0
11 West Ham United 48.0
12 Manchester City 47.0
13 Everton 47.0
14 Leicester City 47.0
15 Chelsea 47.0
16 Sunderland 46.0
17 Aston Villa 45.0
18 Swansea City 45.0
19 Arsenal 44.0
20 Manchester United 44.0
probabilities to win the title, to be top 4, to be last 3
1 Stoke City 0.392 0.88 0.0
2 Hull City 0.238 0.73 0.0
3 West Bromwich Albion 0.181 0.68 0.0
4 Crystal Palace 0.099 0.54 0.01
5 Tottenham Hotspur 0.051 0.39 0.01
6 Liverpool 0.011 0.18 0.05
7 Southampton 0.005 0.11 0.07
8 Newcastle United 0.001 0.06 0.16
9 Queens Park Ranger 0.004 0.05 0.15
10 Burnley 0.004 0.06 0.16
11 West Ham Uni

## Messi in each team
Below, is possible to see the effect that Messi would have had on the final standings of the Premier League 2014/2015. The results are obtained by running 20 different simulations, eahc one with the player with the most number of appereances replaced by Messi.

In [18]:
teams_with_messi.set_index(np.arange(1, 21), drop=True)

Unnamed: 0,Team,Delta Pos.,Pr. Win,Pr. TOP 4,Pr. Rel.,Pr. Win. with Messi,Pr. TOP 4. with Messi,Pr. Rel. with Messi
1,Chelsea,0,0.25,0.84,0.0,0.41,0.94,0.0
2,Manchester City,1,0.59,0.97,0.0,0.58,0.97,0.0
3,Arsenal,0,0.05,0.56,0.0,0.17,0.82,0.0
4,Manchester United,1,0.1,0.69,0.0,0.17,0.81,0.0
5,Tottenham,1,0.03,0.44,0.0,0.06,0.56,0.0
6,Liverpool,3,0.01,0.17,0.01,0.1,0.66,0.0
7,Southampton,0,0.0,0.01,0.14,0.01,0.19,0.01
8,Swansea City,0,0.0,0.01,0.16,0.01,0.04,0.08
9,Stoke City,2,0.0,0.01,0.15,0.01,0.16,0.01
10,Crystal Palace,2,0.0,0.0,0.29,0.0,0.05,0.07


# Benchmarks: Market's odds and Elo ratings

While the performance is not our main goal, we nevertheless set up two simple benchmarks to make sure our (topological) model is a reasonable approximation of the reality.

The task we choose is simply the ternary match outcome prediction: will the home team win, the away team or will there be a draw?

The first benchmark is obtained from Market's probabilities for the three outcomes -- they are obtained by simply inverting the odds (see soccer_basics.py for details).

The second benchmark is by using instead Elo ratings, a standard tool for assessing teams' or players' strenghts: <a href="https://en.wikipedia.org/wiki/Elo_rating_system">Elo rating system</a>. For the related World Football Elo Ratings see:     . For a deeper mathematical discussion around this concept, see <a href="https://www.eloratings.net/about"> National teams Elo rating</a>, <a href="https://www.stat.berkeley.edu/~aldous/Papers/me-Elo-SS.pdf">Elo's rating mathematics</a>

We calculate the benchmarks on the Premier League dataset.

Our model is capable an accuracy of 0.531, which is comparable with market's performace. 

In [19]:
probabilities_with_odds = get_dataset(42198).get_data(dataset_format='dataframe')[0]

In [20]:
probabilities_with_odds.head()

Unnamed: 0,home_team_api_id,away_team_api_id,away_team_prob,draw_prob,home_team_prob,index,home_team_goal,away_team_goal,B365H,B365D,B365A
0,9825.0,9826.0,0.154923,0.180577,0.6645,3684,2.0,1.0,1.25,6.5,15.0
1,8191.0,8455.0,0.441281,0.329716,0.229002,3685,1.0,3.0,9.0,5.0,1.4
2,8197.0,8668.0,0.330793,0.402374,0.266833,3686,2.0,2.0,3.2,3.4,2.4
3,8650.0,8466.0,0.180788,0.295813,0.523398,3687,2.0,1.0,1.33,5.75,10.0
4,10260.0,10003.0,0.114999,0.177555,0.707445,3688,1.0,2.0,1.36,5.0,11.0


In [21]:
soccer_basics.useful_updates1(probabilities_with_odds)
soccer_basics.get_elo(probabilities_with_odds, 20, 100)
soccer_basics.useful_updates2(probabilities_with_odds, 100)

market's ternary prediction: 1, X or 2



In [22]:
print('market prediction, all data and 2014-2015 season')
acc1 = len(probabilities_with_odds[probabilities_with_odds['result'] == 
                                   probabilities_with_odds['market_prediction']]) / float(len(probabilities_with_odds))
df = probabilities_with_odds.reset_index()

print(np.round(acc1, 3))

market prediction, all data and 2014-2015 season
0.533


Elo based ternary prediction:



In [23]:
print('Elo based prediction, all data and 2015, with 30 matches quarantine')
soccer_basics.ternary_prediction(probabilities_with_odds, 30)

Elo based prediction, all data and 2015, with 30 matches quarantine
accuracy 0.455
