## Match predictions:

The goal of this notebook is to create a model to predict match results, i.e home win / home loss / draw.

We base our model on the intersection of a certain domain knowledge (the kind of information that could influence the result of a game of football) and the data in our possession. Better results could definetily be achevied if we were in possession of more detailed data on previous match such as a breakdown of performances per position or average possession statistics, etc.

This analysis is inspired by the work of Gunjan Kumar in his thesis: "Machine Learning for Soccer Analytics".

In [1]:
% matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

# Import all ML modules and packages we'll need
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

## Data preparation

In [3]:
import json
# Load in data
filenames = ['BPL/BPL12-13.json']
with open(filenames[0], 'r') as fp:
    data = json.load(fp)

We need to create a dictionary that contains the following data:

- Final score: response variable
- Home team metric average on past 5 games
- Away team metric average on past 5 games
- Sum of the differences of the best player metrics score for each team in the past 5 game
- Average goals against the home team on the past 5 games
- Average goals against the away team on the past 5 games
- Number of losses for the home team in the past 2 games
- Number of losses for the away team in the past 2 games 


We initialise the dictionary with key equal to the match ID and first values: home team, away team, and day.

In [110]:
teams = ['arsenal-fc','aston-villa','chelsea-fc','everton-fc','fulham-fc','liverpool-fc','manchester-city','manchester-united','newcastle-united','norwich-city','queens-park-rangers','reading-fc','southampton-fc','stoke-city','sunderland-afc','swansea-city','tottenham-hotspur','west-bromwich-albion','west-ham-united','wigan-athletic']
games = {}
for t1 in teams:
    for t2 in teams:
        if t1!=t2:
            games[t1+"-"+t2] = {'home': t1, 'away': t2}

In [111]:
for k in games.keys():
    games[k]['day']= data[k]['day']

In order to fill in this dictionary, we need a way to access the data for every player in the team at a particular day. We therefore create a team dictionary. Note that the "if p[2]>=0 else ' '" statement is to remove own goal scorers.

In [125]:
team_players = dict.fromkeys(teams)
for k in games.keys():
    team1 = games[k]['home']
    team2 = games[k]['away']
    
    if team_players[team1] is None:
        team_players[team1] = [p[0] if p[2]>=0 else ' ' for p in data[k]['home']]
    else:
        team_players[team1].extend([p[0] if p[2]>=0 else ' ' for p in data[k]['home']])
    team_players[team1].extend([p[1] if p[2]>=0 else ' ' for p in data[k]['home']])
    
    if team_players[team2] is None:
        team_players[team2] = [p[0] if p[2]>=0 else ' ' for p in data[k]['away']]
    else:
        team_players[team2].extend([p[0] if p[2]>=0 else ' ' for p in data[k]['away']])
    team_players[team2].extend([p[1] if p[2]>=0 else ' ' for p in data[k]['away']])

In [127]:
for k in team_players.keys():
    team_players[k] = list(set(team_players[k]))
    team_players[k].remove(' ')

In [128]:
player_to_club = {}
for k in team_players.keys():
    for v in team_players[k]:
        player_to_club[v] = k

We test it:

In [129]:
team_players['arsenal-fc']

[u'Mikel Arteta',
 u'Aaron Ramsey',
 u'Lukas Podolski',
 u'Jack Wilshere',
 u'Theo Walcott',
 u'Per Mertesacker',
 u'Gervinho',
 u'Kieran Gibbs',
 u'Nacho Monreal',
 u'Alex Oxlade-Chamberlain',
 u'Tom\xc3\xa1\xc5\xa1 Rosick\xc3\xbd',
 u'Laurent Koscielny',
 u'Olivier Giroud',
 u'Santi Cazorla']

We load the feature data:

In [37]:
features12 = pd.read_pickle('Data/features12-13.pkl')

We first fill in the metric averages:

In [130]:
for k in games.keys():
    d = games[k]['day']
    if (d-1) == 0:
        continue
    home_average = 0
    for p in team_players[games[k]['home']]:
        home_average += features12[p]['match_value_list'][d-2]
    games[k]['home_average'] = home_average/len(team_players[games[k]['home']])
    away_average = 0
    for p in team_players[games[k]['away']]:
        away_average += features12[p]['match_value_list'][d-2]
    games[k]['away_average'] = away_average/len(team_players[games[k]['away']])

In [131]:
games[games.keys()[0]]

{'away': 'manchester-city',
 'away_average': 0.609375,
 'day': 19,
 'home': 'sunderland-afc',
 'home_average': 0.4723214285714285}

We fill in the best player differences:

In [132]:
for k in games.keys():
    d = games[k]['day']
    home_team = games[k]['home']
    away_team = games[k]['away']
    best_home = []
    best_away = []
    if (d-1)==0:
        continue
    elif (d-1) == 1:
        for p in team_players[home_team]:
            best_home.append(features12[p]['match_value_list'][0])
        for p in team_players[away_team]:
            best_away.append(features12[p]['match_value_list'][0])
            
    elif (d-1) == 2:
        for p in team_players[home_team]:
            best_home.append(np.max(features12[p]['match_value_list'][:2]))
        for p in team_players[away_team]:
            best_away.append(np.max(features12[p]['match_value_list'][:2]))
            
    elif (d-1) == 3:
        for p in team_players[home_team]:
            best_home.append(np.max(features12[p]['match_value_list'][:3]))
        for p in team_players[away_team]:
            best_away.append(np.max(features12[p]['match_value_list'][:3]))
            
    elif (d-1) == 4:
        for p in team_players[home_team]:
            best_home.append(np.max(features12[p]['match_value_list'][:4]))
        for p in team_players[away_team]:
            best_away.append(np.max(features12[p]['match_value_list'][:4]))
    else:
        for p in team_players[home_team]:
            best_home.append(np.max(features12[p]['match_value_list'][(d-6):(d-2)]))
        for p in team_players[away_team]:
            best_away.append(np.max(features12[p]['match_value_list'][(d-6):(d-2)]))
    
    
    games[k]['best'] = np.sum(best_home)-np.sum(best_away)

In [133]:
games[games.keys()[0]]

{'away': 'manchester-city',
 'away_average': 0.609375,
 'best': -19.324999999999996,
 'day': 19,
 'home': 'sunderland-afc',
 'home_average': 0.4723214285714285}

We add the score:

In [134]:
results = dict.fromkeys(teams)
goals_against = dict.fromkeys(teams)

for k in results.keys():
    results[k] = np.zeros(38)
    goals_against[k] = np.zeros(38)
    
for k in games.keys():
    team1 = games[k]['home']
    team2 = games[k]['away']
    d = games[k]['day']-1
    goal_home = len(data[k]['home'])
    goal_away = len(data[k]['away'])
    
    goals_against[team1][d] = -goal_away
    goals_against[team2][d] = -goal_home
    
    if goal_home > goal_away:
        results[team1][d] = 1
        results[team2][d] = -1
    elif goal_home<goal_away:
        results[team1][d] = -1
        results[team2][d] = 1
    else:
        results[team1][d] = 0
        results[team2][d] = 0

We finally add home losses,wins, goals against, etc...

In [136]:
for k in games.keys():
    
    team1 = games[k]['home']
    team2 = games[k]['away']
    
    d = games[k]['day']-1
    
    if d==0:
        continue
    elif d<=4:
        home_goal_vs = np.sum(goals_against[team1][:d])/d
        away_goal_vs = np.sum(goals_against[team2][:d])/d
        if d<=2:
            home_loss = len(results[team1][:d][results[team1][:d]<0])
            home_win = len(results[team1][:d][results[team1][:d]>0])
            away_loss = len(results[team2][:d][results[team2][:d]<0])
            away_win = len(results[team2][:d][results[team2][:d]>0])
        else:
            home_loss = len(results[team1][d-2:d][results[team1][d-2:d]<0])
            home_win = len(results[team1][d-2:d][results[team1][d-2:d]>0])
            away_loss = len(results[team2][d-2:d][results[team2][d-2:d]<0])
            away_win = len(results[team2][d-2:d][results[team2][d-2:d]>0])
    else:
        home_goal_vs = np.sum(goals_against[team1][d-5:d])/5
        away_goal_vs = np.sum(goals_against[team2][d-5:d])/5
        home_loss = len(results[team1][d-2:d][results[team1][d-2:d]<0])
        home_win = len(results[team1][d-2:d][results[team1][d-2:d]>0])
        away_loss = len(results[team2][d-2:d][results[team2][d-2:d]<0])
        away_win = len(results[team2][d-2:d][results[team2][d-2:d]>0])
        
    games[k]['goals_against_home'] = home_goal_vs
    games[k]['goals_against_away'] = away_goal_vs
    games[k]['home_loss'] = home_loss
    games[k]['away_loss'] = away_loss
    games[k]['home_win'] = home_win
    games[k]['away_win'] = away_win
    games[k]['score'] = results[team1][d]

We therefore have:

In [137]:
games[games.keys()[0]]

{'away': 'manchester-city',
 'away_average': 0.609375,
 'away_loss': 0,
 'away_win': 2,
 'best': -19.324999999999996,
 'day': 19,
 'goals_against_away': -1.0,
 'goals_against_home': -1.6000000000000001,
 'home': 'sunderland-afc',
 'home_average': 0.4723214285714285,
 'home_loss': 1,
 'home_win': 1,
 'score': 1.0}

We now convert it to a dataframe to perform predictions using to start multinomial logistic regression:

In [140]:
DF = pd.DataFrame.from_dict(games)

In [171]:
DF.iloc[0:13,:5]

Unnamed: 0,arsenal-fc-aston-villa,arsenal-fc-chelsea-fc,arsenal-fc-everton-fc,arsenal-fc-fulham-fc,arsenal-fc-liverpool-fc
away,aston-villa,chelsea-fc,everton-fc,fulham-fc,liverpool-fc
away_average,0.25,0.75,0.578125,0.4305556,0.6111111
away_loss,0,0,1,0,1
away_win,1,1,1,0,1
best,-3.2375,-19.1875,7.7125,-26.6125,12.6
day,27,6,29,11,24
goals_against_away,-1.8,-0.4,-1.8,-1.8,-1
goals_against_home,-1.2,-0.4,-1,-1.2,-1.8
home,arsenal-fc,arsenal-fc,arsenal-fc,arsenal-fc,arsenal-fc
home_average,0.4723214,0.2464286,0.14375,0.08214286,0.14375


## Analysis:

In [195]:
import statsmodels.api as st
from sklearn.linear_model import LogisticRegression

We will train on the first part of the season (except the first game as we don't have any explanotary variables for the first game)

In [224]:
firsthalf = DF.loc['day']<=19
secondhalf = ~firsthalf

In [299]:
DF_firsthalf = DF[DF.columns[firsthalf]]
DF_secondhalf = DF[DF.columns[secondhalf]]

In [300]:
# We remove the first day:
train = DF_firsthalf[DF_firsthalf.columns[DF_firsthalf.loc['day']>1]]

In [303]:
y = train.loc['score'].copy()

In [304]:
X = train.iloc[[1,2,3,4,6,7,9,10,11],:].T
test = DF_secondhalf.iloc[[1,2,3,4,6,7,9,10,11],:].T
y_test = DF_secondhalf.loc['score'].copy()

In [204]:
X.head()

Unnamed: 0,away_average,away_loss,away_win,best,goals_against_away,goals_against_home,home_average,home_loss,home_win
arsenal-fc-chelsea-fc,0.75,0,1,-19.1875,-0.4,-0.4,0.2464286,0,1
arsenal-fc-fulham-fc,0.4305556,0,0,-26.6125,-1.8,-1.2,0.08214286,1,1
arsenal-fc-queens-park-rangers,0.4907895,1,0,14.4375,-1.6,-1.2,0.0,1,1
arsenal-fc-southampton-fc,1.045455,2,0,-11.4875,-2.666667,0.0,0.6160714,0,1
arsenal-fc-swansea-city,0.6607143,0,1,9.825,-0.8,-1.6,0.4723214,0,0


We fit the model:

In [230]:
logreg = LogisticRegression(C=1e5)

In [231]:
logreg.fit(X,y)

LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0)

We predict the second half:

In [232]:
y_test_pred = logreg.predict(test)

We derive the confusion metric to check how good we actually do:

In [245]:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(y_test.values.astype(int),y_test_pred.astype(int))

In [257]:
score = float(mat[0,0]+mat[1,1]+mat[2,2])/np.sum(mat)

In [258]:
print score

0.384210526316


Not great.... Let's see if with a greater training set we can get better results !

In [250]:
tr = DF.loc['day']<=27
ts = ~tr

In [251]:
DF_tr = DF[DF.columns[tr]]
DF_ts = DF[DF.columns[ts]]
train2 = DF_tr[DF_tr.columns[DF_tr.loc['day']>1]]
y2 = train2.loc['score']

X2 = train2.iloc[[1,2,3,4,6,7,9,10,11],:].T
test2 = DF_ts.iloc[[1,2,3,4,6,7,9,10,11],:].T
y_test2 = DF_ts.loc['score']

In [252]:
logreg2 = LogisticRegression(C=1e5)
logreg.fit(X2,y2)

LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0)

In [253]:
y_test_pred2 = logreg.predict(test2)

In [254]:
mat2 = confusion_matrix(y_test2.values.astype(int),y_test_pred2.astype(int))

In [255]:
mat2

array([[ 5,  9, 21],
       [ 7,  3, 18],
       [ 9,  5, 33]])

In [259]:
score2 = float(mat2[0,0]+mat2[1,1]+mat2[2,2])/np.sum(mat2)

In [260]:
print score2

0.372727272727


nop....

### Let's try random forest:

In [262]:
from sklearn.ensemble import RandomForestClassifier

In [263]:
clf = RandomForestClassifier(n_estimators=25)
clf.fit(X,y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=25, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [265]:
RFC_pred = clf.predict(test).astype(int)

In [266]:
RFC_mat = confusion_matrix(y_test.astype(int),RFC_pred)

In [267]:
RFC_mat

array([[14, 17, 27],
       [12, 18, 20],
       [18, 14, 50]])

In [268]:
RFC_score = float(RFC_mat[0,0]+RFC_mat[1,1]+RFC_mat[2,2])/np.sum(RFC_mat)

In [269]:
print RFC_score

0.431578947368


Slightly better....

In [279]:
clf = RandomForestClassifier(n_estimators=10)
clf.fit(X2,y2)
RFC_pred2 = clf.predict(test2).astype(int)
RFC_mat2 = confusion_matrix(y_test2.astype(int),RFC_pred2)

print RFC_mat2

RFC_score2 = float(RFC_mat2[0,0]+RFC_mat2[1,1]+RFC_mat2[2,2])/np.sum(RFC_mat2)

print RFC_score2

[[15  8 12]
 [ 7  5 16]
 [10 10 27]]
0.427272727273


Let's now to see if we can at least better predict if the home teams doesn't win:

In [281]:
y[y<1]= 0 

In [282]:
y_test[y_test<1] = 0

In [283]:
logreg = LogisticRegression(C=1e5)
logreg.fit(X,y)
y_test_pred = logreg.predict(test).astype(int)

In [284]:
mat4 = confusion_matrix(y_test.values.astype(int),y_test_pred)

In [285]:
mat4

array([[70, 38],
       [44, 38]])

In [286]:
score4 = float(mat4[0,0]+mat4[1,1])/np.sum(mat4)

In [287]:
score4

0.5684210526315789

We manage to predict slightly better than by chance... still not good enough...

In [None]:
clf = RandomForestClassifier(n_estimators=25)

clf.fit(X,y)

RFC_pred3 = clf.predict(test).astype(int)
RFC_mat3 = confusion_matrix(y_test.astype(int),RFC_pred3)

print RFC_mat3

RFC_score3 = float(RFC_mat3[0,0]+RFC_mat3[1,1])/np.sum(RFC_mat3)

print RFC_score3

Maybe not loose:

In [307]:
y = train.loc['score'].copy()
y_test = DF_secondhalf.loc['score'].copy()
y[y>-1]= 1
y_test[y_test>-1]= 1

In [309]:
logreg = LogisticRegression(C=1e5)
logreg.fit(X,y)
y_test_pred = logreg.predict(test).astype(int)
mat5 = confusion_matrix(y_test.values.astype(int),y_test_pred)
score5 = float(mat5[0,0]+mat5[1,1])/np.sum(mat5)

In [310]:
score5

0.6631578947368421

2/3 ! Let's build on this.

In [312]:
onlywin_mat = confusion_matrix(y_test.values.astype(int),[1]*190)
float(onlywin_mat[0,0]+onlywin_mat[1,1])/np.sum(onlywin_mat)

0.6947368421052632

ouch....

In [313]:
onlyloss_mat = confusion_matrix(y_test.values.astype(int),[-1]*190)
float(onlyloss_mat[0,0]+onlyloss_mat[1,1])/np.sum(onlyloss_mat)

0.30526315789473685