## Match predictions:

The goal of this notebook is to create a model to predict match results, i.e home win / home loss / draw.

We base our model on the intersection of a certain domain knowledge (the kind of information that could influence the result of a game of football) and the data in our possession. Better results could definetily be achevied if we were in possession of more detailed data on previous match such as a breakdown of performances per position or average possession statistics, etc.

This analysis is inspired by the work of Gunjan Kumar in his thesis: "Machine Learning for Soccer Analytics".

In [1]:
% matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

# Import all ML modules and packages we'll need
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

In [3]:
import json
# Load in data
filenames = ['BPL/BPL12-13.json']
with open(filenames[0], 'r') as fp:
    data = json.load(fp)

We need to create a dictionary that contains the following data:

- Final score: response variable
- Home team metric average on past 5 games
- Away team metric average on past 5 games
- Sum of the differences of the best player metrics score for each team in the past 5 game
- Average goals against the home team on the past 5 games
- Average goals against the away team on the past 5 games
- Number of losses for the home team in the past 2 games
- Number of losses for the away team in the past 2 games 


We initialise the dictionary with key equal to the match ID and first values: home team, away team, and day.

In [110]:
teams = ['arsenal-fc','aston-villa','chelsea-fc','everton-fc','fulham-fc','liverpool-fc','manchester-city','manchester-united','newcastle-united','norwich-city','queens-park-rangers','reading-fc','southampton-fc','stoke-city','sunderland-afc','swansea-city','tottenham-hotspur','west-bromwich-albion','west-ham-united','wigan-athletic']
games = {}
for t1 in teams:
    for t2 in teams:
        if t1!=t2:
            games[t1+"-"+t2] = {'home': t1, 'away': t2}

In [111]:
for k in games.keys():
    games[k]['day']= data[k]['day']

In order to fill in this dictionary, we need a way to access the data for every player in the team at a particular day. We therefore create a team dictionary. Note that the "if p[2]>=0 else ' '" statement is to remove own goal scorers.

In [125]:
team_players = dict.fromkeys(teams)
for k in games.keys():
    team1 = games[k]['home']
    team2 = games[k]['away']
    
    if team_players[team1] is None:
        team_players[team1] = [p[0] if p[2]>=0 else ' ' for p in data[k]['home']]
    else:
        team_players[team1].extend([p[0] if p[2]>=0 else ' ' for p in data[k]['home']])
    team_players[team1].extend([p[1] if p[2]>=0 else ' ' for p in data[k]['home']])
    
    if team_players[team2] is None:
        team_players[team2] = [p[0] if p[2]>=0 else ' ' for p in data[k]['away']]
    else:
        team_players[team2].extend([p[0] if p[2]>=0 else ' ' for p in data[k]['away']])
    team_players[team2].extend([p[1] if p[2]>=0 else ' ' for p in data[k]['away']])

In [127]:
for k in team_players.keys():
    team_players[k] = list(set(team_players[k]))
    team_players[k].remove(' ')

In [128]:
player_to_club = {}
for k in team_players.keys():
    for v in team_players[k]:
        player_to_club[v] = k

We test it:

In [129]:
team_players['arsenal-fc']

[u'Mikel Arteta',
 u'Aaron Ramsey',
 u'Lukas Podolski',
 u'Jack Wilshere',
 u'Theo Walcott',
 u'Per Mertesacker',
 u'Gervinho',
 u'Kieran Gibbs',
 u'Nacho Monreal',
 u'Alex Oxlade-Chamberlain',
 u'Tom\xc3\xa1\xc5\xa1 Rosick\xc3\xbd',
 u'Laurent Koscielny',
 u'Olivier Giroud',
 u'Santi Cazorla']

We load the feature data:

In [37]:
features12 = pd.read_pickle('Data/features12-13.pkl')

We first fill in the metric averages:

In [130]:
for k in games.keys():
    d = games[k]['day']
    if (d-1) == 0:
        continue
    home_average = 0
    for p in team_players[games[k]['home']]:
        home_average += features12[p]['match_value_list'][d-2]
    games[k]['home_average'] = home_average/len(team_players[games[k]['home']])
    away_average = 0
    for p in team_players[games[k]['away']]:
        away_average += features12[p]['match_value_list'][d-2]
    games[k]['away_average'] = away_average/len(team_players[games[k]['away']])

In [131]:
games[games.keys()[0]]

{'away': 'manchester-city',
 'away_average': 0.609375,
 'day': 19,
 'home': 'sunderland-afc',
 'home_average': 0.4723214285714285}

We fill in the best player differences:

In [132]:
for k in games.keys():
    d = games[k]['day']
    home_team = games[k]['home']
    away_team = games[k]['away']
    best_home = []
    best_away = []
    if (d-1)==0:
        continue
    elif (d-1) == 1:
        for p in team_players[home_team]:
            best_home.append(features12[p]['match_value_list'][0])
        for p in team_players[away_team]:
            best_away.append(features12[p]['match_value_list'][0])
            
    elif (d-1) == 2:
        for p in team_players[home_team]:
            best_home.append(np.max(features12[p]['match_value_list'][:2]))
        for p in team_players[away_team]:
            best_away.append(np.max(features12[p]['match_value_list'][:2]))
            
    elif (d-1) == 3:
        for p in team_players[home_team]:
            best_home.append(np.max(features12[p]['match_value_list'][:3]))
        for p in team_players[away_team]:
            best_away.append(np.max(features12[p]['match_value_list'][:3]))
            
    elif (d-1) == 4:
        for p in team_players[home_team]:
            best_home.append(np.max(features12[p]['match_value_list'][:4]))
        for p in team_players[away_team]:
            best_away.append(np.max(features12[p]['match_value_list'][:4]))
    else:
        for p in team_players[home_team]:
            best_home.append(np.max(features12[p]['match_value_list'][(d-6):(d-2)]))
        for p in team_players[away_team]:
            best_away.append(np.max(features12[p]['match_value_list'][(d-6):(d-2)]))
    
    
    games[k]['best'] = np.sum(best_home)-np.sum(best_away)

In [133]:
games[games.keys()[0]]

{'away': 'manchester-city',
 'away_average': 0.609375,
 'best': -19.324999999999996,
 'day': 19,
 'home': 'sunderland-afc',
 'home_average': 0.4723214285714285}

We add the score:

In [134]:
results = dict.fromkeys(teams)
goals_against = dict.fromkeys(teams)

for k in results.keys():
    results[k] = np.zeros(38)
    goals_against[k] = np.zeros(38)
    
for k in games.keys():
    team1 = games[k]['home']
    team2 = games[k]['away']
    d = games[k]['day']-1
    goal_home = len(data[k]['home'])
    goal_away = len(data[k]['away'])
    
    goals_against[team1][d] = -goal_away
    goals_against[team2][d] = -goal_home
    
    if goal_home > goal_away:
        results[team1][d] = 1
        results[team2][d] = -1
    elif goal_home<goal_away:
        results[team1][d] = -1
        results[team2][d] = 1
    else:
        results[team1][d] = 0
        results[team2][d] = 0

We finally add home losses,wins, goals against, etc...

In [136]:
for k in games.keys():
    
    team1 = games[k]['home']
    team2 = games[k]['away']
    
    d = games[k]['day']-1
    
    if d==0:
        continue
    elif d<=4:
        home_goal_vs = np.sum(goals_against[team1][:d])/d
        away_goal_vs = np.sum(goals_against[team2][:d])/d
        if d<=2:
            home_loss = len(results[team1][:d][results[team1][:d]<0])
            home_win = len(results[team1][:d][results[team1][:d]>0])
            away_loss = len(results[team2][:d][results[team2][:d]<0])
            away_win = len(results[team2][:d][results[team2][:d]>0])
        else:
            home_loss = len(results[team1][d-2:d][results[team1][d-2:d]<0])
            home_win = len(results[team1][d-2:d][results[team1][d-2:d]>0])
            away_loss = len(results[team2][d-2:d][results[team2][d-2:d]<0])
            away_win = len(results[team2][d-2:d][results[team2][d-2:d]>0])
    else:
        home_goal_vs = np.sum(goals_against[team1][d-5:d])/5
        away_goal_vs = np.sum(goals_against[team2][d-5:d])/5
        home_loss = len(results[team1][d-2:d][results[team1][d-2:d]<0])
        home_win = len(results[team1][d-2:d][results[team1][d-2:d]>0])
        away_loss = len(results[team2][d-2:d][results[team2][d-2:d]<0])
        away_win = len(results[team2][d-2:d][results[team2][d-2:d]>0])
        
    games[k]['goals_against_home'] = home_goal_vs
    games[k]['goals_against_away'] = away_goal_vs
    games[k]['home_loss'] = home_loss
    games[k]['away_loss'] = away_loss
    games[k]['home_win'] = home_win
    games[k]['away_win'] = away_win
    games[k]['score'] = results[team1][d]

We therefore have:

In [137]:
games[games.keys()[0]]

{'away': 'manchester-city',
 'away_average': 0.609375,
 'away_loss': 0,
 'away_win': 2,
 'best': -19.324999999999996,
 'day': 19,
 'goals_against_away': -1.0,
 'goals_against_home': -1.6000000000000001,
 'home': 'sunderland-afc',
 'home_average': 0.4723214285714285,
 'home_loss': 1,
 'home_win': 1,
 'score': 1.0}

We now convert it to a dataframe to perform predictions using to start multinomial logistic regression:

In [140]:
DF = pd.DataFrame.from_dict(games)

In [143]:
DF.head()

Unnamed: 0,arsenal-fc-aston-villa,arsenal-fc-chelsea-fc,arsenal-fc-everton-fc,arsenal-fc-fulham-fc,arsenal-fc-liverpool-fc,arsenal-fc-manchester-city,arsenal-fc-manchester-united,arsenal-fc-newcastle-united,arsenal-fc-norwich-city,arsenal-fc-queens-park-rangers,arsenal-fc-reading-fc,arsenal-fc-southampton-fc,arsenal-fc-stoke-city,arsenal-fc-sunderland-afc,arsenal-fc-swansea-city,arsenal-fc-tottenham-hotspur,arsenal-fc-west-bromwich-albion,arsenal-fc-west-ham-united,arsenal-fc-wigan-athletic,aston-villa-arsenal-fc,aston-villa-chelsea-fc,aston-villa-everton-fc,aston-villa-fulham-fc,aston-villa-liverpool-fc,aston-villa-manchester-city,aston-villa-manchester-united,aston-villa-newcastle-united,aston-villa-norwich-city,aston-villa-queens-park-rangers,aston-villa-reading-fc,aston-villa-southampton-fc,aston-villa-stoke-city,aston-villa-sunderland-afc,aston-villa-swansea-city,aston-villa-tottenham-hotspur,aston-villa-west-bromwich-albion,aston-villa-west-ham-united,aston-villa-wigan-athletic,chelsea-fc-arsenal-fc,chelsea-fc-aston-villa,chelsea-fc-everton-fc,chelsea-fc-fulham-fc,chelsea-fc-liverpool-fc,chelsea-fc-manchester-city,chelsea-fc-manchester-united,chelsea-fc-newcastle-united,chelsea-fc-norwich-city,chelsea-fc-queens-park-rangers,chelsea-fc-reading-fc,chelsea-fc-southampton-fc,...,west-bromwich-albion-manchester-united,west-bromwich-albion-newcastle-united,west-bromwich-albion-norwich-city,west-bromwich-albion-queens-park-rangers,west-bromwich-albion-reading-fc,west-bromwich-albion-southampton-fc,west-bromwich-albion-stoke-city,west-bromwich-albion-sunderland-afc,west-bromwich-albion-swansea-city,west-bromwich-albion-tottenham-hotspur,west-bromwich-albion-west-ham-united,west-bromwich-albion-wigan-athletic,west-ham-united-arsenal-fc,west-ham-united-aston-villa,west-ham-united-chelsea-fc,west-ham-united-everton-fc,west-ham-united-fulham-fc,west-ham-united-liverpool-fc,west-ham-united-manchester-city,west-ham-united-manchester-united,west-ham-united-newcastle-united,west-ham-united-norwich-city,west-ham-united-queens-park-rangers,west-ham-united-reading-fc,west-ham-united-southampton-fc,west-ham-united-stoke-city,west-ham-united-sunderland-afc,west-ham-united-swansea-city,west-ham-united-tottenham-hotspur,west-ham-united-west-bromwich-albion,west-ham-united-wigan-athletic,wigan-athletic-arsenal-fc,wigan-athletic-aston-villa,wigan-athletic-chelsea-fc,wigan-athletic-everton-fc,wigan-athletic-fulham-fc,wigan-athletic-liverpool-fc,wigan-athletic-manchester-city,wigan-athletic-manchester-united,wigan-athletic-newcastle-united,wigan-athletic-norwich-city,wigan-athletic-queens-park-rangers,wigan-athletic-reading-fc,wigan-athletic-southampton-fc,wigan-athletic-stoke-city,wigan-athletic-sunderland-afc,wigan-athletic-swansea-city,wigan-athletic-tottenham-hotspur,wigan-athletic-west-bromwich-albion,wigan-athletic-west-ham-united
away,aston-villa,chelsea-fc,everton-fc,fulham-fc,liverpool-fc,manchester-city,manchester-united,newcastle-united,norwich-city,queens-park-rangers,reading-fc,southampton-fc,stoke-city,sunderland-afc,swansea-city,tottenham-hotspur,west-bromwich-albion,west-ham-united,wigan-athletic,arsenal-fc,chelsea-fc,everton-fc,fulham-fc,liverpool-fc,manchester-city,manchester-united,newcastle-united,norwich-city,queens-park-rangers,reading-fc,southampton-fc,stoke-city,sunderland-afc,swansea-city,tottenham-hotspur,west-bromwich-albion,west-ham-united,wigan-athletic,arsenal-fc,aston-villa,everton-fc,fulham-fc,liverpool-fc,manchester-city,manchester-united,newcastle-united,norwich-city,queens-park-rangers,reading-fc,southampton-fc,...,manchester-united,newcastle-united,norwich-city,queens-park-rangers,reading-fc,southampton-fc,stoke-city,sunderland-afc,swansea-city,tottenham-hotspur,west-ham-united,wigan-athletic,arsenal-fc,aston-villa,chelsea-fc,everton-fc,fulham-fc,liverpool-fc,manchester-city,manchester-united,newcastle-united,norwich-city,queens-park-rangers,reading-fc,southampton-fc,stoke-city,sunderland-afc,swansea-city,tottenham-hotspur,west-bromwich-albion,wigan-athletic,arsenal-fc,aston-villa,chelsea-fc,everton-fc,fulham-fc,liverpool-fc,manchester-city,manchester-united,newcastle-united,norwich-city,queens-park-rangers,reading-fc,southampton-fc,stoke-city,sunderland-afc,swansea-city,tottenham-hotspur,west-bromwich-albion,west-ham-united
away_average,0.25,0.75,0.578125,0.4305556,0.6111111,0.359375,0.4204545,0.603125,0.6470588,0.4907895,0,1.045455,0.2666667,,0.6607143,0.314881,0,0.3026316,1.292969,1.107143,0,0.359375,0,0.06388889,0.59375,0.2159091,0.15,0.1764706,0.5,0.330625,0.5227273,0.7075,0.4107143,0.75,0.2142857,0.4558824,0.3026316,0.071875,0,0.6109375,0.25,0,0.2916667,0.53125,0.6590909,0.3575,0.2794118,0,0,0.5227273,...,0.5795455,0,0.6764706,0.09210526,0.2875,0.09090909,1.3,0,0.5,0.2875,0.5980263,0.65625,0.5178571,,0,0,0.43125,0.3194444,0.359375,0.5909091,0,0.2647059,0,0,1.045455,0.3833333,0.4107143,0,0.2857143,0,0.375,1.047321,0.359375,,0.640625,0.4305556,0.6805556,0,0.1704545,0.725,0.3889706,0.25,0.4125,0.3136364,0,0.4107143,1.046429,0.797619,0.4411765,0.5
away_loss,0,0,1,0,1,0,0,1,1,1,2,2,1,,0,2,2,1,0,0,0,0,1,1,1,0,1,1,0,1,0,0,0,0,0,1,1,2,1,0,0,2,0,0,0,0,2,2,1,1,...,1,1,0,2,2,2,0,2,1,0,1,1,1,,0,0,1,1,0,0,1,2,0,1,1,1,0,0,0,0,1,0,1,,0,1,1,0,0,1,0,0,0,1,0,1,1,0,1,1
away_win,1,1,1,0,1,2,2,1,0,0,0,0,0,,1,0,0,0,1,1,2,1,1,1,1,2,0,1,2,1,0,2,2,1,1,1,1,0,0,1,1,0,0,2,2,1,0,0,0,1,...,1,1,2,0,0,0,2,0,1,0,1,0,0,,0,1,1,1,2,2,0,0,1,1,0,1,0,1,2,1,0,2,1,,2,1,1,1,2,1,0,0,1,0,0,1,0,1,1,1
best,-3.2375,-19.1875,7.7125,-26.6125,12.6,11.4125,26.15,-7.1125,14.125,14.4375,11.9,-11.4875,12.8625,,9.825,-0.0875,-2.2375,-6.4875,20.175,-4.7875,-12.75,-5.75,16.2875,-24.3125,-8.5125,-21.225,-22,5.55,2.025,-16.8375,-3.45,-15.2,17.175,-13.1125,-22.4125,-10,7.05,-3.0625,1.675,6.4625,21.3125,-22.5125,23.1625,2.5625,-5.325,3.6125,23.75,12.1875,9.35,4.4375,...,5.2,-6.8,-1.4625,14.5875,10.875,0.775,21.675,-1.775,11.725,1.7625,-2.475,-8.6375,-4.575,,-8.775,8.675,-11.7625,-0.75,-5.875,-9.9375,2.8,-2.725,2.8375,1.0125,-4.5375,15.25,-10.8125,-10.2,-2.45,-0.325,1.05,9.8875,-3.3875,,-18.775,-13.9,-8.2,-11.55,-18.2275,-11.025,-4.85,13.05,-7.9875,1,4.3125,-8.5875,-4.6125,2.375,-4.225,-7.625
