## Match predictions:

The goal of this notebook is to create a model to predict match results, i.e home win / home loss / draw.

We base our model on the intersection of a certain domain knowledge (the kind of information that could influence the result of a game of football) and the data in our possession. Better results could definetily be achevied if we were in possession of more detailed data on previous match such as a breakdown of performances per position or average possession statistics, etc.

This analysis is inspired by the work of Gunjan Kumar in his thesis: "Machine Learning for Soccer Analytics".

In [595]:
% matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

# Import all ML modules and packages we'll need
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

from __future__ import division

## Data preparation

In [3]:
import json
# Load in data
filenames = ['BPL/BPL12-13.json']
with open(filenames[0], 'r') as fp:
    data = json.load(fp)

We need to create a dictionary that contains the following data:

- Final score: response variable
- Home team metric average on past 5 games
- Away team metric average on past 5 games
- Sum of the differences of the best player metrics score for each team in the past 5 game
- Average goals against the home team on the past 5 games
- Average goals against the away team on the past 5 games
- Number of losses for the home team in the past 2 games
- Number of losses for the away team in the past 2 games 


We initialise the dictionary with key equal to the match ID and first values: home team, away team, and day.

In [464]:
teams = ['arsenal-fc','aston-villa','chelsea-fc','everton-fc','fulham-fc','liverpool-fc','manchester-city','manchester-united','newcastle-united','norwich-city','queens-park-rangers','reading-fc','southampton-fc','stoke-city','sunderland-afc','swansea-city','tottenham-hotspur','west-bromwich-albion','west-ham-united','wigan-athletic']
games = {}
for t1 in teams:
    for t2 in teams:
        if t1!=t2:
            games[t1+"-"+t2] = {'home': t1, 'away': t2}

In [465]:
for k in games.keys():
    games[k]['day']= data[k]['day']

In order to fill in this dictionary, we need a way to access the data for every player in the team at a particular day. We therefore create a team dictionary. Note that the "if p[2]>=0 else ' '" statement is to remove own goal scorers.

In [466]:
team_players = dict.fromkeys(teams)
for k in games.keys():
    team1 = games[k]['home']
    team2 = games[k]['away']
    
    if team_players[team1] is None:
        team_players[team1] = [p[0] if p[2]>=0 else ' ' for p in data[k]['home']]
    else:
        team_players[team1].extend([p[0] if p[2]>=0 else ' ' for p in data[k]['home']])
    team_players[team1].extend([p[1] if p[2]>=0 else ' ' for p in data[k]['home']])
    
    if team_players[team2] is None:
        team_players[team2] = [p[0] if p[2]>=0 else ' ' for p in data[k]['away']]
    else:
        team_players[team2].extend([p[0] if p[2]>=0 else ' ' for p in data[k]['away']])
    team_players[team2].extend([p[1] if p[2]>=0 else ' ' for p in data[k]['away']])

In [467]:
for k in team_players.keys():
    team_players[k] = list(set(team_players[k]))
    team_players[k].remove(' ')

In [468]:
player_to_club = {}
for k in team_players.keys():
    for v in team_players[k]:
        player_to_club[v] = k

We test it:

In [469]:
team_players['arsenal-fc']

[u'Mikel Arteta',
 u'Aaron Ramsey',
 u'Lukas Podolski',
 u'Jack Wilshere',
 u'Theo Walcott',
 u'Per Mertesacker',
 u'Gervinho',
 u'Kieran Gibbs',
 u'Nacho Monreal',
 u'Alex Oxlade-Chamberlain',
 u'Tom\xc3\xa1\xc5\xa1 Rosick\xc3\xbd',
 u'Laurent Koscielny',
 u'Olivier Giroud',
 u'Santi Cazorla']

We load the feature data:

In [470]:
features12 = pd.read_pickle('Data/features12-13.pkl')

We first fill in the metric averages:

In [471]:
for k in games.keys():
    d = games[k]['day']
    if (d-1) == 0:
        continue
    home_average = 0
    for p in team_players[games[k]['home']]:
        home_average += features12[p]['match_value_list'][d-2]
    games[k]['home_average'] = home_average/len(team_players[games[k]['home']])
    away_average = 0
    for p in team_players[games[k]['away']]:
        away_average += features12[p]['match_value_list'][d-2]
    games[k]['away_average'] = away_average/len(team_players[games[k]['away']])

In [472]:
games[games.keys()[0]]

{'away': 'manchester-city',
 'away_average': 0.609375,
 'day': 19,
 'home': 'sunderland-afc',
 'home_average': 0.4723214285714285}

We fill in the best player differences:

In [473]:
for k in games.keys():
    d = games[k]['day']
    home_team = games[k]['home']
    away_team = games[k]['away']
    best_home = []
    best_away = []
    if (d-1)==0:
        continue
    elif (d-1) == 1:
        for p in team_players[home_team]:
            best_home.append(features12[p]['match_value_list'][0])
        for p in team_players[away_team]:
            best_away.append(features12[p]['match_value_list'][0])
            
    elif (d-1) == 2:
        for p in team_players[home_team]:
            best_home.append(np.max(features12[p]['match_value_list'][:2]))
        for p in team_players[away_team]:
            best_away.append(np.max(features12[p]['match_value_list'][:2]))
            
    else:
        for p in team_players[home_team]:
            best_home.append(np.max(features12[p]['match_value_list'][d-4:d-1]))
        for p in team_players[away_team]:
            best_away.append(np.max(features12[p]['match_value_list'][d-4:d-1]))
            
#     elif (d-1) == 4:
#         for p in team_players[home_team]:
#             best_home.append(np.max(features12[p]['match_value_list'][:4]))
#         for p in team_players[away_team]:
#             best_away.append(np.max(features12[p]['match_value_list'][:4]))
#     else:
#         for p in team_players[home_team]:
#             best_home.append(np.max(features12[p]['match_value_list'][(d-6):(d-2)]))
#         for p in team_players[away_team]:
#             best_away.append(np.max(features12[p]['match_value_list'][(d-6):(d-2)]))
    
    
    games[k]['best'] = np.sum(best_home)-np.sum(best_away)

In [474]:
games[games.keys()[0]]

{'away': 'manchester-city',
 'away_average': 0.609375,
 'best': -13.650000000000002,
 'day': 19,
 'home': 'sunderland-afc',
 'home_average': 0.4723214285714285}

We add the score:

In [475]:
results = dict.fromkeys(teams)
goals_against = dict.fromkeys(teams)

for k in results.keys():
    results[k] = np.zeros(38)
    goals_against[k] = np.zeros(38)
    
for k in games.keys():
    team1 = games[k]['home']
    team2 = games[k]['away']
    d = games[k]['day']-1
    goal_home = len(data[k]['home'])
    goal_away = len(data[k]['away'])
    
    goals_against[team1][d] = -goal_away
    goals_against[team2][d] = -goal_home
    
    if goal_home > goal_away:
        results[team1][d] = 1
        results[team2][d] = -1
    elif goal_home<goal_away:
        results[team1][d] = -1
        results[team2][d] = 1
    else:
        results[team1][d] = 0
        results[team2][d] = 0

We finally add home losses,wins, goals against, etc...

In [476]:
for k in games.keys():
    
    team1 = games[k]['home']
    team2 = games[k]['away']
    
    d = games[k]['day']-1
    
    if d==0:
        continue
    elif d<=4:
        home_goal_vs = np.sum(goals_against[team1][:d])/d
        away_goal_vs = np.sum(goals_against[team2][:d])/d
        if d<=2:
            home_loss = len(results[team1][:d][results[team1][:d]<0])
            home_win = len(results[team1][:d][results[team1][:d]>0])
            away_loss = len(results[team2][:d][results[team2][:d]<0])
            away_win = len(results[team2][:d][results[team2][:d]>0])
        else:
            home_loss = len(results[team1][d-2:d][results[team1][d-2:d]<0])
            home_win = len(results[team1][d-2:d][results[team1][d-2:d]>0])
            away_loss = len(results[team2][d-2:d][results[team2][d-2:d]<0])
            away_win = len(results[team2][d-2:d][results[team2][d-2:d]>0])
    else:
        home_goal_vs = np.sum(goals_against[team1][d-5:d])/5
        away_goal_vs = np.sum(goals_against[team2][d-5:d])/5
        home_loss = len(results[team1][d-2:d][results[team1][d-2:d]<0])
        home_win = len(results[team1][d-2:d][results[team1][d-2:d]>0])
        away_loss = len(results[team2][d-2:d][results[team2][d-2:d]<0])
        away_win = len(results[team2][d-2:d][results[team2][d-2:d]>0])
        
    games[k]['goals_against_home'] = home_goal_vs
    games[k]['goals_against_away'] = away_goal_vs
    games[k]['home_loss'] = home_loss
    games[k]['away_loss'] = away_loss
    games[k]['home_win'] = home_win
    games[k]['away_win'] = away_win
    games[k]['score'] = results[team1][d]

We therefore have:

In [477]:
games[games.keys()[0]]

{'away': 'manchester-city',
 'away_average': 0.609375,
 'away_loss': 0,
 'away_win': 2,
 'best': -13.650000000000002,
 'day': 19,
 'goals_against_away': -1.0,
 'goals_against_home': -1.6000000000000001,
 'home': 'sunderland-afc',
 'home_average': 0.4723214285714285,
 'home_loss': 1,
 'home_win': 1,
 'score': 1.0}

We now convert it to a dataframe to perform predictions using to start multinomial logistic regression:

In [478]:
DF = pd.DataFrame.from_dict(games)

In [524]:
DF.head()

Unnamed: 0,arsenal-fc-aston-villa,arsenal-fc-chelsea-fc,arsenal-fc-everton-fc,arsenal-fc-fulham-fc,arsenal-fc-liverpool-fc,arsenal-fc-manchester-city,arsenal-fc-manchester-united,arsenal-fc-newcastle-united,arsenal-fc-norwich-city,arsenal-fc-queens-park-rangers,arsenal-fc-reading-fc,arsenal-fc-southampton-fc,arsenal-fc-stoke-city,arsenal-fc-sunderland-afc,arsenal-fc-swansea-city,arsenal-fc-tottenham-hotspur,arsenal-fc-west-bromwich-albion,arsenal-fc-west-ham-united,arsenal-fc-wigan-athletic,aston-villa-arsenal-fc,aston-villa-chelsea-fc,aston-villa-everton-fc,aston-villa-fulham-fc,aston-villa-liverpool-fc,aston-villa-manchester-city,aston-villa-manchester-united,aston-villa-newcastle-united,aston-villa-norwich-city,aston-villa-queens-park-rangers,aston-villa-reading-fc,aston-villa-southampton-fc,aston-villa-stoke-city,aston-villa-sunderland-afc,aston-villa-swansea-city,aston-villa-tottenham-hotspur,aston-villa-west-bromwich-albion,aston-villa-west-ham-united,aston-villa-wigan-athletic,chelsea-fc-arsenal-fc,chelsea-fc-aston-villa,chelsea-fc-everton-fc,chelsea-fc-fulham-fc,chelsea-fc-liverpool-fc,chelsea-fc-manchester-city,chelsea-fc-manchester-united,chelsea-fc-newcastle-united,chelsea-fc-norwich-city,chelsea-fc-queens-park-rangers,chelsea-fc-reading-fc,chelsea-fc-southampton-fc,...,west-bromwich-albion-manchester-united,west-bromwich-albion-newcastle-united,west-bromwich-albion-norwich-city,west-bromwich-albion-queens-park-rangers,west-bromwich-albion-reading-fc,west-bromwich-albion-southampton-fc,west-bromwich-albion-stoke-city,west-bromwich-albion-sunderland-afc,west-bromwich-albion-swansea-city,west-bromwich-albion-tottenham-hotspur,west-bromwich-albion-west-ham-united,west-bromwich-albion-wigan-athletic,west-ham-united-arsenal-fc,west-ham-united-aston-villa,west-ham-united-chelsea-fc,west-ham-united-everton-fc,west-ham-united-fulham-fc,west-ham-united-liverpool-fc,west-ham-united-manchester-city,west-ham-united-manchester-united,west-ham-united-newcastle-united,west-ham-united-norwich-city,west-ham-united-queens-park-rangers,west-ham-united-reading-fc,west-ham-united-southampton-fc,west-ham-united-stoke-city,west-ham-united-sunderland-afc,west-ham-united-swansea-city,west-ham-united-tottenham-hotspur,west-ham-united-west-bromwich-albion,west-ham-united-wigan-athletic,wigan-athletic-arsenal-fc,wigan-athletic-aston-villa,wigan-athletic-chelsea-fc,wigan-athletic-everton-fc,wigan-athletic-fulham-fc,wigan-athletic-liverpool-fc,wigan-athletic-manchester-city,wigan-athletic-manchester-united,wigan-athletic-newcastle-united,wigan-athletic-norwich-city,wigan-athletic-queens-park-rangers,wigan-athletic-reading-fc,wigan-athletic-southampton-fc,wigan-athletic-stoke-city,wigan-athletic-sunderland-afc,wigan-athletic-swansea-city,wigan-athletic-tottenham-hotspur,wigan-athletic-west-bromwich-albion,wigan-athletic-west-ham-united
away,aston-villa,chelsea-fc,everton-fc,fulham-fc,liverpool-fc,manchester-city,manchester-united,newcastle-united,norwich-city,queens-park-rangers,reading-fc,southampton-fc,stoke-city,sunderland-afc,swansea-city,tottenham-hotspur,west-bromwich-albion,west-ham-united,wigan-athletic,arsenal-fc,chelsea-fc,everton-fc,fulham-fc,liverpool-fc,manchester-city,manchester-united,newcastle-united,norwich-city,queens-park-rangers,reading-fc,southampton-fc,stoke-city,sunderland-afc,swansea-city,tottenham-hotspur,west-bromwich-albion,west-ham-united,wigan-athletic,arsenal-fc,aston-villa,everton-fc,fulham-fc,liverpool-fc,manchester-city,manchester-united,newcastle-united,norwich-city,queens-park-rangers,reading-fc,southampton-fc,...,manchester-united,newcastle-united,norwich-city,queens-park-rangers,reading-fc,southampton-fc,stoke-city,sunderland-afc,swansea-city,tottenham-hotspur,west-ham-united,wigan-athletic,arsenal-fc,aston-villa,chelsea-fc,everton-fc,fulham-fc,liverpool-fc,manchester-city,manchester-united,newcastle-united,norwich-city,queens-park-rangers,reading-fc,southampton-fc,stoke-city,sunderland-afc,swansea-city,tottenham-hotspur,west-bromwich-albion,wigan-athletic,arsenal-fc,aston-villa,chelsea-fc,everton-fc,fulham-fc,liverpool-fc,manchester-city,manchester-united,newcastle-united,norwich-city,queens-park-rangers,reading-fc,southampton-fc,stoke-city,sunderland-afc,swansea-city,tottenham-hotspur,west-bromwich-albion,west-ham-united
away_average,0.25,0.75,0.578125,0.4305556,0.6111111,0.359375,0.4204545,0.603125,0.6470588,0.4907895,0,1.045455,0.2666667,,0.6607143,0.314881,0,0.3026316,1.292969,1.107143,0,0.359375,0,0.06388889,0.59375,0.2159091,0.15,0.1764706,0.5,0.330625,0.5227273,0.7075,0.4107143,0.75,0.2142857,0.4558824,0.3026316,0.071875,0,0.6109375,0.25,0,0.2916667,0.53125,0.6590909,0.3575,0.2794118,0,0,0.5227273,...,0.5795455,0,0.6764706,0.09210526,0.2875,0.09090909,1.3,0,0.5,0.2875,0.5980263,0.65625,0.5178571,,0,0,0.43125,0.3194444,0.359375,0.5909091,0,0.2647059,0,0,1.045455,0.3833333,0.4107143,0,0.2857143,0,0.375,1.047321,0.359375,,0.640625,0.4305556,0.6805556,0,0.1704545,0.725,0.3889706,0.25,0.4125,0.3136364,0,0.4107143,1.046429,0.797619,0.4411765,0.5
away_loss,0,0,1,0,1,0,0,1,1,1,2,2,1,,0,2,2,1,0,0,0,0,1,1,1,0,1,1,0,1,0,0,0,0,0,1,1,2,1,0,0,2,0,0,0,0,2,2,1,1,...,1,1,0,2,2,2,0,2,1,0,1,1,1,,0,0,1,1,0,0,1,2,0,1,1,1,0,0,0,0,1,0,1,,0,1,1,0,0,1,0,0,0,1,0,1,1,0,1,1
away_win,1,1,1,0,1,2,2,1,0,0,0,0,0,,1,0,0,0,1,1,2,1,1,1,1,2,0,1,2,1,0,2,2,1,1,1,1,0,0,1,1,0,0,2,2,1,0,0,0,1,...,1,1,2,0,0,0,2,0,1,0,1,0,0,,0,1,1,1,2,2,0,0,1,1,0,1,0,1,2,1,0,2,1,,2,1,1,1,2,1,0,0,1,0,0,1,0,1,1,1
best,0.625,-18.2875,1.2,-30.875,-16.375,18.4875,19.0125,-1.35,5.7125,10.6875,10.625,-11.4875,-2.15,,7.8125,6.9375,-9.175,4.3125,-9.9625,-12.1375,6.2625,-5.75,22.4,-0.3875,-3.9625,-15.6125,8.2125,-3.6125,-10.2,-4.8625,-4.025,-26.25,1.6875,-13.1125,-9.0875,0.475,3.3125,1.7375,-7.675,2.85,18.2625,-17.625,16.675,-1.1,11.3,3.6125,18.925,34.6375,9.35,7.325,...,5.075,-4.4375,-24.1375,-3.775,4.375,11.5375,-1.4125,12.3375,1.575,8.7375,-27.6,-3.0375,-2.825,,-0.275,10.4125,-11.7625,18.2625,-10.475,-2.2,7.0375,1.25,-4.6125,-8.6125,-2.9875,2.875,-10.8125,-0.7125,-2,7.25,-9.95,-3.125,-6.3875,,-31.0625,-6.15,-1.35,-2.25,-11.725,-12.6125,1.3875,12.55,-6.5,5.025,4.3125,0.45,-1.2,-21.25,-7.075,-8.6125


## Analysis:

In [195]:
import statsmodels.api as st
from sklearn.linear_model import LogisticRegression

We will train on the first part of the season (except the first game as we don't have any explanotary variables for the first game)

In [395]:
firsthalf = DF.loc['day']<=19
secondhalf = ~firsthalf

In [396]:
DF_firsthalf = DF[DF.columns[firsthalf]]
DF_secondhalf = DF[DF.columns[secondhalf]]

In [397]:
# We remove the first day:
train = DF_firsthalf[DF_firsthalf.columns[DF_firsthalf.loc['day']>1]]

In [398]:
y = train.loc['score'].copy()

In [403]:
X = train.iloc[[1,2,3,4,5,7,8,10,11,12],:].T
test = DF_secondhalf.iloc[[1,2,3,4,5,7,8,10,11,12],:].T
y_test = DF_secondhalf.loc['score'].copy()

In [321]:
X.head()

Unnamed: 0,away_average,away_loss,away_win,best,goals_against_away,goals_against_home,home_average,home_loss,home_win
arsenal-fc-chelsea-fc,0.75,0,1,-19.1875,-0.4,-0.4,0.2464286,0,1
arsenal-fc-fulham-fc,0.4305556,0,0,-26.6125,-1.8,-1.2,0.08214286,1,1
arsenal-fc-queens-park-rangers,0.4907895,1,0,14.4375,-1.6,-1.2,0.0,1,1
arsenal-fc-southampton-fc,1.045455,2,0,-11.4875,-2.666667,0.0,0.6160714,0,1
arsenal-fc-swansea-city,0.6607143,0,1,9.825,-0.8,-1.6,0.4723214,0,0


We fit the model:

In [230]:
logreg = LogisticRegression(C=1e5)

In [231]:
logreg.fit(X,y)

LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0)

We predict the second half:

In [232]:
y_test_pred = logreg.predict(test)

We derive the confusion metric to check how good we actually do:

In [245]:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(y_test.values.astype(int),y_test_pred.astype(int))

In [257]:
score = float(mat[0,0]+mat[1,1]+mat[2,2])/np.sum(mat)

In [258]:
print score

0.384210526316


Not great.... Let's see if with a greater training set we can get better results !

In [250]:
tr = DF.loc['day']<=27
ts = ~tr

In [251]:
DF_tr = DF[DF.columns[tr]]
DF_ts = DF[DF.columns[ts]]
train2 = DF_tr[DF_tr.columns[DF_tr.loc['day']>1]]
y2 = train2.loc['score']

X2 = train2.iloc[[1,2,3,4,6,7,9,10,11],:].T
test2 = DF_ts.iloc[[1,2,3,4,6,7,9,10,11],:].T
y_test2 = DF_ts.loc['score']

In [252]:
logreg2 = LogisticRegression(C=1e5)
logreg.fit(X2,y2)

LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0)

In [253]:
y_test_pred2 = logreg.predict(test2)

In [254]:
mat2 = confusion_matrix(y_test2.values.astype(int),y_test_pred2.astype(int))

In [255]:
mat2

array([[ 5,  9, 21],
       [ 7,  3, 18],
       [ 9,  5, 33]])

In [259]:
score2 = float(mat2[0,0]+mat2[1,1]+mat2[2,2])/np.sum(mat2)

In [260]:
print score2

0.372727272727


nop....

### Let's try random forest:

In [262]:
from sklearn.ensemble import RandomForestClassifier

In [263]:
clf = RandomForestClassifier(n_estimators=25)
clf.fit(X,y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=25, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [265]:
RFC_pred = clf.predict(test).astype(int)

In [266]:
RFC_mat = confusion_matrix(y_test.astype(int),RFC_pred)

In [267]:
RFC_mat

array([[14, 17, 27],
       [12, 18, 20],
       [18, 14, 50]])

In [268]:
RFC_score = float(RFC_mat[0,0]+RFC_mat[1,1]+RFC_mat[2,2])/np.sum(RFC_mat)

In [269]:
print RFC_score

0.431578947368


Slightly better....

In [279]:
clf = RandomForestClassifier(n_estimators=10)
clf.fit(X2,y2)
RFC_pred2 = clf.predict(test2).astype(int)
RFC_mat2 = confusion_matrix(y_test2.astype(int),RFC_pred2)

print RFC_mat2

RFC_score2 = float(RFC_mat2[0,0]+RFC_mat2[1,1]+RFC_mat2[2,2])/np.sum(RFC_mat2)

print RFC_score2

[[15  8 12]
 [ 7  5 16]
 [10 10 27]]
0.427272727273


Let's now to see if we can at least better predict if the home teams doesn't win:

In [281]:
y[y<1]= 0 

In [282]:
y_test[y_test<1] = 0

In [283]:
logreg = LogisticRegression(C=1e5)
logreg.fit(X,y)
y_test_pred = logreg.predict(test).astype(int)

In [284]:
mat4 = confusion_matrix(y_test.values.astype(int),y_test_pred)

In [285]:
mat4

array([[70, 38],
       [44, 38]])

In [286]:
score4 = float(mat4[0,0]+mat4[1,1])/np.sum(mat4)

In [287]:
score4

0.5684210526315789

We manage to predict slightly better than by chance... still not good enough...

In [None]:
clf = RandomForestClassifier(n_estimators=25)

clf.fit(X,y)

RFC_pred3 = clf.predict(test).astype(int)
RFC_mat3 = confusion_matrix(y_test.astype(int),RFC_pred3)

print RFC_mat3

RFC_score3 = float(RFC_mat3[0,0]+RFC_mat3[1,1])/np.sum(RFC_mat3)

print RFC_score3

Maybe not loose:

In [307]:
y = train.loc['score'].copy()
y_test = DF_secondhalf.loc['score'].copy()
y[y>-1]= 1
y_test[y_test>-1]= 1

In [309]:
logreg = LogisticRegression(C=1e5)
logreg.fit(X,y)
y_test_pred = logreg.predict(test).astype(int)
mat5 = confusion_matrix(y_test.values.astype(int),y_test_pred)
score5 = float(mat5[0,0]+mat5[1,1])/np.sum(mat5)

In [310]:
score5

0.6631578947368421

2/3 ! Let's build on this.

In [312]:
onlywin_mat = confusion_matrix(y_test.values.astype(int),[1]*190)
float(onlywin_mat[0,0]+onlywin_mat[1,1])/np.sum(onlywin_mat)

0.6947368421052632

ouch....

In [313]:
onlyloss_mat = confusion_matrix(y_test.values.astype(int),[-1]*190)
float(onlyloss_mat[0,0]+onlyloss_mat[1,1])/np.sum(onlyloss_mat)

0.30526315789473685

#### Let's try to improve results by looking in more depths at the models:

Let's focus on one model in particular, the random forest 

In [507]:
DF_firsthalf = DF[DF.columns[firsthalf]]
DF_secondhalf = DF[DF.columns[secondhalf]]
train = DF_firsthalf[DF_firsthalf.columns[DF_firsthalf.loc['day']>1]]
y = train.loc['score'].copy()
X = train.iloc[[1,2,3,4,6,7,9,10,11],:].T
test = DF_secondhalf.iloc[[1,2,3,4,6,7,9,10,11],:].T
y_test = DF_secondhalf.loc['score'].copy()

In [511]:
clf = RandomForestClassifier(n_estimators=10)
clf.fit(X,y)

RFC_pred = clf.predict(test).astype(int)
RFC_mat = confusion_matrix(y_test.astype(int),RFC_pred)

print RFC_mat

RFC_score = float(RFC_mat[0,0]+RFC_mat[1,1]+RFC_mat[2,2])/np.sum(RFC_mat)
print RFC_score

[[21 12 25]
 [13 18 19]
 [18 21 43]]
0.431578947368


In [463]:
zip(DF.index[[1,2,3,4,5,7,8,10,11,12]],clf.feature_importances_)

[('away_average', 0.14642648213399728),
 ('away_loss', 0.037159761707974776),
 ('away_win', 0.032031118693113647),
 ('best_away', 0.21480853295013938),
 ('best_home', 0.13386022164207761),
 ('goals_against_away', 0.12788066270244336),
 ('goals_against_home', 0.12051259638730313),
 ('home_average', 0.11557261273151902),
 ('home_loss', 0.033852519557199978),
 ('home_win', 0.037895491494231826)]

### Let's try to add a team factor:

In [518]:
team_df = dict.fromkeys(games.keys())

In [521]:
for k in team_df.keys():
    team_df[k] = {}
    for team in teams:
        team_df[k][team] = 0
    team_df[k][games[k]['home']] = 1
    team_df[k][games[k]['away']] = 1

In [522]:
team_df = pd.DataFrame.from_dict(team_df)

In [523]:
team_df.head()

Unnamed: 0,arsenal-fc-aston-villa,arsenal-fc-chelsea-fc,arsenal-fc-everton-fc,arsenal-fc-fulham-fc,arsenal-fc-liverpool-fc,arsenal-fc-manchester-city,arsenal-fc-manchester-united,arsenal-fc-newcastle-united,arsenal-fc-norwich-city,arsenal-fc-queens-park-rangers,arsenal-fc-reading-fc,arsenal-fc-southampton-fc,arsenal-fc-stoke-city,arsenal-fc-sunderland-afc,arsenal-fc-swansea-city,arsenal-fc-tottenham-hotspur,arsenal-fc-west-bromwich-albion,arsenal-fc-west-ham-united,arsenal-fc-wigan-athletic,aston-villa-arsenal-fc,aston-villa-chelsea-fc,aston-villa-everton-fc,aston-villa-fulham-fc,aston-villa-liverpool-fc,aston-villa-manchester-city,aston-villa-manchester-united,aston-villa-newcastle-united,aston-villa-norwich-city,aston-villa-queens-park-rangers,aston-villa-reading-fc,aston-villa-southampton-fc,aston-villa-stoke-city,aston-villa-sunderland-afc,aston-villa-swansea-city,aston-villa-tottenham-hotspur,aston-villa-west-bromwich-albion,aston-villa-west-ham-united,aston-villa-wigan-athletic,chelsea-fc-arsenal-fc,chelsea-fc-aston-villa,chelsea-fc-everton-fc,chelsea-fc-fulham-fc,chelsea-fc-liverpool-fc,chelsea-fc-manchester-city,chelsea-fc-manchester-united,chelsea-fc-newcastle-united,chelsea-fc-norwich-city,chelsea-fc-queens-park-rangers,chelsea-fc-reading-fc,chelsea-fc-southampton-fc,...,west-bromwich-albion-manchester-united,west-bromwich-albion-newcastle-united,west-bromwich-albion-norwich-city,west-bromwich-albion-queens-park-rangers,west-bromwich-albion-reading-fc,west-bromwich-albion-southampton-fc,west-bromwich-albion-stoke-city,west-bromwich-albion-sunderland-afc,west-bromwich-albion-swansea-city,west-bromwich-albion-tottenham-hotspur,west-bromwich-albion-west-ham-united,west-bromwich-albion-wigan-athletic,west-ham-united-arsenal-fc,west-ham-united-aston-villa,west-ham-united-chelsea-fc,west-ham-united-everton-fc,west-ham-united-fulham-fc,west-ham-united-liverpool-fc,west-ham-united-manchester-city,west-ham-united-manchester-united,west-ham-united-newcastle-united,west-ham-united-norwich-city,west-ham-united-queens-park-rangers,west-ham-united-reading-fc,west-ham-united-southampton-fc,west-ham-united-stoke-city,west-ham-united-sunderland-afc,west-ham-united-swansea-city,west-ham-united-tottenham-hotspur,west-ham-united-west-bromwich-albion,west-ham-united-wigan-athletic,wigan-athletic-arsenal-fc,wigan-athletic-aston-villa,wigan-athletic-chelsea-fc,wigan-athletic-everton-fc,wigan-athletic-fulham-fc,wigan-athletic-liverpool-fc,wigan-athletic-manchester-city,wigan-athletic-manchester-united,wigan-athletic-newcastle-united,wigan-athletic-norwich-city,wigan-athletic-queens-park-rangers,wigan-athletic-reading-fc,wigan-athletic-southampton-fc,wigan-athletic-stoke-city,wigan-athletic-sunderland-afc,wigan-athletic-swansea-city,wigan-athletic-tottenham-hotspur,wigan-athletic-west-bromwich-albion,wigan-athletic-west-ham-united
arsenal-fc,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
aston-villa,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
chelsea-fc,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
everton-fc,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
fulham-fc,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [525]:
big_df = pd.concat([DF,team_df])

In [527]:
big_df_firsthalf = big_df[big_df.columns[firsthalf]]
big_df_secondhalf = big_df[big_df.columns[secondhalf]]
train = big_df_firsthalf[big_df_firsthalf.columns[big_df_firsthalf.loc['day']>1]]
y = train.loc['score'].copy()
X = train.iloc[[1,2,3,4,6,7,9,10,11,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32],:].T
test = big_df_secondhalf.iloc[[1,2,3,4,6,7,9,10,11,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32],:].T
y_test = big_df_secondhalf.loc['score'].copy()

In [535]:
clf = RandomForestClassifier(n_estimators=10,bal)
clf.fit(X,y)

RFC_pred = clf.predict(test).astype(int)
RFC_mat = confusion_matrix(y_test.astype(int),RFC_pred)

print RFC_mat

RFC_score = float(RFC_mat[0,0]+RFC_mat[1,1]+RFC_mat[2,2])/np.sum(RFC_mat)
print RFC_score

[[16 18 24]
 [ 9 22 19]
 [11 24 47]]
0.447368421053


In [580]:
y = train.loc['score'].copy()
y_test = big_df_secondhalf.loc['score'].copy()
y[y>-1]= 1
y[y==-1] = 0
y_test[y_test>-1]= 1
y_test[y_test==-1] = 0

In [714]:
clf = RandomForestClassifier(n_estimators=100,class_weight='auto')
clf.fit(X,y)

RFC_pred = clf.predict(test).astype(int)
RFC_mat = confusion_matrix(y_test.astype(int),RFC_pred)

print RFC_mat

RFC_score = float(RFC_mat[0,0]+RFC_mat[1,1])/np.sum(RFC_mat)
print RFC_score

[[  5  53]
 [  3 129]]
0.705263157895


In [716]:
range(180)[y==1]

TypeError: list indices must be integers, not Series

In [715]:
len(y)

180