In [1]:
import pandas as pd

We'll be predicting the number of goals scored by teams in soccer matches today, let's get started by loading the datasets into dataframes.

In [153]:
matches = pd.read_csv('../data/spi_matches.csv')
rankings = pd.read_csv('../data/spi_global_rankings.csv')
played_matches = matches.dropna()

One feature we could track is win-streaks. There is a common notion in sports that teams can get 'hot' and have a higher chance of winning. We can add a column to the matches dataframe to track the current winstreak for each of the respective teams. The one potential drawback of this is that all teams will start with a winstreak of zero.

In [155]:
win_streaks = {}
played_matches['win_streak1'] = 0
played_matches['win_streak2'] = 0
for i, row in played_matches.iterrows():
    if row.team1 not in win_streaks:
        win_streaks[row.team1] = 0
    if row.team2 not in win_streaks:
        win_streaks[row.team2] = 0
    played_matches.at[i, 'win_streak1'] = win_streaks[row.team1]
    played_matches.at[i, 'win_streak2'] = win_streaks[row.team2]

    if row.score1 > row.score2:
        win_streaks[row.team1] += 1
        win_streaks[row.team2] = 0
    elif row.score2 > row.score1:
        win_streaks[row.team2] += 1
        win_streaks[row.team1] = 0
    else:
        win_streaks[row.team1] = 0
        win_streaks[row.team2] = 0







We could also take this idea further by weighting win streaks according to the strengths of the team defeated. For example, Paris Saint-Germain stomping Ligue 1 bottom-feeders every week carries much less weight than winning several games in a row playing in the Champions League.

In [156]:
weighted_win_streaks = {}
played_matches['win_streak_weighted1'] = 0.0
played_matches['win_streak_weighted2'] = 0.0
for i, row in played_matches.iterrows():
    if row.team1 not in weighted_win_streaks:
        weighted_win_streaks[row.team1] = 0
    if row.team2 not in weighted_win_streaks:
        weighted_win_streaks[row.team2] = 0
    played_matches.at[i, 'win_streak_weighted1'] = weighted_win_streaks[row.team1]
    played_matches.at[i, 'win_streak_weighted2'] = weighted_win_streaks[row.team2]

    if row.score1 > row.score2:
        weighted_win_streaks[row.team1] += max(0.5 * (row.spi2 - row.spi1) + 0.5, 0.5)
        weighted_win_streaks[row.team2] = 0
    elif row.score2 > row.score1:
        weighted_win_streaks[row.team2] += max(0.5 * (row.spi1 - row.spi2) + 0.5, 0.5)
        weighted_win_streaks[row.team1] = 0
    else:
        weighted_win_streaks[row.team1] = 0
        weighted_win_streaks[row.team2] = 0





Cool, so now we've added a time-accurate non-weighted and weighted win streak to the rankings dataframe. This could be another useful feature to predict scores in a given match.



In [157]:
played_matches

Unnamed: 0,season,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,...,xg1,xg2,nsxg1,nsxg2,adj_score1,adj_score2,win_streak1,win_streak2,win_streak_weighted1,win_streak_weighted2
10,2016,2016-08-12,1843,French Ligue 1,Bastia,Paris Saint-Germain,51.16,85.68,0.0463,0.8380,...,0.97,0.63,0.43,0.45,0.00,1.05,0,0,0.000,0.000
11,2016,2016-08-12,1843,French Ligue 1,AS Monaco,Guingamp,68.85,56.48,0.5714,0.1669,...,2.45,0.77,1.75,0.42,2.10,2.10,0,0,0.000,0.000
12,2016,2016-08-13,2411,Barclays Premier League,Hull City,Leicester City,53.57,66.81,0.3459,0.3621,...,0.85,2.77,0.17,1.25,2.10,1.05,0,0,0.000,0.000
13,2016,2016-08-13,2411,Barclays Premier League,Middlesbrough,Stoke City,56.32,60.35,0.4380,0.2692,...,1.40,0.55,1.13,1.06,1.05,1.05,0,0,0.000,0.000
14,2016,2016-08-13,2411,Barclays Premier League,Burnley,Swansea City,58.98,59.74,0.4482,0.2663,...,1.24,1.84,1.71,1.56,0.00,1.05,0,0,0.000,0.000
15,2016,2016-08-13,2411,Barclays Premier League,Southampton,Watford,69.49,59.33,0.5759,0.1874,...,1.05,0.22,1.52,0.41,1.05,1.05,0,0,0.000,0.000
16,2016,2016-08-13,2411,Barclays Premier League,Crystal Palace,West Bromwich Albion,55.19,58.66,0.4214,0.2939,...,1.11,0.68,0.84,1.60,0.00,1.05,0,0,0.000,0.000
17,2016,2016-08-13,2411,Barclays Premier League,Everton,Tottenham Hotspur,68.02,73.25,0.3910,0.3401,...,0.73,1.11,0.88,1.81,1.05,1.05,0,0,0.000,0.000
18,2016,2016-08-13,1843,French Ligue 1,Bordeaux,St Etienne,62.01,64.92,0.4232,0.2764,...,1.03,1.84,1.10,2.26,3.12,2.10,0,0,0.000,0.000
19,2016,2016-08-13,2411,Barclays Premier League,Manchester City,Sunderland,86.42,53.64,0.8152,0.0525,...,2.14,1.25,1.81,0.92,2.10,1.05,0,0,0.000,0.000


Now that we've engineered some new features, let's see investigate how regression can help us predict scores for a given game. First, we'll manipulate the dataset to generate some training data and some testing data. This is not as simple as filtering dataframe columns because we actually have 2 samples contained within each row (home team's score, away team's score).

In [118]:


train = pd.DataFrame()
test = pd.DataFrame()
counter = 0
switch_index = int(len(played_matches) * 0.8)
for match in played_matches.itertuples():
    if counter < switch_index:
        new_frame = pd.DataFrame()
        new_frame.at[0, 'spi'] = match.spi1
        new_frame.at[0, 'prob'] = match.prob1
        new_frame.at[0, 'importance'] = match.importance1
        new_frame.at[0, 'spi_o'] = match.spi2
        new_frame.at[0, 'prob_o'] = match.prob2
        new_frame.at[0, 'importance_o'] = match.importance2
        new_frame.at[0, 'wws'] = match.win_streak_weighted1
        new_frame.at[0, 'wws_o'] = match.win_streak_weighted2
        new_frame.at[0, 'score'] = match.score1
        train = train.append(new_frame)
        new_frame = pd.DataFrame()
        new_frame.at[0, 'spi'] = match.spi2
        new_frame.at[0, 'prob'] = match.prob2
        new_frame.at[0, 'importance'] = match.importance2
        new_frame.at[0, 'spi_o'] = match.spi1
        new_frame.at[0, 'prob_o'] = match.prob1
        new_frame.at[0, 'importance_o'] = match.importance1
        new_frame.at[0, 'wws'] = match.win_streak_weighted2
        new_frame.at[0, 'wws_o'] = match.win_streak_weighted1
        new_frame.at[0, 'score'] = match.score2
        train = train.append(new_frame)
    else:
        new_frame = pd.DataFrame()
        new_frame.at[0, 'spi'] = match.spi1
        new_frame.at[0, 'prob'] = match.prob1
        new_frame.at[0, 'importance'] = match.importance1
        new_frame.at[0, 'spi_o'] = match.spi2
        new_frame.at[0, 'prob_o'] = match.prob2
        new_frame.at[0, 'importance_o'] = match.importance2
        new_frame.at[0, 'wws'] = match.win_streak_weighted1
        new_frame.at[0, 'wws_o'] = match.win_streak_weighted2
        new_frame.at[0, 'score'] = match.score1
        test = test.append(new_frame)
        new_frame = pd.DataFrame()
        new_frame.at[0, 'spi'] = match.spi2
        new_frame.at[0, 'prob'] = match.prob2
        new_frame.at[0, 'importance'] = match.importance2
        new_frame.at[0, 'spi_o'] = match.spi1
        new_frame.at[0, 'prob_o'] = match.prob1
        new_frame.at[0, 'importance_o'] = match.importance1
        new_frame.at[0, 'wws'] = match.win_streak_weighted2
        new_frame.at[0, 'wws_o'] = match.win_streak_weighted1
        new_frame.at[0, 'score'] = match.score2
        test = test.append(new_frame)
    counter += 1


Now that we have each game split into two entries and the dataset split between training and testing data, we can filter out the data we want to feed into a Regression model.

In [140]:
x_train = train[['spi', 'prob', 'importance', 'spi_o', 'prob_o', 'importance_o', 'wws', 'wws_o']]
y_train = train['score']
x_test = test[['spi', 'prob', 'importance', 'spi_o', 'prob_o', 'importance_o', 'wws', 'wws_o']]
y_test = test['score']

First we will try Gradient Boosting.

In [126]:
from sklearn.ensemble import GradientBoostingRegressor

estimator = GradientBoostingRegressor()
estimator.fit(x_train, y_train)

GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion=&#39;friedman_mse&#39;,
                          init=None, learning_rate=0.1, loss=&#39;ls&#39;, max_depth=3,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=100,
                          n_iter_no_change=None, presort=&#39;deprecated&#39;,
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

In [143]:
predictions = list(estimator.predict(x_test))
diff_sum = 0.0
y_test_list = list(y_test.values)
total =  len(predictions)
for x in range(0, total):
    diff_sum += abs(predictions[x] - y_test_list[x])
print('Average error = ' + str(diff_sum/total))

Average error = 0.9080543566029726


Not a bad score, but maybe we can do better. To prove that our win streaks idea is useful, let's compare that to the accuracy without using win streaks.


In [147]:
x_train = train[['spi', 'prob', 'importance', 'spi_o', 'prob_o', 'importance_o']]
y_train = train['score']
x_test = test[['spi', 'prob', 'importance', 'spi_o', 'prob_o', 'importance_o']]
y_test = test['score']
estimator = GradientBoostingRegressor()
estimator.fit(x_train, y_train)
predictions = list(estimator.predict(x_test))
diff_sum = 0.0
y_test_list = list(y_test.values)
total =  len(predictions)
for x in range(0, total):
    diff_sum += abs(predictions[x] - y_test_list[x])
print('Average error (without win streaks)= ' + str(diff_sum/total))

Average error (without win streaks)= 0.9080660192258367


Turns out it was not very important at all (it made our model slightly worse). Let's see if we get better performance with a different regression technique.

In [152]:
from sklearn.ensemble import RandomForestRegressor

x_train = train[['spi', 'prob', 'importance', 'spi_o', 'prob_o', 'importance_o', 'wws', 'wws_o']]
y_train = train['score']
x_test = test[['spi', 'prob', 'importance', 'spi_o', 'prob_o', 'importance_o', 'wws', 'wws_o']]
y_test = test['score']
rand_forest = RandomForestRegressor()
rand_forest.fit(x_train, y_train)
predictions = list(rand_forest.predict(x_test))
diff_sum = 0.0
y_test_list = list(y_test.values)
total =  len(predictions)
for x in range(0, total):
    diff_sum += abs(predictions[x] - y_test_list[x])
print('Average error= ' + str(diff_sum/total))

Average error= 0.9461978221415618


Hmmm, a Random Forest Regressor has not improved our performance. I suppose this could be about as good as it gets, as predicting a team's score for a given game with an error on average of less than 1 goal is likely far better than most humans would fare. 