## SportsMonks Predictive Analysis

Using the data gathered through the [Sports Monks API](https://www.sportmonks.com/), we can begin to analyze the match and player data and try to predict match outcomes. We can do this with the [SciKit-Learn library](http://scikit-learn.org/stable/), which provides a collection of machine learning models that can be tuned to the specific problem.

Specifically, we're going to be using the [Supervised Neural Networks Classifier](http://scikit-learn.org/stable/modules/neural_networks_supervised.html) in the SciKit-Learn library.

### Setup
First, we need to set up access to the local SportsMonks database, as well as importing [Pandas](http://pandas.pydata.org/pandas-docs/version/0.23/) and [Numpy](http://www.numpy.org/).

In [1]:
import warnings
import sys
import pandas as pd
import numpy as np

from sqlalchemy import create_engine

sys.path.insert(0, './internal')
import databaseinfo

In [2]:
engine = create_engine('mysql+mysqlconnector://{}:{}@{}/{}'.format(
    databaseinfo.db_user(),
    databaseinfo.db_passwd(),
    databaseinfo.db_host(),
    databaseinfo.db_name()))

Then, we need to retrieve and merge the Club Season data and the Starting Lineup data for each match, for both the home and away teams. For ease of access, the starting lineup statistics for each team have been precomputed and inserted into a table, LineupStats. After this is done, we can use this data to train the Neural Network and determine the optimal number of hidden layers and nodes that gives us the most accurate match prediction.

In [3]:
def toSqlRename(tableName, attribute, prefix):
    return "%s.%s as %s%s" % (tableName, attribute, prefix, attribute)

In [4]:
player_attributes = [
    "minutes_played",
    "appearances",
    "goals",
    "goals_conceded",
    "assists",
    "shots_on_goal",
    "shots_total",
    "fouls_committed",
    "fouls_drawn",
    "interceptions",
    "saves",
    "clearances",
    "tackles",
    "offsides",
    "blocks",
    "pen_saved",
    "pen_missed",
    "pen_scored",
    "passes_total",
    "crosses_total"
]

def player_rename(tableName, attribute, prefix):
    return "%s.%s as %s%s" % (tableName, attribute, prefix, attribute)

def home_player_rename(attribute):
    return player_rename("ls", attribute, "home_")

def away_player_rename(attribute):
    return player_rename("ls", attribute, "away_")

In [5]:
def homeRename(attribute):
    return toSqlRename("css", attribute, "home_")

def awayRename(attribute):
    return toSqlRename("css", attribute, "away_")

attributes = [
    "win_total",
    "draw_total",
    "lost_total",
    "goals_for_total",
    "goals_against_total",
    "clean_sheet_total",
    "failed_to_score_total"
]

home_string = ", ".join(list(map(homeRename, attributes)))
home_players_string = ", ".join(list(map(home_player_rename, player_attributes)))
away_string = ", ".join(list(map(awayRename, attributes)))
away_players_string = ", ".join(list(map(away_player_rename, player_attributes)))

In [6]:
club_attribute_query = "SELECT home.*, %s, %s \
FROM (  SELECT f.*, %s, %s \
        FROM Fixture f, ClubSeasonStats css, LineupStats ls \
        WHERE f.season_id=css.season_id \
            AND f.home_team_id=css.club_id \
            AND f.id=ls.fixture_id \
            AND f.home_team_id=ls.club_id \
     ) home, \
    ClubSeasonStats css, \
    LineupStats ls \
WHERE home.season_id=css.season_id \
    AND home.away_team_id=css.club_id  \
    AND home.id=ls.fixture_id \
    AND home.away_team_id=ls.club_id" % (away_string, away_players_string, home_string, home_players_string)

In [7]:
resoverall = engine.execute(club_attribute_query)
df = pd.DataFrame(resoverall.fetchall())
df.columns = resoverall.keys()

### Predicting Match Outcome
Now that we have the data, we need to do something with it. Initially, let's just look at the overall result of a game, giving the three outcomes a corresponding label:
* Home team wins: '0'
* Teams draw: '1'
* Away team wins: '2'

In [8]:
def getResult(scores):
    home_score = scores[0]
    away_score = scores[1]
    
    if home_score > away_score:
        return '0'
    elif home_score == away_score:
        return '1'
    else:
        return '2'

In [9]:
scores = df.loc[:, ['home_team_score', 'away_team_score']]
df['Result'] = scores.apply(getResult, axis=1)

In [10]:
df.head()

Unnamed: 0,id,season_id,venue_id,home_team_id,away_team_id,date_of_game,home_team_score,away_team_score,home_win_total,home_draw_total,...,away_clearances,away_tackles,away_offsides,away_blocks,away_pen_saved,away_pen_missed,away_pen_scored,away_passes_total,away_crosses_total,Result
0,2188,13,199,22,42,2016-08-13,2,1,9,7,...,69,358,38,69,0,2,3,8057,403,0
1,2197,13,200,27,30,2016-08-13,0,1,11,7,...,76,333,28,76,1,0,0,8942,277,2
2,2208,13,201,51,10,2016-08-13,0,1,12,5,...,110,312,34,110,0,0,0,7248,288,2
3,2216,13,202,13,6,2016-08-13,1,1,17,10,...,78,461,37,78,0,2,6,14887,526,1
4,2225,13,203,7,26,2016-08-13,1,1,5,13,...,83,310,46,83,0,2,2,6376,443,1


#### Determining Optimal Hidden Layers

After assigning these labels, we extract the input and output values and split them into separate arrays.

In [11]:
X = np.array([list(x) for x in df.loc[:, 'home_win_total':'away_crosses_total'].values])
Y = np.array(df['Result'].values)

We can then use SciKit-Learn's Pipeline and cross-validation-score imports in order to test a series of models with different hidden layer setups. Specifically, we can test the number of hidden layers and the number of nodes in each hidden layer.

In order for the tests to be reproducable, I went with a consistent random state seed of 1 for all of my MLPClassifier models.

Using the formula in Section 4.2 of a [Neural Networks paper](https://tinyurl.com/ybhoz5ea), I was able to estimate the number of optimal hidden layers and perform a smaller analysis. I also used the guidance on [this StackExchange post](https://tinyurl.com/ydhcc39y) to limit the number of hidden layers to either a single layer or two layers. Three or more hidden layers require an extremely large dataset to draw from, as well as computing power that isn't available to me.

*Note: These tests take a long time to run. The analysis I ran is summarized after this section, and the code is left here for re-use, if necessary.*

In [12]:
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

In [13]:
def get_cv_score(X, Y, n1, n2):
    if n2 > 0:
        hidden_layers = (n1, n2)
    else:
        hidden_layers = (n1)
    
    scaler = StandardScaler()
    model = MLPClassifier(max_iter=500,
                          hidden_layer_sizes=hidden_layers,
                          random_state=1)
    pipeline = Pipeline([('transform', scaler), ('fit', model)])
    return cross_val_score(pipeline, X, Y, cv=10, scoring='accuracy').mean()

In [49]:
def test_hidden_layers(X, Y):
    cv_errors = []

    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        for n1 in range(1, 15, 1):
            nums = ((n1, 0))
            cv_errors.append((nums, get_cv_score(X, Y, n1, 0)))

        for n1 in range(8, 15, 1):
            for n2 in range(8, 15, 1):
                nums = ((n1, n2))
                cv_errors.append((nums, get_cv_score(X, Y, n1, n2)))
    
    # return the five best CV errors
    cv_errors.sort(key=lambda x: x[1], reverse=True)
    return cv_errors[0:5]

In [None]:
output = test_hidden_layers(X, Y)

In [32]:
output

[((9, 8), 0.5884684516279699),
 ((9, 9), 0.5872224727273606),
 ((9, 11), 0.586651044155932),
 ((2, 0), 0.5862703579294973),
 ((5, 0), 0.5858923984923434)]

So the five best hidden layer setups are as follows:
* (9,8) = **0.5884684516279699**
* (9,9) = **0.5872224727273606**
* (9,11) = **0.586651044155932**
* (2,0) = **0.5862703579294973**
* (5,0) = **0.5858923984923434**

Most of the other combinations landed between 0.53 and 0.56, with a gradual increase until the above values, followed by a small decrease in cross-validation score, and finally a plateau (around 0.567).

#### Testing Our Predictions

Awesome, now we have the "optimal" hidden layer values for the given dataset. Now, we can split the data into test and training subsets and explicitly test how well the model performs, instead of using the above CV score. We can then fit the training dataset to the optimal model and predict the match results of the test dataset. Then, using some more tools from SciKit-Learn, we can view the [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) of the predictions, as well as the precision, recall, and [F1-score](https://en.wikipedia.org/wiki/F1_score) of the predictions.

I used a random 80% of the training data and used the remaining 20% as the test data.

In [35]:
import random
import math

train_range = range(0, len(X), 1)
train_idxs = random.sample(train_range, int(math.floor(len(train_range) * 0.8)))
test_idxs = [x for x in train_range if x not in train_idxs]

X_train = X[train_idxs]
Y_train = Y[train_idxs]

X_test = X[test_idxs]
Y_test = Y[test_idxs]

In [39]:
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
            
    scaler = StandardScaler()
    scaler.fit(X_train)

    model = MLPClassifier(max_iter=500, hidden_layer_sizes=(9,8), random_state=1)

    X_train_std = scaler.transform(X_train)
    model.fit(X_train_std, Y_train)

    X_test_std = scaler.transform(X_test)
    predictions = model.predict(X_test_std)

In [40]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(Y_test, predictions, labels=['0', '1', '2']))

[[731  87 135]
 [219 171 144]
 [170  95 350]]


In [41]:
from sklearn.metrics import classification_report

print(classification_report(Y_test,predictions))

             precision    recall  f1-score   support

          0       0.65      0.77      0.71       953
          1       0.48      0.32      0.39       534
          2       0.56      0.57      0.56       615

avg / total       0.58      0.60      0.58      2102



To interpret these:
* Correctly predicted (top-left to bottom-right diagonal):
  * 731 Home wins
  * 171 Draws
  * 350 Away wins.
  * 1252 correct out of 2102 = **59.6% correct**
* Incorrectly predicted:
  * Predicted Home wins that were wrong:
    * 219 Draws.
    * 170 Away wins.
  * Predicted Draws that were wrong:
    * 87 Home wins.
    * 95 Away wins.
  * Predicted Away wins that were wrong:
    * 135 Home wins.
    * 144 Draws.

We can see that the Neural Network heavily leans towards Home wins, which backs up the theory that [Home Field Advantage in soccer is huge](http://freakonomics.com/2011/12/18/football-freakonomics-how-advantageous-is-home-field-advantage-and-why/). We can see the results of this in the recall score: Home wins have a 77% chance of being correctly recalled, while draws only have a *32%* chance of being recalled. This huge predictive imbalance between home wins and draws is split almost directly in the middle by correctly predicted away wins.

### Predicting Match Scores
Using the same data and MLPClassifier class as above, we can try to predict the amount of goals that each team (home and away) will score. In order to classify these correctly, we have to create labels for expected outcomes.

In [42]:
def getGoalsLabel(goals):
    return str(goals) if goals < 5 else '5+'

Seen in the above class, any amount of goals greater than or equal to 5 is treated as the same label. There is an extremely low chance that the model will ever predict that both teams will score more than 5 goals, and, if this ever happens, the teams probably deserve to draw.

In [43]:
home_scores = df['home_team_score']
df['Home_Result'] = home_scores.apply(getGoalsLabel)

away_scores = df['home_team_score']
df['Away_Result'] = away_scores.apply(getGoalsLabel)

We can then perform the same CV score analysis as we did with the Match Outcome for the number of home and away goals that will be scored in a match.

In [44]:
X = np.array([list(x) for x in df.loc[:, 'home_win_total':'away_crosses_total'].values])

In [45]:
# Use the Home Result as the outcome variable
Y = np.array(df['Home_Result'].values)

In [46]:
test_hidden_layers(X, Y)

[((1, 0), 0.35400013767485816), ((2, 0), 0.35579934234273247), ((3, 0), 0.3672453997308855), ((4, 0), 0.36448537297120776), ((5, 0), 0.36848013001283053), ((6, 0), 0.37390402301343884), ((7, 0), 0.36248844462741153), ((8, 0), 0.3541169694205152), ((9, 0), 0.3652702391630903), ((10, 0), 0.3613551912101625), ((11, 0), 0.36764573310464976), ((12, 0), 0.3647847763842161), ((13, 0), 0.35526600494135624), ((14, 0), 0.367366626272157), ((8, 8), 0.3660436559362655), ((8, 9), 0.3622107072483708), ((8, 10), 0.36639921044102), ((8, 11), 0.36820748446808843), ((8, 12), 0.3606860666665036), ((8, 13), 0.36078448932526075), ((8, 14), 0.369630803884229), ((9, 8), 0.3632733890029861), ((9, 9), 0.3648727764978046), ((9, 10), 0.354786195666646), ((9, 11), 0.36278941992152747), ((9, 12), 0.35784559960061874), ((9, 13), 0.3588748126681717), ((9, 14), 0.3603126581719702), ((10, 8), 0.3642264217299148), ((10, 9), 0.36695976633493943), ((10, 10), 0.35736114981064354), ((10, 11), 0.36886579774443484), ((10, 12

[((6, 0), 0.37390402301343884),
 ((11, 12), 0.36983804100132295),
 ((8, 14), 0.369630803884229),
 ((10, 11), 0.36886579774443484),
 ((5, 0), 0.36848013001283053)]

In [47]:
# Do the same for the away score
Y = np.array(df['Away_Result'].values)

In [48]:
test_hidden_layers(X, Y)

[((1, 0), 0.35400013767485816), ((2, 0), 0.35579934234273247), ((3, 0), 0.3672453997308855), ((4, 0), 0.36448537297120776), ((5, 0), 0.36848013001283053), ((6, 0), 0.37390402301343884), ((7, 0), 0.36248844462741153), ((8, 0), 0.3541169694205152), ((9, 0), 0.3652702391630903), ((10, 0), 0.3613551912101625), ((11, 0), 0.36764573310464976), ((12, 0), 0.3647847763842161), ((13, 0), 0.35526600494135624), ((14, 0), 0.367366626272157), ((8, 8), 0.3660436559362655), ((8, 9), 0.3622107072483708), ((8, 10), 0.36639921044102), ((8, 11), 0.36820748446808843), ((8, 12), 0.3606860666665036), ((8, 13), 0.36078448932526075), ((8, 14), 0.369630803884229), ((9, 8), 0.3632733890029861), ((9, 9), 0.3648727764978046), ((9, 10), 0.354786195666646), ((9, 11), 0.36278941992152747), ((9, 12), 0.35784559960061874), ((9, 13), 0.3588748126681717), ((9, 14), 0.3603126581719702), ((10, 8), 0.3642264217299148), ((10, 9), 0.36695976633493943), ((10, 10), 0.35736114981064354), ((10, 11), 0.36886579774443484), ((10, 12

[((6, 0), 0.37390402301343884),
 ((11, 12), 0.36983804100132295),
 ((8, 14), 0.369630803884229),
 ((10, 11), 0.36886579774443484),
 ((5, 0), 0.36848013001283053)]

As we can see, the ideal hidden layer setup is (6,0) for both home and away score, leading to a CV score around 0.374 for both home and away.

We can then use the previously determined test and train indices and split the data into 20% testing, 80% training data.

In [None]:
Y_home = np.array(df['Home_Result'].values)
Y_away = np.array(df['Away_Result'].values)

In [None]:
X_train = X[train_idxs]
Y_train_home = Y_home[train_idxs]
Y_train_away = Y_away[train_idxs]

X_test = X[test_idxs]
Y_test_home = Y_home[test_idxs]
Y_test_away = Y_away[test_idxs]

In [None]:
with warnings.catch_warnings():
    warnings.simplefilter("ignore")scaler = StandardScaler()

    scaler.fit(X_train)

    home_team_model = MLPClassifier(max_iter=500, hidden_layer_sizes=(7, 8))
    away_team_model = MLPClassifier(max_iter=500, hidden_layer_sizes=(7, 8))

    X_train_std = scaler.transform(X_train)

    home_team_model.fit(X_train_std, Y_train_home)
    away_team_model.fit(X_train_std, Y_train_away)
    
    X_test_std = scaler.transform(X_test)
    home_predictions = home_team_model.predict_probas(X_test_std)
    away_predictions = away_team_model.predict_probas(X_test_std)

Let's see how well this lines up with the actual outcome. We'll take the predictions from both the home and away models and figure out the most likely score, then compare it to the actual score.

In [None]:
def getMatchPrediction(home_probas, away_probas):
    probability = [[], [], [], [], [], []]

    for home_idx, home_score in enumerate(home_probas):
        for away_idx, away_score in enumerate(away_probas):
            probability[home_idx].append(home_score * away_score)

    matrix = np.asmatrix(probability)
    result = list(np.unravel_index(np.argmax(matrix, axis=None), matrix.shape))
    result.append(np.max(matrix, axis=None) * 100)
    return result

In [None]:
predictions = []

for home_probas in enumerate(home_predictions):
    for away_probas in enumerate(away_predictions):
        predictions.append(getMatchPrediction(home_probas, away_probas))

predictions

In [None]:
# test_df['Score_Result'] = test_df.loc[:, ['Score_Probabilities']].apply(getMost)
print(confusion_matrix(Y_test_result, test_df['Most_Probable_Outcome'].apply(str), labels=['0', '1', '2']))
print(classification_report(Y_test_result,test_df['Most_Probable_Outcome'].apply(str)))

For ease of use on the website, all of the training data should be easily accessible. Below, I extract the tags for the outcome, home result, and away result and store it in a new table, NeuralNetworkTraining, so that all of the training data can be retrieved with a simple `SELECT * FROM` statement.

In [None]:
df.head()
# df.to_sql('NeuralNetworkTraining', con=engine, index=False, if_exists='append')
# df = pd.read_sql("SELECT * FROM neuralnetworktraining", con=engine)

Internal Documentation
 - How does it work?
 - Log file
 - What's needed to continue the work?
 - How to run the system
 
External Documentation
 - How to use the system
 - Setup, required files, etc.