# Predicting Soccer Games Scores using Regression

For this data science project, I plan on predicting the scores of soccer games. I will use a regression model to predict the scores and only test and train the model with game data that has already been played.

In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.metrics import r2_score
from sklearn.ensemble import RandomForestRegressor
from sklearn import linear_model
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split

matchesDF = pd.read_csv('../data/spi_matches.csv')

In [2]:
matchesDF

Unnamed: 0,season,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,...,importance1,importance2,score1,score2,xg1,xg2,nsxg1,nsxg2,adj_score1,adj_score2
0,2016,2016-07-09,7921,FA Women's Super League,Liverpool Women,Reading,51.56,50.42,0.4389,0.2767,...,,,2.0,0.0,,,,,,
1,2016,2016-07-10,7921,FA Women's Super League,Arsenal Women,Notts County Ladies,46.61,54.03,0.3572,0.3608,...,,,2.0,0.0,,,,,,
2,2016,2016-07-10,7921,FA Women's Super League,Chelsea FC Women,Birmingham City,59.85,54.64,0.4799,0.2487,...,,,1.0,1.0,,,,,,
3,2016,2016-07-16,7921,FA Women's Super League,Liverpool Women,Notts County Ladies,53.00,52.35,0.4289,0.2699,...,,,0.0,0.0,,,,,,
4,2016,2016-07-17,7921,FA Women's Super League,Chelsea FC Women,Arsenal Women,59.43,60.99,0.4124,0.3157,...,,,1.0,2.0,,,,,,
5,2016,2016-07-24,7921,FA Women's Super League,Reading,Birmingham City,50.75,55.03,0.3821,0.3200,...,,,1.0,1.0,,,,,,
6,2016,2016-07-24,7921,FA Women's Super League,Notts County Ladies,Manchester City Women,48.13,60.15,0.3082,0.3888,...,,,1.0,5.0,,,,,,
7,2016,2016-07-31,7921,FA Women's Super League,Reading,Notts County Ladies,50.62,52.63,0.4068,0.3033,...,,,1.0,1.0,,,,,,
8,2016,2016-07-31,7921,FA Women's Super League,Arsenal Women,Liverpool Women,48.32,48.46,0.4350,0.3100,...,,,1.0,2.0,,,,,,
9,2016,2016-08-03,7921,FA Women's Super League,Reading,Manchester City Women,50.41,63.20,0.3061,0.4198,...,,,1.0,2.0,,,,,,


## Feature Engineering

### Idea 1 (Selection):

For this project, I plan on only predicting the scores using only features that are more promising and complete. As we can see from the data above, there is a lot of NaN values for the importance1 and importance2 feature. A choice to fix that would be to drop all the rows containing it. However, after briefly going through the data set, there is a lot of NaN values for importance1 and 2 in the data set and if we dropped all of it, we will be losing a lot of data. Therefore, for this project, I will not use the importance features to predict the scores.

For this exercise, I also do not care for the projected scores, adjusted scores, shot-based expected goals and non-shot expected goals and solely focus on the actual scores and prediciting it with the teams Soccer Power Index (SPI), and their probability of winning, losing, or drawing. Therefore, I will also be dropping those columns that are not being used.

In [3]:
del matchesDF['importance1']
del matchesDF['importance2']
del matchesDF['proj_score1']
del matchesDF['proj_score2']
del matchesDF['xg1']
del matchesDF['xg2']
del matchesDF['nsxg1']
del matchesDF['nsxg2']
del matchesDF['adj_score1']
del matchesDF['adj_score2']

matchesDF

Unnamed: 0,season,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,probtie,score1,score2
0,2016,2016-07-09,7921,FA Women's Super League,Liverpool Women,Reading,51.56,50.42,0.4389,0.2767,0.2844,2.0,0.0
1,2016,2016-07-10,7921,FA Women's Super League,Arsenal Women,Notts County Ladies,46.61,54.03,0.3572,0.3608,0.2819,2.0,0.0
2,2016,2016-07-10,7921,FA Women's Super League,Chelsea FC Women,Birmingham City,59.85,54.64,0.4799,0.2487,0.2714,1.0,1.0
3,2016,2016-07-16,7921,FA Women's Super League,Liverpool Women,Notts County Ladies,53.00,52.35,0.4289,0.2699,0.3013,0.0,0.0
4,2016,2016-07-17,7921,FA Women's Super League,Chelsea FC Women,Arsenal Women,59.43,60.99,0.4124,0.3157,0.2719,1.0,2.0
5,2016,2016-07-24,7921,FA Women's Super League,Reading,Birmingham City,50.75,55.03,0.3821,0.3200,0.2979,1.0,1.0
6,2016,2016-07-24,7921,FA Women's Super League,Notts County Ladies,Manchester City Women,48.13,60.15,0.3082,0.3888,0.3030,1.0,5.0
7,2016,2016-07-31,7921,FA Women's Super League,Reading,Notts County Ladies,50.62,52.63,0.4068,0.3033,0.2899,1.0,1.0
8,2016,2016-07-31,7921,FA Women's Super League,Arsenal Women,Liverpool Women,48.32,48.46,0.4350,0.3100,0.2550,1.0,2.0
9,2016,2016-08-03,7921,FA Women's Super League,Reading,Manchester City Women,50.41,63.20,0.3061,0.4198,0.2742,1.0,2.0


### Idea 2 (Extraction):

As mentioned on top, I would like to train and test the regression model with only games that have been played. Therefore to do this. I will split the data set into 2, one being games played Before Octobet 5th 2020, and the other being upcoming games as of today (Oct 5 2020). To do this I will change the date column to an int so that I can easily split the tables.

As I recalled, due to covid-19, there was a lot of games that were unplayed. For example, when the covid-19 started hitting the european countries, the French League ended their season short. Therefore, after splitting the data set, I will be cleaning the dataset to get rid of all the rows where games ended up not being played.

In [4]:
matchesDF['date'] = matchesDF.date.str.replace('-', '').astype(int)

upcomingMatchesDF = matchesDF.where(matchesDF['date'] > 20201005.0, inplace=False)

matchesPlayedDF = matchesDF.where(matchesDF['date'] <= 20201005.0, inplace=False)
matchesPlayedDF = matchesPlayedDF.dropna()

Now our dataset is ready for the regression model!

In [5]:
matchesPlayedDF

Unnamed: 0,season,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,probtie,score1,score2
0,2016.0,20160709.0,7921.0,FA Women's Super League,Liverpool Women,Reading,51.56,50.42,0.4389,0.2767,0.2844,2.0,0.0
1,2016.0,20160710.0,7921.0,FA Women's Super League,Arsenal Women,Notts County Ladies,46.61,54.03,0.3572,0.3608,0.2819,2.0,0.0
2,2016.0,20160710.0,7921.0,FA Women's Super League,Chelsea FC Women,Birmingham City,59.85,54.64,0.4799,0.2487,0.2714,1.0,1.0
3,2016.0,20160716.0,7921.0,FA Women's Super League,Liverpool Women,Notts County Ladies,53.00,52.35,0.4289,0.2699,0.3013,0.0,0.0
4,2016.0,20160717.0,7921.0,FA Women's Super League,Chelsea FC Women,Arsenal Women,59.43,60.99,0.4124,0.3157,0.2719,1.0,2.0
5,2016.0,20160724.0,7921.0,FA Women's Super League,Reading,Birmingham City,50.75,55.03,0.3821,0.3200,0.2979,1.0,1.0
6,2016.0,20160724.0,7921.0,FA Women's Super League,Notts County Ladies,Manchester City Women,48.13,60.15,0.3082,0.3888,0.3030,1.0,5.0
7,2016.0,20160731.0,7921.0,FA Women's Super League,Reading,Notts County Ladies,50.62,52.63,0.4068,0.3033,0.2899,1.0,1.0
8,2016.0,20160731.0,7921.0,FA Women's Super League,Arsenal Women,Liverpool Women,48.32,48.46,0.4350,0.3100,0.2550,1.0,2.0
9,2016.0,20160803.0,7921.0,FA Women's Super League,Reading,Manchester City Women,50.41,63.20,0.3061,0.4198,0.2742,1.0,2.0


## Building the Regression model

Finally, it is time for us to build our regression model using the Soccer Power Index (SPI), and the Probability of winning/losing/drawing as featues in order to determine the player. I will use 80% of the data to train my model, and 20% of it to test.

The first regression model we will be using is the Random Forest Regressor as it is fast and accurate.

### Data pre-processing 

In [6]:
x = matchesPlayedDF[['spi1','spi2','prob1','prob2','probtie']]
y = matchesPlayedDF[['score1', 'score2']]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20)

### Training the Random Forest Regressor Model

In [7]:
modelRFR = RandomForestRegressor(n_estimators=100)
modelRFR.fit(x_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

### Random Forest Regressor Prediction and Evaluation

In [8]:
y_predict = modelRFR.predict(x_test)
r2_score(y_test,y_predict)

0.022419227601166214

Oh no! With an accuracy of ~2%, the model is not accurate at all.

This is kind of a dissapointment as Random Forest Regression models are usually known to be accurate. Next I will try the Linear Regression model as it is easy to implement to see if there is any difference.

### Training the Linear Regression Model

In [9]:
modelLR = linear_model.LinearRegression()
modelLR.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### Linear Regression Prediction and Evaluation

In [10]:
y_predict = modelLR.predict(x_test)
r2_score(y_test,y_predict)

0.09883386784910503

~10% accuracy! Another suprise as it is significantly better then the Random Forest Regressor Model. However, 9% is still not very good. Therefore, I will try using a Neural Network as it is known for its state-of-the art approach with very good accuracy.

### Training the Neural Network Multi-layer Perceptron Regressor Model

In [11]:
modelMLPR = MLPRegressor()
modelMLPR.fit(x_train, y_train)

MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

### Neural Network Multi-layer Perceptron Regressor Model Prediction and Evaluation

In [12]:
y_predict = modelMLPR.predict(x_test)
r2_score(y_test,y_predict)

0.0914775082434095

An ~9% accuracy. Linear Regression suprisingly still has the highest accuracy among the 3.

## Conclusion

As we can see all our models had less thn or equal to 10% accuracy. The low accuracy does make sense as there is no way to accurately predict the score of the game from using only the Soccer Power Index of the team and its probabilities as there are too many variables during a 90-min game. These variables could include the weather, the team's fitness, the team's chemistry, the referee decison, and even luck! Therefore, I would say having <10% accuracy in these models are to be expected.