# Major Leagues

Dalton Hahn (2762306)

## Soccer Data
https://github.com/fivethirtyeight/data/tree/master/soccer-spi

## Goal
- Perform feature engineering upon the data to enable regression analysis
- Build a regression model in order to be able to determine the score of a soccer match

In [1]:
import pandas as pd
import numpy as np
import datetime as dt
import seaborn as sns
import matplotlib.pyplot as plt
import math
from statistics import mean, stdev
import string
from sklearn import preprocessing

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

## Read in the Data

In [2]:
df_matches = pd.read_csv("../data/external/spi_matches.csv")
df_matches.head()

Unnamed: 0,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,probtie,...,importance1,importance2,score1,score2,xg1,xg2,nsxg1,nsxg2,adj_score1,adj_score2
0,2016-08-12,1843,French Ligue 1,Bastia,Paris Saint-Germain,51.16,85.68,0.0463,0.838,0.1157,...,32.4,67.7,0.0,1.0,0.97,0.63,0.43,0.45,0.0,1.05
1,2016-08-12,1843,French Ligue 1,AS Monaco,Guingamp,68.85,56.48,0.5714,0.1669,0.2617,...,53.7,22.9,2.0,2.0,2.45,0.77,1.75,0.42,2.1,2.1
2,2016-08-13,2411,Barclays Premier League,Hull City,Leicester City,53.57,66.81,0.3459,0.3621,0.2921,...,38.1,22.2,2.0,1.0,0.85,2.77,0.17,1.25,2.1,1.05
3,2016-08-13,2411,Barclays Premier League,Crystal Palace,West Bromwich Albion,55.19,58.66,0.4214,0.2939,0.2847,...,43.6,34.6,0.0,1.0,1.11,0.68,0.84,1.6,0.0,1.05
4,2016-08-13,2411,Barclays Premier League,Everton,Tottenham Hotspur,68.02,73.25,0.391,0.3401,0.2689,...,31.9,48.0,1.0,1.0,0.73,1.11,0.88,1.81,1.05,1.05


In [3]:
print(df_matches.shape)
print(df_matches.columns)

(32106, 22)
Index(['date', 'league_id', 'league', 'team1', 'team2', 'spi1', 'spi2',
       'prob1', 'prob2', 'probtie', 'proj_score1', 'proj_score2',
       'importance1', 'importance2', 'score1', 'score2', 'xg1', 'xg2', 'nsxg1',
       'nsxg2', 'adj_score1', 'adj_score2'],
      dtype='object')


In [4]:
df_rank = pd.read_csv("../data/external/spi_global_rankings.csv")
df_rank.head()

Unnamed: 0,rank,prev_rank,name,league,off,def,spi
0,1,1,Manchester City,Barclays Premier League,3.29,0.22,95.22
1,2,2,Bayern Munich,German Bundesliga,3.22,0.36,93.3
2,3,3,Liverpool,Barclays Premier League,2.92,0.27,92.78
3,4,5,Paris Saint-Germain,French Ligue 1,2.75,0.41,89.51
4,5,4,Barcelona,Spanish Primera Division,2.8,0.49,88.64


In [5]:
df_rank.shape

(628, 7)

In [6]:
df_intl = pd.read_csv("../data/external/spi_global_rankings_intl.csv")
df_intl.head()

Unnamed: 0,rank,name,confed,off,def,spi
0,1,Spain,UEFA,3.47,0.55,91.68
1,2,Brazil,CONMEBOL,2.91,0.36,90.74
2,3,Germany,UEFA,3.16,0.59,89.13
3,4,Belgium,UEFA,3.0,0.55,88.44
4,5,Argentina,CONMEBOL,2.52,0.42,86.38


In [7]:
df_intl.shape

(216, 6)

## Considering the Data
### Initial approach
- Overall goal is to predict 'score1' and 'score2' from the spi_matches.csv file
- Other features to consider right away: spi, prob, xg, nsxg, adj_scores

### Combining other features from other datasets
- Offensive and Defensive ratings from the other two .csv files could be important in helping shape the overall score
- Depending on distribution, the league that a match is occurring in could have an effect (or difference in league between two teams)

In [8]:
# Quick cleaning of df_matches (dropping NaNs)
df_matches = df_matches.dropna(axis=0)
print(df_matches.shape)

le = preprocessing.LabelEncoder()
df_matches["league"] = le.fit_transform(df_matches["league"])
df_matches["team1"] = le.fit_transform(df_matches["team1"])
df_matches["team2"] = le.fit_transform(df_matches["team2"])

(12347, 22)


In [9]:
# Gonna try basic linear regression
# Inspiration from https://medium.com/@contactsunny/linear-regression-in-python-using-scikit-learn-f0f7b125a204
# Just going to attempt to predict score1 for this round

x = df_matches.drop(['score1', 'score2', 'date'], axis=1)
y = df_matches['score1']

xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size = 2/5)
linearRegressor = LinearRegression()
linearRegressor.fit(xTrain, yTrain)
yPrediction = linearRegressor.predict(xTest)

errors = list()
counter = 0
for row in yTest:
    errors.append(yPrediction[counter]-row)
    #print("Pred: ", yPrediction[counter], "; True Val: ", row, "; Error: ", yPrediction[counter]-row)
    counter = counter + 1
    
print("Average error: ", mean(errors))
print("Largest positive error: ", max(errors))
print("Largest negative error: ", min(errors))

Average error:  0.00018126529161474015
Largest positive error:  0.4870698326741305
Largest negative error:  -1.5701335396991922


### Previous cell is responsible for taking raw values from the matches dataframe and passing them to a linear regressor to consider all features of the dataset besides the scores of the teams and the date

- Not bad results, especially in terms of soccer scores, where some matches there is only one score, if any in the entire match
- Going to repeat the process to see the results for the "score2" column to see the results for that, then consider other regressors

In [10]:
# Predicting score2

x = df_matches.drop(['score1', 'score2', 'date'], axis=1)
y = df_matches['score2']

xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size = 2/5)
linearRegressor = LinearRegression()
linearRegressor.fit(xTrain, yTrain)
yPrediction = linearRegressor.predict(xTest)

errors = list()
counter = 0
for row in yTest:
    errors.append(yPrediction[counter]-row)
    #print("Pred: ", yPrediction[counter], "; True Val: ", row, "; Error: ", yPrediction[counter]-row)
    counter = counter + 1
    
print("Average error: ", mean(errors))
print("Largest positive error: ", max(errors))
print("Largest negative error: ", min(errors))

Average error:  -6.27474071013915e-05
Largest positive error:  0.2844061690547255
Largest negative error:  -1.9798214699781953


## Linear regression worked surprisingly well, but let's see how a Random Forest fares
https://medium.com/datadriveninvestor/random-forest-regression-9871bc9a25eb for inspiration

In [11]:
# Predicting score1

x = df_matches.drop(['score1', 'score2', 'date'], axis=1)
y = df_matches['score1']

xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size = 2/5)
forestRegressor = RandomForestRegressor(n_estimators=1000, max_features=7)
forestRegressor.fit(xTrain, yTrain)
yPrediction = forestRegressor.predict(xTest)

errors = list()
counter = 0
for row in yTest:
    errors.append(yPrediction[counter]-row)
    #print("Pred: ", yPrediction[counter], "; True Val: ", row, "; Error: ", yPrediction[counter]-row)
    counter = counter + 1
    
print("Average error: ", mean(errors))
print("Largest positive error: ", max(errors))
print("Largest negative error: ", min(errors))

Average error:  0.004177161368698105
Largest positive error:  0.5410000000000004
Largest negative error:  -1.8819999999999997


In [12]:
# Predicting score2

x = df_matches.drop(['score1', 'score2', 'date'], axis=1)
y = df_matches['score2']

xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size = 2/5)
forestRegressor = RandomForestRegressor(n_estimators=1000, max_features=7)
forestRegressor.fit(xTrain, yTrain)
yPrediction = forestRegressor.predict(xTest)

errors = list()
counter = 0
for row in yTest:
    errors.append(yPrediction[counter]-row)
    #print("Pred: ", yPrediction[counter], "; True Val: ", row, "; Error: ", yPrediction[counter]-row)
    counter = counter + 1
    
print("Average error: ", mean(errors))
print("Largest positive error: ", max(errors))
print("Largest negative error: ", min(errors))

Average error:  0.0014723628264830793
Largest positive error:  0.4109999999999996
Largest negative error:  -2.4560000000000004


# Conclusions
- Overall, this was a very easy dataset to work with.  With most of the features already being numeric, there was very little processing that needed to be done before directly using models
- Linear regression had great results immediately to predict both score1 and score2
- Random Forest seemed to have very comparable results, but with more drastic largest errors in both the positive and negative direction