# Major Leagues

a project for EECS 731 by Benjamin Wyss

Examining fivethirtyeight soccer power index (spi) data sets to build a regression model which predicts the scores of soccer matches

###### python imports

In [14]:
import numpy as np
import pandas as pd
import sklearn as skl
import matplotlib.pyplot as plt
plt.close('all')
import warnings
warnings.filterwarnings('ignore')

### Reading Data Sets From CSV

fivethirtyeight Soccer SPI Ratings and Matches

Taken from: https://github.com/fivethirtyeight/data/tree/master/soccer-spi on 10/1/20

Only the spi_matches dataset is examined because it contains all of the match information and historical spi data needed to build the target regression model. The spi_matches_latest data set is a subset of the spi_matches dataset, and thus it can be discarded without losing any additional match data samples. The remaining data sets include information about current spi ratings, but the target regression model should perform better at predicting match scores based on ratings from when a match occured rather than from current ratings, so these data sets are discarded as well.

###### Soccer Matches and SPI Ratings Data Set

In [15]:
df = pd.read_csv('../data/raw/spi_matches.csv')

In [16]:
df

Unnamed: 0,season,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,...,importance1,importance2,score1,score2,xg1,xg2,nsxg1,nsxg2,adj_score1,adj_score2
0,2016,2016-07-09,7921,FA Women's Super League,Liverpool Women,Reading,51.56,50.42,0.4389,0.2767,...,,,2.0,0.0,,,,,,
1,2016,2016-07-10,7921,FA Women's Super League,Arsenal Women,Notts County Ladies,46.61,54.03,0.3572,0.3608,...,,,2.0,0.0,,,,,,
2,2016,2016-07-10,7921,FA Women's Super League,Chelsea FC Women,Birmingham City,59.85,54.64,0.4799,0.2487,...,,,1.0,1.0,,,,,,
3,2016,2016-07-16,7921,FA Women's Super League,Liverpool Women,Notts County Ladies,53.00,52.35,0.4289,0.2699,...,,,0.0,0.0,,,,,,
4,2016,2016-07-17,7921,FA Women's Super League,Chelsea FC Women,Arsenal Women,59.43,60.99,0.4124,0.3157,...,,,1.0,2.0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42168,2020,2021-05-30,1871,Spanish Segunda Division,Mirandes,CD Sabadell,32.09,30.30,0.4473,0.2862,...,,,,,,,,,,
42169,2020,2021-05-30,1871,Spanish Segunda Division,AD Alcorcon,Espanyol,33.41,62.03,0.1738,0.5834,...,,,,,,,,,,
42170,2020,2021-05-30,1871,Spanish Segunda Division,Málaga,Castellon,35.64,31.26,0.4759,0.2216,...,,,,,,,,,,
42171,2020,2021-05-30,1871,Spanish Segunda Division,FC Cartagena,Girona FC,29.81,39.15,0.3196,0.3964,...,,,,,,,,,,


## The Big Ideas

Feature engineering and transformation can add value to this data set for building a regression model in the following ways:

(1): By selecting only the most promising and score-correlated attributes as input features, the target regression model will be able to best utilize correlations between features and match scores to increase overall model accuracy.

(2): By one-hot encoding league and team names, the target regression model can gain additional information about how many points specific teams score throughout specific leagues which will aid its prediction accuracy.

## Exploratory Data Analysis

### Cleaning the data sets

In [17]:
df

Unnamed: 0,season,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,...,importance1,importance2,score1,score2,xg1,xg2,nsxg1,nsxg2,adj_score1,adj_score2
0,2016,2016-07-09,7921,FA Women's Super League,Liverpool Women,Reading,51.56,50.42,0.4389,0.2767,...,,,2.0,0.0,,,,,,
1,2016,2016-07-10,7921,FA Women's Super League,Arsenal Women,Notts County Ladies,46.61,54.03,0.3572,0.3608,...,,,2.0,0.0,,,,,,
2,2016,2016-07-10,7921,FA Women's Super League,Chelsea FC Women,Birmingham City,59.85,54.64,0.4799,0.2487,...,,,1.0,1.0,,,,,,
3,2016,2016-07-16,7921,FA Women's Super League,Liverpool Women,Notts County Ladies,53.00,52.35,0.4289,0.2699,...,,,0.0,0.0,,,,,,
4,2016,2016-07-17,7921,FA Women's Super League,Chelsea FC Women,Arsenal Women,59.43,60.99,0.4124,0.3157,...,,,1.0,2.0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42168,2020,2021-05-30,1871,Spanish Segunda Division,Mirandes,CD Sabadell,32.09,30.30,0.4473,0.2862,...,,,,,,,,,,
42169,2020,2021-05-30,1871,Spanish Segunda Division,AD Alcorcon,Espanyol,33.41,62.03,0.1738,0.5834,...,,,,,,,,,,
42170,2020,2021-05-30,1871,Spanish Segunda Division,Málaga,Castellon,35.64,31.26,0.4759,0.2216,...,,,,,,,,,,
42171,2020,2021-05-30,1871,Spanish Segunda Division,FC Cartagena,Girona FC,29.81,39.15,0.3196,0.3964,...,,,,,,,,,,


### Transforming the data sets


### Visualizing the Data

First, some basic statistics of the data set and correlation coefficients are calculated

In [18]:
df.describe()

Unnamed: 0,season,league_id,spi1,spi2,prob1,prob2,probtie,proj_score1,proj_score2,importance1,importance2,score1,score2,xg1,xg2,nsxg1,nsxg2,adj_score1,adj_score2
count,42173.0,42173.0,42173.0,42173.0,42173.0,42173.0,42173.0,42173.0,42173.0,29572.0,29572.0,34268.0,34268.0,18273.0,18273.0,18273.0,18273.0,18273.0,18273.0
mean,2018.411543,2181.590117,44.75693,44.720523,0.44657,0.300422,0.253008,1.516175,1.172247,31.306398,30.60534,1.523112,1.181715,1.500572,1.170989,1.407696,1.134884,1.54047,1.190836
std,1.174531,901.753292,18.958686,18.97344,0.157805,0.142912,0.047112,0.425189,0.418712,26.21861,25.8592,1.28257,1.14243,0.829435,0.73503,0.65293,0.570753,1.245991,1.128228
min,2016.0,1818.0,3.88,4.04,0.0271,0.0032,0.0,0.25,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2018.0,1849.0,30.94,30.93,0.3468,0.2074,0.2353,1.24,0.91,10.7,10.2,1.0,0.0,0.88,0.62,0.95,0.73,1.05,0.0
50%,2018.0,1874.0,42.67,42.57,0.4374,0.285,0.2611,1.45,1.12,26.0,25.2,1.0,1.0,1.37,1.04,1.32,1.05,1.05,1.05
75%,2019.0,2160.0,58.12,58.07,0.5351,0.375,0.2815,1.71,1.38,45.5,44.6,2.0,2.0,1.97,1.56,1.75,1.43,2.1,2.1
max,2020.0,9541.0,96.57,96.78,0.9775,0.8992,0.4537,4.9,4.01,100.0,100.0,11.0,9.0,7.07,6.2,6.89,5.92,9.15,7.93


In [19]:
df.corr()

Unnamed: 0,season,league_id,spi1,spi2,prob1,prob2,probtie,proj_score1,proj_score2,importance1,importance2,score1,score2,xg1,xg2,nsxg1,nsxg2,adj_score1,adj_score2
season,1.0,-0.00309,-0.173172,-0.174024,-0.045921,0.059279,-0.026003,-0.019484,0.071384,0.005693,0.001886,-0.011913,0.01525,0.017753,0.052834,-0.015834,0.004984,-0.015833,0.014908
league_id,-0.00309,1.0,-0.01104,-0.0022,-0.035165,0.028932,0.030021,-0.055666,0.002632,-0.113949,-0.113123,-0.027612,-0.008804,-0.042356,-0.052965,-0.031415,-0.042875,-0.059734,-0.043892
spi1,-0.173172,-0.01104,1.0,0.725621,0.413552,-0.338577,-0.358158,0.41134,-0.285175,0.237619,0.10454,0.145003,-0.077101,0.222491,-0.073344,0.257511,-0.123095,0.153642,-0.070863
spi2,-0.174024,-0.0022,0.725621,1.0,-0.296836,0.363344,-0.107905,-0.211839,0.348759,0.098546,0.232628,-0.080582,0.128151,-0.086743,0.215758,-0.10883,0.22948,-0.089843,0.146453
prob1,-0.045921,-0.035165,0.413552,-0.296836,1.0,-0.955708,-0.450476,0.89647,-0.845741,0.268296,-0.211493,0.310997,-0.259674,0.411614,-0.352685,0.484448,-0.43804,0.320732,-0.264582
prob2,0.059279,0.028932,-0.338577,0.363344,-0.955708,1.0,0.167763,-0.776506,0.932239,-0.217204,0.260338,-0.271946,0.290257,-0.363794,0.396134,-0.430814,0.485393,-0.283134,0.294766
probtie,-0.026003,0.030021,-0.358158,-0.107905,-0.450476,0.167763,1.0,-0.647291,0.004971,-0.236662,-0.077574,-0.214805,-0.008764,-0.283652,-0.0068,-0.326007,0.011108,-0.222011,0.002025
proj_score1,-0.019484,-0.055666,0.41134,-0.211839,0.89647,-0.776506,-0.647291,1.0,-0.556369,0.250422,-0.168499,0.329584,-0.176136,0.42964,-0.245749,0.509225,-0.313489,0.339363,-0.184422
proj_score2,0.071384,0.002632,-0.285175,0.348759,-0.845741,0.932239,0.004971,-0.556369,1.0,-0.19706,0.232949,-0.207023,0.299402,-0.292918,0.411639,-0.35397,0.501245,-0.225094,0.304332
importance1,0.005693,-0.113949,0.237619,0.098546,0.268296,-0.217204,-0.236662,0.250422,-0.19706,1.0,0.329041,0.067201,-0.06345,0.114217,-0.054501,0.137417,-0.074613,0.090451,-0.050411


###### Generating Interesting Plots

###### Intuitions Gained From Visualizations

### Model Construction

A variety of machine learning models will be tested and compared

In [20]:
array = df.values

### Analysis and Testing of Regression Models

### Results