# Expected Goals Analysis

## Description

Expected Goals (xG) is the most popular advanced metric for association football (soccer) analysis.  In it's most basic form, it is a measurement of the probability of a particular shot resulting in a goal.  Variables taken into account include distance and angle of shot, as well as the part of the body used to take the shot.  Non-shot xG models can also be developed that perform the same task, but measure expected goals based on actions such as interceptions near goal.  In any case, expected goals for a game and/or season can be summed for a description of team performance besides the scorelines.

In this analysis, we look at the predicted (xG) versus actual (G) results across various professional leagues.

## Data

Unfortunately, due to the work required to calculate the variables for a shot, typically involving video of the shot in question, expected goal data is not widely available.  However, some free public sources exist.

Fivethirtyeight.com (538) uses two expected goals models (shot-based and non-shot based) as part of its soccer league forecasting.  The data for summed expected goals models (as well as actual results) is published periodically game-by-game.  This data is used as a first start for investigation.  As expected goals models can have a decently high variance game-by-game, we sum up for individual seasons in any case.

## Code

### First steps

First, the required packages must be loaded and the data must be read in.

In [1]:
#load required packages
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
#create filepath
fivethirtyeightpath = os.path.join('..', 'Resources', 'spi_matches.csv')

In [3]:
#load data
df538 = pd.read_csv(fivethirtyeightpath)

In [4]:
#preview data
df538.head()

Unnamed: 0,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,probtie,...,importance1,importance2,score1,score2,xg1,xg2,nsxg1,nsxg2,adj_score1,adj_score2
0,2016-08-12,1843,French Ligue 1,Bastia,Paris Saint-Germain,51.16,85.68,0.0463,0.838,0.1157,...,32.4,67.7,0.0,1.0,0.97,0.63,0.43,0.45,0.0,1.05
1,2016-08-12,1843,French Ligue 1,AS Monaco,Guingamp,68.85,56.48,0.5714,0.1669,0.2617,...,53.7,22.9,2.0,2.0,2.45,0.77,1.75,0.42,2.1,2.1
2,2016-08-13,2411,Barclays Premier League,Hull City,Leicester City,53.57,66.81,0.3459,0.3621,0.2921,...,38.1,22.2,2.0,1.0,0.85,2.77,0.17,1.25,2.1,1.05
3,2016-08-13,2411,Barclays Premier League,Crystal Palace,West Bromwich Albion,55.19,58.66,0.4214,0.2939,0.2847,...,43.6,34.6,0.0,1.0,1.11,0.68,0.84,1.6,0.0,1.05
4,2016-08-13,2411,Barclays Premier League,Everton,Tottenham Hotspur,68.02,73.25,0.391,0.3401,0.2689,...,31.9,48.0,1.0,1.0,0.73,1.11,0.88,1.81,1.05,1.05


### Exploratory Analysis

We would like to explore the leagues and datatypes in the dataset.

In [5]:
#leagues in dataset
df538['league'].unique()

array(['French Ligue 1', 'Barclays Premier League',
       'Spanish Primera Division', 'Italy Serie A', 'German Bundesliga',
       'UEFA Champions League',
       'Mexican Primera Division Torneo Clausura', 'Major League Soccer',
       'Swedish Allsvenskan', 'Norwegian Tippeligaen',
       "National Women's Soccer League", 'Brasileiro Série A',
       'Russian Premier Liga', 'Mexican Primera Division Torneo Apertura',
       'Austrian T-Mobile Bundesliga', 'Swiss Raiffeisen Super League',
       'French Ligue 2', 'German 2. Bundesliga',
       'English League Championship', 'Scottish Premiership',
       'Portuguese Liga', 'Dutch Eredivisie',
       'Turkish Turkcell Super Lig', 'Spanish Segunda Division',
       'Italy Serie B', 'Argentina Primera Division',
       'UEFA Europa League', 'United Soccer League', 'Danish SAS-Ligaen',
       'Belgian Jupiler League', 'Chinese Super League',
       'Japanese J League', 'English League One',
       'South African ABSA Premier League', 'En

In [6]:
#datatypes
df538.dtypes

date            object
league_id        int64
league          object
team1           object
team2           object
spi1           float64
spi2           float64
prob1          float64
prob2          float64
probtie        float64
proj_score1    float64
proj_score2    float64
importance1    float64
importance2    float64
score1         float64
score2         float64
xg1            float64
xg2            float64
nsxg1          float64
nsxg2          float64
adj_score1     float64
adj_score2     float64
dtype: object

### Date Conversion

We would like to make the date into a datetime object.

In [7]:
#import datetime module
import datetime

In [8]:
#test to_datetime method
pd.to_datetime(df538.loc[:, 'date'], format = "%Y-%m-%d")

0       2016-08-12
1       2016-08-12
2       2016-08-13
3       2016-08-13
4       2016-08-13
5       2016-08-13
6       2016-08-13
7       2016-08-13
8       2016-08-13
9       2016-08-13
10      2016-08-13
11      2016-08-13
12      2016-08-13
13      2016-08-13
14      2016-08-14
15      2016-08-14
16      2016-08-14
17      2016-08-14
18      2016-08-14
19      2016-08-15
20      2016-08-19
21      2016-08-19
22      2016-08-19
23      2016-08-19
24      2016-08-20
25      2016-08-20
26      2016-08-20
27      2016-08-20
28      2016-08-20
29      2016-08-20
           ...    
21096   2019-05-26
21097   2019-05-26
21098   2019-05-26
21099   2019-05-26
21100   2019-05-26
21101   2019-05-26
21102   2019-05-26
21103   2019-05-26
21104   2019-06-02
21105   2019-06-02
21106   2019-06-02
21107   2019-06-02
21108   2019-06-02
21109   2019-06-02
21110   2019-06-02
21111   2019-06-02
21112   2019-06-02
21113   2019-06-02
21114   2019-06-02
21115   2019-06-09
21116   2019-06-09
21117   2019

In [9]:
#convert datatype
df538.loc[:, 'date'] = pd.to_datetime(df538.loc[:, 'date'], format = "%Y-%m-%d")

In [10]:
#confirm datatypes
df538.dtypes

date           datetime64[ns]
league_id               int64
league                 object
team1                  object
team2                  object
spi1                  float64
spi2                  float64
prob1                 float64
prob2                 float64
probtie               float64
proj_score1           float64
proj_score2           float64
importance1           float64
importance2           float64
score1                float64
score2                float64
xg1                   float64
xg2                   float64
nsxg1                 float64
nsxg2                 float64
adj_score1            float64
adj_score2            float64
dtype: object

### Earliest Point in the Datset per League

For grouping seasons, it is useful to see which seasons we actually have.  So, we will look at the minimum and maximum dates per league.

In [14]:
#group by league
grouped_league = df538.groupby('league')

In [16]:
#minimum and maximum dates
min_dates = grouped_league['date'].min()
max_dates = grouped_league['date'].max()

#create dataframe
dates538 = pd.DataFrame({'Start Date': min_dates, 'End Date': max_dates})

In [17]:
#see dataframe
dates538

Unnamed: 0_level_0,Start Date,End Date
league,Unnamed: 1_level_1,Unnamed: 2_level_1
Argentina Primera Division,2017-08-25,2019-04-07
Australian A-League,2018-10-19,2019-04-28
Austrian T-Mobile Bundesliga,2017-07-22,2019-03-17
Barclays Premier League,2016-08-13,2019-05-12
Belgian Jupiler League,2018-07-27,2019-03-17
Brasileiro Série A,2017-05-13,2018-12-02
Chinese Super League,2018-08-01,2018-11-10
Danish SAS-Ligaen,2018-07-13,2019-03-17
Dutch Eredivisie,2017-08-11,2019-05-12
English League Championship,2017-08-04,2019-05-05
