<h1>Chapter 3 | Data Exercise #4 | Measuring home team advantage | Different statistics</h1>
<h2>Introduction:</h2>
<p>In this notebook, you will find my notes and code for Chapter 3's <b>exercise 4</b> of the book <a href="https://gabors-data-analysis.com/">Data Analysis for Business, Economics, and Policy</a>, by Gábor Békés and Gábor Kézdi. The question was: 
<p>4. Choose the same 2016/2017 season from the <code>football</code> dataset..</p>
<p>Assignments:</p>
<ul>
    <li>Produce a different table with possibly different statistics to show the extent of home team advantage.</li>
    <li>Compare the results and discuss what you find.</li>
</ul>
<h2><b>1.</b> Load the data</h2>

In [20]:
import os
import sys
import warnings
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.pylab as plt
import numpy as np
from plotnine import *
from mizani.formatters import percent_format
import numpy as np
from scipy.stats import trim_mean
from statsmodels import robust
import wquantiles

import seaborn as sns

warnings.filterwarnings("ignore")
%matplotlib inline

In [None]:
# Increase number of returned rows in pandas
pd.set_option("display.max_rows", 500)

In [2]:
# Current script folder
current_path = os.getcwd()
dirname = current_path.split("da_data_exercises")[0]

# Get location folders
data_in = f"{dirname}da_data_repo/football/clean/"
data_out = f"{dirname}da_data_exercises/ch03-exploratory_data_analysis/04-football_home_adv_stats/data/clean/"
output = f"{dirname}da_data_exercises/ch03-exploratory_data_analysis/04-football_home_adv_stats/"
func = f"{dirname}da_case_studies/ch00-tech_prep/"
sys.path.append(func)

In [3]:
from py_helper_functions import *

In [4]:
df = pd.read_csv(f"{data_in}epl_games.csv")

In [5]:
df.head()

Unnamed: 0,div,season,date,team_home,team_away,points_home,points_away,goals_home,goals_away
0,E0,2008,16aug2008,Arsenal,West Brom,3,0,1,0
1,E0,2008,16aug2008,West Ham,Wigan,3,0,2,1
2,E0,2008,16aug2008,Middlesbrough,Tottenham,3,0,2,1
3,E0,2008,16aug2008,Everton,Blackburn,0,3,2,3
4,E0,2008,16aug2008,Bolton,Stoke,3,0,3,1


<h2><b>2</b>. EDA</h2>
<h3>2.1 Pick 2016/2017 season</h3>

In [7]:
df = df.loc[df["season"] == 2016, :].reset_index(drop=True)

In [8]:
df.head()

Unnamed: 0,div,season,date,team_home,team_away,points_home,points_away,goals_home,goals_away
0,E0,2016,13aug2016,Middlesbrough,Stoke,1,1,1,1
1,E0,2016,13aug2016,Burnley,Swansea,0,3,0,1
2,E0,2016,13aug2016,Everton,Tottenham,1,1,1,1
3,E0,2016,13aug2016,Crystal Palace,West Brom,0,3,0,1
4,E0,2016,13aug2016,Man City,Sunderland,3,0,2,1


<p>Let's calculate home team advantage by creating the field <code>"home_goaladv"</code>.

In [15]:
df["home_goaladv"] = df["goals_home"] - df["goals_away"]

<h3>2.1 Calculate different possible statistics</h3>
<p>Now, there are many different statistics we can use. We can use three amongst the most used types of statistics and name some of which we can use here:</p>
<ul>
<li>Central value (location)</li>
<ul>
<li>Mean</li>
<li>Weighted mean</li>
<li>Median</li>
<li>Percentile</li>
<li>Weighted median</li>
<li>Trimmed mean</li>
</ul>
<ul>
<li>Spread (variation)</li>
<li>Deviation</li>
<li>Variance</li>
<li>Mean absolute deviation</li>
<li>Median absolute deviation from the median</li>
<li>Range</li>
<li>Order statistics</li>
</ul>
<li>Skewness</li>
</ul>
<h4>2.1.1 Measures of central value (location)</h4>
<p>Let's start with the <b>mean</b> and its variations.</p>

In [16]:
df["home_goaladv"]

0      0
1     -1
2      0
3     -1
4      1
      ..
375    1
376    4
377    2
378   -5
379    0
Name: home_goaladv, Length: 380, dtype: int64

In [19]:
df["home_goaladv"].mean()

0.39473684210526316

<p>The <b>trimmed mean</b> is when you calculate the mean and drop a fixed number of sorted values at each and and than take the average of the remaining values. It eliminates the influence of extreme values.</p>
<p>We can calculate it using <code>scipy.stats.</code>'s <code>trim_mean</code> an drop 10% of the observations from each end of the dataset. 

In [24]:
trim_mean(df["home_goaladv"], 0.1)

0.4342105263157895

<p>As we can see, the mean increased from 0.39 to <b>0.43</b>. We removed games with extreme results, which lead to an even more favorable scenario to home team.</p>
<p>Let's compute the <b>median</b>. It represents the middle number on a sorted list of the data and it depends only on the values in the center of such data. Because it is robust to extreme values, it is a useful measure when trying to get a hold of central values.</p>

In [22]:
df["home_goaladv"].median()

0.0

<p>The median is <b>0</b>, which means that the middle value for the goal difference for team home is zero. Because the mean is higher than the medium, we can expect some degree of skewness in the distribution of our dataset.</p>
<p>We can now take a look at the <b>weighted median</b>. It represents a value such that the sum of the weights is equal for the lower and upper halves the sorted list.</p>
<p>We can use <code>numpy</code> with the parameter <code>weights</code>.</p>

In [26]:
df.head()

Unnamed: 0,div,season,date,team_home,team_away,points_home,points_away,goals_home,goals_away,home_goaladv,mean
0,E0,2016,13aug2016,Middlesbrough,Stoke,1,1,1,1,0,0.394737
1,E0,2016,13aug2016,Burnley,Swansea,0,3,0,1,-1,0.394737
2,E0,2016,13aug2016,Everton,Tottenham,1,1,1,1,0,0.394737
3,E0,2016,13aug2016,Crystal Palace,West Brom,0,3,0,1,-1,0.394737
4,E0,2016,13aug2016,Man City,Sunderland,3,0,2,1,1,0.394737


In [32]:
wquantiles.median(df["home_goaladv"], weights=df["home_goaladv"])

4.0

In [28]:
np.average(df["home_goaladv"], weights=df["home_goaladv"])

9.586666666666666