<h1>Chapter 3 | Data Exercise #4 | Measuring home team advantage | Different statistics</h1>
<h2>Introduction:</h2>
<p>In this notebook, you will find my notes and code for Chapter 3's <b>exercise 4</b> of the book <a href="https://gabors-data-analysis.com/">Data Analysis for Business, Economics, and Policy</a>, by Gábor Békés and Gábor Kézdi. The question was: 
<p>4. Choose the same 2016/2017 season from the <code>football</code> dataset..</p>
<p>Assignments:</p>
<ul>
    <li>Produce a different table with possibly different statistics to show the extent of home team advantage.</li>
    <li>Compare the results and discuss what you find.</li>
</ul>
<h2><b>1.</b> Load the data</h2>

In [45]:
import os
import sys
import warnings
import pandas as pd
import numpy as np
import numpy as np
from scipy.stats import trim_mean
from statsmodels import robust
import wquantiles

warnings.filterwarnings("ignore")

In [46]:
# Increase number of returned rows in pandas
pd.set_option("display.max_rows", 500)

In [47]:
# Current script folder
current_path = os.getcwd()
dirname = current_path.split("da_data_exercises")[0]

# Get location folders
data_in = f"{dirname}da_data_repo/football/clean/"
data_out = f"{dirname}da_data_exercises/ch03-exploratory_data_analysis/04-football_home_adv_stats/data/clean/"
output = f"{dirname}da_data_exercises/ch03-exploratory_data_analysis/04-football_home_adv_stats/"
func = f"{dirname}da_case_studies/ch00-tech_prep/"
sys.path.append(func)

In [48]:
from py_helper_functions import *

In [49]:
df = pd.read_csv(f"{data_in}epl_games.csv")

In [50]:
df.head()

Unnamed: 0,div,season,date,team_home,team_away,points_home,points_away,goals_home,goals_away
0,E0,2008,16aug2008,Arsenal,West Brom,3,0,1,0
1,E0,2008,16aug2008,West Ham,Wigan,3,0,2,1
2,E0,2008,16aug2008,Middlesbrough,Tottenham,3,0,2,1
3,E0,2008,16aug2008,Everton,Blackburn,0,3,2,3
4,E0,2008,16aug2008,Bolton,Stoke,3,0,3,1


<h2><b>2</b>. EDA</h2>
<h3>2.1 Pick 2016/2017 season</h3>

In [51]:
df = df.loc[df["season"] == 2016, :].reset_index(drop=True)

In [52]:
df.head()

Unnamed: 0,div,season,date,team_home,team_away,points_home,points_away,goals_home,goals_away
0,E0,2016,13aug2016,Middlesbrough,Stoke,1,1,1,1
1,E0,2016,13aug2016,Burnley,Swansea,0,3,0,1
2,E0,2016,13aug2016,Everton,Tottenham,1,1,1,1
3,E0,2016,13aug2016,Crystal Palace,West Brom,0,3,0,1
4,E0,2016,13aug2016,Man City,Sunderland,3,0,2,1


<p>Let's calculate home team advantage by creating the field <code>"home_goaladv"</code>.

In [53]:
df["home_goaladv"] = df["goals_home"] - df["goals_away"]

<h3>2.1 Calculate different possible statistics</h3>
<p>Now, there are many different statistics we can use. We can use three amongst the most used types of statistics and name some of which we can use here:</p>
<ul>
<li>Central value (location)</li>
<ul>
<li>Mean</li>
<li>Trimmed mean</li>
<li>Weighted mean</li>
<li>Median</li>
<li>Weighted median</li>
</ul>
</ul>
<ul>
<li>Spread (variation)</li>
<ul>
<li>Range</li>
<li>Inter-quartile range</li>
<li>Deviation</li>
<li>Median absolute deviation from the median</li>
</ul>
<li>Skewness</li>
</ul>
<h4>2.1.1 Measures of central value (location)</h4>
<p>Let's start with the <b>mean</b> and its variations.</p>

In [54]:
mean = df["home_goaladv"].mean()
mean

0.39473684210526316

<p>The <b>trimmed mean</b> is when you calculate the mean and drop a fixed number of sorted values at each and and than take the average of the remaining values. It eliminates the influence of extreme values.</p>
<p>We can calculate it using <code>scipy.stats.</code>'s <code>trim_mean</code> an drop 10% of the observations from each end of the dataset. 

In [55]:
trim_mean = trim_mean(df["home_goaladv"], 0.1)
trim_mean

0.4342105263157895

<p>As we can see, the mean increased from 0.39 to <b>0.43</b>. We removed games with extreme results, which lead to an even more favorable scenario to home team.</p>
<p>Let's compute the <b>median</b>. It represents the middle number on a sorted list of the data and it depends only on the values in the center of such data. Because it is robust to extreme values, it is a useful measure when trying to get a hold of central values.</p>

In [56]:
median = df["home_goaladv"].median()
median

0.0

<p>The median is <b>0</b>, which means that the middle value for the goal difference for team home is zero. Because the mean is higher than the medium, we can expect some degree of skewness in the distribution of our dataset.</p>
<p>We can now take a look at the <b>weighted median</b>. It represents a value such that the sum of the weights is equal for the lower and upper halves the sorted list.</p>
<p>We can use <code>wquantiles</code> and use as weights the variable <code>points_home</code>.</p>

In [57]:
weighted_median = wquantiles.median(df["home_goaladv"], weights=df["points_home"])
weighted_median

2.0

<p>We can use Numpy's <code>.average()</code> with the parameter <code>weights</code> to get the <b>weighted mean</b>.</p>

In [58]:
weighted_mean = np.average(df["home_goaladv"], weights=df["points_home"])
weighted_mean

1.6790697674418604

<p>When using the number of home points, the average increases, as well as the median (weighted, I mean). I used this variable as a weight given that it is the closest we can use as a field that has some representation and effect on the data (the number of points at home is a relevant weight).</p>
<h4>2.1.2 Measures of <b>spread</b> (variability)</h4>
<p>We can start by calculating the <b>range</b>.


In [59]:
range = df["home_goaladv"].max() - df["home_goaladv"].min()
range

11

<p>That is quite a wide range considering our scenario. We get a range of 11 goals between the max and the min values.

<p>Moving on, we can calculate the <b>quantiles</b> and determine the <b>IQR</b>.</p>

In [60]:
iqr = df["home_goaladv"].quantile(0.75) - df["home_goaladv"].quantile(0.25)
iqr

3.0

<p>The IQR is <b>3</b>, which means that the 50% of our dataset lies within this range of goal difference.</p>
<p>Let's calculate the <b>standard deviation</b>.

In [61]:
stddev = df["home_goaladv"].std()
stddev

1.9073455242915103

 <p>We can now calculate the <b>mean absolute deviation from the median</b> (MAD), which is robust to extreme values and quantifies the variability of a set around their median. You calculate it by getting the absolute values of the deviation from the median. We can use <code>statsmodels</code> with its methods to get this statistic.</p>

In [62]:
mad = robust.scale.mad(df["home_goaladv"])
mad

1.482602218505602

<p>As expected, the MAD is lower than the standard deviation. This reflects its robustness to extreme values. Considering that our median is zero, we can expect an average of 1.5 goal dispersion from the median for the dataset.</p>
</h3>2.1.3 Skewness</h3>
<p>Let's calculate the skenwess of our dataset by using the <b>mean-median skewness</b>

In [63]:
skew = (df["home_goaladv"].mean() - df["home_goaladv"].median())/df["home_goaladv"].std()
skew

0.20695612676255365

<p>Because our mean is higher than the median, we get a positive number, <b>0.2</b>, which means that our dataset have a longer tail to the right.</p>
<h3>2.2 Compare the results</h3>
<p>We can now create our table for our summary statistics and compare the results we got from the original statistics, which focused on the mean and on the standard deviation.</p>

In [64]:
pd.DataFrame.from_dict(
    {
        "Statistics": [
            "Mean",
            "Trimmed mean",
            "Weighted mean",
            "Median",
            "Weighted median",
            "Range",
            "IQR",
            "Standard deviation",
            "MAD",
            "Skewness",
        ],
        "Value": [
            mean,
            trim_mean,
            weighted_mean,
            median,
            weighted_median,
            range,
            iqr,
            stddev,
            mad,
            skew
        ],
    }
).round(2)

Unnamed: 0,Statistics,Value
0,Mean,0.39
1,Trimmed mean,0.43
2,Weighted mean,1.68
3,Median,0.0
4,Weighted median,2.0
5,Range,11.0
6,IQR,3.0
7,Standard deviation,1.91
8,MAD,1.48
9,Skewness,0.21


<p>We can start by discussing the <b>mean</b>. First, the <b>trim mean</b> is higher than the mean. This indicates that probably some extreme value to the left was pulling the mean. By removing these extreme points, our mean increased by a small fraction. Now, by weighted the mean on the number of home points, the mean increased to 1.68. I really cannot understand this value that well. Does it mean that the mean goal difference is higher when weighting on the number of home points? In any case, we can affirm that the mean is affected by extreme values and by removing them, we get a more robust result.</p>
<p>Now, the median also increased to a 2 goal-difference when weighting on home points.</p>
<p>Regarding the spread of the data, by analyzing the <b>range</b> of the dataset, we got to understand how <b>wide</b> its distribution is. And it is quite wide, which tells us that, there are indeed extreme values, or, in other words, games in which the goal difference was significantly high! The range is <b>11</b>, which is very high considering a median of 0 and a mean of 0.39. When we apply the <b>IQR</b>, however, we get to see that 50% of the observations are concentrated in a narrower interval, that is, <b>3</b>. This means that the distance between the 75th and the 25th quantiles is 3, which is quite narrow considering such a wide range. When analyzing the <b>standard deviation</b>, we get a result of <b>1.91</b>, which tells us that we can expect a deviation of 1.91 goals from the mean of 0.39. This is quite high yet expected given the presence of extreme values. If we apply the <b>MAD</b>, we get a lower result: <b>1.48</b>, which indicates the robustness of such estimate in the presence of extreme values. This means that we can expect a mean deviation of 1.48 goals from the mean, which is 0.</p>
<p>Finally, regarding the skewness of the data, the <b>mean-median skewness</b> is <b>0.21</b>. This result indicates that the mean is higher than the median, which hints at the presence of extreme values to the right, that are pulling the mean. In a normal distribution, since the mean and the median would be equivalent, this value would be zero.</p>
<h2><b>3</b>. Conclusion | Final remarks</h2>
<p>By calculating different statistics, we get a more nuanced view of our dataset. We can understand the effects of outliers on important estimates, and use alternative methods to try to overcome the uncertainty that we get in such scenarios. While using all these statistics may be unpractical on a daily basis, learning which ones are relevant for each case is an important skill to be honed!
<p>And that was it. Thank you and hope you enjoyed it!</p>
<hr>