<h1>Chapter 3 | Case Study C1 | <b>Measuring Home Team Advantage in Football</b></h1>
<p>Our goal in this notebook is to provide a descriptive analysis of the <b>distribution and summary statistics</b> regarding the idea of home team advantage in footbal. The dataset used for this case is the <code>football</code> dataset from the English Premier League - see Chapter 2 for previous analyses made on it. Our main question is: 
<ul>
<i>Do professional football teams playing in their <b>home</b> stadium have an advantage? What is the extent of that advantage?</i></p>
</ul>
<p>With no further ado, let's move on and get our hands dirty!</p>
<h2><b>PART A</b> | Read the data</h2>

In [1]:
import os
import sys
import warnings
import pandas as pd
import numpy as np
from mizani.formatters import percent_format
from plotnine import *
from scipy.stats import norm

warnings.filterwarnings("ignore")

In [2]:
pd.set_option("display.max_rows", 500)

In [3]:
# Current script folder
current_path = os.getcwd()
dirname = current_path.split("da_case_studies")[0]

#  Get location folders
data_in = f"{dirname}da_data_repo/football/clean/"
data_out = f"{dirname}da_case_studies/ch03-football_home_advantage/"
output = f"{dirname}da_case_studies/ch03-football_home_advantage/output/"
func = f"{dirname}da_case_studies/ch00-tech_prep/"
sys.path.append(func)

In [4]:
# Import the prewritten helper functions
from py_helper_functions import *

In [5]:
df = pd.read_csv(f"{data_in}epl_games.csv")

<p>We need to get the same data used in the previous chapter, that is, the games in the 2016,2017 season. Each pair of teams plays twice, once in the home stadium of one team, and once in the home stadium of the other team. We are supposed to get a total of 380 games, hence N= 380.

In [9]:
df = df.loc[df["season"] == 2016, :].reset_index(drop=True)

In [10]:
df.shape

(380, 9)

In [11]:
df.head()

Unnamed: 0,div,season,date,team_home,team_away,points_home,points_away,goals_home,goals_away
0,E0,2016,13aug2016,Middlesbrough,Stoke,1,1,1,1
1,E0,2016,13aug2016,Burnley,Swansea,0,3,0,1
2,E0,2016,13aug2016,Everton,Tottenham,1,1,1,1
3,E0,2016,13aug2016,Crystal Palace,West Brom,0,3,0,1
4,E0,2016,13aug2016,Man City,Sunderland,3,0,2,1


<p>Now, we can create a new variable <code>"home_goaladv"</code> to answer our question.</p>

In [12]:
df.head().T

Unnamed: 0,0,1,2,3,4
div,E0,E0,E0,E0,E0
season,2016,2016,2016,2016,2016
date,13aug2016,13aug2016,13aug2016,13aug2016,13aug2016
team_home,Middlesbrough,Burnley,Everton,Crystal Palace,Man City
team_away,Stoke,Swansea,Tottenham,West Brom,Sunderland
points_home,1,0,1,0,3
points_away,1,3,1,3,0
goals_home,1,0,1,0,2
goals_away,1,1,1,1,1


<p>Before visualizing the distribution of goal difference in games, let's return Table 3.7 from the book. It describes the home team-away team goal difference.</p>

In [None]:
pd.Da