# Exploring the Impact of Home Team Advantage in Football

## Methodology
In this analysis, we will explore the impact of home team advantage in football by:
1. Collecting and preprocessing match data, including home and away team performance.
2. Run a two sample z test on whether the proportion of home games won is more than away games won.

## **Step 1:** Collecting and Preprocessing Match Data
In this step, we will gather match data, including information about home and away team performances. The data will be cleaned and prepared for analysis by handling missing values, standardizing formats, and ensuring consistency. First let's load the modules and data we need

In [1]:
import soccerdata as sd
import sys
import os
import logging

# Add the parent directory to sys.path
sys.path.append(os.path.abspath(".."))  # Adjust as needed
logging.getLogger().setLevel(logging.ERROR) # Removes unnecessary messages if needed

In [2]:
from football_analytics.utils.constants import BIG_FIVE_LEAGUES
from football_analytics.utils.config import HOME_ADVANTAGE_CACHE_DIR

TIME_PERIOD = 20 # How far back are we looking in years

# Using Match History to get match specific data like if a team is home
match_history = sd.MatchHistory(
    leagues=BIG_FIVE_LEAGUES, # Premier League, Seria A, La Liga, Bunsaliga, Ligue 1 
    seasons=range(2025 - TIME_PERIOD, 2025),
    no_cache=False,
    no_store=False,
    data_dir=HOME_ADVANTAGE_CACHE_DIR
)

games = match_history.read_games()

Let's have a quick look at the head of the data to see if it was loaded in correctly

In [3]:
games.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,date,home_team,away_team,FTHG,FTAG,FTR,HTHG,HTAG,HTR,referee,...,1XBCH,1XBCD,1XBCA,BFECH,BFECD,BFECA,BFEC>2.5,BFEC<2.5,BFECAHH,BFECAHA
league,season,game,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
ENG-Premier League,506,2005-08-13 Aston Villa-Bolton,2005-08-13 12:00:00,Aston Villa,Bolton,2.0,2.0,D,2.0,2.0,D,M Riley,...,,,,,,,,,,
ENG-Premier League,506,2005-08-13 Everton-Man United,2005-08-13 12:00:00,Everton,Man United,0.0,2.0,A,0.0,1.0,A,G Poll,...,,,,,,,,,,
ENG-Premier League,506,2005-08-13 Fulham-Birmingham,2005-08-13 12:00:00,Fulham,Birmingham,0.0,0.0,D,0.0,0.0,D,R Styles,...,,,,,,,,,,
ENG-Premier League,506,2005-08-13 Man City-West Brom,2005-08-13 12:00:00,Man City,West Brom,0.0,0.0,D,0.0,0.0,D,C Foy,...,,,,,,,,,,
ENG-Premier League,506,2005-08-13 Middlesbrough-Liverpool,2005-08-13 12:00:00,Middlesbrough,Liverpool,0.0,0.0,D,0.0,0.0,D,M Halsey,...,,,,,,,,,,


Since we only need the full time home and away goals for the analysis (FTHG & FTAG), we only need to check for any empty values in those columns to make sure the data is valid.

In [4]:
games['FTHG'].isna().any() or games['FTAG'].isna().any() # Is any FTHG or FTAG Nan?

np.False_

All the data is valid and ready to be used, we can move on to analyzing the data

## **Step 2:** Determining if there is a Home Advantage
In this step, we will analyze the match data to determine if there is a home advantage. This involves a using two sample t-test to evaluate whether the probability of winning home games is significantly greater than the proportion winning away games for a team. The clean data from the previous step will be used for this analysis. Let's put everything in mathematical terms.

We define the random variables $X$ and $Y$ as:
$$
\begin{aligned}
X &= 
\begin{cases}
1 & \text{a team wins at home} \\
0 & \text{a team does not win at home (loss or draw)}
\end{cases} \\
Y &= 
\begin{cases}
1 & \text{a team wins away} \\
0 & \text{a team does not win away (loss or draw)}
\end{cases}
\end{aligned}
$$
We model $X \sim \text{Bernoulli}(p_h)$ and $Y \sim \text{Bernoulli}(p_a)$ where $p_h$ and $p_a$ are the probabilities of winning at home and away, respectively.

Before proceeding, we make the following assumptions:

1. **Independence of Samples:** The outcomes of home games (wins or losses) and away games (wins or losses) are independent. This means the result of one game (home or away) does not influence the result of another game.
In this context of football, while factors like momentum and team morale may have some effect, they are likely insignificant and can be reasonably ignored for the purposes of this analysis.  

2. **Large Enough Sample Size:** For large sample sizes, the Central Limit Theorem ensures that the sampling distribution of the sample mean approaches a normal distribution, even if the original data are not normally distributed. For large sample sizes, we assume that the sample proportion of home and away games won, $\hat{p_h}$ and $\hat{p_A}$ approximately follow:
$$\hat{p_h} \sim \mathcal{N}(p_h, \: p_h(1 - p_h))$$
$$\hat{p_a} \sim \mathcal{N}(p_a, \: p_a(1 - p_a))$$

### Two Sample Z-Test for Proportions

Our null and alternate hypotheses are:
$$H_0: p_h = p_a$$
$$H_1: p_h > p_a$$

Here we use a one-tailed test ($p_h > p_a$) since we are asking if football teams win more at home than away

Our test statistic is given by:
$$z = \frac{\hat{p_h} - \hat{p_a}}{\sqrt{p^*(1 - p^*)(\frac{1}{n_h}+\frac{1}{n_a})}}$$

Where:
- $p^*$ is the pooled sample proportion of home and away games won $$p = \frac{w_h + w_a}{n_h + n_a}$$
- $n_h$ and $n_a$ are the number of home and away games in the dataset
- $w_h$ and $w_a$ are the number of home and away games won
$$ w_h = \sum_{i = 1}^{n_h}{X_i} $$
$$ w_a = \sum_{i = 1}^{n_a}{Y_i} $$
- $\hat{p_h}$ and $\hat{p_a}$ are the sample proportions of home and away games won respectively 
$$ \hat{p_h} = \frac{w_h}{n_h} $$
$$ \hat{p_a} = \frac{w_a}{n_a} $$

We can choose a significance level $\alpha = 0.05$, giving us a rejection region of $z \geq 1.96$

Let's now calculate z using the data. We'll first calculate the all the other variables used in the calculation first

In [5]:
n_h = n_a = len(games)
print(f"Number of games used: {n_h}")

Number of games used: 36086


In [6]:
w_h = (games["FTHG"] > games["FTAG"]).sum()
w_a = (games["FTAG"] > games["FTHG"]).sum()

sample_p_h = w_h / n_h
sample_p_a = w_a / n_a
print(f"Sample proportion of home games won: {sample_p_h}")
print(f"Sample proportion of away games won: {sample_p_a}")

Sample proportion of home games won: 0.45103364185556727
Sample proportion of away games won: 0.2927173973286039


In [7]:
pooled_p = (w_h + w_a) / (n_h + n_a)
print(f"Pooled proportion of games won: {pooled_p}")

Pooled proportion of games won: 0.3718755195920856


In [8]:
from math import sqrt
z = (sample_p_h - sample_p_a) / sqrt(pooled_p * (1 - pooled_p) * (1/n_h + 1/n_a))
print(f"Test statistic z: {z}")

Test statistic z: 44.000559507366745


Our value z is the rejection region $z \geq 1.96$ so there is enough evidence to reject $H_0$ ($p_h = p_a$). Let's calculate our p-value (the probability of the data given $H_0$).

In [9]:
from scipy.stats import norm

p_val = 1 - norm.cdf(z)
print(p_val)

0.0


If the probabilities were equal then it is nearly impossible to generate the data so we can safely assume there is a home team advantage  