# Exploring the Impact of Home Team Advantage in Football

## Description
Analyze whether playing at home significantly increases a football team's chance of winning.

## Table of Contents
1. [Executive Summary](#executive-summary)
2. [Introduction & Background](#introduction--background)
3. [Imports](#imports)
4. [Data & API](#data-to-explore)
5. [Analysis](#analysis)
6. [Conclusion](#conclusion)

## Executive Summary

## Introduction & Background

Home field advantage is a long-discussed phenomenon in sports. This analysis investigates whether such an advantage exists in football and, if so, how strong it is across different leagues and teams. This is often a key indictor used in football match predictions and betting algorithms. We will use both frequentist and bayesian methods to analyze the data to answer these questions.

## Imports

In [1]:
# Allows custom modules to be auto reloaded
%load_ext autoreload
%autoreload 2

In [2]:
import soccerdata as sd
import pandas as pd
import numpy as np
from football_analytics.paths import RAW_DATA_DIR

import logging
# Optional: suppresses noisy logs
logging.getLogger().setLevel(logging.ERROR)

## Data & API

To gather the data, we will use the module soccerdata: a web scraper that provides accurate data from a variety of different websites on football. Specifically we use soccerdata to collect match history from the website [football-data](https://www.football-data.co.uk/data.php). To provide a large amount of data for analysis, we will look at the five biggest european leagues over a span of 20 years.

In [11]:
TIME_PERIOD = 20 # How far back are we looking in years

match_history = sd.MatchHistory(
    leagues=[
        'ENG-Premier League', 
        'ESP-La Liga', 
        'FRA-Ligue 1', 
        'GER-Bundesliga', 
        'ITA-Serie A'
    ],
    seasons=range(2025 - TIME_PERIOD, 2025),
    no_cache=False,
    no_store=False,
    data_dir=RAW_DATA_DIR / "home advantage"
)

games = match_history.read_games()

The dataframe contains matches indexed by the league, season, and the specific game. It records data such as the home team, away team, home goals (FTHG), away goals (FTAG), betting measurements, and other match details.

In [12]:
games.sample(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,date,home_team,away_team,FTHG,FTAG,FTR,HTHG,HTAG,HTR,referee,...,1XBCH,1XBCD,1XBCA,BFECH,BFECD,BFECA,BFEC>2.5,BFEC<2.5,BFECAHH,BFECAHA
league,season,game,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
ITA-Serie A,607,2007-02-24 Atalanta-Palermo,2007-02-24 12:00:00,Atalanta,Palermo,1.0,1.0,D,1.0,0.0,H,R. Rosetti,...,,,,,,,,,,
GER-Bundesliga,2425,2025-03-15 Mainz-Freiburg,2025-03-15 14:30:00,Mainz,Freiburg,2.0,2.0,D,1.0,0.0,H,,...,1.83,3.61,4.83,1.88,3.7,5.0,2.06,1.92,1.87,2.13
ENG-Premier League,2425,2024-12-22 Everton-Chelsea,2024-12-22 14:00:00,Everton,Chelsea,0.0,0.0,D,0.0,0.0,D,C Kavanagh,...,4.9,4.24,1.7,5.3,4.3,1.71,1.68,2.46,2.09,1.9
ESP-La Liga,1112,2012-02-05 Ath Madrid-Valencia,2012-02-05 12:00:00,Ath Madrid,Valencia,0.0,0.0,D,0.0,0.0,D,,...,,,,,,,,,,
ESP-La Liga,1617,2017-05-21 Barcelona-Eibar,2017-05-21 12:00:00,Barcelona,Eibar,4.0,2.0,H,0.0,1.0,A,,...,,,,,,,,,,


We only care about whether a team was home or away and whether it won or not. So we can drop all irrelevant columns and leave only the home and away teams and goals

In [13]:
games = games[["date", "home_team", "away_team", "FTHG", "FTAG"]]
games.sample(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,date,home_team,away_team,FTHG,FTAG
league,season,game,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
FRA-Ligue 1,708,2007-12-15 Paris SG-Toulouse,2007-12-15 12:00:00,Paris SG,Toulouse,1.0,2.0
GER-Bundesliga,506,2006-05-13 Hamburg-Werder Bremen,2006-05-13 12:00:00,Hamburg,Werder Bremen,1.0,2.0
ITA-Serie A,2021,2021-04-24 Parma-Crotone,2021-04-24 17:00:00,Parma,Crotone,3.0,4.0
ITA-Serie A,2425,2024-11-03 Napoli-Atalanta,2024-11-03 11:30:00,Napoli,Atalanta,0.0,3.0
ENG-Premier League,2021,2020-12-17 Sheffield United-Man United,2020-12-17 20:00:00,Sheffield United,Man United,2.0,3.0


Now we can check if we need to clean the data by finding missing values or duplicate data.

In [14]:
games.isnull().any() # Returns true if there are any missing values

date         False
home_team    False
away_team    False
FTHG         False
FTAG         False
dtype: bool

In [15]:
duplicates = games[games.duplicated(keep=False)]
duplicates.sort_values(by=['date'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,date,home_team,away_team,FTHG,FTAG
league,season,game,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
FRA-Ligue 1,2021,2020-08-21 Bordeaux-Nantes,2020-08-21 18:00:00,Bordeaux,Nantes,0.0,0.0
FRA-Ligue 1,2021,2020-08-21 Bordeaux-Nantes,2020-08-21 18:00:00,Bordeaux,Nantes,0.0,0.0
FRA-Ligue 1,2021,2020-08-22 Dijon-Angers,2020-08-22 16:00:00,Dijon,Angers,0.0,1.0
FRA-Ligue 1,2021,2020-08-22 Dijon-Angers,2020-08-22 16:00:00,Dijon,Angers,0.0,1.0
FRA-Ligue 1,2021,2020-08-22 Lille-Rennes,2020-08-22 20:00:00,Lille,Rennes,1.0,1.0
FRA-Ligue 1,2021,...,...,...,...,...,...
FRA-Ligue 1,2021,2021-05-23 Rennes-Nimes,2021-05-23 20:00:00,Rennes,Nimes,2.0,0.0
FRA-Ligue 1,2021,2021-05-23 St Etienne-Dijon,2021-05-23 20:00:00,St Etienne,Dijon,0.0,1.0
FRA-Ligue 1,2021,2021-05-23 St Etienne-Dijon,2021-05-23 20:00:00,St Etienne,Dijon,0.0,1.0
FRA-Ligue 1,2021,2021-05-23 Lyon-Nice,2021-05-23 20:00:00,Lyon,Nice,2.0,3.0


In [16]:
prev = len(games)
games = games.drop_duplicates()
print(f"Removed {prev - len(games)} duplicate values")

Removed 1826 duplicate values
