# The Super Sub: Impact of Attacking Substitutes in the Premier League (2014–2023)

## Introduction
This project investigates the impact of "Super Subs"—attacking players brought on as substitutes who score crucial goals late in matches. Specifically, we explore whether these players are more likely than expected to score goals that equalize or win games after being subbed on when their team is tied or losing.

## Hypothesis
**H₀ (Null):** Attacking substitutes do not have a significantly higher chance of scoring game-tying or winning goals compared to starters, when adjusting for playing time and match context.  
**H₁ (Alternative):** Attacking substitutes (Super Subs) have a significantly higher chance of scoring game-tying or winning goals, relative to their time on the pitch and the state of the match.

## Scope
- **League:** English Premier League  
- **Time Period:** 2014–2023 (9 seasons)  
- **Focus:** Substitutes entering after the 60th minute in tied or losing game states

## Constraints
- Publicly available data sources may limit event granularity (e.g., exact match state at time of substitution or goal).
- Player position classification may not always be consistent across datasets.

## Biases
- Stronger teams tend to have deeper benches, possibly skewing substitute impact.
- Tactical shifts or injuries may affect the context of substitutions.
- Late-game chaos and small sample size may distort scoring likelihoods.

In [10]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import json
from tqdm import tqdm

In [None]:
def get_season_data(season):
    url = f'https://understat.com/league/EPL/{season}'
    res = requests.get(url)
    soup = BeautifulSoup(res.content, 'lxml')
    script = soup.find_all('script')[1].string
    json_data = script[script.index("('") + 2 : script.index("')")]
    json_data = json_data.encode('utf8').decode('unicode_escape')
    matches = json.loads(json_data)

    return pd.json_normalize(matches)

# Collect matches for 2014-2023 seasons
all_matches = []

for year in tqdm(range(2014, 2024)):
    df_season = get_season_data(year)
    df_season['season'] = f'{year}-{year+1}'
    all_matches.append(df_season)

df_all = pd.concat(all_matches, ignore_index=True)
df_all.to_csv('epl_matches_2014_2023.csv', index=False)

print("Head:")
print(df_all.head())

100%|██████████| 10/10 [00:11<00:00,  1.19s/it]

Head:
     id  isResult             datetime h.id               h.title  \
0  4749      True  2014-08-16 12:45:00   89     Manchester United   
1  4750      True  2014-08-16 15:00:00   75             Leicester   
2  4751      True  2014-08-16 15:00:00  202   Queens Park Rangers   
3  4752      True  2014-08-16 15:00:00   85                 Stoke   
4  4753      True  2014-08-16 15:00:00   76  West Bromwich Albion   

  h.short_title a.id      a.title a.short_title goals.h goals.a      xG.h  \
0           MUN   84      Swansea           SWA       1       2   1.16635   
1           LEI   72      Everton           EVE       2       2    1.2783   
2           QPR   91         Hull           HUL       0       1   1.90067   
3           STO   71  Aston Villa           AVL       0       1  0.423368   
4           WBA   77   Sunderland           SUN       2       2   1.68343   

       xG.a forecast.w forecast.d forecast.l     season  
0  0.278076     0.6519     0.2802     0.0679  2014-2015  



