# Football Card Predicter Model
## Overview

- This project builds a statistical model to predict the number of cards in Premier League matches (home, away, and total).
- It combines team behavioural tendencies and referee tendencies, and models match card counts using a Poisson framework.
- The model first estimates expected cards from the team matchup, then scales that expectation to the referee’s historical card level.
- The long-term goal of the project is to develop a fully testable, walk-forward evaluated football card model using only information available before kick-off (no data leakage).


## Objectives 

- Build weighted team profiles from historical Premier League data.
- Build weighted referee profiles from historical Premier League data.
- Generate expected card counts for any fixture.
- Convert expected values into probabilities using the Poisson distribution.
- Combine team and referee models in a statistically coherent way.
- Lay the foundation for match-level back-testing and evaluation.


## Data Used

- This model uses 3 data sources:
1. Team data - Season-level averages of yellow and red cards for each team, split by:
    - Home vs away
    - Cards for vs cards against
    
2. Referee data - Season-level referee averages:
    - Yellow cards per game
    - Red cards per game
    - Appearances per season

3. Match data (future stage for v2 of the model)

    - Match-level Premier League data will be used for:

        - Walk-forward testing
        - Model evaluation
        - Calibration
        - Automation

## Modelling
Team Model
- Each team is assigned a weighted profile using exponential decay weighting, meaning recent seasons matter more than older ones.
- For every team, the model estimates:
    - Home cards for
    - Home cards against
    - Away cards for
    - Away cards against
- Yellow and red cards are first modelled separately, then combined.
- Expected cards in a fixture are calculated from both teams’ tendencies.

Referee Model

- Each referee is also assigned a weighted profile using exponential decay.
- For every referee, the model estimates:
    - Average yellow cards per game
    - Average red cards per game
    - Total historical appearances
- A minimum appearance threshold is applied to avoid unreliable estimates.

Model combination

- The team model produces a matchup-specific expected total.
- The referee model produces a general match environment expected total.
- The team expectations are scaled to the referee’s historical average.
This enforces the principle that:
- Referees control the overall card level.
- Teams influence distribution within that level.

## Steps
- Business/Project Understanding
    - See Overview and Objectives sections above.

- Data Understanding, Cleaning, and Preparation
    - Referee Data
        - Obtained referee season-level data from whoscored.com (Premier League, historical seasons).
        - Data originally stored as one CSV per season. Combined into a single Excel workbook with one sheet per season.
        - Initial cleaning performed using Power Query and Excel:
            - Removed irrelevant columns.
            - Corrected formatting issues.
        - Removed four erroneous referee entries where averages were unrealistically high (20–27 yellows per game). These referees only had two appearances each and were clear data errors.
        -  Cleaned referee name field and standardised into a unique referee_id to handle formatting inconsistencies and initials. 

    - Team Data
        - The model requires card data split by:
            - home vs away
            - cards for vs cards against
        - Data obtained from thestatsdontlie.com (Premier League team statistics).
        - Historical team data was only available as images and was converted into CSV format using AI-assisted extraction, then manually checked.
        - One CSV per season combined into a single Excel workbook with separate sheets.
        - Final dataset provides season-level averages of:
            - home yellows / reds for and against
            - away yellows / reds for and against
        
- Data Modelling
    - Team Modelling
        - Loaded all team sheets into pandas and concatenated into a single dataframe.
        - Created standardised team_id keys to handle naming inconsistencies.
        - Added a season column derived from sheet names.
        - Implemented exponential decay weighting to prioritise recent seasons while retaining historical information. Originally this was a dictionary of hard-coded weights.
        - Added season weights to the dataframe.
        - Built weighted mean function (multiplies a series of team data (5 rows, one row for each season) by the corresponding weights). 
        - Grouped by team_id and calculated weighted averages for:
            - home cards for
            - home cards against
            - away cards for
            - away cards against
        - Initially modelled yellows and reds separately, then combined into total card measures 8 columns -> 4 columns.
        - Implemented a team lookup function to dynamically retrieve any team’s weighted profile.
        - Constructed a function that outputs expected home, away, and total cards for any fixture based purely on team tendencies.

    - Referee Modelling
        - Loaded all referee sheets and concatenated into a single dataframe.
        - Standardised referee names into a unique referee_id to handle initials and formatting differences.
        - Added season labels and dropped unnecessary columns.
        - Applied the same exponential decay weighting approach used for teams.
        - Implemented:
            - Weighted referee averages (yellow, red, total cards per game).
            - Appearance counter across seasons.
        - Applied a minimum appearance threshold to flag unreliable referee estimates.
        - Built a referee lookup function returning a referee’s weighted card profile.

    - Combining the Models
        - The team model estimates match-specific distribution of cards.
        - The referee model estimates the baseline match environment.
        - The two are combined by scaling team-based expected values to the referee’s historical average:
            - adjusted team λ = team λ × (ref λ / team total λ)
        - Referees control overall match strictness
        - Teams influence who receives cards

    - Probabilistic Layer (Poisson Model)
        - Implemented a Poisson PMF to calculate the probability of seeing exactly k cards.
        - Built functions to compute:
            - Full card count distributions.
            - Over/under probabilities (e.g., over 3.5, 4.5, etc.).
        - This converts point estimates into usable probability ranges.

- Evaluation (planned)
    - This version focuses on model construction only.
    - Formal evaluation will be introduced in v2 using match-level data:
        - Rolling training windows.
        - Walk-forward testing.
        - Prediction tracking.
        - Calibration and stability analysis.

## Future Improvements - These will be addressed in v2 of this project
- Find a source for match level data. This will allow me to:
    - Backtest the model on a per match basis using walk-forward methodology.
    - Model performance calibration though parameter tuning (weights, alpha, scaling balance).
- Contextual features (to be address last)
    - Explore derby/rivalry effects.
- Improved coverage for refs and teams (hard gating replaced with fall-back to leauge averages) (only applicable to refs currently but effects teams in v2).
- Automation:
    - Weekly data ingestion.
    - Automated prediction generation.
    - Automated result comparison and logging.

## Imports and Data Loading

In [6]:
import pandas as pd
import numpy as np
from aliases import TEAM_ALIAS, REF_ALIAS

# Team Modelling

In [7]:
# View one of the sheets
df_test = pd.read_excel('data/data_used_in_v1/team_data.xlsx', sheet_name='2025-2026')

In [8]:
df_test

Unnamed: 0,Team,Home_Team_Avg_Y_For,Away_Team_Avg_Y_For,Home_Team_Avg_Y_Against,Away_Team_Avg_Y_Against,Home_Team_Avg_R_For,Away_Team_Avg_R_For,Home_Team_Avg_R_Against,Away_Team_Avg_R_Against
0,Arsenal,0.5,2.14,1.67,1.43,0.0,0.0,0.0,0.14
1,Aston Villa,1.43,1.67,2.71,2.5,0.14,0.0,0.0,0.17
2,Bournemouth,2.17,2.86,2.17,2.14,0.0,0.14,0.17,0.0
3,Brentford,2.0,1.83,2.14,2.17,0.0,0.0,0.14,0.0
4,Brighton,1.67,3.29,1.83,2.29,0.0,0.0,0.0,0.14
5,Burnley,1.5,1.86,1.5,1.14,0.17,0.0,0.0,0.0
6,Chelsea,1.57,2.83,3.0,2.17,0.29,0.33,0.0,0.17
7,Crystal Palace,1.43,2.5,2.86,1.83,0.0,0.0,0.0,0.0
8,Everton,2.29,1.5,2.14,1.17,0.0,0.17,0.0,0.0
9,Fulham,1.67,2.29,1.67,1.57,0.0,0.0,0.17,0.0


### Prep Steps
- Load the Team Data 
- Add season column to differeniate between seasons
- Rename columns 
- Concat them into one data frame


In [9]:
file_path = 'data/data_used_in_v1/team_data.xlsx'  # The workbook with the team data worksheets in for each season

# Read all sheet names
sheets = pd.ExcelFile(file_path).sheet_names # Open the excel workbook, and list the sheets within it

all_seasons = []

for sheet in sheets:   # Loop through the sheets
    df = pd.read_excel(file_path, sheet_name=sheet)
    
    # Create standardised team_id column
    df['team_id'] = df['Team'].str.strip().str.lower()
    df['team_id'] = df['team_id'].replace(TEAM_ALIAS)

    # Add the season column (sheet name)
    df['season'] = sheet

    # Rename columns for consistency
    df = df.rename(columns={
        'Home_Team_Avg_Y_For': 'hy_for',
        'Away_Team_Avg_Y_For': 'ay_for',
        'Home_Team_Avg_Y_Against': 'hy_against',
        'Away_Team_Avg_Y_Against': 'ay_against',
        'Home_Team_Avg_R_For': 'hr_for',
        'Away_Team_Avg_R_For': 'ar_for',
        'Home_Team_Avg_R_Against': 'hr_against',
        'Away_Team_Avg_R_Against': 'ar_against',
        'Team': 'team'
    })

    all_seasons.append(df)

# Combine all data
team_data = pd.concat(all_seasons, ignore_index=True)


In [10]:
print(team_data.to_string())


              team  hy_for  ay_for  hy_against  ay_against  hr_for  ar_for  hr_against  ar_against            team_id     season
0          Arsenal    0.50    2.14        1.67        1.43    0.00    0.00        0.00        0.14            arsenal  2025-2026
1      Aston Villa    1.43    1.67        2.71        2.50    0.14    0.00        0.00        0.17        aston villa  2025-2026
2      Bournemouth    2.17    2.86        2.17        2.14    0.00    0.14        0.17        0.00        bournemouth  2025-2026
3        Brentford    2.00    1.83        2.14        2.17    0.00    0.00        0.14        0.00          brentford  2025-2026
4         Brighton    1.67    3.29        1.83        2.29    0.00    0.00        0.00        0.14           brighton  2025-2026
5          Burnley    1.50    1.86        1.50        1.14    0.17    0.00        0.00        0.00            burnley  2025-2026
6          Chelsea    1.57    2.83        3.00        2.17    0.29    0.33        0.00        0.1

### Weighting

- This is an exponential weighting scheme.
- Originally used manualy set weights in a dictionary.
- Emphasises recency very strongly, dlder seasons still contribute a little.

Changed to using a formula to allow for easy manipulatio(through alpha) of the weights at later date.

- Each season matters less the further back it is, and α controls how fast that importance fades.
- weight = exp(-α × seasons_ago)

In [11]:
seasons = ['2025-2026','2024-2025','2023-2024','2022-2023','2021-2022']
seasons_ago = [0,1,2,3,4]

alpha = 0.55 # α controls how fast old seasons lose importance, large alpha means old season die off faster

raw_weights = np.exp(-alpha * np.array(seasons_ago)) # compute raw decay weights using formula

weights = raw_weights / raw_weights.sum() # normalise (ensure weights add to one) the weights

weights = dict(zip(seasons, weights)) # zip seasons and weights into a dictionary

print(weights)


{'2025-2026': np.float64(0.4519418665369904), '2024-2025': np.float64(0.26074777420151984), '2023-2024': np.float64(0.15043837888270084), '2022-2023': np.float64(0.08679539417032205), '2021-2022': np.float64(0.05007658620846691)}


In [12]:
# Make weights column and map the weights to the seasons
team_data['weight'] = team_data['season'].map(weights)

In [13]:
team_data

Unnamed: 0,team,hy_for,ay_for,hy_against,ay_against,hr_for,ar_for,hr_against,ar_against,team_id,season,weight
0,Arsenal,0.50,2.14,1.67,1.43,0.00,0.00,0.00,0.14,arsenal,2025-2026,0.451942
1,Aston Villa,1.43,1.67,2.71,2.50,0.14,0.00,0.00,0.17,aston villa,2025-2026,0.451942
2,Bournemouth,2.17,2.86,2.17,2.14,0.00,0.14,0.17,0.00,bournemouth,2025-2026,0.451942
3,Brentford,2.00,1.83,2.14,2.17,0.00,0.00,0.14,0.00,brentford,2025-2026,0.451942
4,Brighton,1.67,3.29,1.83,2.29,0.00,0.00,0.00,0.14,brighton,2025-2026,0.451942
...,...,...,...,...,...,...,...,...,...,...,...,...
95,Southampton,1.74,1.68,1.74,1.58,0.05,0.05,0.05,0.00,southampton,2021-2022,0.050077
96,Tottenham,1.79,1.84,2.26,2.05,0.00,0.05,0.16,0.16,tottenham,2021-2022,0.050077
97,Watford,1.84,1.47,1.53,1.84,0.05,0.11,0.05,0.00,watford,2021-2022,0.050077
98,West Ham,1.47,1.21,1.42,1.11,0.00,0.16,0.11,0.05,west ham,2021-2022,0.050077


In [14]:
team_data['season'].unique()

array(['2025-2026', '2024-2025', '2023-2024', '2022-2023', '2021-2022'],
      dtype=object)

In [15]:
def wmean(series, weights):
    '''
    Compute weighted mean for a pandas series.
    Takes in the series (the 5 rows for each card catergory per team, one for each year), and the predefined weights. 
    '''
    return (series * weights).sum() / weights.sum()

In [16]:
'''
Group the data into teams (this is how to get the series to pass into the wmean func).
For each team subset(x), run the following:
Create a dictionary where the keys are the column names, and the values are calculated using the wmean function.
Wrap the dictionary in pd.series to turn the dictinary into a series/single row. Dict/group -> one output row.
This will return a pandas series that contains the weighted means/appearances for that team.
Pandas does this for every team subset(x), stacks the series together, and builds a new dataframe (team_data_weighted).
'''
team_data_weighted = team_data.groupby('team_id').apply(
    lambda x: pd.Series({
            'hy_for': wmean(x['hy_for'], x['weight']),
            'ay_for': wmean(x['ay_for'], x['weight']),
            'hy_against': wmean(x['hy_against'], x['weight']),
            'ay_against': wmean(x['ay_against'], x['weight']),
            'hr_for': wmean(x['hr_for'], x['weight']),
            'ar_for': wmean(x['ar_for'], x['weight']),
            'hr_against': wmean(x['hr_against'], x['weight']),
            'ar_against': wmean(x['ar_against'], x['weight']),
        }),
    include_groups=False
).reset_index()

In [17]:
team_data_weighted

Unnamed: 0,team_id,hy_for,ay_for,hy_against,ay_against,hr_for,ar_for,hr_against,ar_against
0,arsenal,1.115842,1.940159,1.988203,1.745103,0.038708,0.070291,0.0289,0.091954
1,aston villa,1.733967,2.04659,2.714639,2.38419,0.076302,0.066619,0.027403,0.123485
2,bournemouth,2.088534,2.690412,2.308184,1.991358,0.047615,0.08825,0.124799,0.047615
3,brentford,1.844201,1.847575,2.099457,1.983729,0.014366,0.023063,0.115346,0.035094
4,brighton,1.731415,2.719297,2.228898,2.305478,0.019052,0.051745,0.020559,0.153961
5,burnley,1.503812,2.032168,1.656815,1.368276,0.129284,0.082226,0.025363,0.011529
6,chelsea,2.034165,2.726071,2.645917,2.283689,0.164989,0.190778,0.028572,0.120781
7,crystal palace,1.793306,2.219331,2.613259,1.842223,0.071826,0.014366,0.055079,0.040734
8,everton,2.136833,1.876,1.79246,1.325484,0.046042,0.091196,0.040544,0.007522
9,fulham,1.749779,2.376121,1.677458,1.640804,0.035714,0.031145,0.104656,0.044814


### Combining Yellow and Red

In [18]:
team_data_weighted['home_cards_for']     = team_data_weighted['hy_for']      + team_data_weighted['hr_for']
team_data_weighted['home_cards_against'] = team_data_weighted['hy_against']  + team_data_weighted['hr_against']
team_data_weighted['away_cards_for']     = team_data_weighted['ay_for']      + team_data_weighted['ar_for']
team_data_weighted['away_cards_against'] = team_data_weighted['ay_against']  + team_data_weighted['ar_against']

### Predicting Expected Cards in a Match

#### Lookup function

In [19]:
# Function that takes in a team name and returns a series of data for that team
# Even though there is only 1 row in the output, pandas views it as a dataframe, iloc seclects the first row, making it a series
def get_team_profile(team_id):
    return team_data_weighted[team_data_weighted['team_id'] == team_id].iloc[0]

In [20]:
example = team_data_weighted[team_data_weighted['team_id'] == 'watford']

In [21]:
example

Unnamed: 0,team_id,hy_for,ay_for,hy_against,ay_against,hr_for,ar_for,hr_against,ar_against,home_cards_for,home_cards_against,away_cards_for,away_cards_against
24,watford,1.84,1.47,1.53,1.84,0.05,0.11,0.05,0.0,1.89,1.58,1.58,1.84


In [22]:
print(get_team_profile('west ham'))

team_id               west ham
hy_for                1.704373
ay_for                1.927495
hy_against            1.547978
ay_against            1.471048
hr_for                0.151622
ar_for                0.053243
hr_against            0.005508
ar_against            0.032089
home_cards_for        1.855996
home_cards_against    1.553486
away_cards_for        1.980738
away_cards_against    1.503137
Name: 25, dtype: object


In [23]:
'''
Function that takes in the home and away teams
Calls the get_team_profile func to obtain the teams weighted profiles
Expected yellow cards for home team depends on:
1. How often home team gets yellow cards at home (H['hy_for']), 
2. How often away team's opponents recieve yellow cards when they are away (A['ay_against'])
Expected yellow cards for away team depends on:
1. How often away team gets yellow cards when away (A['ay_for']), 
2. How often the home team's opponents recieve yellow cards when they at home (H['hy_against'])
Function returns a dictionary containing the expected yellow, red, and combined cards for each team.
'''

def expected_cards(home, away):
    H = get_team_profile(home)
    A = get_team_profile(away)
    
    # YELLOWS
    home_y = (H['hy_for'] + A['ay_against']) / 2
    away_y = (A['ay_for'] + H['hy_against']) / 2
    
    # REDS
    home_r = (H['hr_for'] + A['ar_against']) / 2
    away_r = (A['ar_for'] + H['hr_against']) / 2
    
    # TOTALS
    home_total = home_y + home_r
    away_total = away_y + away_r
    
    return {
        'home_team': home,
        'away_team': away,
        'home_yellows_exp': float(home_y),
        'home_reds_exp': float(home_r),
        'away_yellows_exp': float(away_y),
        'away_reds_exp': float(away_r),
        'home_cards_exp': float(home_total),
        'away_cards_exp': float(away_total),
        'total_cards_exp': float(home_total + away_total)
    }

In [24]:
expected_cards('aston villa','arsenal')

{'home_team': 'aston villa',
 'away_team': 'arsenal',
 'home_yellows_exp': 1.7395351253510971,
 'home_reds_exp': 0.08412816060979544,
 'away_yellows_exp': 2.3273990068875525,
 'away_reds_exp': 0.0488470559964797,
 'home_cards_exp': 1.8236632859608926,
 'away_cards_exp': 2.3762460628840323,
 'total_cards_exp': 4.199909348844924}

## Ref Modelling

In [25]:
# View one of the sheets
df_test_2 = pd.read_excel('data/data_used_in_v1/ref_data_cleaned.xlsx', sheet_name ='2025-2026')
df_test_2

Unnamed: 0,Referee,Apps,Yel pg,Yel,Red pg,Red,Fouls pg,Fouls/Tackles,Pen pg,Column10,Column11,Column12,Column13
0,Michael Oliver,11,2.36,26,0.09,1,21.91,0.64,0.09,,,,
1,Anthony Taylor,11,3.45,38,0.09,1,19.36,0.54,0.18,,,,
2,Chris Kavanagh,11,3.91,43,0.09,1,23.18,0.71,0.45,,,,
3,Peter Bankes,10,4.2,42,0.2,2,24.6,0.74,0.4,,,,
4,Stuart Attwell,9,4.89,44,0.11,1,21.78,0.66,0.33,,,,
5,Craig Pawson,9,1.78,16,0.11,1,21.11,0.63,0.22,,,,
6,Simon Hooper,8,4.75,38,0.25,2,25.13,0.68,0.13,,,,
7,Darren England,8,3.75,30,0.13,1,21.13,0.71,0.13,,,,
8,Robert Jones,8,3.25,26,0.0,0,21.13,0.63,0.25,,,,
9,Jarred Gillett,8,3.63,29,0.13,1,20.5,0.54,0.38,,,,


### Prep Steps
- Load the Ref Data 
- Drop unwanted columns
- Add season column to differeniate between seasons
- Rename columns 
- Concat them into one data frame

In [26]:
file_path = 'data/data_used_in_v1/ref_data_cleaned.xlsx'  # The workbook with the cleaned ref data worksheets in for each season

# Read all sheet names
sheets = pd.ExcelFile(file_path).sheet_names # Open the excel workbook, and list the sheets within it

all_seasons = []

for sheet in sheets:   # Loop through the sheets
    df = pd.read_excel(file_path, sheet_name=sheet)

    # Clean referee column
    df['Referee'] = df['Referee'].str.strip().str.lower()

    # Create standardised referee_id column
    df['referee_id'] = (
        df['Referee'].str.split().str[0].str[0] # Splits the string into first/last name, grabs the first name, then grabs the first letter
        + '_'
        + df['Referee'].str.split().str[-1] # Splits the string into first/last name, grabs the last name
    )
    
    df['referee_id'] = df['referee_id'].replace(REF_ALIAS)

    # Drop unwanted columns by selecting the columns we want
    df = df[['Referee','Apps', 'Yel pg', 'Red pg', 'referee_id']]

    # Add the season column (sheet name)
    df['season'] = sheet

    # Rename columns for consistency
    df = df.rename(columns={
        'Apps': 'appearances',
        'Yel pg': 'avg_yellow_pg',
        'Red pg': 'avg_red_pg'  
    })

    all_seasons.append(df)

# Combine all data
ref_data = pd.concat(all_seasons, ignore_index=True)

In [27]:
# Test previous code has worked
print(ref_data.to_string())

               Referee  appearances  avg_yellow_pg  avg_red_pg    referee_id     season
0       michael oliver           11           2.36        0.09      m_oliver  2025-2026
1       anthony taylor           11           3.45        0.09      a_taylor  2025-2026
2       chris kavanagh           11           3.91        0.09    c_kavanagh  2025-2026
3         peter bankes           10           4.20        0.20      p_bankes  2025-2026
4       stuart attwell            9           4.89        0.11     s_attwell  2025-2026
5         craig pawson            9           1.78        0.11      c_pawson  2025-2026
6         simon hooper            8           4.75        0.25      s_hooper  2025-2026
7       darren england            8           3.75        0.13     d_england  2025-2026
8         robert jones            8           3.25        0.00       r_jones  2025-2026
9       jarred gillett            8           3.63        0.13     j_gillett  2025-2026
10      samuel barrott          

### Weighting


- SEE FIRST WEIGHTS SECTION

In [28]:
# # Same as for teams at the moment - may change in the future

seasons = ['2025-2026','2024-2025','2023-2024','2022-2023','2021-2022']
seasons_ago = [0,1,2,3,4]

alpha = 0.55 # α controls how fast old seasons lose importance, large alpha means old season die off faster

raw_weights = np.exp(-alpha * np.array(seasons_ago)) # Compute raw decay weights using formula

weights_ref = raw_weights / raw_weights.sum() # Normalise (ensure weights add to zero) the weights

weights_ref = dict(zip(seasons, weights_ref)) # zip seasons and weights into a dictionary

print(weights_ref)

{'2025-2026': np.float64(0.4519418665369904), '2024-2025': np.float64(0.26074777420151984), '2023-2024': np.float64(0.15043837888270084), '2022-2023': np.float64(0.08679539417032205), '2021-2022': np.float64(0.05007658620846691)}


In [29]:
# Make weights column and map the weights to the seasons
ref_data['weight'] = ref_data['season'].map(weights_ref)

In [30]:
ref_data

Unnamed: 0,Referee,appearances,avg_yellow_pg,avg_red_pg,referee_id,season,weight
0,michael oliver,11,2.36,0.09,m_oliver,2025-2026,0.451942
1,anthony taylor,11,3.45,0.09,a_taylor,2025-2026,0.451942
2,chris kavanagh,11,3.91,0.09,c_kavanagh,2025-2026,0.451942
3,peter bankes,10,4.20,0.20,p_bankes,2025-2026,0.451942
4,stuart attwell,9,4.89,0.11,s_attwell,2025-2026,0.451942
...,...,...,...,...,...,...,...
107,robert jones,12,2.75,0.00,r_jones,2021-2022,0.050077
108,jarred gillett,9,3.44,0.11,j_gillett,2021-2022,0.050077
109,john brooks,4,5.25,0.00,j_brooks,2021-2022,0.050077
110,michael salisbury,3,4.00,0.00,m_salisbury,2021-2022,0.050077


In [31]:
def appearance_counter(series):
     '''
     Counts the number of appearances of the refs over the 5 seasons 
     I have included this to allow for the setting of a threshold limit, i.e. refs need at least 5 appearrances for data to be viable.
     '''
     return series.sum()

In [32]:
def wmean(series, weights):
    '''
    Compute weighted mean for a pandas series.
    Takes in the series (the 5 rows for each card catergory per team, one for each year), and the predefined weights
    Exactly the same as the previous wmean function - sort out
    '''
     
    return (series * weights).sum() / weights.sum()

In [33]:
'''
Group the data into refs (this is how to get the series to pass into the wmean func).
For each ref subset(x), run the following:
Create a dictionary where the keys are the column names, and the values are calculated using the wmean / apprearance counter functions.
Wrap the dictionary in pd.series to turn the dictinary into a series/single row. Dict/group -> one output row.
This will return a pandas series that contains the weighted means/appearances for that ref.
Pandas does this for every ref subset(x), stacks the series together, and builds a new dataframe (ref_data_weighted).
'''

ref_data_weighted = ref_data.groupby('referee_id').apply(
    lambda x: pd.Series({
            'avg_yellow_pg': wmean(x['avg_yellow_pg'], x['weight']),
            'avg_red_pg': wmean(x['avg_red_pg'], x['weight']),
            'total_appearances': appearance_counter(x['appearances'])
        }),
    include_groups=False
).reset_index()


### Combining Yellow and Red

In [34]:
ref_data_weighted['avg_card_pg'] = ref_data_weighted['avg_yellow_pg'] + ref_data_weighted['avg_red_pg']
ref_data_weighted

Unnamed: 0,referee_id,avg_yellow_pg,avg_red_pg,total_appearances,avg_card_pg
0,a_kitchen,3.5,0.0,2.0,3.5
1,a_madley,3.37297,0.04437,89.0,3.41734
2,a_marriner,3.382678,0.113414,32.0,3.496092
3,a_taylor,3.541753,0.13308,127.0,3.674833
4,c_kavanagh,3.974217,0.131425,87.0,4.105642
5,c_pawson,3.129993,0.118634,97.0,3.248627
6,d_bond,4.084186,0.201065,23.0,4.285251
7,d_coote,4.885855,0.083463,63.0,4.969318
8,d_england,3.997593,0.13599,73.0,4.133583
9,g_scott,2.70382,0.113635,24.0,2.817455


In [35]:
ref_data_weighted[ref_data_weighted['referee_id'] == 'p_tierney'].iloc[0]

referee_id           p_tierney
avg_yellow_pg         3.655132
avg_red_pg            0.117096
total_appearances         82.0
avg_card_pg           3.772228
Name: 22, dtype: object

#### Lookup Function

In [36]:
# Function that takes in a ref name and returns a series of data for that ref
# Even though there is only 1 row in the output, pandas views it as a dataframe, iloc seclects the first row, making it a series
def get_ref_profile(referee_id):
    return ref_data_weighted[ref_data_weighted['referee_id'] == referee_id].iloc[0]

In [37]:
get_ref_profile('c_pawson')

referee_id           c_pawson
avg_yellow_pg        3.129993
avg_red_pg           0.118634
total_appearances        97.0
avg_card_pg          3.248627
Name: 5, dtype: object

## Combining the Models

I am going to scale the team lambda to the ref lambda because the ref information is more reliable and more stable than the team matchup-specific total.

This is industry standard practice.
Logic from Data Science

When combining models:

The more general model (ref) should dominate the more specific model (team matchup)

Ref λ comes from:
- 20–25 matches per season OF THE SAME PERSON, who controls the match card level

This means the data is:
- Stable
- Consistent
- Predictive
- Low variance

Team λ comes from:
- Many opponents
- Different match states
- Style interactions
- Tactical context

This means the data is:
- Higher variance
- Opponent dependent
- Less stable

In [38]:
def expected_cards_with_ref(home, away, referee_id):
    
    # Pull Team data using fucntion
    exp_team_cards = expected_cards(home,away)

    home_team_lam = exp_team_cards['home_cards_exp']
    away_team_lam = exp_team_cards['away_cards_exp']
    team_total_lam = exp_team_cards['total_cards_exp']

    # Pull Ref data using function
    exp_ref_cards = get_ref_profile(referee_id)

    ref_total_lam = exp_ref_cards['avg_card_pg']
    ref_total_appearances = exp_ref_cards['total_appearances']
    
    # Run ref appearance check
    if ref_total_appearances <5:
        print(f'Ref only has {ref_total_appearances} appearances - not enough data.')
        return None

    # Set scaling
    scale = ref_total_lam / team_total_lam

    # Apply scaling
    home_adj = home_team_lam * scale
    away_adj = away_team_lam * scale
    
    return {
        
        # Set up
        'home_team': home,
        'away_team': away,
        'ref': referee_id,
        
        # Team only lambdas
        'home_lambda_team': float(home_team_lam),
        'away_lambda_team': float(away_team_lam),
        'team_total_lambda': float(team_total_lam),
        
        # Ref target + scale
        'ref_total_lambda': float(ref_total_lam),
        'scale_factor': float(scale),
        
        # Adjusted lambdas
        'home_lambda_adj': float(home_adj),
        'away_lambda_adj': float(away_adj),
        'total_lambda_adj': float(home_adj + away_adj)
    }
    
    
    # maybe some fucntionality for sclaing the yellows and red indiviually at some point.


In [39]:
expected_cards_with_ref("arsenal", 'man city', 'r_jones')

{'home_team': 'arsenal',
 'away_team': 'man city',
 'ref': 'r_jones',
 'home_lambda_team': 1.5129943229048397,
 'away_lambda_team': 1.9242637708080454,
 'team_total_lambda': 3.4372580937128854,
 'ref_total_lambda': 3.7985110104452207,
 'scale_factor': 1.1050991537100765,
 'home_lambda_adj': 1.6720087458102886,
 'away_lambda_adj': 2.1265022646349316,
 'total_lambda_adj': 3.79851101044522}

### Poisson PMF (Probability Mass Function)

- At this stage the model outputs expected card counts (λ).
- To convert these into usable predictions, we model card counts as Poisson-distributed random variables.

- The Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time.

Poisson answers the question:

“If the expected (average) number of events in a match is λ (lambda),
what is the probability of seeing exactly k events?”

In our case:

λ = expected total cards in the match (e.g. 3.91)

k = 0, 1, 2, 3, 4, 5, … cards

--

Why Poisson fits football cards?

Because:

Cards are countable events

They occur independently (mostly)

There’s a natural match duration (90min)

Variance resembles mean

Formula
P(k events | mean = λ) = (e^(-λ)) * (λ^k) / k!

Example 
λ = 3.9 (expected match cards)
k chance of 3 cards
P(3) = e^(-3.9) * (3.9^3) / 3!

In [40]:
import math

def poisson_pmf(k, lam):
    '''
    Probability of seeing exactly k events (cards)
    when the expected value (lambda) is lam.
    This takes in k(the amount of cards we are testing/want to find out the probabilty for), and lambda(the expected total cards in the match)
    It is just using the formula shown above.
    '''
    return math.exp(-lam) * (lam ** k) / math.factorial(k)


In [41]:
# Example of what poisson_pmf func does

exp = expected_cards('aston villa','arsenal')

lam_home = exp['home_cards_exp']
lam_away = exp['away_cards_exp']
lam_total = exp['total_cards_exp']


for k in range(0, 8):
    P_home_k  = poisson_pmf(k, lam_home)
    P_away_k  = poisson_pmf(k, lam_away)
    P_total_k = poisson_pmf(k, lam_total)

    print(
        f"k={k} | "
        f"P_home:{P_home_k:.3f}  " # .3f formats output as float to 3 decimal places
        f"P_away:{P_away_k:.3f}  "
        f"P_total:{P_total_k:.3f}"
        
    )
print('lam_total:', lam_total)


k=0 | P_home:0.161  P_away:0.093  P_total:0.015
k=1 | P_home:0.294  P_away:0.221  P_total:0.063
k=2 | P_home:0.268  P_away:0.262  P_total:0.132
k=3 | P_home:0.163  P_away:0.208  P_total:0.185
k=4 | P_home:0.074  P_away:0.123  P_total:0.194
k=5 | P_home:0.027  P_away:0.059  P_total:0.163
k=6 | P_home:0.008  P_away:0.023  P_total:0.114
k=7 | P_home:0.002  P_away:0.008  P_total:0.069
lam_total: 4.199909348844924


Function that lets us go from Poisson PMF → Over/Under probabilities

In [42]:
def prob_over_x_point_5(lam, line):
    '''
    Probability that total cards > line (e.g. line = 3.5, 4.5).
    3.5 or more means 4 or more cards, so when you put the line in (e.g. 3.5), you minus 0.5 and look for over that number (3) 
    Basically adding up all the likelihoods that are below the threshold, then minusing it from 1 to get the probability of the complement.
    '''
    threshold = int(line - 0.5)   # 3.5 -> 3, 4.5 -> 4
    prob_leq_threshold = 0.0 # set count of running total to 0 - this will store the added probabilities, 
    # when completed the final value will be probability that the total number of cards is less than or equal to the threshold.
    for k in range(0, threshold + 1):
        prob_leq_threshold += poisson_pmf(k, lam)
    return 1 - prob_leq_threshold


In [43]:
# Function that uses prob_over_x_point_5 func to output probabilities for 0-10 cards for each team and total

def over_lines_summary(home, away, lines=(0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5)):
    exp = expected_cards(home, away)
    lam_home  = exp["home_cards_exp"]
    lam_away  = exp["away_cards_exp"]
    lam_total = exp['total_cards_exp']

    print(f"{home} vs {away}")
    print(f"λ_home = {lam_home:.3f}, λ_away = {lam_away:.3f}, λ_total = {lam_total:.3f}")
    print()
    print("Line | P(home > line)  P(away > line)  P(total > line)")
    print("-----+---------------+---------------+----------------")
    
    for line in lines:
        p_home  = prob_over_x_point_5(lam_home,  line)
        p_away  = prob_over_x_point_5(lam_away,  line)
        p_total = prob_over_x_point_5(lam_total, line)
        
        print(f"{line:.1f} |     {p_home:.3f}           {p_away:.3f}          {p_total:.3f}")
        
    


In [44]:
over_lines_summary('aston villa','arsenal')

aston villa vs arsenal
λ_home = 1.824, λ_away = 2.376, λ_total = 4.200

Line | P(home > line)  P(away > line)  P(total > line)
-----+---------------+---------------+----------------
0.5 |     0.839           0.907          0.985
1.5 |     0.544           0.686          0.922
2.5 |     0.276           0.424          0.790
3.5 |     0.113           0.216          0.605
4.5 |     0.038           0.093          0.410
5.5 |     0.011           0.034          0.247
6.5 |     0.003           0.011          0.133
7.5 |     0.001           0.003          0.064
8.5 |     0.000           0.001          0.028
9.5 |     0.000           0.000          0.011


## MATCH LEVEL DATA STUFF

MOVED TO A NEW FILE FOR MATCH LEVEL MODELLING.