<p>A season in the English Premiere League typically runs between August and May and contains 38 games. Each team plays every other team twice, once at home and once away. For a win, a team is awarded 3 points; for a draw, both teams are awarded one point; and, for a loss, a team receives no points. At the end of the season the team with the most points is crowned Premier League Champion</p>

<p>Journalists, pundits, and fans all like to read deeply into teams' performances and make predictive statements about how certain results will affect the rest of their side's season. How will a team react to a particularly bad loss? Will a late evening game leave them tired for subsequent fixtures? Can they do it on a cold rainy night in Stoke?</p>

<p>For every such question there are at least a dozen answers. After pulling data for the last few years of Premier League results, I attempted to statistically verify several of these kinds of claims.</p>

<p>First, some imports:</p>

In [None]:
import requests
import pandas as pd
import os
import json
import numpy as np
import urllib.request
import colorsys
import io
import datetime

from PIL import Image

<p>The data I used comes from a publicly available API hosted on rapid API: api-sports' API-FOOTBALL. I used the fixtures endpoint to retrieve data for an entire season. The data arrives fairly deeply nested, so I wrote a recursive function to unroll any complex types into individual rows.</p>

<h4>If you want to use your own rapid API key, you must sign up for API-FOOTBALL in rapid API. Usage of the API may incur charges as outlined on their fee page</h4>

<p>If you have a rapid API key, you can use it to replace the definitions below and get the data from the actual endpoint. Otherwise, the code will fall back to some cached data provided in the repo. Note that the cached data only goes as far back as the 2011-12 season.</p>

In [None]:
def _get_rapid_api_key() -> str:
    with open(os.path.abspath('../api_keys/rapid_api_key.txt')) as file:
        api_key = file.read()
    return api_key


def _flatten_response(df: pd.DataFrame) -> pd.DataFrame:
    """
    Takes a DataFrame with dictionary values and "unrolls" them such that each dictionary entry receives its own column.

    Args:
        df (DataFrame): DataFrame to flatten

    Returns:
        DataFrame: Flat DataFrame

    """
    df_types: pd.Series = df.iloc[0].apply(type)
    df_types = df_types.reset_index().rename({'index': 'column_name', 0: 'column_type'}, axis=1)
    dict_columns = df_types[df_types['column_type'] == dict]['column_name']
    if len(dict_columns) > 0:
        for col_name in dict_columns:
            new_cols = df[col_name].apply(pd.Series)
            new_col_names = {new_col_name: f'{col_name}-{new_col_name}' for new_col_name in new_cols.columns}
            new_cols = new_cols.rename(new_col_names, axis=1)
            df = pd.concat([df, new_cols], axis=1)
            df = df.drop(col_name, axis=1)
        df = _flatten_response(df)

    return df

def get_season_data(season: int, league_id: int, use_cache=False) -> pd.DataFrame:
    """
    Gets the historical data for a given season and soccer league

    Args:
        season (int): Season for which to get data
        league_id (str): League for which to get data
        use_cache (bool): Whether to use local cache or not

    Returns:
        DataFrame: Historical data for the league in the given season

    """
    cache_dir = os.path.abspath(f'../local_cache/')
    cache_file_path = f'{cache_dir}/league_{league_id}_season_{season}.csv'

    if use_cache:
        print('Retrieving data from local cache')
        file_already_exists = os.path.exists(cache_file_path)

        if file_already_exists:
            return pd.read_csv(cache_file_path)

    print('Retrieving data from API')
    api_key = _get_rapid_api_key()

    url = "https://api-football-v1.p.rapidapi.com/v3/fixtures"

    querystring = {"league": league_id, "season": season}

    headers = {
        'x-rapidapi-host': "api-football-v1.p.rapidapi.com",
        'x-rapidapi-key': api_key
    }

    response = requests.request("GET", url, headers=headers, params=querystring)

    df = pd.DataFrame(json.loads(response.text)['response'])

    df = _flatten_response(df)

    if use_cache:
        print('Saving data to local cache')
        os.makedirs(cache_dir, exist_ok=True)
        df.to_csv(cache_file_path)

    return df

<p>The cell below calls the functions listed above to get data for the specified years. You can change the years as desired, just note that I haven't defined team color data for teams outside of the 2018 season. Some visualizations might look off if you try to use other seasons' data.</p>

<h3>I've also found that older EPL data has some missing fields that cause the visualizations to break. Anything after and including 2011 seems to be fine</h3>

In [None]:
epl_league_id = 39
fixtures_df = get_season_data(2014, epl_league_id, use_cache=True)
fixtures_df.head()

<p>The DataFrame above has one row per fixture, though ideally we'd want a DataFrame that tracks a team's overall position in the league throughout a season. The functions below reshape and aggregate the fixtures data into more table-centric data, where each row is more akin to an entry in a league table for a particular match week. As a result the generated DataFrame has twice as many rows as the original since both the home team and away team for a match get their own row.</p>

<p>During this transformation I also add some columns used for visualization, namely I add the color used to represent each team and a three letter code. I hand-picked the colors for the 2018-19 season, originally intending to analyze just that year. I quickly found that there weren't enough samples for a lot of the cases I was trying to test. Instead of hand-picking a bunch more colors for various promoted/relegated teams, I wrote a function to download the team's logo and use the most common color determined by pixel count as the team's color. It works well enough for most teams, just be aware that if a team has a weird color in a graph then  it's probably for that reason.</p>


<p>I also define a method to take get and create multiple seasons worth of table data in one go. The function concatenates all the data together into one DataFrame. This puts all the data I'll be working on later in one place.</p>

In [None]:
team_colors = {
    42: '#fe0002',
    35: '#d71a21',
    51: '#0054a6',
    44: '#6a003a',
    43: '#093ad6',
    49: '#0a4595',
    52: '#eb302e',
    45: '#00369c',
    36: '#000000',
    37: '#0361af',
    46: '#273e8a',
    40: '#e31b23',
    50: '#6caee0',
    33: '#d81920',
    34: '#383838',
    41: '#d71920',
    47: '#001c58',
    38: '#ffee00',
    48: '#7d2c3b',
    39: '#f9a01b'
}


def get_team_colors_from_api(matches_df: pd.DataFrame, use_cache=False, league_id=None, season=None,
                             rebuild_cache=False) -> pd.DataFrame:
    """
    Gets a dictionary of team colors by finding the most common pixel color in the teams' logo

    Requests a 256 x 256 PNG image for each team using the API-Football API

    Args:
        matches_df (DataFrame): DataFrame acquired using the get_season_data function
        use_cache (bool): Optional parameter specifying whether to use cached data if it is available. Cache
            is granular down to the league/season combination. Must specify league_id and season to use the
            cache
        league_id (int): League id for the team colors to retrieve
        season (int): Season for the team colors to retrieve
        rebuild_cache (bool): If true, will recreate data using the API and overwrite any existing cached data with
            the new values

    Returns:
        DataFrame: DataFrame where the index is team ids and the only column is team_color containing
            team color values as hex colors
    """
    cache_dir = os.path.abspath(f'../local_cache/')
    cache_file_path = f'{cache_dir}/league_{league_id}_season_{season}_color_data.csv'

    if use_cache and not rebuild_cache:
        if league_id is None or season is None:
            raise RuntimeError('One of league_id or season is None when attempting to use colors cache!')

        file_already_exists = os.path.exists(cache_file_path)

        if file_already_exists:
            print('Retrieving color data from local cache')
            return pd.read_csv(cache_file_path, index_col='team_id')

    def get_nth_most_common_color(colors: list, n: int):
        """
        Finds and returns the nth most common color in a list of colors in HLS format

        Args:
            colors: List of colors in RGB format
            n: Rank of the color to get ordered by descending frequency

        Returns:
            Tuple: Tuple containing hue, saturation, lightness in that order
        """
        colors.sort(key=lambda color: color[0])
        most_common_color = colors[-n][-1]
        most_common_color_hls = colorsys.rgb_to_hsv(most_common_color[0], most_common_color[1],
                                                    most_common_color[2])
        return most_common_color_hls[0], most_common_color_hls[1], most_common_color_hls[2]

    def get_colors(df: pd.DataFrame):
        """
        Finds the most common color in a team's logo. Uses HSV color system to filter out low saturation colors
            that are likely to be background pixels. Doesn't always return the best color to represent a team
            but is usually pretty decent. Best used as a fallback option when a hand-picked value isn't available.

        Args:
            df (DataFrame): DataFrame containing at least a teams-home-logo column with URLs pointing to team logo
                image files served by API-FOOTBALL in Rapid API and a teams-home-id containing a team id.

        Returns:
            DataFrame: DataFrame indexed by team id with a column named team_color containing the hex value
                of the most common pixel found in the team's logo.
        """
        logo_url = df.iloc[0]['teams-home-logo']
        with urllib.request.urlopen(logo_url) as image_url_in:
            image_file = io.BytesIO(image_url_in.read())
        image = Image.open(image_file)
        image: Image

        image = image.convert('RGB')

        image_colors = image.getcolors(image.size[0] * image.size[1])
        image_colors.sort(key=lambda color: color[0])

        current_rank = 1
        h, s, v = get_nth_most_common_color(image_colors, current_rank)
        while s < 0.35:
            current_rank += 1
            h, s, v = get_nth_most_common_color(image_colors, current_rank)

        rgb_color_to_use = colorsys.hsv_to_rgb(h, s, v)
        return '#%02x%02x%02x' % \
               (round(rgb_color_to_use[0]), round(rgb_color_to_use[1]), round(rgb_color_to_use[2]))

    colors_df = matches_df.groupby(by='teams-home-id') \
        .apply(get_colors) \
        .to_frame() \
        .rename({0: 'team_color'}, axis=1)

    colors_df.index.rename('team_id', inplace=True)

    if use_cache or rebuild_cache:
        print('Saving color data to local cache')
        os.makedirs(cache_dir, exist_ok=True)
        colors_df.to_csv(cache_file_path)

    return colors_df


def create_tables_df(matches_df: pd.DataFrame) -> pd.DataFrame:
    matches_df['round'] = matches_df['league-round'].apply(lambda s: int(s.split('Regular Season - ')[1]))

    matches_df['home_win'] = matches_df['score-fulltime-home'] > matches_df['score-fulltime-away']
    matches_df['away_win'] = matches_df['score-fulltime-home'] < matches_df['score-fulltime-away']
    matches_df['draw'] = matches_df['score-fulltime-home'] == matches_df['score-fulltime-away']

    matches_df['home_pts'] = matches_df['home_win'] * 3 + matches_df['draw'] * 1
    matches_df['away_pts'] = matches_df['away_win'] * 3 + matches_df['draw'] * 1

    matches_df['home_gd_half_time'] = matches_df['score-halftime-home'] - matches_df['score-halftime-away']
    matches_df['away_gd_half_time'] = -matches_df['home_gd_half_time']

    matches_df['home_gd_full_time'] = matches_df['score-fulltime-home'] - matches_df['score-fulltime-away']
    matches_df['away_gd_full_time'] = -matches_df['home_gd_full_time']

    relevant_cols = ['round', 'teams-home-id', 'teams-home-name', 'teams-away-id', 'teams-away-name',
                     'score-halftime-home', 'score-halftime-away', 'score-fulltime-home', 'score-fulltime-away',
                     'home_win', 'away_win', 'draw', 'home_pts', 'away_pts', 'home_gd_half_time', 'away_gd_half_time',
                     'home_gd_full_time', 'away_gd_full_time', 'league-id', 'league-season', 'teams-home-logo',
                     'fixture-date']
    matches_df = matches_df[relevant_cols]

    home_match_results_df = matches_df[['round', 'teams-home-id', 'teams-home-name', 'teams-away-id', 'teams-away-name',
                                        'home_win', 'draw', 'home_pts', 'home_gd_half_time', 'home_gd_full_time',
                                        'league-id', 'league-season', 'fixture-date']] \
        .rename({'teams-home-id': 'team_id', 'teams-home-name': 'team_name', 'teams-away-id': 'opp_team_id',
                 'teams-away-name': 'opp_team_name', 'home_win': 'match_win', 'home_pts': 'match_pts',
                 'home_gd_half_time': 'match_gd_half', 'home_gd_full_time': 'match_gd_full',
                 'draw': 'match_draw', 'league-id': 'league_id', 'league-season': 'league_season',
                 'fixture-date': 'fixture_date'}, axis='columns')
    home_match_results_df['home'] = True

    away_match_results_df = matches_df[['round', 'teams-home-id', 'teams-home-name', 'teams-away-id', 'teams-away-name',
                                        'away_win', 'draw', 'away_pts', 'away_gd_half_time', 'away_gd_full_time',
                                        'league-id', 'league-season', 'fixture-date']] \
        .rename({'teams-away-id': 'team_id', 'teams-away-name': 'team_name', 'teams-home-id': 'opp_team_id',
                 'teams-home-name': 'opp_team_name', 'away_win': 'match_win', 'away_pts': 'match_pts',
                 'away_gd_half_time': 'match_gd_half', 'away_gd_full_time': 'match_gd_full',
                 'draw': 'match_draw', 'league-id': 'league_id', 'league-season': 'league_season',
                 'fixture-date': 'fixture_date'}, axis='columns')
    away_match_results_df['home'] = False

    tables_df = pd.concat([home_match_results_df, away_match_results_df], axis='rows') \
        .sort_values(by='round') \
        .reset_index(drop=True)

    cumulative_stats = tables_df[['team_id', 'match_win', 'match_draw', 'match_pts', 'match_gd_half',
                                  'match_gd_full']] \
        .groupby(by='team_id').cumsum() \
        .rename({'match_win': 'wins', 'match_draw': 'draws', 'match_pts': 'pts', 'match_gd_half': 'gd_half',
                 'match_gd_full': 'gd_full'}, axis='columns')

    tables_df = pd.concat([tables_df, cumulative_stats], axis='columns')
    tables_df['losses'] = tables_df['round'] - tables_df['wins'] - tables_df['draws']

    tables_df['fixture_date'] = tables_df['fixture_date'].apply(datetime.datetime.fromisoformat)

    team_colors_from_api = get_team_colors_from_api(matches_df, use_cache=True, league_id=tables_df['league_id'][0],
                                                    season=tables_df['league_season'][0])

    def get_team_color(row: pd.Series, row_team_id_col_name: str) -> str:
        if row[row_team_id_col_name] in team_colors:
            return team_colors[row[row_team_id_col_name]]
        else:
            return team_colors_from_api.loc[row[row_team_id_col_name]]['team_color']

    tables_df['team_color'] = tables_df.apply(get_team_color, row_team_id_col_name='team_id', axis='columns')
    tables_df['opp_team_color'] = tables_df.apply(get_team_color, row_team_id_col_name='opp_team_id', axis='columns')

    return tables_df


def validate_tables_df(tables_df: pd.DataFrame):
    n_rounds = max(tables_df['round'])
    n_teams = tables_df['team_id'].nunique(dropna=True)

    n_rows_expected = n_rounds * n_teams

    if len(tables_df) != n_rows_expected:
        return False

    return True


def create_multi_season_tables_df(start_year: int, end_year: int, league_id: int, throw_on_invalid=True,
                                  use_cache=True):
    """
    Gets all the season data for a range of years and return the results as one big DataFrame

    Args:
        start_year: Start of the range inclusive
        end_year: End of the range inclusive
        league_id: Id for the league
        throw_on_invalid: If true will throw an error if any season data fails validation. If false will omit
            the data and still return
        use_cache: Whether or not to prioritize the local cache when getting data

    Returns:
        DataFrame: Pandas DataFrame with multiple seasons of league table data

    """
    multi_season_table_df = pd.DataFrame()

    for year in range(start_year, end_year + 1):
        current_season_df = create_tables_df(get_season_data(season=year, league_id=league_id, use_cache=use_cache))
        is_season_valid = validate_tables_df(current_season_df)
        if not is_season_valid and throw_on_invalid:
            raise RuntimeError(f'Season {year} league {league_id} failed validation!')
        elif is_season_valid:
            multi_season_table_df = pd.concat([multi_season_table_df, current_season_df])
            
    def create_final_position_column(season_df: pd.DataFrame):
        final_round_df = season_df.loc[season_df['round'] == max(season_df['round'])] \
            .sort_values(by=['pts', 'gd_full'], ascending=True)
        final_round_df['final_position'] = final_round_df['pts'].rank(method='first').astype(int)
        final_round_df = final_round_df.reset_index(drop=True)

        season_df = season_df.reset_index(drop=True)
        season_df = season_df.merge(final_round_df[['team_id', 'final_position']], on='team_id')

        return season_df

    multi_season_table_df = multi_season_table_df.groupby('league_season').apply(create_final_position_column)

    return multi_season_table_df.reset_index(drop=True)

In [None]:
tables_df = create_multi_season_tables_df(2011, 2021, epl_league_id)
tables_df.head()

<p>This data shape is much easier to work with. Using this I can create a simple visualization of each teams' points haul throughout the season with bokeh:</p>

<h3>When running the cells below, please make sure that the port number in the notebook_url variable matches the port number displayed in your web browser's address bar, or bokeh will throw an error.</h3>
<p>In most cases the notebook will be running on port 8888, but if you have multiple notebook instances open then this could change. Also feel free to adjust the plot_size variable to best suit your monitor and browser. Each visualization is intended to be fully in frame without any need for scrolling.</p>

In [None]:
from bokeh.io import output_notebook

notebook_url = 'http://localhost:8888'

plot_size = 400

output_notebook()

In [None]:
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, LabelSet, VBar, Legend, MultiLine, Slider, DataTable, Dropdown, FactorRange, Range1d
from bokeh.layouts import row, column
from functools import partial

import math

selected_year = max(tables_df['league_season'])
selected_round = min(tables_df.loc[tables_df['league_season'] == selected_year]['round'])


def slider_update(attr, old, new, lines_data_source, bar_data_source):
    update_round_df = tables_df.loc[(tables_df['round'] == new) & 
                                    (tables_df['league_season'] == selected_year)].reset_index(drop=True)
    update_round_df = update_round_df.sort_values(by='final_position')

    bar_data_source.data['x'] = update_round_df['team_name']
    bar_data_source.data['y'] = [0] * len(update_round_df)
    bar_data_source.data['top'] = update_round_df['pts']
    bar_data_source.data['color'] = update_round_df['team_color']
    bar_data_source.data['label'] = update_round_df['team_name']

    update_tables_till_current_round_df = tables_df.loc[(tables_df['league_season'] == selected_year) &
                                                         (tables_df['round'] <= new)]

    new_xs = list()
    new_ys = list()
    new_team_colors = list()
    new_team_name = list()

    for update_team_name in update_round_df['team_name']:
        update_team_df = update_tables_till_current_round_df.loc[
            update_tables_till_current_round_df['team_name'] == update_team_name]
        update_team_color = update_team_df['team_color'].iloc[0]

        new_xs.append(update_team_df['round'])
        new_ys.append(update_team_df['pts'])

        new_team_colors.append(update_team_color)
        new_team_name.append(update_team_name)

    lines_data_source.data['xs'] = new_xs
    lines_data_source.data['ys'] = new_ys
    lines_data_source.data['team_colors'] = new_team_colors
    lines_data_source.data['team_name'] = new_team_name
    

def year_selector_update(event, lines_data_source, bar_data_source, bar_graph_object, line_graph_object, round_slider_object):
    new_year = int(event.item)
    global selected_year
    selected_year = new_year
    
    update_season_df = tables_df.loc[tables_df['league_season'] == new_year].reset_index(drop=True)
    update_round_df = update_season_df.loc[update_season_df['round'] == 1].reset_index(drop=True)
    update_round_df = update_round_df.sort_values(by='final_position')
    
    bar_graph_object.x_range.factors = update_round_df['team_name'].unique().tolist()
    
    bar_graph_object.y_range.end = max(update_season_df['pts']) + 10
    line_graph_object.y_range.end = max(update_season_df['pts']) + 10

    bar_graph_object.title.text = f'{new_year} Season Points Progression'

    bar_data_source.data['x'] = update_round_df['team_name']
    bar_data_source.data['y'] = [0] * len(update_round_df)
    bar_data_source.data['top'] = update_round_df['pts']
    bar_data_source.data['color'] = update_round_df['team_color']
    bar_data_source.data['label'] = update_round_df['team_name']

    update_tables_till_current_round_df = update_season_df.loc[update_season_df['round'] <= 1]

    new_xs = list()
    new_ys = list()
    new_team_colors = list()
    new_team_name = list()

    for update_team_name in update_round_df['team_name']:
        update_team_df = update_tables_till_current_round_df.loc[
            update_tables_till_current_round_df['team_name'] == update_team_name]
        update_team_color = update_team_df['team_color'].iloc[0]

        new_xs.append(update_team_df['round'])
        new_ys.append(update_team_df['pts'])

        new_team_colors.append(update_team_color)
        new_team_name.append(update_team_name)

    lines_data_source.data['xs'] = new_xs
    lines_data_source.data['ys'] = new_ys
    lines_data_source.data['team_colors'] = new_team_colors
    lines_data_source.data['team_name'] = new_team_name
    
    round_slider_object.value = 1
    round_slider_object.trigger(new=1, old=None, attr='value')


def create_bokeh_plot_for_round(tables_df: pd.DataFrame, round_num: int, plot_size=400):
    tables_df = tables_df.sort_values(by=['round', 'pts', 'team_id'], ascending=[True, False, True])
    
    season_df = tables_df.loc[tables_df['league_season'] == selected_year]
    round_df = season_df.loc[tables_df['round'] == round_num].reset_index(drop=True)
    
    round_df = round_df.sort_values(by='pts', ascending=True)

    line_graph = figure(x_range=(0, max(tables_df['round'])),
                        y_range=(0, max(tables_df['pts']) + 10),
                        x_axis_label='Round',
                        y_axis_label='Points',
                        toolbar_location=None,
                        plot_width=plot_size,
                        plot_height=plot_size)

    line_graph.toolbar.active_drag = None
    line_graph.toolbar.active_scroll = None
    line_graph.toolbar.active_tap = None

    line_graph.y_range.start = 0

    line_graph.xgrid.grid_line_color = None
    
    final_round_df = season_df.loc[season_df['round'] == max(season_df['round'])]

    bar_graph = figure(title=f'{selected_year} Season Points Progression',
                       y_axis_label='Points',
                       x_range=final_round_df.sort_values(by='final_position', ascending=True)['team_name'],
                       y_range=(0, max(tables_df['pts']) + 10),
                       toolbar_location=None,
                       plot_width=plot_size,
                       plot_height=plot_size)

    bar_graph.toolbar.active_drag = None
    bar_graph.toolbar.active_scroll = None
    bar_graph.toolbar.active_tap = None

    bar_graph.y_range.start = 0

    bar_graph.xgrid.grid_line_color = None

    bar_graph.xaxis.major_label_orientation = (5. * math.pi) / 12.

    bar_data_source = ColumnDataSource(
        dict(
            x=round_df['team_name'],
            y=[0] * len(round_df),
            top=round_df['pts'],
            color=round_df['team_color'],
            label=round_df['team_name']
        )
    )

    tables_till_current_round_df = tables_df.loc[tables_df['round'] <= round_num]

    xs = list()
    ys = list()
    team_colors = list()
    team_name_list = list()

    for team_name in round_df['team_name']:
        round_df = round_df.sort_values(by='final_position')
        team_df = tables_till_current_round_df.loc[tables_till_current_round_df['team_name'] == team_name]
        team_color = team_df['team_color'].iloc[0]

        xs.append(team_df['round'])
        ys.append(team_df['pts'])
        team_colors.append(team_color)
        team_name_list.append(team_name)

    lines_data_source = ColumnDataSource(
        dict(
            xs=xs,
            ys=ys,
            team_colors=team_colors,
            team_name=team_name_list
        )
    )

    bar_graph.vbar(x='x', top='top', color='color', source=bar_data_source)
    line_graph.multi_line(xs='xs', ys='ys', line_color='team_colors', line_width=2, source=lines_data_source)

    round_slider = Slider(start=1, end=max(tables_df['round']), value=1, step=1, title='Round')
    year_labels = [str(year) for year in sorted(tables_df['league_season'].unique().tolist())]
    year_selector_dropdown = Dropdown(label='Year', menu=year_labels, width_policy='min')
    
    slider_update_partial = partial(slider_update, lines_data_source=lines_data_source, bar_data_source=bar_data_source)
    year_selector_update_partial = partial(year_selector_update, 
                                           lines_data_source=lines_data_source, 
                                           bar_data_source=bar_data_source, 
                                           bar_graph_object=bar_graph, 
                                           line_graph_object=line_graph, 
                                           round_slider_object=round_slider)
    
    round_slider.on_change('value', slider_update_partial)
    year_selector_dropdown.on_click(year_selector_update_partial)

    return column(row(bar_graph, line_graph, year_selector_dropdown), round_slider)
    
    
def modify_doc(doc):
    doc.add_root(create_bokeh_plot_for_round(tables_df, 1))
    
show(modify_doc, notebook_url=notebook_url)

<p>With these plots it's fairly easy to see the spread of team performances and how they accumulate points. There's the contenders: the small handful of teams near the top that are more likely than not to earn full points in a given week. Teams in the middle of the table have about even odds of doing the same. And then there's the teams in the relegation zone, who are lucky if they walk away from a match with a win.</p>

<p>I've opted to keep the teams in the bar chart ordered by their final positions in the league table. This helps identify teams that are momentarily overperforming or underperforming, as they'll stand out from adjacent teams. For example, by match week 35 in the 2018 EPL season Watford seemed poised for at least a top half finish. Advance the slider through the rest of the season and you'll see that they managed to earn zero points in three games and ended the season in eleventh. Similarly, you can see how Newcastle managed to turn things around in the same year, after being level with Huddersfield in last place in week 10 they managed to pull things together enough to finish thirteenth. Huddersfield stayed last by a wide margin and were ultimately relegated.</p>

<p>Despite telling a few stories quite plainly, the graphs above hide a lot of nuance in their clutter. To move into the finer details of teams' performances, I'll create a few more columns in the table data. I'll make both a forward and backwards looking average points per game (ppg) column, and a pair of columns that flag a particular match as being a "bad loss" or a "big win." These I'll define big wins and bad losses as being games where the goal difference was four or more.</p>

<p>I also create a metric to judge team performance called "{n}_match_ppg_diff." This is the difference between the  average points per game earned by a team over their next n games and the average from their previous {n} games. This helps control for varying team performance levels- some teams have higher baseline ppgs than others, so we really care more about deviations from the norm. I can also test different values of n to see if the effects of a big win or a bad loss are more visible when using certain metrics.</p>

<p>I'll be running tests on these ppg_diff distributions to see if certain occurrences produce a statistically significant difference. I'll treat these distributions as samples of a larger super-population that is larger than just the table data. To me, the total population would the set of results where each team played each game both under "treatment" conditions and control conditions, with all other circumstances held equal. For example, when testing whether a late game impacts team performance, this hypothetical total population contains two results for each game in the league schedule, one where the game was played at night, and one where it was played during the day under otherwise equal conditions.</p>

<p>Such a hypothetical population is impossible to actually realize, but has precedent [1, 2]</p>

<p>
[1] Hartley, H. O., and R. L. Sielken. “A ‘Super-Population Viewpoint’ for Finite Population Sampling.” Biometrics 31, no. 2 (June 1975): 411–22. https://doi.org/10.2307/2529429.

[2] Gelman, Andrew. “Statistical Modeling, Causal Inference, and Social Science.” Statistical Modeling Causal Inference and Social Science. Columbia University, July 3, 2009. https://statmodeling.stat.columbia.edu/2009/07/03/how_does_statis/. 
</p>

In [None]:
from scipy import stats

def _create_rolling_avg_match_pts(group: pd.DataFrame, period: int):
    """
    Creates two new columns in the group named prev_{period}_match_ppg_avg and next_{period}_match_ppg_avg where
        {period} is the second positional argument provided and represents the size of the rolling average to use.

        The new columns represent the backwards and forwards looking moving averages of the points earned per game

    Args:
        group (DataFrame): team_id group of DataFrame generated by create_tables_df
        period (int): The size of the rolling window

    Returns:
        DataFrame: DataFrame with the two new columns

    """
    group = group.sort_values(by='round')

    group[f'prev_{period}_match_ppg_avg'] = group['match_pts'] \
        .shift(1) \
        .rolling(window=period, min_periods=period) \
        .mean()

    group[f'next_{period}_match_ppg_avg'] = group[::-1]['match_pts'] \
        .shift(1) \
        .rolling(window=period, min_periods=period) \
        .mean()
    return group


def run_gd_statistical_test_for_period(epl_tables_df, period=3, by_team=False, big_win_goal_diff_thresh=4):
    epl_tables_df = epl_tables_df.groupby(by=['league_season', 'team_id']).apply(_create_rolling_avg_match_pts,
                                                                                 period=period)
    epl_tables_df = epl_tables_df.dropna()
    epl_tables_df[f'{period}_match_ppg_diff'] = \
        epl_tables_df[f'prev_{period}_match_ppg_avg'] - epl_tables_df[f'next_{period}_match_ppg_avg']

    epl_tables_df['big_win'] = epl_tables_df['match_gd_full'] >= big_win_goal_diff_thresh
    epl_tables_df['bad_loss'] = epl_tables_df['match_gd_full'] <= -big_win_goal_diff_thresh

    results_df = pd.DataFrame()

    if by_team:
        for team_name in epl_tables_df['team_name'].unique():
            big_win_group = epl_tables_df.loc[epl_tables_df['big_win'] & (epl_tables_df['team_name'] == team_name)]
            bad_loss_group = epl_tables_df.loc[
                epl_tables_df['bad_loss'] & (epl_tables_df['team_name'] == team_name)]
            control_group = epl_tables_df.loc[(~epl_tables_df['big_win'])
                                              & (~epl_tables_df['bad_loss'])
                                              & (epl_tables_df['team_name'] == team_name)]
            
            control_group_ppg_mean = np.mean(control_group[f'{period}_match_ppg_diff'])
            control_group_ppg_std = np.std(control_group[f'{period}_match_ppg_diff'])

            if len(bad_loss_group) != 0:
                _, bad_loss_p_value = stats.mannwhitneyu(bad_loss_group[f'{period}_match_ppg_diff'],
                                                         control_group[f'{period}_match_ppg_diff'])
                
                bad_loss_ppg_mean = np.mean(bad_loss_group[f'{period}_match_ppg_diff'])
                bad_loss_ppg_std = np.std(bad_loss_group[f'{period}_match_ppg_diff'])
            else:
                bad_loss_p_value = np.nan
                bad_loss_ppg_mean = np.nan
                bad_loss_ppg_std = np.nan

            if len(big_win_group) != 0:
                _, big_win_p_value = stats.mannwhitneyu(big_win_group[f'{period}_match_ppg_diff'],
                                                        control_group[f'{period}_match_ppg_diff'])
                
                big_win_ppg_mean = np.mean(big_win_group[f'{period}_match_ppg_diff'])
                big_win_ppg_std = np.std(big_win_group[f'{period}_match_ppg_diff'])
            else:
                big_win_p_value = np.nan
                big_win_ppg_mean = np.nan
                big_win_ppg_std = np.nan

            results_df = pd.concat([results_df, pd.Series({
                'team_name': team_name,
                'period': period,
                'control_group_mean': control_group_ppg_mean,
                'control_group_std': control_group_ppg_std,
                'bad_loss_n': len(bad_loss_group),
                'bad_loss_p_value': bad_loss_p_value,
                'bad_loss_mean': bad_loss_ppg_mean,
                'bad_loss_std': bad_loss_ppg_std,
                'big_win_n': len(big_win_group),
                'big_win_p_value': big_win_p_value,
                'big_win_mean': big_win_ppg_mean,
                'big_win_std': big_win_ppg_std,
            })], axis=1)

        return results_df.transpose()
    else:
        big_win_group = epl_tables_df.loc[epl_tables_df['big_win']]
        bad_loss_group = epl_tables_df.loc[epl_tables_df['bad_loss']]
        control_group = epl_tables_df.loc[(~epl_tables_df['big_win']) & (~epl_tables_df['bad_loss'])]

        _, bad_loss_p_value = stats.mannwhitneyu(bad_loss_group[f'{period}_match_ppg_diff'],
                                                 control_group[f'{period}_match_ppg_diff'])
        _, big_win_p_value = stats.mannwhitneyu(big_win_group[f'{period}_match_ppg_diff'],
                                                control_group[f'{period}_match_ppg_diff'])
        
                    
        control_group_ppg_mean = np.mean(control_group[f'{period}_match_ppg_diff'])
        control_group_ppg_std = np.std(control_group[f'{period}_match_ppg_diff'])
        
        bad_loss_ppg_mean = np.mean(bad_loss_group[f'{period}_match_ppg_diff'])
        bad_loss_ppg_std = np.std(bad_loss_group[f'{period}_match_ppg_diff'])

        big_win_ppg_mean = np.mean(big_win_group[f'{period}_match_ppg_diff'])
        big_win_ppg_std = np.std(big_win_group[f'{period}_match_ppg_diff'])

        results_df = pd.Series({
            'period': period,
            'bad_loss_n': len(bad_loss_group),
            'bad_loss_p_value': bad_loss_p_value,
            'big_win_n': len(big_win_group),
            'big_win_p_value': big_win_p_value,
            'control_group_mean': control_group_ppg_mean,
            'control_group_std': control_group_ppg_std,
            'bad_loss_mean': bad_loss_ppg_mean,
            'bad_loss_std': bad_loss_ppg_std,
            'big_win_mean': big_win_ppg_mean,
            'big_win_std': big_win_ppg_std
        }).to_frame()

        return results_df

<p>I test five different values of n: [1, 2, 3, 4, 5]. First I try and see if there is a difference in performance when testing the entire league at once, then I perform the test one team at a time. This results in a large total number of tests, so I also find the FDR-corrected p-values using the Benjamini/Hochberg correction.</p>

In [None]:
import statsmodels.stats.multitest

metric_test_range = range(1, 6)

gd_res_entire_league = pd.concat([run_gd_statistical_test_for_period(tables_df, period=period_length, by_team=False)
          for period_length in metric_test_range], axis=1).transpose().reset_index(drop=True)

gd_res_entire_league['adjusted_big_win_p_value'] = statsmodels.stats.multitest.fdrcorrection(gd_res_entire_league['big_win_p_value'], is_sorted=False)[1]
gd_res_entire_league['adjusted_bad_loss_p_value'] = statsmodels.stats.multitest.fdrcorrection(gd_res_entire_league['bad_loss_p_value'], is_sorted=False)[1]

gd_res_entire_league.head()

In [None]:
gd_res_by_team = pd.concat([run_gd_statistical_test_for_period(tables_df, period=period_length, by_team=True)
          for period_length in metric_test_range], axis=0).dropna().reset_index(drop=True)

gd_res_by_team['adjusted_big_win_p_value'] = statsmodels.stats.multitest.fdrcorrection(gd_res_by_team['big_win_p_value'], is_sorted=False)[1]
gd_res_by_team['adjusted_bad_loss_p_value'] = statsmodels.stats.multitest.fdrcorrection(gd_res_by_team['bad_loss_p_value'], is_sorted=False)[1]

gd_res_by_team.head()

<p>Next are a series of visualizations to show the results, first of the league-level test, then of each individual team. The visualizations display the uncorrected p-values at each tested metric horizon.</p>

In [None]:
from bokeh.palettes import Category10
from bokeh.models import CheckboxGroup

gd_entire_league_line_graph = figure(x_range=(min(metric_test_range), max(metric_test_range)),
                    y_range=(0, 1),
                    x_axis_label='Metric Horizon',
                    y_axis_label='P-Value',
                    toolbar_location=None,
                    plot_width=2 * plot_size,
                    plot_height=plot_size,
                    title='Big Win and Bad Loss Test Results at Various Metric Horizons Entire League')

gd_league_bad_loss_lines_data_source = ColumnDataSource(
    dict(
        xs=[gd_res_entire_league['period'], gd_res_entire_league['period']],
        ys=[gd_res_entire_league['big_win_p_value'], gd_res_entire_league['bad_loss_p_value']],
        line_colors=Category10[3][:2],
        series_labels=['Big Win', 'Bad Loss']
    )
)

gd_league_bad_loss_scatter_data_source = ColumnDataSource(
    dict(
        x=gd_res_entire_league['period'].tolist() + gd_res_entire_league['period'].tolist(),
        y=gd_res_entire_league['big_win_p_value'].tolist() + gd_res_entire_league['bad_loss_p_value'].tolist(),
        scatter_colors=[Category10[3][0]] * len(gd_res_entire_league) + [Category10[3][1]] * len(gd_res_entire_league)
    )
)

gd_entire_league_line_graph.multi_line(xs='xs', ys='ys', line_color='line_colors', legend_field='series_labels', line_width=2, source=gd_league_bad_loss_lines_data_source)
gd_entire_league_line_graph.scatter(x='x', y='y', size=15, fill_color='scatter_colors', line_color='scatter_colors', source=gd_league_bad_loss_scatter_data_source)

def modify_doc_gd_entire_league_p_value(doc):
    doc.add_root(gd_entire_league_line_graph)
    
show(modify_doc_gd_entire_league_p_value, notebook_url=notebook_url)

In [None]:
big_wins_by_team_line_graph = figure(
                    x_range=(min(metric_test_range), max(metric_test_range)),
                    y_range=(0, 1),
                    x_axis_label='Metric Horizon',
                    y_axis_label='P-Value',
                    toolbar_location=None,
                    plot_width=2 * plot_size,
                    plot_height=round(1.5 * plot_size),
                    title='Big Win Test Results at Various Metric Horizons By Team')

team_names = gd_res_by_team['team_name'].unique()

n_teams = len(team_names)

team_dfs = [group[1] for group in gd_res_by_team.groupby(by=['team_name'])]

def update_data_sources_for_activated_teams(activated_teams, lines_data_source, scatter_data_source, metric_to_plot):
    big_wins_xs = list()
    big_wins_ys = list()
    big_wins_line_colors = list()
    big_wins_series_labels = list()

    big_wins_scatter_xs = list()
    big_wins_scatter_ys = list()
    big_wins_scatter_colors = list()

    for df in [df for df in team_dfs if df['team_name'].iloc[0] in activated_teams]:
        team_color = tables_df.loc[tables_df['team_name'] == df['team_name'].iloc[0]]['team_color'].iloc[0]

        big_wins_xs.append(df['period'])
        big_wins_ys.append(df[metric_to_plot])
        big_wins_line_colors.append(team_color)
        big_wins_series_labels.append(df['team_name'].iloc[0])

        big_wins_scatter_xs.extend(df['period'])
        big_wins_scatter_ys.extend(df[metric_to_plot])
        big_wins_scatter_colors.extend([team_color] * len(df))
    
    lines_data_source.data = dict(
        xs=big_wins_xs,
        ys=big_wins_ys,
        line_colors=big_wins_line_colors,
        series_labels=big_wins_series_labels
    )

    scatter_data_source.data = dict(
        x=big_wins_scatter_xs,
        y=big_wins_scatter_ys,
        scatter_colors=big_wins_scatter_colors
    )
    
big_win_by_team_lines_data_source = ColumnDataSource(
    dict(
        xs=list(),
        ys=list(),
        line_colors=list(),
        series_labels=list(),
    )
)

big_win_by_team_scatter_data_source = ColumnDataSource(
    dict(
        x=list(),
        y=list(),
        scatter_colors=list()
    )
)

big_wins_by_team_line_graph.multi_line(xs='xs', ys='ys', line_color='line_colors', legend_field='series_labels', line_width=2, source=big_win_by_team_lines_data_source)
big_wins_by_team_line_graph.scatter(x='x', y='y', size=8, fill_color='scatter_colors', line_color='scatter_colors', source=big_win_by_team_scatter_data_source)

checkbox_labels=sorted(team_names.tolist())

def checkbox_group_on_click(event):
    activated_team_names = [checkbox_labels[i] for i in event]
    update_data_sources_for_activated_teams(activated_team_names, big_win_by_team_lines_data_source, big_win_by_team_scatter_data_source, metric_to_plot='big_win_p_value')

checkbox_group = CheckboxGroup(labels=checkbox_labels)
checkbox_group.on_click(checkbox_group_on_click)

def modify_doc_gd_by_team_p_value(doc):
    doc.add_root(row(big_wins_by_team_line_graph, checkbox_group))
    
show(modify_doc_gd_by_team_p_value, notebook_url=notebook_url)

In [None]:
big_loss_by_team_line_graph = figure(
                    x_range=(min(metric_test_range), max(metric_test_range)),
                    y_range=(0, 1),
                    x_axis_label='Metric Horizon',
                    y_axis_label='P-Value',
                    toolbar_location=None,
                    plot_width=2 * plot_size,
                    plot_height=round(1.5 * plot_size),
                    title='Bad Loss Test Results at Various Metric Horizons By Team')
    
big_loss_by_team_lines_data_source = ColumnDataSource(
    dict(
        xs=list(),
        ys=list(),
        line_colors=list(),
        series_labels=list(),
    )
)

big_loss_by_team_scatter_data_source = ColumnDataSource(
    dict(
        x=list(),
        y=list(),
        scatter_colors=list()
    )
)

big_loss_by_team_line_graph.multi_line(xs='xs', ys='ys', line_color='line_colors', legend_field='series_labels', line_width=2, source=big_loss_by_team_lines_data_source)
big_loss_by_team_line_graph.scatter(x='x', y='y', size=8, fill_color='scatter_colors', line_color='scatter_colors', source=big_loss_by_team_scatter_data_source)

checkbox_labels=sorted(team_names.tolist())

def checkbox_group_on_click(event):
    activated_team_names = [checkbox_labels[i] for i in event]
    update_data_sources_for_activated_teams(activated_team_names, big_loss_by_team_lines_data_source, big_loss_by_team_scatter_data_source, metric_to_plot='bad_loss_p_value')

big_loss_checkbox_group = CheckboxGroup(labels=checkbox_labels)
big_loss_checkbox_group.on_click(checkbox_group_on_click)

def modify_doc_gd_big_loss_by_team_p_value(doc):
    doc.add_root(row(big_loss_by_team_line_graph, big_loss_checkbox_group))
    
show(modify_doc_gd_big_loss_by_team_p_value, notebook_url=notebook_url)

<p>These charts suggest that there are very few teams whose performance varies in a statistically significant way after a big win and a bad loss. A more robust filtering of the test results confirms this. First we can take a look at the differences in performance after a big win:<p>

In [None]:
gd_res_by_team.loc[gd_res_by_team['big_win_p_value'] < 0.05] \
    [['team_name', 'period', 'big_win_n', 'big_win_p_value', 'adjusted_big_win_p_value', 'big_win_mean', 'big_win_std', 'control_group_mean', 'control_group_std']] \
    .sort_values(by=['team_name', 'period']).head(20)

<p>It looks like only Arsenal and Manchester United had statistically significant differences in their performances. Fulham's "big win" sample size appears too small to be meaningful. For both Arsenal and Manchester United, however, the adjusted p-values (q-values) reveal that these results are above our FDR threshold of q=0.05.</p>

<p>Next we look for statistically significant differences after a bad loss:</p>

In [None]:
gd_res_by_team.loc[gd_res_by_team['bad_loss_p_value'] < 0.05] \
    [['team_name', 'period', 'big_win_n', 'big_win_p_value', 'big_win_mean', 'big_win_std', 'control_group_mean', 'control_group_std']] \
    .sort_values(by=['team_name', 'period']).head(20)

<p>A few teams exhibit a statistically significant difference in points per game, but in all cases the sample sizes for what can be considered the treatment group are too small to be meaningful.</p>

<p>Next I test to see if late games have an adverse effect on teams. I count a game as "late" if it starts after 6:00pm local time, and I run a chi squared contingency test on the counts of the three possible game outcomes (a win, loss, or draw) between the control and the treatment group. I do this once per team, and run the p-value correction again.</p>

In [None]:
def run_tod_statistical_test_for_period(epl_tables_df, late_game_hour_cutoff=18):
    epl_tables_df['fixture_hour'] = epl_tables_df['fixture_date'].apply(lambda date: date.hour)
    epl_tables_df['late_game'] = epl_tables_df['fixture_hour'] >= late_game_hour_cutoff

    result_df = pd.DataFrame()
    for team_name in epl_tables_df['team_name'].unique():
        late_game_group = epl_tables_df.loc[epl_tables_df['late_game'] & (epl_tables_df['team_name'] == team_name)]
        control_group = epl_tables_df.loc[(~epl_tables_df['late_game']) & (epl_tables_df['team_name'] == team_name)]

        late_game_outcomes = late_game_group.groupby('match_pts').apply(len).tolist()
        control_game_outcomes = control_group.groupby('match_pts').apply(len).tolist()

        if len(late_game_outcomes) == 3 and len(control_game_outcomes) == 3:
            chi2, p_value, dof, expected = stats.chi2_contingency(
                np.array([late_game_outcomes, control_game_outcomes])
            )
        else:
            p_value = np.nan
            chi2 = np.nan
            dof = np.nan

        result_df = pd.concat([result_df, pd.Series({
            'team_name': team_name,
            'n': len(late_game_group),
            'p_value': p_value,
            'chi_2': chi2,
            'dof': dof
        })], axis=1)

    return result_df.transpose().sort_values(by='p_value')

tod_test_results = run_tod_statistical_test_for_period(tables_df)
tod_test_results.sort_values(by='p_value', ascending=True).head()
tod_test_results['adjusted_p_value'] = statsmodels.stats.multitest.fdrcorrection(tod_test_results['p_value'], is_sorted=True)[1]

tod_test_results.head()

<p>It looks like in the ten year stretch that our data covers, only Watford had a statistically different performance between games they played during the day and those they played in the evening. Again, however, the q-value falls outside the acceptable range. Still, we can plot the win/draw/loss distribution for the two match groups and examine the differences:</p>

In [None]:
early_game_graph = figure(title=f'Watford Early Day Games W/D/L Distribution 2011-2021',
                        y_axis_label='Frequency',
                        x_range=['Win', 'Draw', 'Loss'],
                        toolbar_location=None,
                        plot_width=plot_size,
                        plot_height=plot_size)

late_game_graph = figure(title=f'Watford Late Day Games W/D/L Distribution 2011-2021',
                        y_axis_label='Frequency',
                        x_range=['Win', 'Draw', 'Loss'],
                        toolbar_location=None,
                        plot_width=plot_size,
                        plot_height=plot_size)

early_game_graph.toolbar.active_drag = None
early_game_graph.toolbar.active_scroll = None
early_game_graph.toolbar.active_tap = None
early_game_graph.y_range.start = 0
early_game_graph.xgrid.grid_line_color = None

late_game_graph.toolbar.active_drag = None
late_game_graph.toolbar.active_scroll = None
late_game_graph.toolbar.active_tap = None
late_game_graph.y_range.start = 0
late_game_graph.xgrid.grid_line_color = None

watford_early_games_df = tables_df.loc[
    (tables_df['team_name'] == 'Watford') & ~(tables_df['late_game'])
]

watford_late_games_df = tables_df.loc[
    (tables_df['team_name'] == 'Watford') & (tables_df['late_game'])
]

watford_early_games_results = watford_early_games_df.groupby('match_pts').apply(len)
n_early_games = len(watford_early_games_df)

watford_early_games_losses = watford_early_games_results[0] / n_early_games
watford_early_games_draws = watford_early_games_results[1] / n_early_games
watford_early_games_wins = watford_early_games_results[3] / n_early_games

watford_late_games_results = watford_late_games_df.groupby('match_pts').apply(len)
n_late_games = len(watford_late_games_df)

watford_late_games_losses = watford_late_games_results[0] / n_late_games
watford_late_games_draws = watford_late_games_results[1] / n_late_games
watford_late_games_wins = watford_late_games_results[3] / n_late_games

early_games_bar_data_source = ColumnDataSource(
    dict(
        x=['Win', 'Draw', 'Loss'],
        y=[0] * 3,
        top=[watford_early_games_wins, watford_early_games_draws, watford_early_games_losses],
        color=[Category10[3][0]] * 3
    )
)

late_games_bar_data_source = ColumnDataSource(
    dict(
        x=['Win', 'Draw', 'Loss'],
        y=[0] * 3,
        top=[watford_late_games_wins, watford_late_games_draws, watford_late_games_losses],
        color=[Category10[3][1]] * 3
    )
)

early_game_graph.vbar(x='x', top='top', color='color', source=early_games_bar_data_source)
late_game_graph.vbar(x='x', top='top', color='color', source=late_games_bar_data_source)

def modify_watford_late_games_test_p_value(doc):
    doc.add_root(row(early_game_graph, late_game_graph))
    
show(modify_watford_late_games_test_p_value, notebook_url=notebook_url)

In [None]:
early_games_avg_pts = np.mean(watford_early_games_df['match_pts'])
early_games_pts_std_dev = np.std(watford_early_games_df['match_pts'])

print(f'Early games average points earned: {early_games_avg_pts}, std. dev.:  {early_games_pts_std_dev}')

late_games_avg_pts = np.mean(watford_late_games_df['match_pts'])
late_games_pts_std_dev = np.std(watford_late_games_df['match_pts'])
print(f'Late games average points earned: {late_games_avg_pts}, std. dev.:  {late_games_pts_std_dev}')

print(f"\nCohen's effect size on points earned: {(early_games_avg_pts - late_games_avg_pts) / early_games_pts_std_dev}")

tod_test_N = n_early_games + n_late_games

effect_size_phi = math.sqrt(tod_test_results.loc[tod_test_results['team_name'] == 'Watford']['chi_2'][0] / tod_test_N)

print(f'Phi effect size: {effect_size_phi}')

<p>Last, I wanted to see if playing a game at a "cold, rainy night at Stoke" makes a statistically significant difference to game outcomes. To pick games that happen at night I again choose those that start after 6:00pm local time. For the "cold" part I use the time of year as a proxy, taking any games in November, December, January, and February as "cold." The "rainy" element is unfortunately not possible for me to filter by using the current dataset as it doesn't provide weather data. Even if it did, I believe it would reduce the sample size by too much.</p>

<p>If I tested only games that occurred under the above criteria in Stoke, then the treatment group would also be far too small. Instead, I hand-picked "stoke-like" teams that played in the league between 2011 and 2021. These teams will be my broader stand-in for Stoke. I chose teams that were from the Midlands or the North, were mid- to small- sized (no Manchester United, for example), and often played with a highly defensive, counter-attacking long ball and low block. For my control group I select the same teams though choose normal conditions.</p>

In [None]:
def run_stoke_statistical_test(epl_tables_df, late_game_hour_cutoff=18, cold_season_start_month=11,
                               cold_season_end_month=3):
    epl_tables_df['fixture_hour'] = epl_tables_df['fixture_date'].apply(lambda date: date.hour)
    epl_tables_df['fixture_month'] = epl_tables_df['fixture_date'].apply(lambda date: date.month)

    epl_tables_df['late_game'] = epl_tables_df['fixture_hour'] >= late_game_hour_cutoff
    epl_tables_df['cold_game'] = (epl_tables_df['fixture_month'] >= cold_season_start_month) | \
                                 (epl_tables_df['fixture_month'] < cold_season_end_month)

    stoke_like = {'Blackburn': True,
                  'Manchester United': False,
                  'Chelsea': False,
                  'Arsenal': False,
                  'Wolves': False,
                  'Norwich': False,
                  'Bolton': True,
                  'Sunderland': True,
                  'Aston Villa': False,
                  'Everton': False,
                  'Swansea': False,
                  'QPR': False,
                  'West Brom': False,
                  'Stoke City': True,
                  'Newcastle': True,
                  'Wigan': True,
                  'Liverpool': False,
                  'Fulham': False,
                  'Manchester City': False,
                  'Tottenham': False,
                  'Southampton': False,
                  'Reading': False,
                  'West Ham': False,
                  'Cardiff': False,
                  'Hull City': True,
                  'Crystal Palace': False,
                  'Leicester': False,
                  'Burnley': True,
                  'Watford': False,
                  'Bournemouth': False,
                  'Middlesbrough': True,
                  'Brighton': False,
                  'Huddersfield': True,
                  'Sheffield Utd': True,
                  'Leeds': False,
                  'Brentford': False}

    epl_tables_df['stoke_like_opposition'] = epl_tables_df['opp_team_name'].apply(lambda name: stoke_like[name])

    cold_rainy_nights_at_stoke_df = epl_tables_df.loc[
        epl_tables_df['cold_game'] &
        epl_tables_df['late_game'] &
        epl_tables_df['stoke_like_opposition'] &
        ~epl_tables_df['home']
        ]

    control_group = epl_tables_df.loc[
        (~epl_tables_df['cold_game'] |
        ~epl_tables_df['late_game']) &
        (epl_tables_df['stoke_like_opposition'] &
        ~epl_tables_df['home'])
        ]

    cold_rainy_nights_outcome = cold_rainy_nights_at_stoke_df.groupby('match_pts').apply(len).tolist()
    control_game_outcomes = control_group.groupby('match_pts').apply(len).tolist()

    chi2, p_value, dof, expected = stats.chi2_contingency(
        np.array([cold_rainy_nights_outcome, control_game_outcomes])
    )
    
    stoke_test_N = len(control_group) + len(cold_rainy_nights_at_stoke_df)
    effect_size_phi = math.sqrt(chi2 / stoke_test_N)
    
    control_games_avg_pts = np.mean(control_group['match_pts'])
    control_games_pts_std_dev = np.std(control_group['match_pts'])
    
    stoke_games_avg_pts = np.mean(cold_rainy_nights_at_stoke_df['match_pts'])
    stoke_games_pts_std_dev = np.std(cold_rainy_nights_at_stoke_df['match_pts'])
    
    print(f'Control mean: {control_games_avg_pts}')
    print(f'Stoke mean: {stoke_games_avg_pts}')
    
    cohens = (control_games_avg_pts - stoke_games_avg_pts) / max(control_games_pts_std_dev, stoke_games_pts_std_dev)

    return p_value, cohens, effect_size_phi

In [None]:
stoke_p_value, stoke_effect_size_cohens, stoke_effect_size_phi = run_stoke_statistical_test(tables_df)

print(f'P-Value: {stoke_p_value}')

<p>The result isn't statistically significant, making it unlikely that there is a systematic disadvantage to playing at a cold, rainy night over any other time at Stoke.</p>