<a href="https://colab.research.google.com/github/dtmeyers/Soccer-Data-Analysis/blob/main/SPI_Research.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is a notebook to look at the data freely available from fivethirtyeight.com and test it against the market

I'll test the data a number of ways, one of which is removing their adjustment for the model's tendency to under-predict draws

A note about SPI's methodology: the offensive and defense ratings are based on a game against an average team at a neutral site, and the overall rating is the percentage of points that team would be expected to take against an average team at a neutral site

I also want to build a model that incorporates the betting market into predicting goals scored/conceded. I think this will just serve to, effectively, regress the model towards the market, but I'm interested in seeing the results

Notable findings to keep track of:
*   In 2019/2020 24% of EPL matches ended in a draw
*   In 2018/2019 18.7% of EPL matches ended in a draw
*   In 2017/2018 26% of EPL matches ended in a draw
*   In 2016/2017 22% of EPL matches ended in a draw


*   SPI would have returned 4.7% in 2016/2017 if betting evenly across all games
*   SPI would have returned 20% in 2016/2017 if betting only on 5% or larger edge (roughly 25% of games)
*   
*   





In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Code block will analyze EPL data
# Load the data into variables
spi_global_ranking = pd.read_csv('/content/drive/My Drive/Sports Data Analysis/Soccer Data/538 SPI Data/spi_global_rankings.csv')
spi_matches = pd.read_csv('/content/drive/My Drive/Sports Data Analysis/Soccer Data/538 SPI Data/spi_matches.csv')
spi_matches_latest = pd.read_csv('/content/drive/My Drive/Sports Data Analysis/Soccer Data/538 SPI Data/spi_matches_latest.csv')

EPL_match_odds_1920 = pd.read_csv('/content/drive/My Drive/Sports Data Analysis/Soccer Data/EPL Betting Odds 2019 2020.csv')
EPL_match_odds_1819 = pd.read_csv('/content/drive/My Drive/Sports Data Analysis/Soccer Data/EPL Betting Odds 2018 2019.csv')
EPL_match_odds_1718 = pd.read_csv('/content/drive/My Drive/Sports Data Analysis/Soccer Data/EPL Betting Odds 2017 2018.csv')
EPL_match_odds_1617 = pd.read_csv('/content/drive/My Drive/Sports Data Analysis/Soccer Data/EPL Betting Odds 2016 2017.csv')

# This section stores the data I want to use regarding betting markets in a new dataframe
keep_cols = ['HomeTeam', 'AwayTeam', 'B365H', 'B365A', 'B365D', 'FTR', 'HTR']
EPL_match_odds = EPL_match_odds_1617[keep_cols].sort_values(['HomeTeam', 'AwayTeam'])
EPL_match_odds = EPL_match_odds.append(EPL_match_odds_1718[keep_cols].sort_values(['HomeTeam', 'AwayTeam']))
EPL_match_odds = EPL_match_odds.append(EPL_match_odds_1819[keep_cols].sort_values(['HomeTeam', 'AwayTeam']))
EPL_match_odds = EPL_match_odds.append(EPL_match_odds_1920[keep_cols].sort_values(['HomeTeam', 'AwayTeam']))

# This section converts the data into usable forms
# Converting between decimal odds and implied probability is just y = 1/x
EPL_match_odds['B365A'] = EPL_match_odds['B365A'].apply(lambda x: 1/x)
EPL_match_odds['B365D'] = EPL_match_odds['B365D'].apply(lambda x: 1/x)
EPL_match_odds['B365H'] = EPL_match_odds['B365H'].apply(lambda x: 1/x)

spi_matches_EPL = spi_matches[spi_matches['league'] == 'Barclays Premier League']
spi_matches_EPL = spi_matches_EPL[spi_matches_EPL['season'] != 2020]
spi_matches_EPL = spi_matches_EPL.replace(to_replace='AFC Bournemouth', value='Bournemouth').sort_values(['season', 'team1', 'team2'])

# This block will load the data I'm using into a new dataframe to make it easier to manipulate
# The odds in the odds_comparison dataframe represent the difference between SPI match
# probability and the line offered at Bet365. Positive numbers represent +EV
home_odds = np.array(spi_matches_EPL['prob1']) - np.array(EPL_match_odds['B365H'])
home_odds = [round(x*100, 2) for x in home_odds]

away_odds = np.array(spi_matches_EPL['prob2']) - np.array(EPL_match_odds['B365A'])
away_odds = [round(x*100, 2) for x in away_odds]

draw_odds = np.array(spi_matches_EPL['probtie']) - np.array(EPL_match_odds['B365D'])
draw_odds = [round(x*100, 2) for x in draw_odds]

odds_comparison = pd.DataFrame({'Season': spi_matches_EPL['season'],
                                'Home Team': spi_matches_EPL['team1'],
                                'Away Team': spi_matches_EPL['team2'],
                                'Home Edge': home_odds,
                                'Away Edge': away_odds,
                                'Draw Edge': draw_odds,
                                'Result': np.array(EPL_match_odds['FTR'])})

# This section will find the biggest edge and calculate the result of each bet
# We can restrict the bets to only be on a significant edge (eg. +5%)
def calc_bet(x, df):
  if df['Biggest Edge'].iloc[x] == df['Home Edge'].iloc[x]:
    return 'H'
  elif df['Biggest Edge'].iloc[x] == df['Away Edge'].iloc[x]:
    return 'A'
  else:
    return 'D'

def calc_winnings(x, df):
  if df['Biggest Edge'].iloc[x] < 1: # this means a 'no bet' if the edge is too small
    return 0
  if df['Bet'].iloc[x] == df['Result'].iloc[x]:
    if df['Bet'].iloc[x] == 'H':
      return (df['Home Price'].iloc[x]-1)
    elif df['Bet'].iloc[x] == 'A':
      return (df['Away Price'].iloc[x]-1)
    else:
      return (df['Draw Price'].iloc[x]-1)
  else:
    return -1

odds_comparison['Biggest Edge'] = [max(odds_comparison['Home Edge'].iloc[x],
                                       odds_comparison['Away Edge'].iloc[x],
                                       odds_comparison['Draw Edge'].iloc[x]) for x in range(len(odds_comparison))]

odds_comparison['Bet'] = [calc_bet(x, odds_comparison) for x in range(len(odds_comparison))]

odds_comparison['Home Price'] = np.array(EPL_match_odds['B365H'].apply(lambda x: 1/x))
odds_comparison['Away Price'] = np.array(EPL_match_odds['B365A'].apply(lambda x: 1/x))
odds_comparison['Draw Price'] = np.array(EPL_match_odds['B365D'].apply(lambda x: 1/x))

odds_comparison['Winnings'] = [calc_winnings(x, odds_comparison) for x in range(len(odds_comparison))]

theory = odds_comparison[odds_comparison['Season'] == 2016]

print(theory['Winnings'].sum()/len(theory[theory['Winnings'] != 0]))
print(1 - len(theory[theory['Winnings'] == 0 ])/len(theory))
print(len(theory[theory['Winnings'] != 0]))

theory.tail(20)

0.013219178082191772
0.7684210526315789
292
        sign
0       wait
1      tribe
2       hood
3     clever
4     affair
5    project
6       cram
7       mean
8    giraffe
9   describe
10      rose


Unnamed: 0,Season,Home Team,Away Team,Home Edge,Away Edge,Draw Edge,Result,Biggest Edge,Bet,Home Price,Away Price,Draw Price,Winnings
184,2016,West Bromwich Albion,West Ham United,6.64,-5.5,-3.81,H,6.64,H,2.8,2.8,3.2,1.8
756,2016,West Ham United,Arsenal,-6.89,9.6,-4.71,A,9.6,A,4.75,1.75,4.2,0.75
44,2016,West Ham United,Bournemouth,-2.61,0.69,-1.03,H,0.69,A,1.95,4.0,3.75,0.0
851,2016,West Ham United,Burnley,-7.78,5.09,0.41,H,5.09,A,1.67,5.75,4.0,-1.0
1477,2016,West Ham United,Chelsea,-1.23,0.69,-2.25,A,0.69,A,6.0,1.62,4.1,0.0
997,2016,West Ham United,Crystal Palace,8.12,-6.87,-3.94,H,8.12,H,2.38,3.2,3.4,1.38
1942,2016,West Ham United,Everton,-3.21,3.69,-3.0,D,3.69,A,3.3,2.25,3.6,-1.0
871,2016,West Ham United,Hull City,-1.7,-0.34,-0.75,H,-0.34,A,1.62,6.0,4.1,0.0
1554,2016,West Ham United,Leicester City,5.48,-5.11,-2.71,A,5.48,H,2.3,3.3,3.5,-1.0
2284,2016,West Ham United,Liverpool,4.26,-7.28,0.15,A,4.26,H,5.25,1.7,4.0,-1.0
