# MLB Predictions and Betting Strategy
> Part 5 - Making our guesses, assessing confidence, and deciding how much to bet

- toc: false
- badges: true
- comments: true
- categories: [baseball, webscraping, kelly criterion, xgboost]
- image: images/chart-preview.png

It's finally time to put all of this work into practice. This is going to be another long notebook, though. It's a lot of work.
> Warning: Do this at your own risk. If you lose money, it's not my problem. Gamble responsibly. If you have think you may have a problem controlling yourself, [get some help](http://www.gamblersanonymous.org/ga/).

>Important: But if you win, you owe me 10%.

To make bet recommendations, this notebook needs to do all this:
- Download games and odds that have completed
- Download today's games and their odds
- Calculate the stats for today's games
- Generate predictions using our saved model
- Calculate bet sizes based on the moneyline odds and our predictions

Luckily a bunch of the code for this was written in previous posts, we just need to get it into this notebook.

## Update Historic Data
First, we'll update the data from our web scraping. I've hidden all the code you saw in parts 2 and 3. Click the button to unhide it.

In [1]:
#collapse-hide
from bs4 import BeautifulSoup as bs
import requests

# these are functions related to parsing the baseball reference page

def get_game_summary(soup, game_id):
    game = {'game_id': game_id}
    scorebox = soup.find('div', {'class':'scorebox'})
    teams = scorebox.findAll('a',{'itemprop':'name'})
    game['away_team_abbr'] = teams[0]['href'].split('/')[2]
    game['home_team_abbr'] = teams[1]['href'].split('/')[2]
    meta = scorebox.find('div', {'class':'scorebox_meta'}).findAll('div')
    game['date'] = meta[0].text.strip()
    game['start_time'] = meta[1].text[12:-6].strip()
    return game

def get_table_summary(soup, table_no):
    stats_tables = soup.findAll('table', {'class':'stats_table'})
    t = stats_tables[table_no].find('tfoot')
    summary = {x['data-stat']:x.text.strip() for x in t.findAll('td')}
    return summary

def get_pitcher_data(soup, table_no):
    stats_tables = soup.findAll('table', {'class':'stats_table'})
    t = stats_tables[table_no]
    data = []
    rows = t.findAll('tr')[1:-1] # not the header and footer rows
    for r in rows:
        summary = {x['data-stat']:x.text.strip() for x in r.findAll('td')}
        summary['name'] = r.find('th',{'data-stat':'player'}).find('a')['href'].split('/')[-1][:-6].strip()
        data.append(summary)
    return data

def process_link(url):
    resp = requests.get(url)
    game_id = url.split('/')[-1][:-6]

    # strange preprocessing routine
    uncommented_html = ''
    for h in resp.text.split('\n'):
        if '<!--     <div' in h: continue
        if h.strip() == '<!--': continue
        if h.strip() == '-->': continue
        uncommented_html += h + '\n'

    soup = bs(uncommented_html)
    data = {
        'game': get_game_summary(soup, game_id),
        'away_batting': get_table_summary(soup, 1),
        'home_batting':get_table_summary(soup, 2),
        'away_pitching':get_table_summary(soup, 3),
        'home_pitching':get_table_summary(soup, 4),
        'away_pitchers': get_pitcher_data(soup, 3),
        'home_pitchers': get_pitcher_data(soup, 4)
    }
    return data

def get_covers_data(date_string):
    odds_data = []
    # get the web page with game data on it
    url = f'https://www.covers.com/Sports/MLB/Matchups?selectedDate={date_string}'
    resp = requests.get(url)

    # parse the games
    scraped_games = bs(resp.text).findAll('div',{'class':'cmg_matchup_game_box'})
    for g in scraped_games:
        game = {}
        game['home_moneyline'] = g['data-game-odd']
        game['date'] = g['data-game-date']
        game['away_team_abbr'] = g['data-away-team-shortname-search']
        game['home_team_abbr'] = g['data-home-team-shortname-search']
        try:
            game['home_score'] =g.find('div',{'class':'cmg_matchup_list_score_home'}).text.strip()
            game['away_score'] =g.find('div',{'class':'cmg_matchup_list_score_away'}).text.strip()
        except:
            game['home_score'] =''
            game['away_score'] =''

        odds_data.append(game)
    return odds_data

First we'll load our saved data


In [2]:
import pickle
game_data = pickle.load(open('game_data.pkl','rb'))

Let's get the scores for the games that aren't in our database yet.

In [3]:
current_year=2020

# find all games for the year
url = f"https://www.baseball-reference.com/leagues/MLB/{current_year}-schedule.shtml"
resp = requests.get(url)
soup=bs(resp.text)
game_soups = soup.findAll('a',text='Boxscore')
game_links = [x['href'] for x in game_soups] # for instance '/boxes/LAN/LAN202007230.shtml'

# compare against downloaded games
downloaded_games = [g['game']['game_id'] for g in game_data]
new_game_links = [x for x in game_links if x[-18:-6] not in downloaded_games]

# get the new games
for link in new_game_links:
    url = 'https://www.baseball-reference.com' + link
    game_data.append(process_link(url))
print("New games downloaded: ", len(new_game_links))

New games downloaded:  46


## Append Today's Games
Now we'll append today's games to this data

In [4]:
import pandas as pd
import datetime as dt

today_games=[]
url = 'https://www.baseball-reference.com/previews/'
page = requests.get(url).text
soup = bs(page)
summaries = soup.findAll('div', {'class':'game_summary'})
for s in summaries:
    game = {
        'game':{
            'game_id': s.find('a', text='Preview')['href'][-18:-6],
            'is_test':True,
            'date': dt.datetime.now().strftime('%A, %B %d, %Y')
        },
        'home_batting':{'R':0},
        'away_batting':{'R':0},
        'home_pitching':{'R':0},
        'away_pitching':{'R':0},
        'home_pitchers':[{'R':0}],
        'away_pitchers':[{'R':0}]
    }
    cells = s.findAll('table')[0].findAll('td')

    # skip postponed games
    if not s.find('a', text="Preview"): continue

    # skip game 2 in double header - links look like this for 2nd games: "/previews/2020/PHI202008051.shtml"
    if s.find('a', text="Preview")['href'][-7]=='2': continue

    try:
        team_links = s.findAll('a')
        game['game']['away_team_abbr'] = team_links[0]['href'].split('/')[2]
        game['game']['home_team_abbr'] = team_links[2]['href'].split('/')[2]
    except Exception as e:
        #just all star games trigger this, I think
        print(team_links)
        continue

    # get time
    game['game']['start_time'] = s.find('table',{'class':'teams'}).find('tbody').findAll('tr')[1].findAll('td')[2].text.strip()
    # get pitchers
    try:
        cells = s.findAll('table')[1].findAll('td')
        game['away_pitchers'][0]['name'] = cells[1].find('a')['href'].split('/')[-1][:-6].strip()
        game['home_pitchers'][0]['name'] = cells[3].find('a')['href'].split('/')[-1][:-6].strip()
    except Exception as e:
        # no pitcher
        game['away_pitchers'][0]['name'] = ''
        game['home_pitchers'][0]['name'] = ''
    today_games.append(game)
game_data.extend(today_games)
print(len(today_games), "Games today")
for x in today_games: print(x['game']['game_id'], x['game']['start_time'],
                            x['game']['away_team_abbr'],x['away_pitchers'][0]['name'],
                            x['game']['home_team_abbr'],x['home_pitchers'][0]['name'])

15 Games today
NYA202009260 1:05PM MIA rogertr01 NYY garcide01
WAS202009261 3:05PM NYM degroja01 WSN scherma01
OAK202009261 4:10PM SEA sheffju01 OAK minormi01
TOR202009260 6:37PM BAL meansjo01 TOR zeuchtj01
KCA202009260 7:05PM DET boydma01 KCR hernaca04
TEX202009260 7:05PM HOU  TEX 
SLN202009260 7:07PM MIL woodrbr01 STL wainwad01
TBA202009260 7:07PM PHI wheelza01 TBR curtijo02
ATL202009260 7:10PM BOS houckta01 ATL 
CHA202009260 7:10PM CHC lestejo01 CHW dunnida01
CLE202009260 7:10PM PIT musgrjo01 CLE civalaa01
MIN202009260 7:10PM CIN castilu02 MIN pinedmi01
ARI202009260 8:10PM COL marquge01 ARI weavelu01
LAN202009260 9:10PM LAA bundydy01 LAD gonsoto01
SFN202009260 9:15PM SDP davieza02 SFG cuetojo01


# Generate Stats and Features Using Code from Part 2 and Part 3 of this Series
Now we'll create our dataframe in the same way we did in Part 2 of this series. Since the above data is just appended to the end of the db, the stats for our test data will get filled in during the `shift(1)` statements that build the stats.

Again I hid the code because it's a lot and you've already seen it.

In [5]:
#collapse-hide

import pandas as pd

games = []
batting = []
pitching = []
pitchers = []

for g in game_data:
    game_summary = g['game']
    if 'is_test' not in game_summary.keys(): game_summary['is_test']=False
    # fix date
    game_summary['date'] = game_summary['date'] + " " + game_summary['start_time']
    del game_summary['start_time']

    # get starting pitchers
    game_summary['home_pitcher'] = g['home_pitchers'][0]['name']
    game_summary['away_pitcher'] = g['away_pitchers'][0]['name']

    # this is the field we'll train our model to predict
    game_summary['home_team_win'] = int(g['home_batting']['R'])>int(g['away_batting']['R'])
    games.append(game_summary)

    # add all stats to appropriate lists
    target_pairs = [
        ('away_batting', batting),
        ('home_batting', batting),
        ('away_pitching', pitching),
        ('home_pitching', pitching),
        ('away_pitchers', pitchers),
        ('home_pitchers', pitchers)
    ]
    for key, d in target_pairs:
        if isinstance(g[key], list): # pitchers
            for x in g[key]:
                if 'home' in key:
                    x['is_home_team'] = True
                    x['team'] = g['game']['home_team_abbr']
                else:
                    x['is_home_team'] = False
                    x['team'] = g['game']['away_team_abbr']
                x['game_id'] = g['game']['game_id']
                d.append(x)
        else: #batting, pitching
            x = g[key]
            if 'home' in key:
                x['is_home_team'] = True
                x['team'] = g['game']['home_team_abbr']
                x['spread'] = int(g[key]['R']) - int(g[key.replace('home','away')]['R'])
            else:
                x['is_home_team'] = False
                x['team'] = g['game']['away_team_abbr']
                x['spread'] = int(g[key]['R']) - int(g[key.replace('away','home')]['R'])
            x['game_id'] = g['game']['game_id']
            d.append(x)

game_df = pd.DataFrame(games)
game_df['date'] = pd.to_datetime(game_df['date'], errors='coerce')
game_df = game_df[~game_df['game_id'].str.contains('allstar')].copy() #don't care about allstar games

batting_df = pd.DataFrame(batting)
for k in batting_df.keys():
    if any(x in k for x in ['team','game_id', 'home_away']): continue
    batting_df[k] =pd.to_numeric(batting_df[k],errors='coerce', downcast='float')
batting_df.drop(columns=['details'], inplace=True)

pitching_df = pd.DataFrame(pitching)
for k in pitching_df.keys():
    if any(x in k for x in ['team','game_id', 'home_away']): continue
    pitching_df[k] =pd.to_numeric(pitching_df[k],errors='coerce', downcast='float')
pitcher_df = pd.DataFrame(pitchers)

for k in pitcher_df.keys():
    if any(x in k for x in ['team','name','game_id', 'home_away']): continue
    pitcher_df[k] =pd.to_numeric(pitcher_df[k],errors='coerce', downcast='float')
# filter the pitcher performances to just the starting pitcher
pitcher_df = pitcher_df[~pitcher_df['game_score'].isna()].copy().reset_index(drop=True)
pitcher_df.drop(columns=[x for x in pitcher_df.keys() if 'inherited' in x], inplace=True)

print("Created game_df, batting_df, pitching_df and pitcher_df")

Created game_df, batting_df, pitching_df and pitcher_df


Now we'll create all the calculated features of the model that we made in Part 2 of this blog


In [6]:
#collapse-hide

import numpy as np

def add_rolling(period, df, stat_columns):
    for s in stat_columns:
        if 'object' in str(df[s].dtype): continue
        df[s+'_'+str(period)+'_Avg'] = df.groupby('team')[s].apply(lambda x:x.rolling(period).mean())
        df[s+'_'+str(period)+'_Std'] = df.groupby('team')[s].apply(lambda x:x.rolling(period).std())
        df[s+'_'+str(period)+'_Skew'] = df.groupby('team')[s].apply(lambda x:x.rolling(period).skew())
    return df

def get_diff_df(df, name, is_pitcher=False):
    #runs for each of the stat dataframes, returns the difference in stats

    #set up dataframe with time index
    df['date'] = pd.to_datetime(df['game_id'].str[3:-1], format="%Y%m%d")
    df = df.sort_values(by='date').copy()
    newindex = df.groupby('date')['date']\
             .apply(lambda x: x + np.arange(x.size).astype(np.timedelta64))
    df = df.set_index(newindex).sort_index()

    # get stat columns
    stat_cols = [x for x in df.columns if 'int' in str(df[x].dtype)]
    stat_cols.extend([x for x in df.columns if 'float' in str(df[x].dtype)])

    #add lags
    df = add_rolling('5d', df, stat_cols) # this game series
    df = add_rolling('10d', df, stat_cols)
    df = add_rolling('45d', df, stat_cols)
    df = add_rolling('180d', df, stat_cols) # this season
    df = add_rolling('730d', df, stat_cols) # 2 years

    # reset stat columns to just the lags (removing the original stats)
    df.drop(columns=stat_cols, inplace=True)
    stat_cols = [x for x in df.columns if 'int' in str(df[x].dtype)]
    stat_cols.extend([x for x in df.columns if 'float' in str(df[x].dtype)])

    # shift results so that each row is  a pregame stat
    df = df.reset_index(drop=True)
    df = df.sort_values(by='date')
    for s in stat_cols:
        if is_pitcher:
            df[s] = df.groupby('name')[s].shift(1)
        else:
            df[s] = df.groupby('team')[s].shift(1)

    # calculate differences in pregame stats from home vs. away teams
    away_df = df[~df['is_home_team']].copy()
    away_df = away_df.set_index('game_id')
    away_df = away_df[stat_cols]

    home_df = df[df['is_home_team']].copy()
    home_df = home_df.set_index('game_id')
    home_df = home_df[stat_cols]

    diff_df = home_df.subtract(away_df, fill_value=0)
    diff_df = diff_df.reset_index()

    # clean column names
    for s in stat_cols:
        diff_df[name + "_" + s] = diff_df[s]
        diff_df.drop(columns=s, inplace=True)

    return diff_df

df = game_df
df = pd.merge(left=df, right = get_diff_df(batting_df, 'batting'),
               on = 'game_id', how='left')
df = pd.merge(left=df, right = get_diff_df(pitching_df, 'pitching'),
               on = 'game_id', how='left')
df = pd.merge(left=df, right = get_diff_df(pitcher_df, 'pitcher',is_pitcher=True),
               on = 'game_id', how='left')

#pitcher rest feature
pitcher_df = pd.DataFrame(pitchers) # old version was filtered to just starters
dates = pitcher_df['game_id'].str[3:-1]
pitcher_df['date'] = pd.to_datetime(dates,format='%Y%m%d', errors='coerce')
pitcher_df['rest'] = pitcher_df.groupby('name')['date'].diff().dt.days
# filter the pitcher performances to just the starting pitcher
pitcher_df = pitcher_df[~pitcher_df['game_score'].isna()].copy().reset_index(drop=True)
home_pitchers = pitcher_df[pitcher_df['is_home_team']].copy().reset_index(drop=True)
df = pd.merge(left=df, right=home_pitchers[['game_id','name', 'rest']],
              left_on=['game_id','home_pitcher'],
              right_on=['game_id','name'],
              how='left')
df.rename(columns={'rest':'home_pitcher_rest'}, inplace=True)
away_pitchers = pitcher_df[~pitcher_df['is_home_team']].copy().reset_index(drop=True)
df = pd.merge(left=df, right=away_pitchers[['game_id','name','rest']],
              left_on=['game_id','away_pitcher'],
              right_on=['game_id','name'],
              how='left')
df.rename(columns={'rest':'away_pitcher_rest'}, inplace=True)
df['rest_diff'] = df['home_pitcher_rest']-df['away_pitcher_rest']

#datetime features
df.dropna(subset=['date'], inplace=True)
df['season'] = df['date'].dt.year
df['month']=df['date'].dt.month
df['week']=df['date'].dt.isocalendar().week.astype('int')
df['dow']=df['date'].dt.weekday
df['date'] = (pd.to_datetime(df['date']) - pd.Timestamp("1970-01-01")) // pd.Timedelta('1s') #epoch time

print("The shape of our main dataframe is now (rows x columns):",df.shape)


The shape of our main dataframe is now (rows x columns): (10596, 1037)


Now we'll add the power rankings using the code we built in Part 3 of this series.


In [7]:
#collapse-hide

from elote import EloCompetitor
ratings = {}
for x in df.home_team_abbr.unique():
    ratings[x]=EloCompetitor()
for x in df.away_team_abbr.unique():
    ratings[x]=EloCompetitor()

home_team_elo = []
away_team_elo = []
elo_exp = []

df = df.sort_values(by='date').reset_index(drop=True)
for i, r in df.iterrows():
    # get pre-game ratings
    elo_exp.append(ratings[r.home_team_abbr].expected_score(ratings[r.away_team_abbr]))
    home_team_elo.append(ratings[r.home_team_abbr].rating)
    away_team_elo.append(ratings[r.away_team_abbr].rating)
    # update ratings
    if r.home_team_win:
        ratings[r.home_team_abbr].beat(ratings[r.away_team_abbr])
    else:
        ratings[r.away_team_abbr].beat(ratings[r.home_team_abbr])

df['elo_exp'] = elo_exp
df['home_team_elo'] = home_team_elo
df['away_team_elo'] = away_team_elo

#elo slow
ratings = {}
for x in df.home_team_abbr.unique():
    ratings[x]=EloCompetitor()
    ratings[x]._k_score=16
for x in df.away_team_abbr.unique():
    ratings[x]=EloCompetitor()
    ratings[x]._k_score=16

home_team_elo = []
away_team_elo = []
elo_exp = []

df = df.sort_values(by='date').reset_index(drop=True)
for i, r in df.iterrows():
    # get pregame ratings
    elo_exp.append(ratings[r.home_team_abbr].expected_score(ratings[r.away_team_abbr]))
    home_team_elo.append(ratings[r.home_team_abbr].rating)
    away_team_elo.append(ratings[r.away_team_abbr].rating)
    # update ratings
    if r.home_team_win:
        ratings[r.home_team_abbr].beat(ratings[r.away_team_abbr])
    else:
        ratings[r.away_team_abbr].beat(ratings[r.home_team_abbr])

df['elo_slow_exp'] = elo_exp
df['home_team_elo_slow'] = home_team_elo
df['away_team_elo_slow'] = away_team_elo

#glicko
from elote import GlickoCompetitor
ratings = {}
for x in df.home_team_abbr.unique():
    ratings[x]=GlickoCompetitor()
for x in df.away_team_abbr.unique():
    ratings[x]=GlickoCompetitor()

home_team_glick = []
away_team_glick = []
glick_exp = []

df = df.sort_values(by='date').reset_index(drop=True)
for i, r in df.iterrows():
    # get pregame ratings
    glick_exp.append(ratings[r.home_team_abbr].expected_score(ratings[r.away_team_abbr]))
    home_team_glick.append(ratings[r.home_team_abbr].rating)
    away_team_glick.append(ratings[r.away_team_abbr].rating)
    # update ratings
    if r.home_team_win:
        ratings[r.home_team_abbr].beat(ratings[r.away_team_abbr])
    else:
        ratings[r.away_team_abbr].beat(ratings[r.home_team_abbr])

df['glick_exp'] = glick_exp
df['home_team_glick'] = home_team_glick
df['away_team_glick'] = away_team_glick

#trueskill
from trueskill import Rating, quality, rate
ratings = {}
for x in df.home_team_abbr.unique():
    ratings[x]=Rating(25)
for x in df.away_team_abbr.unique():
    ratings[x]=Rating(25)
for x in df.home_pitcher.unique():
    ratings[x]=Rating(25)
for x in df.away_pitcher.unique():
    ratings[x]=Rating(25)

ts_quality = []
pitcher_ts_diff = []
team_ts_diff = []
home_pitcher_ts = []
away_pitcher_ts = []
home_team_ts = []
away_team_ts = []
df = df.sort_values(by='date').copy()
for i, r in df.iterrows():
    # get pre-match trueskill ratings from dict
    match = [(ratings[r.home_team_abbr], ratings[r.home_pitcher]),
            (ratings[r.away_team_abbr], ratings[r.away_pitcher])]
    ts_quality.append(quality(match))
    pitcher_ts_diff.append(ratings[r.home_pitcher].mu-ratings[r.away_pitcher].mu)
    team_ts_diff.append(ratings[r.home_team_abbr].mu-ratings[r.away_team_abbr].mu)
    home_pitcher_ts.append(ratings[r.home_pitcher].mu)
    away_pitcher_ts.append(ratings[r.away_pitcher].mu)
    home_team_ts.append(ratings[r.home_team_abbr].mu)
    away_team_ts.append(ratings[r.away_team_abbr].mu)

    if r.date < df.date.max():
        # update ratings dictionary with post-match ratings
        if r.home_team_win==1:
            match = [(ratings[r.home_team_abbr], ratings[r.home_pitcher]),
                     (ratings[r.away_team_abbr], ratings[r.away_pitcher])]
            [(ratings[r.home_team_abbr], ratings[r.home_pitcher]),
            (ratings[r.away_team_abbr], ratings[r.away_pitcher])] = rate(match)
        else:
            match = [(ratings[r.away_team_abbr], ratings[r.away_pitcher]),
                     (ratings[r.home_team_abbr], ratings[r.home_pitcher])]
            [(ratings[r.away_team_abbr], ratings[r.away_pitcher]),
            (ratings[r.home_team_abbr], ratings[r.home_pitcher])] = rate(match)

df['ts_game_quality'] = ts_quality
df['pitcher_ts_diff'] = pitcher_ts_diff
df['team_ts_diff'] = team_ts_diff
df['home_pitcher_ts'] = home_pitcher_ts
df['away_pitcher_ts'] = away_pitcher_ts
df['home_team_ts'] = home_team_ts
df['away_team_ts'] = away_team_ts

print("The shape of our main dataframe after adding skill rankings is (rows x columns):",df.shape)

The shape of our main dataframe after adding skill rankings is (rows x columns): (10596, 1053)


Now to update the odds data and get it into the dataframe properly



In [8]:
import pickle
odds_data = pickle.load(open('covers_data_2.pkl','rb'))

In [9]:
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs

dates = pd.to_datetime(df['date'], unit='s')
game_days = dates.dt.strftime('%Y-%m-%d').unique()
existing_odds_days = [x['date'][:10] for x in odds_data]
new_game_days = [x for x in game_days if x not in existing_odds_days]

for d in new_game_days:
    # get the web page with game data on it
    url = f'https://www.covers.com/Sports/MLB/Matchups?selectedDate={d}'
    resp = requests.get(url)

    # parse the games
    scraped_games = bs(resp.text).findAll('div',{'class':'cmg_matchup_game_box'})
    for g in scraped_games:
        game = {}
        game['home_moneyline'] = g['data-game-odd']
        game['date'] = g['data-game-date']
        game['away_team_abbr'] = g['data-away-team-shortname-search']
        game['home_team_abbr'] = g['data-home-team-shortname-search']
        try:
            game['home_score'] =g.find('div',{'class':'cmg_matchup_list_score_home'}).text.strip()
            game['away_score'] =g.find('div',{'class':'cmg_matchup_list_score_away'}).text.strip()
        except:
            game['home_score'] =''
            game['away_score'] =''

        odds_data.append(game)
print("Done! Days of odds downloaded:", len(new_game_days))


Done! Days of odds downloaded: 4


Let's integrate that into the dataframe in the same way as in Part 3



In [10]:
#collapse-hide


import numpy as np
import pandas as pd
odds = pd.DataFrame(odds_data)
odds['home_moneyline'].replace('', np.nan, inplace=True)
odds.dropna(subset=['home_moneyline'], inplace=True)
odds.home_moneyline = pd.to_numeric(odds.home_moneyline)
odds.date = pd.to_datetime(odds.date).dt.date

import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=FutureWarning)

odds.home_team_abbr[odds.home_team_abbr=='SF']='SFG'
odds.home_team_abbr[odds.home_team_abbr=='TB']='TBR'
odds.home_team_abbr[odds.home_team_abbr=='WAS']='WSN'
odds.home_team_abbr[odds.home_team_abbr=='KC']='KCR'
odds.home_team_abbr[odds.home_team_abbr=='SD']='SDP'

odds.away_team_abbr[odds.away_team_abbr=='SF']='SFG'
odds.away_team_abbr[odds.away_team_abbr=='TB']='TBR'
odds.away_team_abbr[odds.away_team_abbr=='WAS']='WSN'
odds.away_team_abbr[odds.away_team_abbr=='KC']='KCR'
odds.away_team_abbr[odds.away_team_abbr=='SD']='SDP'

odds['odds_proba']=np.nan
odds['odds_proba'][odds.home_moneyline<0] = -odds.home_moneyline/(-odds.home_moneyline + 100)
odds['odds_proba'][odds.home_moneyline>0] = (100/(odds.home_moneyline + 100))

# get dates into the same format
odds['date'] = (pd.to_datetime(pd.to_datetime(odds['date'])) - pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')

# do the merge
df = pd.merge_asof(left=df.sort_values(by='date'),
                   right=odds[['home_team_abbr','date', 'away_team_abbr','odds_proba']].sort_values(by='date'),
                   by=['home_team_abbr','away_team_abbr'],
                   on='date')
df = df.sort_values(by='date').copy().reset_index(drop=True)
print('Dataframe shape after adding odds data:', df.shape)

Dataframe shape after adding odds data: (10596, 1054)


In [11]:
print("Today's Games and the Home Team Win Probabilities")
test_df = df[df['is_test']][['home_team_abbr', 'away_team_abbr',
                             'home_pitcher','away_pitcher',
                             'odds_proba']]
display(test_df)

Today's Games and the Home Team Win Probabilities


Unnamed: 0,home_team_abbr,away_team_abbr,home_pitcher,away_pitcher,odds_proba
10581,NYY,MIA,garcide01,rogertr01,0.708455
10582,WSN,NYM,scherma01,degroja01,0.434783
10583,OAK,SEA,minormi01,sheffju01,0.677419
10584,TOR,BAL,zeuchtj01,meansjo01,0.565217
10585,KCR,DET,hernaca04,boydma01,0.565217
10586,TEX,HOU,,,0.444444
10587,TBR,PHI,curtijo02,wheelza01,0.512195
10588,STL,MIL,wainwad01,woodrbr01,0.5
10589,ATL,BOS,,houckta01,0.598394
10590,CHW,CHC,dunnida01,lestejo01,0.607843


## Generate Predictions
We'll continue to use code from the previous part of the series to prepare the data for predictions.

In [12]:
# target encoding
encode_me = [x for x in df.keys() if 'object' in str(df[x].dtype)]
for x in encode_me:
    df[x] = df.groupby(x)['home_team_win'].apply(lambda x:x.rolling(180).mean()).shift(1)

# create Prediction Set
X_test = df[df['is_test']].drop(columns=['home_team_win', 'game_id','is_test'])
len(X_test)

15

In [13]:
import pickle
model = pickle.load(open('xgb_model.pkl','rb'))

test_df['xgb_home_win'] = model.predict(X_test).astype('bool')
test_df['xgb_home_win_proba'] = model.predict_proba(X_test)[:,1]
test_df

Unnamed: 0,home_team_abbr,away_team_abbr,home_pitcher,away_pitcher,odds_proba,xgb_home_win,xgb_home_win_proba
10581,NYY,MIA,garcide01,rogertr01,0.708455,True,0.701283
10582,WSN,NYM,scherma01,degroja01,0.434783,False,0.315103
10583,OAK,SEA,minormi01,sheffju01,0.677419,True,0.618798
10584,TOR,BAL,zeuchtj01,meansjo01,0.565217,False,0.492718
10585,KCR,DET,hernaca04,boydma01,0.565217,False,0.48225
10586,TEX,HOU,,,0.444444,False,0.410267
10587,TBR,PHI,curtijo02,wheelza01,0.512195,False,0.378972
10588,STL,MIL,wainwad01,woodrbr01,0.5,False,0.395375
10589,ATL,BOS,,houckta01,0.598394,True,0.549559
10590,CHW,CHC,dunnida01,lestejo01,0.607843,True,0.570431


This is good. It looks like our model disagrees with the casino odds for a few games. One trap you need to be thinking about is that when a pitcher is not announced, xgboost is still going to give you a prediction. It will be a very different prediction if you tell it about the pitchers, my advice is to wait for the pitcher announcements.

Next step is to figure our if any of these games are good bets.

## Kelly Criterion
The [Kelly criterion](https://en.wikipedia.org/wiki/Kelly_criterion) is a betting strategy developed in the 50's that tells you the percentage of your bankroll to bet based on your advantage in the bets. Its objective is to maximize profit while minimizing risk of ruin, and it's been rigorously proven time and again.

People still hate it though. The major criticism is that it sets you up for some pretty extreme bets. People have found two major ways of dealing with this. The first is using a [fractional-Kelly](https://www.pinnacle.com/en/betting-articles/Betting-Strategy/fractional-kelly-criterion/GBD27Z9NLJVGFLGG), usually a half-Kelly or quarter-Kelly, where they will use the Kelly formula and bet half or a quarter as much as it tells them. The other is a diversified approach, where the better will place several simultaneous bets dedicating a fraction of their bankroll to each bet. It's easy to utilize the latter in baseball, since multiple games are happening simultaneously.

Implementing the formula is pretty straight forward. We're going to loop through the data from above, applying the formula to both the home and away teams, then print out only the bets that the formula says have good expected value.

In [14]:
from IPython.display import HTML
bets = []
for i,r in test_df.iterrows():
    p = r['xgb_home_win_proba'] # probability of win
    b = (1/r['odds_proba'])-1 #return
    q = 1-r['xgb_home_win_proba'] # probability of loss
    bet = {
        'date': dt.datetime.now().date(),
        'team': r['home_team_abbr'],
        'pitcher': r['home_pitcher'],
        'opposition':r['away_team_abbr'],
        'opp_pitcher':r['away_pitcher'],
        'odds_proba': r['odds_proba'],
        'dollar return': b,
        'ml_proba': p,
        'kelly_criterion': p-(q/b)
    }
    bets.append(bet)

    # away team
    p = 1 - r['xgb_home_win_proba'] # probability of win
    b = (1/(1.02-r['odds_proba']))-1 #1.02 is calibrated to mgm vs consensus odds
    q = r['xgb_home_win_proba'] # probability of loss

    bet = {
        'date': dt.datetime.now().date(),
        'team': r['away_team_abbr'],
        'pitcher': r['away_pitcher'],
        'opposition':r['home_team_abbr'],
        'opp_pitcher':r['home_pitcher'],
        'odds_proba': 1-r['odds_proba'],
        'dollar return': b,
        'ml_proba': p,
        'kelly_criterion': p-(q/b)
    }
    bets.append(bet)
bet_df = pd.DataFrame(bets)
HTML(bet_df[bet_df['kelly_criterion']>0].to_html(index=False))

date,team,pitcher,opposition,opp_pitcher,odds_proba,dollar return,ml_proba,kelly_criterion
2020-09-26,NYM,degroja01,WSN,scherma01,0.565217,0.708767,0.684897,0.240319
2020-09-26,SEA,sheffju01,OAK,minormi01,0.322581,1.919021,0.381202,0.058747
2020-09-26,BAL,meansjo01,TOR,zeuchtj01,0.434783,1.198853,0.507282,0.09629
2020-09-26,DET,boydma01,KCR,hernaca04,0.434783,1.198853,0.51775,0.11549
2020-09-26,HOU,,TEX,,0.555556,0.737452,0.589733,0.033403
2020-09-26,PHI,wheelza01,TBR,curtijo02,0.487805,0.96926,0.621028,0.230037
2020-09-26,MIL,woodrbr01,STL,wainwad01,0.5,0.923077,0.604625,0.176302
2020-09-26,BOS,houckta01,ATL,,0.401606,1.37188,0.450441,0.049853
2020-09-26,CHC,lestejo01,CHW,dunnida01,0.392157,1.426261,0.429569,0.02962
2020-09-26,CLE,civalaa01,PIT,musgrjo01,0.636364,0.571429,0.650989,0.040219


### Model Interpretation
Above it looks like the algorithm has selected 12 bets for us, with the Kelly formula ranging from <1% of our bankroll to 24%. Let's take a deeper look at each.

In the first, the algorithm says to put 24% of our bankroll on the Nationals. The casino gives them a 57% chance of willing and so will payback only 1.471 for every dollar we bet ("dollar return"). Our machine learning model gives them a 68.5% chance, so there's money to be made.

In the second, the algo says bet on Seattle, but only 5.8% of the bankroll. We think they are going to lose, but we thing there's a better chance of them winning than the casino does. Because there's a large payback, Kelly says we should make a bet.

In the Seattle game, the casino think they are going to lose, and so does our model. But we give them a much better chance of winning, so it makes sense to bet on the underdog. Especially at that level of return.

In the Houston game, we don't have pitcher data, so maybe we shoudl not bet on that one.

Etc., etc. I feel like you can take it from here.

## My Learnings
I've used this approach this season and done well. It's been a money maker overall. But my results are still super streaky. I've had 2 days in a row where I've gotten every game right, and then whole weeks where I've steadily lost money every day. I haven't had any day where I lost every game, thankfully. But I'm sure it's coming.

 One thing I will say is that it's pretty boring. Betting off a spreadsheet and taking small wins/losses is not exactly an adventure. It starts to feel like more of a job.

 I hope this series has helped you out, either getting you started, or giving you some ideas to incorporate back into your process. Get in touch and let me know. I'd love to hear what you are up to.

