Baseball Prediction: 3a - Getting Odds Data

In this notebook, we will get historical odds data from oddsshark.com. We will use the pandas read_html function to grab a table into a dataframe, and show how to programmatically sweep through all the necessary urls to get the data we need.

We will save this data as a collection of csv files. In the next notebook, we will use these csv files to add the odds information to our primary dataframe.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

pd.set_option('display.max_columns', 1000)
pd.set_option('display.max_rows', 1000)

In [None]:
import lxml
import html5lib
from urllib.request import urlopen
import time

In [None]:
df1 = pd.read_html('https://www.oddsshark.com/stats/gamelog/baseball/mlb/27000?season=2021')[0]
df1.head(10)

In [None]:
def line_to_prob(line):
    prob_underdog = 100/(np.abs(line)+100) # this is the probability for the 
    add_term = ((1-np.sign(line))/2) # 0 if negative, 1 if positive
    mult_factor = np.sign(line) # -1 if negative, 1 if positive
    # if line is positive, team is underdog, give 0 + 1*prob_underdog
    # if line is negative, team is favorited give 1 + (-1)*prob_underdog
    imp_prob = add_term + mult_factor * prob_underdog 
    return(imp_prob)


Plan of Attack

- Get the "number" for each team
- Read the table for each team, and for each season (2019-2022)
- Lightly process the data frame (remove playoffs, process date, add game_number, add "source_team", convert line)
- Save each file

In [None]:
# manually figure out what number in url corresponds to which team
# use the 3 letter abbrev from retrosheet for each team

oddsshark_num_to_team_dict = {}
oddsshark_num_to_team_dict[26995]='PHI'
oddsshark_num_to_team_dict[26996]='SDN'
oddsshark_num_to_team_dict[26997]='SFN'
oddsshark_num_to_team_dict[26998]='ANA'
oddsshark_num_to_team_dict[26999]='DET'
oddsshark_num_to_team_dict[27000]='CIN'
oddsshark_num_to_team_dict[27001]='NYA'
oddsshark_num_to_team_dict[27002]='TEX'
oddsshark_num_to_team_dict[27003]='TBA'
oddsshark_num_to_team_dict[27004]='COL'
oddsshark_num_to_team_dict[27005]='MIN'
oddsshark_num_to_team_dict[27006]='KCA'
oddsshark_num_to_team_dict[27007]='ARI'
oddsshark_num_to_team_dict[27008]='BAL'
oddsshark_num_to_team_dict[27009]='ATL'
oddsshark_num_to_team_dict[27010]='TOR'
oddsshark_num_to_team_dict[27011]='SEA'
oddsshark_num_to_team_dict[27012]='MIL'
oddsshark_num_to_team_dict[27013]='PIT'
oddsshark_num_to_team_dict[27014]='NYN'
oddsshark_num_to_team_dict[27015]='LAN'
oddsshark_num_to_team_dict[27016]='OAK'
oddsshark_num_to_team_dict[27017]='WAS'
oddsshark_num_to_team_dict[27018]='CHA'
oddsshark_num_to_team_dict[27019]='SLN'
oddsshark_num_to_team_dict[27020]='CHN'
oddsshark_num_to_team_dict[27021]='BOS'
oddsshark_num_to_team_dict[27022]='MIA'
oddsshark_num_to_team_dict[27023]='HOU'
oddsshark_num_to_team_dict[27024]='CLE'

In [None]:
for i in range(26995, 27025):
    team_name = oddsshark_num_to_team_dict[i]
    print(team_name)
    for season in range(2019,2023):
        print(season)
        url = 'https://www.oddsshark.com/stats/gamelog/baseball/mlb/'+str(i)+'?season='+str(season)
        df_temp = pd.read_html(url)[0]
        df_temp = df_temp[df_temp.Game=='REG']
        print(df_temp.shape)
        df_temp['team_source'] = team_name
        df_temp['season'] = season
        df_temp['date_numeric'] = pd.to_datetime(df_temp.Date).astype(str).str.replace('-','')
        df_temp['game_no'] = np.arange(1,df_temp.shape[0]+1)
        df_temp['prob_implied'] = line_to_prob(df_temp['Line'])      
        next_game_date = np.concatenate((df_temp['date_numeric'].iloc[1:],[0]))
        previous_game_date = np.concatenate(([0], df_temp['date_numeric'].iloc[:-1]))
        game_1_dblheader = (df_temp.date_numeric.to_numpy()==next_game_date).astype(int)
        game_2_dblheader = (df_temp.date_numeric.to_numpy()==previous_game_date).astype(int)*2
        df_temp['dblheader_num'] = game_1_dblheader+game_2_dblheader        
        fname_out = 'oddsshark_'+team_name+'_'+str(season)+'.csv'
        df_temp.to_csv('/Users/gilliancurtis/Desktop/beatingVegas/oddshark/'+fname_out,index=False)
        time.sleep(0.5)