# Predicting Total Runs in MLB Games

The purpose of this notebook is to provide an end-to-end example of how one can go about gathering, cleaning, modeling, and interpreting sports data. Specifically, we will gather data that allows us to build a classifier that predicts whether or not the total runs scored in MLB games will be above or below the over/under values set by various sportsbooks.

For the purposes of this analysis, we will assume that the odds at which a bettor can bet on the over/unders of all games are -110. To demonstrate what this means: suppose a bettor risks 110 dollars that the total runs scored in a particular MLB game will be over 8.5. If 9 or more runs are scored, the bettor wins 100 dollars (and retains his or her initial 110 dollars). If 8 or fewer runs are scored in the game, the bettor loses his or her 110 dollars.

Assuming that we can always bet at -110 odds (which is a stable assumption), we need to win our bets 52.4% of the time in order to break even. Thus, our ultimate goal is to build a classifier with accuracy greater than 52.4% of the time (for context, professional sports bettors are estimated to have win rates between 55% and 60%).

# Part 1: Obtain Historical Game Features

As will be done for all required web scraping in this project, we will use two web scraping libraries, `requests` and `BeautifulSoup`, to access the desired data.

In [48]:
import pandas as pd
import numpy as np
import requests
import sys
import time
import math
import statistics
from bs4 import BeautifulSoup, Comment
from IPython.display import display, clear_output

In [2]:
def find_chars(a_str, sub):
    '''
    Auxiliary function to find all instances of a string "sub" within a larger string "a_str"
    '''
    start = 0
    while True:
        start = a_str.find(sub, start)
        if start == -1: return
        yield start
        start += len(sub) # use start += 1 to find overlapping matches

In [3]:
def parse_html(string, char):
    '''
    Auxiliary function that returns all characters within a string until character "char" is found
    '''
    output = ""
    for i in range(len(string)):
        if string[i] != char:
            output = output + string[i]
        else:
            break
    return output

In [4]:
def get_pitcher_stats(comment_str, preceding_str, away_pitcher_index, home_pitcher_index):
    '''
    Auxiliary function that returns specific statistics embedded in a comment, based on certain strings that
    precede the statistics of interest.
    '''
    indexes = [i+len(preceding_str) for i in list(find_chars(comment_str, preceding_str))]
    for index in indexes:
        if index > away_pitcher_index:
            away_pitcher_stat_index = index
            break
    for index in indexes:
        if index > home_pitcher_index:
            home_pitcher_stat_index = index
            break
    away_pitcher_stat = parse_html(comment_str[away_pitcher_stat_index:], "<")
    home_pitcher_stat = parse_html(comment_str[home_pitcher_stat_index:], "<")
    return away_pitcher_stat, home_pitcher_stat

In [5]:
def get_pitching_info(pitching_link, year, month, day):
    '''
    This function scrapes baseball-reference and retrieves many features related to pitching statistics for
    MLB games played on a specific date
    
    Args:
        pitching_link: A web extension that contains detailed pitching information for a given game
        year: The year of the game
        month: The month of the game
        day: The day of the game
    
    Returns:
        away_pitcher: name of the away team starting pitcher
        home_pitcher: name of the home team starting pitcher
        away_pitcher_IP: away team starting pitcher's innings pitched in game
        home_pitcher_IP: home team starting pitcher's innings pitched in game
        away_pitcher_postgame_ERA: away team starting pitcher's ERA (including stats from game of interest)
        home_pitcher_postgame_ERA: home team starting pitcher's ERA (including stats from game of interest)
        away_pitcher_ER: away team starting pitcher's earned runs in game
        home_pitcher_ER: home team starting pitcher's earned runs in game
        
    '''
    # define the specific baseball reference link for MLB game pitching statistics
    pitcher_url = requests.get(pitching_link)
    # ensure page was loaded properly
    if pitcher_url.status_code == 200:
        # initialize BeautifulSoup object
        soup = BeautifulSoup(pitcher_url.content, 'html.parser')
        # parse through HTML body to retrieve the element of the site that shows team starting lineups
        soup1 = soup.find("body").find("div", {"id" : "wrap"}).find("div", {"id" : "content"}) 
        soup2 = soup1.find("div", {"class" : "grid_wrapper",
                      "id" : "all_lineups"})
        # interestingly, the starting lineups we need to parse are saved as a single HTML comment
        # get the comment
        comment = soup2.find_all(string=lambda text:isinstance(text,Comment))[0]
        comment_str = str(comment) # convert the comment to a string
        # each player in a starting lineup is preceded by a string 'html">'
        # while seemingly cryptic, this allows us to quickly identify the indexes where starting pitcher
        # names lie within the comment
        indexes = [i+6 for i in list(find_chars(comment_str, 'html">'))]
        # depending on whether the game is played at a National League or American league stadium,
        # either 9 or 10 players, respectively, will be listed for each team's starting lineup
        if len(indexes) == 18: # if National league game, there will be 9 players on each team
            away_pitcher_index = indexes[8] # get index corresponding to first char of away pitcher's name
            home_pitcher_index = indexes[17] # get index corresponding to first char of home pitcher's name
            # retrieve the full names of the two pitchers by calling function parse_html
            away_pitcher = parse_html(comment_str[away_pitcher_index:], "<")
            home_pitcher = parse_html(comment_str[home_pitcher_index:], "<")
        elif len(indexes) == 20: # if American league game, there will be 10 players on each team
            away_pitcher_index = indexes[9] # get index corresponding to first char of away pitcher's name
            home_pitcher_index = indexes[19] # get index corresponding to first char of home pitcher's name
            # retrieve the full names of the two pitchers by calling function parse_html
            away_pitcher = parse_html(comment_str[away_pitcher_index:], "<")
            home_pitcher = parse_html(comment_str[home_pitcher_index:], "<")
        else:
            sys.exit("Could not find starting pitchers.")
        # now we want to get a few different stats for each starting pitcher in the game
        # specifically, we want innings pitched (IP), ERA (which takes into account the outcome of the game),
        # and earned runs (ER)
        
        # again, the statistics of interest are embedded in a comment. We need to parse this
        soup2 = soup1.find_all("div", {"class" : "section_wrapper"})[1]
        comment = soup2.find_all(string=lambda text:isinstance(text,Comment))[0]
        comment_str = str(comment) # convert comment to a string
        # as we saw before, player names are preceded with 'html">'
        # let's get the indexes for the two starting pitchers
        # note: this is not redundant; this is a new comment so the name indexes have changed
        indexes = [i+6 for i in list(find_chars(comment_str, 'html">'))]
        for index in indexes:
            pitcher = parse_html(comment_str[index:], "<")
            if pitcher == away_pitcher:
                away_pitcher_index = index
            if pitcher == home_pitcher:
                home_pitcher_index = index
        # the above for loop now gives us away_pitcher_index and home_pitcher index within the comment
        # now we can call function get_pitcher_stats to retrieve the statistics of interest
        away_pitcher_IP, home_pitcher_IP = get_pitcher_stats(comment_str, '"IP" > ',
                                                             away_pitcher_index, home_pitcher_index)
        away_pitcher_ERA, home_pitcher_ERA = get_pitcher_stats(comment_str, '"earned_run_avg" >',
                                                              away_pitcher_index, home_pitcher_index)
        away_pitcher_ER, home_pitcher_ER = get_pitcher_stats(comment_str, '"ER" >',
                                                            away_pitcher_index, home_pitcher_index)
    else:
        # exit and throw an error if the page could not be loaded
        sys.exit("Pitching page cannot be loaded.")
    
    return away_pitcher, home_pitcher, away_pitcher_IP, home_pitcher_IP, away_pitcher_ERA, home_pitcher_ERA, away_pitcher_ER, home_pitcher_ER

In [6]:
def get_stats(df, year, month, day):
    '''
    This function scrapes baseball-reference and retrieves many features for MLB games played on a specific date,
    appending to a dataframe a row of features for each game
    
    Args:
        df: A dataframe with columns "away", "home", "away_score", "home_score", "year", "month", "day"
        year: The year of the game
        month: The month of the game
        day: The day of the game
    
    Returns:
        df: The same dataframe that was passed in as an argument, with appended rows.
    '''
    # define the specific baseball reference link for MLB games played on a given date
    url = requests.get("https://www.baseball-reference.com/boxes/?month=" \
                       + str(month) + "&day=" + str(day) + "&year=" + str(year))
    # ensure the page could be loaded successfully
    if url.status_code == 200:
        # initialize the BeautifulSoup object
        soup = BeautifulSoup(url.content, 'html.parser')
        # parse the HTML body to find all games played on this date
        game_summaries = soup.find("body").find("div", {"id" : "wrap"}).find("div", {"id" : "content"})\
        .find("div", {"class" : "game_summaries"})
        try:
            # each game is uniquely defined through a "div" tag with "class"="game_summary_nohover"
            # store all the games in a list called games, which we will iterate through to grab features
            games = game_summaries.find_all("div", {"class" : "game_summary nohover"})
            for game in games:
                # the away team is always presented first
                away_team_info = game.find("table").find_all("tr")[0]
                # the home team is always presented second
                home_team_info = game.find("table").find_all("tr")[1]
                # note: playoff games are formatted slightly differently, but we are excluding these in
                # the analysis, since MLB teams treat playoffs very differently than regular season games
                away_team = away_team_info.find("td").text # retrieve name of away team
                home_team = home_team_info.find("td").text # retrieve name of home team
                # get the final scores of both teams and convert them to integers
                away_team_score = int(away_team_info.find("td", {"class" : "right"}).text)
                home_team_score = int(home_team_info.find("td", {"class" : "right"}).text)
                # now we need to click on a new link to get specific pitcher info
                # get the link extension
                game_link = game.find("table").find("tbody").find("tr").find("td", {"class" : "right gamelink"}) \
                .find("a").get("href")
                # concatenate the baseball-reference domain with game_link to get full site link
                pitching_link = "https://www.baseball-reference.com" + game_link
                
                # pass game_link into function get_pitching_info
                away_pitcher, home_pitcher, away_pitcher_IP, home_pitcher_IP,\
                away_pitcher_ERA, home_pitcher_ERA, away_pitcher_ER, home_pitcher_ER = get_pitching_info(pitching_link, year, month, day)
                # append features as a row in dataframe
                df = df.append({"away" : away_team,
                                "home" : home_team,
                                "away_score" : away_team_score,
                                "home_score" : home_team_score,
                                "year" : year,
                                "month" : month,
                                "day" : day,
                                "away_pitcher" : away_pitcher,
                                "home_pitcher" : home_pitcher,
                                "away_pitcher_IP" : away_pitcher_IP,
                                "home_pitcher_IP" : home_pitcher_IP,
                                "away_pitcher_postgame_ERA" : away_pitcher_ERA,
                                "home_pitcher_postgame_ERA" : home_pitcher_ERA,
                                "away_pitcher_ER" : away_pitcher_ER,
                                "home_pitcher_ER" : home_pitcher_ER}, ignore_index = True)
        except:
            pass
    else:
        # exit and throw an error if the page could not be loaded
        sys.exit("Page cannot be loaded.")
    return df


Now that the functionality is set up, we can actually retrieve the data in which we are interested by running the cell below. Note that the running time will vary slightly by machine, but expect this to take an hour or two to complete. Additionally, `BeautifulSoup` may error out once or twice during the process, because the software tries to prohibit real-time scraping updates. If you do see an error, note that `df` has been saved, so you can just updated the `start` date below and continue from where you left off unscathed.

When completed, we will save the data to a csv file so that we can easily access the data again.

In [12]:
# initialize an empty dataframe with the columns of interest
df = pd.DataFrame(columns = ["away", "home", "away_score", "home_score", "year", "month", "day",
                            "away_pitcher", "home_pitcher", "away_pitcher_IP", "home_pitcher_IP",
                            "away_pitcher_postgame_ERA", "home_pitcher_postgame_ERA",
                             "away_pitcher_ER", "home_pitcher_ER"])

# enter the date range for which we want to gather game data
dates = pd.date_range(start='3/1/2013', end='12/1/2018')

# iterate through all dates and call function get_stats each time
for i in range(len(dates)):
    clear_output()
    month, day, year = dates[i].month, dates[i].day, dates[i].year
    display("Getting scores for " + str(month) + "/" + str(day) + "/" + str(year))
    if month not in [12,1,2]:
        df = get_stats(df, year, month, day)

# save file to working directory
df.to_csv("df.csv")

'Getting scores for 3/2/2013'

We have now obtained a dataframe with 15 features for every (non-playoff) MLB game that has been played between the 2013 season and the 2018 season, inclusive.

# Part 2: Get Earned Run Averages (ERAs)

Although we were able to scrape starting pitchers' earned run averages (ERAs) above, it is necessary to note that these ERAs include the earned runs and innings pitched of the game of interest, which is unfortunately not what we want. We instead are interested in the ERAs of pitchers before the game begins, which was something I had a very difficult time finding throughout my data search. Despite this, earned run averages are easy enough to calculate, so below we create `pregame_ERA` variables for both `away_pitchers` and `home_pitchers`. Because we previously scraped daily post-game ERAS, this gives us a baseline that we can check our computations against.

As a reference, the general formula for earned run averages is as follows:

$$ERA = \frac{\text{total earned runs}}{\text{total innings pitched}} * 9$$

In [44]:
# read in the dataframe that we saved at the conclusion of part 1
df = pd.read_csv("df.csv")

# initialize two new columns in dataframe that will hold pitcher cumulative ERAs
df['home_pitcher_pregame_ERA'] = 0.0
df['away_pitcher_pregame_ERA'] = 0.0

# to compute ERAs, we need every pitcher's cumulative earned runs and cumulative innings pitched
# we will reset these statistics at the beginning of every season
# initialize 12 dictionaries that hold pitcher ER and IP statistics
pitcher_ERs2013 = {}
pitcher_IPs2013 = {}
pitcher_ERs2014 = {}
pitcher_IPs2014 = {}
pitcher_ERs2015 = {}
pitcher_IPs2015 = {}
pitcher_ERs2016 = {}
pitcher_IPs2016 = {}
pitcher_ERs2017 = {}
pitcher_IPs2017 = {}
pitcher_ERs2018 = {}
pitcher_IPs2018 = {}

# create a list that consists of the names of the above dictionaries
dicts = [pitcher_ERs2013, pitcher_IPs2013,
         pitcher_ERs2014, pitcher_IPs2014,
         pitcher_ERs2015, pitcher_IPs2015,
         pitcher_ERs2016, pitcher_IPs2016,
         pitcher_ERs2017, pitcher_IPs2017,
         pitcher_ERs2018, pitcher_IPs2018]

# initialize the values in all dictionaries to zero for all pitchers
for dictionary in dicts:
    for pitcher in df.home_pitcher.unique():
        dictionary[pitcher] = 0.0
    for pitcher in df.away_pitcher.unique():
        dictionary[pitcher] = 0.0

In [46]:
# note this cell will take a few minutes to run
for i in range(len(df)): # for every row in df
    year = df.year.iloc[i]
    home_pitcher = df.home_pitcher.iloc[i]
    away_pitcher = df.away_pitcher.iloc[i]
    
    # determine which dictionaries should be updated based on value for year
    if year == 2013:
        pitcher_ER_dict = pitcher_ERs2013
        pitcher_IP_dict = pitcher_IPs2013
    elif year == 2014:
        pitcher_ER_dict = pitcher_ERs2014
        pitcher_IP_dict = pitcher_IPs2014
    elif year == 2015:
        pitcher_ER_dict = pitcher_ERs2015
        pitcher_IP_dict = pitcher_IPs2015
    elif year == 2016:
        pitcher_ER_dict = pitcher_ERs2016
        pitcher_IP_dict = pitcher_IPs2016
    elif year == 2017:
        pitcher_ER_dict = pitcher_ERs2017
        pitcher_IP_dict = pitcher_IPs2017
    else:
        pitcher_ER_dict = pitcher_ERs2018
        pitcher_IP_dict = pitcher_IPs2018

    # calculate pregame ERA for home pitcher
    if pitcher_IP_dict[home_pitcher] == 0:
        df.home_pitcher_pregame_ERA.iloc[i] = np.nan # if this is the pitcher's first outing, assign it a null value
    else:
        df.home_pitcher_pregame_ERA.iloc[i] = (pitcher_ER_dict[home_pitcher] / pitcher_IP_dict[home_pitcher]) * 9
    # calculate pregame ERA for away pitcher
    if pitcher_IP_dict[away_pitcher] == 0:
        df.away_pitcher_pregame_ERA.iloc[i] = np.nan # if this is the pitcher's first outing, assign it a null value
    else:
        df.away_pitcher_pregame_ERA.iloc[i] = (pitcher_ER_dict[away_pitcher] / pitcher_IP_dict[away_pitcher]) * 9
    
    # update dictionary values with post-game ER values
    pitcher_ER_dict[home_pitcher] = pitcher_ER_dict[home_pitcher] + df.home_pitcher_ER.iloc[i]
    pitcher_ER_dict[away_pitcher] = pitcher_ER_dict[away_pitcher] + df.away_pitcher_ER.iloc[i]
    
    # update dictionary values with post-game IP values
    # note: if IP = 6.1, this really means 6.67 (baseball statistics just uses this notation for simplicity)
    # the extra math in the below computations adjusts for this
    pitcher_IP_dict[home_pitcher] = pitcher_IP_dict[home_pitcher] + math.floor(df.home_pitcher_IP.iloc[i]) \
    + (((df.home_pitcher_IP.iloc[i] - math.floor(df.home_pitcher_IP.iloc[i])) * 10) / 3)
    pitcher_IP_dict[away_pitcher] = pitcher_IP_dict[away_pitcher] + math.floor(df.away_pitcher_IP.iloc[i]) \
    + (((df.away_pitcher_IP.iloc[i] - math.floor(df.away_pitcher_IP.iloc[i])) * 10) / 3)

399

# Part 3: Scrape Historical Over/Under Odds

We certainly need odds (i.e. the projected total runs per game) in order to build any predictive model. Many sportsbooks do not keep historical odds public (seemingly to prevent people from doing exactly what we are doing right now), but I was able to find them on a website called DonBest. The site displays the historical odds from five different sportsbooks: the odds from each source are generally very similar for any given game, but in case the odds do differ from site to site, we take the median of the five reported over/under lines.

We will again use `requests` and `BeautifulSoup` to scrape the odds for each game. We will also scrape team names and team scores, so that we have fields that can create primary keys on which we will merge the odds data with game data we have collected thus far.

In [49]:
def get_odds(df, year, month, day):
    '''
    This function scrapes historical over/under betting odds for MLB games played on a given date,
    appending to a dataframe a row for each game.
    
    Args:
        df: A dataframe with columns "away", "home", "away_score", "home_score", "over_under", "year", "month", "day"
        year: The year of the game
        month: The month of the game
        day: The day of the game
    
    Returns:
        df: The same dataframe that was passed in as an argument, with appended rows.
    '''
    # format the date correctly, as expected by the donbest url
    if day < 10:
        day = "0" + str(day)
    if month < 10:
        month = "0" + str(month)
    # define the specific donbest link for MLB games that were played on a given date
    url = requests.get("http://www.donbest.com/mlb/odds/" + str(year) + str(month) + str(day) + ".html")
    if url.status_code == 200: # ensure the page was loaded properly
        try:
            # initialize BeautifulSoup object
            soup = BeautifulSoup(url.content, 'html.parser')
            soup1 = soup.find("body").find("form").find("div").find("div").find("div", {"id" : "col1"})
            soup2 = soup1.find("div").find("div").find("div", {"id" : "module2_2"}).find("div").find("div")
            soup3 = soup2.find("div").find("div", {"id" : "_DivOutput"}).find("div").find("div", {"class" : "odds_gamesHolder"})
            # rows in the table are alternately called "rows" or "alternateRows"
            rows = soup3.find("table").find_all("tr", {"class" : "statistics_table_row"})
            alt_rows = soup3.find("table").find_all("tr", {"class" : "statistics_table_alternateRow"})
            for i in range(len(rows)): # for every row in the table
                try:
                    teams = rows[i].find("td", {"class" : "alignLeft"}).find("a")
                    # get the name of the away team
                    away_team = teams.find("span").text
                    # get the name of the home team
                    home_team = teams.find_all("span")[1].text
                    scores = rows[i].find_all("td", {"class" : "alignCenter"})[1]
                    # get away team score
                    away_team_score = scores.find("div").find("b").text
                    # get home team score
                    home_team_score = scores.find_all("div")[1].find("b").text
                    # initialize an array that holds the over/under odds from five different sportsbooks
                    totals = []
                    for j in range(6, 11):
                        soup5 = float(rows[i].find_all("td")[j].find("div").text)
                        if soup5 > 0 and soup5 < 25:
                            # append the over/under line to the totals array
                            totals.append(soup5)
                        else:
                            soup5 = float(rows[i].find_all("td")[j].find_all("div")[1].text)
                            # append the over/under line to the totals array
                            totals.append(soup5)
                    # sort the totals array
                    totals = sorted(totals)
                    # take the median of values in the totals array
                    over_under = statistics.median(totals)
                    # append rows to df
                    df = df.append({"away" : away_team,
                        "home" : home_team,
                        "away_score" : away_team_score,
                        "home_score" : home_team_score,
                        "over_under" : over_under,
                        "year" : year,
                        "month" : month,
                        "day" : day}, ignore_index = True)
                except:
                    pass
            # the below for loop is analogous to the one above, just now for alt_rows instead of rows
            for i in range(len(alt_rows)):
                try:
                    teams = alt_rows[i].find("td", {"class" : "alignLeft"}).find("a")
                    away_team = teams.find("span").text
                    home_team = teams.find_all("span")[1].text
                    scores = alt_rows[i].find_all("td", {"class" : "alignCenter"})[1]
                    away_team_score = scores.find("div").find("b").text
                    home_team_score = scores.find_all("div")[1].find("b").text
                    totals = []
                    for j in range(6, 11):
                        soup5 = float(alt_rows[i].find_all("td")[j].find("div").text)
                        if soup5 > 0 and soup5 < 25:
                            totals.append(soup5)
                        else:
                            soup5 = float(alt_rows[i].find_all("td")[j].find_all("div")[1].text)
                            totals.append(soup5)
                    totals = sorted(totals)
                    over_under = statistics.median(totals)
                    df = df.append({"away" : away_team,
                        "home" : home_team,
                        "away_score" : away_team_score,
                        "home_score" : home_team_score,
                        "over_under" : over_under,
                        "year" : year,
                        "month" : month,
                        "day" : day}, ignore_index = True)
                except:
                    pass
        except:
            pass
    # if page cannot be loaded, exit and throw an error
    else:
        sys.exit("Page cannot be loaded.")
    return df

Again, the cell below may take a bit to run, depending on the machine as well as the date range that is set (default is 3/1/2013 through 12/1/2018).

In [51]:
# initialize an empty dataframe with the columns of interest
odds_df = pd.DataFrame(columns = ["away", "home", "away_score", "home_score", "over_under", "year", "month", "day"])

# enter the date range for which we want to gather odds data
dates = pd.date_range(start='3/1/2013', end='12/1/2018')

# iterate through all dates and call function get_stats each time
for i in range(len(dates)):
    clear_output()
    month, day, year = dates[i].month, dates[i].day, dates[i].year
    display("Getting odds for " + str(month) + "/" + str(day) + "/" + str(year))
    if month not in [12,1,2]:
        odds_df = get_odds(odds_df, year, month, day)
        
# save file to working directory
odds_df.to_csv("odds_df.csv")

'Getting odds for 5/5/2013'

# Part 4: Merge and Clean Data Sources

In [53]:
# read in the dataframe that we saved at the conclusion of part 1
odds_df = pd.read_csv("odds_df.csv")

In [62]:
# the Angels have two different syntactic names. We will update this for consistency
df = df.replace("LA Angels of Anaheim", "Los Angeles Angels")

In [68]:
# create keys in both dataframes so that we can join them
df["id"] = df.away + df.away_score.astype(str) + df.home + df.home_score.astype(str) \
+ df.year.astype(str) + df.month.astype(str) + df.day.astype(str)
odds_df["id"] = odds_df.away + odds_df.away_score.astype(str) + odds_df.home + odds_df.home_score.astype(str) \
+ odds_df.year.astype(str) + odds_df.month.astype(str) + odds_df.day.astype(str)

In [69]:
# merge df with odds_df into a new dataframe
df = pd.merge(df, odds_df, on="id", how="inner")

In [75]:
df.columns

Index(['Unnamed: 0_x', 'away_x', 'home_x', 'away_score_x', 'home_score_x',
       'year_x', 'month_x', 'day_x', 'away_pitcher', 'home_pitcher',
       'away_pitcher_IP', 'home_pitcher_IP', 'away_pitcher_postgame_ERA',
       'home_pitcher_postgame_ERA', 'away_pitcher_ER', 'home_pitcher_ER',
       'home_pitcher_pregame_ERA', 'away_pitcher_pregame_ERA', 'id',
       'Unnamed: 0_y', 'away_y', 'home_y', 'away_score_y', 'home_score_y',
       'over_under', 'year_y', 'month_y', 'day_y'],
      dtype='object')