# Sports Betting Project Overview

This document is intended as an outline, as of April 5 2020, of the sports betting project. In particular, I will cover three different aspects of the work: qualitative betting knowledge, web scraping, and machine learning

# Basics of Sports Betting

There are multiple bet types, events, and bookmaker sites available for our work. I'll quickly go over a few of these ideas and provide a basic explanation for important vocabulary and concepts.

## Bet types

We will focus on basketball, specifically the NBA. With basketball, we have more than one way to place bets:

1. Money Line
2. Point Spread (Over Under)
3. Total Points (Over Under)

The method I have focused on up to now is Money Line, but we may be just as interested in the other options as well. I will simply describe money line betting for now in the interest of space.

### Money line betting

The idea behind this method is to bet purely on who will win. The odds will look something like this:

**Houston Rockets**: -250

**New York Knicks**: 210

Here is how to read this money line: a \\$250 bet on the Rockets will payout \\$350 if the Rockets win, and a \\$100 bet on the Knicks will payout \\$310 if the Knicks win. In other words, the Rockets are the *favorite* (the Knicks are the *underdog*) - you have to bet more on the Rockets than the Knicks for a similar payoff of around \\$330.


Importantly, we can "back out" the implied probability of the winner/loser based on this information. In our given example, the implied probability is that the Rockets have about a 70\% chance of winning the game. Basically, this is the formula and you solve for $p$ where $M$ is the money line for the favorite:


Here's a quick function that does the calculation:

In [8]:
def get_implied_odds(ml):
    if ml > 0:
        return 100 / (ml + 100)
    else:
        return -ml / (-ml + 100)
    
print(get_implied_odds(-250))
print(get_implied_odds(210))

0.7142857142857143
0.3225806451612903


## Websites

For our analysis, we will want historical odds and I found two different sites that can provide this data: [ScoresAndOdds](https://www.scoresandodds.com/), and [Sportsbook Review](https://www.sportsbookreview.com/betting-odds/nba-basketball/). I would encourage you to look around on these websites, and familiarize yourself with their user interfaces.

The Sportsbook Review site is particularly interesting. It aggregates the odds available on other betting sites, and presents them in one place. It can be interesting to look at the various odds offered on multiple websites.

Perhaps the most notable online sports gambling website is [Draftkings](https://sportsbook.draftkings.com/sport/4). I have not scraped this site (yet), but it may prove useful in the future.

For actual basketball stats (shot success, rebounds, etc) I am using [Basketball Reference](https://www.basketball-reference.com/)

## Web Scraping

I want to demonstrate how I have scraped the data provided on Dropbox. Particularly considering that the background we have is computer science, it is worth going over.

### Scrapy

[Scrapy](https://docs.scrapy.org/en/latest/) is a powerful web scraping tool. It is typically more common for a beginner in web scraping to use Beautiful Soup, but I would encourage taking a swing at Scrapy.

The fundamental building block in Scrapy is the Spider. This is what I am using for all my web scraping of Basketball Reference, Sportsbook Review, and ScoresAndOdds. This is the example spider class taken from the documentation.

In [3]:
import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
        'http://quotes.toscrape.com/tag/'
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)


Notice that the `start_urls` property targets certain webpages for scraping, and the `parse` method isolates individual parts of the webpage to save to a data set. This is essentially the way that I have scraped all my data.

To actually run the spider the command is:

`scrapy crawl quotes`

Another important part of Scrapy is the pipeline from the data scrape to the ultimate output. A pipeline looks like this:

In [10]:
import csv
import datetime

class CSVPipeline(object):
    def __init__(self):
        self.time_format = "%Y-%m-%d_%H%M%S"

    def open_spider(self, spider):
        spider_name = spider.name
        now = datetime.datetime.now()
        now_str = now.strftime(self.time_format)
        target_dir = output_directory + "/" + spider_name + "/"
        filename = target_dir + spider_name + "_" + now_str + ".csv"
        self.file = open(filename, 'w')
        self.writer = csv.DictWriter(
            self.file,
            fieldnames=database_specs['tables'][spider_name].keys(),
            lineterminator='\n'
        )
        self.writer.writeheader()

    def process_item(self, item, spider):
        self.writer.writerow(dict(item))

    def close_spider(self, spider):
        self.file.close()


Most important to note is that the pipeline will open the spider, process the data, then close the spider. This can ultimately be accomplished using any number of different output formats: CSV, a database (such as Postgres), or json, for example.

### JSON craziness

Normally, web scraping involves attacking the HTML and CSS properties of a webpage. This is messy and a sort of "hack", whereby the target is not a formal service like with a REST API, but instead the actual rendered website. However, with this project I have stumbled upon the underlying APIs and have reverse engineered their endpoints.

Here is an example of what I mean:

[Sportsbook Review odds](https://www.sportsbookreview.com/ms-odds-v2/odds-v2-service?query=%7B+currentLines(eid:+[3872357,+3872363,+3872367,+3872371,+3872377,+3872381,+3872386,+3872389,+3872394],+mtid:+[83],+marketTypeLayout:+%22PARTICIPANTS%22,+catid:+133)+openingLines(eid:+[3872357,+3872363,+3872367,+3872371,+3872377,+3872381,+3872386,+3872389,+3872394],+mtid:+[83],+marketTypeLayout:+%22PARTICIPANTS%22,+paid:+3)+bestLines(catid:+133,+eid:+[3872357,+3872363,+3872367,+3872371,+3872377,+3872381,+3872386,+3872389,+3872394],+mtid:+[83])+consensus(eid:+[3872357,+3872363,+3872367,+3872371,+3872377,+3872381,+3872386,+3872389,+3872394],+mtid:+[83])+%7B+eid+mtid+boid+partid+sbid+bb+paid+lineid+wag+perc+vol+tvol+sequence+tim+%7D+maxSequences+%7B+events:+eventsMaxSequence+scores:+scoresMaxSequence+currentLines:+linesMaxSequence+statistics:+statisticsMaxSequence+plays:+playsMaxSequence+consensus:+consensusMaxSequence+%7D+%7D)

Though I currently get all this data using Scrapy, it may be better to use the `requests` library in Python. This may be a great initial project, given the background in computer science.

## Machine Learning

Now for the most interesting bit.

Once we understand our data, have scraped it, and can access it easily in python, we can begin running machine learning models on it with the goal of better predicting the outcome of games. I have shared 3 files on Dropbox, all this analysis comes from those files so you can probably run this code on your machine as well.

First, libraries and settings:

In [15]:
import pandas as pd
import numpy as np

import warnings

from pandas.core.common import SettingWithCopyWarning
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

warnings.simplefilter(action='ignore', category=SettingWithCopyWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

data_directory = "~/Desktop/data"

Next, some basic data cleaning operations

In [16]:
games = pd.read_csv(data_directory + "/games.csv")
boxscore = pd.read_csv(data_directory + "/boxscore.csv")
money_line = pd.read_csv(data_directory + "/money_line.csv")

games["game_date"] = pd.to_datetime(games["game_date"])

money_line = money_line[money_line["home"] != "Off"]
money_line["date_url"] = pd.to_datetime(money_line["date_url"])
money_line["home"] = money_line["home"].astype(float)
money_line["away"] = money_line["away"].astype(float)

Here is what the data so far looks like:

In [18]:
games.head()

Unnamed: 0,code,season,game_date,start_time,home_team,home_code,home_points,visiting_team,visiting_code,visitor_points,has_ot,attendance,winner
0,200102010CLE,2001,2001-02-01,19:30:00,Cleveland Cavaliers,CLE,81,Minnesota Timberwolves,MIN,90,False,13904,MIN
1,200102010DAL,2001,2001-02-01,20:00:00,Dallas Mavericks,DAL,95,Miami Heat,MIA,91,True,14497,DAL
2,200102010NYK,2001,2001-02-01,20:00:00,New York Knicks,NYK,80,Philadelphia 76ers,PHI,87,False,19763,PHI
3,200102010HOU,2001,2001-02-01,20:30:00,Houston Rockets,HOU,84,Los Angeles Clippers,LAC,101,False,11124,LAC
4,200102010UTA,2001,2001-02-01,21:00:00,Utah Jazz,UTA,87,Charlotte Hornets,CHH,76,False,18855,UTA


In [20]:
money_line.head()

Unnamed: 0,key,date_timestamp,date_url,home_team,away_team,home,away
0,179454,2018-10-07 19:05:00,2018-10-07,Oklahoma City Thunder,Atlanta Hawks,-253.0,206.0
1,179455,2018-10-07 20:05:00,2018-10-07,San Antonio Spurs,Houston Rockets,-122.0,103.0
2,179456,2018-10-08 00:05:00,2018-10-07,Minnesota Timberwolves,Milwaukee Bucks,110.0,-131.0
3,179457,2018-10-08 01:05:00,2018-10-07,Portland Trail Blazers,Utah Jazz,128.0,-152.0
4,179477,2018-10-10 00:05:00,2018-10-09,Oklahoma City Thunder,Milwaukee Bucks,-216.0,178.0


In [22]:
boxscore.head()

Unnamed: 0,code,team,player_code,player,mp,fg,fga,fg_pct,fg3,fg3a,fg3_pct,ft,fta,ft_pct,orb,drb,trb,ast,stl,blk,tov,pf,pts,plus_minus
0,201501010CHI,DEN,afflaar01,Arron Afflalo,39:56,8,14,0.571,1,3,0.333,2,2,1.0,1,3,4,1,1,1,4,1,19,0.0
1,201501010CHI,DEN,lawsoty01,Ty Lawson,39:52,8,16,0.5,1,1,1.0,3,4,0.75,0,7,7,7,1,0,2,3,20,-6.0
2,201501010CHI,DEN,farieke01,Kenneth Faried,35:56,7,14,0.5,0,0,,4,4,1.0,7,12,19,3,2,3,1,2,18,5.0
3,201501010CHI,DEN,chandwi01,Wilson Chandler,33:35,8,16,0.5,2,5,0.4,4,4,1.0,1,2,3,2,0,0,2,6,22,-1.0
4,201501010CHI,DEN,mozgoti01,Timofey Mozgov,24:45,1,4,0.25,0,0,,2,2,1.0,1,5,6,1,0,0,0,6,4,4.0


Now, I will make some joins in order to get the primary dataset of interest:

In [25]:
money_line_subset = money_line[["date_url", "home_team", "home"]]
money_line_subset.columns = ["game_date", "home_team", "money_line"]

boxscore_agg = boxscore.groupby(["code", "team"]).aggregate("sum")
boxscore_agg["shot_percentage"] = boxscore_agg["fg"] / boxscore_agg["fga"]

games_join = games.set_index(["code", "game_date", "home_code", "visiting_code", "home_team", "visiting_team"])
games_join = games_join.join(boxscore_agg["shot_percentage"], on=["code", "home_code"], how="inner")
games_join = games_join.reset_index()

df = pd.merge(games_join, money_line_subset)
df.head()

Unnamed: 0,code,game_date,home_code,visiting_code,home_team,visiting_team,season,start_time,home_points,visitor_points,has_ot,attendance,winner,shot_percentage,money_line
0,201901010TOR,2019-01-01,TOR,UTA,Toronto Raptors,Utah Jazz,2019,19:30:00,122,116,False,19800,TOR,0.54878,-115.0
1,201901010MIL,2019-01-01,MIL,DET,Milwaukee Bucks,Detroit Pistons,2019,20:00:00,121,98,False,17534,MIL,0.594937,-600.0
2,201901010DEN,2019-01-01,DEN,NYK,Denver Nuggets,New York Knicks,2019,21:00:00,115,108,False,19520,DEN,0.463158,-1400.0
3,201901010SAC,2019-01-01,SAC,POR,Sacramento Kings,Portland Trail Blazers,2019,21:00:00,108,113,True,17583,POR,0.382609,100.0
4,201901010LAC,2019-01-01,LAC,PHI,Los Angeles Clippers,Philadelphia 76ers,2019,22:30:00,113,119,False,17868,PHI,0.461538,103.0


A natural question: how well do the odds themselves serve in predicting the outcome of the game?

In [30]:
def get_implied_probability(ml):
    if ml > 0:
        return 100 / (ml + 100)
    else:
        return -ml / (-ml + 100)

df["prob"] = df["money_line"].apply(get_implied_probability)
df["odds_prediction"] = np.where(df["money_line"] < 0, 1, 0)
df["home_wins"] = np.where(df["winner"] == df["home_code"], 1, 0)

accuracy_score(df["odds_prediction"], df["home_wins"])

0.6730462519936204

In [31]:
confusion_matrix(df["odds_prediction"], df["home_wins"])

array([[250, 148],
       [262, 594]])

It seems that the odds predict the winner correctly about 67% of the time

A natural choice of binary classification model using this data is logistic regression. Using this simple model, we can begin with the money line as input, then incorporate more features using the boxscore or other basketball data. In particular, this allows us to predict the probability and compare our model to the typical accuracy of 67%. We want our model's accuracy to trounce that of the market.

In [36]:
X = np.array(df["prob"].apply(to_infinity)).reshape(-1, 1)
y = np.array(df["home_wins"])

X_train, X_test, y_train, y_test = train_test_split(X, y)
logistic_regression = LogisticRegression()
logistic_regression.fit(X_train, y_train)
y_hat = logistic_regression.predict(X_train)

accuracy_score(y_train, y_hat)

0.6617021276595745

Perhaps unsurprisingly, the model returns a roughly identical accuracy score as the raw odds. We can also look and see that the predicted probability of the model is very close to that of the implied probability of the given odds:

In [39]:
probs = logistic_regression.predict_proba(X)
pd.DataFrame({
    "game": df["code"],
    "y": y,
    "money_line": df["money_line"],
    "money_line_prob": df["prob"],
    "X": X.reshape(1, -1)[0],
    "y_hat": logistic_regression.predict(X),
    "model_prob": probs[:, 1]
})

Unnamed: 0,game,y,money_line,money_line_prob,X,y_hat,model_prob
0,201901010TOR,1,-115.0,0.534884,0.139762,1,0.517526
1,201901010MIL,1,-600.0,0.857143,1.791759,1,0.826031
2,201901010DEN,1,-1400.0,0.933333,2.639057,1,0.910581
3,201901010SAC,0,100.0,0.500000,0.000000,0,0.486073
4,201901010LAC,0,103.0,0.492611,-0.029559,0,0.479426
...,...,...,...,...,...,...,...
1249,201812310OKC,1,-329.0,0.766900,1.190888,1,0.734323
1250,201811300PHO,0,160.0,0.384615,-0.470004,0,0.382497
1251,201811300LAL,1,-195.0,0.661017,0.667829,1,0.633127
1252,201811300POR,0,-135.0,0.574468,0.300105,1,0.553425
