Skip to content

Latest commit

 

History

History
108 lines (73 loc) · 9.9 KB

README.md

File metadata and controls

108 lines (73 loc) · 9.9 KB

NBA_predict

#excellent

NBApredict is a package for predicting NBA games against betting lines. It has two main behaviors:

  1. Scrape and store team statistics, game results, and betting lines in a SQLite database.
  2. Generate predictions for each NBA game, compare the prediction to the betting line, and store the results.

Status

I have effectively archived this project. Given that, I thought it would be relevant to update the code and README to reflect the state of the project where it left off. Work on the project was previously sponsored by a benefactor, but monetary support dried up with the pause of the NBA season due to COVID-19 in March 2020. That pause occured in the middle of a significant reorganization which was perhaps 90% finished. Those changes are now stored in "main" as people continued to clone the master branch, and the code in that branch was trash. Hence why I started the reorganization. The old master branch is stored as "archive". (Note: "main" is now the equivalent of "master" if that's not clear above.)

This project will not run if you just clone it. However, there's hundreds of hours of work here! So, if you want to get this project to run for you, get in contact, and I'd be happy to iron out the few remaining kinks. Otherwise, I feel no real incentive to work on this for myself at the moment.

In the rest of this README, I'll try to describe the state of the project so that you may have some idea of what utility you may derive from the work I've done. Hopefully this will be of some help in your own pursuits.

Project Overview

Directories

This section overviews the main components of the project. Details for other sections of the project are available in the documentation.

  • br_web_scraper - This is just a clone of the same package listed in the the credits section with some changes to fit NBApredict.
  • database - Can be ignored. This folder contains various modules intended to automate database operations. However, I found these modules useful for other projects, so I created the DatatoTable repo. DatatoTable is available as a package on PyPi, and the rest of NBApredict should use that package.
  • helpers - Miscellaneous. Most of this would get removed were the reorganization finished.
  • management - Data management package
    • tables - Each module in this directory has a corresponding table in the database. Each module will tend to have of format_data(), create_table(), and update_table() functions, among others, which perform the necessary work for that table.
    • conversion - This module has been replaced by DatatoTable.convert functions. However, the functionality, either from this module or DatatoTable, is essential for configuring foreign keys. An example is in tables.odds.format_data().
    • etl - Extract, Transform, Load. This module runs all processes which involve external data.
  • models - This directory is to hold whatever models get incorporated into the project. At the moment, it just holds the four_factor_regression module (explained below) and a graphing script which generates graphs for regression evaluation.
  • outputs - This directory is generated by the project. It holds the SQLite database generated by the project. It also holds a graphs directory which stores any saved graphs.
  • predict - As it's named, this package is used to generate predictions. This is where work on the reorganization stopped, so these scripts are the least polished. The ToDo at the top of predict.bets describes the vision for this package.
  • run - The run directory holds two scripts, daily.py and all.py. The daily script will set the project to run daily while the all script runs the project when called. Neither will work unless work on upstream components is finished.
  • scrapers - The scrapers folder holds modules for scaping data. scraper.py's scrape_all() function will scrape all season, team, and betting line data. To just scrape one type of data, call the desired data's scrape function. For example, line_scraper.scrape() will scrape betting lines.

The Model

As of now, the model uses a linear regression based on the Four Factors of Basketball Success which encapsulates shooting, turnovers, rebounding, and free throws. Further, we include the opposing four factors, which are how a team's opponents perform on the four factors in aggregate. Thus, each team has eight variables, and the model uses sixteen variables (eight for each team) for each prediction. The target, Y, or dependent variable is home Margin of Victory (MOV). Away MOV is simply the inverse of home MOV.

What are betting lines?

MOV is targeted because it provides an easy comparison with two types of betting lines, the spread and moneyline. Here's what the spread and moneyline might look like for a matchup between the Milwaukee Bucks and Atlanta Hawks:

Milwaukee Bucks (Home):

  1. Spread: -8
  2. Moneyline: -350

Atlanta Hawks (Away):

  1. Spread: 8
  2. Moneyline: 270

First, the spread attempts to guess the MOV between two teams. The Milwaukee Bucks spread of -8 indicates the betting line expects the Bucks to beat the Hawks by eight points. Or, the Bucks are "given" eight points. If one thinks the Bucks will beat the Hawks by more than eight points, they bet the Bucks. If one believes the Bucks will either win by less than eight points or lose, they bet the Hawks. Typically, spreads have symetric, or near-symetric, returns where picking the Bucks or the Hawks provides an equal return on a correct bet.

In comparison, the moneyline states the likelihood of a team winning or losing in terms of a monetary return. A negative moneyline, such as the Buck's -350, means one must put up $350 in order to win $100. A positive moneyline, such as the Hawk's 270, means a bet of $100 will return $270 if it is correct.

Generating Predictions

Before comparing predictions to betting lines, we need to ensure the model meets the assumptions of regression. For now, assume assumptions are met, and refer to Additional Reading for further model discussion. To compare the model's predictions to betting lines, we look at the prediction's distance from the betting line. In the model, the prediction is the expected value, or the mean, of the matchup. All possible outcomes of the game are normally distributed around this mean with a standard deviation, which as of March 2020, is approximately thirteen.

Continuing the Bucks-Hawks example, lets say the model predicts the Bucks to win by 6 in comparison to the betting line of 8. To compare the betting line to the prediction, we want to evaluate the likelihood of a Bucks win by 8 or more given a normal distribution with a mean of 6 and standard deviation of 13. Thus, we calculate the survival function* of 8 based on the distribution. The result is approximately 0.44 which means we'd expect the home MOV to be greater than or equal to 8 44% of the time. Inversely, we expect the home MOV to be less than 8 approximately 56% of the time.

To compare moneylines instead of spreads, simply set the spread to 0, and the output will be the likelihood of a win or loss.

*The model uses a cumulative density function when the predicted MOV is greater than the betting line

Usage

(Outdated: This usage hasn't been recreated in the reorganization yet.)

Clone this repo to your local machine: https://github.com/Spencer-Weston/NBA_bet.git

To set the project to run daily: ~\NBApredict>python -m run.daily

run.daily sets the project to run 1 hour before the first game of each day. This time is chosen because betting lines are not always available until later in the day.

Or to run the project once: ~\NBApredict>python -m run.all

Version: V0.2 - Reorganization

This version isn't finished as described in the Status section. Still, here is a rough approximation of V0.2:

Why the Reorganization?

In short, the project sucked before this point (check the archive branch). The project strucure was not pythonic, so namespaces were messy. Modules were more agglomerations of random behaviors than coherent units of related functionality. The project structure now follows standard python package design principles. The initial database design did not incorporate normalized databases. Given that many tables stored the same data, such as game times, keeping tables in sync required adding unique update functions for every table. The new tables are normalized with cascades to avoid this. Various other quality of life improvements have been implemented or will be if I ever pick this project back up.

Finished

  • Project organized into a pythonic package structure
  • All tables are normalized
  • Database operations exported to DatatoTable with much improved usability
  • All scrapers (schedule, betting lines, team stats, and teams) and their associated table modules use the normalized format

Unfinished

  • Predictions and the associated table (The models still work; they'r just not threaded into the full workflow)
  • Predictions and data interfaces. When completed, these functions would allow some degree of analysis for individual games or stats from the command line
  • "Run All" functionality. Once the above is finished, the project will be left to run daily to keep up to date data and predictions.

Author

Spencer Weston

personal website: Crockpot Thoughts

Additional Reading

Credits

Jae Bradley: https://github.com/jaebradley/basketball_reference_web_scraper - Used to scrape games and game results

License

MIT