Modeling to predict outcome of MLB baseball games and generate a profitable betting strategy.
This directory contains data collection, cleaning, and preparation notebooks. Data for this project was collected from FiveThirtyEight, Retrosheet.org, baseball-reference.com, sportsbookreview.com, the United States Geological Survey, and the NOAA Global Historical Climatology Network.
Jupyter Notebook exploring and cleaning primary dataframe, as well as creating momentum based features.
Jupyter Notebook performing binary classification on home team win or loss. Emphasis here is on Feature Selection and determining best feature subsets to perform further modeling.
(Largely unsuccessful) attempt at predicting score differential in each game. Will revisit once classification model is optimized (win/loss classification as a potential feature)
Initial attempts were to gain an edge using only team based performance statistics, collected and incorporated Vegas gambling odds as potential feature.
Exploration of deep learning models, proof of concept for custom objective function and dimensionality reduction
Current modeling steps: application as times series, deep learning / gradient boosting machines (with custom -Vegas- weights)
Python script defining custom evaluation metric. As the purpose of the model is to generate a profitable betting strategy, the evaluation metric must reflect gambling profits.
Python script addressing the double header issue: non unique merge keys between dataframes.
Python script parsing 13 million play-by-play observations from Retrosheet.org in to 197000 usable game level observations with team statistics
Extension of weather collection from NOAA global historical climatology network
Python script creating TeamFeatureEngineer, an attempt to capture momentum-based statistics. Generated season by season trends such as home run differential, road run differential, current winning streaks, etc.
Python script collecting weather daily weather observations for the past 100 years from NOAA global historical climatology network
Python script creating database of all starting pitchers from 1918 until the present season and aggregating observations into single dataframe.
Python script creating FeatureSelector object. Allows user to perform advanced analysis of feature subsets, evaluate subsets and select best feature set for a machine learning problem.
Python script scraping team stadium information from baseball-reference.com
Python script collecting all Retrosheet season files created by event_parser script
Python script collecting all starting pitcher data