> Table of Contents Ideas:


- Summary
- Introduction
- Methods
  - Data
  - Analysis
- Results & Discussion
- References

# Summary

Here we build a regression model using ------------- to predict the number of expected minutes an NBA basketball player will play in an upcoming game.

# Introduction

Fantasy sports represents a large and growing market in North America. According to a 2019 survey by the Fantasy Sports & Gaming Association (FSGA), 19% of Americans aged 18+ participate in fantasy sports (SOURCE). A smaller subset of people participate in what's known as Daily Fantasy Sports (DFS), with an estimated 6 million active users spending a total of $3.26$ Billion on entry fees in 2016 (SOURCE). DFS involves the process of creating lineup(s) for a given sport on a given day. As the games are played for a particular sport on that day, individual players will rack up fantasy points based on their real-world in-game performance. For example, in NBA basketball a player will accumulate fantasy points based on his total number of registered points, rebounds, assists, blocks, steals, and turnovers in the day's game. A lineups' total value is representative of the sum of the total fantasy points for each player in the lineup. Therefore, accurately predicting a players' fantasy points for an upcoming game holds a lot of value in what is a very competitive DFS market.

For this project we test various machine learning models to predict the total number of minutes and NBA basketball player will play in an upcoming game. As in most sports, but with NBA basketball in-particular, more playing time translates directly to increased fantasy point production (ADD SOURCE!!!!!!). Therefore, having a more accurate prediction of a player's expected minutes in an upcoming game would give users an edge when constructing their lineups for DFS contests, providing significant value to the end user.



are attempting to predict the total number of minutes an NBA basketball player will play in an upcoming game. The question we aim to answer is: **can we accurately predict a players minutes in an upcoming game using a combination of his historical statistics**. We will evaluate our predictions against a player's historical 5 game average minutes played to determine if they offer any value.


SOURCES:
https://www.draftkings.com/playbook/nba/nba-all-star-lesson-01-minutes-and-usage
https://thefsga.org/industry-demographics/

# Methods

## Data

The data set used in this project is of the NBA Enhanced Box Score and Standings (2012 - 2018) created by Paul Rossotti, hosted on [Kaggle.com](https://www.kaggle.com/pablote/nba-enhanced-stats#2012-18_playerBoxScore.csv). It was sourced using APIs from [xmlstats](https://erikberg.com/api). A copy of this dataset is hosted on a separate remote repository located [here]() to allow easy download with authenticating a Kaggle account. The particular data file used can be accessed [here](https://github.com/jnederlo/nba_data/blob/master/2012-18_playerBoxScore.csv). Each row in the data set represents a player's box score statistics for a particular game. The box score statistics are determined by statisticians working for the NBA. There were 151,493 data examples (rows).

## Analysis


We made predictions with 3 separate regression models: a simple linear regression model, an extreme gradient boosting model (XG Boost), and a light gradient boosting model (LGBM). 
The R and Python programming languages (SOURCE) and the following R and Python packages were used to perform the analysis: pandas, numpy, docopt, requests, tqdm, selenium, altair, scikit-learn, matplotlib, plotly, selenium. The code used to perform the analysis and create this report can be found [here](https://github.com/UBC-MDS/DSCI_522_group408).

# Results & Discussion

To reduce the complexity of the models, and to remove noisy features, we did a substantial amount of feature engineering. The linear regression model represents the simplest approach, while the tree models were chosen as they work well for continuous count data like in our dataset. The hyper-parameters for each model, mainly the number of estimators (n_estimators=60), were chosen through an iterative approach in order to reduce overfitting.

After all of our feature analysis we were left with two general categories of stats for each example: historical minutes played per game, and a player game rating. The minutes stats were calculated based on various rolling and exponential moving averages of different decay rates. The player game rating was a combination of different stats developed into a single number which was also calculated based on various rolling and exponential moving averages. Finally, we added in a flag for whether the player was a starter or on the bench for the upcoming game. All of the features were developed through our own knowledge of the game, and by examining correlations with the target.

We chose to evaluate our model by comparing our predictions to a players' five-game historical average minutes played (the base model). This represents a good test as it is reasonable to assume that a players five-game average would be a good indication of their expected minutes played in an upcoming game. We compared the prediction errors (mean squared errors and the r^2) of each model and the base model, as follows:

In [4]:
{
    "tags": [
        "hide_input",
    ]
}
import pandas as pd
pd.read_csv('results/modelling-score_table.csv', index_col=0)

Unnamed: 0,lgbm,xgboost,linaer regression,base model
MSE,38.24,38.3,39.59,50.24
Coefficient of Determination,0.65,0.65,0.64,0.55


> Figure 1. Comparison of model fitness.

From Figure 1. we can see that all of the models beat the base model in terms of both the mean squared error and the coefficient of determination. The LGBM model slightly outperformed the other regression models and was also significantly faster to train. In addition to the above analysis, we also examined the prediction residuals for each model to get a visual idea of how the predictions performed throughout the range of minutes played in a game. The residual results are displayed below in Figure 2:

![Residuals Plot](results/modelling-residual_plot.png)
> Figure 2. Model Residuals

We can see from the residuals that the tree-based models performed significantly better across the range of minutes played. All models struggle a little bit at the extremes but that is expected, as predicting injuries (i.e very few minutes played) and predicting games that go into overtime (i.e. extra minutes played), are uniquely challenging problems and cannot be fully modeled with the data present in the data set. 

Based on the above analysis and results, we chose to look further into the LGBM model to identify what were the most important features by plotting the feature importance as determined by the model splits:

![Feature Importance](results/modelling-gbm_importance.png)
> Figure 3. LGBM Feature Importance

We can see from Figure 3 that the players previous minutes played as measured with an exponential moving average (decay rate of 0.1) is the most important feature followed by "playStat" which represents whether a player was a starter or a bench player in the game. We can see that all of the features had fairly good representation in the model. Additionally, the historical minutes stats ("Min") are generally more important than the player rating stats ("playRat").

Overall, we demonstrate that using a players' historical average stats to predict minutes in an upcoming game can and does beat a baseline model of only their previous 5-game minute average. The tree-based models, particularly LGBM, provide better correlation and overall lower error rates and represent a small, but noticeably present advantage to a DFS user.
