# dk_nba_salaries

* Getting the dataset
* Exploring the dataset
* Predicting DK salaries
* Predicting DK points
* Predicting whether player will reach 6X

## Getting the dataset

In [2]:
require(RPostgreSQL)
require(dplyr)
require(caret)
library(corrplot)

Loading required package: RPostgreSQL
Loading required package: DBI
Loading required package: dplyr

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Loading required package: caret
Loading required package: lattice
Loading required package: ggplot2


In [3]:
library(doMC)
registerDoMC(cores = 3)

Loading required package: foreach
Loading required package: iterators
Loading required package: parallel


In [4]:
nbadb <- function() {
  drv <- dbDriver("PostgreSQL")
  con <- dbConnect(drv, dbname = "nbadb", user='nbadb', password='cft0911')
  q = "SELECT * FROM tmpmodel"
  dbGetQuery(con, q)
}

In [5]:
pp <- function(sal=3500, dk=12, min=15) {
  # filter out NA
  # remove useless players
  d = nbadb()
  d = d[complete.cases(d),]
  d %>% filter(salary >= sal, dkema25 > dk, minema25 > min)
}

## Exploring the Dataset

In [6]:
dfr = pp(sal=4000)

Most of the variables are self-explanatory, such as season or game_date. 

The opaque variables are described below.

    position_group: Point, Wing, or Big
    minavg: average minutes played up to, but not including, this game
    minema2: exponentional moving average of minutes, alpha = .02
    minema10: exponentional moving average of minutes, alpha = .10
    minema25: exponentional moving average of minutes, alpha = .25
    minema40: exponentional moving average of minutes, alpha = .40
    dkavg: average dk_points in games up to, but not including, this game
    dkema2: exponentional moving average of dk_points, alpha = .02
    dkema10: exponentional moving average of dk_points, alpha = .10
    dkema25: exponentional moving average of dk_points, alpha = .25
    dkema40: exponentional moving average of dk_points, alpha = .40
    lastmin: minutes played in previous game
    lastdk: dk points scored in previous game
    delta_projected_team_total: team_average_ppg - team_implied_total
    pace_avg: in games up to, but not including, this game
    pace_ema2: exponentional moving average of pace, alpha = .02
    pace_ema10: exponentional moving average of pace, alpha = .10
    pace_ema25: exponentional moving average of pace, alpha = .25
    pace_ema40: exponentional moving average of pace, alpha = .40
    drtg_avg: defensive rating in games up to, but not including, this game
    drtg_ema2: exponentional moving average of defensive rating, alpha = .02
    drtg_ema10: exponentional moving average of defensive rating, alpha = .10
    drtg_ema25: exponentional moving average of defensive rating, alpha = .25
    drtg_ema40: exponentional moving average of defensive rating, alpha = .40
    y: whether the player scored 5X salary


## Predicting DraftKings Salaries

How accurately can we predict a player's salary? This could be useful, at a minimum, to impute salaries to old NBA data which would massively increase the available samples.

### Simple regression model

In [8]:
# some of the variables not needed, such as season or player_name so we'll remove those.
dfr2 = subset(dfr, select = -c(season, game_date, game_id, nbacom_player_id, player_name, team_code,
           opp, min, dk_points, pos, y))

In [9]:
# turn the categorical variables into factors
dfr2$position_group = as.factor(dfr2$position_group)
dfr2$back_to_back = as.factor(dfr2$back_to_back)
dfr2$three_in_four = as.factor(dfr2$three_in_four)

In [10]:
# Create model with default paramters
# based on http://machinelearningmastery.com/tune-machine-learning-algorithms-in-r/
in_train = createDataPartition(dfr2$salary, p=.75, list=FALSE)
dfr2_train = dfr2[in_train,]
dfr2_test = dfr2[-in_train,]
Xtrain = subset(dfr2_train, select=-c(salary))
ytrain = subset(dfr2_train, select=c(salary))
Xtest = subset(dfr2_test, select=-c(salary))
ytest = subset(dfr2_test, select=c(salary))
control <- trainControl(method="repeatedcv", number=10, repeats=3)

In [11]:
set.seed(13)
m.lm <- train(salary ~ ., data=dfr2_train, method="lm", trControl=control)

In [12]:
print(m.lm)

Linear Regression 

12725 samples
   32 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 11451, 11453, 11451, 11452, 11453, 11452, ... 
Resampling results:

  RMSE      Rsquared 
  559.9706  0.8891593

Tuning parameter 'intercept' was held constant at a value of TRUE
 


In [None]:
set.seed(13)
m.lm <- train(salary ~ ., data=dfr2_train, method="lm", preProcess=c('scale', 'center'), trControl=control)

In [None]:
print(m.lm)

In [None]:
set.seed(13)
m.lm <- train(salary ~ ., data=dfr2_train, method="lm", preProcess=c('scale', 'center', 'pca'), trControl=control)

In [None]:
print(m.lm)

## Classification: Predicting 6x (or some other salary multiplier)

### Using the ranger library

In [None]:
library(ranger)

In [None]:
m = ranger(y ~ ., data = dfr2_train, importance="impurity")

In [None]:
m$variable.importance

In [None]:
pred.dfr2 <- predict(m, dat=dfr2_test)

In [None]:
table(dfr2_test$y, pred.dfr2$predictions)

In [None]:
dfr3 = subset(dfr2, select=c(pace_avg, drtg_avg, delta_projected_team_total, dkema2, dkema40, y))

In [None]:
in_train = createDataPartition(dfr3$y, p=.75, list=FALSE)
dfr3_train = dfr3[in_train,]
dfr3_test = dfr3[-in_train,]

In [None]:
m = ranger(y ~ ., data = dfr3_train, importance="impurity")

In [None]:
m$variable.importance

In [None]:
pred.dfr3 <- predict(m, dat=dfr3_test)

In [None]:
table(dfr3_test$y, pred.dfr3$predictions)

## Using caret

In [None]:
# Create model with default paramters
# based on http://machinelearningmastery.com/tune-machine-learning-algorithms-in-r/
X = dfr2[,1:ncol(dfr2) - 1]
y = dfr2[,ncol(dfr2)]
control <- trainControl(method="repeatedcv", number=10, repeats=3)
set.seed(13)
tunegrid <- expand.grid(.mtry=sqrt(ncol(dfr2)))

In [None]:
rfm <- train(X, y, method="ranger", metric="Accuracy", tuneGrid=tunegrid, trControl=control)

In [None]:
print(rfm)

In [None]:
# do gridsearch for optimal mtry
# based on http://machinelearningmastery.com/tune-machine-learning-algorithms-in-r/
X = dfr2[,1:ncol(dfr2) - 1]
y = dfr2[,ncol(dfr2)]
control <- trainControl(method="repeatedcv", number=10, repeats=3)
set.seed(13)
tunegrid <- expand.grid(.mtry=c(1:15))

In [None]:
rfm2 <- train(X, y, method="ranger", metric="Accuracy", tuneGrid=tunegrid, trControl=control)

In [None]:
print(rfm2)