# General Assembly DSI - Denver 2018
## Capstone Project - DFS Model
This is my capstone project at General Assembly's fifth [Data Science Immersive](https://generalassemb.ly/education/data-science-immersive) cohort in 2018. I am developing a model to assist in optimizing NFL lineups on the daily fantasy sports platforms [Draft Kings](https://www.draftkings.com/) and [Fan Duel](https://www.fanduel.com/).

### Problem Statement

Can we build a model to predict a football player’s fantasy football performance to estimate their value and implement the model in conjunction with a daily fantasy strategy to be profitable?

In [1]:
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical

%matplotlib inline

  from numpy.core.umath_tests import inner1d
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Couldn't import dot_parser, loading of dot files will not be possible.


In [2]:
# load data
data = pd.read_csv('../data/modeling_tes.csv')

In [3]:
data = data.groupby(['Name', 'Year', 'Week', 'Month', 'Team', 'Oppt']).sum()

In [4]:
data.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Age,h/a,Favored,Spread,O/U,Temperature,Wind,Receptions,Targets,Rec_TDs,...,Weather_DOME,Weather_Fog,Weather_Rain,Weather_Rain | Fog,Weather_Snow,Weather_Snow | Fog,Weather_Snow | Freezing Rain,Weather_Sunny,FD salary,FD points
Name,Year,Week,Month,Team,Oppt,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
"Adams, Jerell",2016,5,October,NYG,GB,23.283,0,0,-7.0,49.0,48.0,2.0,2.0,2.333333,0.0,...,0,0,0,0,0,0,0,1,4500.0,3.7
"Adams, Jerell",2016,7,October,NYG,LAR,23.297,0,1,-2.5,44.5,54.0,10.0,2.0,2.333333,0.0,...,0,0,0,0,0,0,0,1,4500.0,1.0
"Adams, Jerell",2016,9,November,NYG,PHI,23.311,1,1,-3.0,42.5,57.0,8.0,1.333333,1.333333,0.0,...,0,0,0,0,0,0,0,1,4500.0,3.9
"Adams, Jerell",2016,11,November,NYG,CHI,23.325,1,1,-7.0,41.5,40.0,12.0,3.0,3.0,1.0,...,0,0,0,0,0,0,0,1,4500.0,0.0
"Adams, Jerell",2016,13,December,NYG,PIT,23.339,0,0,-6.5,49.5,40.0,2.0,1.5,2.0,0.5,...,0,0,0,0,0,0,0,1,4500.0,1.5


### Modeling
- Linear Regression
- Random Forest
- Support Vector Regression
- Boosting
- PCA
- Neural Net

### Linear Regression

In [5]:
columns = [col for col in data.columns if col != 'FD points']

def linreg(df, features, target = 'FD points'):
    X = df[features]
    y = df[target]

    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

    # first attempt - all features
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    print('=============EVAULATION=============\n')
    print('R2 Training:', model.score(X_train, y_train))
    print('R2 Testing:', model.score(X_test, y_test))

    predictions = model.predict(X_test)

    print('RMSE:', np.sqrt(mean_squared_error(y_test, predictions)))
    print('MAE:', mean_absolute_error(y_test, predictions), '\n')
    print('=============COEFFICIENTS=============\n')
    print('Intercept:', model.intercept_)
    for key, index in dict(zip(X.columns, model.coef_)).items():
        print(key, index)
    return model

lr_all_features = linreg(data, columns)


R2 Training: 0.23579381106157848
R2 Testing: 0.2203812753547637
RMSE: 4.978861720650418
MAE: 3.8264615739438472 


Intercept: -6.709895460711842
Age 0.06449120266836442
h/a 0.18251896949854607
Favored 0.3725098495621829
Spread -0.00689981020118736
O/U 0.06547691277398701
Temperature 0.004869535931084206
Wind -0.04378065288412842
Receptions 0.2257883517506472
Targets 0.4537566849602897
Rec_TDs -0.7294945082394024
Rec_Yds 0.02759674772844765
Opp_Rank -0.004919125357804126
Opp_Avg_Rec_Allowed -0.11804752197878918
Opp_Avg_Targets_Allowed 0.077734528118091
Opp_Avg_TDs_Allowed 0.14443663074968655
Opp_Avg_Yds_Allowed 0.011785464897104405
Weather_DOME -1.58090990453568
Weather_Fog 0.5892585812058256
Weather_Rain -0.5544892930624472
Weather_Rain | Fog -2.5166126687691786
Weather_Snow -1.1615399021768338
Weather_Snow | Fog 4.242079502139789
Weather_Snow | Freezing Rain 2.0677970368420118
Weather_Sunny -1.0855833516434783
FD salary 0.0009839241884064278


Feature Coefficients that seem logical:
- Increasing Prediction for:
    - Home/Away
    - Favored
    - O/U
    - Attempts
    - Completions
    - TDs
    - Y/A
    
- Decreasing Prediction for:
    - Age
        - Sort of...
    - Wind
    - Rain
    - Rain & Fog
    - Snow
    - Snow & Fog
    - Interceptions
    - Opponent Rank
        - The rank columns is actually backwards so defenses get better as ranking improves

Feature Coefficients that make no sense to me:
- Increasing Prediction for:
    - Fog
    - Snow & Freezing Rain
    - Opponent Interceptions
- Decreasing Prediction for:
    - Being in a Dome (Controlled conditions seem like a player should perform better)
    - Sunny weather
    - Yards
    - Opponent Attempts Allowed
    - Opponent Completions Allowed
    - Opponent TDs Allowed
- Almost no effect from:
    - Rating
    - FD Salary
        - Considering salary kind of derives FanDuel's predictions I feel like this would have an impact.

**The most interesting thing to note here is that FD Salary has seemingly no effect on a player's point production**

In [6]:
# data.columns

In [7]:
# remove_cols = ['Weather_Fog', 'Weather_Snow | Freezing Rain', 'Opp_Avg_Ints', 'Weather_DOME', 
#                'Weather_Sunny', 'Yards', 'Opp_Avg_Att_Allowed', 'Opp_Avg_Comp_Allowed', 
#                'Opp_Avg_TDs_Allowed', 'Rating', 'FD salary', 'FD points']

# features = [col for col in data.columns if col not in remove_cols]

In [8]:
# lr_partial_features = linreg(data, features)

### Random Forest Regression

In [58]:
features = [col for col in data.columns if col != 'FD points']

X = data[features]
y = data['FD points']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

#### No Hypertuning

In [10]:
model = RandomForestRegressor()
cross_val_score(model, X_train, y_train).mean()

0.1219148993050894

In [11]:
model.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [12]:
model.score(X_test, y_test)

0.14379246771871357

In [13]:
predictions = model.predict(X_test)

In [14]:
mean_absolute_error(y_test, predictions)

4.014495412844037

In [15]:
np.sqrt(mean_squared_error(y_test, predictions))

5.217692203130183

#### With Hypertuning

In [63]:
model = RandomForestRegressor()

params = {
    'max_depth': [15, 20, 25],
    'n_estimators': [90, 95, 100],
    'min_samples_leaf': [11, 13, 15],
    'max_features': [8, 10, 12]
}

gs = GridSearchCV(model, param_grid = params)
gs.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'max_depth': [15, 20, 25], 'n_estimators': [90, 95, 100], 'min_samples_leaf': [11, 13, 15], 'max_features': [8, 10, 12]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [64]:
print(gs.best_score_)
print(gs.score(X_test, y_test))
print(gs.best_params_)

0.23968342429199102
0.2397514647957475
{'max_depth': 25, 'max_features': 8, 'min_samples_leaf': 11, 'n_estimators': 100}


In [65]:
y_pred = gs.best_estimator_.predict(X_test)
print(mean_absolute_error(y_test, y_pred))
print(np.sqrt(mean_squared_error(y_test, y_pred)))

3.761517392309697
4.916620979967099


In [69]:
dict(zip(X.columns, gs.best_estimator_.feature_importances_))

{'Age': 0.05614846183189576,
 'h/a': 0.006075499680317972,
 'Favored': 0.008419205567882103,
 'Spread': 0.030077061916882177,
 'O/U': 0.03857191054353415,
 'Temperature': 0.030050943010066722,
 'Wind': 0.02327311546739492,
 'Receptions': 0.14062753842998865,
 'Targets': 0.12136770955008425,
 'Rec_TDs': 0.01803184599414879,
 'Rec_Yds': 0.13450606225835707,
 'Opp_Rank': 0.03218574497964323,
 'Opp_Avg_Rec_Allowed': 0.02306772899365528,
 'Opp_Avg_Targets_Allowed': 0.02995825250344852,
 'Opp_Avg_TDs_Allowed': 0.012120859373062058,
 'Opp_Avg_Yds_Allowed': 0.034463508135742116,
 'Weather_DOME': 0.0025286939998132923,
 'Weather_Fog': 0.0,
 'Weather_Rain': 0.0001467408292299609,
 'Weather_Rain | Fog': 0.0,
 'Weather_Snow': 0.0,
 'Weather_Snow | Fog': 0.0,
 'Weather_Snow | Freezing Rain': 0.0,
 'Weather_Sunny': 0.002950064565052698,
 'FD salary': 0.25542905236980035}

> Now this is telling me that `FD Salary` is far and away the most important feature

### Support Vector Regression

In [77]:
features = [col for col in data.columns if col != 'FD points']

X = data[features]
y = data['FD points']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

#### No Hypertuning

In [21]:
model = SVR()

In [22]:
model.fit(X_train, y_train)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [23]:
model.score(X_train, y_train)

0.11279720131204463

In [24]:
model.score(X_test, y_test)

-0.11463266923230298

In [25]:
y_pred = model.predict(X_test)

In [26]:
mean_absolute_error(y_test, y_pred)

4.160065196278003

#### Hypertuning

In [54]:
model = SVR()
params = {
    'kernel': ['linear'],
    'C': [0.1, 0.01]
}

gs = GridSearchCV(model, param_grid = params)
gs.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'kernel': ['linear'], 'C': [0.1, 0.01]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [55]:
print(gs.best_score_)
print(gs.score(X_test, y_test))
print(gs.best_params_)

0.19105404779867072
0.18042208093057188
{'C': 0.1, 'kernel': 'linear'}


In [56]:
y_pred = gs.best_estimator_.predict(X_test)
print(mean_absolute_error(y_test, y_pred))
print(np.sqrt(mean_squared_error(y_test, y_pred)))

3.707355435298749
5.104862614977823


### ADA Boost

In [70]:
features = [col for col in data.columns if col != 'FD points']

X = data[features]
y = data['FD points']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [74]:
model = AdaBoostRegressor()
params = {
    'n_estimators': [45, 47, 50],
    'learning_rate': [.9, .92, .95]
}

gs = GridSearchCV(model, param_grid=params)
gs.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='linear',
         n_estimators=50, random_state=None),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_estimators': [45, 47, 50], 'learning_rate': [0.9, 0.92, 0.95]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [75]:
print(gs.best_score_)
print(gs.score(X_test, y_test))
print(gs.best_params_)

-0.004661824318639856
-0.16595185649995114
{'learning_rate': 0.92, 'n_estimators': 50}


In [76]:
y_pred = gs.best_estimator_.predict(X_test)
print(mean_absolute_error(y_test, y_pred))
print(np.sqrt(mean_squared_error(y_test, y_pred)))

5.290952029631688
6.0887654296182285


### Gradient Boost

In [34]:
features = [col for col in data.columns if col != 'FD points']

X = data[features]
y = data['FD points']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [35]:
model = GradientBoostingRegressor()
params = {
    'n_estimators': [75, 80, 85],
    'learning_rate': [0.15, 0.2, 0.25, 0.3]
}

gs = GridSearchCV(model, param_grid=params)
gs.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=100, presort='auto', random_state=None,
             subsample=1.0, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_estimators': [75, 80, 85], 'learning_rate': [0.15, 0.2, 0.25, 0.3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [36]:
print(gs.best_score_)
print(gs.score(X_test, y_test))
print(gs.best_params_)

0.18118682217416815
0.19752317823437493
{'learning_rate': 0.15, 'n_estimators': 80}


In [37]:
y_pred = gs.best_estimator_.predict(X_test)
print(mean_absolute_error(y_test, y_pred))
print(np.sqrt(mean_squared_error(y_test, y_pred)))

3.8279745550800435
5.051323499084434


### Principal Component Analysis

In [78]:
features = [col for col in data.columns if col != 'FD points']

X = data[features]
y = data['FD points']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [39]:
ss = StandardScaler()
X_train_scaled = ss.fit_transform(X_train)
X_test_scaled = ss.transform(X_test)

pca = PCA(n_components = 18)
pca.fit(X_train_scaled)

PCA(copy=True, iterated_power='auto', n_components=18, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [40]:
var_exp = pca.explained_variance_ratio_
print('Explained Variance: ', var_exp)

Explained Variance:  [0.14579905 0.10928441 0.1008197  0.05487093 0.05050765 0.04440351
 0.04290416 0.04152544 0.04087467 0.04062209 0.04038616 0.03937853
 0.03631828 0.03534193 0.03112427 0.02818144 0.02778879 0.02650852]


In [41]:
cum_var_exp = np.cumsum(var_exp)
print('Cumulative Explained Variance: ', cum_var_exp)

Cumulative Explained Variance:  [0.14579905 0.25508346 0.35590316 0.41077408 0.46128173 0.50568524
 0.5485894  0.59011484 0.63098951 0.6716116  0.71199776 0.75137629
 0.78769457 0.8230365  0.85416077 0.88234221 0.910131   0.93663952]


In [42]:
Z_train = pca.transform(X_train_scaled)
Z_test = pca.transform(X_test_scaled)

In [43]:
model = LinearRegression()

model.fit(Z_train, y_train)
print(model.score(Z_train, y_train))
print(model.score(Z_test, y_test))

0.2312486678582144
0.20735035787213055


In [44]:
predictions = model.predict(Z_test)

In [45]:
mean_absolute_error(y_test, predictions)

3.838802770486505

In [46]:
np.sqrt(mean_squared_error(y_test, predictions))

5.02029881760621

### Neural Network

In [47]:
features = [col for col in data.columns if col != 'FD points']

X = data[features]
y = data['FD points']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [48]:
ss = StandardScaler()
X_train_scaled = ss.fit_transform(X_train)
X_test_scaled = ss.transform(X_test)

In [49]:
model = Sequential()

model.add(Dense(X_train.shape[1], activation = 'relu', input_dim = X_train.shape[1]))
model.add(Dense(5, activation = 'relu'))
model.add(Dense(1, activation=None))

In [50]:
model.compile(loss = 'mean_squared_error', optimizer = 'adam')

In [51]:
model.fit(X_train_scaled, 
          y_train, 
          epochs = 10, 
          batch_size = 2, 
          validation_data = (X_test_scaled, y_test))

Train on 2286 samples, validate on 763 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1a28e485c0>

In [52]:
predictions = model.predict(X_test_scaled)

In [53]:
print(mean_absolute_error(y_test, predictions))
print(np.sqrt(mean_squared_error(y_test, predictions)))

3.8108382990835223
5.025512438443507
