# General Assembly DSI - Denver 2018
## Capstone Project - DFS Model
This is my capstone project at General Assembly's fifth [Data Science Immersive](https://generalassemb.ly/education/data-science-immersive) cohort in 2018. I am developing a model to assist in optimizing NFL lineups on the daily fantasy sports platforms [Draft Kings](https://www.draftkings.com/) and [Fan Duel](https://www.fanduel.com/).

### Problem Statement

Can we build a model to predict a football player’s fantasy football performance to estimate their value and implement the model in conjunction with a daily fantasy strategy to be profitable?

In [1]:
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical

%matplotlib inline

  from numpy.core.umath_tests import inner1d
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Couldn't import dot_parser, loading of dot files will not be possible.


In [2]:
# load data
data = pd.read_csv('../data/modeling_qbs.csv')

In [3]:
data = data.groupby(['Name', 'Year', 'Week', 'Month', 'Team', 'Oppt']).sum()

In [4]:
data.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Age,h/a,Favored,Spread,O/U,Temperature,Wind,Weather_DOME,Weather_Fog,Weather_Rain,...,Y/A,Yards,Opp_Avg_Att_Allowed,Opp_Avg_Comp_Allowed,Opp_Avg_Ints,Opp_Rank,Opp_Avg_TDs_Allowed,Opp_Avg_Yds_Allowed,FD salary,FD points
Name,Year,Week,Month,Team,Oppt,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
"Barkley, Matt",2013,8,October,PHI,NYG,23.049,1,1,-5.5,49.5,58.0,13.0,0,0,0,...,6.45,129.0,47.333333,26.0,2.0,19,1.666667,308.0,5800.0,3.32
"Barkley, Matt",2013,9,November,PHI,OAK,23.056,0,0,-1.0,45.5,62.0,6.0,0,0,0,...,6.265,143.5,43.0,26.333333,1.0,10,1.0,278.666667,6000.0,0.32
"Barkley, Matt",2016,12,November,CHI,TEN,26.08,1,0,-6.5,42.0,38.0,5.0,0,1,0,...,5.4,81.0,35.333333,22.333333,0.333333,8,1.666667,255.666667,6000.0,22.64
"Barkley, Matt",2016,13,December,CHI,SF,26.087,1,0,-1.0,44.0,35.0,5.0,0,0,1,...,5.625,198.5,30.666667,19.666667,0.333333,30,2.0,255.0,6000.0,7.98
"Barkley, Matt",2016,14,December,CHI,DET,26.094,0,0,-7.5,42.0,72.0,0.0,1,0,0,...,7.306667,196.333333,24.666667,13.666667,0.0,9,1.333333,164.0,6200.0,12.48


### Modeling
- Linear Regression
- Random Forest
- Support Vector Regression
- Boosting
- PCA
- Neural Net

### Linear Regression

In [5]:
columns = [col for col in data.columns if col != 'FD points']

def linreg(df, features, target = 'FD points'):
    X = df[features]
    y = df[target]

    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

    # first attempt - all features
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    print('=============EVAULATION=============\n')
    print('R2 Training:', model.score(X_train, y_train))
    print('R2 Testing:', model.score(X_test, y_test))

    predictions = model.predict(X_test)

    print('RMSE:', np.sqrt(mean_squared_error(y_test, predictions)))
    print('MAE:', mean_absolute_error(y_test, predictions), '\n')
    print('=============COEFFICIENTS=============\n')
    print('Intercept:', model.intercept_)
    for key, index in dict(zip(X.columns, model.coef_)).items():
        print(key, index)
    return model

lr_all_features = linreg(data, columns)


R2 Training: 0.15486610055413175
R2 Testing: 0.18665600551619443
RMSE: 7.287011116864119
MAE: 5.843886835718315 


Intercept: -0.20327927495129572
Age -0.06140023478284524
h/a 1.0642877383415745
Favored 1.0974248609121435
Spread 0.051009312296253044
O/U 0.0733439606349433
Temperature 0.00872591655260506
Wind -0.07066066594197369
Weather_DOME -0.7921370031943825
Weather_Fog 0.7450121526853518
Weather_Rain -0.873967921546931
Weather_Rain | Fog -6.012113835360802
Weather_Snow -0.5201543654175679
Weather_Snow | Fog -0.8148385738626248
Weather_Snow | Freezing Rain 8.316867116125255
Weather_Sunny -0.048667569419144095
Attempts 0.034780555734668234
Completions 0.13222211872378642
Interceptions -0.47793441368624884
Rating 0.0061988000776445305
TDs 0.14102747930959597
Y/A 0.14235124384640124
Yards -0.0054605595972012835
Opp_Avg_Att_Allowed -0.017740374245161485
Opp_Avg_Comp_Allowed -0.029322777943181378
Opp_Avg_Ints 0.016947429866321313
Opp_Rank -0.12229781729889888
Opp_Avg_TDs_Allowed -0.3550

Feature Coefficients that seem logical:
- Increasing Prediction for:
    - Home/Away
    - Favored
    - O/U
    - Attempts
    - Completions
    - TDs
    - Y/A
    
- Decreasing Prediction for:
    - Age
        - Sort of...
    - Wind
    - Rain
    - Rain & Fog
    - Snow
    - Snow & Fog
    - Interceptions
    - Opponent Rank
        - The rank columns is actually backwards so defenses get better as ranking improves

Feature Coefficients that make no sense to me:
- Increasing Prediction for:
    - Fog
    - Snow & Freezing Rain
    - Opponent Interceptions
- Decreasing Prediction for:
    - Being in a Dome (Controlled conditions seem like a player should perform better)
    - Sunny weather
    - Yards
    - Opponent Attempts Allowed
    - Opponent Completions Allowed
    - Opponent TDs Allowed
- Almost no effect from:
    - Rating
    - FD Salary
        - Considering salary kind of derives FanDuel's predictions I feel like this would have an impact.

**The most interesting thing to note here is that FD Salary has seemingly no effect on a player's point production**

In [6]:
data.columns

Index(['Age', 'h/a', 'Favored', 'Spread', 'O/U', 'Temperature', 'Wind',
       'Weather_DOME', 'Weather_Fog', 'Weather_Rain', 'Weather_Rain | Fog',
       'Weather_Snow', 'Weather_Snow | Fog', 'Weather_Snow | Freezing Rain',
       'Weather_Sunny', 'Attempts', 'Completions', 'Interceptions', 'Rating',
       'TDs', 'Y/A', 'Yards', 'Opp_Avg_Att_Allowed', 'Opp_Avg_Comp_Allowed',
       'Opp_Avg_Ints', 'Opp_Rank', 'Opp_Avg_TDs_Allowed',
       'Opp_Avg_Yds_Allowed', 'FD salary', 'FD points'],
      dtype='object')

In [7]:
remove_cols = ['Weather_Fog', 'Weather_Snow | Freezing Rain', 'Opp_Avg_Ints', 'Weather_DOME', 
               'Weather_Sunny', 'Yards', 'Opp_Avg_Att_Allowed', 'Opp_Avg_Comp_Allowed', 
               'Opp_Avg_TDs_Allowed', 'Rating', 'FD salary', 'FD points']

features = [col for col in data.columns if col not in remove_cols]

In [8]:
lr_partial_features = linreg(data, features)


R2 Training: 0.12384088048260056
R2 Testing: 0.12012426937208964
RMSE: 7.579193053472293
MAE: 6.021901659335721 


Intercept: 3.3717179304861666
Age -0.05184342532900565
h/a 0.7545127193585466
Favored 2.082204448223854
Spread 0.04722141450700851
O/U 0.21074076440789036
Temperature -0.002075301163649467
Wind -0.052021627297039484
Weather_Rain -0.8217764014883531
Weather_Rain | Fog -5.562976552428371
Weather_Snow -0.8758091619729306
Weather_Snow | Fog 0.47061620156574135
Attempts 0.07961798927262871
Completions 0.11265372081012745
Interceptions -0.7045880819602188
TDs 0.854891438318532
Y/A 0.2526735741740987
Opp_Rank -0.11165553997398336
Opp_Avg_Yds_Allowed -0.0060759040342483256


### Random Forest Regression

In [90]:
features = [col for col in data.columns if col != 'FD points']

X = data[features]
y = data['FD points']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

#### No Hypertuning

In [91]:
model = RandomForestRegressor()
cross_val_score(model, X_train, y_train).mean()

0.04326462235328713

In [92]:
model.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [93]:
model.score(X_test, y_test)

0.07804748959279184

In [94]:
predictions = model.predict(X_test)

In [95]:
mean_absolute_error(y_test, predictions)

6.141268456375839

In [96]:
np.sqrt(mean_squared_error(y_test, predictions))

7.758300116519505

#### With Hypertuning

In [97]:
model = RandomForestRegressor()

params = {
    'max_depth': [10, 20, 30],
    'n_estimators': [70, 75, 80],
    'min_samples_leaf': [13, 14],
    'max_features': [10, 20]
}

gs = GridSearchCV(model, param_grid = params)
gs.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'max_depth': [10, 20, 30], 'n_estimators': [70, 75, 80], 'min_samples_leaf': [13, 14], 'max_features': [10, 20]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [98]:
print(gs.best_score_)
print(gs.score(X_test, y_test))
print(gs.best_params_)

0.13994090141689752
0.18205143193102846
{'max_depth': 10, 'max_features': 10, 'min_samples_leaf': 13, 'n_estimators': 70}


In [18]:
y_pred = gs.best_estimator_.predict(X_test)
print(mean_absolute_error(y_test, y_pred))
print(np.sqrt(mean_squared_error(y_test, y_pred)))

5.867280408469351
7.326354429216604


In [19]:
dict(zip(X.columns, gs.best_estimator_.feature_importances_))

{'Age': 0.04732571806887254,
 'h/a': 0.011567405036229763,
 'Favored': 0.01860889460390169,
 'Spread': 0.04428143039214601,
 'O/U': 0.05569674677120847,
 'Temperature': 0.028938038789831845,
 'Wind': 0.023075553027321654,
 'Weather_DOME': 0.0025294271001848633,
 'Weather_Fog': 0.0,
 'Weather_Rain': 0.0,
 'Weather_Rain | Fog': 0.0,
 'Weather_Snow': 0.0,
 'Weather_Snow | Fog': 0.0,
 'Weather_Snow | Freezing Rain': 0.0,
 'Weather_Sunny': 0.002947863117521723,
 'Attempts': 0.04965422647524838,
 'Completions': 0.05979835822322462,
 'Interceptions': 0.014529852306577636,
 'Rating': 0.055703018520107894,
 'TDs': 0.0329191301008596,
 'Y/A': 0.04432541857952866,
 'Yards': 0.0845525357638508,
 'Opp_Avg_Att_Allowed': 0.03356653538207762,
 'Opp_Avg_Comp_Allowed': 0.02498466504672492,
 'Opp_Avg_Ints': 0.013748808397962222,
 'Opp_Rank': 0.07658739522154372,
 'Opp_Avg_TDs_Allowed': 0.017288672841565718,
 'Opp_Avg_Yds_Allowed': 0.045644747490440854,
 'FD salary': 0.21172555874306878}

> Now this is telling me that `FD Salary` is far and away the most important feature

### Support Vector Regression

In [20]:
features = [col for col in data.columns if col != 'FD points']

X = data[features]
y = data['FD points']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

#### No Hypertuning

In [21]:
model = SVR()

In [22]:
model.fit(X_train, y_train)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [23]:
model.score(X_train, y_train)

0.17839379011055656

In [24]:
model.score(X_test, y_test)

-0.0035552533895411016

In [25]:
y_pred = model.predict(X_test)

In [26]:
mean_absolute_error(y_test, y_pred)

6.516993357999584

#### Hypertuning

In [101]:
model = SVR()
params = {
    'kernel': ['linear'],
    'C': [0.1, 0.01]
}

gs = GridSearchCV(model, param_grid = params)
gs.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'kernel': ['linear'], 'C': [0.1, 0.01]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [102]:
print(gs.best_score_)
print(gs.score(X_test, y_test))
print(gs.best_params_)

0.11735567863893763
0.18523738880463558
{'C': 0.01, 'kernel': 'linear'}


In [103]:
y_pred = gs.best_estimator_.predict(X_test)
print(mean_absolute_error(y_test, y_pred))
print(np.sqrt(mean_squared_error(y_test, y_pred)))

5.820359894682846
7.293363270531602


### ADA Boost

In [30]:
features = [col for col in data.columns if col != 'FD points']

X = data[features]
y = data['FD points']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [31]:
model = AdaBoostRegressor()
params = {
    'n_estimators': [45, 47, 50],
    'learning_rate': [.8, .83, .85, .9]
}

gs = GridSearchCV(model, param_grid=params)
gs.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='linear',
         n_estimators=50, random_state=None),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_estimators': [45, 47, 50], 'learning_rate': [0.8, 0.83, 0.85, 0.9]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [32]:
print(gs.best_score_)
print(gs.score(X_test, y_test))
print(gs.best_params_)

0.08877834225084463
0.11381732505308828
{'learning_rate': 0.83, 'n_estimators': 45}


In [33]:
y_pred = gs.best_estimator_.predict(X_test)
print(mean_absolute_error(y_test, y_pred))
print(np.sqrt(mean_squared_error(y_test, y_pred)))

6.19816261998654
7.606308356372341


### Gradient Boost

In [34]:
features = [col for col in data.columns if col != 'FD points']

X = data[features]
y = data['FD points']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [35]:
model = GradientBoostingRegressor()
params = {
    'n_estimators': [75, 80, 85],
    'learning_rate': [0.15, 0.2, 0.25, 0.3]
}

gs = GridSearchCV(model, param_grid=params)
gs.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=100, presort='auto', random_state=None,
             subsample=1.0, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_estimators': [75, 80, 85], 'learning_rate': [0.15, 0.2, 0.25, 0.3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [36]:
print(gs.best_score_)
print(gs.score(X_test, y_test))
print(gs.best_params_)

0.08341984965942152
0.15906787487379648
{'learning_rate': 0.15, 'n_estimators': 80}


In [37]:
y_pred = gs.best_estimator_.predict(X_test)
print(mean_absolute_error(y_test, y_pred))
print(np.sqrt(mean_squared_error(y_test, y_pred)))

5.93891294315364
7.409566015594872


### Principal Component Analysis

In [81]:
features = [col for col in data.columns if col != 'FD points']

X = data[features]
y = data['FD points']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [82]:
ss = StandardScaler()
X_train_scaled = ss.fit_transform(X_train)
X_test_scaled = ss.transform(X_test)

pca = PCA(n_components = 24)
pca.fit(X_train_scaled)

PCA(copy=True, iterated_power='auto', n_components=24, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [83]:
var_exp = pca.explained_variance_ratio_
print('Explained Variance: ', var_exp)

Explained Variance:  [0.15884208 0.09447997 0.08122668 0.07204716 0.04974341 0.04485379
 0.04019799 0.03860966 0.03725892 0.03546763 0.03508014 0.0347573
 0.03381769 0.03137285 0.02975884 0.02751775 0.02423443 0.02357858
 0.02324518 0.02216546 0.01599642 0.01521103 0.01107703 0.00908408]


In [84]:
cum_var_exp = np.cumsum(var_exp)
print('Cumulative Explained Variance: ', cum_var_exp)

Cumulative Explained Variance:  [0.15884208 0.25332206 0.33454874 0.4065959  0.45633931 0.5011931
 0.54139109 0.58000075 0.61725967 0.6527273  0.68780744 0.72256474
 0.75638243 0.78775528 0.81751412 0.84503188 0.8692663  0.89284489
 0.91609006 0.93825553 0.95425195 0.96946298 0.98054    0.98962408]


In [85]:
Z_train = pca.transform(X_train_scaled)
Z_test = pca.transform(X_test_scaled)

In [86]:
model = LinearRegression()

model.fit(Z_train, y_train)
print(model.score(Z_train, y_train))
print(model.score(Z_test, y_test))

0.1540178668111044
0.18498667830864635


In [87]:
predictions = model.predict(Z_test)

In [88]:
mean_absolute_error(y_test, predictions)

5.850796686201867

In [89]:
np.sqrt(mean_squared_error(y_test, predictions))

7.294485304149663

### Neural Network

In [47]:
features = [col for col in data.columns if col != 'FD points']

X = data[features]
y = data['FD points']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [48]:
ss = StandardScaler()
X_train_scaled = ss.fit_transform(X_train)
X_test_scaled = ss.transform(X_test)

In [49]:
model = Sequential()

model.add(Dense(29, activation = 'relu', input_dim = X_train.shape[1]))
model.add(Dense(5, activation = 'relu'))
model.add(Dense(1, activation=None))

In [50]:
model.compile(loss = 'mean_squared_error', optimizer = 'adam')

In [51]:
model.fit(X_train_scaled, 
          y_train, 
          epochs = 10, 
          batch_size = 2, 
          validation_data = (X_test_scaled, y_test))

Train on 1786 samples, validate on 596 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1a2a3137f0>

In [52]:
predictions = model.predict(X_test_scaled)

In [53]:
print(mean_absolute_error(y_test, predictions))
print(np.sqrt(mean_squared_error(y_test, predictions)))

6.081302643030282
7.550065296454414
