# General Assembly DSI - Denver 2018
## Capstone Project - DFS Model
This is my capstone project at General Assembly's fifth [Data Science Immersive](https://generalassemb.ly/education/data-science-immersive) cohort in 2018. I am developing a model to assist in optimizing NFL lineups on the daily fantasy sports platforms [Draft Kings](https://www.draftkings.com/) and [Fan Duel](https://www.fanduel.com/).

### Problem Statement

Can we build a model to predict a football player’s fantasy football performance to estimate their value and implement the model in conjunction with a daily fantasy strategy to be profitable?

In [1]:
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical

%matplotlib inline

  from numpy.core.umath_tests import inner1d
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Couldn't import dot_parser, loading of dot files will not be possible.


In [2]:
# load data
data = pd.read_csv('../data/modeling_wrs.csv')

In [3]:
data = data.groupby(['Name', 'Year', 'Week', 'Month', 'Team', 'Oppt']).sum()

In [4]:
data.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Age,h/a,Favored,Spread,O/U,Temperature,Wind,Receptions,Targets,Rec_TDs,...,Weather_DOME,Weather_Fog,Weather_Rain,Weather_Rain | Fog,Weather_Snow,Weather_Snow | Fog,Weather_Snow | Freezing Rain,Weather_Sunny,FD salary,FD points
Name,Year,Week,Month,Team,Oppt,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
"Adams, Davante",2014,2,September,GB,NYJ,21.264,1,1,-7.0,46.5,63.0,11.0,3.0,6.333333,0.0,...,0,0,0,0,0,0,0,1,4700.0,7.5
"Adams, Davante",2014,3,September,GB,DET,21.271,0,0,-1.5,52.0,72.0,0.0,2.666667,5.0,0.0,...,1,0,0,0,0,0,0,0,4600.0,2.1
"Adams, Davante",2014,4,September,GB,CHI,21.278,0,1,-2.0,51.0,78.0,6.0,3.0,4.666667,0.0,...,0,0,0,0,0,0,0,1,4500.0,2.8
"Adams, Davante",2014,5,October,GB,MIN,21.282,1,1,-8.5,46.5,63.0,7.0,3.0,4.333333,0.0,...,0,1,0,0,0,0,0,0,4500.0,7.6
"Adams, Davante",2014,6,October,GB,MIA,21.292,0,1,-1.5,47.0,88.0,12.0,1.666667,3.333333,0.333333,...,0,0,0,0,0,0,0,1,4900.0,10.7


### Modeling
- Linear Regression
- Random Forest
- Support Vector Regression
- Boosting
- PCA
- Neural Net

### Linear Regression

In [5]:
columns = [col for col in data.columns if col != 'FD points']

def linreg(df, features, target = 'FD points'):
    X = df[features]
    y = df[target]

    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

    # first attempt - all features
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    print('=============EVAULATION=============\n')
    print('R2 Training:', model.score(X_train, y_train))
    print('R2 Testing:', model.score(X_test, y_test))

    predictions = model.predict(X_test)

    print('RMSE:', np.sqrt(mean_squared_error(y_test, predictions)))
    print('MAE:', mean_absolute_error(y_test, predictions), '\n')
    print('=============COEFFICIENTS=============\n')
    print('Intercept:', model.intercept_)
    for key, index in dict(zip(X.columns, model.coef_)).items():
        print(key, index)
    return model

lr_all_features = linreg(data, columns)


R2 Training: 0.21357368801925125
R2 Testing: 0.19696907087177018
RMSE: 6.188420455379233
MAE: 4.8720054200683105 


Intercept: -4.885523943097192
Age -0.006861437450359716
h/a 0.2552276368073809
Favored 0.5722819328597738
Spread -0.04549945815720833
O/U 0.04432805984737186
Temperature 0.0035347967703881493
Wind -0.03859521810994177
Receptions 0.08663763483670646
Targets 0.47315325262832025
Rec_TDs -0.310114707651849
Rec_Yds 0.01705604420446762
Opp_Rank -0.03765596806172487
Opp_Avg_Rec_Allowed -0.18453892568859978
Opp_Avg_Targets_Allowed 0.12541076175593396
Opp_Avg_TDs_Allowed 0.1155990766202968
Opp_Avg_Yds_Allowed 0.004851222386516624
Weather_DOME -0.3011350866689547
Weather_Fog 1.1584893507770873
Weather_Rain -1.3974150526353875
Weather_Rain | Fog -1.9376802347148876
Weather_Snow -0.09826526179660736
Weather_Snow | Fog -1.4079746315088875
Weather_Snow | Freezing Rain 4.1168075185894475
Weather_Sunny -0.13282660204181437
FD salary 0.0011394086316156163


Feature Coefficients that seem logical:
- Increasing Prediction for:
    - Home/Away
    - Favored
    - O/U
    - Attempts
    - Completions
    - TDs
    - Y/A
    
- Decreasing Prediction for:
    - Age
        - Sort of...
    - Wind
    - Rain
    - Rain & Fog
    - Snow
    - Snow & Fog
    - Interceptions
    - Opponent Rank
        - The rank columns is actually backwards so defenses get better as ranking improves

Feature Coefficients that make no sense to me:
- Increasing Prediction for:
    - Fog
    - Snow & Freezing Rain
    - Opponent Interceptions
- Decreasing Prediction for:
    - Being in a Dome (Controlled conditions seem like a player should perform better)
    - Sunny weather
    - Yards
    - Opponent Attempts Allowed
    - Opponent Completions Allowed
    - Opponent TDs Allowed
- Almost no effect from:
    - Rating
    - FD Salary
        - Considering salary kind of derives FanDuel's predictions I feel like this would have an impact.

**The most interesting thing to note here is that FD Salary has seemingly no effect on a player's point production**

In [6]:
# data.columns

In [7]:
# remove_cols = ['Weather_Fog', 'Weather_Snow | Freezing Rain', 'Opp_Avg_Ints', 'Weather_DOME', 
#                'Weather_Sunny', 'Yards', 'Opp_Avg_Att_Allowed', 'Opp_Avg_Comp_Allowed', 
#                'Opp_Avg_TDs_Allowed', 'Rating', 'FD salary', 'FD points']

# features = [col for col in data.columns if col not in remove_cols]

In [8]:
# lr_partial_features = linreg(data, features)

### Random Forest Regression

In [9]:
features = [col for col in data.columns if col != 'FD points']

X = data[features]
y = data['FD points']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

#### No Hypertuning

In [10]:
model = RandomForestRegressor()
cross_val_score(model, X_train, y_train).mean()

0.06966126489294744

In [11]:
model.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [12]:
model.score(X_test, y_test)

0.0935194121739551

In [13]:
predictions = model.predict(X_test)

In [14]:
mean_absolute_error(y_test, predictions)

5.133384428223844

In [15]:
np.sqrt(mean_squared_error(y_test, predictions))

6.574957209485804

#### With Hypertuning

In [16]:
model = RandomForestRegressor()

params = {
    'max_depth': [10, 20, 30],
    'n_estimators': [70, 75, 80],
    'min_samples_leaf': [13, 14],
    'max_features': [10, 20]
}

gs = GridSearchCV(model, param_grid = params)
gs.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'max_depth': [10, 20, 30], 'n_estimators': [70, 75, 80], 'min_samples_leaf': [13, 14], 'max_features': [10, 20]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [17]:
print(gs.best_score_)
print(gs.score(X_test, y_test))
print(gs.best_params_)

0.20366044262401994
0.21296052551540726
{'max_depth': 30, 'max_features': 10, 'min_samples_leaf': 14, 'n_estimators': 75}


In [18]:
y_pred = gs.best_estimator_.predict(X_test)
print(mean_absolute_error(y_test, y_pred))
print(np.sqrt(mean_squared_error(y_test, y_pred)))

4.813541070341038
6.126492896134535


In [19]:
dict(zip(X.columns, gs.best_estimator_.feature_importances_))

{'Age': 0.04745814888587751,
 'h/a': 0.007293642845841915,
 'Favored': 0.008839491101138429,
 'Spread': 0.033056657531239673,
 'O/U': 0.03364329334438604,
 'Temperature': 0.026565599369248315,
 'Wind': 0.02391552783436463,
 'Receptions': 0.09949026762446782,
 'Targets': 0.12913527054819893,
 'Rec_TDs': 0.0209881155893818,
 'Rec_Yds': 0.10579520882863552,
 'Opp_Rank': 0.03895756609841014,
 'Opp_Avg_Rec_Allowed': 0.02901441058831457,
 'Opp_Avg_Targets_Allowed': 0.03150470702429774,
 'Opp_Avg_TDs_Allowed': 0.016232702589686544,
 'Opp_Avg_Yds_Allowed': 0.03958375698917037,
 'Weather_DOME': 0.0019949374328749143,
 'Weather_Fog': 0.0,
 'Weather_Rain': 0.0005416908789058594,
 'Weather_Rain | Fog': 0.0,
 'Weather_Snow': 0.0,
 'Weather_Snow | Fog': 0.0,
 'Weather_Snow | Freezing Rain': 0.0,
 'Weather_Sunny': 0.0024082556431633214,
 'FD salary': 0.3035807492523961}

> Now this is telling me that `FD Salary` is far and away the most important feature

### Support Vector Regression

In [20]:
features = [col for col in data.columns if col != 'FD points']

X = data[features]
y = data['FD points']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

#### No Hypertuning

In [21]:
model = SVR()

In [22]:
model.fit(X_train, y_train)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [23]:
model.score(X_train, y_train)

0.12586005050609317

In [24]:
model.score(X_test, y_test)

-0.0569092544139942

In [25]:
y_pred = model.predict(X_test)

In [26]:
mean_absolute_error(y_test, y_pred)

5.335342506968317

#### Hypertuning

In [27]:
# model = SVR()
# params = {
#     'kernel': ['linear'],
#     'C': [0.1, 0.01]
# }

# gs = GridSearchCV(model, param_grid = params)
# gs.fit(X_train, y_train)

In [28]:
print(gs.best_score_)
print(gs.score(X_test, y_test))
print(gs.best_params_)

0.20366044262401994
0.21296052551540726
{'max_depth': 30, 'max_features': 10, 'min_samples_leaf': 14, 'n_estimators': 75}


In [29]:
y_pred = gs.best_estimator_.predict(X_test)
print(mean_absolute_error(y_test, y_pred))
print(np.sqrt(mean_squared_error(y_test, y_pred)))

4.813541070341038
6.126492896134535


### ADA Boost

In [30]:
features = [col for col in data.columns if col != 'FD points']

X = data[features]
y = data['FD points']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [31]:
model = AdaBoostRegressor()
params = {
    'n_estimators': [45, 47, 50],
    'learning_rate': [.8, .83, .85, .9]
}

gs = GridSearchCV(model, param_grid=params)
gs.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='linear',
         n_estimators=50, random_state=None),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_estimators': [45, 47, 50], 'learning_rate': [0.8, 0.83, 0.85, 0.9]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [32]:
print(gs.best_score_)
print(gs.score(X_test, y_test))
print(gs.best_params_)

0.09513974878701767
0.09910261670957854
{'learning_rate': 0.8, 'n_estimators': 45}


In [33]:
y_pred = gs.best_estimator_.predict(X_test)
print(mean_absolute_error(y_test, y_pred))
print(np.sqrt(mean_squared_error(y_test, y_pred)))

5.4518419679687105
6.554677662809767


### Gradient Boost

In [34]:
features = [col for col in data.columns if col != 'FD points']

X = data[features]
y = data['FD points']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [35]:
model = GradientBoostingRegressor()
params = {
    'n_estimators': [75, 80, 85],
    'learning_rate': [0.15, 0.2, 0.25, 0.3]
}

gs = GridSearchCV(model, param_grid=params)
gs.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=100, presort='auto', random_state=None,
             subsample=1.0, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_estimators': [75, 80, 85], 'learning_rate': [0.15, 0.2, 0.25, 0.3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [36]:
print(gs.best_score_)
print(gs.score(X_test, y_test))
print(gs.best_params_)

0.19138794936028716
0.21208954416951695
{'learning_rate': 0.15, 'n_estimators': 75}


In [37]:
y_pred = gs.best_estimator_.predict(X_test)
print(mean_absolute_error(y_test, y_pred))
print(np.sqrt(mean_squared_error(y_test, y_pred)))

4.790338865116227
6.129881916458523


### Principal Component Analysis

In [38]:
features = [col for col in data.columns if col != 'FD points']

X = data[features]
y = data['FD points']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [39]:
ss = StandardScaler()
X_train_scaled = ss.fit_transform(X_train)
X_test_scaled = ss.transform(X_test)

pca = PCA(n_components = 18)
pca.fit(X_train_scaled)

PCA(copy=True, iterated_power='auto', n_components=18, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [40]:
var_exp = pca.explained_variance_ratio_
print('Explained Variance: ', var_exp)

Explained Variance:  [0.14938207 0.11103121 0.09854372 0.05612772 0.05153472 0.04414462
 0.04293378 0.04189589 0.0406451  0.04035708 0.03985134 0.03902206
 0.03611149 0.03398105 0.03129266 0.02826511 0.02732568 0.02643202]


In [41]:
cum_var_exp = np.cumsum(var_exp)
print('Cumulative Explained Variance: ', cum_var_exp)

Cumulative Explained Variance:  [0.14938207 0.26041329 0.35895701 0.41508473 0.46661945 0.51076407
 0.55369785 0.59559373 0.63623883 0.67659591 0.71644725 0.7554693
 0.7915808  0.82556185 0.85685451 0.88511962 0.91244531 0.93887732]


In [42]:
Z_train = pca.transform(X_train_scaled)
Z_test = pca.transform(X_test_scaled)

In [43]:
model = LinearRegression()

model.fit(Z_train, y_train)
print(model.score(Z_train, y_train))
print(model.score(Z_test, y_test))

0.20408148229245282
0.18592770823650928


In [44]:
predictions = model.predict(Z_test)

In [45]:
mean_absolute_error(y_test, predictions)

4.887510475207698

In [46]:
np.sqrt(mean_squared_error(y_test, predictions))

6.230819396681633

### Neural Network

In [47]:
features = [col for col in data.columns if col != 'FD points']

X = data[features]
y = data['FD points']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [48]:
ss = StandardScaler()
X_train_scaled = ss.fit_transform(X_train)
X_test_scaled = ss.transform(X_test)

In [49]:
model = Sequential()

model.add(Dense(X_train.shape[1], activation = 'relu', input_dim = X_train.shape[1]))
model.add(Dense(5, activation = 'relu'))
model.add(Dense(1, activation=None))

In [50]:
model.compile(loss = 'mean_squared_error', optimizer = 'adam')

In [51]:
model.fit(X_train_scaled, 
          y_train, 
          epochs = 10, 
          batch_size = 2, 
          validation_data = (X_test_scaled, y_test))

Train on 4930 samples, validate on 1644 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x10dd5bba8>

In [52]:
predictions = model.predict(X_test_scaled)

In [53]:
print(mean_absolute_error(y_test, predictions))
print(np.sqrt(mean_squared_error(y_test, predictions)))

4.888175068068388
6.204675665965225
