# General Assembly DSI - Denver 2018
## Capstone Project - DFS Model
This is my capstone project at General Assembly's fifth [Data Science Immersive](https://generalassemb.ly/education/data-science-immersive) cohort in 2018. I am developing a model to assist in optimizing NFL lineups on the daily fantasy sports platforms [Draft Kings](https://www.draftkings.com/) and [Fan Duel](https://www.fanduel.com/).

### Problem Statement

Can we build a model to predict a football player’s fantasy football performance to estimate their value and implement the model in conjunction with a daily fantasy strategy to be profitable?

In [1]:
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical

%matplotlib inline

  from numpy.core.umath_tests import inner1d
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Couldn't import dot_parser, loading of dot files will not be possible.


In [2]:
# load data
data = pd.read_csv('../data/modeling_rbs.csv')

In [3]:
data = data.groupby(['Name', 'Year', 'Week', 'Month', 'Team', 'Oppt']).sum()

In [4]:
data.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Age,h/a,Favored,Spread,O/U,Temperature,Wind,Carries,Rush_Yds,Rush_TDs,...,Weather_DOME,Weather_Fog,Weather_Rain,Weather_Rain | Fog,Weather_Snow,Weather_Snow | Fog,Weather_Snow | Freezing Rain,Weather_Sunny,FD salary,FD points
Name,Year,Week,Month,Team,Oppt,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
"Abdullah, Ameer",2015,2,September,DET,MIN,22.099,0,0,-2.5,44.0,70.0,12.0,8.0,46.333333,0.333333,...,0,0,0,0,0,0,0,1,6400.0,2.3
"Abdullah, Ameer",2015,3,September,DET,DEN,22.106,1,0,-3.0,45.0,72.0,0.0,7.666667,41.666667,0.333333,...,1,0,0,0,0,0,0,0,6000.0,11.2
"Abdullah, Ameer",2015,4,October,DET,SEA,22.114,0,0,-10.0,43.0,61.0,2.0,7.333333,23.666667,0.0,...,0,0,0,0,0,0,0,1,6100.0,5.4
"Abdullah, Ameer",2015,5,October,DET,ARI,22.12,1,0,-4.5,46.0,72.0,0.0,9.0,21.666667,0.0,...,1,0,0,0,0,0,0,0,6100.0,1.1
"Abdullah, Ameer",2015,6,October,DET,CHI,22.127,1,1,-3.5,45.5,72.0,0.0,9.0,24.0,0.0,...,1,0,0,0,0,0,0,0,5900.0,8.4


### Modeling
- Linear Regression
- Random Forest
- Support Vector Regression
- Boosting
- PCA
- Neural Net

### Linear Regression

In [5]:
columns = [col for col in data.columns if col != 'FD points']

def linreg(df, features, target = 'FD points'):
    X = df[features]
    y = df[target]

    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

    # first attempt - all features
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    print('=============EVAULATION=============\n')
    print('R2 Training:', model.score(X_train, y_train))
    print('R2 Testing:', model.score(X_test, y_test))

    predictions = model.predict(X_test)

    print('RMSE:', np.sqrt(mean_squared_error(y_test, predictions)))
    print('MAE:', mean_absolute_error(y_test, predictions), '\n')
    print('=============COEFFICIENTS=============\n')
    print('Intercept:', model.intercept_)
    for key, index in dict(zip(X.columns, model.coef_)).items():
        print(key, index)
    return model

lr_all_features = linreg(data, columns)


R2 Training: 0.29202997295009725
R2 Testing: 0.31895794910927167
RMSE: 6.273472890532594
MAE: 4.817716067101606 


Intercept: -7.9660028953115045
Age 0.012090830463540385
h/a 0.7056492165069178
Favored 0.49918014941457517
Spread 0.004801744493945273
O/U 0.0787057374499108
Temperature -0.029221087281389172
Wind -0.06247126717026167
Carries 0.14147398113490475
Rush_Yds 0.024128986258513948
Rush_TDs 0.22323631808922761
Receptions -0.056855022932739384
Targets 0.559679892919739
Rec_Yds 0.011068438057728183
Rec_TDs -0.3002818069614838
Opp_Avg_Carries 0.09502522706385733
Opp_Rank 0.05295639825473429
Opp_Avg_TDs_Allowed 0.2892627781673079
Opp_Avg_Yds_Allowed -0.004368493006366178
Weather_DOME -0.6750207928236073
Weather_Fog -2.4719962221717267
Weather_Rain -0.15603564562047867
Weather_Rain | Fog 1.1977752815290368
Weather_Snow 1.0690802909972468
Weather_Snow | Fog 1.6947501973894517
Weather_Snow | Freezing Rain 0.0
Weather_Sunny -0.6585531092998711
FD salary 0.001427960394905597


Feature Coefficients that seem logical:
- Increasing Prediction for:
    - Home/Away
    - Favored
    - O/U
    - Attempts
    - Completions
    - TDs
    - Y/A
    
- Decreasing Prediction for:
    - Age
        - Sort of...
    - Wind
    - Rain
    - Rain & Fog
    - Snow
    - Snow & Fog
    - Interceptions
    - Opponent Rank
        - The rank columns is actually backwards so defenses get better as ranking improves

Feature Coefficients that make no sense to me:
- Increasing Prediction for:
    - Fog
    - Snow & Freezing Rain
    - Opponent Interceptions
- Decreasing Prediction for:
    - Being in a Dome (Controlled conditions seem like a player should perform better)
    - Sunny weather
    - Yards
    - Opponent Attempts Allowed
    - Opponent Completions Allowed
    - Opponent TDs Allowed
- Almost no effect from:
    - Rating
    - FD Salary
        - Considering salary kind of derives FanDuel's predictions I feel like this would have an impact.

**The most interesting thing to note here is that FD Salary has seemingly no effect on a player's point production**

In [None]:
# data.columns

In [None]:
# remove_cols = ['Weather_Fog', 'Weather_Snow | Freezing Rain', 'Opp_Avg_Ints', 'Weather_DOME', 
#                'Weather_Sunny', 'Yards', 'Opp_Avg_Att_Allowed', 'Opp_Avg_Comp_Allowed', 
#                'Opp_Avg_TDs_Allowed', 'Rating', 'FD salary', 'FD points']

# features = [col for col in data.columns if col not in remove_cols]

In [None]:
# lr_partial_features = linreg(data, features)

### Random Forest Regression

In [6]:
features = [col for col in data.columns if col != 'FD points']

X = data[features]
y = data['FD points']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

#### No Hypertuning

In [7]:
model = RandomForestRegressor()
cross_val_score(model, X_train, y_train).mean()

0.15084322760957558

In [8]:
model.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [9]:
model.score(X_test, y_test)

0.23922020577379155

In [10]:
predictions = model.predict(X_test)

In [11]:
mean_absolute_error(y_test, predictions)

5.020835853131749

In [12]:
np.sqrt(mean_squared_error(y_test, predictions))

6.630565144057453

#### With Hypertuning

In [13]:
model = RandomForestRegressor()

params = {
    'max_depth': [10, 20, 30],
    'n_estimators': [70, 75, 80],
    'min_samples_leaf': [13, 14],
    'max_features': [10, 20]
}

gs = GridSearchCV(model, param_grid = params)
gs.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'max_depth': [10, 20, 30], 'n_estimators': [70, 75, 80], 'min_samples_leaf': [13, 14], 'max_features': [10, 20]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [14]:
print(gs.best_score_)
print(gs.score(X_test, y_test))
print(gs.best_params_)

0.28243261432854705
0.33735769281346084
{'max_depth': 20, 'max_features': 10, 'min_samples_leaf': 13, 'n_estimators': 75}


In [15]:
y_pred = gs.best_estimator_.predict(X_test)
print(mean_absolute_error(y_test, y_pred))
print(np.sqrt(mean_squared_error(y_test, y_pred)))

4.751160671130208
6.188147283497317


In [16]:
dict(zip(X.columns, gs.best_estimator_.feature_importances_))

{'Age': 0.03641318293090595,
 'h/a': 0.008754138473086578,
 'Favored': 0.005458367376560537,
 'Spread': 0.01816486045714805,
 'O/U': 0.022167908963627806,
 'Temperature': 0.029055020947359556,
 'Wind': 0.012898760728738863,
 'Carries': 0.14052952586080442,
 'Rush_Yds': 0.15018516487119118,
 'Rush_TDs': 0.014889782501167176,
 'Receptions': 0.05083245223377397,
 'Targets': 0.054075007251543815,
 'Rec_Yds': 0.04290725822313163,
 'Rec_TDs': 0.002537922618368954,
 'Opp_Avg_Carries': 0.02553938439157167,
 'Opp_Rank': 0.0360837435667111,
 'Opp_Avg_TDs_Allowed': 0.011126457570978627,
 'Opp_Avg_Yds_Allowed': 0.030835965692564336,
 'Weather_DOME': 0.0013610724699165302,
 'Weather_Fog': 0.0,
 'Weather_Rain': 0.0,
 'Weather_Rain | Fog': 0.0,
 'Weather_Snow': 0.0,
 'Weather_Snow | Fog': 0.0,
 'Weather_Snow | Freezing Rain': 0.0,
 'Weather_Sunny': 0.003387327659434226,
 'FD salary': 0.30279669521141506}

> Now this is telling me that `FD Salary` is far and away the most important feature

### Support Vector Regression

In [17]:
features = [col for col in data.columns if col != 'FD points']

X = data[features]
y = data['FD points']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

#### No Hypertuning

In [18]:
model = SVR()

In [19]:
model.fit(X_train, y_train)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [20]:
model.score(X_train, y_train)

0.11852806948272077

In [21]:
model.score(X_test, y_test)

-0.053384621940437915

In [22]:
y_pred = model.predict(X_test)

In [23]:
mean_absolute_error(y_test, y_pred)

5.880975597911216

#### Hypertuning

In [24]:
# model = SVR()
# params = {
#     'kernel': ['linear'],
#     'C': [0.1, 0.01]
# }

# gs = GridSearchCV(model, param_grid = params)
# gs.fit(X_train, y_train)

In [25]:
print(gs.best_score_)
print(gs.score(X_test, y_test))
print(gs.best_params_)

0.28243261432854705
0.33735769281346084
{'max_depth': 20, 'max_features': 10, 'min_samples_leaf': 13, 'n_estimators': 75}


In [26]:
y_pred = gs.best_estimator_.predict(X_test)
print(mean_absolute_error(y_test, y_pred))
print(np.sqrt(mean_squared_error(y_test, y_pred)))

4.751160671130208
6.188147283497317


### ADA Boost

In [27]:
features = [col for col in data.columns if col != 'FD points']

X = data[features]
y = data['FD points']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [28]:
model = AdaBoostRegressor()
params = {
    'n_estimators': [45, 47, 50],
    'learning_rate': [.8, .83, .85, .9]
}

gs = GridSearchCV(model, param_grid=params)
gs.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='linear',
         n_estimators=50, random_state=None),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_estimators': [45, 47, 50], 'learning_rate': [0.8, 0.83, 0.85, 0.9]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [29]:
print(gs.best_score_)
print(gs.score(X_test, y_test))
print(gs.best_params_)

0.11312432300370741
0.10814891154009498
{'learning_rate': 0.8, 'n_estimators': 47}


In [30]:
y_pred = gs.best_estimator_.predict(X_test)
print(mean_absolute_error(y_test, y_pred))
print(np.sqrt(mean_squared_error(y_test, y_pred)))

6.073499937460125
7.179054244614785


### Gradient Boost

In [31]:
features = [col for col in data.columns if col != 'FD points']

X = data[features]
y = data['FD points']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [32]:
model = GradientBoostingRegressor()
params = {
    'n_estimators': [75, 80, 85],
    'learning_rate': [0.15, 0.2, 0.25, 0.3]
}

gs = GridSearchCV(model, param_grid=params)
gs.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=100, presort='auto', random_state=None,
             subsample=1.0, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_estimators': [75, 80, 85], 'learning_rate': [0.15, 0.2, 0.25, 0.3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [33]:
print(gs.best_score_)
print(gs.score(X_test, y_test))
print(gs.best_params_)

0.24439093445853657
0.32038002434170976
{'learning_rate': 0.15, 'n_estimators': 75}


In [34]:
y_pred = gs.best_estimator_.predict(X_test)
print(mean_absolute_error(y_test, y_pred))
print(np.sqrt(mean_squared_error(y_test, y_pred)))

4.787611162619803
6.266919688409846


### Principal Component Analysis

In [35]:
features = [col for col in data.columns if col != 'FD points']

X = data[features]
y = data['FD points']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [36]:
ss = StandardScaler()
X_train_scaled = ss.fit_transform(X_train)
X_test_scaled = ss.transform(X_test)

pca = PCA(n_components = 18)
pca.fit(X_train_scaled)

PCA(copy=True, iterated_power='auto', n_components=18, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [37]:
var_exp = pca.explained_variance_ratio_
print('Explained Variance: ', var_exp)

Explained Variance:  [0.14314189 0.0961016  0.09242599 0.07789782 0.05272022 0.0512491
 0.04190068 0.04081237 0.0400064  0.03905669 0.03850776 0.03751357
 0.03603068 0.0315258  0.03060055 0.02727189 0.02590727 0.02492588]


In [38]:
cum_var_exp = np.cumsum(var_exp)
print('Cumulative Explained Variance: ', cum_var_exp)

Cumulative Explained Variance:  [0.14314189 0.23924349 0.33166948 0.4095673  0.46228752 0.51353663
 0.55543731 0.59624968 0.63625608 0.67531276 0.71382053 0.7513341
 0.78736478 0.81889058 0.84949113 0.87676302 0.90267028 0.92759616]


In [39]:
Z_train = pca.transform(X_train_scaled)
Z_test = pca.transform(X_test_scaled)

In [40]:
model = LinearRegression()

model.fit(Z_train, y_train)
print(model.score(Z_train, y_train))
print(model.score(Z_test, y_test))

0.2764091046941143
0.31441705769780465


In [41]:
predictions = model.predict(Z_test)

In [42]:
mean_absolute_error(y_test, predictions)

4.802210354606487

In [43]:
np.sqrt(mean_squared_error(y_test, predictions))

6.294352535035272

### Neural Network

In [44]:
features = [col for col in data.columns if col != 'FD points']

X = data[features]
y = data['FD points']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [45]:
ss = StandardScaler()
X_train_scaled = ss.fit_transform(X_train)
X_test_scaled = ss.transform(X_test)

In [46]:
model = Sequential()

model.add(Dense(29, activation = 'relu', input_dim = X_train.shape[1]))
model.add(Dense(5, activation = 'relu'))
model.add(Dense(1, activation=None))

In [47]:
model.compile(loss = 'mean_squared_error', optimizer = 'adam')

In [48]:
model.fit(X_train_scaled, 
          y_train, 
          epochs = 10, 
          batch_size = 2, 
          validation_data = (X_test_scaled, y_test))

Train on 2777 samples, validate on 926 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1a2ba8f320>

In [49]:
predictions = model.predict(X_test_scaled)

In [50]:
print(mean_absolute_error(y_test, predictions))
print(np.sqrt(mean_squared_error(y_test, predictions)))

4.868240097557751
6.361552406344468
