Lambda School Data Science

*Unit 2, Sprint 3, Module 2*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - _**[A Kaggle Master Explains Gradient Boosting](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)**_
  - [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) Chapter 8
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html)
  - _**[Boosting](https://www.youtube.com/watch?v=GM3CDQfQ4sw) (2.5 minute video)**_

In [5]:
import pandas as pd
df0 = pd.read_csv('https://socratesdidnothingwrong.com/nfl/qbcf/master.txt')

def quickpeek(df):
  print(df.shape)
  print(df['player'].nunique(), 'QBs')
  display(df.head())

quickpeek(df0)

(10076, 19)
262 QBs


Unnamed: 0,player,date,team,home,opp,game,week,day,completions,passatt,passyards,passtds,ints,sacks,sackyards,rushatt,rushyards,rushtds,fumbles
0,Geno Smith,2013-12-01,NYJ,1,MIA,12,13,Sun,4,10,29,0,1,1.0,8.0,1,2,0,0
1,Ryan Tannehill,2013-12-01,MIA,0,NYJ,12,13,Sun,28,43,331,2,1,1.0,3.0,3,22,0,0
2,Brandon Weeden,2013-12-01,CLE,1,JAX,12,13,Sun,24,40,370,3,2,3.0,28.0,2,5,0,2
3,Joe Flacco,2013-11-28,BAL,1,PIT,12,13,Thu,24,35,251,1,0,2.0,14.0,4,7,0,1
4,Matt Flynn,2013-11-28,GNB,0,DET,12,13,Thu,10,20,139,0,1,7.0,37.0,2,4,0,2


In [37]:
# quickly engineer the "season" feature,
# which we will use for our train/test splitting
df0['date'] = pd.to_datetime(df0['date'], infer_datetime_format=True)
df0['season'] = df0['date'].apply(lambda x: x.year if x.month > 6 else x.year - 1)
quickpeek(df0)

(10076, 20)
262 QBs


Unnamed: 0,player,date,team,home,opp,game,week,day,completions,passatt,passyards,passtds,ints,sacks,sackyards,rushatt,rushyards,rushtds,fumbles,season
0,Geno Smith,2013-12-01,NYJ,1,MIA,12,13,Sun,4,10,29,0,1,1.0,8.0,1,2,0,0,2013
1,Ryan Tannehill,2013-12-01,MIA,0,NYJ,12,13,Sun,28,43,331,2,1,1.0,3.0,3,22,0,0,2013
2,Brandon Weeden,2013-12-01,CLE,1,JAX,12,13,Sun,24,40,370,3,2,3.0,28.0,2,5,0,2,2013
3,Joe Flacco,2013-11-28,BAL,1,PIT,12,13,Thu,24,35,251,1,0,2.0,14.0,4,7,0,1,2013
4,Matt Flynn,2013-11-28,GNB,0,DET,12,13,Thu,10,20,139,0,1,7.0,37.0,2,4,0,2,2013


In [53]:
df0.isnull().sum()

player           0
date             0
team             0
home             0
opp              0
game             0
week             0
day              0
completions      0
passatt          0
passyards        0
passtds          0
ints             0
sacks           69
sackyards      288
rushatt          0
rushyards        0
rushtds          0
fumbles          0
season           0
dtype: int64

In [0]:
# Sanitize the data and
# engineer some features

def engineer(df, fn):
  return df.apply(fn, axis=1)

def ratingscreen(val):
  if val > 2.375: return 2.375
  if val < 0: return 0
  return val

def extractpasserrating(row):
  COMP = row['completions']
  ATT = row['passatt']
  YDS = row['passyards']
  TD = row['passtds']
  INT = row['ints']
  a = ratingscreen(((COMP / ATT) - 0.3) * 5)
  b = ratingscreen(((YDS / ATT) - 3) * 0.25)
  c = ratingscreen((TD / ATT) * 20)
  d = ratingscreen(2.375 - ((INT / ATT) * 25))
  return ((a + b + c + d) / 6) * 100

def extractrusherrating(row):
  ATT = row['rushatt']
  YDS = row['rushyards']
  TD = row['rushtds']
  FUM = row['fumbles']
  if ATT == 0: return 0
  a = ratingscreen(((YDS / ATT) - 3) * 0.25)
  b = ratingscreen((TD / ATT) * 20)
  c = ratingscreen(2.375 - ((FUM / ATT) * 25))
  return ((a + b + c) / 6) * 100

def wrangle(X):
  X = X.copy()

  # remove "QBs" with no passes
  X = X[X['passatt'] > 0]

  # fix missing `sacks`/`sackyards` data
  X['sacks'] = X['sacks'].fillna(0)
  X['sackyards'] = X['sackyards'].fillna(0)

  # Severely imbalanced classes;
  # remove everyone who doesn't have a full season of data
  vc = X['player'].value_counts()
  X = X[X['player'].isin(vc[vc >= 16].index.to_list())]

  # engineer some metadata
  X['postseason'] = (X['week'] > 17).astype(int)

  # engineer some stats
  X['cmp%'] = engineer(X, lambda x: x['completions'] / x['passatt'])
  X['ny/a'] = engineer(X, lambda x: (x['passyards'] - x['sackyards']) / (x['passatt'] + x['sacks']))
  X['passrate'] = engineer(X, extractpasserrating)
  X['rusheff'] = engineer(X, lambda x: 0 if x['rushatt'] == 0 else x['rushyards'] / x['rushatt'])
  X['rushrate'] = engineer(X, extractrusherrating)
  X['volume'] = engineer(X, lambda x: x['passyards'] + x['rushyards'] - x['sackyards'])

  # reorganize
  keepem = ['cmp%', 'ny/a', 'passrate', 'rusheff', 'rushrate', 'volume',
            'season', 'player']
  X = X[keepem]

  return X

In [0]:
# Split into train & test
# We'd like to use this to predict future games,
# so we'll test on the last few seasons, 2017-2019

df_w = wrangle(df0)
df_train = df_w[df_w['season'] < 2017].drop(columns='season')
df_test = df_w[df_w['season'] >= 2017].drop(columns='season')

In [0]:
# Arrange feature & target matrices

y_col = 'player'
X_cols = df_train.columns.drop(y_col)

X_train = df_train[X_cols]
X_test = df_test[X_cols]
y_train = df_train[y_col]
y_test = df_test[y_col]

In [100]:
# Get our "baseline"

from sklearn.dummy import DummyClassifier
model = DummyClassifier(strategy='most_frequent')
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.03491078355314197

In [104]:
# Build the "actual" model
from xgboost import XGBClassifier
model = XGBClassifier(n_estimators=100, max_depth=5, random_state=143)
model.fit(X_train, y_train, eval_metric='merror')
model.score(X_test, y_test)

0.027152831652443754