<a href="https://colab.research.google.com/github/DavidVollendroff/DS-Unit-2-Applied-Modeling/blob/master/module3/LS_DS_233_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 3, Module 3*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - _**[A Kaggle Master Explains Gradient Boosting](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)**_
  - [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) Chapter 8
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html)
  - _**[Boosting](https://www.youtube.com/watch?v=GM3CDQfQ4sw) (2.5 minute video)**_

In [0]:
import pandas as pd
import numpy as np

In [0]:
pd.options.display.max_rows = 500

In [0]:
seasons = [yr for yr in range(1990, 2021)]
data = {}
stats_url = 'https://www.basketball-reference.com/leagues/NBA_{}_advanced.html'
for season in seasons:
  stats = pd.read_html(stats_url.format(season))
  centers = stats[0][stats[0]['Pos']=='C'].dropna(axis=1,thresh=10)
  data['{}-{}'.format(season -1, str(season)[-2:])] = centers

In [0]:
# All-NBA Awards List
url = 'https://www.basketball-reference.com/awards/all_league.html'
all_nba = pd.read_html(url) # pulls from the web
all_nba = all_nba[0] # removes from the list
all_nba = all_nba.dropna() # drops spacer rows
all_nba = all_nba[['Season', 'Unnamed: 3']] # selects only the Centers
all_nba = all_nba.set_index('Season') # removes useless index

In [0]:
team_abr = ['OKC', 'MIA', 'MIN', 'SAS', 'BRK', 'IND', 'NOP', 'CHI',
       'BOS', 'GSW', 'ORL', 'HOU', 'LAL', 'UTA', 'SAC', 'PHO', 'POR',
       'MEM', 'ATL', 'DET', 'PHI', 'CLE', 'WAS', 'LAC', 'MIL', 'NYK',
       'CHO', 'DEN', 'DAL', 'TOR', 'CHA', 'NOH', 'NJN', 'SEA', 'NOK',
       'CHH', 'VAN', 'WSB']
team_cities = ['Oklahoma City', 'Miami', 'Minnesota', 'San Antonio', 'Brooklyn', 'Indiana',
            'New Orleans', 'Chicago', 'Boston', 'Golden State', 'Orlando', 'Houston',
            'Los Angeles Lakers', 'Utah', 'Sacramento', 'Phoenix', 'Portland', 'Memphis',
            'Atlanta', 'Detroit', 'Philadelphia', 'Cleveland', 'Washington', 'Los Angeles Clippers',
            'Milwaukee', 'New York', 'Charlotte', 'Denver', 'Dallas', 'Toronto', 'Charlotte',
            'New Orleans', 'New Jersey', 'Seattle', 'New Orleans', 'Charlotte', 'Vancouver',
            'Washington']

abr_city = dict(zip(team_abr, team_cities))

In [0]:
def trim(some_string):
  return some_string[:-2]
all_nba = all_nba['Unnamed: 3'].apply(trim)

In [0]:
def hof_remover(some_string):
  if some_string.endswith('*'):
    return some_string[:-1]
  else:
    return some_string

for key in data.keys():
  data[key]['Player'] = data[key]['Player'].apply(hof_remover)

In [0]:
modern_all_nba = pd.DataFrame(all_nba[(all_nba.index >'1988-89')])

In [0]:
awards = np.ones(len(modern_all_nba))

In [0]:
modern_all_nba['awards'] = awards
modern_all_nba.rename(columns={'Unnamed: 3': 'Player'}, inplace=True)

In [0]:
merged_data = {}
for key in modern_all_nba.index.unique():
  merged_data[key] = pd.merge(data[key], modern_all_nba[modern_all_nba.index==key], how='left', on='Player')
  merged_data[key].fillna(0, inplace=True)
  merged_data[key]['Season'] = np.repeat(key, len(merged_data[key]))
  merged_data[key]['team_city'] = merged_data[key]['Tm'].map(abr_city)

In [0]:
latest_seasons = ['2018-19', '2019-20']
df_list = []
for key in merged_data.keys():
  if key not in latest_seasons:
    df_list.append(merged_data[key])
train = pd.concat(df_list, axis=0)
validate = merged_data[latest_seasons[0]]
test = data[latest_seasons[1]]

In [0]:
target = 'awards'

features = train.columns.tolist()

removed_features = ['Player',
                    'Pos',
                    'Rk',
                    'awards',
                    'Season',
                    'team_city' 
                    ]

for item in removed_features:
  if item in features:
    features.remove(item)

In [0]:
# features matrix and target vector 
X_train = train[features]
y_train = train[target]
X_val = validate[features]
y_val = validate[target]
X_test = test[features]

In [15]:
!pip install category-encoders



In [0]:
# import all necessary functions for sklearn pipeline
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

In [0]:
# create pipeline
pipe = make_pipeline(ce.OrdinalEncoder(), # Somewhat unnecessary with all awards changed to 1
                         RandomForestClassifier(max_depth=None, n_estimators=1200))
pipe.fit(X_train, y_train);

In [18]:
pipe.score(X_val, y_val)

0.9583333333333334

In [19]:
for key in merged_data.keys():
  my_bool = '3PAr' in data[key].columns.tolist()
  my_list = data[key].columns.tolist()
  print(key, len(my_list), my_bool)

2018-19 27 True
2017-18 27 True
2016-17 27 True
2015-16 27 True
2014-15 27 True
2013-14 27 True
2012-13 27 True
2011-12 27 True
2010-11 27 True
2009-10 27 True
2008-09 27 True
2007-08 27 True
2006-07 27 True
2005-06 27 True
2004-05 27 True
2003-04 27 True
2002-03 27 True
2001-02 27 True
2000-01 27 True
1999-00 27 True
1998-99 27 True
1997-98 27 True
1996-97 27 True
1995-96 27 True
1994-95 27 True
1993-94 27 True
1992-93 27 True
1991-92 27 True
1990-91 27 True
1989-90 27 True


In [0]:
encoder = ce.OrdinalEncoder()
imputer = SimpleImputer()
transformer = make_pipeline(encoder, imputer)
X_train_transformed = transformer.fit_transform(X_train)
X_val_transformed = transformer.fit_transform(X_val)

In [21]:
model = RandomForestClassifier(max_depth=None, n_estimators=1200)
model.fit(X_train_transformed, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1200,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [23]:
!pip install eli5

Collecting eli5
[?25l  Downloading https://files.pythonhosted.org/packages/97/2f/c85c7d8f8548e460829971785347e14e45fa5c6617da374711dec8cb38cc/eli5-0.10.1-py2.py3-none-any.whl (105kB)
[K     |███                             | 10kB 21.3MB/s eta 0:00:01[K     |██████▏                         | 20kB 6.9MB/s eta 0:00:01[K     |█████████▎                      | 30kB 9.7MB/s eta 0:00:01[K     |████████████▍                   | 40kB 6.2MB/s eta 0:00:01[K     |███████████████▌                | 51kB 7.5MB/s eta 0:00:01[K     |██████████████████▋             | 61kB 8.8MB/s eta 0:00:01[K     |█████████████████████▊          | 71kB 9.7MB/s eta 0:00:01[K     |████████████████████████▊       | 81kB 10.7MB/s eta 0:00:01[K     |███████████████████████████▉    | 92kB 11.8MB/s eta 0:00:01[K     |███████████████████████████████ | 102kB 10.1MB/s eta 0:00:01[K     |████████████████████████████████| 112kB 10.1MB/s 
Installing collected packages: eli5
Successfully installed eli5-0.10.1


In [27]:
import eli5
from eli5.sklearn import PermutationImportance

# 1. Calculate permutation importances
permuter = PermutationImportance(
    model, 
    scoring='accuracy', 
    n_iter=5, 
    random_state=42
)

permuter.fit(X_train_transformed, y_train)

PermutationImportance(cv='prefit',
                      estimator=RandomForestClassifier(bootstrap=True,
                                                       class_weight=None,
                                                       criterion='gini',
                                                       max_depth=None,
                                                       max_features='auto',
                                                       max_leaf_nodes=None,
                                                       min_impurity_decrease=0.0,
                                                       min_impurity_split=None,
                                                       min_samples_leaf=1,
                                                       min_samples_split=2,
                                                       min_weight_fraction_leaf=0.0,
                                                       n_estimators=1200,
                                                    

In [28]:
permuter.score(X_val_transformed, y_val)

0.9583333333333334

In [29]:
feature_names = X_val.columns.tolist()
pd.Series(permuter.feature_importances_, feature_names).sort_values()

Tm       0.000000
FTr      0.000064
BLK%     0.000128
WS/48    0.000191
ORB%     0.000255
DBPM     0.000319
BPM      0.000319
OBPM     0.000574
3PAr     0.000574
DRB%     0.000765
AST%     0.000829
TRB%     0.000893
G        0.001020
TOV%     0.001148
Age      0.001466
STL%     0.001530
MP       0.001658
PER      0.001785
USG%     0.002295
TS%      0.002423
OWS      0.004144
VORP     0.006120
WS       0.006503
DWS      0.012177
dtype: float64

In [30]:
# 2. Display permutation importances
eli5.show_weights(
    permuter, 
    top=None, # show permutation importances for all features
    feature_names=feature_names # must be a list
)

Weight,Feature
0.0122  ± 0.0005,DWS
0.0065  ± 0.0005,WS
0.0061  ± 0.0014,VORP
0.0041  ± 0.0007,OWS
0.0024  ± 0.0007,TS%
0.0023  ± 0.0006,USG%
0.0018  ± 0.0003,PER
0.0017  ± 0.0003,MP
0.0015  ± 0.0003,STL%
0.0015  ± 0.0003,Age


In [52]:
from xgboost import XGBClassifier

eval_set = [(X_train_transformed, y_train),
            (X_val_transformed, y_val)]

model = XGBClassifier(n_estimators=1000,
                      max_depth=6,
                      learning_rate=0.1,
                      n_jobs=-1,
                      scale_pos_weight=1)

model.fit(X_train_transformed,
          y_train,
          eval_set=eval_set,
          eval_metric='error',
          early_stopping_rounds=25)
model.predict(X_val_transformed)

[0]	validation_0-error:0.012432	validation_1-error:0.041667
Multiple eval metrics have been passed: 'validation_1-error' will be used for early stopping.

Will train until validation_1-error hasn't improved in 25 rounds.
[1]	validation_0-error:0.012113	validation_1-error:0.05
[2]	validation_0-error:0.01307	validation_1-error:0.05
[3]	validation_0-error:0.011795	validation_1-error:0.041667
[4]	validation_0-error:0.011157	validation_1-error:0.033333
[5]	validation_0-error:0.010201	validation_1-error:0.033333
[6]	validation_0-error:0.01052	validation_1-error:0.033333
[7]	validation_0-error:0.010201	validation_1-error:0.041667
[8]	validation_0-error:0.009882	validation_1-error:0.041667
[9]	validation_0-error:0.009563	validation_1-error:0.05
[10]	validation_0-error:0.008607	validation_1-error:0.041667
[11]	validation_0-error:0.008288	validation_1-error:0.05
[12]	validation_0-error:0.007332	validation_1-error:0.05
[13]	validation_0-error:0.007332	validation_1-error:0.05
[14]	validation_0-err

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0.])

In [56]:
from sklearn.metrics import accuracy_score
y_pred = model.predict(X_val_transformed)
y_pred

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0.])

In [58]:
validate[y_pred.astype(bool)]

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,MP,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP,awards,Season,team_city
29,147,Andre Drummond,C,25,DET,79,2647,23.4,0.555,0.036,0.392,16.8,34.7,25.4,7.2,2.5,4.4,12.4,22.9,4.1,5.9,10.0,0.181,-0.7,3.6,2.9,3.3,0.0,2018-19,Detroit
43,187,Rudy Gobert,C,26,UTA,81,2577,24.6,0.682,0.0,0.733,13.2,30.2,21.9,9.6,1.2,5.8,12.1,17.8,8.7,5.7,14.4,0.268,2.0,5.1,7.0,5.9,1.0,2018-19,Utah
106,492,Nikola Vučević,C,28,ORL,80,2510,25.5,0.573,0.171,0.168,9.4,31.9,20.5,21.9,1.6,3.0,9.9,28.0,5.4,4.7,10.1,0.193,3.0,3.4,6.4,5.3,0.0,2018-19,Orlando
