Lambda School Data Science

*Unit 2, Sprint 3, Module 3*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf), Chapter 8
  - _**[Gradient Boosting Explained](https://www.gormanalysis.com/blog/gradient-boosting-explained/)**_ — Ben Gorman
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html) — Alex Rogozhnikov
  - [How to explain gradient boosting](https://explained.ai/gradient-boosting/) — Terence Parr & Jeremy Howard

# Imports

In [1]:
import pandas as pd
import numpy as np

import pandas_profiling

from functions import wrangle

In [2]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score
from sklearn.model_selection import train_test_split, RandomizedSearchCV

from category_encoders import OneHotEncoder

from xgboost import XGBRegressor

# Read in data

In [3]:
import os
from pathlib import Path

__location__ = Path(os.getcwd()).parent
__data_dir__ = __location__ / "data"

df = pd.read_csv(str(__data_dir__) + '/anime.csv')

# Clean Data

In [4]:
clean = wrangle(df)

# Split X and y
### Consider having multiple y's, to do this must change wrangle from dropping certain columns, check phone notepad

In [5]:
y = clean['rank'].copy()
X = clean.drop('rank',axis=1).copy()

# TTS

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.2,random_state=42)

# XGBoost model

In [7]:
xg_pipe = make_pipeline(OneHotEncoder(use_cat_names=True),XGBRegressor(n_jobs=-1,
                                                                      random_state=42))

In [8]:
xg_pipe.fit(X_train, y_train);

In [9]:
baseline = [y_train.mean()]

print(f"XGB train : {xg_pipe.score(X_train,y_train)}")
print(f"XGB test: {xg_pipe.score(X_test,y_test)}")
print()
print(f"Base MAE: {mean_absolute_error(y_train,baseline*len(y_train))}\n")
print(f"Train MAE: {mean_absolute_error(y_train,xg_pipe.predict(X_train))}")
print(f"Test MAE: {mean_absolute_error(y_test,xg_pipe.predict(X_test))}")

XGB train : 0.7595123869386505
XGB test: 0.5994204568174826

Base MAE: 3210.093119097792

Train MAE: 1366.8743007766166
Test MAE: 1787.1004934340785


# XGB Hyperparameter tuning

In [25]:
pipe = make_pipeline(OneHotEncoder(use_cat_names=True),
                    XGBRegressor(n_jobs=-1,random_state=42))

In [45]:
params = {
    'xgbregressor__learning_rate': [.01,.02,.03,.04,.05,.06,.07,.08,.09,.1,],
    'xgbregressor__max_delta_step': [0,1,2,3,4,5,6,7,8,9,10],
    'xgbregressor__booster': ['gbtree','gblinear','dart'],
    'xgbregressor__max_depth': [5,10,30,50,None],
    'xgbregressor__min_child_weight': [1,2,3,4,5,10,15]
}

In [46]:
tune_search = RandomizedSearchCV(
    estimator = pipe,
    param_distributions = params,
    n_iter=500,
    verbose=1,
    n_jobs=8
)

In [47]:
#tune_search.fit(X_train,y_train);

Fitting 5 folds for each of 500 candidates, totalling 2500 fits


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:   22.9s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:  1.4min
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:  3.1min
[Parallel(n_jobs=8)]: Done 784 tasks      | elapsed: 10.9min
[Parallel(n_jobs=8)]: Done 1234 tasks      | elapsed: 17.8min
[Parallel(n_jobs=8)]: Done 1784 tasks      | elapsed: 38.5min
[Parallel(n_jobs=8)]: Done 2434 tasks      | elapsed: 49.7min
[Parallel(n_jobs=8)]: Done 2500 out of 2500 | elapsed: 51.1min finished


In [48]:
#tune_search.best_params_

{'xgbregressor__min_child_weight': 10,
 'xgbregressor__max_depth': 10,
 'xgbregressor__max_delta_step': 0,
 'xgbregressor__learning_rate': 0.02,
 'xgbregressor__booster': 'dart'}

In [32]:
pipe = make_pipeline(OneHotEncoder(use_cat_names=True),
                    XGBRegressor(n_jobs=-1,random_state=42,
                                n_estimators=500,min_child_weight=3,
                                max_depth=None,learning_rate=0.1,
                                booster='gbtree',max_delta_step=0))

In [33]:
pipe.fit(X_train,y_train);

In [34]:
baseline = [y_train.mean()]

print(f"Forest train : {pipe.score(X_train,y_train)}")
print(f"Forest test: {pipe.score(X_test,y_test)}")
print()
print(f"Base MAE: {mean_absolute_error(y_train,baseline*len(y_train))}\n")
print(f"Train MAE: {mean_absolute_error(y_train,pipe.predict(X_train))}")
print(f"Test MAE: {mean_absolute_error(y_test,pipe.predict(X_test))}")

Forest train : 0.7854710621391039
Forest test: 0.611687962113775

Base MAE: 3210.093119097792

Train MAE: 1279.1128902135588
Test MAE: 1756.676769762928


In [52]:
pipe = make_pipeline(OneHotEncoder(use_cat_names=True),
                    XGBRegressor(n_jobs=-1,random_state=42,
                                n_estimators=500,min_child_weight=10,
                                max_depth=10,learning_rate=0.02,
                                booster='dart',max_delta_step=0))

In [53]:
pipe.fit(X_train,y_train);

In [54]:
baseline = [y_train.mean()]

print(f"Forest train : {pipe.score(X_train,y_train)}")
print(f"Forest test: {pipe.score(X_test,y_test)}")
print()
print(f"Base MAE: {mean_absolute_error(y_train,baseline*len(y_train))}\n")
print(f"Train MAE: {mean_absolute_error(y_train,pipe.predict(X_train))}")
print(f"Test MAE: {mean_absolute_error(y_test,pipe.predict(X_test))}")

Forest train : 0.7615501538682983
Forest test: 0.6151730665688578

Base MAE: 3210.093119097792

Train MAE: 1373.4839870986207
Test MAE: 1755.5248652389485


# Feature importance

In [55]:
from sklearn.inspection import permutation_importance

In [57]:
result = permutation_importance(pipe, X_test, y_test, 
                                n_repeats=5, random_state=0)

In [58]:
df = pd.DataFrame({'feature': X_test.columns,
                   'importances_mean': np.round(result['importances_mean'], 3),
                   'importances_std': result['importances_std']})

In [59]:
df.sort_values(by='importances_mean', ascending=False)

Unnamed: 0,feature,importances_mean,importances_std
8,num_related,0.222,0.013685
6,rating,0.15,0.003504
1,source,0.117,0.003138
5,duration,0.095,0.00705
7,studio,0.048,0.004501
0,type,0.039,0.005465
2,episodes,0.027,0.005903
9,hentai/romance,0.025,0.003266
13,drama,0.019,0.001964
14,fantasy/sci-fi,0.013,0.00236
