Lambda School Data Science

*Unit 2, Sprint 3, Module 3*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - _**[A Kaggle Master Explains Gradient Boosting](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)**_
  - [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) Chapter 8
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html)
  - _**[Boosting](https://www.youtube.com/watch?v=GM3CDQfQ4sw) (2.5 minute video)**_

In [0]:
%%capture
!pip install category_encoders

In [8]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split

titanic = sns.load_dataset('titanic')

train, test = train_test_split(titanic, test_size=.2)

features = ['age', 'class', 'deck', 'embarked', 'fare', 'sex']
target = 'survived'

X_train = train[features]
y_train = train[target]
X_test = test[features]
y_test = test[target]

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((712, 6), (179, 6), (712,), (179,))

In [10]:
X_train.isnull().sum()
# we're dealing with some null values

age         149
class         0
deck        556
embarked      1
fare          0
sex           0
dtype: int64

In [14]:
# what is our baseline
max(1-y_train.mean(), y_train.mean())

0.6306179775280899

In [0]:
from sklearn.pipeline import Pipeline
import category_encoders as ce
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer

In [0]:
# create base pipeline
pipeline = Pipeline([
                     ('encoder', ce.OrdinalEncoder()),
                     ('model', XGBClassifier())
])

In [13]:
# fit base pipeline 
train_size = .8
cutoff = int(train_size*X_train.shape[0])
small_X_train = X_train[:cutoff]
X_val = X_train[cutoff:]
small_y_train = y_train[:cutoff]
y_val = y_train[cutoff:]

pipeline.fit(small_X_train, small_y_train)
pipeline.score(X_val, y_val)

569


0.7972027972027972

## Baseline model beat the baseline by about 16%!

In [17]:
# now lets tune some hyperparameters! 
params = {
    'model__n_estimators': [50, 70, 90],
    'model__max_depth': [3, 5]
}

search = GridSearchCV(pipeline, params, n_jobs=-1)
search.fit(X_train, y_train)
print(f"Best params: \n{search.best_params_}")
print(f"Best score: \n{search.best_score_}")

Best params: 
{'model__max_depth': 3, 'model__n_estimators': 90}
Best score: 
0.8215699793164581


In [0]:
pipeline = Pipeline([
                     ('encoder', ce.OrdinalEncoder()),
                     ('model', XGBClassifier(n_estimators=90,
                                             max_depth=3))
])

In [19]:
pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)

0.8324022346368715

In [28]:
# get model and encoded data seperate for permutation importance eval
model = XGBClassifier(n_estimators=90, max_depth=3)
transformer = Pipeline([
                        ('encoder', ce.OrdinalEncoder()),
                        ('imputer', SimpleImputer())
])

X_train_transformed = transformer.fit_transform(X_train)
X_test_transformed = transformer.transform(X_test)
model.fit(X_train_transformed, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=90, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [22]:
!pip install eli5

Collecting eli5
[?25l  Downloading https://files.pythonhosted.org/packages/97/2f/c85c7d8f8548e460829971785347e14e45fa5c6617da374711dec8cb38cc/eli5-0.10.1-py2.py3-none-any.whl (105kB)
[K     |███                             | 10kB 18.5MB/s eta 0:00:01[K     |██████▏                         | 20kB 1.7MB/s eta 0:00:01[K     |█████████▎                      | 30kB 2.5MB/s eta 0:00:01[K     |████████████▍                   | 40kB 1.7MB/s eta 0:00:01[K     |███████████████▌                | 51kB 2.1MB/s eta 0:00:01[K     |██████████████████▋             | 61kB 2.5MB/s eta 0:00:01[K     |█████████████████████▊          | 71kB 2.9MB/s eta 0:00:01[K     |████████████████████████▊       | 81kB 3.2MB/s eta 0:00:01[K     |███████████████████████████▉    | 92kB 3.6MB/s eta 0:00:01[K     |███████████████████████████████ | 102kB 2.8MB/s eta 0:00:01[K     |████████████████████████████████| 112kB 2.8MB/s 
Installing collected packages: eli5
Successfully installed eli5-0.10.1


In [29]:
import eli5 
from eli5.sklearn import PermutationImportance

permuter = PermutationImportance(
    model,
    scoring="accuracy",
    n_iter=10,
    random_state=42
)

permuter.fit(X_test_transformed, y_test)

PermutationImportance(cv='prefit',
                      estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                              colsample_bylevel=1,
                                              colsample_bynode=1,
                                              colsample_bytree=1, gamma=0,
                                              learning_rate=0.1,
                                              max_delta_step=0, max_depth=3,
                                              min_child_weight=1, missing=None,
                                              n_estimators=90, n_jobs=1,
                                              nthread=None,
                                              objective='binary:logistic',
                                              random_state=0, reg_alpha=0,
                                              reg_lambda=1, scale_pos_weight=1,
                                              seed=None, silent=None,
                      

In [30]:
feature_names = X_train.columns.tolist()
pd.Series(permuter.feature_importances_, feature_names).sort_values()

embarked    0.013408
deck        0.029050
fare        0.075419
age         0.077095
class       0.130168
sex         0.237989
dtype: float64

In [31]:
eli5.show_weights(
    permuter,
    top=None, # includes all features
    feature_names=feature_names
)

Weight,Feature
0.2380  ± 0.0413,sex
0.1302  ± 0.0346,class
0.0771  ± 0.0370,age
0.0754  ± 0.0475,fare
0.0291  ± 0.0164,deck
0.0134  ± 0.0167,embarked


In [0]:
# I may be cautious about embarked given the standard error is larger than 
# the permutation importance value