<a href="https://colab.research.google.com/github/PalmerTurley34/DS-Unit-2-Applied-Modeling/blob/master/LS_DS_233_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 3, Module 3*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf), Chapter 8
  - _**[Gradient Boosting Explained](https://www.gormanalysis.com/blog/gradient-boosting-explained/)**_ — Ben Gorman
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html) — Alex Rogozhnikov
  - [How to explain gradient boosting](https://explained.ai/gradient-boosting/) — Terence Parr & Jeremy Howard

In [79]:
!pip install category_encoders==2.*
!pip install eli5



In [80]:
import pandas as pd
import numpy as np
! unzip '/content/insurance.zip'
df = pd.read_csv('insurance.csv')

Archive:  /content/insurance.zip
replace insurance.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [0]:
from sklearn.model_selection import train_test_split
train_val, test = train_test_split(df, train_size=.8, test_size=.2, random_state=42)
train, val = train_test_split(train_val, train_size=.8, test_size=.2, random_state=42)

In [0]:
bmi_bins = [0,20,25,30,40,54]
bmi_labels = ['Underweight', 'Normal', 'Overweight', 'Obese', 'Extreme Obesity']

def wrangle(x):
  x = x.copy()
  x['bmi_class'] = pd.cut(x['bmi'], bins=bmi_bins, labels=bmi_labels)
  return x

train = wrangle(train)
val = wrangle(val)
test = wrangle(test)

## First model

In [0]:
target = 'charges'
features = train.drop(columns=target).columns

X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]
X_test = test[features]
y_test = test[target]

In [0]:
from sklearn.metrics import mean_absolute_error
import category_encoders as ce
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

In [0]:
pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True),
    StandardScaler(),
    RandomForestRegressor(n_estimators=100, n_jobs=-1)
)

In [123]:
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_val)

mae = mean_absolute_error(y_val, y_pred)
print(f'MAE: {mae}')

MAE: 2965.167113626636


## XGBoost

In [0]:
from xgboost import XGBRegressor

pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    StandardScaler(),
    XGBRegressor(n_estimators=100, max_depth=5, n_jobs=-1)
)

In [125]:
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_val)

mae = mean_absolute_error(y_val, y_pred)
print(f'MAE: {mae}')

MAE: 2898.7468254663777


## Feature Permutations

In [0]:
transformers = make_pipeline(
    ce.OrdinalEncoder(),
    StandardScaler()
)

In [0]:
X_train_trans = transformers.fit_transform(X_train)
X_val_trans = transformers.transform(X_val)
X_test_trans = transformers.transform(X_test)

In [128]:
model = XGBRegressor(n_estimators=100, max_depth=5, n_jobs=-1)
model.fit(X_train_trans, y_train)



XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=5, min_child_weight=1, missing=None, n_estimators=100,
             n_jobs=-1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)

In [0]:
import eli5
from eli5.sklearn import PermutationImportance

permuter = PermutationImportance(
    model,
    # scoring = 'mae',
    n_iter=5,
    random_state=42
)

In [130]:
permuter.fit(X_val_trans, y_val)

PermutationImportance(cv='prefit',
                      estimator=XGBRegressor(base_score=0.5, booster='gbtree',
                                             colsample_bylevel=1,
                                             colsample_bynode=1,
                                             colsample_bytree=1, gamma=0,
                                             importance_type='gain',
                                             learning_rate=0.1,
                                             max_delta_step=0, max_depth=5,
                                             min_child_weight=1, missing=None,
                                             n_estimators=100, n_jobs=-1,
                                             nthread=None,
                                             objective='reg:linear',
                                             random_state=0, reg_alpha=0,
                                             reg_lambda=1, scale_pos_weight=1,
                                      

In [0]:
feature_names = X_val.columns.to_list()
perm_imp = pd.Series(permuter.feature_importances_, feature_names).sort_values(ascending=False)

In [132]:
eli5.show_weights(permuter, top=None,
                  feature_names=feature_names)

Weight,Feature
1.3070  ± 0.1949,smoker
0.2320  ± 0.0955,bmi
0.1512  ± 0.0587,age
0.0121  ± 0.0086,children
0.0072  ± 0.0034,region
0  ± 0.0000,bmi_class
-0.0003  ± 0.0021,sex


## Model Tuning

In [0]:
min_importance = 0
mask = permuter.feature_importances_ > min_importance
features = X_train.columns[mask]
X_train_sub = X_train[features]
X_val_sub = X_val[features]
X_test_sub = X_test[features]

In [134]:
# model gets a little better
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    StandardScaler(),
    XGBRegressor(n_estimators=100, random_state=42, n_jobs=-1)
)

pipeline.fit(X_train_sub, y_train)

y_pred = pipeline.predict(X_val_sub)
mae = mean_absolute_error(y_val, y_pred)
print(f'MAE: {mae}')

MAE: 2871.330085899168


In [150]:
# early stopping
encoder = ce.OrdinalEncoder()
X_train_enc = encoder.fit_transform(X_train_sub)
X_val_enc = encoder.transform(X_val_sub)
X_test_enc = encoder.transform(X_test_sub)

scaler = StandardScaler()
X_train_scl = scaler.fit_transform(X_train_enc)
X_val_scl = scaler.transform(X_val_enc)
X_test_scl = scaler.transform(X_test_enc)

model = XGBRegressor(n_estimators=1000,
                     max_depth=3,
                     learning_rate=.5,
                     n_jobs=-1,
                     random_state=42)

eval_set = [(X_train_scl, y_train),(X_val_scl, y_val)]
model.fit(X_train_scl, y_train,
          eval_set=eval_set,
          eval_metric='mae',
          early_stopping_rounds=100)

[0]	validation_0-mae:6742.04	validation_1-mae:7538.93
Multiple eval metrics have been passed: 'validation_1-mae' will be used for early stopping.

Will train until validation_1-mae hasn't improved in 100 rounds.
[1]	validation_0-mae:3699.06	validation_1-mae:4437.33
[2]	validation_0-mae:2489.98	validation_1-mae:3173.33
[3]	validation_0-mae:2135.23	validation_1-mae:2806.93
[4]	validation_0-mae:2101.25	validation_1-mae:2779.51
[5]	validation_0-mae:2098.55	validation_1-mae:2856.12
[6]	validation_0-mae:2117.69	validation_1-mae:2888.6
[7]	validation_0-mae:2126.16	validation_1-mae:2903.18
[8]	validation_0-mae:2107.23	validation_1-mae:2978.94
[9]	validation_0-mae:2097.96	validation_1-mae:2974.89
[10]	validation_0-mae:2082.6	validation_1-mae:2990.85
[11]	validation_0-mae:2067.45	validation_1-mae:2997.14
[12]	validation_0-mae:2052.86	validation_1-mae:2980.74
[13]	validation_0-mae:2041.36	validation_1-mae:2993.65
[14]	validation_0-mae:2030.74	validation_1-mae:2996.03
[15]	validation_0-mae:2026.73

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.5, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=1000,
             n_jobs=-1, nthread=None, objective='reg:linear', random_state=42,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)

In [152]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    StandardScaler(),
    XGBRegressor(n_estimators=4, random_state=42, n_jobs=-1,
                 learning_rate=.5, max_depth=5)
)

pipeline.fit(X_train_scl, y_train)

y_pred = pipeline.predict(X_val_scl)
mae = mean_absolute_error(y_val, y_pred)
print(f'MAE: {mae}')

MAE: 2775.193156784828


In [153]:
# final test MAE
y_pred = pipeline.predict(X_test_scl)
mae = mean_absolute_error(y_test, y_pred)
print(f'MAE: {mae}')

MAE: 2244.173808648321
