<a href="https://colab.research.google.com/github/Cknowles11/DS-Unit-2-Applied-Modeling/blob/master/Copy_of_LS_DS_233_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 3, Module 3*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf), Chapter 8
  - _**[Gradient Boosting Explained](https://www.gormanalysis.com/blog/gradient-boosting-explained/)**_ — Ben Gorman
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html) — Alex Rogozhnikov
  - [How to explain gradient boosting](https://explained.ai/gradient-boosting/) — Terence Parr & Jeremy Howard

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
!pip install category_encoders==2.*
!pip install eli5

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
df = pd.read_csv('/content/drive/My Drive/Local Repo/wineQualityWhites.csv')

In [14]:
df.sample(5)

Unnamed: 0,fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,quality,fsd_perc
4713,6.4,0.28,0.28,3.0,0.04,19.0,98.0,0.99216,3.25,0.47,11.1,6,0.194
1086,5.2,0.24,0.45,3.8,0.027,21.0,128.0,0.992,3.55,0.49,11.2,8,0.164
3435,6.5,0.24,0.28,1.1,0.034,26.0,83.0,0.98928,3.25,0.33,12.3,6,0.313
4261,6.0,0.31,0.27,2.3,0.042,19.0,120.0,0.98952,3.32,0.41,12.7,7,0.158
3741,7.0,0.15,0.28,14.7,0.051,29.0,149.0,0.99792,2.96,0.39,9.0,7,0.195


In [6]:
df = df.drop('Unnamed: 0', axis = 1)

In [15]:
df.dtypes

fixed.acidity           float64
volatile.acidity        float64
citric.acid             float64
residual.sugar          float64
chlorides               float64
free.sulfur.dioxide     float64
total.sulfur.dioxide    float64
density                 float64
pH                      float64
sulphates               float64
alcohol                 float64
quality                   int64
fsd_perc                float64
dtype: object

# Feature Engineering

In [9]:
# Free Sulfur Dioxide in comparison to Total Sulfur Dioxide
df['fsd_perc'] = df['free.sulfur.dioxide'] / df['total.sulfur.dioxide']

In [13]:
df['fsd_perc'] = df['fsd_perc'].round(3)

# Exploration

In [37]:
above_avg_sub = df[df['quality'] >= 5 ]

In [38]:
above_avg_sub.head()

Unnamed: 0,fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,quality,fsd_perc
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,0.265
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,0.106
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,0.309
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,0.253
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,0.253


In [39]:
above_avg_sub.describe()

Unnamed: 0,fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,quality,fsd_perc
count,4715.0,4715.0,4715.0,4715.0,4715.0,4715.0,4715.0,4715.0,4715.0,4715.0,4715.0,4715.0,4715.0
mean,6.842131,0.274448,0.33522,6.452365,0.045587,35.644751,138.67614,0.994015,3.188456,0.490386,10.527493,5.955037,0.258168
std,0.826105,0.09511,0.119301,5.089551,0.021521,16.134741,41.51303,0.003008,0.15029,0.113958,1.236029,0.807326,0.091975
min,3.8,0.08,0.0,0.6,0.009,2.0,9.0,0.98711,2.72,0.22,8.0,5.0,0.024
25%,6.3,0.21,0.27,1.7,0.036,24.0,109.0,0.9917,3.09,0.41,9.5,5.0,0.194
50%,6.8,0.26,0.32,5.3,0.043,34.0,134.0,0.9937,3.18,0.48,10.4,6.0,0.256
75%,7.3,0.32,0.39,10.0,0.05,46.0,167.0,0.9961,3.28,0.55,11.4,6.0,0.317
max,14.2,0.965,1.66,65.8,0.346,131.0,344.0,1.03898,3.82,1.08,14.2,9.0,0.711


In [40]:
below_avg_sub = df[df['quality'] <= 5 ]

In [41]:
below_avg_sub.describe()

Unnamed: 0,fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,quality,fsd_perc
count,1640.0,1640.0,1640.0,1640.0,1640.0,1640.0,1640.0,1640.0,1640.0,1640.0,1640.0,1640.0,1640.0
mean,6.961524,0.310265,0.334311,7.054451,0.051436,35.33872,148.597866,0.99516,3.170457,0.481506,9.84953,4.87622,0.232263
std,0.884887,0.112548,0.142987,5.283594,0.026743,20.217828,46.914579,0.002556,0.144274,0.100566,0.876269,0.364596,0.095712
min,4.2,0.1,0.0,0.6,0.009,2.0,9.0,0.98722,2.79,0.25,8.0,3.0,0.024
25%,6.4,0.24,0.24,1.7,0.04,20.0,117.0,0.9932,3.08,0.41,9.2,5.0,0.16375
50%,6.8,0.29,0.32,6.625,0.047,34.0,149.0,0.99514,3.16,0.47,9.6,5.0,0.231
75%,7.5,0.35,0.41,11.025,0.053,49.0,182.0,0.9971,3.24,0.53,10.4,5.0,0.292
max,11.8,1.1,1.0,23.5,0.346,289.0,440.0,1.00241,3.79,0.88,13.6,5.0,0.657


# Fit Model

In [61]:
from sklearn.model_selection import train_test_split
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
import eli5
from eli5.sklearn import PermutationImportance
from sklearn.metrics import accuracy_score

In [45]:
train,test = train_test_split(df, train_size = .8, test_size = .2, stratify = df['quality'], random_state = 21)

In [47]:
train, val = train_test_split(train, train_size = .80, test_size = .20, stratify = train['quality'], random_state = 21)

In [49]:
print(train.shape)
print(val.shape)
test.shape

(3134, 13)
(784, 13)


(980, 13)

In [50]:
target = 'quality'
X_train = train.drop(columns=target)
y_train = train[target]
X_val = val.drop(columns=target)
y_val = val[target]
X_test = test

In [51]:
tfs = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median')
)

X_train_transformed = tfs.fit_transform(X_train)
X_val_transformed = tfs.transform(X_val)

model = RandomForestClassifier(random_state=42)
model.fit(X_train_transformed, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [56]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy = 'median'),
    RandomForestClassifier()
)

# Permutation Importances / Model Fit

In [52]:
permuter = PermutationImportance(model, scoring = 'accuracy', n_iter = 5, random_state=42)
permuter.fit(X_val_transformed, y_val)

PermutationImportance(cv='prefit',
                      estimator=RandomForestClassifier(bootstrap=True,
                                                       ccp_alpha=0.0,
                                                       class_weight=None,
                                                       criterion='gini',
                                                       max_depth=None,
                                                       max_features='auto',
                                                       max_leaf_nodes=None,
                                                       max_samples=None,
                                                       min_impurity_decrease=0.0,
                                                       min_impurity_split=None,
                                                       min_samples_leaf=1,
                                                       min_samples_split=2,
                                                       min_weight_fr

In [53]:
feature_names = X_val.columns.tolist()

In [55]:
eli5.show_weights(permuter, top = None, feature_names = feature_names)

Weight,Feature
0.0936  ± 0.0167,alcohol
0.0712  ± 0.0158,volatile.acidity
0.0306  ± 0.0077,density
0.0247  ± 0.0114,chlorides
0.0212  ± 0.0159,fsd_perc
0.0168  ± 0.0142,residual.sugar
0.0153  ± 0.0105,total.sulfur.dioxide
0.0125  ± 0.0122,free.sulfur.dioxide
0.0105  ± 0.0116,citric.acid
0.0089  ± 0.0239,pH


In [58]:
pipeline.fit(X_train,y_train)
pipeline.score(X_val, y_val)

  elif pd.api.types.is_categorical(cols):


0.6492346938775511

In [60]:
from xgboost import XGBClassifier

xgb_pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    XGBClassifier()
)

xgb_pipeline.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('ordinalencoder',
                 OrdinalEncoder(cols=[], drop_invariant=False,
                                handle_missing='value', handle_unknown='value',
                                mapping=[], return_df=True, verbose=0)),
                ('xgbclassifier',
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, gamma=0, learning_rate=0.1,
                               max_delta_step=0, max_depth=3,
                               min_child_weight=1, missing=None,
                               n_estimators=100, n_jobs=1, nthread=None,
                               objective='multi:softprob', random_state=0,
                               reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
                               seed=None, silent=None, subsample=1,
                               verbosity=1))],
         

In [63]:
y_pred = pipeline.predict(X_val)
accuracy_score(y_val, y_pred)

0.6492346938775511

# Parameter Tuning

In [64]:
encoder = ce.OrdinalEncoder()
X_train_encoded = encoder.fit_transform(X_train)
X_val_encoded = encoder.transform(X_val)

x_model = XGBClassifier(
    n_estimators = 1000,
    max_depth = 10,
    learning_rate = 0.5,
)

eval_set = [(X_train_encoded, y_train),
            (X_val_encoded, y_val)]

x_model.fit(X_train_encoded, y_train,
            eval_set = eval_set,
            eval_metric = 'merror', 
            early_stopping_rounds = 50)

[0]	validation_0-merror:0.22559	validation_1-merror:0.432398
Multiple eval metrics have been passed: 'validation_1-merror' will be used for early stopping.

Will train until validation_1-merror hasn't improved in 50 rounds.
[1]	validation_0-merror:0.158583	validation_1-merror:0.422194
[2]	validation_0-merror:0.115507	validation_1-merror:0.399235
[3]	validation_0-merror:0.093172	validation_1-merror:0.394133
[4]	validation_0-merror:0.071793	validation_1-merror:0.380102
[5]	validation_0-merror:0.047543	validation_1-merror:0.373724
[6]	validation_0-merror:0.0418	validation_1-merror:0.376276
[7]	validation_0-merror:0.02776	validation_1-merror:0.371173
[8]	validation_0-merror:0.022974	validation_1-merror:0.377551
[9]	validation_0-merror:0.019464	validation_1-merror:0.364796
[10]	validation_0-merror:0.014997	validation_1-merror:0.367347
[11]	validation_0-merror:0.013082	validation_1-merror:0.362245
[12]	validation_0-merror:0.006701	validation_1-merror:0.360969
[13]	validation_0-merror:0.00510

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.5, max_delta_step=0, max_depth=10,
              min_child_weight=1, missing=None, n_estimators=1000, n_jobs=1,
              nthread=None, objective='multi:softprob', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)