Lambda School Data Science

*Unit 2, Sprint 3, Module 3*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - _**[A Kaggle Master Explains Gradient Boosting](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)**_
  - [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) Chapter 8
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html)
  - _**[Boosting](https://www.youtube.com/watch?v=GM3CDQfQ4sw) (2.5 minute video)**_

# Continued from assignment 232
## Why is my model junky?
Hypothesis: My model has low precision and high accuracy because of the class imbalance (0 = 99.4%, 1 = 0.6%), and makes bad predictions because the training on 1,000 observations is not meaningful

## Test strategy
### Test 1 more data
Simply just feed in more data
### Test 2 random sample more data
Same thing, but lets sample instead of read in sequentially
### Test 3 stratify sample more data
Same thing, but lets try stratification for sampling

## Next steps
Once I have a reasonable data sampling strategy I can try permuting and an XGBoost model

### Methodology
Get data iterable, define a validation set, define a X y function, create basic pipeline, create hypothesis testing data sets and test one by one.

In [1]:
#Random guess baseline from last assignment
baseline = .00385

In [2]:
#Setup
import pandas as pd

folder = '../../DS-Unit-2-Build/bosch-production-line-performance/'

num_iter = pd.read_csv(folder + 'train_numeric.csv', iterator = True, chunksize = 1000)

In [3]:
def getXy(df):
    target = 'Response'
    return(df.drop(columns = ['Id', target]), df[target])

In [4]:
#Baseline (from last assignment): Precision - .00385, Accurady: .481
#I'm measuring for precision
#Simple logistic regressor with CV
from sklearn.linear_model import LogisticRegressionCV
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

pipeline = make_pipeline(
    SimpleImputer(strategy = 'median'),
    StandardScaler(),
    LogisticRegressionCV(scoring = 'precision', cv = 4, n_jobs = -1, random_state = 42)
)

### Validation set

In [5]:
#time to chonk it in.  I'm going to kinda play this by ear, but I think I need to increase
#my data by a factor of 100.  Interesting Dash idea! make a few graphs that show the importance
#of data size, sampling methods, models etc.  Store that away for later

chunks = []

for i in range(100):
    chunks.append(num_iter.get_chunk())
    
val = pd.concat(chunks)
print(val.shape)
val.head()

(100000, 970)


Unnamed: 0,Id,L0_S0_F0,L0_S0_F2,L0_S0_F4,L0_S0_F6,L0_S0_F8,L0_S0_F10,L0_S0_F12,L0_S0_F14,L0_S0_F16,...,L3_S50_F4245,L3_S50_F4247,L3_S50_F4249,L3_S50_F4251,L3_S50_F4253,L3_S51_F4256,L3_S51_F4258,L3_S51_F4260,L3_S51_F4262,Response
0,4,0.03,-0.034,-0.197,-0.179,0.118,0.116,-0.015,-0.032,0.02,...,,,,,,,,,,0
1,6,,,,,,,,,,...,,,,,,,,,,0
2,7,0.088,0.086,0.003,-0.052,0.161,0.025,-0.015,-0.072,-0.225,...,,,,,,,,,,0
3,9,-0.036,-0.064,0.294,0.33,0.074,0.161,0.022,0.128,-0.026,...,,,,,,,,,,0
4,11,-0.055,-0.086,0.294,0.33,0.118,0.025,0.03,0.168,-0.169,...,,,,,,,,,,0


### More data set

In [6]:
#okay that only took about 2 gigs of ram. I have 32 gigs available.  So I think I'm safe making
#test sets that are 100,000 observations long. Lets get my validation set before i get too far
#in

chunks = []

for i in range(100):
    chunks.append(num_iter.get_chunk())
    
big_samp = pd.concat(chunks)
print(big_samp.shape)
big_samp.head()

(100000, 970)


Unnamed: 0,Id,L0_S0_F0,L0_S0_F2,L0_S0_F4,L0_S0_F6,L0_S0_F8,L0_S0_F10,L0_S0_F12,L0_S0_F14,L0_S0_F16,...,L3_S50_F4245,L3_S50_F4247,L3_S50_F4249,L3_S50_F4251,L3_S50_F4253,L3_S51_F4256,L3_S51_F4258,L3_S51_F4260,L3_S51_F4262,Response
100000,200354,,,,,,,,,,...,,,,,,,,,,0
100001,200358,,,,,,,,,,...,,,,,,,,,,0
100002,200359,,,,,,,,,,...,,,,,,,,,,0
100003,200360,0.082,0.078,-0.033,-0.052,0.031,0.116,-0.015,-0.072,0.01,...,,,,,,,,,,0
100004,200361,-0.134,-0.071,0.294,0.312,-0.013,0.07,0.052,0.168,0.143,...,,,,,,,,,,0


### Random dataset

In [7]:
#now lets do the random sample.  I'm going to go ahead and build a simple one since I couldn't
#find any tailor made methods in sklearn
from random import randint

chunks = []

while len(chunks) < 100:
    
    if randint(0,100) <= 80:
        continue
    
    chunks.append(num_iter.get_chunk())
    
rand_samp = pd.concat(chunks)
print(rand_samp.shape)
rand_samp.head()

(100000, 970)


Unnamed: 0,Id,L0_S0_F0,L0_S0_F2,L0_S0_F4,L0_S0_F6,L0_S0_F8,L0_S0_F10,L0_S0_F12,L0_S0_F14,L0_S0_F16,...,L3_S50_F4245,L3_S50_F4247,L3_S50_F4249,L3_S50_F4251,L3_S50_F4253,L3_S51_F4256,L3_S51_F4258,L3_S51_F4260,L3_S51_F4262,Response
200000,399900,,,,,,,,,,...,,,,,,,,,,0
200001,399901,0.043,-0.004,-0.179,-0.179,-0.1,-0.294,-0.015,-0.072,-0.087,...,,,,,,,,,,0
200002,399907,0.069,0.093,-0.343,-0.361,0.031,0.161,-0.03,-0.192,0.066,...,,,,,,,,,,0
200003,399909,,,,,,,,,,...,,,,,,,,,,0
200004,399910,0.003,0.041,-0.015,-0.016,-0.056,0.07,0.0,-0.072,0.153,...,,,,,,,,,,0


### Stratified dataset

In [8]:
#now lets try to make a stratified sample.  I'm not exactly sure how to do that,
#but I could read in 200,000 columns and do a 50/50 split stratified?  Idk if that is
#nonesense, but its what I'm going to try today.
from sklearn.model_selection import train_test_split

chunks = []

for i in range(200):
    chunks.append(num_iter.get_chunk())

big_strat = pd.concat(chunks)
stratify = big_strat['Response']
strat_samp, strat_test = train_test_split(big_strat, train_size = .5, random_state = 42, stratify = stratify)

print(strat_samp.shape)
strat_samp.head()

(100000, 970)


Unnamed: 0,Id,L0_S0_F0,L0_S0_F2,L0_S0_F4,L0_S0_F6,L0_S0_F8,L0_S0_F10,L0_S0_F12,L0_S0_F14,L0_S0_F16,...,L3_S50_F4245,L3_S50_F4247,L3_S50_F4249,L3_S50_F4251,L3_S50_F4253,L3_S51_F4256,L3_S51_F4258,L3_S51_F4260,L3_S51_F4262,Response
449123,899066,,,,,,,,,,...,,,,,,,,,,0
463431,927592,0.036,0.056,-0.197,-0.179,-0.013,-0.157,-0.015,-0.112,-0.097,...,,,,,,,,,,0
332827,666136,-0.01,-0.049,-0.233,-0.179,0.031,0.07,-0.007,-0.032,0.066,...,,,,,,,,,,0
393143,786801,-0.029,-0.101,-0.179,-0.216,0.292,0.161,-0.007,0.048,0.015,...,,,,,,,,,,0
418274,837079,,,,,,,,,,...,,,,,,,,,,0


### Split into X and y

In [9]:
X_big, y_big = getXy(big_samp)
X_rand, y_rand = getXy(rand_samp)
X_strat, y_strat = getXy(strat_samp)
X_val, y_val = getXy(val)

print(X_big.shape, y_big.shape)
print(X_rand.shape, y_rand.shape)
print(X_strat.shape, y_strat.shape)
print(X_val.shape, y_val.shape)

(100000, 968) (100000,)
(100000, 968) (100000,)
(100000, 968) (100000,)
(100000, 968) (100000,)


### Testing function

In [10]:
def test(X, y):
    pipeline.fit(X,y)
    print('Baseline Precision: ', baseline)
    print('Trained Precision: ', pipeline.score(X,y))
    print('Validation Precision: ', pipeline.score(X_val, y_val))

### Test 1 More data

In [11]:
test(X_big, y_big)

Baseline Precision:  0.00385
Trained Precision:  0.5
Validation Precision:  0.5384615384615384


### Test 2 randomized data

In [12]:
test(X_rand, y_rand)

Baseline Precision:  0.00385
Trained Precision:  0.75
Validation Precision:  0.5


### Test 3 stratified data

In [13]:
test(X_strat, y_strat)

Baseline Precision:  0.00385
Trained Precision:  0.8571428571428571
Validation Precision:  0.47058823529411764


# Conclusion
Increasing my data size significantly improved my model's precision score as expected. Randomization and stratification appeared to do worse.  I should spend a little more time comparing randomized, stratified and sequential data sets that are larger and see how they do.  But for now I have something way better than what I was working with

## XGBoost

In [19]:
from xgboost import XGBClassifier

XGBpipe = make_pipeline(
    SimpleImputer(strategy = 'median'),
    StandardScaler(),
    XGBClassifier(n_estimators = 100, random_state = 42, n_jobs = -1)
    
)

In [20]:
#XGB doesn't have built in scoring, so you have to import the scoring separately
from sklearn.metrics import precision_score

XGBpipe.fit(X_big, y_big)

y_pred = pipeline.predict(X_val)

print('Validation Precision:', precision_score(y_val, y_pred))

Validation Precision: 0.47058823529411764


Eeeeeeeeh not an improvement, but this is just an mvp.  I'll likely need to tweak things a bit to verify if it is actually doing worse.  Good to know.

## Permutation Importance

In [24]:
#For permutation importance we need to separate our model from our pipeline.  So Instead
#we are using a pipeline to transform our datasets and feed it into the model manually.

import eli5
from eli5.sklearn import PermutationImportance

transformer = make_pipeline(
    SimpleImputer(strategy = 'median'),
    StandardScaler()
    
)

X_big_transformed = transformer.fit_transform(X_big)
X_val_transformed = transformer.fit_transform(X_val)

model = XGBClassifier(n_estimators = 100, random_state = 42, n_jobs = -1)

model.fit(X_big_transformed, y_big)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=-1,
              nthread=None, objective='binary:logistic', random_state=42,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [26]:
#this is why we split them out, permutationimportance needs an already fit model to
#instantiate the class

permuter = PermutationImportance(
    model,
    scoring = 'precision',
    n_iter = 5,
    random_state = 42
)

#our permuter needs to be fit on the validation sets
permuter.fit(X_val_transformed, y_val)

PermutationImportance(cv='prefit',
                      estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                              colsample_bylevel=1,
                                              colsample_bynode=1,
                                              colsample_bytree=1, gamma=0,
                                              learning_rate=0.1,
                                              max_delta_step=0, max_depth=3,
                                              min_child_weight=1, missing=None,
                                              n_estimators=100, n_jobs=-1,
                                              nthread=None,
                                              objective='binary:logistic',
                                              random_state=42, reg_alpha=0,
                                              reg_lambda=1, scale_pos_weight=1,
                                              seed=None, silent=None,
                   

In [27]:
#extracting the column names for a prettier presentation of permutation importance
feature_names = X_val.columns.tolist()

eli5.show_weights(
    permuter,
    top = None,
    feature_names = feature_names
)

Weight,Feature
0.1032  ± 0.0253,L3_S38_F3960
0.0417  ± 0.0000,L1_S24_F1846
0.0250  ± 0.0408,L3_S38_F3956
0.0095  ± 0.0233,L0_S21_F527
0.0048  ± 0.0190,L1_S24_F1816
0  ± 0.0000,L3_S51_F4262
0  ± 0.0000,L1_S24_F1494
0  ± 0.0000,L1_S24_F1539
0  ± 0.0000,L1_S24_F1520
0  ± 0.0000,L1_S24_F1518


### Important features only

In [28]:
#setting the threshold of importance we want to run our model on now
minimum_importance = 0

#not sure why we are calling this a mask.  Basically return only the features imporances that
#are greater than our minimum importance
mask = permuter.feature_importances_ > minimum_importance

#get the columns for each features that is above our minimum importance
features = X_big.columns[mask]

#creating a new dataframe with only the features above our specified minimum importance
X_big = X_big[features]

In [29]:
#same thing, lets get a validation set with only the features above our minimum importance
#threshold that we set.
X_val = X_val[features]

perm_pipe = make_pipeline(
    SimpleImputer(strategy = 'median'),
    StandardScaler(),
    XGBClassifier(n_estimators = 100, random_state = 42, n_jobs = -1)
)

perm_pipe.fit(X_big, y_big)

y_pred = perm_pipe.predict(X_val)

print('Validation Precision after permutation importance:', precision_score(y_val, y_pred))

Validation Precision after permutation importance: 0.75
