<a href="https://colab.research.google.com/github/ethanmjansen/DS-Unit-2-Applied-Modeling/blob/master/module3/LS_DS10_assignment_applied_modeling_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 3, Module 3*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - _**[A Kaggle Master Explains Gradient Boosting](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)**_
  - [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) Chapter 8
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html)
  - _**[Boosting](https://www.youtube.com/watch?v=GM3CDQfQ4sw) (2.5 minute video)**_

#My Data Prework 

In [39]:
#Initial Imports
import pandas as pd
from google.colab import files
import numpy as np
import matplotlib.pyplot as plt
!pip install category_encoders==2.*
!pip install eli5

Collecting eli5
[?25l  Downloading https://files.pythonhosted.org/packages/97/2f/c85c7d8f8548e460829971785347e14e45fa5c6617da374711dec8cb38cc/eli5-0.10.1-py2.py3-none-any.whl (105kB)
[K     |████████████████████████████████| 112kB 2.5MB/s 
Installing collected packages: eli5
Successfully installed eli5-0.10.1


In [2]:
uploaded = files.upload()

Saving 2018_Central_Park_Squirrel_Census_-_Squirrel_Data.csv to 2018_Central_Park_Squirrel_Census_-_Squirrel_Data.csv


In [0]:
df = pd.read_csv('2018_Central_Park_Squirrel_Census_-_Squirrel_Data.csv')

#Define Target

In [0]:
#Brief Clean to Define Target
indexNames = df[ (df['Runs from'] == True) & (df['Indifferent'] == True) ].index
df.drop(indexNames , inplace=True)

In [9]:
#This is my target
df['Target'] = df['Approaches'] | df['Indifferent'] == True
df['Target'].value_counts()

True     1581
False    1410
Name: Target, dtype: int64

#What is the Accuracy I get just by guessing?

In [10]:
print(df['Target'].value_counts(max))
print(f'If I only guessed Majority Class I would be \nright only 52.9% of the time')

True     0.528586
False    0.471414
Name: Target, dtype: float64
If I only guessed Majority Class I would be 
right only 52.9% of the time


#My Quick and Dirty Model

In [11]:
#Train Test Split Classifier
from sklearn.model_selection import train_test_split
#The actual Split
train, test = train_test_split(df, 
                               train_size=0.80,
                               test_size=0.20,
                               random_state=42)

#Train/Val/Test Split
train, val = train_test_split(train, 
                               train_size=0.80,
                               test_size=0.20,
                               random_state=42)

train.shape,val.shape, test.shape

((1913, 37), (479, 37), (599, 37))

In [0]:
#Model Imports
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier

In [0]:
#Make X features Matrix and y vector 
#columns_to_drop = ['Unique Squirrel ID', 'Date', 'Color notes', 'Above Ground Sighter Measurement', 'Specific Location', 'Other Activities', 'Approaches', 'Indifferent', 'Runs from', 'Other Interactions', 'Lat/Long', 'Zip Codes','Target']
columns_to_drop = ['Unique Squirrel ID', 'Approaches', 'Indifferent', 'Runs from', 'Target']
target = 'Target'
X_train = train.drop(columns=columns_to_drop)
y_train = train[target]
X_val = val.drop(columns=columns_to_drop)
y_val = val[target]
X_test = test

In [0]:
#Make Pipeline
pipeline = make_pipeline(ce.OrdinalEncoder(), 
                         SimpleImputer(strategy='mean'), 
                         RandomForestClassifier(n_estimators=300, n_jobs=-1, random_state=27))

In [19]:
# Fit on train, score on val
pipeline.fit(X_train, y_train)
print('Validation Accuracy', pipeline.score(X_val, y_val))

Validation Accuracy 0.6722338204592901


#Try Xgboost

In [0]:
#Import xgboost and accuracy score
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

In [0]:
#Make Pipeline
pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    XGBClassifier(n_estimators=400,
                  random_state=27,
                  max_depth=5, 
                  learning_rate=0.1,
                  n_jobs=-1)
)
#Fit to Train
pipeline.fit(X_train, y_train)

In [37]:
#Using accuracy_score to get model accuracy
y_pred = pipeline.predict(X_val)
print('Validation Accuracy', accuracy_score(y_val, y_pred))

Validation Accuracy 0.6617954070981211


#My Model's Permutaion Importances

In [0]:
#Imports
import eli5
from eli5.sklearn import PermutationImportance

In [0]:
#Making transformers
transformers = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='mean')
)

In [0]:
#Fitting transformers to X_train and X_val
X_train_transformed = transformers.fit_transform(X_train)
X_val_transformed = transformers.transform(X_val)

model = RandomForestClassifier(n_estimators=300, random_state=27, n_jobs=-1)
model.fit(X_train_transformed, y_train)

In [0]:
#Calculate Permutatin Importances
permuter = PermutationImportance(
    model, 
    scoring='accuracy', 
    n_iter=5, 
    random_state=27
)

permuter.fit(X_val_transformed, y_val)

In [71]:
#Ugly Display of Feature Permutations
feature_names = X_val.columns.tolist()
pd.Series(permuter.feature_importances_, feature_names).sort_values()

Above Ground Sighter Measurement             -0.007933
Climbing                                     -0.002088
Tail twitches                                -0.000418
Police Precincts                              0.000000
Borough Boundaries                            0.000000
Community Districts                           0.000000
Zip Codes                                     0.000000
Lat/Long                                      0.000000
Moans                                         0.000000
City Council Districts                        0.000000
Combination of Primary and Highlight Color    0.000835
Running                                       0.001253
Color notes                                   0.001253
Specific Location                             0.002505
Tail flags                                    0.002505
Kuks                                          0.002505
Quaas                                         0.002923
Other Activities                              0.003758
Eating    

In [72]:
#Nice Display of Feature Permutations
eli5.show_weights(
    permuter, 
    top=None,
    feature_names=feature_names
)

Weight,Feature
0.0380  ± 0.0202,Shift
0.0359  ± 0.0116,Foraging
0.0334  ± 0.0203,X
0.0330  ± 0.0442,Date
0.0267  ± 0.0148,Hectare
0.0238  ± 0.0192,Y
0.0196  ± 0.0202,Hectare Squirrel Number
0.0125  ± 0.0095,Age
0.0092  ± 0.0043,Other Interactions
0.0079  ± 0.0110,Primary Fur Color


In [0]:
#Establishing minimum importance and using permuter to mask features
minimum_importance = 0
mask = permuter.feature_importances_ > minimum_importance
features = X_train.columns[mask]
X_train = X_train[features]

In [74]:
#New Prediction after selecting with feature permutation 
X_val = X_val[features]

pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='mean'), 
    RandomForestClassifier(n_estimators=300, random_state=27, n_jobs=-1)
)

# Fit on train, score on val
pipeline.fit(X_train, y_train)
print('Validation Accuracy', pipeline.score(X_val, y_val))

Validation Accuracy 0.6722338204592901
