Lambda School Data Science

*Unit 2, Sprint 3, Module 2*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - _**[A Kaggle Master Explains Gradient Boosting](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)**_
  - [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) Chapter 8
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html)
  - _**[Boosting](https://www.youtube.com/watch?v=GM3CDQfQ4sw) (2.5 minute video)**_

### Food is life dataset

In [64]:
!pip install category_encoders==2.*
!pip install eli5




In [65]:
from google.colab import files
uploaded = files.upload()

Saving whats-cooking.zip to whats-cooking (1).zip


In [66]:
!unzip whats-cooking.zip  

Archive:  whats-cooking.zip
replace train.json? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: train.json              
replace sample_submission.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: sample_submission.csv   
replace test.json? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: test.json               


In [0]:
import pandas as pd
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt

In [68]:
df = pd.read_json('train.json').drop(['id'], axis=1)
print(df.shape)
df.head()

(39774, 2)


Unnamed: 0,cuisine,ingredients
0,greek,"[romaine lettuce, black olives, grape tomatoes..."
1,southern_us,"[plain flour, ground pepper, salt, tomatoes, g..."
2,filipino,"[eggs, pepper, salt, mayonaise, cooking oil, g..."
3,indian,"[water, vegetable oil, wheat, salt]"
4,indian,"[black pepper, shallots, cornflour, cayenne pe..."


In [69]:
ingredients = pd.DataFrame(df['ingredients'].tolist())
ingredients.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64
0,romaine lettuce,black olives,grape tomatoes,garlic,pepper,purple onion,seasoning,garbanzo beans,feta cheese crumbles,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,plain flour,ground pepper,salt,tomatoes,ground black pepper,thyme,eggs,green tomatoes,yellow corn meal,milk,vegetable oil,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,eggs,pepper,salt,mayonaise,cooking oil,green chilies,grilled chicken breasts,garlic powder,yellow onion,soy sauce,butter,chicken livers,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,water,vegetable oil,wheat,salt,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,black pepper,shallots,cornflour,cayenne pepper,onions,garlic paste,milk,butter,salt,lemon juice,water,chili powder,passata,oil,ground cumin,boneless chicken skinless thigh,garam masala,double cream,natural yogurt,bay leaf,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [0]:
pepper = df['ingredients'].str.contains('pepper')
garlic = df['ingredients'].str.contains('garlic')
eggs = df['ingredients'].str.contains('eggs')
olive_oil = df['ingredients'].str.contains('olive oil')

In [0]:
ingredients.replace('ground black pepper', 'pepper', inplace = True)
ingredients.replace('ground pepper', 'pepper', inplace = True)
ingredients.replace('black pepper', 'pepper', inplace = True)
ingredients.replace('garlic cloves', 'garlic', inplace = True)
ingredients.replace('unsalted butter', 'butter', inplace = True)
ingredients.replace('large eggs', 'eggs', inplace = True)
ingredients.replace('extra-virgin olive oil', 'olive oil', inplace = True)
ingredients.replace('fresh lime juice', 'lime', inplace = True)
ingredients.replace('flat leaf parsley', 'fresh parsley', inplace = True)
ingredients.replace('grated parmesan cheese', 'parmesan cheese', inplace = True)
ingredients.replace('fresh ginger', 'ginger', inplace = True)
ingredients.replace('all-purpose flour', 'flour', inplace = True)
ingredients.replace('ground cinnamon', 'cinnamon', inplace = True)
ingredients.replace('ground turmeric', 'turmeric', inplace = True)

In [0]:
ingredients= ingredients.fillna('NaN').replace('None', np.nan)


In [73]:
ingredients.columns = ['Ingredient #1', 'Ingredient #2', 'Ingredient #3', 'Ingredient #4',
                       'Ingredient #5','Ingredient #6','Ingredient #7','Ingredient #8',
                       'Ingredient #9','Ingredient #10','Ingredient #10','Ingredient #12',
                       'Ingredient #13','Ingredient #14','Ingredient #15','Ingredient #16',
                       'Ingredient #17','Ingredient #18','Ingredient #19','Ingredient #20',
                       'Ingredient #21','Ingredient 22','Ingredient #23','Ingredient #24',
                       'Ingredient #25','Ingredient #26','Ingredient #27','Ingredient #28',
                       'Ingredient #29','Ingredient #30','Ingredient #31','Ingredient #32',
                       'Ingredient #33','Ingredient #34','Ingredient #35','Ingredient #36',
                       'Ingredient #37','Ingredient #38','Ingredient #39','Ingredient #40',
                       'Ingredient #41','Ingredient #42','Ingredient #43','Ingredient #44',
                       'Ingredient #45','Ingredient #46','Ingredient #47','Ingredient #48',
                       'Ingredient #49','Ingredient #50','Ingredient #51','Ingredient #52',
                       'Ingredient #53','Ingredient #54','Ingredient #55','Ingredient #56',
                       'Ingredient #57','Ingredient #58','Ingredient #59','Ingredient #60',
                       'Ingredient #61','Ingredient #62','Ingredient #63','Ingredient #64', 'Ingredient #65']
print(ingredients.shape)
ingredients.head()

(39774, 65)


Unnamed: 0,Ingredient #1,Ingredient #2,Ingredient #3,Ingredient #4,Ingredient #5,Ingredient #6,Ingredient #7,Ingredient #8,Ingredient #9,Ingredient #10,Ingredient #10.1,Ingredient #12,Ingredient #13,Ingredient #14,Ingredient #15,Ingredient #16,Ingredient #17,Ingredient #18,Ingredient #19,Ingredient #20,Ingredient #21,Ingredient 22,Ingredient #23,Ingredient #24,Ingredient #25,Ingredient #26,Ingredient #27,Ingredient #28,Ingredient #29,Ingredient #30,Ingredient #31,Ingredient #32,Ingredient #33,Ingredient #34,Ingredient #35,Ingredient #36,Ingredient #37,Ingredient #38,Ingredient #39,Ingredient #40,Ingredient #41,Ingredient #42,Ingredient #43,Ingredient #44,Ingredient #45,Ingredient #46,Ingredient #47,Ingredient #48,Ingredient #49,Ingredient #50,Ingredient #51,Ingredient #52,Ingredient #53,Ingredient #54,Ingredient #55,Ingredient #56,Ingredient #57,Ingredient #58,Ingredient #59,Ingredient #60,Ingredient #61,Ingredient #62,Ingredient #63,Ingredient #64,Ingredient #65
0,romaine lettuce,black olives,grape tomatoes,garlic,pepper,purple onion,seasoning,garbanzo beans,feta cheese crumbles,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,plain flour,pepper,salt,tomatoes,pepper,thyme,eggs,green tomatoes,yellow corn meal,milk,vegetable oil,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,eggs,pepper,salt,mayonaise,cooking oil,green chilies,grilled chicken breasts,garlic powder,yellow onion,soy sauce,butter,chicken livers,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,water,vegetable oil,wheat,salt,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,pepper,shallots,cornflour,cayenne pepper,onions,garlic paste,milk,butter,salt,lemon juice,water,chili powder,passata,oil,ground cumin,boneless chicken skinless thigh,garam masala,double cream,natural yogurt,bay leaf,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [74]:
df = pd.concat([df,ingredients], axis=1).drop(['ingredients'], axis=1)
print(df.shape)
df.head()

(39774, 66)


Unnamed: 0,cuisine,Ingredient #1,Ingredient #2,Ingredient #3,Ingredient #4,Ingredient #5,Ingredient #6,Ingredient #7,Ingredient #8,Ingredient #9,Ingredient #10,Ingredient #10.1,Ingredient #12,Ingredient #13,Ingredient #14,Ingredient #15,Ingredient #16,Ingredient #17,Ingredient #18,Ingredient #19,Ingredient #20,Ingredient #21,Ingredient 22,Ingredient #23,Ingredient #24,Ingredient #25,Ingredient #26,Ingredient #27,Ingredient #28,Ingredient #29,Ingredient #30,Ingredient #31,Ingredient #32,Ingredient #33,Ingredient #34,Ingredient #35,Ingredient #36,Ingredient #37,Ingredient #38,Ingredient #39,Ingredient #40,Ingredient #41,Ingredient #42,Ingredient #43,Ingredient #44,Ingredient #45,Ingredient #46,Ingredient #47,Ingredient #48,Ingredient #49,Ingredient #50,Ingredient #51,Ingredient #52,Ingredient #53,Ingredient #54,Ingredient #55,Ingredient #56,Ingredient #57,Ingredient #58,Ingredient #59,Ingredient #60,Ingredient #61,Ingredient #62,Ingredient #63,Ingredient #64,Ingredient #65
0,greek,romaine lettuce,black olives,grape tomatoes,garlic,pepper,purple onion,seasoning,garbanzo beans,feta cheese crumbles,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,southern_us,plain flour,pepper,salt,tomatoes,pepper,thyme,eggs,green tomatoes,yellow corn meal,milk,vegetable oil,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,filipino,eggs,pepper,salt,mayonaise,cooking oil,green chilies,grilled chicken breasts,garlic powder,yellow onion,soy sauce,butter,chicken livers,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,indian,water,vegetable oil,wheat,salt,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,indian,pepper,shallots,cornflour,cayenne pepper,onions,garlic paste,milk,butter,salt,lemon juice,water,chili powder,passata,oil,ground cumin,boneless chicken skinless thigh,garam masala,double cream,natural yogurt,bay leaf,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [0]:
# Random split
from sklearn.model_selection import train_test_split
train, val = train_test_split(df, train_size=0.75, test_size=0.25, random_state=42)

In [0]:
# Arrange data into X features matrix and y target vector
target = 'cuisine'
X_train = train.drop(columns=target)
y_train = train[target]
X_val = val.drop(columns=target)
y_val = val[target]
X_test = test

In [0]:
import category_encoders as ce 
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(ce.OrdinalEncoder(), SimpleImputer(strategy='median'), 
    RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1))
pipeline.fit(X_train, y_train)
print('Validation Accuracy', pipeline.score(X_val, y_val))

### Permutation

In [0]:
transformers = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median')
)

X_train_transformed = transformers.fit_transform(X_train)
X_val_transformed = transformers.transform(X_val)

model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
model.fit(X_train_transformed, y_train)

In [0]:
import eli5
from eli5.sklearn import PermutationImportance
permuter = PermutationImportance(model, scoring = 'accuracy', n_iter=5, random_state=42)
permuter.fit(X_val_transformed, y_val)

In [86]:
feature_names = X_val.columns.tolist()
eli5.show_weights(permuter, top=None, feature_names=feature_names)

NameError: ignored

### XGBoosting

In [0]:
from xgboost import XGBClassifier
pipeline = make_pipeline(ce.OrdinalEncoder(), XGBClassifier(n_estimators=100, random_State=42, n_jobs=-1))
pipeline.fit(X_train, y_train)

In [0]:
y_pred = pipeline.predict(X_val)
print('Validation Accuracy', accuracy_score(y_val, y_pred))

In [88]:
encoder = ce.OrdinalEncoder()
X_train_encoded = encoder.fit_transform(X_train)
X_val_encoded = encoder.transform(X_val)

model = XGBClassifier(
    n_estimators=1000, # <= 1000 trees, depends on early stopping
    max_depth=7,       # try deeper trees because of high cardinality categoricals
    learning_rate=0.5, # try higher learning rate
    n_jobs=-1
)

eval_set = [(X_train_encoded, y_train), 
            (X_val_encoded, y_val)]

model.fit(X_train_encoded, y_train, 
          eval_set=eval_set, 
          eval_metric='merror', 
          early_stopping_rounds=50)

AttributeError: ignored