Lambda School Data Science

*Unit 2, Sprint 3, Module 3*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf), Chapter 8
  - _**[Gradient Boosting Explained](https://www.gormanalysis.com/blog/gradient-boosting-explained/)**_ — Ben Gorman
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html) — Alex Rogozhnikov
  - [How to explain gradient boosting](https://explained.ai/gradient-boosting/) — Terence Parr & Jeremy Howard

In [0]:
from google.colab import files

uploaded = files.upload()

In [0]:
import pandas as pd

df = pd.read_csv('Video_Games_Sales_as_at_22_Dec_2016.csv')

In [3]:
df.head()

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8.0,192.0,Nintendo,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37,,,,,,


In [0]:
df = df.dropna(subset=['Critic_Score'])

In [0]:
df['High_Critic_Score'] = df['Critic_Score'] >= 80

In [0]:
import numpy as np

def correct_user_score(score):
  if score == 'tbd':
    return np.NaN
  
  else:
    return float(score)

In [0]:
df['User_Score'] = df['User_Score'].apply(correct_user_score)

In [0]:
top_25_publishers = df['Publisher'].value_counts(ascending=False)[:25].index

def publisher_top_25(publisher):
  if publisher in top_25_publishers:
    return publisher
  else:
    return "Other"

In [0]:
df['Publisher'] = df['Publisher'].apply(publisher_top_25)

In [0]:
df['Developer'] = df['Developer'].fillna("Missing")

In [0]:
ea  = df['Developer'].str.contains("EA ")
ubisoft = df['Developer'].str.contains("Ubisoft")

df.loc[ea, 'Developer'] = "Electronic Arts"
df.loc[ubisoft, 'Developer'] = 'Ubisoft'

In [0]:
top_25_developers = df['Developer'].value_counts(ascending=False)[:25].index

def developer_top_25(developer):
  if developer in top_25_developers:
    return developer
  else:
    return "Other"


df['Developer'] = df['Developer'].apply(developer_top_25)

In [14]:
df['Year_of_Release'].value_counts()

2008.0    715
2007.0    692
2005.0    655
2009.0    651
2002.0    627
2006.0    620
2003.0    585
2004.0    561
2011.0    500
2010.0    500
2001.0    326
2012.0    321
2013.0    273
2014.0    261
2016.0    232
2015.0    225
2000.0    143
1999.0     39
1998.0     28
1997.0     17
1996.0      8
1994.0      1
1985.0      1
1992.0      1
1988.0      1
Name: Year_of_Release, dtype: int64

In [0]:
train = df[(df['Year_of_Release']!= 2015) & (df['Year_of_Release'] != 2016)]
val = df[df['Year_of_Release']==2015]
test = df[df['Year_of_Release']==2016]

In [16]:
train.shape, val.shape, test.shape

((7680, 17), (225, 17), (232, 17))

In [17]:
train.head(1)

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating,High_Critic_Score
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E,False


In [18]:
!pip install category_encoders==2.*

Collecting category_encoders==2.*
[?25l  Downloading https://files.pythonhosted.org/packages/44/57/fcef41c248701ee62e8325026b90c432adea35555cbc870aff9cfba23727/category_encoders-2.2.2-py2.py3-none-any.whl (80kB)
[K     |████                            | 10kB 16.6MB/s eta 0:00:01[K     |████████▏                       | 20kB 1.7MB/s eta 0:00:01[K     |████████████▏                   | 30kB 2.6MB/s eta 0:00:01[K     |████████████████▎               | 40kB 1.7MB/s eta 0:00:01[K     |████████████████████▎           | 51kB 2.1MB/s eta 0:00:01[K     |████████████████████████▍       | 61kB 2.5MB/s eta 0:00:01[K     |████████████████████████████▍   | 71kB 2.9MB/s eta 0:00:01[K     |████████████████████████████████| 81kB 2.3MB/s 
Installing collected packages: category-encoders
Successfully installed category-encoders-2.2.2


In [0]:
target = 'High_Critic_Score'

X_train = train.drop(columns=['Name', 'Critic_Score', target])
y_train = train[target]

X_val = val.drop(columns=['Name', 'Critic_Score', target])
y_val = val[target]


In [20]:
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier


pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True),
    SimpleImputer(),
    RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=10)
)


pipeline.fit(X_train, y_train)
print(f'Val score (acc): {pipeline.score(X_val, y_val)}')

  import pandas.util.testing as tm


Val score (acc): 0.7733333333333333


In [78]:
encoder = ce.OneHotEncoder(use_cat_names=True)
X_train_encoded = encoder.fit_transform(X_train)
X_val_encoded = encoder.transform(X_val)

imputer = SimpleImputer()
X_train_imputed = imputer.fit_transform(X_train_encoded)
X_val_imputed = imputer.transform(X_val_encoded)

model = XGBClassifier(
    n_estimators=1000, # upper threshold
    max_depth=3,
    learning_rate=0.5,
    n_jobs=-1
)

eval_set = [(X_train_imputed, y_train),
            (X_val_imputed, y_val)
           ] 


model.fit(X_train_imputed, 
          y_train,
          eval_set=eval_set,
          eval_metric='mae',
          early_stopping_rounds=50)

[0]	validation_0-mae:0.379296	validation_1-mae:0.416154
Multiple eval metrics have been passed: 'validation_1-mae' will be used for early stopping.

Will train until validation_1-mae hasn't improved in 50 rounds.
[1]	validation_0-mae:0.31954	validation_1-mae:0.371565
[2]	validation_0-mae:0.280919	validation_1-mae:0.341737
[3]	validation_0-mae:0.257713	validation_1-mae:0.322128
[4]	validation_0-mae:0.243519	validation_1-mae:0.312274
[5]	validation_0-mae:0.231002	validation_1-mae:0.304024
[6]	validation_0-mae:0.225098	validation_1-mae:0.30191
[7]	validation_0-mae:0.218843	validation_1-mae:0.298077
[8]	validation_0-mae:0.215891	validation_1-mae:0.29758
[9]	validation_0-mae:0.211128	validation_1-mae:0.29087
[10]	validation_0-mae:0.208083	validation_1-mae:0.290835
[11]	validation_0-mae:0.205718	validation_1-mae:0.288196
[12]	validation_0-mae:0.202111	validation_1-mae:0.284516
[13]	validation_0-mae:0.199948	validation_1-mae:0.281277
[14]	validation_0-mae:0.197494	validation_1-mae:0.282259
[1

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.5, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=1000, n_jobs=-1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [80]:
from sklearn.metrics import accuracy_score

y_pred = model.predict(X_val_imputed)
accuracy_score(y_val, y_pred)

0.8088888888888889

In [82]:
!pip install eli5
import eli5
from eli5.sklearn import PermutationImportance

permuter = PermutationImportance(
    model,
    scoring='accuracy',
    n_iter=10,
    random_state=2
)



In [83]:
permuter.fit(X_val_imputed, y_val)

PermutationImportance(cv='prefit',
                      estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                              colsample_bylevel=1,
                                              colsample_bynode=1,
                                              colsample_bytree=1, gamma=0,
                                              learning_rate=0.5,
                                              max_delta_step=0, max_depth=3,
                                              min_child_weight=1, missing=None,
                                              n_estimators=1000, n_jobs=-1,
                                              nthread=None,
                                              objective='binary:logistic',
                                              random_state=0, reg_alpha=0,
                                              reg_lambda=1, scale_pos_weight=1,
                                              seed=None, silent=None,
                   

In [96]:
feature_names = X_val_encoded.columns.tolist()
pd.Series(permuter.feature_importances_, feature_names).sort_values(ascending=False)

User_Count                        0.108000
User_Score                        0.079556
Global_Sales                      0.064889
EU_Sales                          0.020444
Critic_Count                      0.019111
                                    ...   
Publisher_Take-Two Interactive   -0.004000
Developer_Nintendo               -0.004444
Genre_Shooter                    -0.004889
Platform_XOne                    -0.006222
Genre_Platform                   -0.007556
Length: 98, dtype: float64

In [87]:
eli5.show_weights(permuter, 
                  top=None, # how many best features to display. None == all
                  feature_names=feature_names,
                  )

Weight,Feature
0.1080  ± 0.0375,User_Count
0.0796  ± 0.0554,User_Score
0.0649  ± 0.0386,Global_Sales
0.0204  ± 0.0267,EU_Sales
0.0191  ± 0.0195,Critic_Count
0.0151  ± 0.0081,Genre_Action
0.0124  ± 0.0111,Genre_Sports
0.0124  ± 0.0118,Publisher_Nintendo
0.0116  ± 0.0121,Rating_M
0.0102  ± 0.0089,Rating_E10+
