Lambda School Data Science

*Unit 2, Sprint 3, Module 3*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - _**[A Kaggle Master Explains Gradient Boosting](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)**_
  - [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) Chapter 8
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html)
  - _**[Boosting](https://www.youtube.com/watch?v=GM3CDQfQ4sw) (2.5 minute video)**_

In [1]:
import pandas as pd
# importing/instantiating all the csv's in case I want to use any of the other data as well
# main DF
df = pd.read_csv('C:/users/Stewa/Documents/Downloads/steam-store-games/steam.csv')

In [2]:
#Clean up nulls
# instantiate a df_clean so i can have a version to work with, and drop NaNs to start
df_clean = df.dropna()


In [3]:
df_clean.dtypes
'''
I'm going to look at it like a regression problem 
and try to predict more exact prices this time, as opposed
to making it more of a classification problem yesterday.
'''

"\nI'm going to look at it like a regression problem \nand try to predict more exact prices this time, as opposed\nto making it more of a classification problem yesterday.\n"

In [4]:
# make my target and features
target = 'price'
features = ['english', 'developer', 'publisher', 'platforms', 'required_age',
            'categories', 'genres', 'steamspy_tags', 'achievements', 'positive_ratings',
            'negative_ratings',	'average_playtime', 'median_playtime', 'owners']

# break up my dataset
# first, I need to cast release_date to dt format
df_clean['release_date'] = pd.to_datetime(df_clean['release_date'], infer_datetime_format=True)
train = df_clean.loc[(df_clean['release_date'] <= pd.datetime(2014,12,31)) 
                     & (df_clean['release_date'] >= pd.datetime(1997,1,1))]

val = df_clean.loc[(df_clean['release_date'] >= pd.datetime(2015,1,1))
                   & (df_clean['release_date'] <= pd.datetime(2016,12,31))] 

test = df_clean.loc[(df_clean['release_date'] >= pd.datetime(2017,1,1))
                   & (df_clean['release_date'] <= pd.datetime(2019,12,31))]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [5]:
X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]
X_test = test[features]
y_test = test[target]

In [6]:
import category_encoders as ce
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from xgboost import XGBRegressor
# Im going to use random forest, and ordinal encoding
gradient_b = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(), 
    XGBRegressor(n_estimators=200, objective='reg:squarederror', n_jobs=-1)
)

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform


gradient_b.fit(X_train, y_train)

  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \


Pipeline(memory=None,
         steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['developer', 'publisher', 'platforms',
                                      'categories', 'genres', 'steamspy_tags',
                                      'owners'],
                                drop_invariant=False, handle_missing='value',
                                handle_unknown='value',
                                mapping=[{'col': 'developer',
                                          'data_type': dtype('O'),
                                          'mapping': Valve                                                                          1
Gearbox Software                                                               2
Valve;Hidden Path Entertainment                                                3
Mark Healey                                                                    4
Tripwire...
                              colsample_bylevel=1, colsample_bynode=1,
               

In [7]:
from sklearn.metrics import r2_score

gradient_b.fit(X_train, y_train)
y_pred = gradient_b.predict(X_val)
print('Gradient Boosting R^2', r2_score(y_val, y_pred))
# not a good r2 score at all! maybe Ill try it with a different target and features to try and understand what it does a little better

  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \


Gradient Boosting R^2 0.1271990963370957


In [11]:
transformers = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median')
)
X_train_transformed = transformers.fit_transform(X_train)
X_val_transformed = transformers.transform(X_val)



model = XGBRegressor(n_estimators=100, random_state=42, n_jobs=-1)
model.fit(X_train_transformed, y_train)

  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \




XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
             n_jobs=-1, nthread=None, objective='reg:linear', random_state=42,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)

In [16]:
import eli5
from eli5.sklearn import PermutationImportance

# 1. Calculate permutation importances
permuter = PermutationImportance(
    model, 
    scoring='r2', 
    n_iter=5, 
    random_state=42
)

permuter.fit(X_val_transformed, y_val)

PermutationImportance(cv='prefit',
                      estimator=XGBRegressor(base_score=0.5, booster='gbtree',
                                             colsample_bylevel=1,
                                             colsample_bynode=1,
                                             colsample_bytree=1, gamma=0,
                                             importance_type='gain',
                                             learning_rate=0.1,
                                             max_delta_step=0, max_depth=3,
                                             min_child_weight=1, missing=None,
                                             n_estimators=100, n_jobs=-1,
                                             nthread=None,
                                             objective='reg:linear',
                                             random_state=42, reg_alpha=0,
                                             reg_lambda=1, scale_pos_weight=1,
                                     

In [17]:
feature_names = X_val.columns.tolist()

eli5.show_weights(
    permuter, 
    top=None, # show permutation importances for all features
    feature_names=feature_names # must be a list
)
#hmmm the font is white and impossible to read, will have to look into that

Weight,Feature
0.1382  ± 0.0296,median_playtime
0.0760  ± 0.0050,negative_ratings
0.0730  ± 0.0213,genres
0.0544  ± 0.0223,publisher
0.0280  ± 0.0163,positive_ratings
0.0236  ± 0.0083,developer
0.0224  ± 0.0147,achievements
0.0158  ± 0.0047,steamspy_tags
0.0062  ± 0.0055,platforms
0.0050  ± 0.0009,english


In [18]:
pd.Series(permuter.feature_importances_, feature_names).sort_values()

owners             -0.010203
required_age       -0.001215
categories         -0.000071
average_playtime    0.001192
english             0.005044
platforms           0.006185
steamspy_tags       0.015816
achievements        0.022378
developer           0.023564
positive_ratings    0.027986
publisher           0.054399
genres              0.072996
negative_ratings    0.076016
median_playtime     0.138231
dtype: float64