<a href="https://colab.research.google.com/github/danieljaouen/DS-Unit-2-Applied-Modeling/blob/master/module2/assignment_applied_modeling_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Applied Modeling, Module 2

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] Plot the distribution of your target. 
    - Classification problem: Are your classes imbalanced? Then, don't use just accuracy.
    - Regression problem: Is your target skewed? If so, let's discuss in Slack.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline?
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - _**[A Kaggle Master Explains Gradient Boosting](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)**_
  - [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) Chapter 8
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html)
  - _**[Boosting](https://www.youtube.com/watch?v=GM3CDQfQ4sw) (2.5 minute video)**_

In [4]:
import os, sys
in_colab = 'google.colab' in sys.modules

# If you're in Colab...
if in_colab:
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Applied-Modeling.git
    !git pull origin master
    
    # Install required python packages
    !pip install -r requirements.txt
    
    # Change into directory for module
    os.chdir('module2')

Initialized empty Git repository in /content/.git/
remote: Enumerating objects: 77, done.[K
remote: Total 77 (delta 0), reused 0 (delta 0), pack-reused 77[K
Unpacking objects: 100% (77/77), done.
From https://github.com/LambdaSchool/DS-Unit-2-Applied-Modeling
 * branch            master     -> FETCH_HEAD
 * [new branch]      master     -> origin/master
Checking out files: 100% (26/26), done.
Collecting category_encoders==2.0.0 (from -r requirements.txt (line 1))
[?25l  Downloading https://files.pythonhosted.org/packages/6e/a1/f7a22f144f33be78afeb06bfa78478e8284a64263a3c09b1ef54e673841e/category_encoders-2.0.0-py2.py3-none-any.whl (87kB)
[K     |████████████████████████████████| 92kB 5.7MB/s 
[?25hCollecting eli5==0.10.1 (from -r requirements.txt (line 2))
[?25l  Downloading https://files.pythonhosted.org/packages/97/2f/c85c7d8f8548e460829971785347e14e45fa5c6617da374711dec8cb38cc/eli5-0.10.1-py2.py3-none-any.whl (105kB)
[K     |████████████████████████████████| 112kB 16.5MB/s 
Co

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import eli5
from eli5.sklearn import PermutationImportance
from sklearn.model_selection import train_test_split
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import RandomizedSearchCV
import scipy.stats as st

In [0]:
train = pd.read_csv('https://raw.githubusercontent.com/danieljaouen/DS-Unit-1-Sprint-1-Dealing-With-Data/master/module1-afirstlookatdata/Video_Games_Sales_as_at_22_Dec_2016.csv')

train, X_2, y_train, y_2 = train_test_split(train, train['Global_Sales'], train_size=0.60, test_size=0.40,
                                            random_state=42)

val, test, y_val, y_test = train_test_split(X_2, y_2, train_size=0.50, test_size=0.50,
                                            random_state=42)

In [5]:
test.head()

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
3865,Way of the Samurai 3,PS3,2008.0,Action,Gamebridge,0.18,0.08,0.22,0.04,0.52,58.0,18.0,7.9,22.0,Acquire,M
10420,Pro Yaky? Spirits 2010,PS2,2010.0,Sports,Konami Digital Entertainment,0.0,0.0,0.1,0.0,0.1,,,,,,
5733,Einhänder,PS,1997.0,Shooter,SquareSoft,0.1,0.07,0.13,0.02,0.31,,,,,,
8483,Leisure Suit Larry: Box Office Bust,X360,2009.0,Adventure,Codemasters,0.14,0.01,0.0,0.01,0.16,25.0,27.0,2.5,41.0,Team 17,M
1354,Super Metroid,SNES,1994.0,Action,Nintendo,0.57,0.12,0.71,0.02,1.42,,,,,,


In [6]:
y_val.head()

3684     0.54
2882     0.71
12127    0.07
1534     1.29
16121    0.01
Name: Global_Sales, dtype: float64

In [0]:
target = 'Global_Sales'
X_train = train.drop(columns=[target, 'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'User_Count'])
y_train = train[target]
X_val = val.drop(columns=[target, 'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'User_Count'])
y_val = val[target]
X_test = test.drop(columns=[target, 'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'User_Count'])
y_test = test[target]

In [21]:
baseline = pd.DataFrame(np.array([y_test.mean()] * len(y_test)))

mean_absolute_error(y_test, baseline)

0.6605237975321034

In [35]:
transformers = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median')
)

X_train_transformed = transformers.fit_transform(X_train)
X_val_transformed = transformers.fit_transform(X_val)
X_test_transformed = transformers.fit_transform(X_test)

eval_set = [(X_train_transformed, y_train), (X_val_transformed, y_val)]

one_to_left = st.beta(10, 1)  
from_zero_positive = st.expon(0, 50)

params = {  
    "n_estimators": st.randint(3, 40),
    "max_depth": st.randint(3, 40),
    "learning_rate": st.uniform(0.05, 0.4),
    "colsample_bytree": one_to_left,
    "subsample": one_to_left,
    "gamma": st.uniform(0, 10),
    'reg_alpha': from_zero_positive,
    "min_child_weight": from_zero_positive,
}

fit_params = {
    'eval_metric': 'mae', 
    'early_stopping_rounds': 50,
    'eval_set': eval_set
}

model = XGBRegressor(nthreads=-1)  
gs = RandomizedSearchCV(model, params, n_jobs=-1, n_iter=50)  
gs.fit(X_train_transformed, y_train, **fit_params)  

y_test_pred = gs.predict(X_test_transformed)

mae = mean_absolute_error(y_test, y_test_pred)
print('MAE: ', mae)

import eli5
from eli5.sklearn import PermutationImportance

permuter = PermutationImportance(
    gs,
    scoring='neg_mean_absolute_error',
    n_iter=2,
    random_state=42
)

permuter.fit(X_val_transformed, y_val)
feature_names = X_val.columns.tolist()

eli5.show_weights(
    permuter,
    top=None,
    feature_names = feature_names
)



[0]	validation_0-mae:0.54344	validation_1-mae:0.51089
Multiple eval metrics have been passed: 'validation_1-mae' will be used for early stopping.

Will train until validation_1-mae hasn't improved in 50 rounds.
[1]	validation_0-mae:0.520641	validation_1-mae:0.492036
[2]	validation_0-mae:0.502593	validation_1-mae:0.47776
[3]	validation_0-mae:0.489216	validation_1-mae:0.46784
[4]	validation_0-mae:0.477645	validation_1-mae:0.459915
[5]	validation_0-mae:0.468466	validation_1-mae:0.453445
[6]	validation_0-mae:0.4612	validation_1-mae:0.448282


  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \


[7]	validation_0-mae:0.454724	validation_1-mae:0.444085
[8]	validation_0-mae:0.449316	validation_1-mae:0.440549
[9]	validation_0-mae:0.444789	validation_1-mae:0.438896
[10]	validation_0-mae:0.440041	validation_1-mae:0.439398
[11]	validation_0-mae:0.436963	validation_1-mae:0.438252
[12]	validation_0-mae:0.434053	validation_1-mae:0.435509
[13]	validation_0-mae:0.430648	validation_1-mae:0.433074
[14]	validation_0-mae:0.426973	validation_1-mae:0.433378
[15]	validation_0-mae:0.42479	validation_1-mae:0.431523
[16]	validation_0-mae:0.421983	validation_1-mae:0.431186
[17]	validation_0-mae:0.419933	validation_1-mae:0.431337
[18]	validation_0-mae:0.418641	validation_1-mae:0.430759
[19]	validation_0-mae:0.417144	validation_1-mae:0.42932
[20]	validation_0-mae:0.415824	validation_1-mae:0.429701
[21]	validation_0-mae:0.414274	validation_1-mae:0.428981
[22]	validation_0-mae:0.412979	validation_1-mae:0.428465
[23]	validation_0-mae:0.411981	validation_1-mae:0.428716
[24]	validation_0-mae:0.410782	valid

Weight,Feature
0.0649  ± 0.0057,Critic_Score
0.0487  ± 0.0019,Critic_Count
0.0434  ± 0.0006,Publisher
0.0332  ± 0.0039,Platform
0.0246  ± 0.0008,Year_of_Release
0.0170  ± 0.0002,User_Score
0.0035  ± 0.0014,Genre
0.0029  ± 0.0021,Developer
0.0024  ± 0.0001,Name
0.0014  ± 0.0033,Rating


In [0]:
# An MAE of 0.4875 beats the baseline MAE of 0.6605