<a href="https://colab.research.google.com/github/danieljaouen/DS-Unit-2-Applied-Modeling/blob/master/module2/assignment_applied_modeling_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Applied Modeling, Module 2

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] Plot the distribution of your target. 
    - Classification problem: Are your classes imbalanced? Then, don't use just accuracy.
    - Regression problem: Is your target skewed? If so, let's discuss in Slack.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline?
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - _**[A Kaggle Master Explains Gradient Boosting](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)**_
  - [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) Chapter 8
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html)
  - _**[Boosting](https://www.youtube.com/watch?v=GM3CDQfQ4sw) (2.5 minute video)**_

In [4]:
import os, sys
in_colab = 'google.colab' in sys.modules

# If you're in Colab...
if in_colab:
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Applied-Modeling.git
    !git pull origin master
    
    # Install required python packages
    !pip install -r requirements.txt
    
    # Change into directory for module
    os.chdir('module2')

Initialized empty Git repository in /content/.git/
remote: Enumerating objects: 77, done.[K
remote: Total 77 (delta 0), reused 0 (delta 0), pack-reused 77[K
Unpacking objects: 100% (77/77), done.
From https://github.com/LambdaSchool/DS-Unit-2-Applied-Modeling
 * branch            master     -> FETCH_HEAD
 * [new branch]      master     -> origin/master
Checking out files: 100% (26/26), done.
Collecting category_encoders==2.0.0 (from -r requirements.txt (line 1))
[?25l  Downloading https://files.pythonhosted.org/packages/6e/a1/f7a22f144f33be78afeb06bfa78478e8284a64263a3c09b1ef54e673841e/category_encoders-2.0.0-py2.py3-none-any.whl (87kB)
[K     |████████████████████████████████| 92kB 5.7MB/s 
[?25hCollecting eli5==0.10.1 (from -r requirements.txt (line 2))
[?25l  Downloading https://files.pythonhosted.org/packages/97/2f/c85c7d8f8548e460829971785347e14e45fa5c6617da374711dec8cb38cc/eli5-0.10.1-py2.py3-none-any.whl (105kB)
[K     |████████████████████████████████| 112kB 16.5MB/s 
Co

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import eli5
from eli5.sklearn import PermutationImportance
from sklearn.model_selection import train_test_split
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error

In [0]:
train = pd.read_csv('https://raw.githubusercontent.com/danieljaouen/DS-Unit-1-Sprint-1-Dealing-With-Data/master/module1-afirstlookatdata/Video_Games_Sales_as_at_22_Dec_2016.csv')

train, X_2, y_train, y_2 = train_test_split(train, train['Global_Sales'], train_size=0.60, test_size=0.40,
                                            random_state=42)

val, test, y_val, y_test = train_test_split(X_2, y_2, train_size=0.50, test_size=0.50,
                                            random_state=42)

In [5]:
test.head()

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
3865,Way of the Samurai 3,PS3,2008.0,Action,Gamebridge,0.18,0.08,0.22,0.04,0.52,58.0,18.0,7.9,22.0,Acquire,M
10420,Pro Yaky? Spirits 2010,PS2,2010.0,Sports,Konami Digital Entertainment,0.0,0.0,0.1,0.0,0.1,,,,,,
5733,Einhänder,PS,1997.0,Shooter,SquareSoft,0.1,0.07,0.13,0.02,0.31,,,,,,
8483,Leisure Suit Larry: Box Office Bust,X360,2009.0,Adventure,Codemasters,0.14,0.01,0.0,0.01,0.16,25.0,27.0,2.5,41.0,Team 17,M
1354,Super Metroid,SNES,1994.0,Action,Nintendo,0.57,0.12,0.71,0.02,1.42,,,,,,


In [6]:
y_val.head()

3684     0.54
2882     0.71
12127    0.07
1534     1.29
16121    0.01
Name: Global_Sales, dtype: float64

In [0]:
target = 'Global_Sales'
X_train = train.drop(columns=[target, 'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'User_Count'])
y_train = train[target]
X_val = val.drop(columns=[target, 'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'User_Count'])
y_val = val[target]
X_test = test.drop(columns=[target, 'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'User_Count'])
y_test = test[target]

In [21]:
baseline = pd.DataFrame(np.array([y_test.mean()] * len(y_test)))

mean_absolute_error(y_test, baseline)

0.6605237975321034

In [18]:
transformers = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median')
)

X_train_transformed = transformers.fit_transform(X_train)
X_val_transformed = transformers.fit_transform(X_val)
X_test_transformed = transformers.fit_transform(X_test)

eval_set = [(X_train_transformed, y_train), (X_val_transformed, y_val)]

model = XGBRegressor(
    n_estimators=1000,
    max_depth=7,
    learning_rate=0.1,
    n_jobs = -1
)

model.fit(X_train_transformed, y_train, eval_set=eval_set, eval_metric='mae', early_stopping_rounds=50)
y_test_pred = model.predict(X_test_transformed)

mae = mean_absolute_error(y_test, y_test_pred)
print('MAE: ', mae)

import eli5
from eli5.sklearn import PermutationImportance

permuter = PermutationImportance(
    model,
    scoring='neg_mean_absolute_error',
    n_iter=2,
    random_state=42
)

permuter.fit(X_val_transformed, y_val)
feature_names = X_val.columns.tolist()

eli5.show_weights(
    permuter,
    top=None,
    feature_names = feature_names
)

[0]	validation_0-mae:0.546838	validation_1-mae:0.513565
Multiple eval metrics have been passed: 'validation_1-mae' will be used for early stopping.

Will train until validation_1-mae hasn't improved in 50 rounds.
[1]	validation_0-mae:0.528774	validation_1-mae:0.500992
[2]	validation_0-mae:0.512696	validation_1-mae:0.490548
[3]	validation_0-mae:0.498712	validation_1-mae:0.48101
[4]	validation_0-mae:0.486867	validation_1-mae:0.473585
[5]	validation_0-mae:0.475044	validation_1-mae:0.465433
[6]	validation_0-mae:0.464395	validation_1-mae:0.458138
[7]	validation_0-mae:0.454331	validation_1-mae:0.453452
[8]	validation_0-mae:0.445441	validation_1-mae:0.449267
[9]	validation_0-mae:0.437643	validation_1-mae:0.446244


  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \


[10]	validation_0-mae:0.430533	validation_1-mae:0.443707
[11]	validation_0-mae:0.423778	validation_1-mae:0.440771
[12]	validation_0-mae:0.415826	validation_1-mae:0.437868
[13]	validation_0-mae:0.410376	validation_1-mae:0.437563
[14]	validation_0-mae:0.405393	validation_1-mae:0.436369
[15]	validation_0-mae:0.39966	validation_1-mae:0.435816
[16]	validation_0-mae:0.395044	validation_1-mae:0.43502
[17]	validation_0-mae:0.39118	validation_1-mae:0.435382
[18]	validation_0-mae:0.38801	validation_1-mae:0.435514
[19]	validation_0-mae:0.383699	validation_1-mae:0.435228
[20]	validation_0-mae:0.380342	validation_1-mae:0.433943
[21]	validation_0-mae:0.377324	validation_1-mae:0.434089
[22]	validation_0-mae:0.373981	validation_1-mae:0.432858
[23]	validation_0-mae:0.369388	validation_1-mae:0.432673
[24]	validation_0-mae:0.365735	validation_1-mae:0.433396
[25]	validation_0-mae:0.362587	validation_1-mae:0.433732
[26]	validation_0-mae:0.360049	validation_1-mae:0.433835
[27]	validation_0-mae:0.358049	vali

Weight,Feature
0.0855  ± 0.0075,Critic_Score
0.0808  ± 0.0112,Year_of_Release
0.0806  ± 0.0014,Critic_Count
0.0759  ± 0.0004,Platform
0.0491  ± 0.0019,Publisher
0.0210  ± 0.0029,Name
0.0140  ± 0.0089,User_Score
0.0067  ± 0.0043,Rating
0.0049  ± 0.0008,Developer
0.0011  ± 0.0033,Genre


In [0]:
# An MAE of 0.4920 beats the baseline MAE of 0.6605