Lambda School Data Science

*Unit 2, Sprint 3, Module 2*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [X] Plot the distribution of your target. 
    - Classification problem: Are your classes imbalanced? Then, don't use just accuracy.
    - Regression problem: Is your target skewed? If so, let's discuss in Slack.
- [X] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline?
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - _**[A Kaggle Master Explains Gradient Boosting](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)**_
  - [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) Chapter 8
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html)
  - _**[Boosting](https://www.youtube.com/watch?v=GM3CDQfQ4sw) (2.5 minute video)**_

In [25]:
!pip install category_encoders==2.*

Collecting category_encoders==2.*
[?25l  Downloading https://files.pythonhosted.org/packages/a0/52/c54191ad3782de633ea3d6ee3bb2837bda0cf3bc97644bb6375cf14150a0/category_encoders-2.1.0-py2.py3-none-any.whl (100kB)
[K     |████████████████████████████████| 102kB 2.7MB/s 
Installing collected packages: category-encoders
Successfully installed category-encoders-2.1.0


In [59]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt

data = pd.read_csv("https://raw.githubusercontent.com/apathyhill/DS-Unit-2-Applied-Modeling/master/data/portfolio/vgsales-12-4-2019.csv")
data2 = data.copy()
#data2["score"] = data2[["VGChartz_Score", "Critic_Score", "User_Score"]].mean(axis=1)
data2 = pd.concat([data2, pd.get_dummies(data2["ESRB_Rating"])], axis=1)

features_excluded = ["Last_Update", "url", "status", "Vgchartzscore", "img_url", "basename", 
                     "NA_Sales", "PAL_Sales", "JP_Sales", "Other_Sales", "Rank", "Name", 
                     "VGChartz_Score", "Critic_Score", "User_Score", "ESRB_Rating"]

data2 = data2.drop(features_excluded, axis=1)

data2["Year"] = data2["Year"].fillna(data2["Year"].median())
data2["Total_Shipped"] = data2["Total_Shipped"].fillna(method="ffill")
data2["Global_Sales"] = data2["Global_Sales"].fillna(method="ffill")
data2 = data2.dropna()

print(data2.isna().sum())
print("Columns:", list(data2.columns))
target = "Global_Sales"

train, test = train_test_split(data2, train_size=0.90)
train, val = train_test_split(train, train_size=0.75)

"""
This is a regression problem, as I am trying to output a continous number.

Simple accuracy could work as a eval matric right now, but I might change it to mean absolue/squared error to see if I can decipher things better.
"""

features_excluded = ["Last_Update", "url", "status", "Vgchartzscore", "img_url", "basename", 
                     "NA_Sales", "PAL_Sales", "JP_Sales", "Other_Sales", "Rank", "Name", 
                     "VGChartz_Score", "Critic_Score", "User_Score", "ESRB_Rating"]
"""
These last columns are only for information about the game on the VGChartz website, such as cover images, and the url to the game. Not important here.
"basename" is a web-safe, simplified name of the game. (Wii Sports becomes wii-sports)
"NA_Sales", "PAL_Sales", "JP_Sales", "Other_Sales" are just  "Global_Sales" split into regions; probably not necessary.
"Rank" is not needed, as it is functionally just an index.
Each entry is a new game with its own title, so the "Name" category is *very* high cardinality, too much to use.


This leaves Genre, ESRB Rating (in seperate columns), Platform, Plublisher, Developer, and Year
"""


#data2[target].plot(kind="hist", bins=50, figsize=(20, 10));

mean = data2[target].mean()
print("Baseline Accuracy:", len(data2[data2[target] < mean])/ len(data2[target]) ) # Baseline is 178,000 sales for 82% accuracy


X_train = train[train.columns.drop(target)]
y_train = train[target]

X_val = val[val.columns.drop(target)]
y_val = val[target]

X_test = test[val.columns.drop(target)]
y_test = test[target]


from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
import category_encoders as ce
import numpy as np

pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy="median", missing_values=np.nan), 
    Ridge()
)


pipeline.fit(X_train, y_train)
print("Validation Accuracy:", pipeline.score(X_val, y_val))
print("Test Accuracy:", pipeline.score(X_test, y_test))


Genre            0
Platform         0
Publisher        0
Developer        0
Total_Shipped    0
Global_Sales     0
Year             0
AO               0
E                0
E10              0
EC               0
KA               0
M                0
RP               0
T                0
dtype: int64
Columns: ['Genre', 'Platform', 'Publisher', 'Developer', 'Total_Shipped', 'Global_Sales', 'Year', 'AO', 'E', 'E10', 'EC', 'KA', 'M', 'RP', 'T']
Baseline Accuracy: 0.8299734557715761
Validation Accuracy: 0.9962755163019107
Test Accuracy: 0.9991348142347253
