<a href="https://colab.research.google.com/github/Daniel-Benson-Poe/DS-Unit-2-Applied-Modeling/blob/master/db_LS_DS_233_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 3, Module 3*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf), Chapter 8
  - _**[Gradient Boosting Explained](https://www.gormanalysis.com/blog/gradient-boosting-explained/)**_ — Ben Gorman
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html) — Alex Rogozhnikov
  - [How to explain gradient boosting](https://explained.ai/gradient-boosting/) — Terence Parr & Jeremy Howard

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/Daniel-Benson-Poe/practice_datasets/master/'
    !pip install category_encoders==2.*
    !pip install pandas-profiling==2.*
    !pip install eli5

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
# Read in data
import pandas as pd
suicide_df = pd.read_csv(DATA_PATH+'suicide_rates.csv')

In [0]:
# General Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sbn
from scipy.stats import ttest_ind, ttest_1samp, t, randint, uniform
import pandas_profiling
from pandas_profiling import ProfileReport
import category_encoders as ce 
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, plot_confusion_matrix, confusion_matrix, classification_report

In [5]:
suicide_df.head()

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers


In [6]:
(suicide_df['suicides_no'] / suicide_df['population'] * 100) 

0        0.006711
1        0.005195
2        0.004833
3        0.004587
4        0.003281
           ...   
27815    0.002955
27816    0.002583
27817    0.002172
27818    0.001672
27819    0.001459
Length: 27820, dtype: float64

In [7]:
suicide_df.shape

(27820, 12)

In [8]:
# Create train set out of data from all years prior to 2015
train = suicide_df[suicide_df['year'] < 2015]

# Create validation set out of data from the year 2015
val = suicide_df[suicide_df['year'] == 2015]

# Create test set out of data from the year 2016
test = suicide_df[suicide_df['year'] == 2016]

train.shape, val.shape, test.shape

((26916, 12), (744, 12), (160, 12))

In [0]:
def char_eraser(df, column, chars):
  df = df[column].replace(chars, '')
  return df

In [0]:
# Create function to wrangle data
def wrangle(X):
  """Wrangles the train, validation, and test sets the same way."""

  # Prevent SettingWithCopyWarning
  X = X.copy()

  # Change column names: replace spaces with underscores
  cols_to_name = ['suicides_no', 'suicides/100k pop', 'country-year', 
                  'HDI for year', ' gdp_for_year ($) ', 'gdp_per_capita ($)']
  new_col_names = ['num_suicides', 'suicides/100k_pop', 'country_year', 
                   'HDI_for_year', 'annual_gdp', 'gdp_per_capita']
  i = 0
  for col in cols_to_name:
    X = X.rename(columns={col: new_col_names[i]})
    i += 1
  
  # Remove commas from the values in annual_gdp column and convert the values to integers
  X['annual_gdp'] = X.apply(char_eraser, axis=1, args=('annual_gdp', ',')).astype(int)

  # Remove the ' years' string in the age column
  X['age'] = X.apply(char_eraser, axis=1, args=('age', ' years'))


  # Drop the country_year and HDI_for_year columns
  garbage_columns = ['country_year', 'HDI_for_year']
  X = X.drop(columns=garbage_columns)

  # Feature engineer suicide percent of each population
  X['suicide_rate'] = (X['num_suicides'] / X['population'] * 100)

  return X

train = wrangle(train)
val = wrangle(val)
test = wrangle(test)
  

  

In [11]:
train.head()

Unnamed: 0,country,year,sex,age,num_suicides,population,suicides/100k_pop,annual_gdp,gdp_per_capita,generation,suicide_rate
0,Albania,1987,male,15-24,21,312900,6.71,2156624900,796,Generation X,0.006711
1,Albania,1987,male,35-54,16,308000,5.19,2156624900,796,Silent,0.005195
2,Albania,1987,female,15-24,14,289700,4.83,2156624900,796,Generation X,0.004833
3,Albania,1987,male,75+,1,21800,4.59,2156624900,796,G.I. Generation,0.004587
4,Albania,1987,male,25-34,9,274300,3.28,2156624900,796,Boomers,0.003281


In [12]:
train.shape, val.shape, test.shape

((26916, 11), (744, 11), (160, 11))

In [0]:
# Set target
target = 'num_suicides'

# Arrange data into X features matrix and y target vector
X_train = train.drop(columns=target)
y_train = train[target]
X_val = val.drop(columns=target)
y_val = val[target]
X_test = test

In [16]:
# Fit train data into a decisiontreeregressor model to create a baseline

from sklearn.tree import DecisionTreeRegressor

pipeline = make_pipeline(
    ce.ordinal.OrdinalEncoder(),
    SimpleImputer(),
    StandardScaler(),
    DecisionTreeRegressor(random_state=42, max_depth=2)
)

pipeline.fit(X_train, y_train)
score = pipeline.score(X_val, y_val)
print(f"Baseline Validation Accuracy: {score}")

Baseline Validation Accuracy: 0.7095119258179192


In [17]:
# Now fit a random forest regressor to see if accuracy score increases

from sklearn.ensemble import RandomForestRegressor

pipeline = make_pipeline(
    ce.ordinal.OrdinalEncoder(),
    SimpleImputer(),
    StandardScaler(),
    RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
)

pipeline.fit(X_train, y_train)
score = pipeline.score(X_val, y_val)
print(f"Random Forest Validation Accuracy: {score}")

Random Forest Validation Accuracy: 0.9949916922230333


# Use permutation to find importances

In [18]:
# Remake pipeline to exclude random forest regressor because permutation does not play well with others
 transformers = make_pipeline(
     ce.ordinal.OrdinalEncoder(),
     SimpleImputer(),
     StandardScaler()
 )

 X_train_transformed = transformers.fit_transform(X_train)
 X_val_transformed = transformers.transform(X_val)

 model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
 model.fit(X_train_transformed, y_train)

 

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=-1, oob_score=False,
                      random_state=42, verbose=0, warm_start=False)

In [19]:
# instantiate and fit permuter
import eli5
from eli5.sklearn import PermutationImportance

permuter = PermutationImportance(
    model, 
    scoring=None,
    n_iter=5,
    random_state=42
)

permuter.fit(X_val_transformed, y_val)

Using TensorFlow backend.


PermutationImportance(cv='prefit',
                      estimator=RandomForestRegressor(bootstrap=True,
                                                      ccp_alpha=0.0,
                                                      criterion='mse',
                                                      max_depth=None,
                                                      max_features='auto',
                                                      max_leaf_nodes=None,
                                                      max_samples=None,
                                                      min_impurity_decrease=0.0,
                                                      min_impurity_split=None,
                                                      min_samples_leaf=1,
                                                      min_samples_split=2,
                                                      min_weight_fraction_leaf=0.0,
                                                      n_estimators=100

In [20]:
feature_names = X_val.columns.to_list()
pd.Series(permuter.feature_importances_, feature_names).sort_values(ascending=True)

sex                 -0.000011
year                 0.000000
age                  0.000046
generation           0.000051
gdp_per_capita       0.002155
annual_gdp           0.002552
country              0.003241
suicide_rate         0.223968
suicides/100k_pop    0.248361
population           1.887291
dtype: float64

In [21]:
eli5.show_weights(permuter,
                  top=None, # How many best feature to display. None = all
                  feature_names=feature_names)

Weight,Feature
1.8873  ± 0.8097,population
0.2484  ± 0.0349,suicides/100k_pop
0.2240  ± 0.1042,suicide_rate
0.0032  ± 0.0028,country
0.0026  ± 0.0004,annual_gdp
0.0022  ± 0.0006,gdp_per_capita
0.0001  ± 0.0001,generation
0.0000  ± 0.0002,age
0  ± 0.0000,year
-0.0000  ± 0.0001,sex


# Use permutation scores for feature selection purposes

In [22]:
print(f"Shape before removing features: {X_train.shape}")

Shape before removing features: (26916, 10)


In [0]:
minimum_importance = 0.0001
mask = permuter.feature_importances_ > minimum_importance
features = X_train.columns[mask]
X_train = X_train[features]

In [24]:
print(f"Shape after removing features: {X_train.shape}")

Shape after removing features: (26916, 6)


In [25]:
X_val = X_val[features]

pipeline = make_pipeline(
    ce.ordinal.OrdinalEncoder(),
    SimpleImputer(strategy='median'),
    RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
)

pipeline.fit(X_train, y_train)
pipeline.score(X_val, y_val)

0.9950407054950273

# Use xgboost for gradient boosting