Lambda School Data Science

*Unit 2, Sprint 3, Module 3*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - _**[A Kaggle Master Explains Gradient Boosting](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)**_
  - [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) Chapter 8
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html)
  - _**[Boosting](https://www.youtube.com/watch?v=GM3CDQfQ4sw) (2.5 minute video)**_

In [1]:
# import dataset
import pandas as pd
import numpy as np

df = pd.read_csv('https://raw.githubusercontent.com/aklefebvere/DS-Unit-2-Applied-Modeling/master/train-data.csv')

In [2]:
# Create train, test split
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, train_size=0.80, test_size=0.20, random_state=42)

In [3]:
# Create new train and val split
train, val = train_test_split(train, train_size=0.80, test_size=0.20, random_state=42)

In [4]:
# Check shape of splits
train.shape, val.shape, test.shape

((3852, 14), (963, 14), (1204, 14))

In [5]:
import math
def wrangle(df):
    # Create copy so it doesn't modify the origninal dataframe
    df = df.copy()
    
    # Filter out CNG, LPG, and Electric Fuel_type rows
    df = df[(df['Fuel_Type'] != 'CNG') & (df['Fuel_Type'] != 'LPG')
            & (df['Fuel_Type'] != 'Electric')]
    
    # Columns to drop
    drop = ['New_Price', 'Unnamed: 0']
    df = df.drop(columns=drop)
    
    # Rename columns
    df = df.rename(columns={'Price': 'Price_Lakh', 'Power': 'Power_bhp'})
    
    # Unit conversions
#     df['Price_INR'] = df['Price_Lakh'] * 100000
#     df['Price_USD'] = df['Price_INR'] * 0.013881
    
    # Zeros --> NaN (For Imputation)
    # Written this way just incase I need to replace more zeros
    zero = ['Seats']
    for i in zero:
        df[zero] = df[zero].replace(0.0,np.NaN)
        
    # Strip characters
    df['Power_bhp'] = df['Power_bhp'].str.strip('null bhp')
    df['Engine'] = df['Engine'].str.strip('CC')
    df['Mileage'] = df['Mileage'].str.strip('kmpl')
    
    # Convert series into numeric values
    df['Power_bhp'] = pd.to_numeric(df['Power_bhp'])
    df['Engine'] = pd.to_numeric(df['Engine'])
    df['Mileage'] = pd.to_numeric(df['Mileage'])
    
    return df

In [6]:
train = wrangle(train)
test = wrangle(test)
val = wrangle(val)

In [7]:
# See if 
print(type(train['Engine'].iloc[0]))
print(type(train['Mileage'].iloc[0]))

<class 'numpy.float64'>
<class 'numpy.float64'>


In [8]:
# Check head to see if wrangle function worked correctly
train.head()

Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power_bhp,Seats,Price_Lakh
5709,Hyundai Xcent 1.2 Kappa SX,Kochi,2017,15029,Petrol,Manual,First,19.1,1197.0,82.0,5.0,6.05
4413,Maruti Alto 800 LXI,Kochi,2016,51639,Petrol,Manual,First,22.74,796.0,47.3,5.0,3.04
3376,Toyota Innova 2.5 Z Diesel 7 Seater,Jaipur,2015,100000,Diesel,Manual,Second,12.99,2494.0,100.6,7.0,12.0
5545,Land Rover Range Rover Sport SE,Delhi,2014,47000,Diesel,Automatic,Second,12.65,2993.0,255.0,5.0,64.75
797,Maruti SX4 S Cross DDiS 200 Zeta,Mumbai,2017,26000,Diesel,Manual,First,23.65,1248.0,88.5,5.0,9.75


In [15]:
# Identify my target and x and y variables
target = 'Price_Lakh'

X_train = train.drop(columns=target)
X_val = val.drop(columns=target)
X_test = test

y_train = np.log1p(train[target])
y_val = np.log1p(val[target])
y_test = np.log1p(test[target])

# Using RandomForestRegressor

In [16]:
# Using RandomForestRegressor to fit my model without any feature engineering
# Accuracy is pretty high right off the bat, need to check for leakage
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median'),
    RandomForestRegressor(random_state=42, n_jobs=-1)
)

pipeline.fit(X_train, y_train)
print('Validation Accuracy:', pipeline.score(X_val, y_val))

Validation Accuracy: 0.9287420990578699


# Using XGBoost

In [37]:
# all possible eval_metric metrics
import sklearn.metrics
sorted(sklearn.metrics.SCORERS.keys())

['accuracy',
 'adjusted_mutual_info_score',
 'adjusted_rand_score',
 'average_precision',
 'balanced_accuracy',
 'completeness_score',
 'explained_variance',
 'f1',
 'f1_macro',
 'f1_micro',
 'f1_samples',
 'f1_weighted',
 'fowlkes_mallows_score',
 'homogeneity_score',
 'jaccard',
 'jaccard_macro',
 'jaccard_micro',
 'jaccard_samples',
 'jaccard_weighted',
 'max_error',
 'mutual_info_score',
 'neg_brier_score',
 'neg_log_loss',
 'neg_mean_absolute_error',
 'neg_mean_gamma_deviance',
 'neg_mean_poisson_deviance',
 'neg_mean_squared_error',
 'neg_mean_squared_log_error',
 'neg_median_absolute_error',
 'neg_root_mean_squared_error',
 'normalized_mutual_info_score',
 'precision',
 'precision_macro',
 'precision_micro',
 'precision_samples',
 'precision_weighted',
 'r2',
 'recall',
 'recall_macro',
 'recall_micro',
 'recall_samples',
 'recall_weighted',
 'roc_auc',
 'roc_auc_ovo',
 'roc_auc_ovo_weighted',
 'roc_auc_ovr',
 'roc_auc_ovr_weighted',
 'v_measure_score']

In [46]:
# Creating a XGB Regressor model using mae as the evaluation metric
from xgboost import XGBRegressor
encoder = ce.OrdinalEncoder()
imputer = SimpleImputer(strategy='median')

X_train_encoded = encoder.fit_transform(X_train)
X_val_encoded = encoder.transform(X_val)

X_train_imputed = imputer.fit_transform(X_train_encoded)
X_val_imputed = imputer.transform(X_val_encoded)

model = XGBRegressor(
    n_estimators = 1000,
    max_depth = 7,
    learning_rate = 0.5,
    n_jobs=-1
)

eval_set = [(X_train_imputed, y_train),
           (X_val_imputed, y_val)]

model.fit(X_train_imputed, y_train,
         eval_set=eval_set,
         eval_metric='mae',
         early_stopping_rounds=150)


[0]	validation_0-mae:0.773916	validation_1-mae:0.776738
Multiple eval metrics have been passed: 'validation_1-mae' will be used for early stopping.

Will train until validation_1-mae hasn't improved in 150 rounds.
[1]	validation_0-mae:0.402589	validation_1-mae:0.40698
[2]	validation_0-mae:0.232976	validation_1-mae:0.246553
[3]	validation_0-mae:0.158384	validation_1-mae:0.183524
[4]	validation_0-mae:0.127705	validation_1-mae:0.160148
[5]	validation_0-mae:0.116204	validation_1-mae:0.155325
[6]	validation_0-mae:0.107861	validation_1-mae:0.151605
[7]	validation_0-mae:0.103011	validation_1-mae:0.150789
[8]	validation_0-mae:0.100352	validation_1-mae:0.149086
[9]	validation_0-mae:0.097053	validation_1-mae:0.146532
[10]	validation_0-mae:0.093681	validation_1-mae:0.147878
[11]	validation_0-mae:0.089997	validation_1-mae:0.147204
[12]	validation_0-mae:0.087782	validation_1-mae:0.146412
[13]	validation_0-mae:0.0846	validation_1-mae:0.144455
[14]	validation_0-mae:0.08232	validation_1-mae:0.143496
[

[138]	validation_0-mae:0.008076	validation_1-mae:0.135078
[139]	validation_0-mae:0.008013	validation_1-mae:0.135068
[140]	validation_0-mae:0.007804	validation_1-mae:0.135098
[141]	validation_0-mae:0.007642	validation_1-mae:0.135099
[142]	validation_0-mae:0.007528	validation_1-mae:0.135124
[143]	validation_0-mae:0.007369	validation_1-mae:0.135172
[144]	validation_0-mae:0.007244	validation_1-mae:0.135195
[145]	validation_0-mae:0.007134	validation_1-mae:0.135199
[146]	validation_0-mae:0.007027	validation_1-mae:0.135199
[147]	validation_0-mae:0.006919	validation_1-mae:0.135209
[148]	validation_0-mae:0.006855	validation_1-mae:0.13517
[149]	validation_0-mae:0.006784	validation_1-mae:0.135129
[150]	validation_0-mae:0.006679	validation_1-mae:0.135161
[151]	validation_0-mae:0.006506	validation_1-mae:0.135183
[152]	validation_0-mae:0.006401	validation_1-mae:0.135224
[153]	validation_0-mae:0.006356	validation_1-mae:0.1352
[154]	validation_0-mae:0.006327	validation_1-mae:0.135199
[155]	validation_

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.5, max_delta_step=0,
             max_depth=7, min_child_weight=1, missing=None, n_estimators=1000,
             n_jobs=-1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)

In [38]:
# Using the XGB model to find the permutation importance in r^2
from eli5.sklearn import PermutationImportance

model = RandomForestRegressor(random_state=42, n_jobs=-1)
model.fit(X_train_imputed, y_train)

permuter = PermutationImportance(
    model,
    scoring='r2',
    n_iter=5,
    random_state=42
)

permuter.fit(X_val_imputed, y_val)

PermutationImportance(cv='prefit',
                      estimator=RandomForestRegressor(bootstrap=True,
                                                      ccp_alpha=0.0,
                                                      criterion='mse',
                                                      max_depth=None,
                                                      max_features='auto',
                                                      max_leaf_nodes=None,
                                                      max_samples=None,
                                                      min_impurity_decrease=0.0,
                                                      min_impurity_split=None,
                                                      min_samples_leaf=1,
                                                      min_samples_split=2,
                                                      min_weight_fraction_leaf=0.0,
                                                      n_estimators=100

In [44]:
# Permutation Importance in r^2
feature_names = X_val.columns.tolist()
pd.Series(permuter.feature_importances_, feature_names).sort_values(ascending=False)

Power_bhp            0.922803
Year                 0.394076
Engine               0.086143
Transmission         0.019209
Mileage              0.015459
Kilometers_Driven    0.011234
Location             0.009466
Seats                0.007727
Name                 0.005703
Fuel_Type            0.003922
Owner_Type           0.000036
dtype: float64