<a href="https://colab.research.google.com/github/evan-grinalds/DS-Unit-2-Applied-Modeling/blob/master/module3-permutation-boosting/Copy_of_LS_DS17_233_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 3, Module 3*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf), Chapter 8
  - _**[Gradient Boosting Explained](https://www.gormanalysis.com/blog/gradient-boosting-explained/)**_ — Ben Gorman
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html) — Alex Rogozhnikov
  - [How to explain gradient boosting](https://explained.ai/gradient-boosting/) — Terence Parr & Jeremy Howard

In [1]:
%%capture
import sys

if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/evan-grinalds/Unit-2-Build-Tesla/master/'
    !pip install category_encoders==2.*
    !pip install eli5

# If you're working locally:
else:
    DATA_PATH = '../data/'

### Clean the data

In [2]:
import pandas as pd

df = pd.read_csv(DATA_PATH+'model_s_whole.csv')

print(df.shape)
df

(200, 7)


Unnamed: 0,year,car,battery,ludacris_mode,all_wheel_drive,mileage,price
0,2013,Model S,60,No,No,82851 mi.,27995
1,2018,Model S,100,No,Yes,5357 mi.,57992
2,2012,Model S,60,No,No,85478 mi.,24499
3,2017,Model S,100,No,Yes,32593 mi.,59980
4,2016,Model S,60,No,Yes,28418 mi.,49560
...,...,...,...,...,...,...,...
195,2014,Model S,60,No,No,25444 mi.,39590
196,2016,Model S,100,Yes,Yes,33719 mi.,67990
197,2020,Model S,100,No,No,1527 mi.,83900
198,2016,Model S,75,No,No,50600 mi.,40500


In [3]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Read train_features.csv
train = pd.read_csv(DATA_PATH+'model_s_train_features.csv')
                
# Read test_features.csv
test = pd.read_csv(DATA_PATH+'model_s_test_features.csv')

# Split train into train & val
train, val = train_test_split(train, train_size=0.80, test_size=0.20) 

def wrangle(X):
    """Wrangle train, validate, and test sets in the same way"""
    
    # Prevent SettingWithCopyWarning
    X = X.copy()
    
    # return the wrangled dataframe
    return X

train = wrangle(train)
val = wrangle(val)
test = wrangle(test)

In [4]:
# Remove symbols, convert to integer
train['mileage'] = (
train['mileage']
.str.replace('mi.','')
.astype(int)
)

val['mileage'] = (
val['mileage']
.str.replace('mi.','')
.astype(int)
)

test['mileage'] = (
test['mileage']
.str.replace('mi.','')
.astype(int)
)

df['mileage'] = (
df['mileage']
.str.replace('mi.','')
.astype(int)
)

In [5]:
train.head()

Unnamed: 0,year,car,battery,ludacris_mode,all_wheel_drive,mileage,price
32,2017,Model S,100,No,Yes,32565,66000
13,2017,Model S,100,Yes,Yes,22336,77990
29,2017,Model S,75,No,No,44614,46990
22,2013,Model S,60,Yes,No,51780,36997
52,2014,Model S,60,Yes,No,49697,34900


In [6]:
val.head()

Unnamed: 0,year,car,battery,ludacris_mode,all_wheel_drive,mileage,price
97,2017,Model S,100,Yes,No,22336,77990
98,2018,Model S,60,No,No,24451,54500
68,2016,Model S,90,Yes,Yes,20279,59991
88,2014,Model S,60,Yes,No,54616,41990
30,2013,Model S,60,No,No,131679,25900


In [7]:
test.head()

Unnamed: 0,year,car,battery,ludacris_mode,all_wheel_drive,mileage,price
0,2015,Model S,85,No,Yes,28017,44670
1,2013,Model S,60,Yes,No,70661,32990
2,2017,Model S,75,No,No,34738,45995
3,2016,Model S,70,No,Yes,44647,44999
4,2013,Model S,60,No,No,61846,33500


### Visualizations 


In [8]:
import plotly.express as px
px.scatter(df, x='year', y='price', trendline='ols')


pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.



### Feature Selection


In [9]:
# Arrange data into X features matrix and y target vector
target = 'price'
X_train = train.drop(columns=target)
y_train = train[target]
X_val = val.drop(columns=target)
y_val = val[target]
X_test = test

### Baseline

In [7]:
pd.options.display.float_format = '{:,.0f}'.format
df['price'].describe()

count      200
mean    47,691
std     15,380
min     24,499
25%     35,000
50%     44,585
75%     57,247
max     92,900
Name: price, dtype: float64

In [8]:
guess = df['price'].mean()

In [9]:
guess

47690.6

In [10]:
errors = guess - df['price']

In [11]:
errors

0      19,696
1     -10,301
2      23,192
3     -12,289
4      -1,869
        ...  
195     8,101
196   -20,299
197   -36,209
198     7,191
199     7,791
Name: price, Length: 200, dtype: float64

In [12]:
mean_absolute_error = errors.abs().mean()

In [13]:
print(f'If we just guessed every Tesla Model S sold for ${guess:,.0f},')
print(f'we would be off by ${mean_absolute_error:,.0f} on average.')

If we just guessed every Tesla Model S sold for $47,691,
we would be off by $12,357 on average.


### Ridge Regression

In [None]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
import category_encoders as ce
import numpy as np
from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.impute import KNNImputer
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

pipeline = make_pipeline(
    ce.BinaryEncoder(), 
    KNNImputer(), 
    StandardScaler(), 
    SelectKBest(f_regression), 
    Ridge()
)

param_distributions = {
    'knnimputer__n_neighbors': [3,4,5,6,7,8], 
    'selectkbest__k': range(1, len(X_train.columns)+1), 
    'ridge__alpha': [0.1, 1,8,9,10,15], 
}

# If you're on Colab, decrease n_iter & cv parameters
search = RandomizedSearchCV(
    pipeline, 
    param_distributions=param_distributions, 
    n_iter=100, 
    cv=10, 
    scoring='neg_mean_absolute_error', 
    verbose=10, 
    return_train_score=True, 
    n_jobs=-1
)

search.fit(X_train, y_train);

In [None]:
print('Best hyperparameters', search.best_params_)
print('Cross-validation MAE', -search.best_score_)
pipeline = search.best_estimator_
y_pred = pipeline.predict(X_test)
predictions = np.expm1(y_pred)
predictions
mae = mean_absolute_error(y_test, predictions)
print(f'Test MAE: ${mae:,.0f}')

### Random Forest

In [None]:
from scipy.stats import randint, uniform
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
pipeline = make_pipeline(
    ce.TargetEncoder(), 
    SimpleImputer(),
    StandardScaler(), 
    RandomForestRegressor(random_state=42)
)

param_distributions = {
    'targetencoder__min_samples_leaf': randint(1, 1000),     
    'simpleimputer__strategy': ['mean', 'median'], 
    'randomforestregressor__n_estimators': randint(50, 500), 
    'randomforestregressor__max_depth': [5, 10, 15, 20, None], 
    'randomforestregressor__max_features': uniform(0, 1), 
}

# If you're on Colab, decrease n_iter & cv parameters
search = RandomizedSearchCV(
    pipeline, 
    param_distributions=param_distributions, 
    n_iter=100, 
    cv=10, 
    scoring='neg_mean_absolute_error', 
    verbose=10, 
    return_train_score=True, 
    n_jobs=-1
)

search.fit(X_train, y_train);

In [None]:
print('Best hyperparameters', search.best_params_)
print('Cross-validation MAE', -search.best_score_)
pipeline = search.best_estimator_
y_pred = pipeline.predict(X_test)
predictions = np.expm1(y_pred)
predictions
mae = mean_absolute_error(y_test, predictions)
print(f'Test MAE: ${mae:,.0f}')

### Gradient Boosting

In [None]:
from sklearn.ensemble import GradientBoostingRegressor


pipeline = make_pipeline(
    ce.TargetEncoder(), 
    SimpleImputer(),
    StandardScaler(), 
    GradientBoostingRegressor(random_state=42)
)
param_distributions = {
    'targetencoder__min_samples_leaf': randint(1, 1000),     
    'simpleimputer__strategy': ['mean', 'median'], 
    'gradientboostingregressor__max_depth': [5, 10, 15, 20, None], 
    'gradientboostingregressor__loss': ['ls','lad','huber','quantile'], 
}

# If you're on Colab, decrease n_iter & cv parameters
search = RandomizedSearchCV(
    pipeline, 
    param_distributions=param_distributions, 
    n_iter=100, 
    cv=10, 
    scoring='neg_mean_absolute_error', 
    verbose=10, 
    return_train_score=True, 
    n_jobs=-1
)
search.fit(X_train, y_train);

In [None]:
print('Best hyperparameters', search.best_params_)
print('Cross-validation MAE', -search.best_score_)
pipeline = search.best_estimator_
y_pred = pipeline.predict(X_test)
predictions = np.expm1(y_pred)
predictions
mae = mean_absolute_error(Y_test, predictions)
print(f'Test MAE: ${mae:,.0f}')

### XG Boost

In [None]:
from xgboost import XGBRegressor
pipeline = make_pipeline(
    ce.TargetEncoder(), 
    SimpleImputer(),
    StandardScaler(), 
    XGBRegressor(random_state=42,n_jobs=-1)
)
param_distributions = {
    'targetencoder__min_samples_leaf': randint(1, 1000),     
    'simpleimputer__strategy': ['mean', 'median'],
    'xgbregressor__learning_rate': [.03, 0.05, .07 ,0.13,0.2],
    'xgbregressor__max_depth': [5, 6, 7],}

# If you're on Colab, decrease n_iter & cv parameters
search = RandomizedSearchCV(
    pipeline, 
    param_distributions=param_distributions, 
    n_iter=100, 
    cv=10, 
    scoring='neg_mean_absolute_error', 
    verbose=10, 
    return_train_score=True, 
    n_jobs=-1
)
search.fit(X_train, y_train);

In [None]:
print('Best hyperparameters', search.best_params_)
print('Cross-validation MAE', -search.best_score_)
pipeline = search.best_estimator_
y_pred = pipeline.predict(X_test)
predictions = np.expm1(y_pred)
predictions
mae = mean_absolute_error(Y_test, predictions)
print(f'Test MAE: ${mae:,.0f}')

### Permutation Importances

In [None]:
from sklearn.ensemble import RandomForestClassifier
# Get feature importances
rf = pipeline.named_steps['randomforestclassifier']
importances = pd.Series(rf.feature_importances_, X_train.columns)

# Plot feature importances
%matplotlib inline
import matplotlib.pyplot as plt

n = 7
plt.figure(figsize=(10,n/2))
plt.title(f'Top {n} features')
importances.sort_values()[-n:].plot.barh(color='grey');

In [None]:
column  = 'quantity'

# Fit without column
pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median'), 
    RandomForestClassifier(n_estimators=20, random_state=42, n_jobs=-1)
)
pipeline.fit(X_train.drop(columns=column), y_train)
score_without = pipeline.score(X_val.drop(columns=column), y_val)
print(f'Validation Accuracy without {column}: {score_without}')

# Fit with column
pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median'), 
    RandomForestClassifier(n_estimators=20, random_state=42, n_jobs=-1)
)
pipeline.fit(X_train, y_train)
score_with = pipeline.score(X_val, y_val)
print(f'Validation Accuracy with {column}: {score_with}')

# Compare the error with & without column
print(f'Drop-Column Importance for {column}: {score_with - score_without}')

### eli5 Library

In [20]:
import eli5
from eli5.sklearn import PermutationImportance

# Ignore warnings

transformers = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median')
)

X_train_transformed = transformers.fit_transform(X_train)
X_val_transformed = transformers.transform(X_val)

model = RandomForestClassifier(n_estimators=20, random_state=42, n_jobs=-1)
model.fit(X_train_transformed, y_train)



feature_names = X_val.columns.tolist()

permuter = PermutationImportance(
    model,
    scoring='accuracy',
    n_iter=5,
    random_state=42
)

permuter.fit(X_val_transformed, y_val)

eli5.show_weights(
    permuter,
    top=None,
    feature_names=feature_names
)


invalid value encountered in double_scalars



Weight,Feature
0  ± 0.0000,mileage
0  ± 0.0000,all_wheel_drive
0  ± 0.0000,ludacris_mode
0  ± 0.0000,battery
0  ± 0.0000,car
0  ± 0.0000,year
