<a href="https://colab.research.google.com/github/evan-randall/DS-Unit-2-Applied-Modeling/blob/master/module3-permutation-boosting/Copy_of_LS_DS_233_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 3, Module 3*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf), Chapter 8
  - _**[Gradient Boosting Explained](https://www.gormanalysis.com/blog/gradient-boosting-explained/)**_ — Ben Gorman
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html) — Alex Rogozhnikov
  - [How to explain gradient boosting](https://explained.ai/gradient-boosting/) — Terence Parr & Jeremy Howard

#### Continue to clean and explore your data. Make exploratory visualizations.

In [None]:
import pandas as pd
from google.colab import files
uploaded = files.upload()

Saving NBA Salaries - Sheet1 (9).csv to NBA Salaries - Sheet1 (9).csv


In [None]:
df = pd.read_csv('NBA Salaries - Sheet1 (9).csv')
df.head(5)

Unnamed: 0,NAME,TEAM,POSITION,HEIGHT IN.,EXPERIENCE,SCORING AVERAGE,SALARY
0,Stephen Curry,Golden State Warriors,1,75,11,23.5,40200000
1,Chris Paul,Oklahoma City Thunder,1,73,15,18.5,38506482
2,Russell Westbrook,Houston Rockets,1,75,12,23.2,38506482
3,John Wall,Washington Wizards,1,76,10,19.0,38199000
4,Kevin Durant,Brooklyn Nets,3,82,13,27.0,38199000


In [None]:
# Get mean baseline
print('Mean Baseline (using 0 features)')
guess = y_train.mean()

Mean Baseline (using 0 features)


In [None]:
guess

15138945.764705881

In [None]:
# Train Error
from sklearn.metrics import mean_absolute_error
y_pred = [guess] * len(y_train)
mae = mean_absolute_error(y_train, y_pred)
print(f'Train Error (SALARY): ${mae:.2f}')

Train Error (SALARY): $1204988.65


In [None]:
# Test Error
y_pred = [guess] * len(y_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'Train Error (SALARY): ${mae:.2f}')

Train Error (SALARY): $12074097.48


In [None]:
guess = df['SALARY'].mean()

In [None]:
guess

23066383.505050506

In [None]:
errors = guess - df['SALARY']

In [None]:
errors

0    -1.713362e+07
1    -1.544010e+07
2    -1.544010e+07
3    -1.513262e+07
4    -1.513262e+07
          ...     
94    9.733051e+06
95    9.941384e+06
96    1.006638e+07
97    1.016638e+07
98    1.028960e+07
Name: SALARY, Length: 99, dtype: float64

In [None]:
mean_absolute_error = errors.abs().mean()

In [None]:
print(f'If we just guessed every NBA salary was ${guess:,.0f},')
print(f'we would be off by ${mean_absolute_error:,.0f} on average.')

If we just guessed every NBA salary was $23,066,384,
we would be off by $6,703,931 on average.


In [None]:
# 5 Step Linear Regression - Use scikit-learn to fit the simple regression with one feature.
from sklearn.linear_model import LinearRegression

In [None]:
LinearRegression

sklearn.linear_model._base.LinearRegression

In [None]:
# 2. Instantiate this class

model = LinearRegression()

In [None]:
model

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [None]:
# 3. Arrange X features matrix and y target vector

# Standard approach that can generalize to other problems
features = ['EXPERIENCE']
target = ['SALARY']

x_train = df[features]
y_train = df[target]

In [None]:
x_train.describe()

Unnamed: 0,EXPERIENCE
count,99.0
mean,9.363636
std,3.031988
min,4.0
25%,7.0
50%,9.0
75%,11.0
max,17.0


In [None]:
y_train.describe()

Unnamed: 0,SALARY
count,99.0
mean,23066380.0
std,7639039.0
min,12776790.0
25%,16100000.0
50%,21000000.0
75%,27648470.0
max,40200000.0


In [None]:
# 4. Fit the model

model.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [None]:
# 5. Apply the model to new data

EXPERIENCE = 10
x_test = [[ EXPERIENCE ]]

y_pred = model.predict(x_test)
y_pred

array([[23272270.96465972]])

In [None]:
y_test = [18000000]

In [None]:
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, y_pred)
print("Our model was off by", mae)

Our model was off by 5272270.964659717


In [None]:
model

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [None]:
model.coef_

array([[323537.43652876]])

In [None]:
model.intercept_

array([20036896.59937213])

#### Visualizations

In [None]:
import plotly.express as px
px.scatter(df, x='EXPERIENCE', y='SALARY', trendline='ols')

In [None]:
import plotly.express as px
px.scatter(df, x='HEIGHT IN.', y='SALARY', trendline='ols')

In [None]:
import plotly.express as px
px.scatter(df, x='POSITION', y='SALARY', trendline='ols')

In [None]:
import plotly.express as px
px.scatter(df, x='SCORING AVERAGE', y='SALARY', trendline='ols')

In [None]:
def predict(SALARY):
    y_pred = model.predict([[SALARY]])
    estimate = y_pred[0]
    coefficient = model.coef_[0]
    result = f'${estimate[0]:,.0f} is the estimated salary for {EXPERIENCE:,.0f} years experience in the NBA.'
    explanation = f'In this linear regression, each additional year of experience adds ${coefficient[0]:,.0f}.'
    return result + '\n' + explanation

print(predict(10))

$23,272,271 is the estimated salary for 10 years experience in the NBA.
In this linear regression, each additional year of experience adds $323,537.


####  Fit a model. Does it beat your baseline?

In [None]:
train = df[df['SALARY'] <18000000]

test = df[df['SALARY'] >=18000000]

In [None]:
train.shape, test.shape

((34, 7), (65, 7))

In [None]:
train['SALARY'].mean()

15138945.764705881

In [None]:
# Arrange y target vectors
target = 'SALARY'
y_train = train[target]
y_test = test[target]

In [None]:
y_train

65    17839286
66    17650000
67    17185185
68    17150000
69    17000000
70    17000000
71    16720000
72    16229213
73    16200000
74    16000000
75    15680000
76    15643750
77    15625000
78    15625000
79    15500000
80    15450051
81    15349400
82    15000000
83    15000000
84    14896552
85    14651700
86    14634146
87    14500000
88    14471910
89    14057730
90    14041096
91    13565218
92    13486300
93    13437500
94    13333333
95    13125000
96    13000000
97    12900000
98    12776786
Name: SALARY, dtype: int64

In [None]:
y_test

0     40200000
1     38506482
2     38506482
3     38199000
4     38199000
        ...   
60    18539130
61    18539130
62    18500000
63    18000000
64    18000000
Name: SALARY, Length: 65, dtype: int64

 ### Try xgboost

In [None]:
# Tried to edit these notes off the lecture for my project?????????

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Merge train_features.csv & train_labels.csv
train =  pd.read_csv('NBA Salaries - Sheet1 (9).csv')

# Read test_features.csv & sample_submission.csv
test =  pd.read_csv('NBA Salaries - Sheet1 (9).csv')
sample_submission =  pd.read_csv('NBA Salaries - Sheet1 (9).csv')

# Split train into train & val
train, val = train_test_split(train, train_size=0.80, test_size=0.20, 
                              stratify=train['SALARY'], random_state=42)


def wrangle(X):
    """Wrangle train, validate, and test sets in the same way"""
    
    # Prevent SettingWithCopyWarning
    X = X.copy()
    
    # return the wrangled dataframe
    return X

train = wrangle(train)
val = wrangle(val)
test = wrangle(test)

In [None]:
print('Training Accuracy:', pipeline.score(X_train, y_train))
print('Validation Accuracy:', pipeline.score(X_val, y_val))

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median'), 
    GradientBoostingClassifier(random_state=42)
)

# Fit on train, score on val
pipeline.fit(X_train, y_train);

In [None]:
print('Training Accuracy:', pipeline.score(X_train, y_train))
print('Validation Accuracy:', pipeline.score(X_val, y_val))

In [None]:
from xgboost import XGBClassifier

pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    XGBClassifier(n_estimators=100,
                  random_state=42,
                  n_jobs=-1)
)

pipeline.fit(X_train, y_train)

In [None]:
print('Training Accuracy:', pipeline.score(X_train, y_train))
print('Validation Accuracy:', pipeline.score(X_val, y_val))

#### Get your model's permutation importances.

In [None]:
# Get feature importances
rf = pipeline.named_steps['randomforestclassifier']
importance = pd.Series(rf.feature_importances_, X_train.columns)

#Plot feature importances
%matplotlib inline
import matplotlib.pyplot as plt

n = 30
plt.figure(figsize=(10,n/2))
plt.title(f'Top {n} features')
importances.sort_values()[-n:].plot.barh(color='grey')

In [None]:
X_train.shape

In [None]:
column  = 'quantity'

# Fit without column
pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median'), 
    RandomForestClassifier(n_estimators=20, random_state=42, n_jobs=-1)
)
pipeline.fit(X_train.drop(columns=column), y_train)
score_without = pipeline.score(X_val.drop(columns=column), y_val)
print(f'Validation Accuracy without {column}: {score_without}')

# Fit with column
pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median'), 
    RandomForestClassifier(n_estimators=20, random_state=42, n_jobs=-1)
)
pipeline.fit(X_train, y_train)
score_with = pipeline.score(X_val, y_val)
print(f'Validation Accuracy with {column}: {score_with}')

# Compare the error with & without column
print(f'Drop-Column Importance for {column}: {score_with - score_without}')

#### Do-It-Yourself

In [None]:
# Step 1: Train your model
# See above

# Step 2: Shuffle the values in a single column (one feature)
# OF YOUR VALIDATION SET
feature = 'quantity'
print(X_val[feature].head())
print()
print(X_val[feature].value_counts())

In [None]:
import numpy as np

X_val_permuted = X_val.copy()
X_val_permuted[feature] = np.random.permutation(X_val_permuted[feature])

In [None]:
acc = pipeline.score(X_val, y_val)
acc_permuted = pipeline.score(X_val_permuted, y_val)

print(f'Validation accuracy with {feature}:', acc)
print(f'Validation accuracy with {feature} permuted:', acc_permuted)
print(f'Permutation importance:', acc - acc_permuted)

#### eli5 Library

In [None]:
import eli5
from eli5.sklearn import PermutationImportance

# Ignore warnings

transformers = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median')
)

X_train_transformed = transformers.fit_transform(X_train)
X_val_transformed = transformers.transform(X_val)

model = RandomForestClassifier(n_estimators=20, random_state=42, n_jobs=-1)
model.fit(X_train_transformed, y_train)

feature_names = X_val.columns.tolist()

permuter = PermutationImportance(
    model,
    scoring='accuracy',
    n_iter=5,
    random_state=42
)

permuter.fit(X_val_transformed, y_val)

eli5.show_weights(
    permuter,
    top=None,
    feature_names=feature_names
)