<a href="https://colab.research.google.com/github/austinlasseter/DS-Unit-2-Applied-Modeling/blob/master/Canvas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Objective 1 - get permutation importances for model interpretation and feature selection

In many of the models we've fit, we've looked at the feature importance. This has been accomplished by simply ranking the features after fitting the model. In addition to these basic methods, we can also look at what happens when we change a specific feature. This is called the permutation importance.

The permute something means to change the order. When we fit a model, we measure the accuracy by comparing our model predictions to the test or validation data. We can test the importance of a feature by permuting the values and then calculating the accuracy against the test set.

The process works something like this:

    Fit a model and calculate the accuracy
    Choose a feature (either by rank the importance or some other method) and randomly permute the values for just that feat
    Calculate the accuracy again with the permuted column
    A decrease in accuracy: that feature is important to the model
    Accuracy that stays the same: the feature isn't important to the model and could be replaced by random numbers



We'll use the Australian weather data set from the previous module and permute or randomize a few of the features in the test set. The accuracy should change, decrease, for features that are important to the model. The accuracy should remain essentially the same for features that are not very important to the model. 

In [5]:
!wget https://rattle.togaware.com/weatherAUS.csv


--2020-10-06 00:57:36--  https://rattle.togaware.com/weatherAUS.csv
Resolving rattle.togaware.com (rattle.togaware.com)... 207.38.86.6, 2605:de00:1:1:4a:2:0:e5
Connecting to rattle.togaware.com (rattle.togaware.com)|207.38.86.6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19780788 (19M) [text/csv]
Saving to: ‘weatherAUS.csv’


2020-10-06 00:57:39 (8.90 MB/s) - ‘weatherAUS.csv’ saved [19780788/19780788]



In [7]:
# Import libraries, load data, and view
import pandas as pd
weather = pd.read_csv('weatherAUS.csv')
weather.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RISK_MM,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,WNW,20.0,24.0,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,0.0,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,WSW,4.0,22.0,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,0.0,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,WSW,19.0,26.0,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,0.0,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,E,11.0,9.0,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,1.0,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,NW,7.0,20.0,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,0.2,No


In [9]:

# Drop columns with high-percentage of missing values
cols_drop = ['Location', 'Evaporation', 'Sunshine', 'Cloud9am', 'Cloud3pm', 'RISK_MM']
weather_drop = weather.drop(cols_drop, axis=1)



In [10]:
# Convert the 'Date' column to datetime, extract month
weather_drop['Date'] = pd.to_datetime(weather_drop['Date'], infer_datetime_format=True).dt.month

## Create Pipeline

In [11]:
import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.tree import DecisionTreeClassifier


numeric_features = ['MinTemp', 'MaxTemp', 'Rainfall', 'WindGustSpeed', 
                    'WindSpeed9am','WindSpeed3pm', 'Humidity9am', 
                    'Humidity3pm', 'Pressure9am','Pressure3pm', 
                    'Temp9am', 'Temp3pm']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('ordinal', OrdinalEncoder())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])


clf = Pipeline(steps=[('preprocessor', preprocessor),
                  ('classifier', DecisionTreeClassifier())])

## Train and Fit the Model

In [14]:
# Create the feature matrix 
X = weather_drop.drop('RainTomorrow', axis=1)

In [21]:
weather['RainTomorrow'].value_counts()

No     136485
Yes     37405
Name: RainTomorrow, dtype: int64

In [24]:
# Create and encode the target array

weather['RainT']=np.where(weather['RainTomorrow']=="Yes", 1, 0)
print(weather['RainT'].value_counts())
y = weather['RainT']

0    140861
1     37405
Name: RainT, dtype: int64


In [25]:
# Import the train_test_split utility
from sklearn.model_selection import train_test_split

# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Fit the model
clf.fit(X_train,y_train)
print('Validation Accuracy', clf.score(X_test, y_test))

Validation Accuracy 0.7865316654512817


## Feature Importances

In [26]:
# Features (order in which they were preprocessed)
features_order = numeric_features + categorical_features

importances = pd.Series(clf.steps[1][1].feature_importances_, features_order)

# Plot feature importances
import matplotlib.pyplot as plt

n = 10
plt.figure(figsize=(10,n/2))
plt.title(f'Top {n} features')
importances.sort_values()[-n:].plot.barh(color='grey')

plt.clf()

<Figure size 720x360 with 0 Axes>

We can now try a few of the columns and see how permutation of their values affects the accuracy. We'll start with the most important feature (Humidity3pm) and then do the same with one of the less important features (WindSpeed3pm).

We do need to remember to preprocess the data in the same way we did inside of the pipeline. For the numeric features, we used the SimpleImputer() and the StandardScaler.

In [27]:
# Permute the values in the more important column
feature = 'Humidity3pm'

X_test_permuted = X_test.copy()

# Fill in missing values
X_test_permuted[feature].fillna(value = X_test_permuted[feature].median(), inplace=True)

# Permute
X_test_permuted[feature] = np.random.permutation(X_test[feature])

print('Feature permuted: ', feature)
print('Validation Accuracy', clf.score(X_test, y_test))
print('Validation Accuracy (permuted)', clf.score(X_test_permuted, y_test))

Feature permuted:  Humidity3pm
Validation Accuracy 0.7865316654512817
Validation Accuracy (permuted) 0.7025298704212711


The accuracy went down, as we would expect. So Humidity3pm has some affect on the model. Let's try another feature.

In [28]:
# Permute the values in a less important column
feature = 'WindSpeed3pm'

X_test_permuted = X_test.copy()

# Fill in missing values
X_test_permuted[feature].fillna(value = X_test_permuted[feature].median(), inplace=True)

# Permute
X_test_permuted[feature] = np.random.permutation(X_test[feature])

print('Feature permuted: ', feature)
print('Validation Accuracy', clf.score(X_test, y_test))
print('Validation Accuracy (permuted)', clf.score(X_test_permuted, y_test))



Feature permuted:  WindSpeed3pm
Validation Accuracy 0.7865316654512817
Validation Accuracy (permuted) 0.7743030235036742


The decrease in accuracy was not nearly as significant, so WindSpeed3pm is not as important to the model.

# Objective 2 - use xgboost for gradient boosting

## Bagging

In the previous unit, we used the random forest ensemble method, where the ensemble was a collection of trees. An ensemble method makes use of bootstrap sampling where random samples are drawn from the training set with replacement. A decision tree is trained on each sample and each tree gets a "vote" for the class. The class with the most votes wins. This process is called bootstrap aggregating or bagging.
## Boosting

One of the other important processes in machine learning is boosting. For our example, we'll start by training our data set with a weak learner which is often a decision tree with one node or split (called a stump). We find the data that was misclassified and start the next round by assigning them a larger weight. We continue to train decision tree stumps and add larger weight to the mistakes for each model. The samples that are difficult to classify will receive increasing larger weights and eventually be correctly classified. This process is called adaptive boosting and is the source of the AdaBoost() name.
## Gradient Boosting

Gradient boosting is another boosting technique that makes use of a gradient descent method when adding trees to the model. When a tree is added, the hyperparameters are adjusted to minimize the loss function following the negative gradient. THe popular XGBoot algorithm makes use of this process.

In the next section, we'll implement the two boosting methods described above.

First, we'll use the AdaBoost classifier in scikit-learn and then compare that to the results from the XGBoost scikit-learn API. 

In [29]:
# Load in libraries, data
from sklearn import datasets
from sklearn.model_selection import train_test_split

# Create X, y and training/test sets
iris = datasets.load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.4, random_state=42)

# Import the classifier
from sklearn.ensemble import AdaBoostClassifier

ada_classifier = AdaBoostClassifier(n_estimators=50, learning_rate=1.5, random_state=42)
ada_classifier.fit(X_train,y_train)

print('Validation Accuracy: Adaboost', ada_classifier.score(X_test, y_test))

Validation Accuracy: Adaboost 0.9666666666666667


The classifier performed very well, but this data set is intended to be easy to classify. We set the train-test split at 60/40 so the classifier was "challenged" a little more with a smaller training set.

Now we'll try to classify the same data with a different boosted model: xgboost. If you are running your code locally, you'll need to have xgboost installed. If you are using Colab, then you are ready to boost!

In [30]:
# Load xgboost and fit the model
from xgboost import XGBClassifier

xg_classifier = XGBClassifier(n_estimators=50, random_state=42)

xg_classifier.fit(X_train,y_train)

print('Validation Accuracy: Adaboost', xg_classifier.score(X_test, y_test))

Validation Accuracy: Adaboost 0.9833333333333333
