## Gradient Boosted trees importance

Similarly to selecting features using Random Forests derived feature importance, you can select features based on the importance derived by gradient boosted trees. And you can do that in one go, or in a recursive manner, depending on how much time you have, how many features are in the dataset, and whether they are correlated or not.

I will demonstrate how to select features using Gradient boosted trees derived importance using sklearn on a classification problem, using the Paribas claims dataset from Kaggle.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import train_test_split

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_selection import SelectFromModel, RFE
from sklearn.metrics import roc_auc_score

In [3]:
import os
os.chdir('C:\\Users\\obaid\\OneDrive\\Documents')

In [4]:
# load dataset
data = pd.read_csv("../Documents/DataSets/Paribus.csv", nrows=50000)
data.shape

(50000, 133)

In [5]:
data.head()

Unnamed: 0,ID,target,v1,v2,v3,v4,v5,v6,v7,v8,...,v122,v123,v124,v125,v126,v127,v128,v129,v130,v131
0,3,1,1.335739,8.727474,C,3.921026,7.915266,2.599278,3.176895,0.012941,...,8.0,1.98978,0.035754,AU,1.804126,3.113719,2.024285,0,0.636365,2.857144
1,4,1,,,C,,9.191265,,,2.30163,...,,,0.598896,AF,,,1.957825,0,,
2,5,1,0.943877,5.310079,C,4.410969,5.326159,3.979592,3.928571,0.019645,...,9.333333,2.477596,0.013452,AE,1.773709,3.922193,1.120468,2,0.883118,1.176472
3,6,1,0.797415,8.304757,C,4.22593,11.627438,2.0977,1.987549,0.171947,...,7.018256,1.812795,0.002267,CJ,1.41523,2.954381,1.990847,1,1.677108,1.034483
4,8,1,,,C,,,,,,...,,,,Z,,,,0,,


In [6]:
# In practice, feature selection should be done after data pre-processing,
# so ideally, all the categorical variables are encoded into numbers,
# and then you can assess how deterministic they are of the target

# here for simplicity I will use only numerical variables
# select numerical columns:

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numerical_vars = list(data.select_dtypes(include=numerics).columns)
data = data[numerical_vars]
data.shape

(50000, 114)

### Important

In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfit.

In [7]:
# separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target', 'ID'], axis=1),
    data['target'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 112), (15000, 112))

In [8]:
# first I will select features all together in one go
# by contemplating their importance after fitting only
# 1 gradient boosted tree

sel_ = SelectFromModel(GradientBoostingClassifier())
sel_.fit(X_train.fillna(0), y_train)

SelectFromModel(estimator=GradientBoostingClassifier(criterion='friedman_mse',
                                                     init=None,
                                                     learning_rate=0.1,
                                                     loss='deviance',
                                                     max_depth=3,
                                                     max_features=None,
                                                     max_leaf_nodes=None,
                                                     min_impurity_decrease=0.0,
                                                     min_impurity_split=None,
                                                     min_samples_leaf=1,
                                                     min_samples_split=2,
                                                     min_weight_fraction_leaf=0.0,
                                                     n_estimators=100,
                                            

In [9]:
# let's add the variable names and order it for clearer visualisation
selected_feat = X_train.columns[(sel_.get_support())]
len(selected_feat)

9

In [10]:
selected_feat

Index(['v10', 'v14', 'v21', 'v34', 'v38', 'v40', 'v50', 'v114', 'v129'], dtype='object')

In [11]:
# next I will select features recursively for comparison

sel_ = RFE(GradientBoostingClassifier(), n_features_to_select=len(selected_feat))
sel_.fit(X_train.fillna(0), y_train)

KeyboardInterrupt: 

In [10]:
# let's add the variable names and order it for clearer visualisation
selected_feat_rfe = X_train.columns[(sel_.get_support())]
len(selected_feat_rfe)

19

In [11]:
selected_feat_rfe

Index(['v6', 'v10', 'v12', 'v14', 'v21', 'v34', 'v38', 'v40', 'v50', 'v59',
       'v73', 'v77', 'v88', 'v90', 'v114', 'v123', 'v127', 'v129', 'v130'],
      dtype='object')

In [12]:
# create a function to build gradient boosted trees
# and compare performance in train and test set


def run_gradientboosting(X_train, X_test, y_train, y_test):
    rf = GradientBoostingClassifier(
        n_estimators=200, random_state=39, max_depth=4)
    rf.fit(X_train, y_train)
    print('Train set')
    pred = rf.predict_proba(X_train)
    print('Random Forests roc-auc: {}'.format(
        roc_auc_score(y_train, pred[:, 1])))
    print('Test set')
    pred = rf.predict_proba(X_test)
    print('Random Forests roc-auc: {}'.format(
        roc_auc_score(y_test, pred[:, 1])))

In [14]:
# features selected recursively
run_gradientboosting(X_train[selected_feat_rfe].fillna(0),
                  X_test[selected_feat_rfe].fillna(0),
                  y_train, y_test)# features selected altogether


Train set
Random Forests roc-auc: 0.7833430666890429
Test set
Random Forests roc-auc: 0.7120634146957345


In [15]:
# features selected altogether
run_gradientboosting(X_train[selected_feat].fillna(0),
                  X_test[selected_feat].fillna(0),
                  y_train, y_test)

Train set
Random Forests roc-auc: 0.7837428543451019
Test set
Random Forests roc-auc: 0.712377229686829


Same as with the random forest derived importance feature selection, the recursive procedure did not add any advantage over the altogether selection. And it took a substantial amount of time to compute.

That is all for this lecture, I hope you enjoyed it and see you in the next one!