The purpose of this notebook is to investigate the relative importance of the features used to train xgboost. Over 1200 features were passed into the model, and only about 600 were used at any point in the model. I want to see how many of the original 188 features were used at least once. It may be beneficial to remove any original features with 0 contributions to cut down on time spent engineering the features and training the model.

In [1]:
%autosave 0

Autosave disabled


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
df = pd.read_csv('model8_feature_scores.csv', index_col=0)
df.head()

Unnamed: 0_level_0,weight,gain,cover,total_gain,total_cover
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
P_2_min,24.0,1046.174927,3564.391602,25108.199219,85545.398438
P_2_max,3.0,16.868271,1446.203491,50.604813,4338.610352
P_2_median,14.0,378.496765,2358.535889,5298.95459,33019.503906
P_2_std,1.0,10.919077,131.254517,10.919077,131.254517
P_2_last,57.0,606.084595,2461.301758,34546.820312,140294.203125


In [5]:
df.shape

(619, 5)

In [6]:
df.index

Index(['P_2_min', 'P_2_max', 'P_2_median', 'P_2_std', 'P_2_last', 'P_2_change',
       'D_39_min', 'D_39_max', 'D_39_std', 'D_39_last',
       ...
       'D_64_O', 'S_3_nulls', 'D_42_nulls', 'D_49_nulls', 'D_56_nulls',
       'B_29_nulls', 'D_106_nulls', 'R_26_nulls', 'R_27_nulls', 'D_137_nulls'],
      dtype='object', name='0', length=619)

In [14]:
df.reset_index(inplace=True)

In [19]:
df_split = df['0'].str.split('_', expand=True)

In [20]:
df_split.head()

Unnamed: 0,0,1,2
0,P,2,min
1,P,2,max
2,P,2,median
3,P,2,std
4,P,2,last


In [22]:
df_split.columns

RangeIndex(start=0, stop=3, step=1)

In [23]:
df_split['original_feature'] = df_split[0] + '_' + df_split[1]

In [24]:
df_split.head()

Unnamed: 0,0,1,2,original_feature
0,P,2,min,P_2
1,P,2,max,P_2
2,P,2,median,P_2
3,P,2,std,P_2
4,P,2,last,P_2


In [27]:
df_split.original_feature.unique()

array(['P_2', 'D_39', 'B_1', 'B_2', 'R_1', 'S_3', 'D_41', 'B_3', 'D_42',
       'D_43', 'D_44', 'B_4', 'D_45', 'B_5', 'R_2', 'D_46', 'D_47',
       'D_48', 'D_49', 'B_6', 'B_7', 'B_8', 'D_50', 'D_51', 'B_9', 'R_3',
       'D_52', 'P_3', 'B_10', 'D_53', 'S_5', 'B_11', 'S_6', 'D_54', 'R_4',
       'S_7', 'B_12', 'S_8', 'D_55', 'D_56', 'B_13', 'R_5', 'D_58', 'S_9',
       'B_14', 'D_59', 'D_60', 'D_61', 'B_15', 'S_11', 'D_62', 'D_65',
       'B_16', 'B_17', 'B_18', 'B_19', 'B_20', 'S_12', 'R_6', 'S_13',
       'B_21', 'D_69', 'B_22', 'D_70', 'D_71', 'D_72', 'S_15', 'B_23',
       'P_4', 'D_74', 'D_75', 'B_24', 'R_7', 'D_77', 'B_25', 'B_26',
       'D_78', 'D_79', 'R_8', 'R_9', 'S_16', 'D_80', 'R_10', 'R_11',
       'B_27', 'D_81', 'D_82', 'S_17', 'R_12', 'B_28', 'R_13', 'R_14',
       'R_15', 'D_84', 'R_16', 'B_29', 'S_18', 'D_86', 'R_18', 'D_88',
       'S_19', 'R_19', 'B_32', 'S_20', 'R_20', 'R_21', 'B_33', 'D_89',
       'R_22', 'R_23', 'D_91', 'D_92', 'D_93', 'D_94', 'R_24', 'R_25',
 

In [41]:
relevant_columns = list(df_split.original_feature.unique())

In [28]:
df_split.original_feature.nunique()

167

In [30]:
df.sort_values('total_gain')

Unnamed: 0,0,weight,gain,cover,total_gain,total_cover
172,B_12_change,1.0,0.004311,15.389643,0.004311,15.389643
234,D_62_change,1.0,0.291788,13.999965,0.291788,13.999965
144,D_53_median,1.0,0.383034,8.202533,0.383034,8.202533
482,D_103_max,1.0,0.943039,20.933578,0.943039,20.933578
365,D_81_change,1.0,1.111147,11.692972,1.111147,11.692972
...,...,...,...,...,...,...
250,B_18_last,4.0,1508.544678,3608.366699,6034.178711,14433.466797
120,B_9_max,4.0,1725.645752,3515.094238,6902.583008,14060.376953
122,B_9_last,5.0,1930.199829,4813.571289,9650.999023,24067.855469
0,P_2_min,24.0,1046.174927,3564.391602,25108.199219,85545.398438


In [32]:
df_train = pd.read_csv('../../data/prepared/train_data.csv', nrows=10)

In [33]:
df_train.head()

Unnamed: 0,customer_ID,S_2,P_2,D_39,B_1,B_2,R_1,S_3,D_41,B_3,...,D_136,D_137,D_138,D_139,D_140,D_141,D_142,D_143,D_144,D_145
0,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-03-09,0.938469,0.001733,0.008724,1.006838,0.009228,0.124035,0.008771,0.004709,...,,,,0.002427,0.003706,0.003818,,0.000569,0.00061,0.002674
1,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-04-07,0.936665,0.005775,0.004923,1.000653,0.006151,0.12675,0.000798,0.002714,...,,,,0.003954,0.003167,0.005032,,0.009576,0.005492,0.009217
2,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-05-28,0.95418,0.091505,0.021655,1.009672,0.006815,0.123977,0.007598,0.009423,...,,,,0.003269,0.007329,0.000427,,0.003429,0.006986,0.002603
3,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-06-13,0.960384,0.002455,0.013683,1.0027,0.001373,0.117169,0.000685,0.005531,...,,,,0.006117,0.004516,0.0032,,0.008419,0.006527,0.0096
4,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-07-16,0.947248,0.002483,0.015193,1.000727,0.007605,0.117325,0.004653,0.009312,...,,,,0.003671,0.004946,0.008889,,0.00167,0.008126,0.009827


In [48]:
all_columns = list(df_train.drop(columns=['customer_ID', 'S_2']).columns)

In [49]:
for column in all_columns:
    if column not in relevant_columns:
        print(column)

D_68
D_73
D_76
D_83
B_30
D_87
R_17
B_31
S_22
D_104
B_38
D_108
D_109
D_114
D_116
D_117
D_126
B_42
D_134
D_135
D_138


As we can see, 167 of the 188 original features were relevant in training the model. Since these 21 "irrelevant" features represent a small portion of the overall data, I think it's worth keeping them in the dataset. Little time is lost to engineering features from these 21 columns, and other training passes with the model may make use of these features at some point.