## 10 Way Feature Selection
- select 50 features from 136
- xxx_support: list to represent select this feature or not
- xxx_feature: the name of selected features

### Methods:
1. Filter: Pearson, f_classif (Anova F value)
2. Wrapper: RFE with Logistic regression and XGBoost
3. Embeded: Logistic Regression, Random Forest, XGBoost, LassoCV, RidgeClassifierCV

Source of Inspiration and modified: https://www.kaggle.com/sz8416/6-ways-for-feature-selection

In [7]:
import pandas as pd
import numpy as np

In [3]:
model_data = pd.read_csv('all_model_data.csv', index_col = 0)
#model_data.head()

In [4]:
# select X and y 
X = model_data.drop('Revenue', axis = 1)
feature_name = X.columns.tolist()
y = model_data.Revenue

### 1 Filter
#### 1.1 Pearson Correlation

In [5]:
def cor_selector(X, y):
    cor_list = []
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
    # replace NaN with 0
    cor_list = [0 if np.isnan(i) else i for i in cor_list]
    # feature name
    cor_feature = X.iloc[:,np.argsort(np.abs(cor_list))[-50:]].columns.tolist()
    # feature selection? 0 for not select, 1 for select
    cor_support = [True if i in cor_feature else False for i in feature_name]
    return cor_support, cor_feature

In [8]:
cor_support, cor_feature = cor_selector(X, y)
print(str(len(cor_feature)), 'selected features')

50 selected features


#### 1.2 f_classif
- documentation for SelectKBest: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html

In [9]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
f_classif_selector = SelectKBest(f_classif, k=50)
f_classif_selector.fit(X, y)

SelectKBest(k=50, score_func=<function f_classif at 0x7fc20d274bf8>)

In [10]:
f_classif_support = f_classif_selector.get_support()
f_classif_feature = X.loc[:,f_classif_support].columns.tolist()
print(str(len(f_classif_feature)), 'selected features')

50 selected features


#### 2. Wrapper
- documentation for RFE: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html
- logistic regression and xgboost

#### 2.1 RFE - Logistic Regression<br>
RFE notes:
- estimator = means what model to estimate it on
- max_iter = if model doesn't converge, random state for reusability 
- step =  how many observation to remove after each iteration 
- verbose = can get more visual output (doesn't change the model)

In [12]:
# packages you would need 
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
rfe_selector = RFE(estimator=LogisticRegression(max_iter = 1500,random_state=123), step = 10, n_features_to_select=50,
                   verbose=0)
rfe_selector.fit(X, y)

RFE(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                 fit_intercept=True, intercept_scaling=1,
                                 l1_ratio=None, max_iter=1500,
                                 multi_class='auto', n_jobs=None, penalty='l2',
                                 random_state=123, solver='lbfgs', tol=0.0001,
                                 verbose=0, warm_start=False),
    n_features_to_select=50, step=10, verbose=0)

In [13]:
rfe_support = rfe_selector.get_support() # Get a mask, or integer index, of the features selected
rfe_feature = X.loc[:,rfe_support].columns.tolist() # get the column names of features selected and put them in a list
print(str(len(rfe_feature)), 'selected features') 

50 selected features


#### 2.2 RFE XGBOOST

In [16]:
from xgboost import XGBClassifier
rfe_selector_xgboost = RFE(estimator=XGBClassifier(random_state=123), n_features_to_select=50, step=10, verbose=0)
rfe_selector_xgboost.fit(X, y)

RFE(estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                            colsample_bylevel=1, colsample_bynode=1,
                            colsample_bytree=1, gamma=0, learning_rate=0.1,
                            max_delta_step=0, max_depth=3, min_child_weight=1,
                            missing=None, n_estimators=100, n_jobs=1,
                            nthread=None, objective='binary:logistic',
                            random_state=123, reg_alpha=0, reg_lambda=1,
                            scale_pos_weight=1, seed=None, silent=None,
                            subsample=1, verbosity=1),
    n_features_to_select=50, step=10, verbose=0)

In [17]:
# transform
rfe_support_xgboost = rfe_selector_xgboost.get_support()
rfe_feature_xgboost = X.loc[:,rfe_support_xgboost].columns.tolist()
print(str(len(rfe_support_xgboost)), 'selected features')

108 selected features


### 3. Embeded
- documentation for SelectFromModel: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html 

#### 3.1 Logistics Regression 

In [18]:
from sklearn.feature_selection import SelectFromModel
#from sklearn.linear_model import LogisticRegression
# penalty l2 is default (regularization type for solver)
# threshold = minimum threshold applied (applied so it selects approx 50 features )
embeded_lr_selector = SelectFromModel(LogisticRegression(penalty="l2", random_state = 123, max_iter=1000), threshold = 0.2)
embeded_lr_selector.fit(X, y)

SelectFromModel(estimator=LogisticRegression(C=1.0, class_weight=None,
                                             dual=False, fit_intercept=True,
                                             intercept_scaling=1, l1_ratio=None,
                                             max_iter=1000, multi_class='auto',
                                             n_jobs=None, penalty='l2',
                                             random_state=123, solver='lbfgs',
                                             tol=0.0001, verbose=0,
                                             warm_start=False),
                max_features=None, norm_order=1, prefit=False, threshold=0.2)

In [19]:
embeded_lr_support = embeded_lr_selector.get_support()
embeded_lr_feature = X.loc[:,embeded_lr_support].columns.tolist()
print(str(len(embeded_lr_feature)), 'selected features')

50 selected features


#### 3.2 Random Forest

In [20]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
# n_estimators = The number of trees in the forest (10-100)
embeded_rf_selector = SelectFromModel(RandomForestClassifier(n_estimators=50, random_state = 123), threshold=0.00775)
embeded_rf_selector.fit(X, y)

SelectFromModel(estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                                 class_weight=None,
                                                 criterion='gini',
                                                 max_depth=None,
                                                 max_features='auto',
                                                 max_leaf_nodes=None,
                                                 max_samples=None,
                                                 min_impurity_decrease=0.0,
                                                 min_impurity_split=None,
                                                 min_samples_leaf=1,
                                                 min_samples_split=2,
                                                 min_weight_fraction_leaf=0.0,
                                                 n_estimators=50, n_jobs=None,
                                                 oob_score=False,


In [21]:
embeded_rf_support = embeded_rf_selector.get_support()
embeded_rf_feature = X.loc[:,embeded_rf_support].columns.tolist()
print(str(len(embeded_rf_feature)), 'selected features')

50 selected features


#### 3.3 XGBoost

In [22]:
embeded_xgb_selector = SelectFromModel(XGBClassifier(n_estimators=50, random_state = 123))
embeded_xgb_selector.fit(X, y)

SelectFromModel(estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                        colsample_bylevel=1, colsample_bynode=1,
                                        colsample_bytree=1, gamma=0,
                                        learning_rate=0.1, max_delta_step=0,
                                        max_depth=3, min_child_weight=1,
                                        missing=None, n_estimators=50, n_jobs=1,
                                        nthread=None,
                                        objective='binary:logistic',
                                        random_state=123, reg_alpha=0,
                                        reg_lambda=1, scale_pos_weight=1,
                                        seed=None, silent=None, subsample=1,
                                        verbosity=1),
                max_features=None, norm_order=1, prefit=False, threshold=None)

In [23]:
embeded_xgb_support = embeded_xgb_selector.get_support()
embeded_xgb_feature = X.loc[:,embeded_xgb_support].columns.tolist()
print(str(len(embeded_xgb_feature)), 'selected features')

22 selected features


### 3.4 LassoCV
- Lasso linear model with iterative fitting along a regularization path (built-in cross validation)
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html

In [24]:
from sklearn.linear_model import LassoCV
# cv = number of k for cross validation 
embeded_lasso_selector = SelectFromModel(LassoCV(random_state = 123, cv = 10, max_iter = 2000),threshold = 0.0001)
embeded_lasso_selector.fit(X, y)

SelectFromModel(estimator=LassoCV(alphas=None, copy_X=True, cv=10, eps=0.001,
                                  fit_intercept=True, max_iter=2000,
                                  n_alphas=100, n_jobs=None, normalize=False,
                                  positive=False, precompute='auto',
                                  random_state=123, selection='cyclic',
                                  tol=0.0001, verbose=False),
                max_features=None, norm_order=1, prefit=False,
                threshold=0.0001)

In [25]:
embeded_lasso_support = embeded_lasso_selector.get_support()
embeded_lasso_feature = X.loc[:,embeded_lasso_support].columns.tolist()
print(str(len(embeded_lasso_feature)), 'selected features')

41 selected features


### 3.5 Ridge Classifier CV
- Ridge classifier with built-in cross-validation
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifierCV.html#sklearn.linear_model.RidgeClassifierCV

In [26]:
from sklearn.linear_model import RidgeClassifierCV
embeded_ridge_selector = SelectFromModel(RidgeClassifierCV(cv=10), threshold =0.059)
embeded_ridge_selector.fit(X, y)

SelectFromModel(estimator=RidgeClassifierCV(alphas=array([ 0.1,  1. , 10. ]),
                                            class_weight=None, cv=10,
                                            fit_intercept=True, normalize=False,
                                            scoring=None,
                                            store_cv_values=False),
                max_features=None, norm_order=1, prefit=False, threshold=0.059)

In [27]:
embeded_ridge_support = embeded_ridge_selector.get_support()
embeded_ridge_feature = X.loc[:,embeded_ridge_support].columns.tolist()
print(str(len(embeded_ridge_feature)), 'selected features')

50 selected features


### 3.6 Linear SVC
- https://scikit-learn.org/stable/modules/feature_selection.html

In [28]:
from sklearn.svm import LinearSVC
embeded_svc_selector = SelectFromModel(LinearSVC(C=0.5, penalty='l1', dual=False, max_iter = 5000),threshold = 0.001)
embeded_svc_selector.fit(X, y)

embeded_svc_support = embeded_svc_selector.get_support()
embeded_svc_feature = X.loc[:,embeded_svc_support].columns.tolist()
print(str(len(embeded_svc_feature)), 'selected features')

50 selected features




## Summary
- Contains features that were derived from similar features 
- When modeling only one version (better one should be used)

In [29]:
pd.set_option('display.max_rows', 100)
# put all selection together
feature_selection_df = pd.DataFrame({'Feature':feature_name, 'Pearson':cor_support,'f_classif':f_classif_support,
    'RFE-Log':rfe_support,'RFE-XGBoost': rfe_support_xgboost,'Logistics':embeded_lr_support,'LassoCV':embeded_lasso_support,
    'RidgeClassifierCV':embeded_ridge_support,'Random Forest':embeded_rf_support,'XGBoost':embeded_xgb_support,
    'LinearSVC':embeded_svc_support})
# count the selected times for each feature
feature_selection_df['Total'] = np.sum(feature_selection_df, axis=1)
# display the top 65
feature_selection_df = feature_selection_df.sort_values(['Total','Feature'] , ascending=False)
feature_selection_df.index = range(1, len(feature_selection_df)+1)
feature_selection_df.head(65)

Unnamed: 0,Feature,Pearson,f_classif,RFE-Log,RFE-XGBoost,Logistics,LassoCV,RidgeClassifierCV,Random Forest,XGBoost,LinearSVC,Total
1,ProductRelated_Duration_Scaled,True,True,True,True,True,False,True,True,True,True,9
2,PageValues_Norm_Scaled,True,True,True,True,True,True,True,True,False,True,9
3,ExitPageRatio_Norm_Scaled,True,True,True,True,True,True,True,True,False,True,9
4,BounceRates_Scaled,True,True,True,True,True,True,True,True,True,False,9
5,BouncePageRatio_Norm_Scaled,True,True,True,True,True,True,True,True,False,True,9
6,BounceExitW4_Norm_Scaled,True,True,True,True,True,True,True,True,False,True,9
7,ProductRelated_Duration_Norm_Scaled,True,True,True,False,True,True,True,True,False,True,8
8,ProdRelPageRatio_Scaled_Bin,True,True,True,False,True,True,True,True,False,True,8
9,ProdRelExitRatio_Norm_Scaled,True,True,True,False,True,True,True,True,False,True,8
10,PageValues_Scaled_Bin,True,True,True,False,True,True,True,True,False,True,8


In [None]:
# save into csv
#feature_selection_df.to_csv('feature_selection.csv')