## Basic methods plus correlation + Annova with Feature-engine & Scikit Learn 

In this notebook, we will apply basic methods to remove constant, quasi-constant and duplicated features, followed up by removing correlated features, and then applying selection of features using Annova in 1 single step, using Feature-engine and the Scikit-learn.

In [22]:
# import libraries

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler

# to get the univarate annova p values 
from sklearn.feature_selection import f_classif, SelectKBest

from feature_engine.selection import (
    DropConstantFeatures,
    DropDuplicateFeatures,
    SmartCorrelatedSelection 
)

In [2]:
# load data

data = pd.read_csv('..\precleaned-datasets\dataset_1.csv')
data.head()

Unnamed: 0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,var_10,...,var_292,var_293,var_294,var_295,var_296,var_297,var_298,var_299,var_300,target
0,0,0,0.0,0.0,0.0,0,0,0,0,0,...,0.0,0,0,0,0,0,0,0.0,0.0,0
1,0,0,0.0,3.0,0.0,0,0,0,0,0,...,0.0,0,0,0,0,0,0,0.0,0.0,0
2,0,0,0.0,5.88,0.0,0,0,0,0,0,...,0.0,0,0,3,0,0,0,0.0,67772.7216,0
3,0,0,0.0,14.1,0.0,0,0,0,0,0,...,0.0,0,0,0,0,0,0,0.0,0.0,0
4,0,0,0.0,5.76,0.0,0,0,0,0,0,...,0.0,0,0,0,0,0,0,0.0,0.0,0


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Columns: 301 entries, var_1 to target
dtypes: float64(127), int64(174)
memory usage: 114.8 MB


In [4]:
# separate dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),
    data['target'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 300), (15000, 300))

In [5]:
# create a copy of the original data
X_train_original = X_train.copy()
X_test_original = X_test.copy()

### Remove constant and Quasi Constant features

In [9]:
# remove the constant and quasi constant using feature_engine

constant = DropConstantFeatures(tol=0.998, missing_values='ignore')

In [10]:
# fit using the X_train data
constant.fit(X_train)

DropConstantFeatures(missing_values='ignore', tol=0.998,
                     variables=['var_1', 'var_2', 'var_3', 'var_4', 'var_5',
                                'var_6', 'var_7', 'var_8', 'var_9', 'var_10',
                                'var_11', 'var_12', 'var_13', 'var_14',
                                'var_15', 'var_16', 'var_17', 'var_18',
                                'var_19', 'var_20', 'var_21', 'var_22',
                                'var_23', 'var_24', 'var_25', 'var_26',
                                'var_27', 'var_28', 'var_29', 'var_30', ...])

In [11]:
# remove the constant and quasi constant features

X_train = constant.transform(X_train)
X_test = constant.transform(X_test)

In [12]:
X_train.shape, X_test.shape

((35000, 158), (15000, 158))

## Remove duplicated features

In [13]:
# initialize the object
duplicates = DropDuplicateFeatures(variables=None, missing_values='ignore')

# fit it using the training data
duplicates.fit(X_train)

DropDuplicateFeatures(variables=['var_4', 'var_5', 'var_8', 'var_13', 'var_15',
                                 'var_17', 'var_18', 'var_19', 'var_21',
                                 'var_22', 'var_25', 'var_26', 'var_27',
                                 'var_29', 'var_30', 'var_31', 'var_35',
                                 'var_37', 'var_38', 'var_41', 'var_46',
                                 'var_47', 'var_49', 'var_50', 'var_51',
                                 'var_52', 'var_54', 'var_55', 'var_57',
                                 'var_58', ...])

In [14]:
# remove the duplicate features
duplicates.features_to_drop_

{'var_148', 'var_199', 'var_232', 'var_250', 'var_269', 'var_296'}

In [15]:
X_train = duplicates.transform(X_train)
X_test = duplicates.transform(X_test)

In [16]:
X_train.shape, X_test.shape

((35000, 152), (15000, 152))

In [17]:
# I keep a copy of the dataset except constant and duplicated variables
# to measure the performance of machine learning models
# at the end of the notebook

X_train_basic_filter = X_train.copy()
X_test_basic_filter = X_test.copy()

## Remove Correlated features

In [18]:
corr = SmartCorrelatedSelection(variables=None,
                                method='pearson',
                                threshold=0.8,
                                selection_method='variance')

corr.fit(X_train)

SmartCorrelatedSelection(selection_method='variance',
                         variables=['var_4', 'var_5', 'var_8', 'var_13',
                                    'var_15', 'var_17', 'var_18', 'var_19',
                                    'var_21', 'var_22', 'var_25', 'var_26',
                                    'var_27', 'var_29', 'var_30', 'var_31',
                                    'var_35', 'var_37', 'var_38', 'var_41',
                                    'var_46', 'var_47', 'var_49', 'var_50',
                                    'var_51', 'var_52', 'var_54', 'var_55',
                                    'var_57', 'var_58', ...])

In [19]:
corr.correlated_feature_sets_

[{'var_262', 'var_4'},
 {'var_281', 'var_8'},
 {'var_132', 'var_15', 'var_152'},
 {'var_101', 'var_18'},
 {'var_172', 'var_19', 'var_51', 'var_54'},
 {'var_21', 'var_55'},
 {'var_139', 'var_156', 'var_26'},
 {'var_165', 'var_213', 'var_27'},
 {'var_222', 'var_253', 'var_30', 'var_300', 'var_86'},
 {'var_175', 'var_179', 'var_31', 'var_62'},
 {'var_143', 'var_168', 'var_229', 'var_270', 'var_37'},
 {'var_117', 'var_155', 'var_259', 'var_284', 'var_295', 'var_38', 'var_88'},
 {'var_186', 'var_41', 'var_93'},
 {'var_47', 'var_64'},
 {'var_145', 'var_161', 'var_190', 'var_50'},
 {'var_103', 'var_123', 'var_160', 'var_162', 'var_163', 'var_258', 'var_52'},
 {'var_203', 'var_57'},
 {'var_107', 'var_114', 'var_58'},
 {'var_176', 'var_209', 'var_241', 'var_70'},
 {'var_121', 'var_164', 'var_226', 'var_252', 'var_83'},
 {'var_177', 'var_198', 'var_84', 'var_94'},
 {'var_105', 'var_166', 'var_255', 'var_91'},
 {'var_100', 'var_214', 'var_96'},
 {'var_108', 'var_191'},
 {'var_118', 'var_140', 'va

In [20]:
# drop the correlated features

X_train = corr.transform(X_train)
X_test = corr.transform(X_test)

X_test.shape, X_test.shape

((15000, 78), (15000, 78))

In [21]:
# keep a copy of the dataset at  this stage
X_train_corr = X_train.copy()
X_test_corr = X_test.copy()

## Select features based on Annova

In [23]:
sel_ = SelectKBest(f_classif, k =20).fit(X_train, y_train)

In [26]:
features_to_keep = X_train.columns[sel_.get_support()]
features_to_keep

Index(['var_4', 'var_15', 'var_35', 'var_46', 'var_47', 'var_49', 'var_62',
       'var_79', 'var_82', 'var_107', 'var_110', 'var_163', 'var_169',
       'var_174', 'var_181', 'var_209', 'var_222', 'var_230', 'var_231',
       'var_259'],
      dtype='object')

In [27]:
# transform the data
X_train_annova = sel_.transform(X_train)
X_test_annova = sel_.transform(X_test)

In [29]:
# numpy array to dataframe
X_train_anova = pd.DataFrame(X_train_annova)
X_train_anova.columns = features_to_keep

X_test_anova = pd.DataFrame(X_test_annova)
X_test_anova.columns = features_to_keep

X_train_anova.shape, X_test_anova.shape

((35000, 20), (15000, 20))

## Compare the performance in ML Algos

In [30]:
# create a function to build random forests and
# compare its performance in train and test sets

def run_randomForests(X_train, X_test, y_train, y_test):
    
    rf = RandomForestClassifier(n_estimators=200, random_state=39, max_depth=4)
    rf.fit(X_train, y_train)
    
    print('Train set')
    pred = rf.predict_proba(X_train)
    print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
    
    print('Test set')
    pred = rf.predict_proba(X_test)
    print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

In [31]:
# original
run_randomForests(X_train_original,
                  X_test_original,
                  y_train, y_test)

Train set
Random Forests roc-auc: 0.807612232524249
Test set
Random Forests roc-auc: 0.7868832427636059


In [32]:
# filter methods - basic
run_randomForests(X_train_basic_filter,
                  X_test_basic_filter,
                  y_train, y_test)

Train set
Random Forests roc-auc: 0.808688948214592
Test set
Random Forests roc-auc: 0.7902268905900738


In [33]:
# filter methods - correlation
run_randomForests(X_train_corr,
                  X_test_corr,
                  y_train, y_test)

Train set
Random Forests roc-auc: 0.8137741474286087
Test set
Random Forests roc-auc: 0.7928863342856578


In [34]:
# filter methods - univariate roc-auc
run_randomForests(X_train_anova,
                  X_test_anova,
                  y_train, y_test)

Train set
Random Forests roc-auc: 0.8165521002785939
Test set
Random Forests roc-auc: 0.7980865343585657


As we see, the 20 features we selected using the univariate anova are doing a good job, as the final model does not show a decrease in performance compared to that one using all features.

In [39]:
# create a function to run a logistic regression and compare performance between train and test

def run_logistic(X_train, X_test, y_train, y_test):
    
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)

    log = LogisticRegression(random_state=44, max_iter=1000, solver='liblinear')
    
    log.fit(X_train, y_train)
    
    pred_train = log.predict_proba(X_train)[:,1]
    pred_test = log.predict_proba(X_test)[:,1]
    
    print('ROC AUC for train : {}'.format(roc_auc_score(y_train, pred_train)))
    print('ROC AUC for test : {}'.format(roc_auc_score(y_test, pred_test)))

In [40]:
# original
run_logistic(X_train_original,
             X_test_original,
             y_train, y_test)

ROC AUC for train : 0.8027921164077154
ROC AUC for test : 0.795130315849561


In [41]:
# filter methods - basic

run_logistic(X_train_basic_filter,
             X_test_basic_filter,
             y_train, y_test)

ROC AUC for train : 0.7998301107943384
ROC AUC for test : 0.7961601826654346


In [42]:
# filter methods - correlation

run_logistic(X_train_corr,
             X_test_corr,
             y_train, y_test)

ROC AUC for train : 0.7919402115875149
ROC AUC for test : 0.7886070347669065


In [43]:
# filter methods - univariate anova

run_logistic(X_train_anova,
             X_test_anova,
             y_train, y_test)

ROC AUC for train : 0.7826667664362111
ROC AUC for test : 0.7752420514820374


For logistic regression, we see that when we removed correlated features, we seemed to have removed some features that were good at predicting the target, as the performance dropped a bit.

We can try by applying the univariate anova without removing features by correlation, to see if the selected features are good enough.