## 1 - Constant features
## 2 - Quasi-constant features
## 3 - Duplicated features
- Constant features, all observations in the dataset have the same value for that variable 
- Quasi-constant features, where a single value is shared by the great majority of the observations in the dataset (95-99% of observations present the same value)
- Duplicated features, 2 features show same values for all observations. may arise after one hot encoding of categorical variables 

## 3 - Duplicated features

Often datasets contain duplicated features, that is, features that despite having different names, are identical.

In addition, we may often introduce duplicated features when performing **one hot encoding** of categorical variables, particularly if our datasets have many and /or highly cardinal categorical variables.

Identifying and removing duplicated, and therefore redundant features, is an easy first step towards feature selection and more interpretable machine learning models.

Here I will demonstrate how to identify duplicated features using a dataset that I created for this course. 

There is no function in Pandas to find duplicated columns. So we need to write a bit code to do so.

**Note**
Finding duplicated features can be a computationally costly operation in Python, therefore depending on the size of your dataset, you might not always be able to do it.

This method to find duplicated features works for both **numerical and categorical** variables.

In [12]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

In [13]:
data = pd.read_csv('../dataset_1.csv')
data.shape

(50000, 301)

In [14]:
# check the presence of missing data.
# (there are no missing data in this dataset)

[col for col in data.columns if data[col].isnull().sum() > 0]

[]

In [15]:
# separate dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1), # drop the target
    data['target'], # just the target
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 300), (15000, 300))

## Remove constant and quasi-constant

In [16]:
# remove constant and quasi-constant features first:
# we can remove the 2 types of features together with this code
# (we used it in our previous notebook)

# create an empty list
quasi_constant_feat = []

# iterate over every feature
for feature in X_train.columns:

    # find the predominant value, that is the value that is shared
    # by most observations
    predominant = (X_train[feature].value_counts() / np.float(
        len(X_train))).sort_values(ascending=False).values[0]

    # evaluate predominant feature: do more than 99% of the observations
    # show 1 value?
    if predominant > 0.998:
        quasi_constant_feat.append(feature)

len(quasi_constant_feat)

142

In [17]:
# we can then drop these columns from the train and test sets:

X_train.drop(labels=quasi_constant_feat, axis=1, inplace=True)
X_test.drop(labels=quasi_constant_feat, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 158), (15000, 158))

## Remove duplicated features

To identify duplicated variables we need to iterate through all features of our dataset, and for each and every feature, try and find others that are identical, or duplicates.

We will create a dictionary of {variable: duplicated variables} pairs to identify them more easily throughout the demo. Keep in mind that in a dataset, there could be 2 or more features that are identical to each other.

In [18]:
# check for duplicated features in the training set:
# create an empty dictionary, where we will store 
# the groups of duplicates
duplicated_feat_pairs = {}

# create an empty list to collect features
# that were found to be duplicated
_duplicated_feat = []

# iterate over every feature in our dataset:
for i in range(0, len(X_train.columns)):
    
    # this bit helps me understand where the loop is at:
    if i % 10 == 0:  
        print(i)    
    # choose 1 feature:
    feat_1 = X_train.columns[i]
    
    # check if this feature has already been identified
    # as a duplicate of another one. If it was, it should be stored in
    # our _duplicated_feat list.
    # If this feature was already identified as a duplicate, we skip it, if
    # it has not yet been identified as a duplicate, then we proceed:
    if feat_1 not in _duplicated_feat:
    
        # create an empty list as an entry for this feature in the dictionary:
        duplicated_feat_pairs[feat_1] = []

        # now, iterate over the remaining features of the dataset:
        for feat_2 in X_train.columns[i + 1:]:
            # check if this second feature is identical to the first one
            if X_train[feat_1].equals(X_train[feat_2]):
                # if it is identical, append it to the list in the dictionary
                duplicated_feat_pairs[feat_1].append(feat_2)
                # and append it to our monitor list for duplicated variables
                _duplicated_feat.append(feat_2)

0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150


In [19]:
# let's explore our list of duplicated features
len(_duplicated_feat)
# We found 6 features that were duplicates of others.

6

In [20]:
# these are the ones:
_duplicated_feat

['var_148', 'var_199', 'var_296', 'var_250', 'var_232', 'var_269']

In [21]:
# let's explore the dictionary we created:
duplicated_feat_pairs
# We see that for every feature, if it had duplicates, we have entries in the list, otherwise, we have empty lists

{'var_4': [],
 'var_5': [],
 'var_8': [],
 'var_13': [],
 'var_15': [],
 'var_17': [],
 'var_18': [],
 'var_19': [],
 'var_21': [],
 'var_22': [],
 'var_25': [],
 'var_26': [],
 'var_27': [],
 'var_29': [],
 'var_30': [],
 'var_31': [],
 'var_35': [],
 'var_37': ['var_148'],
 'var_38': [],
 'var_41': [],
 'var_46': [],
 'var_47': [],
 'var_49': [],
 'var_50': [],
 'var_51': [],
 'var_52': [],
 'var_54': [],
 'var_55': [],
 'var_57': [],
 'var_58': [],
 'var_62': [],
 'var_63': [],
 'var_64': [],
 'var_68': [],
 'var_70': [],
 'var_74': [],
 'var_75': [],
 'var_76': [],
 'var_79': [],
 'var_82': [],
 'var_83': [],
 'var_84': ['var_199'],
 'var_85': [],
 'var_86': [],
 'var_88': [],
 'var_91': [],
 'var_93': [],
 'var_94': [],
 'var_96': [],
 'var_100': [],
 'var_101': [],
 'var_103': [],
 'var_105': [],
 'var_107': [],
 'var_108': [],
 'var_109': [],
 'var_110': [],
 'var_114': [],
 'var_117': [],
 'var_118': [],
 'var_119': [],
 'var_121': [],
 'var_123': [],
 'var_128': [],
 'var_131'

In [22]:
# let's explore the number of keys in our dictionary
# we see it is 152, because 6 of the 158 were duplicates,
# so they were not included as keys

print(len(duplicated_feat_pairs.keys()))

152


In [23]:
# print the features with its duplicates
# iterate over every feature in our dict:
for feat in duplicated_feat_pairs.keys():
    
    # if it has duplicates, the list should not be empty:
    if len(duplicated_feat_pairs[feat]) > 0:

        # print the feature and its duplicates:
        print(feat, duplicated_feat_pairs[feat])
        print()

var_37 ['var_148']

var_84 ['var_199']

var_143 ['var_296']

var_177 ['var_250']

var_226 ['var_232']

var_229 ['var_269']



In [24]:
# let's check that indeed those features are duplicated
# I select a pair from above

X_train[['var_37', 'var_148']].head(10)

Unnamed: 0,var_37,var_148
17967,0,0
32391,0,0
9341,0,0
7929,0,0
46544,0,0
4149,0,0
33426,0,0
3002,0,0
6974,0,0
16864,0,0


In [25]:
X_train['var_37'].unique()

array([ 0,  3,  6,  9, 12, 21, 33, 15], dtype=int64)

In [26]:
X_train['var_148'].unique()

array([ 0,  3,  6,  9, 12, 21, 33, 15], dtype=int64)

In [27]:
# let's explore parts of the dataframe where the values in
# these features are different from 0:

X_train[X_train['var_37'] != 0][['var_37', 'var_148']].head(10)

Unnamed: 0,var_37,var_148
37493,3,3
20251,6,6
4264,6,6
48480,3,3
31607,3,3
41172,3,3
13502,3,3
7759,3,3
46118,3,3
2638,3,3


In [28]:
# finally, to remove the duplicates, what we are going to do is to retain the keys of the dictionary
# do you understand why? if not, go back to our loop and try to determine the reason

X_train = X_train[duplicated_feat_pairs.keys()]
X_test = X_test[duplicated_feat_pairs.keys()]

X_train.shape, X_test.shape

#We can see how we further reduced our dataset by 6 additional features.
#In summary, by removing constant, quasi-constant and duplicated features, we reduced our original 300 feature dataset to a 152 feature dataset

((35000, 152), (15000, 152))

**************************
**************************
**************************

# Feature-Engine

# 1 - Constant and Quasi-constant features with Feature-engine

will remove constant and quasi-constant features utilizing the new functionality from Feature-engine.

In [46]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.selection import DropConstantFeatures

In [47]:
data = pd.read_csv('../dataset_1.csv')
data.shape

(50000, 301)

In [48]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1), # drop the target
    data['target'], # just the target
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 300), (15000, 300))

## Remove constant features

The DropConstantFeatures class from Feature-engine finds and removes constant and quasi-constant features from a dataset. We can remove constant features by setting the parameter tol to 1, or quasi-constant with smaller values for tol.

In [49]:
sel = DropConstantFeatures(tol=1, variables=None, missing_values='raise')

sel.fit(X_train)

In [50]:
# list of constant features
sel.features_to_drop_

['var_23',
 'var_33',
 'var_44',
 'var_61',
 'var_80',
 'var_81',
 'var_87',
 'var_89',
 'var_92',
 'var_97',
 'var_99',
 'var_112',
 'var_113',
 'var_120',
 'var_122',
 'var_127',
 'var_135',
 'var_158',
 'var_167',
 'var_170',
 'var_171',
 'var_178',
 'var_180',
 'var_182',
 'var_195',
 'var_196',
 'var_201',
 'var_212',
 'var_215',
 'var_225',
 'var_227',
 'var_248',
 'var_294',
 'var_297']

In [51]:
# number of constant features
len(sel.features_to_drop_)

34

In [52]:
# explore 1 of the constant feature values
X_train[sel.features_to_drop_[0]].unique()

array([0], dtype=int64)

In [53]:
# remove constant features from the data
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape

((35000, 266), (15000, 266))

The datasets now contain 34 features less. (Constant Removal)

## Remove quasi-constant features

In [54]:
sel = DropConstantFeatures(tol=0.998, variables=None, missing_values='raise')

sel.fit(X_train)

In [55]:
# number of quasi-constant features
len(sel.features_to_drop_)

108

In [56]:
# list of quasi-constant features
sel.features_to_drop_

['var_1',
 'var_2',
 'var_3',
 'var_6',
 'var_7',
 'var_9',
 'var_10',
 'var_11',
 'var_12',
 'var_14',
 'var_16',
 'var_20',
 'var_24',
 'var_28',
 'var_32',
 'var_34',
 'var_36',
 'var_39',
 'var_40',
 'var_42',
 'var_43',
 'var_45',
 'var_48',
 'var_53',
 'var_56',
 'var_59',
 'var_60',
 'var_65',
 'var_66',
 'var_67',
 'var_69',
 'var_71',
 'var_72',
 'var_73',
 'var_77',
 'var_78',
 'var_90',
 'var_95',
 'var_98',
 'var_102',
 'var_104',
 'var_106',
 'var_111',
 'var_115',
 'var_116',
 'var_124',
 'var_125',
 'var_126',
 'var_129',
 'var_130',
 'var_133',
 'var_136',
 'var_138',
 'var_141',
 'var_142',
 'var_146',
 'var_149',
 'var_150',
 'var_151',
 'var_153',
 'var_159',
 'var_183',
 'var_184',
 'var_187',
 'var_189',
 'var_197',
 'var_202',
 'var_204',
 'var_210',
 'var_211',
 'var_216',
 'var_217',
 'var_219',
 'var_221',
 'var_223',
 'var_224',
 'var_228',
 'var_233',
 'var_234',
 'var_235',
 'var_236',
 'var_237',
 'var_239',
 'var_243',
 'var_245',
 'var_246',
 'var_247',
 

In [57]:
# percentage of observations showing each of the different values of the variable
var = sel.features_to_drop_[0]

X_train[var].value_counts(normalize=True)
# We can see that > 99% of the observations show one value, 0. Therefore, this features is fairly constant.

0    0.999629
3    0.000200
6    0.000143
9    0.000029
Name: var_1, dtype: float64

In [58]:
# let's explore another one
var = sel.features_to_drop_[2]
X_train[var].value_counts(normalize=True)

0.0000         0.999629
207901.3365    0.000029
15028.0560     0.000029
25905.4866     0.000029
35685.9459     0.000029
3583.3941      0.000029
52105.7901     0.000029
86718.0000     0.000029
861.0900       0.000029
2641.0164      0.000029
5209.9500      0.000029
10281.6000     0.000029
12542.3100     0.000029
27.3000        0.000029
Name: var_3, dtype: float64

In [59]:
#remove the quasi-constant features

X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape
# By removing constant and almost constant features, we reduced the feature space from 300 to 158.

((35000, 158), (15000, 158))

********************************

## 2 - Duplicated features with Feature-engine

we will identify and remove duplicated features with Feature-engine.

In [60]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from feature_engine.selection import DropDuplicateFeatures, DropConstantFeatures

In [61]:
data = pd.read_csv('../dataset_1.csv')
data.shape

(50000, 301)

In [62]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1), # drop the target
    data['target'], # just the target
    test_size=0.3,
    random_state=0)
X_train.shape, X_test.shape

((35000, 300), (15000, 300))

## Remove constant and quasi-constant

In [63]:
# remove constant and quasi-constant features first:
# we use Feature-engine for this

sel = DropConstantFeatures(tol=0.998, variables=None, missing_values='raise')
sel.fit(X_train)

In [64]:
# remove the quasi-constant features

X_train = sel.transform(X_train)
X_test = sel.transform(X_test)
X_train.shape, X_test.shape

((35000, 158), (15000, 158))

## Remove duplicated features

In [65]:
# set up the selector
sel = DropDuplicateFeatures(variables=None, missing_values='raise')

# find the duplicate features, this might take a while
sel.fit(X_train)

In [66]:
# these are the pairs of duplicated features
# each set are duplicates

sel.duplicated_feature_sets_

[{'var_148', 'var_37'},
 {'var_199', 'var_84'},
 {'var_143', 'var_296'},
 {'var_177', 'var_250'},
 {'var_226', 'var_232'},
 {'var_229', 'var_269'}]

In [67]:
# these are the features that will be dropped
# 1 from each of the pairs above

sel.features_to_drop_

{'var_148', 'var_199', 'var_232', 'var_250', 'var_269', 'var_296'}

In [68]:
# let's explore our list of duplicated features

len(sel.features_to_drop_)

6

In [69]:
# remove the duplicated features

X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape

((35000, 152), (15000, 152))

## Stack Feature selection in a Pipeline

We can perform both steps together by setting up the transformers within a pipeline.

In [70]:
# separate dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1), # drop the target
    data['target'], # just the target
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 300), (15000, 300))

In [71]:
pipe = Pipeline([
    ('constant', DropConstantFeatures(tol=0.998)),
    ('duplicated', DropDuplicateFeatures()),
])

pipe.fit(X_train)

In [72]:
# remove features

X_train = pipe.transform(X_train)
X_test = pipe.transform(X_test)

X_train.shape, X_test.shape

((35000, 152), (15000, 152))

In [73]:
# we can navigate the pipeline transformers

len(pipe.named_steps['constant'].features_to_drop_)

142

In [74]:
pipe.named_steps['duplicated'].features_to_drop_

{'var_148', 'var_199', 'var_232', 'var_250', 'var_269', 'var_296'}