## 1 - Constant features
## 2 - Quasi-constant features
## 3 - Duplicated features
-  Constant features, all observations in the dataset have the same value for that variable 
-  Quasi-constant features, where a single value is shared by the great majority of the observations in the dataset (95-99% of observations present the same value)
-  Duplicated features, 2 features show same values for all observations. may arise after one hot encoding of categorical variables 

## 1 - Constant features

- Constant features are those that show the same value, just one value, for all the observations of the dataset. In other words, the same value for all the rows of the dataset. These features provide no information that allows a machine learning model to discriminate or predict a target.

- Identifying and removing constant features is an easy first step towards feature selection and more easily interpretable machine learning models.

- Here, I will demonstrate how to identify constant features using a dataset that I created for this course. 

- To identify constant features, we can use the VarianceThreshold from Scikit-learn, or we can code it ourselves. If using the VarianceThreshold, all our variables need to be numerical. If we do it manually however, we can apply the code to both numerical and categorical variables.

- will show 2 snippets of code:
- 1 where I use the VarianceThreshold 
- manually coded alternatives.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold

In [37]:
data = pd.read_csv('../dataset_1.csv')
print(data.shape)  # 301 columns without the index, 300 feature and the o/p
data['var_297'] = 1
print(data.head())

(50000, 301)
   var_1  var_2  var_3  var_4  var_5  var_6  var_7  var_8  var_9  var_10  ...  \
0      0      0    0.0   0.00    0.0      0      0      0      0       0  ...   
1      0      0    0.0   3.00    0.0      0      0      0      0       0  ...   
2      0      0    0.0   5.88    0.0      0      0      0      0       0  ...   
3      0      0    0.0  14.10    0.0      0      0      0      0       0  ...   
4      0      0    0.0   5.76    0.0      0      0      0      0       0  ...   

   var_292  var_293  var_294  var_295  var_296  var_297  var_298  var_299  \
0      0.0        0        0        0        0        1        0      0.0   
1      0.0        0        0        0        0        1        0      0.0   
2      0.0        0        0        3        0        1        0      0.0   
3      0.0        0        0        0        0        1        0      0.0   
4      0.0        0        0        0        0        1        0      0.0   

      var_300  target  
0      0.0000

In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfitting.

In [38]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),  # (X) all columns except the target, drop the target
    data['target'],  # (Y) just the target
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 300), (15000, 300))

### Using VarianceThreshold from Scikit-learn

The VarianceThreshold from sklearn provides a simple baseline approach to feature selection. It removes all features which variance doesn’t meet a certain threshold. 

By default, it removes all zero-variance features, i.e., features that have the same value in all samples. Same value in the whole column

In [39]:
sel = VarianceThreshold(threshold=0)
#This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.
# Features with a training-set variance lower than this threshold will be removed. The default is to keep all features with non-zero variance,
#   i.e. remove the features that have the same value in all samples. Same value in the whole column

sel.fit(X_train)  # fit finds the features with zero variance

VarianceThreshold(threshold=0)

In [40]:
# get_support is a boolean vector that indicates which features are retained
print(sel.get_support().shape)
sel.get_support()

(300,)


array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True, False,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True, False, False,
        True,  True,  True,  True,  True, False,  True, False,  True,
        True, False,  True,  True,  True,  True, False,  True, False,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False, False,  True,  True,  True,  True,
        True,  True, False,  True, False,  True,  True,  True,  True,
       False,  True,

In [41]:
# if we sum over get_support, we get the number of features that are not constant
sum(sel.get_support())

266

In [42]:
# let's print the number of constant feautures
constant = X_train.columns[~sel.get_support()]  # ~ get all the Not True (Falses)
len(constant)
# training data has 34 columns that have the same value in all the column ,, the whole column is ZEROS and var_297 is ones
### We can see that 34 columns / variables are constant. This means that 34 variables show the same value, just one value, for all the observations of the training set.

34

In [46]:
constant # the constant variable names

Index(['var_23', 'var_33', 'var_44', 'var_61', 'var_80', 'var_81', 'var_87',
       'var_89', 'var_92', 'var_97', 'var_99', 'var_112', 'var_113', 'var_120',
       'var_122', 'var_127', 'var_135', 'var_158', 'var_167', 'var_170',
       'var_171', 'var_178', 'var_180', 'var_182', 'var_195', 'var_196',
       'var_201', 'var_212', 'var_215', 'var_225', 'var_227', 'var_248',
       'var_294', 'var_297'],
      dtype='object')

In [49]:
X_train['var_297']

17967    1
32391    1
9341     1
7929     1
46544    1
        ..
21243    1
45891    1
42613    1
43567    1
2732     1
Name: var_297, Length: 35000, dtype: int64

In [48]:
# visualise the values of the constant variables
for col in constant:
    print(col, X_train[col].unique())  # onl

var_23 [0]
var_33 [0]
var_44 [0]
var_61 [0]
var_80 [0]
var_81 [0]
var_87 [0]
var_89 [0.]
var_92 [0]
var_97 [0]
var_99 [0]
var_112 [0]
var_113 [0]
var_120 [0]
var_122 [0]
var_127 [0]
var_135 [0]
var_158 [0]
var_167 [0]
var_170 [0]
var_171 [0]
var_178 [0.]
var_180 [0.]
var_182 [0]
var_195 [0]
var_196 [0]
var_201 [0]
var_212 [0]
var_215 [0]
var_225 [0]
var_227 [0.]
var_248 [0]
var_294 [0]
var_297 [1]


We then use the transform() method of the VarianceThreshold to reduce the training and testing sets to its non-constant features.

Note that VarianceThreshold returns a NumPy array without feature names, so we need to capture the names first, and reconstitute the dataframe in a later step.

In [50]:
# non-constant feature names (i want to keep)
feat_names = X_train.columns[sel.get_support()]

In [51]:
X_train.shape, X_test.shape  # Before

((35000, 300), (15000, 300))

In [52]:
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape  # After
# We passed from our original 300 variables, to 266.

((35000, 266), (15000, 266))

In [53]:
X_train

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [55]:
# reconstitute to dataframe

X_train = pd.DataFrame(X_train, columns=feat_names)
X_train.head(2)

Unnamed: 0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,var_10,...,var_289,var_290,var_291,var_292,var_293,var_295,var_296,var_298,var_299,var_300
0,0.0,0.0,0.0,2.79,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,2.97,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Reduced Num of features From 300 to 266 by removing the Constant features.

We see how by removing constant features, we managed to reduced the feature space quite a bit.

The VarianceThreshold work with numerical variables. What can we do to find constant categorical variables?

One alternative is to encode the categories as numbers and then use the code above. But then you will put effort in pre-processing variables that are not informative.

The code below offers a better solution:

### Manual Code - works also with categorical variables

In [60]:
# separate train and test (again, as we transformed the previous ones)

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),
    data['target'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 300), (15000, 300))

In [62]:
# will cast all the numeric features as object,
# to simulate that they are categorical

X_train = X_train.astype('O')
X_train.dtypes

var_1      object
var_2      object
var_3      object
var_4      object
var_5      object
            ...  
var_296    object
var_297    object
var_298    object
var_299    object
var_300    object
Length: 300, dtype: object

In [63]:
# to find variables that contain only 1 label/value
# we use the nunique() method from pandas, which returns the number
# of different values in a variable.

constant_features = [
    feat for feat in X_train.columns if X_train[feat].nunique() == 1
]

len(constant_features)

34

In [64]:
X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 266), (15000, 266))

**************************

********************************

*****************************

## 2 -  Quasi-constant features
- Same value in almost all observations

- Quasi-constant features are those that show the same value for the great majority of the observations of the dataset. In general, these features provide little, if any, information that allows a machine learning model to discriminate or predict a target. But there can be exceptions. So you should be careful when removing these type of features.

- Identifying and removing quasi-constant features, is an easy first step towards feature selection and more interpretable machine learning models.

- Here, I will demonstrate how to identify quasi-constant features using a dataset that I created for this course. 

- To identify quasi-constant features, we can use the VarianceThreshold from Scikit-learn, or we can code it ourselves. If we use the VarianceThreshold, all our variables need to be numerical. If we code it manually however, we can apply the code to both numerical and categorical variables.

- Will show 2 snippets of code, 1 where I use the VarianceThreshold and 1 manually coded alternative.

In [112]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold

In [113]:
data = pd.read_csv('../dataset_1.csv')
print(data.shape)
print(data.head())

(50000, 301)
   var_1  var_2  var_3  var_4  var_5  var_6  var_7  var_8  var_9  var_10  ...  \
0      0      0    0.0   0.00    0.0      0      0      0      0       0  ...   
1      0      0    0.0   3.00    0.0      0      0      0      0       0  ...   
2      0      0    0.0   5.88    0.0      0      0      0      0       0  ...   
3      0      0    0.0  14.10    0.0      0      0      0      0       0  ...   
4      0      0    0.0   5.76    0.0      0      0      0      0       0  ...   

   var_292  var_293  var_294  var_295  var_296  var_297  var_298  var_299  \
0      0.0        0        0        0        0        0        0      0.0   
1      0.0        0        0        0        0        0        0      0.0   
2      0.0        0        0        3        0        0        0      0.0   
3      0.0        0        0        0        0        0        0      0.0   
4      0.0        0        0        0        0        0        0      0.0   

      var_300  target  
0      0.0000

In [114]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),  # (X) all columns except the target, drop the target
    data['target'],  # (Y) just the target
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 300), (15000, 300))

In [115]:
## Remove constant features
# remove 34 constant features
# Another way , same as VarianceThreshold

constant_features = [ feat for feat in X_train.columns if X_train[feat].std() == 0 ]

X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 266), (15000, 266))

## Remove quasi-constant features

### Using the VarianceThreshold from sklearn

The VarianceThreshold from sklearn provides a simple baseline approach to feature selection. It removes all features which variance doesn’t meet a certain threshold. By default, it removes all zero-variance features(threshold=0).

Here, we will change the default threshold to remove quasi-constant features, or, I should better say, features with low-variance:

In [116]:
sel = VarianceThreshold(threshold=0.01)  
sel.fit(X_train)  # fit finds the features with low variance

VarianceThreshold(threshold=0.01)

In [117]:
sum(sel.get_support())

215

In [118]:
quasi_constant = X_train.columns[~sel.get_support()]
len(quasi_constant)
# 51 columns / variables are almost constant. This means that 51 variables show predominantly one value for the majority of
# observations of the training set. Let's explore a few if these variables below.

51

In [119]:
quasi_constant # variable names

Index(['var_1', 'var_2', 'var_7', 'var_9', 'var_10', 'var_19', 'var_28',
       'var_36', 'var_43', 'var_45', 'var_53', 'var_56', 'var_59', 'var_66',
       'var_67', 'var_69', 'var_71', 'var_104', 'var_106', 'var_116',
       'var_133', 'var_137', 'var_141', 'var_146', 'var_177', 'var_187',
       'var_189', 'var_194', 'var_197', 'var_198', 'var_202', 'var_218',
       'var_219', 'var_223', 'var_233', 'var_234', 'var_235', 'var_245',
       'var_247', 'var_249', 'var_250', 'var_251', 'var_256', 'var_260',
       'var_267', 'var_274', 'var_282', 'var_285', 'var_287', 'var_289',
       'var_298'],
      dtype='object')

In [120]:
# percentage of observations showing each of the different values of the variable

X_train['var_1'].value_counts() / np.float(len(X_train))
# more than 99% of the observations show one value, 0. Therefore, this features is fairly constant.

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  X_train['var_1'].value_counts() / np.float(len(X_train))


0    0.999629
3    0.000200
6    0.000143
9    0.000029
Name: var_1, dtype: float64

In [121]:
# let's explore another one

X_train['var_2'].value_counts() / np.float(len(X_train))

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  X_train['var_2'].value_counts() / np.float(len(X_train))


0    0.999971
1    0.000029
Name: var_2, dtype: float64

In [122]:
X_train['var_2'].value_counts(normalize=True)

0    0.999971
1    0.000029
Name: var_2, dtype: float64

We can then remove the quasi-constant features utilizing the transform() method from the VarianceThreshold. and this gives me a Numpy array.

In [102]:
feat_names = X_train.columns[sel.get_support()]

In [103]:
# remove the quasi-constant features
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape
# By removing constant and almost constant features, we reduced the feature space from 300 to 215.

((35000, 215), (15000, 215))

In [104]:
# trasnform the array into a dataframe

X_train = pd.DataFrame(X_train, columns=feat_names)
X_test = pd.DataFrame(X_test, columns=feat_names)

X_test.head(1)

Unnamed: 0,var_3,var_4,var_5,var_6,var_8,var_11,var_12,var_13,var_14,var_15,...,var_286,var_288,var_290,var_291,var_292,var_293,var_295,var_296,var_299,var_300
0,0.0,2.79,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Coding it ourselves

First, I will separate the dataset into train and test and remove the constant features again. Then, I will provide an alternative method to find out quasi-constant features.

This method, as opposed to the VarianceThreshold, can be used for both 
# **numerical and categorical variables.**
Remove Constant and Quasi-constant with each other

In [105]:
# separate train and test
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),
    data['target'],
    test_size=0.3,
    random_state=0)

# remove constant features
# using the code from the previous lecture

constant_features = [
    feat for feat in X_train.columns if X_train[feat].std() == 0
]

X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 266), (15000, 266))

In [106]:
# create an empty list
quasi_constant_feat = []

# iterate over every feature
for feature in X_train.columns:

    # find the predominant value, that is the value that is shared
    # by most observations
    predominant = X_train[feature].value_counts(
        normalize=True).sort_values(ascending=False).values[0]

    # evaluate the predominant feature: do more than 99% of the observations
    # show 1 value?
    if predominant > 0.998:

        # if yes, add the variable to the list
        quasi_constant_feat.append(feature)

len(quasi_constant_feat)

108

In [107]:
#Our method was a bit more aggressive than VarianceThreshold from sklearn with the threshold that we selected above. 
#It found 108 features that show predominantly 1 value for the majority of the observations. 
#Let's see how some of the quasi constant features look like.
# print the feature names
quasi_constant_feat

['var_1',
 'var_2',
 'var_3',
 'var_6',
 'var_7',
 'var_9',
 'var_10',
 'var_11',
 'var_12',
 'var_14',
 'var_16',
 'var_20',
 'var_24',
 'var_28',
 'var_32',
 'var_34',
 'var_36',
 'var_39',
 'var_40',
 'var_42',
 'var_43',
 'var_45',
 'var_48',
 'var_53',
 'var_56',
 'var_59',
 'var_60',
 'var_65',
 'var_66',
 'var_67',
 'var_69',
 'var_71',
 'var_72',
 'var_73',
 'var_77',
 'var_78',
 'var_90',
 'var_95',
 'var_98',
 'var_102',
 'var_104',
 'var_106',
 'var_111',
 'var_115',
 'var_116',
 'var_124',
 'var_125',
 'var_126',
 'var_129',
 'var_130',
 'var_133',
 'var_136',
 'var_138',
 'var_141',
 'var_142',
 'var_146',
 'var_149',
 'var_150',
 'var_151',
 'var_153',
 'var_159',
 'var_183',
 'var_184',
 'var_187',
 'var_189',
 'var_197',
 'var_202',
 'var_204',
 'var_210',
 'var_211',
 'var_216',
 'var_217',
 'var_219',
 'var_221',
 'var_223',
 'var_224',
 'var_228',
 'var_233',
 'var_234',
 'var_235',
 'var_236',
 'var_237',
 'var_239',
 'var_243',
 'var_245',
 'var_246',
 'var_247',
 

In [108]:
quasi_constant_feat[2]

'var_3'

In [110]:
X_train['var_3'].value_counts(normalize=True)
#The feature shows 0 for more than 99.9% of the observations. But, it also shows a few different values for a very tiny proportion of 
#the observations. This fact, will increase the feature variance, that is why, this feature is not captured by the VarianceThreshold in
#our previous cell. Yet, we can see that it is quasi-constant.
#Keep in mind that the thresholds are arbitrary and decided by the user.

0.0000         0.999629
207901.3365    0.000029
15028.0560     0.000029
25905.4866     0.000029
35685.9459     0.000029
3583.3941      0.000029
52105.7901     0.000029
86718.0000     0.000029
861.0900       0.000029
2641.0164      0.000029
5209.9500      0.000029
10281.6000     0.000029
12542.3100     0.000029
27.3000        0.000029
Name: var_3, dtype: float64

In [111]:
# finally, let's drop the quasi-constant features:

X_train.drop(labels=quasi_constant_feat, axis=1, inplace=True)
X_test.drop(labels=quasi_constant_feat, axis=1, inplace=True)

X_train.shape, X_test.shape
# We passed from 300 variables to 158.

((35000, 158), (15000, 158))