<a href="https://colab.research.google.com/github/Venkatpandey/DataScience_ML/blob/main/featureSelection/2-Quasi-constant-features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Quasi-constant features

Quasi-constant features are those that show the same value for the great majority of the observations of the dataset. In general, these features provide little, if any, information that allows a machine learning model to discriminate or predict a target. But there can be exceptions. So you should be careful when removing these type of features.

Identifying and removing quasi-constant features, is an easy first step towards feature selection and more interpretable machine learning models.

Here, I will demonstrate how to identify quasi-constant features using a dataset that I created for this course. 

To identify quasi-constant features, we can use the VarianceThreshold from Scikit-learn, or we can code it ourselves. If we use the VarianceThreshold, all our variables need to be numerical. If we code it manually however, we can apply the code to both numerical and categorical variables.

I will show 2 snippets of code, 1 where I use the VarianceThreshold and 1 manually coded alternative.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.feature_selection import VarianceThreshold

In [41]:
# load dataset

# (feel free to write some code to explore the dataset and become
# familiar with it ahead of this demo)

data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/precleaned-datasets/dataset_1.csv')
data.shape

(50000, 301)

**Important**

In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfit.

In [42]:
# separate dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1), # drop the target
    data['target'], # just the target
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 300), (15000, 300))

## Remove constant features

First, I will remove constant features like I did in the previous lecture. This will allow a better visualisation of the quasi-constant ones.

In [5]:
# using the code from the previous lecture
# I remove 34 constant features

constant_features = [
    feat for feat in X_train.columns if X_train[feat].std() == 0
]

X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 266), (15000, 266))

## Remove quasi-constant features

### Using the VarianceThreshold from sklearn

The VarianceThreshold from sklearn provides a simple baseline approach to feature selection. It removes all features which variance doesn’t meet a certain threshold. By default, it removes all zero-variance features, as we did in the previous notebook.

Here, we will change the default threshold to remove quasi-constant features, or, I should better say, features with low-variance:

Check the Scikit-learn docs for more details:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html

In [43]:
sel = VarianceThreshold(threshold=0.01)  

sel.fit(X_train)  # fit finds the features with low variance

VarianceThreshold(threshold=0.01)

In [44]:
# get_support is a boolean vector that indicates which features 
# are retained, that is, which features have a higher variance than
# the threshold we indicated.

# If we sum over get_support, we get the number
# of features that are not quasi-constant

sum(sel.get_support())

215

In [45]:
# let's print the number of quasi-constant features

quasi_constant = X_train.columns[~sel.get_support()]

len(quasi_constant)

85

We can see that 51 columns / variables are almost constant. This means that 51 variables show predominantly one value for the majority of observations of the training set. Let's explore a few if these variables below.

In [46]:
# let's print the variable names
quasi_constant

Index(['var_1', 'var_2', 'var_7', 'var_9', 'var_10', 'var_19', 'var_23',
       'var_28', 'var_33', 'var_36', 'var_43', 'var_44', 'var_45', 'var_53',
       'var_56', 'var_59', 'var_61', 'var_66', 'var_67', 'var_69', 'var_71',
       'var_80', 'var_81', 'var_87', 'var_89', 'var_92', 'var_97', 'var_99',
       'var_104', 'var_106', 'var_112', 'var_113', 'var_116', 'var_120',
       'var_122', 'var_127', 'var_133', 'var_135', 'var_137', 'var_141',
       'var_146', 'var_158', 'var_167', 'var_170', 'var_171', 'var_177',
       'var_178', 'var_180', 'var_182', 'var_187', 'var_189', 'var_194',
       'var_195', 'var_196', 'var_197', 'var_198', 'var_201', 'var_202',
       'var_212', 'var_215', 'var_218', 'var_219', 'var_223', 'var_225',
       'var_227', 'var_233', 'var_234', 'var_235', 'var_245', 'var_247',
       'var_248', 'var_249', 'var_250', 'var_251', 'var_256', 'var_260',
       'var_267', 'var_274', 'var_282', 'var_285', 'var_287', 'var_289',
       'var_294', 'var_297', 'var_298']

In [47]:
# percentage of observations showing each of the different values
# of the variable

X_train['var_1'].value_counts() / np.float(len(X_train))

0    0.999629
3    0.000200
6    0.000143
9    0.000029
Name: var_1, dtype: float64

We can see that > 99% of the observations show one value, 0. Therefore, this features is fairly constant.

In [48]:
# let's explore another one

X_train['var_28'].value_counts() / np.float(len(X_train))

0    0.999943
6    0.000029
3    0.000029
Name: var_28, dtype: float64

Go ahead and explore the rest of the quasi-constant variables.

We can then remove the quasi-constant features utilizing the transform() method from the VarianceThreshold. Remember that this returns a NumPy array without feature names, so if we want a dataframe we need to reconstitute it.

In [49]:
# capture feature names

feat_names = X_train.columns[sel.get_support()]

In [50]:
#remove the quasi-constant features

X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape

((35000, 215), (15000, 215))

By removing constant and almost constant features, we reduced the feature space from 300 to 215. This means, that 85 features were removed from this dataset. Almost a third!!

In [51]:
# trasnform the array into a dataframe

X_train = pd.DataFrame(X_train, columns=feat_names)
X_test = pd.DataFrame(X_test, columns=feat_names)

X_test.head()

Unnamed: 0,var_3,var_4,var_5,var_6,var_8,var_11,var_12,var_13,var_14,var_15,var_16,var_17,var_18,var_20,var_21,var_22,var_24,var_25,var_26,var_27,var_29,var_30,var_31,var_32,var_34,var_35,var_37,var_38,var_39,var_40,var_41,var_42,var_46,var_47,var_48,var_49,var_50,var_51,var_52,var_54,...,var_244,var_246,var_252,var_253,var_254,var_255,var_257,var_258,var_259,var_261,var_262,var_263,var_264,var_265,var_266,var_268,var_269,var_270,var_271,var_272,var_273,var_275,var_276,var_277,var_278,var_279,var_280,var_281,var_283,var_284,var_286,var_288,var_290,var_291,var_292,var_293,var_295,var_296,var_299,var_300
0,0.0,2.79,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,13.65,0.0,0.0,0.0,0.0,0.0,1.86,0.0,3.0,0.0,0.0,15.0,0.0,0.0,0.0,0.0,0.0,0.0,57331.4196,0.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.84,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,63485.7858,0.0,0.0,99.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,2.94,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,48763.8912,0.0,0.0,0.0,0.0,0.0,1.88,0.0,3.0,0.0,0.0,117433.0791,0.0,0.0,0.0,0.0,0.0,0.0,76980.4014,0.0,0.0,1.0,5.7,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0
3,0.0,2.76,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,86.4,0.0,0.0,0.0,0.0,0.0,1.88,0.0,3.0,0.0,0.0,87.3,0.0,0.0,0.0,0.0,0.0,0.0,107926.100695,0.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,2.94,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,2.97,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,2.97,0.0,0.0,0.0,0.0,0.0,0.0,98917.9992,0.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Coding it ourselves

First, I will separate the dataset into train and test and remove the constant features again. Then, I will provide an alternative method to find out quasi-constant features.

This method, as opposed to the VarianceThreshold, can be used for both **numerical and categorical** variables.

In [52]:
# separate train and test
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),
    data['target'],
    test_size=0.3,
    random_state=0)

# remove constant features
# using the code from the previous lecture

constant_features = [
    feat for feat in X_train.columns if X_train[feat].std() == 0
]

X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 266), (15000, 266))

In [58]:
# create an empty list
quasi_constant_feat = []

# iterate over every feature
for feature in X_train.columns:

    # find the predominant value, that is the value that is shared
    # by most observations
    predominant = (X_train[feature].value_counts() / np.float(
        len(X_train))).sort_values(ascending=False).values[0]

    # evaluate the predominant feature: do more than 99% of the observations
    # show 1 value?
    if predominant > 0.998:
        
        # if yes, add the variable to the list
        quasi_constant_feat.append(feature)

len(quasi_constant_feat)

108

Our method was a bit more aggressive than VarianceThreshold from sklearn with the threshold that we selected above. It found 108 features that show predominantly 1 value for the majority of the observations. 

Let's see how some of the quasi constant features look like.

In [37]:
# print the feature names

quasi_constant_feat

['var_1',
 'var_2',
 'var_3',
 'var_6',
 'var_7',
 'var_9',
 'var_10',
 'var_11',
 'var_12',
 'var_13',
 'var_14',
 'var_16',
 'var_20',
 'var_22',
 'var_24',
 'var_26',
 'var_28',
 'var_32',
 'var_34',
 'var_36',
 'var_39',
 'var_40',
 'var_42',
 'var_43',
 'var_45',
 'var_48',
 'var_53',
 'var_56',
 'var_59',
 'var_60',
 'var_65',
 'var_66',
 'var_67',
 'var_69',
 'var_71',
 'var_72',
 'var_73',
 'var_77',
 'var_78',
 'var_84',
 'var_90',
 'var_94',
 'var_95',
 'var_98',
 'var_102',
 'var_104',
 'var_106',
 'var_111',
 'var_115',
 'var_116',
 'var_119',
 'var_124',
 'var_125',
 'var_126',
 'var_128',
 'var_129',
 'var_130',
 'var_133',
 'var_136',
 'var_137',
 'var_138',
 'var_139',
 'var_141',
 'var_142',
 'var_146',
 'var_149',
 'var_150',
 'var_151',
 'var_153',
 'var_156',
 'var_159',
 'var_177',
 'var_183',
 'var_184',
 'var_187',
 'var_189',
 'var_193',
 'var_194',
 'var_197',
 'var_198',
 'var_199',
 'var_202',
 'var_204',
 'var_210',
 'var_211',
 'var_216',
 'var_217',
 'var_

In [38]:
# select one feature from the list

quasi_constant_feat[2]

'var_3'

In [39]:
X_train['var_3'].value_counts() / np.float(len(X_train))

0.0000         0.999629
35685.9459     0.000029
3583.3941      0.000029
15028.0560     0.000029
52105.7901     0.000029
10281.6000     0.000029
86718.0000     0.000029
207901.3365    0.000029
25905.4866     0.000029
5209.9500      0.000029
2641.0164      0.000029
12542.3100     0.000029
861.0900       0.000029
27.3000        0.000029
Name: var_3, dtype: float64

The feature shows 0 for more than 99.9% of the observations. But, it also shows a few different values for a very tiny proportion of the observations. This fact, will increase the feature variance, that is why, this feature is not captured by the VarianceThreshold in our previous cell. Yet, we can see that it is quasi-constant.

Keep in mind that the thresholds are arbitrary and decided by the user.

In [40]:
# finally, let's drop the quasi-constant features:

X_train.drop(labels=quasi_constant_feat, axis=1, inplace=True)
X_test.drop(labels=quasi_constant_feat, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 134), (15000, 134))

We see, how, we removed almost half of the original variables!!! We passed from 300 variables to 158.

That is all for this lecture, I hope you enjoyed it and see you in the next one!