## Constant and Quasi-Constant Feature Removal

Features that are constant or quasi-constant do not provide any meaningful information to the model because they have the same value for all instances. By eliminating these features, you can enhance the efficiency of your machine learning algorithms by reducing computational complexity and processing time.

Duplicated features lack discriminatory power and do not affect the target variable. Their inclusion in the model can lead to overfitting, where the model becomes excessively specialized to the training data and performs poorly on unseen data.

Constant and quasi-constant features lack variation or discernible patterns, rendering them irrelevant for understanding the relationships between features and the target variable. Removing such features allows you to focus on more informative attributes, improving the interpretability of the model.

The presence of constant features in the dataset can introduce noise and have a detrimental impact on the performance of machine learning algorithms. By eliminating these features, you can enhance model performance by reducing noise and emphasizing meaningful features.

<b> We can use sklearn variance threshold to remove similar columns becuase a constant variable has a variance of 0 and a quasi-constant variable has a variance close to 0.</b>

In [1]:
from sklearn.datasets import make_regression
import pandas as pd
import numpy as np
from sklearn.feature_selection import VarianceThreshold

import warnings
warnings.filterwarnings("ignore")

In [2]:
# make classification data with binary target values
data = make_regression(
    n_features = 100, 
    n_samples = 250, 
    random_state = 101)

In [3]:
# make a dataframe to improve readability
df = pd.DataFrame(data[0])

# add the targets
df['Y'] = data[1]

# set one column equal to a constant
df[91] = 5

# inspect the results
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,91,92,93,94,95,96,97,98,99,Y
0,1.237051,0.161684,-1.152386,-0.409753,-1.077468,0.747431,-1.412446,-1.28967,-0.760857,-1.001096,...,5,0.770275,1.484575,3.142104,-0.91748,-0.103852,-0.047363,0.083218,0.471955,17.290903
1,0.274839,-1.832766,-0.368043,1.161214,-0.998865,1.202511,-1.202962,-0.644308,-1.788403,0.979846,...,5,-0.269674,1.086608,0.222096,0.076328,1.177665,-0.79619,0.318696,0.285054,142.698122
2,0.256114,-1.550236,0.189128,0.112748,0.03348,0.421889,0.560051,1.280983,-0.111624,0.60016,...,5,-0.985264,1.206036,0.531459,-0.87725,-0.666815,1.756246,-1.707947,1.073355,-27.943177
3,0.107535,0.448421,0.179847,1.356226,0.095896,-0.089147,-0.406464,-0.72817,0.181651,-0.069708,...,5,-0.282638,0.185851,-0.999543,0.030705,2.261594,0.595639,-0.615174,0.499438,-272.047303
4,0.217749,-0.105454,-1.006047,-0.160085,1.434404,-0.989782,-0.87417,-0.161555,0.613464,0.611526,...,5,0.546369,-0.990625,2.385875,-0.024611,0.105822,-1.03114,-1.62013,0.901704,101.952277


In [4]:
# use a pandas method to find constants

# we can use list comprehension to get each column with variance equal to 0
columns_with_constants = [ feat for feat in df.columns if df[feat].var() == 0]

# show the constant columns
columns_with_constants

[91]

In [5]:
# use sklearn to find the constants

# instantiate the class with a threhold equal to 0
selector = VarianceThreshold(threshold = 0)

# fit the method to the data
selector.fit(df)

# get the column with the constant by making an anti-selection
constant = df.columns[~selector.get_support()]

# show the column with the constant
constant

Index([91], dtype='object')

In [6]:
# quasi-constant features will have a variance near 0

# instantiate the class with a threhold equal to 0
selector = VarianceThreshold(threshold = 0.05)

# fit the method to the data
selector.fit(df)

# get the column with the constant by making an anti-selection
constant = df.columns[~selector.get_support()]

# show the column with the constant
constant

Index([91], dtype='object')