## Constant features

Constant features are those that show the same value, just one value, for all the observations of the dataset. This is, the same value for all the rows of the dataset. These features provide no information that allows a machine learning model to discriminate or predict a target.

Identifying and removing constant features, is an easy first step towards feature selection and more easily interpretable machine learning models.



In [57]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold

## Removing constant features

In [58]:
# load the Santander customer satisfaction dataset from Kaggle
# I load just a few rows for the demonstration

data = pd.read_csv('train.csv', nrows=50000)
data.shape

(50000, 371)

In [59]:
data.head(1)

Unnamed: 0,ID,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,...,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38,TARGET
0,1,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,39205.17,0


In [60]:
# check the presence of null data.

[col for col in data.columns if data[col].isnull().sum() > 0]

[]

In [61]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Columns: 371 entries, ID to TARGET
dtypes: float64(108), int64(263)
memory usage: 141.5 MB


### Important

In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfitting.

In [62]:
# separate dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['TARGET'], axis=1),
    data['TARGET'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 370), (15000, 370))

In [63]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 35000 entries, 17967 to 2732
Columns: 370 entries, ID to var38
dtypes: float64(108), int64(262)
memory usage: 99.1 MB


In [64]:
# fit finds the features with zero variance
sel = VarianceThreshold(threshold=0)
sel.fit(X_train)  

VarianceThreshold(threshold=0)

In [65]:
# get_support is a boolean vector that indicates which features are retained
# if we sum over get_support, we get the number of features that are not constant
sum(sel.get_support())

312

In [66]:
# another way of finding non-constant features is like this:
len(X_train.columns[sel.get_support()])

312

In [67]:
# finally we can print the constant features
print(
    len([
        x for x in X_train.columns
        if x not in X_train.columns[sel.get_support()]
    ]))

[x for x in X_train.columns if x not in X_train.columns[sel.get_support()]]

58


['ind_var2_0',
 'ind_var2',
 'ind_var13_medio_0',
 'ind_var13_medio',
 'ind_var27_0',
 'ind_var28_0',
 'ind_var28',
 'ind_var27',
 'ind_var34_0',
 'ind_var34',
 'ind_var41',
 'ind_var46_0',
 'ind_var46',
 'num_var13_medio_0',
 'num_var13_medio',
 'num_var27_0',
 'num_var28_0',
 'num_var28',
 'num_var27',
 'num_var34_0',
 'num_var34',
 'num_var41',
 'num_var46_0',
 'num_var46',
 'saldo_var13_medio',
 'saldo_var28',
 'saldo_var27',
 'saldo_var34',
 'saldo_var41',
 'saldo_var46',
 'delta_imp_amort_var34_1y3',
 'delta_imp_reemb_var33_1y3',
 'delta_num_reemb_var33_1y3',
 'imp_amort_var18_hace3',
 'imp_amort_var34_hace3',
 'imp_amort_var34_ult1',
 'imp_reemb_var13_hace3',
 'imp_reemb_var17_hace3',
 'imp_reemb_var33_hace3',
 'imp_reemb_var33_ult1',
 'imp_trasp_var17_out_hace3',
 'imp_trasp_var33_out_hace3',
 'imp_venta_var44_hace3',
 'num_var2_0_ult1',
 'num_var2_ult1',
 'num_meses_var13_medio_ult3',
 'num_reemb_var13_hace3',
 'num_reemb_var17_hace3',
 'num_reemb_var33_hace3',
 'num_reemb_var

We can see that 58 columns / variables are constant. This means that 58 variables show the same value, just one value, for all the observations of the training set.

In [68]:
# let's visualise the values of one of the constant variables
# as an example

X_train['ind_var2_0'].unique()

array([0], dtype=int64)

We then use the transform function to reduce the training and testing sets. See below.

In [69]:
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape

((35000, 312), (15000, 312))

### Alterative Approach with Labels

In [73]:
# load the dataset again
data = pd.read_csv('train.csv', nrows=50000)

# separate train and test
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['TARGET'], axis=1),
    data['TARGET'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 370), (15000, 370))

In [74]:
# transform all these numeric features into  categorical features 
X_train = X_train.astype('O')
X_train.dtypes

ID                         object
var3                       object
var15                      object
imp_ent_var16_ult1         object
imp_op_var39_comer_ult1    object
                            ...  
saldo_medio_var44_hace2    object
saldo_medio_var44_hace3    object
saldo_medio_var44_ult1     object
saldo_medio_var44_ult3     object
var38                      object
Length: 370, dtype: object

In [75]:
# and now find those columns that contain only 1 label:
constant_features = [
    feat for feat in X_train.columns if len(X_train[feat].unique()) == 1
]

len(constant_features)

58