## Filter based Feature Selection methods

- Filter methods select features from a dataset independently for any machine learning algorithm. 
- These methods rely only on the characteristics of these variables, so features are filtered out of the data before learning begins.
- **Univariate filter methods** evaluate each feature individually. 
- These methods consist of providing a score to each feature, often based on statistical tests.

### <span style="color:blue"> Video Explanation of Variance Threshold : https://youtu.be/Z141lNzivXU

**Filter Methods: Advantages**
- They are computationally inexpensive, you can process thousands of features in a matter of seconds.
- Filter methods are very good for eliminating irrelevant, redundant, constant, duplicated, and correlated features.

**Filter Method Types**
1. Basic Statistical Filter Methods
    - VarianceThreshod (Remove the Constant Feature and Quasi-Constant Features)
    - Remove Duplicate Features
2. Correlation & Ranking based statistical Filter Methods
    - Pearson’s correlation coefficient
    - Spearman’s rank coefficient
    - Kendall’s rank coefficient
3. Statistical Test based Methods
    - Anova or F-Test
    - Mutual Information
    - Chi Square    

### Variance Threshold
**Removing Numerical features with low variance**

- We simply compute the variance of each features, and we select the subset of features based on a user-specified threshold.
- We assume that features with a higher variance may contain more useful information.
- **This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.**
- As we are not taking the relationship between features variables or feature and target variables into account, which is one of the drawbacks of Variance Threshold filter method.
- It is applicable only on Numerical features.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# Lets create a dataframe
col1 = np.ones(100)
col2 = np.zeros(100)
col2[:1] = 1
col3 = np.random.randint(15,30,size = 100)
col4 = np.ones(100)
col4[:40] = 0
col5 = np.zeros(100)
col5[:20] = 1
col6 = np.random.randint(30,40,size = 100)
col7 = np.ones(100)
col8 = np.zeros(100)
col8[:4] = 1

In [3]:
df = pd.DataFrame({"A":col1,"B":col2,"C":col3,"D":col4,"E":col5,"F":col6,"G":col7})

In [4]:
df.head(8)

Unnamed: 0,A,B,C,D,E,F,G
0,1.0,1.0,20,0.0,1.0,30,1.0
1,1.0,0.0,25,0.0,1.0,36,1.0
2,1.0,0.0,22,0.0,1.0,39,1.0
3,1.0,0.0,28,0.0,1.0,39,1.0
4,1.0,0.0,20,0.0,1.0,31,1.0
5,1.0,0.0,27,0.0,1.0,32,1.0
6,1.0,0.0,21,0.0,1.0,36,1.0
7,1.0,0.0,17,0.0,1.0,32,1.0


In [5]:
df_var = df.var()

In [6]:
df_var = df.var().to_frame() #Convert Series to DataFrame
df_var

Unnamed: 0,0
A,0.0
B,0.01
C,18.322828
D,0.242424
E,0.161616
F,8.694444
G,0.0


In [7]:
df_var.shape

(7, 1)

In [8]:
df_var = df_var.rename(columns={0:"Variance"})
df_var

Unnamed: 0,Variance
A,0.0
B,0.01
C,18.322828
D,0.242424
E,0.161616
F,8.694444
G,0.0


In [9]:
# select those columns where variance is geater than 1 , variancethreshold = 1%
df1 = df.loc[:, df.var() > 0.01]
df1

Unnamed: 0,C,D,E,F
0,20,0.0,1.0,30
1,25,0.0,1.0,36
2,22,0.0,1.0,39
3,28,0.0,1.0,39
4,20,0.0,1.0,31
...,...,...,...,...
95,27,1.0,0.0,36
96,28,1.0,0.0,33
97,24,1.0,0.0,38
98,26,1.0,0.0,32


**Feature Selection - Variance Threshold : sklearn.feature_selection.VarianceThreshold**

In [10]:
from sklearn.feature_selection import VarianceThreshold

In [11]:
# Lets use the sklearn
selector = VarianceThreshold(threshold=0.01) # variance 1%
# fit the object to the data
selector.fit(df)

VarianceThreshold(threshold=0.01)

In [12]:
# Get features which have the variance greater than the set threshold value = 0.01
selector.get_support(indices=False)

array([False, False,  True,  True,  True,  True, False])

In [13]:
df.columns

Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')

In [14]:
df.columns[selector.get_support()]

Index(['C', 'D', 'E', 'F'], dtype='object')

In [15]:
#Get features which have the variance less than the set threshold value using list comprehension
selected_cols = [column for column in df.columns if column not in df.columns[selector.get_support()]]

In [16]:
#features which have the variance less than the set threshold value
selected_cols

['A', 'B', 'G']

In [17]:
# lets drop those columns from the dataset where variance is < 1
df2 = df.drop(labels=selected_cols,axis=1)
df2.head()

Unnamed: 0,C,D,E,F
0,20,0.0,1.0,30
1,25,0.0,1.0,36
2,22,0.0,1.0,39
3,28,0.0,1.0,39
4,20,0.0,1.0,31


In [18]:
df_op = selector.transform(df)

In [19]:
df_op

array([[20.,  0.,  1., 30.],
       [25.,  0.,  1., 36.],
       [22.,  0.,  1., 39.],
       [28.,  0.,  1., 39.],
       [20.,  0.,  1., 31.],
       [27.,  0.,  1., 32.],
       [21.,  0.,  1., 36.],
       [17.,  0.,  1., 32.],
       [27.,  0.,  1., 39.],
       [27.,  0.,  1., 37.],
       [27.,  0.,  1., 37.],
       [19.,  0.,  1., 32.],
       [22.,  0.,  1., 33.],
       [17.,  0.,  1., 39.],
       [23.,  0.,  1., 30.],
       [16.,  0.,  1., 31.],
       [29.,  0.,  1., 39.],
       [21.,  0.,  1., 30.],
       [24.,  0.,  1., 39.],
       [15.,  0.,  1., 31.],
       [24.,  0.,  0., 39.],
       [25.,  0.,  0., 36.],
       [20.,  0.,  0., 31.],
       [21.,  0.,  0., 30.],
       [23.,  0.,  0., 30.],
       [24.,  0.,  0., 34.],
       [17.,  0.,  0., 36.],
       [29.,  0.,  0., 36.],
       [28.,  0.,  0., 38.],
       [29.,  0.,  0., 32.],
       [29.,  0.,  0., 37.],
       [17.,  0.,  0., 35.],
       [21.,  0.,  0., 33.],
       [17.,  0.,  0., 36.],
       [17.,  

### Feature Selection Method : Variance Threshold to remove the Constant Features

- Those features which contain constant values (only one value for all the outputs or target values) in the dataset.
- These features provide no information that allows ML models to predict the target.

threshold = 0

In [20]:
#https://www.kaggle.com/iabhishekofficial/mobile-price-classification --Load the dataset
#df_m = pd.read_csv("https://raw.githubusercontent.com/atulpatelDS/Data_Files/master/Mobile_Price/train.csv")#,
                     #encoding = "ISO-8859-1")

In [21]:
# https://www.kaggle.com/c/santander-customer-satisfaction/data --Load the dataset
df_s = pd.read_csv("https://raw.githubusercontent.com/atulpatelDS/Data_Files/master/Santander_CS/train.csv")#,
                     #encoding = "ISO-8859-1")

In [22]:
#df_m.head()
df_s.head()

Unnamed: 0,ID,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,...,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38,TARGET
0,1,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,39205.17,0
1,3,2,34,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,49278.03,0
2,4,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67333.77,0
3,8,2,37,0.0,195.0,195.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,64007.97,0
4,10,2,39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016,0


In [23]:
X_s = df_s.iloc[:,0:-1]
y_s = df_s["TARGET"]
#X_m = df_m.iloc[:,0:-1]
#y_m = df_m["price_range"]
#y_m = df_m["color"]

In [24]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X_s,y_s,test_size = 0.3,random_state = 42)
#X_train,X_test,y_train,y_test = train_test_split(X_m,y_m,test_size = 0.3,random_state = 42)
X_train.shape,X_test.shape

((53214, 370), (22806, 370))

In [25]:
# Lets apply variancethreshold
constant_selector= VarianceThreshold(threshold=0)
constant_selector.fit(X_train)

VarianceThreshold(threshold=0)

In [26]:
# Get features which have the variance greater than the set threshold value = 0.0
sum(constant_selector.get_support())

324

In [27]:
# Above we can get how many features are not constant
# find the number of constant features with the help of the following script:
constant_selected_cols = [column for column in X_train.columns if column not in X_train.columns[constant_selector.get_support()]]

In [28]:
# Constant Columns
len(constant_selected_cols)

46

In [29]:
# lets drop those columns from the dataset where variance is = 0
X_train_op = X_train.drop(labels=constant_selected_cols,axis=1)
X_train_op.head()

Unnamed: 0,ID,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,...,saldo_medio_var29_ult3,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38
199,394,2,71,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,165191.58
62460,124716,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,110940.57
32495,64953,2,38,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,89347.14
46788,93556,2,24,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,92325.69
61106,121970,2,53,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,112730.85


In [30]:
# Other method ---lets drop those columns from the dataset where variance is = 0
X_train_ac = constant_selector.transform(X_train)
X_test_ac = constant_selector.transform(X_test)

In [31]:
X_train_ac.shape, X_test_ac.shape

((53214, 324), (22806, 324))

### Feature Selection Method : Variance Threshold to remove the Quasi-Constant Features

- Features that are almost constant. In other words, these features have the same values for a very large subset of the outputs. 
- Such features are not very useful for making predictions. There is no rule as to what should be the threshold for the variance of quasi-constant features. However, as a rule of thumb, remove those quasi-constant features that have more than 99% similar values for the output observations.
- We will create a quasi-constant filter with the help of VarianceThreshold function. However, instead of passing 0 as the value for the threshold parameter, we will pass 0.01, which means that if the variance of the values in a column is less than 0.01, remove that column. In other words, remove feature column where approximately 99% of the values are similar.
threshold = 0.01

In [32]:
# Lets create threshold = 0.01 to remove the quasi-constant columns
quasi_constant_selector = VarianceThreshold(threshold=0.01)
quasi_constant_selector.fit(X_train)

VarianceThreshold(threshold=0.01)

In [33]:
X_s.shape,y_s.shape
#X_m.shape,y_m.shape

((76020, 370), (76020,))

In [34]:
# Get features which have the variance greater than the set threshold value = 0.01
sum(quasi_constant_selector.get_support())

266

In [35]:
# Filter out the Quasi-constant features
quasi_features = [cols for cols in X_train.columns if cols not in X_train.columns[quasi_constant_selector.get_support()]]

In [36]:
# Check the Quasi-constant columns count
len(quasi_features)

104

In [37]:
print("Total Features :",len(X_train.columns))
print("Total constant features :",len(constant_selected_cols))
print("Total non-quasi constant features :",len(X_train.columns)- len(quasi_features))

Total Features : 370
Total constant features : 46
Total non-quasi constant features : 266


In [38]:
frame = X_train.var().to_frame()
#frame = X_train.var().to_frame()

In [39]:
frame = frame.rename(columns={0:"Variance"})

In [40]:
print("Total Features :",sum(frame.Variance.values>=0))
print("Total constant features :",sum(frame.Variance.values==0))
print("Total non constant and non-quast constant features :",sum(frame.Variance.values > 0.01))

Total Features : 370
Total constant features : 46
Total non constant and non-quast constant features : 266


In [41]:
# lets transfor the train nad test data using fit selector
X_train_aq = quasi_constant_selector.transform(X_train)
X_test_aq = quasi_constant_selector.transform(X_test)

In [42]:
X_train_aq.shape, X_test_aq.shape

((53214, 266), (22806, 266))

**Disadvange of Variance Threshold**
- Does not consider the dependent(Target) variable.
- Does not consider correlations