# Feature Selection by <font color=red>Filter Methods</font>

<img src='Data/Reducing Complexity.png' width=500/>

Filter Methods
- <font color=red>Filter by Threshold</font>
- Chi-square
- ANOVA

<font color=red>Filter features by Threshold</font>
- Filter features by <font color=red>threshold on Missing Values</font>
- Filter features by <font color=red>threshold on Variance</font>

Filter features by <font color=red>threshold on Missing Values</font>

In [1]:
import numpy as np
import pandas as pd

In [2]:
diabetes_data = pd.read_csv('Data/diabetes.csv')
diabetes_data.head(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


In [3]:
# Let us replace 0s with Nans
diabetes_data['Glucose'].replace(0, np.nan, inplace=True)
diabetes_data['BloodPressure'].replace(0, np.nan, inplace=True)
diabetes_data['SkinThickness'].replace(0, np.nan, inplace=True)
diabetes_data['Insulin'].replace(0, np.nan, inplace=True)
diabetes_data['BMI'].replace(0, np.nan, inplace=True)

In [4]:
diabetes_data.isnull().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

In [5]:
diabetes_data['Glucose'].isnull().sum()/len(diabetes_data) * 100

0.6510416666666667

In [6]:
diabetes_data['BloodPressure'].isnull().sum()/len(diabetes_data) * 100

4.557291666666666

In [7]:
diabetes_data['SkinThickness'].isnull().sum()/len(diabetes_data) * 100

29.557291666666668

In [8]:
diabetes_data['Insulin'].isnull().sum()/len(diabetes_data) * 100

48.69791666666667

In [9]:
diabetes_data['BMI'].isnull().sum()/len(diabetes_data) * 100

1.4322916666666665

In [10]:
# Glance over all columns
diabetes_data.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [11]:
# Let us use the dropna to drop columns having missing values > threshold

diabetes_data_trimmed = diabetes_data.dropna(thresh=int(diabetes_data.shape[0] * 0.9), axis=1)
diabetes_data_trimmed.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'BMI',
       'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

We can notice that __SkinThickness__ and __Insulin__ columns are dropped.

Filter features by <font color=red>threshold on Variance</font>

A feature with <font color=red>high variance</font> will have <font color=red>high predictive information</font>. <br/>
Steps:
- Make sure that <font color=red>all features</font> are in the <font color=red>same scale</font>.
- So, find variance of all features and <font color=red>select features</font> having <font color=red>variance > threshold</font> and discard remaining features.

In [12]:
diabetes_data = pd.read_csv('Data/diabetes_processed.csv')
diabetes_data.head(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6.0,148.0,72.0,35.0,219.028414,33.6,0.627,50.0,1
1,1.0,85.0,66.0,29.0,70.34155,26.6,0.351,31.0,0
2,8.0,183.0,64.0,29.15342,269.881846,23.3,0.672,32.0,1
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21.0,0
4,0.0,137.0,40.0,35.0,168.0,43.1,2.288,33.0,1
5,5.0,116.0,74.0,29.15342,127.148895,25.6,0.201,30.0,0
6,3.0,78.0,50.0,32.0,88.0,31.0,0.248,26.0,1
7,10.0,115.0,72.405184,29.15342,135.878919,35.3,0.134,29.0,0
8,2.0,197.0,70.0,45.0,543.0,30.5,0.158,53.0,1
9,8.0,125.0,96.0,29.15342,154.880154,32.0,0.232,54.0,1


In [13]:
X = diabetes_data.drop('Outcome', axis=1)
Y = diabetes_data['Outcome']

In [14]:
X.var(axis=0)

Pregnancies                   11.354056
Glucose                      926.489244
BloodPressure                146.321591
SkinThickness                 77.280660
Insulin                     9447.954568
BMI                           47.270664
DiabetesPedigreeFunction       0.109779
Age                          138.303046
dtype: float64

From the results above, it appears that __DiabetesPedigreeFunction__ holds relatively __lowest predictive information__ and is a candidate to be discarded but it is not so; as the features are on different scales. So, we should scale them first.

In [15]:
# Let us bring all the features to the same scale and study the variance

from sklearn.preprocessing import minmax_scale

X_scaled = pd.DataFrame(minmax_scale(X, feature_range=(0, 10)), columns=X.columns)

In [16]:
X_scaled.var()

Pregnancies                 3.928739
Glucose                     3.856355
BloodPressure               1.523548
SkinThickness               0.913051
Insulin                     1.267813
BMI                         1.976851
DiabetesPedigreeFunction    2.001447
Age                         3.841751
dtype: float64

Now, we can see that __DiabetesPedigreeFunction__ is not the candidate to be discarded but the __SkinThickness__; as it has the __lowest variance__.

In [17]:
# Select features with variance >= 1

from sklearn.feature_selection import VarianceThreshold

select_features = VarianceThreshold(threshold=1.0)

In [18]:
X_new = select_features.fit_transform(X_scaled)
X_new.shape

(768, 7)

We can now notice that 1 feature is discarded by the method above.

In [19]:
# Let us form the feature vs variances
var_df = pd.DataFrame({'feature names': list(X_scaled), 'variances': select_features.variances_})
var_df

Unnamed: 0,feature names,variances
0,Pregnancies,3.923624
1,Glucose,3.851334
2,BloodPressure,1.521565
3,SkinThickness,0.911862
4,Insulin,1.266162
5,BMI,1.974277
6,DiabetesPedigreeFunction,1.998841
7,Age,3.836749


We can notice that SkinThickness is the only feature having variance < 1; means it is the feature that is dropped out.

In [20]:
X_new = pd.DataFrame(X_new)
X_new

Unnamed: 0,0,1,2,3,4,5,6
0,3.529412,6.709677,4.897959,2.737160,3.149284,2.344150,4.833333
1,0.588235,2.645161,4.285714,1.014771,1.717791,1.165670,1.666667
2,4.705882,8.967742,4.081633,3.326246,1.042945,2.536294,1.833333
3,0.588235,2.903226,4.285714,1.288830,2.024540,0.380017,0.000000
4,0.000000,6.000000,1.632653,2.146046,5.092025,9.436379,2.000000
...,...,...,...,...,...,...,...
763,5.882353,3.677419,5.306122,2.285054,3.006135,0.397096,7.000000
764,1.176471,5.032258,4.693878,2.036118,3.803681,1.118702,1.000000
765,2.941176,4.967742,4.897959,1.497342,1.635992,0.713066,1.500000
766,0.588235,5.290323,3.673469,2.214888,2.433538,1.157131,4.333333


Notice that the new dataframe has not feature headings. Let us find the feature headings.

In [21]:
selected_features = []

for i in range(len(X_new.columns)):
    for j in range(len(X_scaled.columns)):
        if (X_new.iloc[:, i].equals(X_scaled.iloc[:, j])):
            selected_features.append(X_scaled.columns[j])

selected_features

['Pregnancies',
 'Glucose',
 'BloodPressure',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age']

We can find that the new dataframe has no 'SkinThickness' feature; so that confirms that it is the one which is dropped.