# Feature Selection

## **Exercise** 

Apply the Variance Threshold Feature Selection to any dataset included in scikit-learn.

## **Variance Threshold Feature Selection**

A feature with a low variance means that it has a lot of similar values. Features with have mostly the same values are usually not very useful to discriminate the different clases. E.g. if almost everybody in this class has the same age, it would not be a good predictor of academic success.

Make imports 

In [55]:
import numpy as np
from sklearn.feature_selection import VarianceThreshold
from sklearn.datasets import fetch_california_housing
import pandas as pd
from sklearn.preprocessing import StandardScaler

Load the dataset

In [56]:
housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


Shape of the dataset

In [57]:
df.shape

(20640, 8)

Variance of the features with no normalization

In [58]:
df.var().round(2).sort_values(ascending=False)

Population    1282470.46
HouseAge          158.40
AveOccup          107.87
AveRooms            6.12
Latitude            4.56
Longitude           4.01
MedInc              3.61
AveBedrms           0.22
dtype: float64

I decided not to standarize, because when we standarize the data, the variance is 1.0 for all the features.

In [59]:
#scaler = StandardScaler()
#df = pd.DataFrame(scaler.fit_transform(df), columns = df.columns)
#df.head()

Perform the Variance Threshold Feature Selection

In [60]:
selector = VarianceThreshold(threshold = 1.00)
selector.fit(df)
df.columns[selector.get_support()]

Index(['MedInc', 'HouseAge', 'AveRooms', 'Population', 'AveOccup', 'Latitude',
       'Longitude'],
      dtype='object')