# Introduction


Often in data science we have hundreds or even millions of features and we want a way to create a model that only includes the most important features. 

This has three benefits. First, we make our model more simple to interpret. Second, we can reduce the variance of the model, and therefore overfitting. Finally, we can reduce the computational cost (and time) of training a model. The process of identifying only the most relevant features is called "feature selection."

Feature selection is done in many ways but important of them all are 
- Univariate selection
- Recursive feature elimination
- Principal componant analysis
- Feature importance

#### Univariate selection

This method falls under the category of filter methods where statistical tests such as LDE, ANOVA , Chi-square test can be used to select those features that have the strongest relationship with the output variable.

#### Recursive Feature elimination

The Recursive Feature Elimination (RFE) method is a feature selection approach. It works by recursively removing attributes and building a model on those attributes that remain. It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.

This example shows the use of RFE on the Iris floweres dataset to select 3 attributes.

Importing the requisite modules

In [2]:
from sklearn import datasets
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

Loading the iris datasets

In [3]:
dataset = datasets.load_iris()

In [4]:
# create a base classifier used to evaluate a subset of attributes

model = LogisticRegression()

# create the RFE model and select 3 attributes

rfe = RFE(model, 3)
rfe = rfe.fit(dataset.data, dataset.target)

# summarize the selection of the attributes

print(rfe.support_)
print(rfe.ranking_)

[False  True  True  True]
[2 1 1 1]


So the above random feature elimination algorithm selected last three features of the iris dataset.

#### Principal componant analysis

Principal Component Analysis (or PCA) uses linear algebra to transform the dataset into a compressed form.

Generally this is called a data reduction technique. A property of PCA is that you can choose the number of dimensions or principal component in the transformed result.

In [15]:
from sklearn.decomposition import PCA

In [19]:
X = dataset.data
y = dataset.target

X.shape

(150, 4)

The iris dataset shape is (150,4). Let us reduce it using principal componant analysis.

In [21]:
pca = PCA(n_components=3)
X_transform = pca.fit_transform(X)

X_transform.shape

(150, 3)

After setting number of components in principal componant analysis the final transformed dataset shape we got is (150,3).

#### Feature Importance

Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance of features. The reason random forests are used for feature selection in data science work flow is because the tree-based strategies used by random forests naturally ranks by how well they improve the purity of the node.

Thus, by pruning trees below a particular node, we can create a subset of the most important features.

In [34]:
from sklearn.tree import DecisionTreeClassifier as DT
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.cross_validation import train_test_split as tts
from sklearn.metrics import accuracy_score 

Splitting into training and testing tests.

In [26]:
X_train,X_test,y_train,y_test = tts(dataset.data,dataset.target)

Fitting to a decision tree classifier.

In [31]:
dt = DT()
dt.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

Predicting using a decision tree classifier.

In [36]:
y_pred = dt.predict(X_test)

In [37]:
accuracy_score(y_test,y_pred)

0.94736842105263153

Prediction using a random tree classifier

In [39]:
rt = RF()
rt.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [41]:
y_pred = rt.predict(X_test)
accuracy_score(y_test,y_pred)

0.97368421052631582

We can clearly see an increase in performace using the feature importance method(i.e, random forests).
Random forest marked a three point percentage increase from the decision tree algorithm.