### Feature Selection for ML

Chosen features to train your model have a huge impact on performance. Only highly relevant features should be used to train your models. There are automatic feature selection techniques to help in this step:

- Univariate Selection
- Recursive Feature Elimination
- Principle Component Analysis
- Feature Importance

#### Feature selection

- Automatic
- Selection process of features that contribute the most to the prediction or output of interest
- Irrelevant features are harmful, specially to linear algorithms (linear and logistic regression)

Benefits

* Reduce overfitting

Less redundant data, less decisions based on noise

* Improves accuracy

Less misleading data, better accuracy

* Reduce training time

Less data, training is faster

More info: http://scikit-learn.org/stable/modules/feature_selection.html

##### Univariate feature selection

Use (subset) statistical tests to select features strongly related to the output variable. For example, chi-squared for non-negative features to select 4 of the best features (example below)


Using scikit-learn's `SelectKBest` class: 

In [27]:
#feature extraction with Univariate Statistical Tests (Chi-squared for classification)

from pandas import read_csv
from numpy import set_printoptions

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

print names
# Feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)
#Summarize scores
set_printoptions(precision=3)
fscore = fit.scores_
print(fit.scores_)
features = fit.transform(X)

#Summarize selected features
print features[0:5, :]

# top 4 attributes
topIndex = [sorted(range(len(fscore)), key=lambda i: i)[-4:]][0]

print '\nTop features: ', 
for t in topIndex:
    print names[t],


['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
[  111.52   1411.887    17.605    53.108  2175.565   127.669     5.393
   181.304]
[[ 148.     0.    33.6   50. ]
 [  85.     0.    26.6   31. ]
 [ 183.     0.    23.3   32. ]
 [  89.    94.    28.1   21. ]
 [ 137.   168.    43.1   33. ]]

Top features:  test mass pedi age
