In [1]:
"""
Feature Selection for Machine Learning in Python
URL: https://machinelearningmastery.com/feature-selection-machine-learning-python/
"""

"""
FEATURE SELECTION

Feature selection is a process where you automatically select those features in your data that 
contribute most to the prediction variable or output in which you are interested.

Having irrelevant features in your data can decrease the accuracy of many models, especially linear algorithms 
like linear and logistic regression.

Three benefits of performing feature selection before modeling your data are:

    - Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
    - Improves Accuracy: Less misleading data means modeling accuracy improves.
    - Reduces Training Time: Less data means that algorithms train faster.
"""
import pandas as pd 
import numpy

# Load data
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pd.read_csv(url, names = names)
array = dataframe.values
X = array[:, 0:8]
Y = array[:, 8]

In [2]:
"""
1. Univariate Selection

Statistical tests can be used to select those features that have the strongest relationship with the 
output variable.

The scikit-learn library provides the SelectKBest class that can be used with a suite of different 
statistical tests to select a specific number of features.

The example below uses the chi squared (chi^2) statistical test for non-negative features to select 
4 of the best features from the Pima Indians onset of diabetes dataset.
"""
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# Feature Extracions
selector = SelectKBest(score_func=chi2, k = 4)
fit = selector.fit(X, Y)

# Summarize scores
numpy.set_printoptions(precision=3)
print(fit.scores_)
features = selector.transform(X)

#Summarize selected feature
print(features[0:5, :])

[  111.52   1411.887    17.605    53.108  2175.565   127.669     5.393
   181.304]
[[ 148.     0.    33.6   50. ]
 [  85.     0.    26.6   31. ]
 [ 183.     0.    23.3   32. ]
 [  89.    94.    28.1   21. ]
 [ 137.   168.    43.1   33. ]]


In [3]:
"""
2. Recursive Feature Elimination

The Recursive Feature Elimination (or RFE) works by recursively removing attributes and building a model 
on those attributes that remain.

It uses the model accuracy to identify which attributes (and combination of attributes) contribute 
the most to predicting the target attribute.

You can learn more about the RFE class in the scikit-learn documentation.

The example below uses RFE with the logistic regression algorithm to select the top 3 features. 
The choice of algorithm does not matter too much as long as it is skillful and consistent.
"""
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Feature extraction
model = LogisticRegression()
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print("Num Features: {}".format(fit.n_features_))
print("Selected Features: {}".format(fit.support_))
print("Features Ranking: {}".format(fit.ranking_))

Num Features: 3
Selected Features: [ True False False False False  True  True False]
Features Ranking: [1 2 3 5 6 1 1 4]


In [4]:
"""
3. Principal Component Analysis

Principal Component Analysis (or PCA) uses linear algebra to transform the dataset into a compressed form.

Generally this is called a data reduction technique. A property of PCA is that you can choose the number
of dimensions or principal component in the transformed result.

In the example below, we use PCA and select 3 principal components.

Learn more about the PCA class in scikit-learn by reviewing the PCA API. Dive deeper into the math 
behind PCA on the Principal Component Analysis Wikipedia article.
"""
from sklearn.decomposition import PCA

# Feature Extraction
pca = PCA(n_components=3)
fit = pca.fit(X)

# Summarize components
print("Exaplained Variance: {}".format(fit.explained_variance_ratio_))
print(fit.components_)

Exaplained Variance: [ 0.889  0.062  0.026]
[[ -2.022e-03   9.781e-02   1.609e-02   6.076e-02   9.931e-01   1.401e-02
    5.372e-04  -3.565e-03]
 [ -2.265e-02  -9.722e-01  -1.419e-01   5.786e-02   9.463e-02  -4.697e-02
   -8.168e-04  -1.402e-01]
 [ -2.246e-02   1.434e-01  -9.225e-01  -3.070e-01   2.098e-02  -1.324e-01
   -6.400e-04  -1.255e-01]]


In [5]:
"""
4. Feature Importance

Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance of features.

In the example below we construct a ExtraTreesClassifier classifier for the Pima Indians onset of diabetes 
dataset. You can learn more about the ExtraTreesClassifier class in the scikit-learn API.
"""
from sklearn.ensemble import ExtraTreesClassifier

# Feature Extraction
model = ExtraTreesClassifier()
model.fit(X, Y)
print(model.feature_importances_)

[ 0.118  0.231  0.104  0.083  0.077  0.137  0.111  0.139]
