## Table of Content

- Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)
- Recursive Feature Elimination
- Principal Component Analysis
- Feature Importance using Tree based models.

## Feature Selection

- Feature Selection means you select those features in your data that contribute most to the 
prediction variable or output in which you are interested.

- Having irrelevant features in your data can decrease the accuracy of many models, especially linear
algorithms like linear and logistic regression. 


- Three benefits of performing feature selection before modeling your data are:
    + Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
    + Improves Accuracy: Less misleading data means modeling accuracy improves.
    + Reduces Training Time: Less data means that algorithms train faster

In [0]:
# Load CSV using Pandas as pandas.DataFrame object
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

#filename = 'pima-indians-diabetes.data.csv'
#names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
#data = pandas.read_csv(filename, names=names)

# data = pd.read_csv('diabetes.csv')
# print(data.shape)
pd.set_option('precision', 2)

In [0]:
# import sklearn.datasets
# data = sklearn.datasets.load_diabetes()

In [0]:
# print(data.DESCR)

Diabetes dataset

Notes
-----

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

Data Set Characteristics:

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attributes:
    :Age:
    :Sex:
    :Body mass index:
    :Average blood pressure:
    :S1:
    :S2:
    :S3:
    :S4:
    :S5:
    :S6:

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani

## Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest


In [0]:
# X = data.data[:,:-1]
# Y = data.target
X = data.iloc[:,:-1]
Y = data.iloc[:,-1]

In [0]:
print(X.shape)
print(Y.shape)

(768, 8)
(768,)


In [0]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)
features = fit.transform(X)

In [0]:
# summarize scores
np.set_printoptions(precision=2)
print(fit.scores_)

[ 111.52 1411.89   17.61   53.11 2175.57  127.67    5.39  181.3 ]


In [0]:
score_df = pd.DataFrame(fit.scores_, index=X.columns, \
                       columns = ['Scores'])
score_df.sort_values(by = 'Scores', ascending=False).head(4)

Unnamed: 0,Scores
Insulin,2175.57
Glucose,1411.89
Age,181.3
BMI,127.67


In [0]:
data.corr()['Outcome'].sort_values(ascending=False).iloc[1:5]

Glucose        0.47
BMI            0.29
Age            0.24
Pregnancies    0.22
Name: Outcome, dtype: float64

In [0]:
# summarize selected features
print(features[0:5,:])
X.head()
print(type(features))

[[148.    0.   33.6  50. ]
 [ 85.    0.   26.6  31. ]
 [183.    0.   23.3  32. ]
 [ 89.   94.   28.1  21. ]
 [137.  168.   43.1  33. ]]
<class 'numpy.ndarray'>


In [0]:
import numpy
from sklearn.preprocessing import MinMaxScaler
X = X.astype('float64')
# scaler = MinMaxScaler(feature_range=(0, 1))
# rescaledX = scaler.fit_transform(X)

from sklearn.preprocessing import Normalizer
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)

test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(normalizedX, Y)
print(normalizedX[0:1,:])

[[0.03 0.83 0.4  0.2  0.   0.19 0.   0.28]]


In [0]:
# summarize scores
features = fit.transform(normalizedX)
print(fit.scores_)

[1.25e-01 5.06e-02 2.44e+00 4.12e-01 3.31e-01 1.51e-01 2.29e-04 1.13e-01]


In [0]:
score_df = pd.DataFrame(fit.scores_, index=X.columns, \
                       columns = ['Scores'])
score_df.sort_values(by = 'Scores', ascending=False).head(4)

Unnamed: 0,Scores
BloodPressure,2.44
SkinThickness,0.41
Insulin,0.33
BMI,0.15


In [0]:
# summarize selected features
features[0:5,:]

array([[0.35, 0.74, 0.23, 0.48],
       [0.06, 0.43, 0.12, 0.17],
       [0.47, 0.92, 0.25, 0.18],
       [0.06, 0.45, 0.04, 0.  ],
       [0.  , 0.69, 0.94, 0.2 ]])

plas, test, mass and age are obtained by manually mapping the values.

## Recursive Feature Elimination

+ RFE works by recursively removing attributes and building a model on those attributes that remain. 

+ It uses the model accuracy to identify which attributes (and combination of attributes) 
contribute the most to predicting the target attribute.

+ The example below uses RFE with the logistic regression algorithm to select the top 3 features.

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE


In [0]:
Y[0:5]

0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64

In [0]:
# feature extraction
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE

model = LogisticRegression()
rfe = RFE(model, 4)
fit = rfe.fit(rescaledX, Y)
print("Num Features:", fit.n_features_)
print("Selected Features:", fit.support_)
print("Feature Ranking:", fit.ranking_)

Num Features: 4
Selected Features: [ True  True False False False  True  True False]
Feature Ranking: [1 1 2 4 5 1 1 3]




In [0]:
type(fit.ranking_)

numpy.ndarray

In [0]:
X.columns[(fit.ranking_==1)]

Index(['Pregnancies', 'Glucose', 'BMI', 'DiabetesPedigreeFunction'], dtype='object')

## Principal Component Analysis

Principal Component Analysis (or PCA) uses linear algebra to transform the dataset into a
compressed form. Generally this is called a data reduction technique. A property of PCA is that
you can choose the number of dimensions or principal components in the transformed result. In
the example below, we use PCA and select 3 principal components. 

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

In [0]:
# Feature Extraction with PCA
from sklearn.decomposition import PCA

pca = PCA(n_components=3)
fit = pca.fit(X)

# summarize components
print("Explained Variance:", fit.explained_variance_ratio_)
print(fit.components_)

Explained Variance: [0.408 0.165 0.128]
[[ 0.219  0.196  0.312  0.274  0.37   0.384 -0.302  0.459  0.393]
 [ 0.059 -0.402 -0.157 -0.123  0.558  0.429  0.537 -0.103 -0.033]
 [ 0.565 -0.027  0.201  0.584 -0.069 -0.255  0.326 -0.337  0.09 ]]


You can see that the transformed dataset (3 principal components) bare little resemblance
to the source data.

## Feature Importance

Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance
of features. 

In the example below we construct a ExtraTreesClassifier classifier for the Pima
Indians onset of diabetes dataset.

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html


In [0]:
# Feature Importance with Extra Trees Classifier
from sklearn.ensemble import ExtraTreesClassifier

# feature extraction
model = ExtraTreesClassifier()
model.fit(rescaledX, Y)

print(model.feature_importances_)

[0.12 0.23 0.1  0.08 0.08 0.14 0.12 0.14]




In [0]:
sc= pd.DataFrame(model.feature_importances_, index=X.columns)

In [0]:
sc

Unnamed: 0,0
Pregnancies,0.12
Glucose,0.23
BloodPressure,0.1
SkinThickness,0.08
Insulin,0.08
BMI,0.14
DiabetesPedigreeFunction,0.12
Age,0.14


You can see that we are given an importance score for each attribute where the larger the
score, the more important the attribute. The scores suggest at the importance of plas, age
and mass.