###  Python Basics Tutorial

#### Feature Selection

####  Machine Learning Mastery with Python
####  Jason Brownlee

---

#### In this recipe:
- Univariate selection
- Recursive Feature Elimination
- PCA
- Feature Importance

### Univariate Selection

In [2]:
## Various statistical tests can be used to test the relationship
##   between the dependent and independent variables

## this example uses chi-sq

from pandas import read_csv
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

path = 'D:\\OneDrive - QJA\\My Files\\DataScience\\DataSets'
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 
         'mass', 'pedi', 'age', 'class']

dataframe = read_csv(path + '\\' + filename,
                    names = names)

In [6]:
array = dataframe.values

X = array[:, 0:8]
Y = array[:, 8]

# create object to perform SelectKBest using chi-sq test
test = SelectKBest(score_func = chi2, k = 4)

# test independent variables using test object above
fit = test.fit(X, Y)

# print scores for each attribute.  highest scores indicate
# biggest influence of features on dependent variable
set_printoptions(precision = 3)
print(fit.scores_)
features = fit.transform(X)

# prints the features with the highest scores
# corresponds to plas, test, mass, age
#   (need to map fit.score index to index of attribute names)
print(features[0:5, :])

[ 111.52  1411.887   17.605   53.108 2175.565  127.669    5.393  181.304]
[[148.    0.   33.6  50. ]
 [ 85.    0.   26.6  31. ]
 [183.    0.   23.3  32. ]
 [ 89.   94.   28.1  21. ]
 [137.  168.   43.1  33. ]]


### Recursive Feature Elimination

- Recursively removes attributes to build model on those that remain

- Note: like other feature selection algos, use this as a guide, only

In [23]:
from pandas import read_csv
from numpy import set_printoptions

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

array = dataframe.values

X = array[:, 0:8]
Y = array[:, 8]

# create object contianing LR algorithm
# use vars(object) to see object attributes
model = LogisticRegression(solver = 'liblinear')
# model
rfe = RFE(model, 3) # object estimator for RFE model with 3 features
# rfe
fit = rfe.fit(X, Y) # apply rfe on dep and indep vars

print('All Features: %s' % names[:-1]) # -1 exclude dep var
print("Number of Features: %d" % fit.n_features_)
print('SelectedFeatures: %s' % fit.support_)
print('Feature Ranking: %s' % fit.ranking_)

# top 3 choices are preg, mass, pedi (indicated by "True")

All Features: ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age']
Number of Features: 3
SelectedFeatures: [ True False False False False  True  True False]
Feature Ranking: [1 2 3 5 6 1 1 4]


### Principal Component Analysis

- linear alg to reduce variable dimensions

In [21]:
from sklearn.decomposition import PCA

array = dataframe.values

X = array[:, 0:8]
Y = array[:, 8]

# object to contain PCA algo
pca = PCA(n_components = 3)
fit = pca.fit(X) # store pca of X in fit

print('Explained Variance: %s' % fit.explained_variance_ratio_)
print(fit.components_)

# first component contains approv 89% of variance with dep var

Explained Variance: [0.889 0.062 0.026]
[[-2.022e-03  9.781e-02  1.609e-02  6.076e-02  9.931e-01  1.401e-02
   5.372e-04 -3.565e-03]
 [-2.265e-02 -9.722e-01 -1.419e-01  5.786e-02  9.463e-02 -4.697e-02
  -8.168e-04 -1.402e-01]
 [-2.246e-02  1.434e-01 -9.225e-01 -3.070e-01  2.098e-02 -1.324e-01
  -6.400e-04 -1.255e-01]]


### Feature Importance

- uses bagged decision trees like RandForest, Extra Trees

- in this example, ExtraTreesClassifier uses

In [27]:
from sklearn.ensemble import ExtraTreesClassifier

array = dataframe.values

X = array[:, 0:8]
Y = array[:, 8]

# object to contain model estimator algorithm
model = ExtraTreesClassifier(n_estimators = 100)
model.fit(X, Y) # apply model to data

# gives importance score for each attribute
print('All Features: %s' % names[:-1]) # -1 exclude dep var
print(model.feature_importances_)

# plas, age, and mass have highest score (most influential)
# see how this compares with other methods used above

All Features: ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age']
[0.106 0.239 0.099 0.08  0.076 0.142 0.12  0.139]
