# Explore Feature Selection/Elimination

In [38]:
## for the sake of readability 
import warnings
warnings.filterwarnings("ignore") 

The data features that you use to train your machine learning models have a huge influence on the performance you can achieve.

Irrelevant or partially relevant features can negatively impact model performance.

In this notebook you will discover automatic feature selection techniques that you can use to prepare your machine learning data in python with scikit-learn.

# Feature selection


Feature selection is a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested. Having irrelevant features in your data can decrease the accuracy of many models, especially linear algorithms like linear and logistic regression.


Three benefits of performing feature selection before modeling your data are:

* Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.

* Improves Accuracy: Less misleading data means modeling accuracy improves.

* Reduces Training Time: Less data means that algorithms train faster.

This section will introduce you to a few options.

## 1. Select K Best

## Select K best for Classification 


The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific number of features. (reference to https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest)



### Step 1 : Import libraries

In [39]:
# Feature Selection with Univariate Statistical Tests

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_regression
from sklearn.datasets import load_boston
from sklearn.datasets import load_iris
from numpy import array 

### Step 2 : Load your data

In [40]:
# load data
iris = load_iris()
x = iris.data
y = iris.target
 
print("Feature data dimension: ", x.shape) 
 



Feature data dimension:  (150, 4)


### Step 3: Execute the Extraction 

   we'll define the model by using SelectKBest class. For classification we'll set 'chi2'  method as a scoring function. The target number of features is defined by k parameter. Then we'll fit and transform method on training x and y data.

In [41]:
select = SelectKBest(score_func=chi2, k=3)
z = select.fit_transform(x,y)
 
print("After selecting best 3 features:", z.shape) 
 


After selecting best 3 features: (150, 3)


We've selected 3 best features in x data. To identify the selected features we use get_support() function and filter out them from the features name list.  The z object contains selected x data. 

In [42]:
filter = select.get_support()
features = array(iris.feature_names)
 
print("All features:")
print(features)
 
print("Selected best 3:")
print(features[filter])
print(z) 

All features:
['sepal length (cm)' 'sepal width (cm)' 'petal length (cm)'
 'petal width (cm)']
Selected best 3:
['sepal length (cm)' 'petal length (cm)' 'petal width (cm)']
[[5.1 1.4 0.2]
 [4.9 1.4 0.2]
 [4.7 1.3 0.2]
 [4.6 1.5 0.2]
 [5.  1.4 0.2]
 [5.4 1.7 0.4]
 [4.6 1.4 0.3]
 [5.  1.5 0.2]
 [4.4 1.4 0.2]
 [4.9 1.5 0.1]
 [5.4 1.5 0.2]
 [4.8 1.6 0.2]
 [4.8 1.4 0.1]
 [4.3 1.1 0.1]
 [5.8 1.2 0.2]
 [5.7 1.5 0.4]
 [5.4 1.3 0.4]
 [5.1 1.4 0.3]
 [5.7 1.7 0.3]
 [5.1 1.5 0.3]
 [5.4 1.7 0.2]
 [5.1 1.5 0.4]
 [4.6 1.  0.2]
 [5.1 1.7 0.5]
 [4.8 1.9 0.2]
 [5.  1.6 0.2]
 [5.  1.6 0.4]
 [5.2 1.5 0.2]
 [5.2 1.4 0.2]
 [4.7 1.6 0.2]
 [4.8 1.6 0.2]
 [5.4 1.5 0.4]
 [5.2 1.5 0.1]
 [5.5 1.4 0.2]
 [4.9 1.5 0.2]
 [5.  1.2 0.2]
 [5.5 1.3 0.2]
 [4.9 1.4 0.1]
 [4.4 1.3 0.2]
 [5.1 1.5 0.2]
 [5.  1.3 0.3]
 [4.5 1.3 0.3]
 [4.4 1.3 0.2]
 [5.  1.6 0.6]
 [5.1 1.9 0.4]
 [4.8 1.4 0.3]
 [5.1 1.6 0.2]
 [4.6 1.4 0.2]
 [5.3 1.5 0.2]
 [5.  1.4 0.2]
 [7.  4.7 1.4]
 [6.4 4.5 1.5]
 [6.9 4.9 1.5]
 [5.5 4.  1.3]
 [6.5 4.6 1.5]
 [

## Select K best for Regression 

In [43]:
boston = load_boston()
x = boston.data
y = boston.target

print("Feature data dimension: ", x.shape)

Feature data dimension:  (506, 13)


Again, we'll define the model by using SelectKBest class. For regression, we'll set 'f_regression'  method as a scoring function. The target number of features to select is 8. We'll fit and transform the model on training x and y data.

In [44]:
select = SelectKBest(score_func=f_regression, k=8)
z = select.fit_transform(x, y) 
 
print("After selecting best 8 features:", z.shape) 

After selecting best 8 features: (506, 8)


To identify the selected features we can use get_support() function and filter out them from the features list. The z object contains selected x data. 

In [45]:
filter = select.get_support()
features = array(boston.feature_names)
 
print("All features:")
print(features)
 
print("Selected best 8:")
print(features[filter])
print(z) 

All features:
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
Selected best 8:
['CRIM' 'INDUS' 'NOX' 'RM' 'RAD' 'TAX' 'PTRATIO' 'LSTAT']
[[6.320e-03 2.310e+00 5.380e-01 ... 2.960e+02 1.530e+01 4.980e+00]
 [2.731e-02 7.070e+00 4.690e-01 ... 2.420e+02 1.780e+01 9.140e+00]
 [2.729e-02 7.070e+00 4.690e-01 ... 2.420e+02 1.780e+01 4.030e+00]
 ...
 [6.076e-02 1.193e+01 5.730e-01 ... 2.730e+02 2.100e+01 5.640e+00]
 [1.096e-01 1.193e+01 5.730e-01 ... 2.730e+02 2.100e+01 6.480e+00]
 [4.741e-02 1.193e+01 5.730e-01 ... 2.730e+02 2.100e+01 7.880e+00]]


## 2. Recursive Feature Elimination

In [46]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import AdaBoostRegressor
from sklearn.datasets import load_boston
from numpy import array

In [47]:
boston = load_boston()
x = boston.data
y = boston.target

print("Feature data dimension: ", x.shape)  

Feature data dimension:  (506, 13)


In [48]:
estimator = AdaBoostRegressor(random_state=0, n_estimators=100)
selector = RFE(estimator, n_features_to_select=8, step=1)
selector = selector.fit(x, y)

In [49]:
filter = selector.support_
ranking = selector.ranking_

print("Mask data: ", filter)
print("Ranking: ", ranking) 

Mask data:  [ True False False False  True  True False  True  True  True  True False
  True]
Ranking:  [1 5 3 6 1 1 4 1 1 1 1 2 1]


In [50]:
features = array(boston.feature_names)
print("All features:")
print(features)

print("Selected features:")
print(features[filter])
  

All features:
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
Selected features:
['CRIM' 'NOX' 'RM' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'LSTAT']


## EXTRA : PCA as Feature Selection/ Speed up Factor etc 

https://stackabuse.com/implementing-pca-in-python-with-scikit-learn/