# Feature Selection using <font color=red>Wrapper Methods</font>

<img src='Data/Reducing Complexity.png' width=500/>

<font color=red>__Wrapper Methods__</font>
- Recursive Feature Elimination(RFE)
- Forward Selection
- Backward Selection

- In __Wrapper methods__, we use several subsets of features to build models and we pick the __best subset__. It lies between filter and embedded methods. Examples are __Forward and Backward stepwise regression__. __Each candidate__ model will have a __different subset__ of features. But all candidates will use __same model__ (for example decision trees). The __way features are selected can vary across wrapper methods__ - features may be __added or dropped__ to see if the __model improves__.

We generate models with subsets of features and find the best subset to work with based on the models performance

In [1]:
import numpy as np
import pandas as pd

In [2]:
diabetes_data = pd.read_csv('Data/diabetes.csv')
diabetes_data.head(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


In [3]:
X = diabetes_data.drop('Outcome', axis=1)
Y = diabetes_data['Outcome']

<font color=red>__Recursive Feature Elimination__</font>
- RFE will __select entire__ set of __features__ and __eliminates the least important__ feature from it at each step till it finds the best possible subset by training the features on a __model__
- The __model__ selection is __our choice__

In [4]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

In [5]:
# We ask the RFE to pick best 4 features

model = LogisticRegression(solver='liblinear')
rfe = RFE(model, n_features_to_select=4)

In [6]:
fit = rfe.fit(X, Y)

In [7]:
print('Num Features = ', fit.n_features_)
print('Selected Features = ', fit.support_)
print('Feature Ranking = ', fit.ranking_)

Num Features =  4
Selected Features =  [ True  True False False False  True  True False]
Feature Ranking =  [1 1 2 4 5 1 1 3]


In [8]:
feature_rank = pd.DataFrame({'columns': X.columns,
                             'ranking': fit.ranking_,
                             'selected': fit.support_ })
feature_rank

Unnamed: 0,columns,ranking,selected
0,Pregnancies,1,True
1,Glucose,1,True
2,BloodPressure,2,False
3,SkinThickness,4,False
4,Insulin,5,False
5,BMI,1,True
6,DiabetesPedigreeFunction,1,True
7,Age,3,False


In [9]:
recursive_feature_names = feature_rank.loc[feature_rank['selected'] == True]
recursive_feature_names

Unnamed: 0,columns,ranking,selected
0,Pregnancies,1,True
1,Glucose,1,True
5,BMI,1,True
6,DiabetesPedigreeFunction,1,True


In [10]:
X[recursive_feature_names['columns'].values].head()

Unnamed: 0,Pregnancies,Glucose,BMI,DiabetesPedigreeFunction
0,6,148,33.6,0.627
1,1,85,26.6,0.351
2,8,183,23.3,0.672
3,1,89,28.1,0.167
4,0,137,43.1,2.288


In [11]:
recursive_features = X[recursive_feature_names['columns'].values]

<font color=red>__Sequential Feature Selector__ (Forward and Backward)</font>
- It adds or removes one variable at a time based on the performance of the classifier till we get to the specified number of features
- Forward selection and Backward selection can be done

In [12]:
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.ensemble import RandomForestClassifier

<font color=red>__Forward Selector__</font>
- One variable at a time is __added__ to the set till we reach desired features

In [13]:
# Forward Selection
feature_selector = SequentialFeatureSelector(RandomForestClassifier(n_estimators=10),
                                             k_features=4,
                                             forward=True,
                                             scoring='accuracy',
                                             cv=4)
features = feature_selector.fit(np.array(X), Y)

In [14]:
forward_elimination_feature_names = list(X.columns[list(features.k_feature_idx_)])
forward_elimination_feature_names

['Pregnancies', 'Glucose', 'BMI', 'DiabetesPedigreeFunction']

In [15]:
forward_elimination_features = X[forward_elimination_feature_names]

<font color=red>__Backward Selector__</font>
- One variable at a time is __removed__ from the set till we reach desired features

In [16]:
# Backward Selection
feature_selector = SequentialFeatureSelector(RandomForestClassifier(n_estimators=10),
                                             k_features=4,
                                             forward=False,
                                             scoring='accuracy',
                                             cv=4)
features = feature_selector.fit(np.array(X), Y)

In [17]:
backward_elimination_feature_names = list(X.columns[list(features.k_feature_idx_)])
backward_elimination_feature_names

['Glucose', 'BloodPressure', 'Insulin', 'BMI']

In [18]:
backward_elimination_features = X[backward_elimination_feature_names]

Let us train the model using the data sets we've got in above steps.

In [19]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [20]:
def build_model(X, Y, test_frac):
    x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=test_frac)
    model = LogisticRegression(solver='liblinear').fit(x_train, y_train)
    y_pred = model.predict(x_test)
    
    print('Test score = ', accuracy_score(y_test, y_pred))

In [21]:
# Train the model with the original data

build_model(X, Y, 0.2)

Test score =  0.7402597402597403


In [22]:
build_model(recursive_features, Y, 0.2)

Test score =  0.7662337662337663


In [23]:
build_model(forward_elimination_features, Y, 0.2)

Test score =  0.7792207792207793


In [24]:
build_model(backward_elimination_features, Y, 0.2)

Test score =  0.8051948051948052
