### Objective: Feature Selection 2
* Introduction to Wrapper Methods
* Advantages of Wrapper Methods
* Process of Wrapper Methods

<hr>

### Wrapper Methods
* Using iterative process, we will try to figure out best subset of features for which ML algorithms is giving best accuracy.
* This process is dependent on ML algo.
* Whenever we change the algo, the selected features also change.

### Advantages
* It does feature selection based on the accuracy of the model (dependent on ML algo)
* It also accounts for interaction among features

### Process of applying Wrapper Methods
* <b>Search</b>  for subset of features
* <b>Build</b> model using the subset
* <b>Evaluate</b> trained model with chosen metrices
* <b>Iterate</b> repeat till you succeed

### Searching subset of features
* Start Small - Start with one feature & keep adding. Stop where you model stops improving further.
* Start Big - Start with all features & keep removing.
* Ramdomize or try all possible combinations 

In [1]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/edyoda/data-science-complete-tutorial/master/Data/winequality-white.csv', sep=';')
def f(r):
    if r <= 3:
        return 1
    elif r<= 6:
        return 2
    else:
        return 3

df.quality = df.quality.map(f)

In [2]:
features = list(df.columns.values)

In [4]:
features.remove('quality')

In [5]:
features

['fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol']

In [6]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

try_features = []
for feature in features:
    try_features.append(feature)
    dt = DecisionTreeClassifier()
    trainX, testX, trainY, testY = train_test_split(df[try_features], df.quality)
    dt.fit(trainX, trainY)
    print (dt.score(testX, testY))

0.763265306122449
0.7681632653061224
0.7624489795918368
0.7706122448979592
0.8081632653061225
0.8016326530612244
0.8
0.8195918367346938
0.8130612244897959
0.8261224489795919
0.8244897959183674


### Disadvantage of this process
* Lot of computations required

### Notes
* Mlextend is additional scikit packge which makes feature selection easy https://pypi.org/project/mlxtend/

In [7]:
from mlxtend.feature_selection import SequentialFeatureSelector

In [9]:
sfs = SequentialFeatureSelector(k_features=5, estimator=DecisionTreeClassifier())

In [10]:
sfs.fit(trainX, trainY)

SequentialFeatureSelector(clone_estimator=True, cv=5,
                          estimator=DecisionTreeClassifier(class_weight=None,
                                                           criterion='gini',
                                                           max_depth=None,
                                                           max_features=None,
                                                           max_leaf_nodes=None,
                                                           min_impurity_decrease=0.0,
                                                           min_impurity_split=None,
                                                           min_samples_leaf=1,
                                                           min_samples_split=2,
                                                           min_weight_fraction_leaf=0.0,
                                                           presort=False,
                                                           random_

In [11]:
sfs.k_feature_names_

('fixed acidity', 'volatile acidity', 'residual sugar', 'sulphates', 'alcohol')

In [12]:
sfs.k_score_

0.8053352632961726

In [14]:
for k in range(4,10):
    sfs = SequentialFeatureSelector(k_features=k, estimator=DecisionTreeClassifier())
    sfs.fit(trainX, trainY)
    print (k,sfs.k_score_, sfs.k_feature_names_)

4 0.8056084883307666 ('volatile acidity', 'chlorides', 'sulphates', 'alcohol')
5 0.8026126889566394 ('fixed acidity', 'volatile acidity', 'residual sugar', 'sulphates', 'alcohol')
6 0.8056147815126525 ('citric acid', 'residual sugar', 'free sulfur dioxide', 'total sulfur dioxide', 'pH', 'alcohol')
7 0.8154059006614137 ('fixed acidity', 'volatile acidity', 'citric acid', 'chlorides', 'pH', 'sulphates', 'alcohol')
8 0.8064151892246059 ('fixed acidity', 'volatile acidity', 'residual sugar', 'chlorides', 'total sulfur dioxide', 'density', 'pH', 'alcohol')
9 0.80722780653548 ('fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'pH', 'sulphates', 'alcohol')


In [15]:
from mlxtend.feature_selection import ExhaustiveFeatureSelector

In [16]:
efs = ExhaustiveFeatureSelector(estimator=DecisionTreeClassifier(), min_features=4, max_features=10, scoring='accuracy')

In [18]:
efs.fit(trainX, trainY)

Features: 1815/1815

ExhaustiveFeatureSelector(clone_estimator=True, cv=5,
                          estimator=DecisionTreeClassifier(class_weight=None,
                                                           criterion='gini',
                                                           max_depth=None,
                                                           max_features=None,
                                                           max_leaf_nodes=None,
                                                           min_impurity_decrease=0.0,
                                                           min_impurity_split=None,
                                                           min_samples_leaf=1,
                                                           min_samples_split=2,
                                                           min_weight_fraction_leaf=0.0,
                                                           presort=False,
                                                           random_

In [19]:
efs.best_feature_names_

('fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'density',
 'pH')

In [20]:
efs.best_score_

0.8184024394617462

In [24]:
efs = ExhaustiveFeatureSelector(estimator=DecisionTreeClassifier(), min_features=4, max_features=10, scoring='roc_auc')