# Feature Selection Tutorial Solutions
**Q1**   
In Weka, apply filter-based feature selection with Information Gain to identify the 3 most discriminating and 3 least discriminating features in the Wine dataset in the ARFF file provided.  
Based on these results, assess the 10-fold cross-validation classification accuracy of a 1-Nearest Neighbour classifier with: 
1. only the 3 most discriminating features included
2. only the 3 least discriminating features included 


In [1]:
import pandas as pd
from sklearn.feature_selection import chi2
from sklearn.feature_selection import mutual_info_classif

In [2]:
wine_DF = pd.read_csv('wine.csv')
print(wine_DF.shape)
wine_DF.head()

(178, 14)


Unnamed: 0,Alcohol,Malic_acid,Ash,Alcalinity_of_ash,Magnesium,Total_phenols,Flavanoids,Nonflavanoid_phenols,Proanthocyanins,Color_intensity,Hue,OD280/OD315_of_diluted_wines,Proline,class
0,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065,Type1
1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050,Type1
2,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185,Type1
3,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480,Type1
4,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735,Type1


In [3]:
y = wine_DF.pop('class').values
X = wine_DF.values
wine_DF.shape

(178, 13)

In [4]:
i_scores = mutual_info_classif(X,y)

Put the I-Gain scores in a dataframe and sort. 

In [5]:
FS_DF = pd.DataFrame(i_scores,index = wine_DF.columns, columns =['I-Gain'])
FS_DF.sort_values(by=['I-Gain'],ascending=False,inplace=True)
FS_DF

Unnamed: 0,I-Gain
Flavanoids,0.670753
Proline,0.575956
Color_intensity,0.551111
OD280/OD315_of_diluted_wines,0.507165
Alcohol,0.46689
Hue,0.454449
Total_phenols,0.41323
Malic_acid,0.289495
Proanthocyanins,0.284944
Alcalinity_of_ash,0.236179


Generate top 3 and bottom 3 dataframes and produce the corresponding X (numpy) arrays.

In [6]:
top3_DF = wine_DF[FS_DF.index[:3]]
bottom3_DF = wine_DF[FS_DF.index[-3:]]
X_top3 = top3_DF.values
X_bottom3 = bottom3_DF.values

In [7]:
from sklearn.neighbors import KNeighborsClassifier
kNN = KNeighborsClassifier(n_neighbors=1)

In [8]:
X_bottom3.shape

(178, 3)

In [9]:
from sklearn.model_selection import cross_val_score
knn_bottom = cross_val_score(kNN, X_bottom3, y, cv=10)
print("10x CV Accuracy Bottom 3 features: {0:.2f}".format(knn_bottom.mean())) 

10x CV Accuracy Bottom 3 features: 0.49


In [10]:
knn_top = cross_val_score(kNN, X_top3, y, cv=10)
print("10x CV Accuracy Top 3 features: {0:.2f}".format(knn_top.mean())) 

10x CV Accuracy Top 3 features: 0.78


---
**Q2**  
Using **mlxtend**, identify informative feature subsets by applying wrapper-based feature selection to the Wine dataset using a 3-Nearest Neighbour classifier and the following search strategies: 
- forward sequential search 
- backward elimination search  
  
Which common features were selected by both search strategies?



In [11]:
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
knn = KNeighborsClassifier(n_neighbors=4)

In [12]:
sfs_forward = SFS(knn, 
                  k_features=7, 
                  forward=True, 
                  floating=False, 
                  verbose=1,
                  scoring='accuracy',
                  cv=10, n_jobs = -1)

sfs_forward = sfs_forward.fit(X, y, 
                              custom_feature_names=wine_DF.columns)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  13 out of  13 | elapsed:    1.8s finished
Features: 1/7[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  12 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    0.0s finished
Features: 2/7[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 out of  11 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  11 out of  11 | elapsed:    0.0s finished
Features: 3/7[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.0s finished
Features: 4/7[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   4 out of   

In [13]:
print(sfs_forward.k_feature_names_)

('Alcohol', 'Ash', 'Total_phenols', 'Flavanoids', 'Nonflavanoid_phenols', 'Proanthocyanins', 'Color_intensity')


In [14]:
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt

fig1 = plot_sfs(sfs_forward.get_metric_dict(), 
                ylabel='Accuracy',
                kind='std_dev')

plt.ylim([0.5, 1])
plt.title('Sequential Forward Selection (w. StdDev)')
plt.grid()
plt.show()
print(sfs_forward.k_feature_names_)

<Figure size 640x480 with 1 Axes>

('Alcohol', 'Ash', 'Total_phenols', 'Flavanoids', 'Nonflavanoid_phenols', 'Proanthocyanins', 'Color_intensity')


Accuracy starts to drop after the addition of five features.  
So run again setting `k_features` to five. 

In [15]:
sfs_forward5 = SFS(knn, 
                  k_features=5, 
                  forward=True, 
                  floating=False, 
                  verbose=1,
                  scoring='accuracy',
                  cv=10, n_jobs = -1)

sfs_forward5 = sfs_forward5.fit(X, y, 
                              custom_feature_names=wine_DF.columns)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  13 out of  13 | elapsed:    0.0s finished
Features: 1/5[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  12 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    0.0s finished
Features: 2/5[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 out of  11 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  11 out of  11 | elapsed:    0.0s finished
Features: 3/5[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.0s finished
Features: 4/5[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   4 out of   

Run backward selection for 5 features.

In [16]:
sfs_backward = SFS(knn, 
                  k_features=5, 
                  forward=False, 
                  floating=False, 
                  verbose=1,
                  scoring='accuracy',
                  cv=10, n_jobs = -1)

sfs_backward = sfs_backward.fit(X, y, 
                              custom_feature_names=wine_DF.columns)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  13 out of  13 | elapsed:    0.0s finished
Features: 12/5[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  12 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    0.0s finished
Features: 11/5[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 out of  11 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  11 out of  11 | elapsed:    0.0s finished
Features: 10/5[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.0s finished
Features: 9/5[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   4 out of

In [17]:
b5_feat = sfs_backward.k_feature_names_
f5_feat = sfs_forward5.k_feature_names_
print(b5_feat)
print(f5_feat)
print((set(b5_feat).intersection(set(f5_feat))))

('Alcohol', 'Flavanoids', 'Proanthocyanins', 'Color_intensity', 'OD280/OD315_of_diluted_wines')
('Alcohol', 'Total_phenols', 'Flavanoids', 'Proanthocyanins', 'Color_intensity')
{'Color_intensity', 'Alcohol', 'Proanthocyanins', 'Flavanoids'}
