## Ranking and selecting features

In this example, we'll exemplify some of scikit-learn's ranking functions used to score the importance of features. We'll reuse the running example, the Adult dataset that we used in the first exercise.

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_selection import mutual_info_classif
from sklearn.pipeline import make_pipeline

train_data = pd.read_csv('Lec2_adult_train.csv')
n_cols = len(train_data.columns)
Xtrain_dicts = train_data.iloc[:, :n_cols-1].to_dict('records')
Ytrain = train_data.iloc[:, n_cols-1]

test_data = pd.read_csv('Lec2_adult_test.csv')
Xtest_dicts = test_data.iloc[:, :n_cols-1].to_dict('records')
Ytest = test_data.iloc[:, n_cols-1]

dv = DictVectorizer()
dv.fit(Xtrain_dicts)

X_vec = dv.transform(Xtrain_dicts)

dv.get_feature_names_out()

#feature_scores = mutual_info_classif(X_vec, Ytrain)

#for score, fname in sorted(zip(feature_scores, dv.get_feature_names_out()), reverse=True)[:10]:
#    print(fname, score)
    
#from sklearn.feature_selection import SelectKBest, SelectPercentile
#from sklearn.tree import DecisionTreeClassifier
#from sklearn.metrics import accuracy_score

#pipeline = make_pipeline(
#        DictVectorizer(),
#        SelectKBest(mutual_info_classif, k=100), # or SelectPercentile(...)
#        DecisionTreeClassifier()
#)
#pipeline.fit(Xtrain_dicts, Ytrain)
#accuracy_score(Ytest, pipeline.predict(Xtest_dicts))

array(['age', 'capital-gain', 'capital-loss', 'education-num',
       'education=10th', 'education=11th', 'education=12th',
       'education=1st-4th', 'education=5th-6th', 'education=7th-8th',
       'education=9th', 'education=Assoc-acdm', 'education=Assoc-voc',
       'education=Bachelors', 'education=Doctorate', 'education=HS-grad',
       'education=Masters', 'education=Preschool',
       'education=Prof-school', 'education=Some-college',
       'hours-per-week', 'marital-status=Divorced',
       'marital-status=Married-AF-spouse',
       'marital-status=Married-civ-spouse',
       'marital-status=Married-spouse-absent',
       'marital-status=Never-married', 'marital-status=Separated',
       'marital-status=Widowed', 'native-country=?',
       'native-country=Cambodia', 'native-country=Canada',
       'native-country=China', 'native-country=Columbia',
       'native-country=Cuba', 'native-country=Dominican-Republic',
       'native-country=Ecuador', 'native-country=El-Salvador',

In [5]:
import pandas as pd

train_data = pd.read_csv('Lec2_adult_train.csv')

n_cols = len(train_data.columns)
Xtrain_dicts = train_data.iloc[:, :n_cols-1].to_dict('records')
Ytrain = train_data.iloc[:, n_cols-1]

test_data = pd.read_csv('Lec2_adult_test.csv')
Xtest_dicts = test_data.iloc[:, :n_cols-1].to_dict('records')
Ytest = test_data.iloc[:, n_cols-1]

As you might recall, the instances in this dataset consist of several features describing each individual.

In [6]:
Xtrain_dicts[0]

{'age': 27,
 'workclass': 'Private',
 'education': 'Some-college',
 'education-num': 10,
 'marital-status': 'Divorced',
 'occupation': 'Adm-clerical',
 'relationship': 'Unmarried',
 'race': 'White',
 'sex': 'Female',
 'capital-gain': 0,
 'capital-loss': 0,
 'hours-per-week': 44,
 'native-country': 'United-States'}

We first convert the training set into numerical vectors.

In [7]:
import numpy as np

from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction import DictVectorizer

dv = DictVectorizer()
dv.fit(Xtrain_dicts)

X_vec = dv.transform(Xtrain_dicts)

The first scoring function we'll investigate is called the [mutual information](https://en.wikipedia.org/wiki/Mutual_information). [Here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html) is the description from scikit-learn about how this scoring function works.

(To see the formula used to compute the mutual information score, see the [description](https://nlp.stanford.edu/IR-book/html/htmledition/mutual-information-1.html) in the book *Introduction to Information Retrieval* by Manning and Schütze.)

We apply the scoring function to all the features, and we then print the top 10 high-scoring features. Please refer back to the perceptron example in the previous lecture for an explanation about the step where we sort the features by importance.

In [8]:
from sklearn.feature_selection import mutual_info_classif

feature_scores = mutual_info_classif(X_vec, Ytrain)

for score, fname in sorted(zip(feature_scores, dv.get_feature_names_out()), reverse=True)[:10]:
    print(fname, score)

marital-status=Married-civ-spouse 0.10543223425355985
capital-gain 0.083382372123436
relationship=Husband 0.08087684110742101
age 0.0687725396789363
education-num 0.064872227626807
marital-status=Never-married 0.06195072410418583
hours-per-week 0.0422833222022355
relationship=Own-child 0.03821610420273137
capital-loss 0.03698048451035268
sex=Male 0.025765242400373284


The second scoring function uses the so-called $F$-statistic in an [ANOVA test](https://en.wikipedia.org/wiki/Analysis_of_variance).

As you can see, there is an overlap between the top-10 list produced by this scorer and the previous list, but they are not identical.

In [9]:
from sklearn.feature_selection import f_classif

feature_scores = f_classif(X_vec, Ytrain)[0]

for score, fname in sorted(zip(feature_scores, dv.get_feature_names_out()), reverse=True)[:10]:
    print(fname, score)

marital-status=Married-civ-spouse 8025.8420615949835
relationship=Husband 6240.018276214241
education-num 4120.095779707474
marital-status=Never-married 3674.2001465697413
age 1886.7073137161203
hours-per-week 1813.3862822161334
relationship=Own-child 1794.1574893573925
capital-gain 1709.150063743795
sex=Female 1593.1079074467164
sex=Male 1593.1079074467073


Yet another feature scoring function. It is based on the well-known [$\chi^2$ statistical test](https://en.wikipedia.org/wiki/Chi-squared_test).

In [10]:
from sklearn.feature_selection import chi2

feature_scores = chi2(X_vec, Ytrain)[0]

for score, fname in sorted(zip(feature_scores, dv.get_feature_names_out()), reverse=True)[:10]:
    print(fname, score)

capital-gain 82192467.14154437
capital-loss 1372145.890201465
age 8600.61182155558
hours-per-week 6476.4089959321245
marital-status=Married-civ-spouse 3477.5158774537117
relationship=Husband 3114.94154602898
education-num 2401.4217771976464
marital-status=Never-married 2218.5219765707857
relationship=Own-child 1435.873016044718
occupation=Exec-managerial 1315.4826322279757


In practice when we'd like to use feature selection in scikit-learn, we just plug a selector into our pipeline. `SelectKBest` and `SelectPercentile` are the most common selectors. They use a feature scoring function (such as the ones above) to rank the features; by default, the `f_classif` scoring function is used.

In [11]:
from sklearn.feature_selection import SelectKBest, SelectPercentile
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

pipeline = make_pipeline(
        DictVectorizer(),
        SelectKBest(k=100), # or SelectPercentile(...)
        DecisionTreeClassifier()
)
pipeline.fit(Xtrain_dicts, Ytrain)
accuracy_score(Ytest, pipeline.predict(Xtest_dicts))

0.8166574534733738