## Finding predictive features

In Machine Learning, it is important to find good features. Such features will help to build better classifiers. Once we have trained a good model, we might also want to know which features were the particularly helpful. 




### Training a classifier on the IRIS dataset

In this section, we will again train a classifier on the [IRIS data set](http://scikit-learn.org/stable/tutorial/statistical_inference/supervised_learning.html) but now we will investiate which features were the most predictive.

<img src="pics/iris.png">
Recall: this data set about IRIS flowers is included in sklearn and already ready to use, meaning that features are already extracted for the data instances x, and each training instance has an associated class label y. 
The iris data set consists of 150 training instances with 3 classes (setosa,versicolor,virginica). Lets train a classifier and evaluate it.

And a reminder:
<img src="http://www.wpclipart.com/plants/diagrams/plant_parts/petal_sepal_label.png">

In [None]:
from sklearn import datasets
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# 3 classes, 150 instances: X=iris[’data’]  y=iris[’target’]
iris = datasets.load_iris()
indices = np.random.permutation(len(iris['data']))
# split in 80% train, 20% test
len_test = int(len(iris['data'])*0.2)
# train part (all except test part)
X_train = iris['data'][indices[:-len_test]]
y_train = iris['target'][indices[:-len_test]]
# test part
X_test = iris['data'][indices[-len_test:]]
y_test = iris['target'][indices[-len_test:]]
# output statistics
print("#inst train: %s" % (len(X_train)))
print("#inst test: %s" % (len(X_test)))
# learn knn classifier
clf = LogisticRegression()
clf.fit(X_train, y_train)
print(clf)
y_pred= clf.predict(X_test)
print("Pred:", y_pred)
print("Gold:", y_test)
# get performance scores
print(classification_report(y_test, y_pred, target_names=iris['target_names']))
print("accuracy: ",accuracy_score(y_test, y_pred))
print(iris['target_names'])
print(confusion_matrix(y_test, y_pred))

We now see that petal length is amongst the most predictive features. However, we do not see how predictive it was for each class. For this, we inspect the coefficient per class, as shown next.

In [None]:

## Since this example has few features, we can look at them all:
feature_names = iris['feature_names']
all={}
for class_num in range(0,len(clf.coef_)):
     all[iris['target_names'][class_num]] = {"Feature":feature_names,"Coefficients":clf.coef_[class_num]}

all

However, in real examples it is typical to only look at the top-n most predictive features.

In [None]:
n=1
feature_names = iris['feature_names']
for class_num in range(0,len(clf.coef_)):
    coefs_with_fns = sorted(zip(clf.coef_[class_num], feature_names))
    top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
    print("class_num",class_num, iris['target_names'][class_num])
    for (coef_1, fn_1), (coef_2, fn_2) in top:
        print("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))

### Exercise:

Extend the sentiment classification example from last week. Find the most predictive feature per class.

Hint: you can use [the method below](https://stackoverflow.com/questions/11116697/how-to-get-most-informative-features-for-scikit-learn-classifiers) below.

In [None]:
def show_most_informative_features(vectorizer, clf, n=10):
    feature_names = vectorizer.get_feature_names()
    for i in range(0,len(clf.coef_)):
        coefs_with_fns = sorted(zip(clf.coef_[i], feature_names))
        top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
        print("i",i)
        for (coef_1, fn_1), (coef_2, fn_2) in top:
            print("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))

## References

There is much more to be said about feature selection than what we cover here. Below are some pointers.

* http://machinelearningmastery.com/an-introduction-to-feature-selection/
* http://scikit-learn.org/stable/datasets/twenty_newsgroups.html
* http://blog.datadive.net/selecting-good-features-part-iv-stability-selection-rfe-and-everything-side-by-side/