# Feature Selection Mini-Project - Find signature

Katie explained in a video a problem that arose in preparing Chris and Sara’s email for the author identification project; it had to do with a feature that was a little too powerful (effectively acting like a signature, which gives an arguably unfair advantage to an algorithm). You’ll work through that discovery process here.


##  Overfitting a Decision Tree 1

This bug was found when Katie was trying to make an overfit decision tree to use as an example in the decision tree mini-project. A decision tree is classically an algorithm that can be easy to overfit; one of the easiest ways to get an overfit decision tree is to use a small training set and lots of features.  

#### If a decision tree is overfit, would you expect the accuracy on a test set to be very high or pretty low?
Ans : Pretty low

#### If a decision tree is overfit, would you expect high or low accuracy on the training set?
Ans : High  

The accuracy would be very high on the training set, but would plummet once it was actually tested.

## Number of Features and Overfitting

A classic way to overfit an algorithm is by using lots of features and not a lot of training data. 

You can find the starter code in ```feature_selection/find_signature.py```.   
Get a decision tree up and training on the training data, and print out the accuracy. 

In [1]:
# Starter code

import pickle
import numpy
numpy.random.seed(42)

In [2]:
# Starter code

### The words (features) and authors (labels), already largely processed.
### These files should have been created from the previous (Lesson 10)
### mini-project.
words_file = "../text_learning/your_word_data.pkl" 
authors_file = "../text_learning/your_email_authors.pkl"
word_data = pickle.load( open(words_file, "r"))
authors = pickle.load( open(authors_file, "r") )

In [3]:
# Starter code

### test_size is the percentage of events assigned to the test set (the
### remainder go into training)
### feature matrices changed to dense representations for compatibility with
### classifier functions in versions 0.15.2 and earlier
from sklearn import cross_validation
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(word_data, authors, test_size=0.1, 
                                                                                             random_state=42)

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words='english')

# the vectorizer is being fitted on the training features. This will allow it to build its list of 
# vocabulary to generate features and also get feature names.
features_train = vectorizer.fit_transform(features_train)
features_test  = vectorizer.transform(features_test).toarray()



In [4]:
# Starter code

### a classic way to overfit is to use a small number
### of data points and a large number of features;
### train on only 150 events to put ourselves in this regime
features_train = features_train[:150].toarray()
labels_train   = labels_train[:150]

#### How many training points are there, according to the starter code?

In [5]:
len(features_train)

150

Yup! We've limited our training data quite a bit, so we should be expecting our models to potentially overfit.

### Accuracy of Your Overfit Decision Tree

What’s the accuracy of the decision tree you just made? (Remember, we're setting up our decision tree to overfit -- ideally, we want to see the test accuracy as relatively low.)

In [6]:
from sklearn import tree
from sklearn.metrics import accuracy_score

In [7]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features_train, labels_train)

In [8]:
pred = clf.predict(features_test)

In [9]:
accuracy = accuracy_score(labels_test, pred)
print "Accuracy of DT classifier is : ", accuracy

Accuracy of DT classifier is :  0.816837315131


Yes, the test performance has an accuracy much higher than it is expected to be - if we are overfitting, then the test performance should be relatively low.

### Identify the Most Powerful Features

Take your (overfit) decision tree and use the ```feature_importances_``` attribute to get a list of the relative importance of all the features being used. 

We suggest iterating through this list (it’s long, since this is text data) and only printing out the feature importance if it’s above some threshold (say, 0.2--remember, if all words were equally important, each one would give an importance of far less than 0.01). What’s the importance of the most important feature? What is the number of this feature?

__Guidance:__  The object of this exercise is to analyze which features are most predictive or most important. The output should be all features above a certain threshold and their feature importance and feature number. The feature importance of the most important feature and the number of that feature are what should be entered into the quiz.  

One way to proceed would be to run the code and look at the output (the printed output). You may consider playing with the threshold and seeing how this impacts the output.

The number we get from "identifying the most powerful features" is actually the index of for that feature

In [10]:
importances = clf.feature_importances_

In [11]:
type(importances)

numpy.ndarray

Answer

In [12]:
for i in range(len(importances)):
    if importances[i] > 0.2:
        print "Most Important feature : ",importances[i]
        print "Feature Number : ",i

Most Important feature :  0.363636363636
Feature Number :  21323


In [13]:
for i in importances:
    if i > 0.1:
        print i

0.105378579003
0.186927243449
0.363636363636


- __Deciding on a threshold :__ This will vary from model to model. The feature importance is somewhat of a measure of how much information we gain from using that feature as measured by the impact the split has on overall system purity. That is feature splits that decrease the impurity of the system more are more important. Often we look at several features and choose a threshold based on using a reasonable number of features with reasonably high scores relative to other features. For example in the plot below we would likely choose the first three features, after which there is a drop off in importance.

![feature_imp](images/feature_imp.png)

- To get the most important feature we don't necessarily need to set a threshold as we do in the code above. You could also return all feature importances and sort. Setting a threshold is a good idea because it filters out features we wouldn't consider important and gives us a smaller list to work with.

- To get the feature number we can use the index of importances. If we determine the index of the highest scoring feature, this can be used to determine the feature number. There are other approaches as well, such as counting through the feature iteration.

## Use TfIdf to Get the Most Important Word

In order to figure out what words are causing the problem, you need to go back to the TfIdf and use the feature numbers that you obtained in the previous part of the mini-project to get the associated words. You can return a list of all the words in the TfIdf by calling ```get_feature_names()``` on it; pull out the word that’s causing most of the discrimination of the decision tree. What is it? Does it make sense as a word that’s uniquely tied to either __Chris Germany__ or __Sara Shackleton__, a signature of sorts?

In [16]:
words_list = vectorizer.get_feature_names()
words_list[21323]

u'houectect'

This is the most powerful word when the decision tree is making its classification decision.

Even though our training data is limited, we still have a word that is highly indicative of author.

## Remove, Repeat

This word seems like an outlier in a certain sense, so let’s remove it and refit. Go back to ```text_learning/vectorize_text.py```, and remove this word from the emails using the same method you used to remove “sara”, “chris”, etc. Rerun ```vectorize_text.py```, and once that finishes, rerun find_signature.py. Any other outliers pop up? What word is it? Seem like a signature-type word? (Define an outlier as a feature with importance >0.2, as before).

After removing the first signature word, another powerful signature word arises.

__cgermannsf__

## Checking Important Features Again

Update ```vectorize_test.py``` one more time, and rerun. Then run ```find_signature.py``` again. Any other important features (importance>0.2) arise? How many? Do any of them look like “signature words”, or are they more “email content” words, that look like they legitimately come from the text of the messages?

__houectect__  

Yes, there is one more word ("houectect").  Your guess about what this word means is as good as ours, but it doesn't look like an obvious signature word so let's keep moving without removing it.

## Accuracy of the Overfit Tree

What’s the accuracy of the decision tree now? We've removed two "signature words", so it will be more difficult for the algorithm to fit to our limited training set without overfitting. Remember, the whole point was to see if we could get the algorithm to overfit--a sensible result is one where the accuracy isn't that great!

In [17]:
accuracy = accuracy_score(labels_test, pred)
print "Accuracy of DT classifier is : ", accuracy

Accuracy of DT classifier is :  0.816837315131


Now that we've removed the outlier "signature words", the training data is starting to overfit to the words that remain.