## Feature Selection

---

#### Overfitting Bug

This bug was found when Katie was trying to make an overfit decision tree to use as an example in the decision tree mini-project. A decision tree is classically an algorithm that can be easy to overfit; one of the easiest ways to get an overfit decision tree is to use a small training set and lots of features

- ao fazer o **treinamento**, parece que encontrei a solução para tudo e a **accuracy** fica enorme

- ao **testar** os dados, minha **accuracy** acaba ficando abaixo do esperado

In [1]:
import pickle
import numpy
numpy.random.seed(42)

The words (features) and authors (labels), already largely processed

These files should have been created from the previous (Lesson 10) mini-project

In [2]:
words_file = "c:/pyprog/udamini/text_learning/your_word_data.pkl" 
authors_file = "c:/pyprog/udamini/text_learning/your_email_authors.pkl"
word_data = pickle.load(open(words_file, "r"))
authors = pickle.load(open(authors_file, "r"))

In [None]:
#word_data

In [None]:
#authors

- test_size is the percentage of events assigned to the test set (the remainder go into training)

- feature matrices changed to dense representations for compatibility with classifier functions in versions 0.15.2 and earlier

Observe o erro aqui:

- eu defini um conjunto muito **pequeno de dados** e **menor ainda** para dados de teste, serão apenas 15 dados de teste ao todo

In [3]:
#from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
features_train, features_test, labels_train, labels_test = train_test_split(word_data, 
                                                                            authors, test_size=0.1, random_state=42)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
features_train = vectorizer.fit_transform(features_train)
features_test  = vectorizer.transform(features_test).toarray()

- a classic way to overfit is to use a small number of data points and a large number of features

- train on only 150 events to put ourselves in this regime

In [4]:
features_train = features_train[:150].toarray()
labels_train   = labels_train[:150]

Classifier here

- observe o valor elevado demais para minha acuidade:

In [5]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(features_train, labels_train)
from sklearn import tree

accuracy = dt.score(features_test, labels_test)
print("\n Accuracy of Model: %0.3F %%" % (accuracy*100))


 Accuracy of Model: 81.684 %


Find the top feature in the decision tree and its relative importance

In [23]:
top_feature = dt.feature_importances_[dt.feature_importances_ > 0.2]

import numpy as np
idx = np.where(dt.feature_importances_ > 0.2)

print("\n Value of most important feature: %0.4F " % top_feature)
print("\n Number of most important feature: %0.0F " % idx[0][0] )


 Value of most important feature: 0.3636 

 Number of most important feature: 21323 


Outro modo:

In [26]:
importances = dt.feature_importances_
for index, item in enumerate(importances):
    if item > 0.2:        
        print index, item       

import numpy as np

indices = np.argsort(importances)[::-1]
print 'Feature Ranking: '
for i in range(10):
    print "{} feature no.{} ({})".format(i+1,indices[i],importances[indices[i]])

21323 0.36363636363636365
Feature Ranking: 
1 feature no.21323 (0.363636363636)
2 feature no.18849 (0.186927243449)
3 feature no.11975 (0.105378579003)
4 feature no.22546 (0.0840692099229)
5 feature no.29690 (0.0675805258904)
6 feature no.16267 (0.0474074074074)
7 feature no.15331 (0.0426666666667)
8 feature no.16440 (0.0262801932367)
9 feature no.37406 (0.0255293305728)
10 feature no.15560 (0.0248101945003)


In [27]:
feature_name = vectorizer.get_feature_names()
for index, item in enumerate(feature_name):
    if index == 33614:        
        print item

sshacklsims1rcsntxswbellnet


What is the word that is causing the trouble

In [24]:
vocab_list = vectorizer.get_feature_names()
print("\n Word causing most discrimination on the decision tree: %s" % vocab_list[idx[0][0]])


 Word causing most discrimination on the decision tree: houectect


---

Special Note

Depending on when you downloaded the code provided for find_signature.py, you may need to change the code in lines 9-10 to be so that the files created from running vectorize_text.py are reflected properly

        words_file = "../text_learning/your_word_data.pkl"
        
        authors_file = "../text_learning/your_email_authors.pkl"
        
        
In addition, if you are having trouble getting the code to run due to memory issues, then if you are on version 0.16.x of scikit-learn, you can remove the **.toarray()** function from the line where features_train is created to save on memory - the decision tree classifier can, in that version take as input a sparse array instead of only dense arrays

---

Take your (overfit) decision tree and use the feature_importances_ attribute to get a list of the relative importance of all the features being used

We suggest iterating through this list (it’s long, since this is text data) and only printing out the feature importance if it’s above some threshold (say, 0.2--remember, if all words were equally important, each one would give an importance of far less than 0.01)

What’s the importance of the most important feature? What is the number of this feature?

---

In order to figure out what words are causing the problem, you need to go back to the TfIdf and use the feature numbers that you obtained in the previous part of the mini-project to get the associated words

You can return a list of all the words in the TfIdf by calling **get_feature_names()** on it; pull out the word that’s causing most of the discrimination of the decision tree

What is it? Does it make sense as a word that’s uniquely tied to either Chris Germany or Sara Shackleton, a signature of sorts?

---

This word seems like an outlier in a certain sense, so let’s remove it and refit. Go back to text_learning/vectorize_text.py, and remove this word from the emails using the same method you used to remove “sara”, “chris”, etc

Rerun vectorize_text.py, and once that finishes, rerun find_signature.py

Any other outliers pop up? What word is it? Seem like a signature-type word?

*Define an outlier as a feature with importance >0.2, as before*

---

Update vectorize_test.py one more time, and rerun. Then run find_signature.py again

Any other important features (importance>0.2) arise? How many?

Do any of them look like “signature words”, or are they more “email content” words, that look like they legitimately come from the text of the messages?

R: **houectect**

In [None]:
# How many training points are there, according to the starter code?
len(features_train)

# What’s the importance of the most important feature? What is the number of this feature?
importances = clf.feature_importances_
for index, item in enumerate(importances):
    if item > 0.2:        
        print index, item       
       
import numpy as np
indices = np.argsort(importances)[::-1]
print 'Feature Ranking: '
for i in range(10):
    print "{} feature no.{} ({})".format(i+1,indices[i],importances[indices[i]])

#remove this words from the emails using the same method you used to remove “sara”, “chris”, etc    
    
# What’s the most powerful word when your decision tree is makeing its classification decisions?
feature_name = vectorizer.get_feature_names()
for index, item in enumerate(feature_name):
    if index == 33614:        
        print item

vectorizer.get_feature_names()[33614]
# Result: sshacklensf #palavra exclusiva e fortemente indicativa do autor      
        
feature_name = vectorizer.get_feature_names()
for index, item in enumerate(feature_name):
    if index == 14343:        
        print item 
        
vectorizer.get_feature_names()[14343]
# Result: cgermannsf
        
feature_name = vectorizer.get_feature_names()
for index, item in enumerate(feature_name):
    if index == 14343:        
        print item 
# Result: houectect

#sklearn.feature_extraction