# Feature Engineering 1.0

As I started playing w/ logistic regression to build a Lada Gaga song detector, I found that the # of features is too large to compute.  Moreso, the feature grid was mostly sparse and mostly useless since it was based on a basic word count vectorization.  (See the nice example using Naive Bayes from Mike: https://github.com/qzmeng/ail/blob/master/training.ipynb)

There is some good scikit info on managing this:
http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html#sphx-glr-auto-examples-text-document-classification-20newsgroups-py  

This is a survey of my findings

- Using 300 songs (1/2 Lady Gaga 1/2 Clash), we counted words
- 4000+ words


In [2]:
%%html
<style>table {margin-left: 0 !important;}</style>

Tons of words only used once, a few used many times, the rest are common words counted alot of times 

| Times Counted | Frequency | Meaning
|---------------|-----------|----
|1: | 1992   |  1992 words appeared once
|2: | 572
|3: | 296
|4: | 212    | 212 words appeared 4 times
|5: | 123   
|.. | ..
|25-600 | ~1  | Many diff words appeared a many times



Most frequent words are maybe useless conjugation:

| Word [Index] | count | Word [Index] | count |
|--------  |------  |----------- |----  |
|in  [1155]    |           ct: 321      | on  [1581]    |         ct:329       |
|love  [1356]     |          ct: 331     |    that  [2288]         |      ct:366       |
|oh  [1573]      |           ct: 394     |    be  [166]          |        ct:394       |
|your  [2612]    |           ct: 487     |     my  [1512]        |         ct:489      | 
|to  [2331]       |          ct: 495    |     and  [69]          |        ct:497     |  
|it  [1185]      |           ct: 550    |     me  [1414]       |          ct:589    |   
|the  [2289]     |           ct: 980   |      you  [2609]        |        ct:1260  |


Lots of words only used once:  

| Word [Index] | count | Word [Index] | count |
|--------  |------  |----------- |----  |
|typical  [2399]    |       ct: 1   |       six  [2046]           |    ct:1       |  
|hits  [1080]        |      ct: 1    |      photo  [1674]          |   ct:1       |  
|screaming  [1940]  |       ct: 1    |      atmosphere  [114]      |   ct:1      |   
|fit  [814]         |       ct: 1    |      rootiful  [1877]      |    ct:1       |  

Useful words maybe in middle of count frequency?

| Word [Index] | count | Word [Index] | count |
|--------  |------  |----------- |----  |
|superstar  [2215]    |     ct: 57      |   fame  [746]         |      ct:57        
|time  [2328]         |     ct: 58        | think  [2300]        |     ct:58        
|now  [1558]          |     ct: 59      |   tonight  [2341]      |     ct:60        
|gaga  [884]            |   ct: 61         |  heart  [1042]      |       ct:61  

One problem is we are counting total # of words.  I imagine in a song, the same words repeat alot so imagine if the chorus is "love love love... all you need is love love love" you are getting like 20 hits from one song.    
  
### Some more analysis using Pandas.DataFrame

I found its way easier to do stuff in DataFrames than with Numpy arrays.  
One thing I did is created this table:  [word, count, countFiles]
 - created matrix of word counts in + gaga files:  
 - created matrix of word counts in - gaga files:  
 - merged the tables (outer-join), and 0-filled Nulls (pandas.merge())
 - created 2 derived columns, calculating +/- of words more frequent/less frequent in gaga vs non-gaga files


In [3]:
import requests, pandas, io, numpy, argparse, math
from featureEngineering import *

trainingMatrix1,yArr1,labels1,fnames1 = getGagaData(gtype=0)
trainingMatrix2,yArr2,labels2,fnames2 = getGagaData(gtype=1)

m1,c1 = countWords2(trainingMatrix1, labels1, fnames1)
m2,c2 = countWords2(trainingMatrix2, labels2, fnames2)

m = mergeCounts(m1,m2)

m = m.sort_values('gct-delta')
print ('\ntop/bot # variance of # words *2nd to last column*')
m = m.sort_values('gct-delta')
display (pandas.concat([m.head(),m.tail()]))

print ('\ntop/bot # variance of # words (once per file) *last column*')
m = m.sort_values('gfct-delta')
display (pandas.concat([m.head(),m.tail()]))




top/bot # variance of # words *2nd to last column*


Unnamed: 0,word,gct,gfct,nct,nfct,gct-delta,gfct-delta
2814,the,1154.0,97.0,980.0,98.0,-174.0,1.0
2823,they,137.0,53.0,43.0,19.0,-94.0,-34.0
114,an,110.0,43.0,38.0,11.0,-72.0,-32.0
1260,he,143.0,44.0,79.0,28.0,-64.0,-16.0
3030,war,47.0,15.0,0.0,0.0,-47.0,-15.0
3175,your,155.0,58.0,487.0,67.0,332.0,9.0
1920,oh,35.0,15.0,394.0,44.0,359.0,29.0
1839,my,119.0,50.0,489.0,84.0,370.0,34.0
1719,me,112.0,39.0,589.0,82.0,477.0,43.0
3172,you,515.0,85.0,1260.0,96.0,745.0,11.0



top/bot # variance of # words (once per file) *last column*


Unnamed: 0,word,gct,gfct,nct,nfct,gct-delta,gfct-delta
2823,they,137.0,53.0,43.0,19.0,-94.0,-34.0
114,an,110.0,43.0,38.0,11.0,-72.0,-32.0
1084,from,65.0,42.0,23.0,14.0,-42.0,-28.0
88,all,151.0,67.0,127.0,43.0,-24.0,-24.0
3102,will,44.0,33.0,42.0,14.0,-2.0,-19.0
1472,just,46.0,25.0,206.0,62.0,160.0,37.0
461,cause,13.0,8.0,114.0,45.0,101.0,37.0
1719,me,112.0,39.0,589.0,82.0,477.0,43.0
177,baby,12.0,4.0,261.0,51.0,249.0,47.0
1656,love,14.0,10.0,331.0,61.0,317.0,51.0


As mentioned before, the most frequent words aren not so great.   Taking the 2nd table may work.

Or I can just use scikit-learn's feature magic and see what they do.  Let me see...  http://scikit-learn.org/stable/modules/feature_selection.html

Using sckit <B>VarianceThreshold</B> - feature reduction magic 3189 -> 326 with a 0.80 variance setting!


In [13]:
t1 = numpy.array(trainingMatrix1)

# VarianceThreadhold
print (t1.shape)
from sklearn.feature_selection import VarianceThreshold
model = VarianceThreshold(threshold=(.8 * (1 - .8)))
m = model.fit_transform(t1)
print ('Variance .8',m.shape)

picklist = model.get_support(True)
pickwords = [labels1[p] for p in picklist]
print (pickwords)


(100, 3189)
('Variance .8', (100, 326))
[u'1977', u'48', u'again', u'ain', u'all', u'alone', u'am', u'ammunition', u'amp', u'an', u'and', u'another', u'are', u'around', u'as', u'at', u'away', u'baby', u'back', u'bad', u'be', u'beat', u'becomes', u'been', u'before', u'believe', u'bells', u'better', u'big', u'billy', u'bit', u'black', u'bombs', u'boulevard', u'boys', u'brixton', u'bullets', u'burn', u'burning', u'but', u'by', u'cadillac', u'call', u'calling', u'came', u'can', u'car', u'casbah', u'cause', u'cheat', u'city', u'clash', u'clean', u'come', u'coming', u'cool', u'cos', u'could', u'crazy', u'crush', u'daddy', u'danced', u'day', u'days', u'dead', u'death', u'dem', u'did', u'didn', u'die', u'do', u'don', u'door', u'doors', u'down', u'drive', u'drug', u'eh', u'else', u'em', u'emotion', u'england', u'enough', u'ever', u'every', u'everybody', u'everything', u'face', u'fail', u'fall', u'far', u'fe', u'feet', u'fight', u'finger', u'fingerpop', u'for', u'fought', u'four', u'free', u'fro

Scikit <B>SelectKBest</B> (top 50) same magic goes from 3189 features to top 50 !


In [15]:
# SelectKBest(50)
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
X, y = t1, yArr1
print(X.shape)
model = SelectKBest(chi2, k=50)
X_new = model.fit_transform(X, y)   # need to keep labels
print('KBest',X_new.shape)

picklist = model.get_support(True)
pickwords = [labels1[p] for p in picklist]
print (pickwords)


(100, 3189)
('KBest', (100, 50))
[u'workers', u'working', u'works', u'world', u'worried', u'worry', u'worse', u'worthy', u'wot', u'would', u'wouldn', u'wreck', u'writ', u'write', u'written', u'wrong', u'ya', u'yabbos', u'yankee', u'yard', u'yeah', u'year', u'years', u'yeh', u'yellow', u'yellowy', u'yer', u'yes', u'yesterday', u'yet', u'yo', u'yob', u'york', u'you', u'young', u'younger', u'your', u'yours', u'yourself', u'youth', u'zealot', u'zed', u'zee', u'zion', u'zombies', u'zone', u'zoo', u'zooming', u'zooms', u'zydeco']
