# Feature Engineering 1.0

As I started playing w/ logistic regression to build a Lada Gaga song detector, I found that the # of features is too large to compute.  Moreso, the feature grid was mostly sparse and mostly useless since it was based on a basic word count vectorization.  (See the nice example using Naive Bayes from Mike: https://github.com/qzmeng/ail/blob/master/training.ipynb)

There is some good scikit info on managing this:
http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html#sphx-glr-auto-examples-text-document-classification-20newsgroups-py  

This is a survey of my findings

- Using 300 songs (1/2 Lady Gaga 1/2 Clash), we counted words
- 4000+ words


In [2]:
%%html
<style>table {margin-left: 0 !important;}</style>

Tons of words only used once, a few used many times, the rest are common words counted alot of times 

| Times Counted | Frequency | Meaning
|---------------|-----------|----
|1: | 1992   |  1992 words appeared once
|2: | 572
|3: | 296
|4: | 212    | 212 words appeared 4 times
|5: | 123   
|.. | ..
|25-600 | ~1  | Many diff words appeared a many times



Most frequent words are maybe useless conjugation:

| Word [Index] | count | Word [Index] | count |
|--------  |------  |----------- |----  |
|in  [1155]    |           ct: 321      | on  [1581]    |         ct:329       |
|love  [1356]     |          ct: 331     |    that  [2288]         |      ct:366       |
|oh  [1573]      |           ct: 394     |    be  [166]          |        ct:394       |
|your  [2612]    |           ct: 487     |     my  [1512]        |         ct:489      | 
|to  [2331]       |          ct: 495    |     and  [69]          |        ct:497     |  
|it  [1185]      |           ct: 550    |     me  [1414]       |          ct:589    |   
|the  [2289]     |           ct: 980   |      you  [2609]        |        ct:1260  |


Lots of words only used once:  

| Word [Index] | count | Word [Index] | count |
|--------  |------  |----------- |----  |
|typical  [2399]    |       ct: 1   |       six  [2046]           |    ct:1       |  
|hits  [1080]        |      ct: 1    |      photo  [1674]          |   ct:1       |  
|screaming  [1940]  |       ct: 1    |      atmosphere  [114]      |   ct:1      |   
|fit  [814]         |       ct: 1    |      rootiful  [1877]      |    ct:1       |  

Useful words maybe in middle of count frequency?

| Word [Index] | count | Word [Index] | count |
|--------  |------  |----------- |----  |
|superstar  [2215]    |     ct: 57      |   fame  [746]         |      ct:57        
|time  [2328]         |     ct: 58        | think  [2300]        |     ct:58        
|now  [1558]          |     ct: 59      |   tonight  [2341]      |     ct:60        
|gaga  [884]            |   ct: 61         |  heart  [1042]      |       ct:61  

One problem is we are counting total # of words.  I imagine in a song, the same words repeat alot so imagine if the chorus is "love love love... all you need is love love love" you are getting like 20 hits from one song.    
  


In [7]:
import requests, pandas, io, numpy, argparse, math
from myutils import *

# copied code from meng - pull in songclass/* lady gaga/class music text data
def getGagaData(maxrows=200,maxfeatures=4000,gtype=None):
    import random, sklearn, sklearn.feature_extraction.text, sklearn.naive_bayes
    def append_data(ds,dir,label,size):
        filenames=os.listdir(dir)
        for i,fn in enumerate(filenames):
            if (i>=size):
                break            
            data=open(dir+'/'+fn,'r').read()
            ds.append((data,label))
        return ds

    dataset=[]
    if (gtype == 1 or gtype == None):
        append_data(dataset,'songclass/lyrics/gaga',1,maxrows/2)
    if (gtype == 0 or gtype == None):     
        append_data(dataset,'songclass/lyrics/clash',0,maxrows/2)
    data,target=zip(*dataset)
    vec=sklearn.feature_extraction.text.CountVectorizer()
    mat=vec.fit_transform(data)
    yarr = list(target)
    data = mat.toarray()
    labels = vec.get_feature_names()[0:maxfeatures]
    
    if (maxfeatures > len(data[0])):
        maxfeatures = len(data[0]) 
    data = data[:,0:maxfeatures]
    return data,yarr,labels

def countWords(trainingMatrix, labels):
    counts = {0:0}
    words = {}
    for i,col in enumerate(trainingMatrix.T):   # transpose to inspect word by word
        sum = numpy.sum(col)
        if (sum not in counts):
            counts[sum] = 1
        else:
            counts[sum] = counts[sum] + 1
        words[labels[i]+'  ['+str(i)+']'] = sum
    print (i, counts)
    import operator
    sorted_words = sorted(words.items(), key=operator.itemgetter(1))
    return numpy.asmatrix(sorted_words)    

trainingMatrix1,yArr1,labels1 = getGagaData(maxrows=10,maxfeatures=20, gtype=0)  # gaga data only for now 10 rows
mGaga = countWords(trainingMatrix1, labels1)




(19, {0: 0, 1: 12, 3: 1, 5: 2, 6: 1, 8: 2, 9: 2})
