## Positive Naive Bayes Classification 

A song can fit in more than one genre, and our conditioning of the data has been attentive to preserving the one-to-potentially-many relationship. In this part, a classifier develops a probability that a given body of text will fit within each of the top ten selected genres. It uses the nltk positive naive Bayes classifier to train a set of lyrics for each of the top 15 genres, using 80% of the data. It assigns uniform priors at various levels, then priors based on the relative frequency of songs. 

This process builds 15 distinct, independent classifiers, and then chooses a randome sentence from each song in the testing set, submitting it to each classifier, which returns True if the classifier predicts that the sentence is a member of the associated genre, and False otherwise. Thus, any body of text submitted to it could result in a prediction of zero, one, or up to 15 genre memberships.

The numbers of true and false positives and negatives are accumulated, and the results are expressed as (true positives plus true negatives) / total observations.  To test whether the classifiers comport with our subjective understanding of what sentences are likey to be in which genres of music, at the end of the notebook the classifers are exposed to entry of arbitrary text which returns the predicted genre(s).

The process does not use cross-validation because the nltk PositiveNaiveBayesClassifier does not expose any regularization parameters; the only parameter it accepts is a prior. The theoretically most appropriate prior to use for each genre is the genre's relative frequency in  the training set; the classifier is run against the testing set using other values for the prior only to gauge whether the theoretically chosen prior is at least as good as arbitrarily chosen values. The series of procedures suggest that it is.   

### The real value of this classfier battery, however, is in how it performs in action, assessed subjectively.  To enter arbitrary text and see what (if any) genres it matches, run this notebook (Cell->Run All) and then go to the last cell.  The classifier battery takes about ten minutes to train, but then each text input is processed very quickly. 

In [1]:
from nltk.classify import PositiveNaiveBayesClassifier
import json
import pandas as pd
import random

In [2]:
# open the data set
df = pd.read_csv("../../data/conditioned/all_years_and_genres_with_lyrics_and_wordcount_and_vocabulary_clean.csv")
df = df.drop_duplicates('song_key')
# total number of songs
allsongs= len(df)*1.0
df.head(2)

Unnamed: 0.1,Unnamed: 0,song_key,lyrics,lyrics_url,lyrics_abstract,decade,artist,title,year,band_singer,...,/wiki/Western_swing,/wiki/Witch_house,/wiki/World_music,/wiki/Worldbeat,/wiki/Worship_music,/wiki/Zydeco,wordcount,wordset,lexdiv,repetition_score
0,627,1976-86,Are you ready\nDo what you wanna do\nDo what y...,http://lyrics.wikia.com/Ohio_Players:Who%27d_S...,Are you ready\nDo what you wanna do\nDo what y...,1970,Ohio Players,Who'd She Coo?,1976,Ohio Players,...,False,False,False,False,False,False,35,26,0.742857,1.346154
1,1375,1984-59,I thought that dreams belonged to other men 'C...,http://lyrics.wikia.com/index.php?title=Mike_R...,I thought that dreams belonged to other men 'C...,1980,Mike Reno,Almost Paradise,1984,Ann Wilson,...,False,False,False,False,False,False,144,104,0.722222,1.384615


In [3]:
# get the top 15 genres
with open("../../notebooks/ss/songsbygenre.json") as json_file:
    genresj = json.load(json_file)
   
genrelist= genresj.keys()
genresj
topgenres={}
for k in genresj.keys():
    topgenres[k]=len(genresj[k])
d=topgenres
glist=[]

rank=0
for w in sorted(d, key=d.get, reverse=True):
    if rank < 15:
        glist.append((rank,w, d[w]/allsongs))
        rank +=1
#glist now holds the genre's id number, the genre, and its freq in the overall population        
glist

[(0, u'/wiki/Pop_music', 0.7867435158501441),
 (1, u'/wiki/Hip_hop_music', 0.6296829971181557),
 (2, u'/wiki/Contemporary_R%26B', 0.6162343900096061),
 (3, u'/wiki/Soul_music', 0.41306436119116235),
 (4, u'/wiki/Rock_music', 0.3621517771373679),
 (5, u'/wiki/Pop_rock', 0.3520653218059558),
 (6, u'/wiki/Soft_rock', 0.24639769452449567),
 (7, u'/wiki/Country_music', 0.1988472622478386),
 (8, u'/wiki/Rhythm_and_blues', 0.19548511047070125),
 (9, u'/wiki/Alternative_rock', 0.16234390009606148),
 (10, u'/wiki/Funk', 0.15994236311239193),
 (11, u'/wiki/Hard_rock', 0.15417867435158503),
 (12, u'/wiki/Dance-pop', 0.14169068203650337),
 (13, u'/wiki/Dance_music', 0.14169068203650337),
 (14, u'/wiki/Disco', 0.13160422670509125)]

In [4]:
#train/test split
dftr = df.sample(frac=0.8)
dftst =  df.loc[~df.index.isin(dftr.index)]
dftr.shape, dftst.shape

((3331, 454), (833, 454))

## Readying the Training Sets
For each genre, there is an in-genre set and all other genres together are not-in. Train the model and build fifteen classifiers. 

In [5]:
def features(sentence):
    words = sentence.lower().split()
    return dict(('contains(%s)' % w, True) for w in words)


## Train

The parameter pprior is manually set for each iteration of the training process. It is either a uniform prior of a seelcted level, or a genre-dependent prior obtained from the relative frequency of the genre's appearance.   

In [6]:
classdict={}
def trainclassifiers(prior):
    #this trains 15 classifiers

    for genretuple in glist:
        gindex=genretuple[0]
        genre = genretuple[1]
        gprior= genretuple[2]
        #genre="/wiki/Pop_music"

        in_genre_df = dftr[dftr[genre]==True]
        out_genre_df = dftr[dftr[genre]==False]

        # concatentate the oyrics from each song into a "sentence"
        in_sentences=[]
        out_sentences=[]
        for row in in_genre_df.iterrows():
            songsents= row[1][2].split('.')
            for s in songsents:
                in_sentences.append(s)

        for row in out_genre_df.iterrows():
            songsents= row[1][2].split('.')
            for s in songsents:
                out_sentences.append(s)
        # this is where you set the prior
        if prior == 0:
            pprior = gprior*.0967
        else:
            pprior=  prior 
        positive_featuresets = list(map(features, in_sentences))
        unlabeled_featuresets = list(map(features, out_sentences))
        classdict[int(gindex)] = PositiveNaiveBayesClassifier.train(positive_featuresets,unlabeled_featuresets,positive_prob_prior=pprior)


In [7]:
# this function runs the supplied text against the 15 classifiers and prints the ones that return true
def show_genre_classify(classtext):
    noresult = True
    for genretuple in glist:
        gindex=genretuple[0]
        genre = genretuple[1]
        gprior= genretuple[2]
        if classdict[gindex].classify(features(classtext)):
            print genre[6:]
            noresult = False
    if noresult:
        print "No match"


## Testing Set in Action

This part takes a sentence from each song and submits it to each of the fifteen classifiers. For each, it compares the predicted result to the observed value, keeping track of true and false positives and negatives. 

In [8]:
results={}
# this function iterates over the values it contains for the prior
# and for each iterates over the rows of the testing set,
# selecting a random sentence from each one
# then running the battery of 15 classifiers on the sentence
for p in [.01,.02,.03,.05,.1,.2,.3,0]:
    trainclassifiers(p)
    truepos=0
    trueneg=0
    falsepos=0
    falseneg=0
    blpos=0
    bltotal=0
    # For each row in the test set
    for row in dftst.iterrows():
        # get a sentence from the song 
        song_sentences=[]
        songsents= row[1][2].split('.')
        if len(songsents)>0:
            for s in songsents:
                song_sentences.append(s)
        songsent=random.choice(song_sentences)
        for genretuple in glist:
            gindex=genretuple[0]
            genre = genretuple[1]
            gprior= genretuple[2]
            # see if it's in the song's genre list
            observed =  genre in row[1][15]
            predicted = classdict[gindex].classify(features(songsent))
            if observed and predicted:
                truepos += 1
            if not observed and not predicted:
                trueneg += 1
            if observed and not predicted:
                falseneg += 1
            if not observed and predicted:
                falsepos += 1
            # this gets the baseline predicting all false
            bltotal += 1
            if observed:
                blpos += 1
                
    print "Baseline pct. predicting all negative:", 1-1.0*blpos/bltotal,"\n\n"             

    print "Prior: ", p
    print "true positive: ",truepos
    print "false positive: ",falsepos
    print "false negative: ",falseneg
    print "true negative: ",trueneg
    totobs= truepos+falsepos+trueneg+falseneg
    print "total observations: ",totobs 
    print "pct correct: ", (1.0*truepos+trueneg)/totobs,"\n\n" 
    results[str(p)] =[truepos,falsepos,falseneg,trueneg]    


Baseline pct. predicting all negative: 0.86962785114 


Prior:  0.01
true positive:  37
false positive:  69
false negative:  1592
true negative:  10797
total observations:  12495
pct correct:  0.867066826731 


Baseline pct. predicting all negative: 0.86962785114 


Prior:  0.02
true positive:  77
false positive:  113
false negative:  1552
true negative:  10753
total observations:  12495
pct correct:  0.866746698679 


Baseline pct. predicting all negative: 0.86962785114 


Prior:  0.03
true positive:  111
false positive:  195
false negative:  1518
true negative:  10671
total observations:  12495
pct correct:  0.862905162065 


Baseline pct. predicting all negative: 0.86962785114 


Prior:  0.05
true positive:  116
false positive:  382
false negative:  1513
true negative:  10484
total observations:  12495
pct correct:  0.848339335734 


Baseline pct. predicting all negative: 0.86962785114 


Prior:  0.1
true positive:  217
false positive:  845
false negative:  1412
true negative:  1002

## Results

The choice of prior had influence on the results. With higher priors, the model had the greatest success in matching songs to their genres, but also overdid it considerably. False positives outnumbered true positives for every prior value, but the margin was much more for larger priors.   Very small priors bias the classifiers more toward the negative side, so far fewer matches were predicted, and the true negatives dominated. This is to be expected since each genre only has a small probability of being applicable to a selected song. 

Genre-specific priors were calculated by counting the number of songs in each genre, divided by the total number of songs, and multiplying that times the average probability of any one genre, 435 total genres divided by 4500 songs. Using the variable prior yielded the best results of all, correct 86.6% of the time and with results that made sense in light of experience.  

Unfoprtunately, none of the models beat the simple classification baseline approach of just predicting negative each time. Part of the reason for this is the model selection- this process does not predict which of the genres are most likely; rather it independently predicts whether a sentence is likely to come from a given genre.  This was initially believed necessary to accommodate the one-to-many nature of the song-to-genre relation. But the result was to have a very low probability that the song would fit into each genre individually, thus making the data quite unbalanced in favor of negative results.  On the other hand, as the cells below indicate, when the classifier does prdict one or a small number of genres for a given text, it seems to be apt.   

## Have Some Fun

In the cell below, replace the text between the quotes in any of the cells below with text of your choosing. The classifier will run against each genre and report which ones it considers a match. IF the matches are coming up empty, you can increase the sensitivity by de-commenting the line that runs the classifer training with a higher prior value. Do not use priors higher than 0.5. 

In [9]:
%%time
trainclassifiers(0.15)
show_genre_classify('you left a mess behind when you went away')

Rock_music
Pop_rock
Soft_rock
Country_music
Hard_rock
Wall time: 42.6 s


In [10]:
show_genre_classify('beat the bitch with a bat')

Hip_hop_music


In [11]:
show_genre_classify('life liberty and the pursuit of happiness')

Pop_music
Hip_hop_music
Contemporary_R%26B
Rock_music
Soft_rock
Country_music
Rhythm_and_blues
Alternative_rock
Funk
Hard_rock
Dance-pop
Dance_music


In [12]:
show_genre_classify('gun shoot pistol bullet dead or alive')

Hip_hop_music
Alternative_rock
Hard_rock


In [13]:
show_genre_classify('cheating woman left me my dog died truck broke down')

Hip_hop_music
Country_music
