<h1>Step 8. Parts of Speech 2</h1>

Here I explore the most frequent words in the songs of the Beatles according to their classes and semantics.

In [1]:
%matplotlib inline
import operator
import pandas as pd
import matplotlib.pyplot as plt

from IPython.display import HTML, display
from collections import Counter

import spacy
nlp = spacy.load('en')

In [2]:
df = pd.read_json('data.json')

After reading the dataset we must provide functions that would tranform the data into a format that can be used easily. For every word in a song a POS tag is needed. For POS identification I use <a href="https://spacy.io/">SpaCy</a> again.

In [3]:
def get_sentences(song):
    '''
        Clean lyrics data and split into sentences
    ''' 
    song_string = song.replace(" cos ", " 'cos ").replace("Cos ", "'Cos ")
    song_string = song_string.replace('[', '').replace(']', '')
    song_string = song_string.replace('&#13;', '')
    song_string = song_string.replace('<p>', '<br/>')
    song_string = song_string.replace('</p>', '')
    return [s.strip() for s in song_string.split('<br/>') if s]

def get_pos(song, tags=True):
    '''
        Get part of speech tags for a song lyrics
    '''
    
    output = []
    
    for sentence in get_sentences(song):
    
        song_obj = nlp(sentence)
        
        if tags:
            output.extend([token.tag_ for token in song_obj])
        else:
            output.extend([token.pos_ for token in song_obj])
    
    output = ' '.join(output) 
    return output

def get_pos_words(song):
    '''
        Transform every word to its POS form, i.e., normalize it
    '''
    output = []
    sentences = get_sentences(song)
    for sentence in get_sentences(song):
        song_obj = nlp(sentence)
        output.extend([token.text for token in song_obj])
    output = ' '.join(output) 
    return output

In [4]:
def get_lemmas(song):
    '''
        Get lemmatized words for a song
    '''
    output = []
    sentences = get_sentences(song)
    for sentence in get_sentences(song):
        song_obj = nlp(sentence)
        output.extend([token.lemma_ for token in song_obj])
    output = ' '.join(output)
    return output

In [5]:
def get_post_words_list(cleaned_lyrics_pos_words, cleaned_lyrics_pos, pos_titles=[]):
    '''
        Transform data for a POS table
    '''

    pos_words = {}
    
    for song, song_pos in zip(cleaned_lyrics_pos_words, cleaned_lyrics_pos):
        
        song_list = song.split()
        song_pos_list = song_pos.split() 

        for w, pos in zip(song_list, song_pos_list):

            w = w.lower()
            
            if pos in pos_words:
                if w in pos_words[pos]:
                    pos_words[pos][w] = pos_words[pos][w] + 1
                else:
                    pos_words[pos][w] = 1
            else:
                pos_words[pos] = {}
                pos_words[pos][w] = 1
                
    pos_words_results = {}
    for pos, pos_word_dict in pos_words.items():
        pos_sum = sum(pos_word_dict.values())
        c = Counter(pos_word_dict)
        raw_results = sorted(c.items(), key=operator.itemgetter(1), reverse=True)
        pos_words_results[pos] = [(result[0], round(100*float(result[1])/pos_sum, 2)) for result in raw_results]

       
    pos_words_list = []
    pos_words_list_titles = []

    for k,v in pos_words_results.items():
        
        if pos_titles:
            
            if k in pos_titles:
                pos_words_list_titles.append(k)
                pos_words_list.append(v)  
        else:
            pos_words_list_titles.append(k)
            pos_words_list.append(v)  
            
        
    return pos_words_list, pos_words_list_titles

In [6]:
def show_table(shares, limit, headlines, exclude=[]):
    '''
        Show an HTML table for data
    '''
    
    words_by_column = '<tr>'
    
    excluded = []
    
    for i, headline in enumerate(headlines):
        if headline in exclude:
            excluded.append(i)
        else:
            words_by_column = words_by_column + '<td></td><td><strong>%s</strong></td>' % str(headline)
    words_by_column = words_by_column + '</tr>'    
    
    if limit == 0:
        limit = len(shares[0])
    
    for i in range(0, limit):
        
        words_by_column = words_by_column + '<tr>'

        for j, st in enumerate(shares):

            if j not in excluded:

                if i < len(st):
                    words_by_column = words_by_column + '<td>{}</td><td>{:.2f}</td>'.format(st[i][0], st[i][1])
                else:
                    words_by_column = words_by_column + '<td>-</td><td>-</td>'

        words_by_column = words_by_column + '</tr>'    
        
    display(HTML('<table>' + words_by_column + '</table>'))

Above I did all the preparations needed to analyze the part of speech data. Now it's time to see the results.

In [7]:
df['cleaned_lyrics_pos'] = df['lyrics'].apply(get_pos, tags=False)
df['cleaned_lyrics_pos_words'] = df['lyrics'].apply(get_pos_words)
pos_words_list, pos_words_list_titles = get_post_words_list(df['cleaned_lyrics_pos_words'], 
                                                            df['cleaned_lyrics_pos'])

In [8]:
show_table(pos_words_list, 10, pos_words_list_titles, exclude=['PUNCT', 'X', 'SYM'])

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
,ADV,,DET,,PART,,NUM,,PRON,,NOUN,,VERB,,ADJ,,INTJ,,ADP,,PROPN,,CCONJ
n't,17.17,the,41.90,to,49.04,one,33.05,you,30.43,love,4.94,do,5.50,my,13.30,oh,19.15,in,15.27,mm,3.39,and,75.62
so,6.17,a,28.54,on,8.72,two,15.25,i,30.30,what,3.32,'s,5.21,your,10.05,yeah,15.65,of,10.51,bill,2.58,but,18.98
now,5.70,all,8.10,up,8.16,four,6.78,me,12.85,baby,2.77,be,3.72,good,4.33,nah,11.49,to,8.62,c'mon,2.50,or,3.73
when,5.70,that,6.31,down,6.88,three,6.36,it,9.95,girl,2.66,know,3.62,little,3.42,well,7.66,for,7.67,bungalow,2.50,so,1.35
there,4.25,no,3.70,'s,6.64,five,5.08,she,5.51,time,2.42,'m,3.16,long,2.98,hey,5.68,with,7.45,da,2.23,yet,0.10
never,3.00,this,2.38,out,6.48,eight,4.66,we,2.79,day,1.69,is,3.03,that,2.77,ah,4.95,on,6.55,jude,2.05,n,0.10
back,2.97,some,1.56,na,6.24,seven,4.24,her,1.82,way,1.65,'ll,2.16,all,2.54,no,4.95,if,6.18,mr.,1.96,&,0.10
all,2.86,another,1.48,ta,3.44,six,4.24,they,1.81,man,1.55,got,2.08,her,2.47,yes,4.89,that,5.64,sgt,1.69,-,-
just,2.83,any,1.48,back,0.88,909,2.97,he,1.71,night,1.34,'re,1.89,much,2.37,please,4.49,like,3.16,la,1.60,-,-


Now we can see the results (see full label descriptions <a href="https://spacy.io/api/annotation">here</a>).

**Nouns**: the leader is *love* (almost 5%). Then, there go *what*, *baby*, and *girl*, all roughly 3%. *Time* has about 2%, and about 1% is given to *way*, *man* and  *night*. All the other nouns have less that 1%.

**Pronouns**: *you* is the leader (about 30%), *I* is very close. *Me* (~13%) and *it* (~10%) have significanlty lower results. Other pronouns have even lower figures.

**Verbs**: here we see mostly forms of the verb *to be*: *do* is almost 6%, *'s* (which can be *is* or *has*, of course) is 5%, the *be* proper and *know* with about 4%, *'m* and *is* with about 3%, *'ll* and *got* with about 2%.

**Adjectives**: the most popular are *my* (~13%) and *your*(~10%). These have been also described as possessive pronouns or possessive adjectives. Then, we see *good* (~4%), *little* (~3%) and *long* (~3%). Below we see pronouns again.

**Adverbs**: the results are strange. Somehow *n't* has been classified as an adverb, and it has the first place with ~17%. Then we have *so* (~6%), *now* and *when* (both 5.7%). Other have less than 5%.

**Adpositions** (both **prepositions** and **postpositions**): the leader is *in* (~15%), then we have *of* (~10.5%), then frequency slowly falls with *to*, *for*, *with*, *on*...

**Particles**: *to* is the absolute leader with 49%! Then there are *on*, *up* and *down*.

**Interjections**: *oh* (~19%) is more frequent than *yeah* (15.65%). *Nah* has the third place with 11.49% (because of *Hey Jude*, I guess).

**Determintives**: *the* is much more frequent than a (42% vs. 29%).

**Numbers**: *one* is the leader (almost 33%)! It's only natural that the second place (~15%) belongs to *two*. But *three* (6.75%) and *four* (6.33%) have reverse order. They are close anyway. Then funny enough we have *five* (5%), but then *eight* (4.64%). *Seven* and *six* have equal figures (4.22%).

**Proper nouns**: there are clear signs of misclassification here. From the top ten only two (*Bill* and *Jude*) are attributed correctly.

**Conjunctions**: *and* (76%) is the absolute leader! But the second place is given to *but* (~19%).

Let's look now at lemmatized lyrics frequencies.

In [9]:
df['cleaned_lyrics_lemmas'] = df['lyrics'].apply(get_lemmas)
pos_lemmas_list, pos_lemmas_list_titles = get_post_words_list(df['cleaned_lyrics_lemmas'], df['cleaned_lyrics_pos'])

In [10]:
show_table(pos_lemmas_list, 10, pos_lemmas_list_titles, exclude=['PUNCT', 'DET', 'ADJ', 'CCONJ', 'ADP', 'PART',
                                                                 'INTJ', 'ADV', 'X', 'NUM', 'SYM', 'PROPN', 'PRON'])

0,1,2,3
,NOUN,,VERB
love,5.03,be,20.49
what,3.32,do,6.82
girl,2.82,will,4.20
baby,2.77,know,4.05
time,2.56,get,3.71
day,2.02,have,3.34
way,1.72,go,3.25
man,1.58,can,2.85
night,1.38,say,2.55


Lemmatization has effect only on noun and verbs, so I examine only these.

***Nouns***: lemmatization, as expected, hasn't changed much. The top ten is the same, but *girl* and *baby* have changed their places.

***Verbs***: with all the forms of be summed up, *be* is he leader now with almost 21%. *Do* with its forms has less than 7%. *Will* is counted as a separate verb, and has 4%. Then we have *know*, *get*, *have*, *go* and others slowly falling down.

After that, let's look at lemmatized nouns and verbs by authors and years.

In [11]:
total_pos_lemmas_list = []
total_pos_lemmas_list_titles = []
for writer in ('Lennon', 'McCartney', 'Harrison'):
    pos_lemmas_list, pos_lemmas_list_titles = get_post_words_list(
        df[df['writers']==writer]['cleaned_lyrics_lemmas'], 
        df[df['writers']==writer]['cleaned_lyrics_pos'], 
        pos_titles=['NOUN']
    )
    total_pos_lemmas_list.extend(pos_lemmas_list)
    total_pos_lemmas_list_titles.append(writer)
show_table(total_pos_lemmas_list, 10, total_pos_lemmas_list_titles)

0,1,2,3,4,5
,Lennon,,McCartney,,Harrison
love,7.59,love,3.98,love,6.27
what,3.80,girl,3.06,sun,5.26
girl,3.21,time,2.68,time,5.01
nothing,2.55,what,2.29,what,5.01
morning,2.29,night,2.06,day,3.01
world,1.90,way,1.99,one,3.01
baby,1.77,day,1.99,because,3.01
everything,1.51,life,1.99,girl,2.76
mind,1.51,mother,1.83,thing,2.51


All you need is love, right? *Love* is the leader of course, but it seems that for McCartney the word *girl* (3.06) is almost as important as *love* (3.98%). The same goes for Harrison with distance from *sun* to *love* being 1%, while Lennon valued *love* much more. And only Lennon has the first top three identical to total.

In [12]:
total_pos_lemmas_list = []
total_pos_lemmas_list_titles = []
for writer in ('Lennon', 'McCartney', 'Harrison'):
    pos_lemmas_list, pos_lemmas_list_titles = get_post_words_list(
        df[df['writers']==writer]['cleaned_lyrics_lemmas'], 
        df[df['writers']==writer]['cleaned_lyrics_pos'], 
        pos_titles=['VERB']
    )
    total_pos_lemmas_list.extend(pos_lemmas_list)
    total_pos_lemmas_list_titles.append(writer)
show_table(total_pos_lemmas_list, 10, total_pos_lemmas_list_titles)

0,1,2,3,4,5
,Lennon,,McCartney,,Harrison
be,23.05,be,18.73,be,22.79
do,6.43,do,7.70,do,8.55
know,4.31,will,5.54,know,4.68
go,4.02,go,4.20,will,4.48
get,3.54,know,3.83,have,4.17
can,3.41,get,3.46,come,3.15
say,2.89,say,3.37,go,2.95
come,2.80,have,3.21,get,2.34
have,2.67,let,2.50,want,2.24


All three share top two verbs: *be* and *do*, *be* being the absolute leader. But then differencies begin. Both Lennon and McCartney value *go* (the fourth place), while it has only 7th place for Harrison. Lennon has *can* at the 6th place, while the others don't have it in the top ten at all.

Let's examine lemmatized nouns year by year.

In [13]:
total_pos_lemmas_list = []
total_pos_lemmas_list_titles = []
for year in range(1963, 1971):
    pos_lemmas_list, pos_lemmas_list_titles = get_post_words_list(
        df[df['year']==year]['cleaned_lyrics_lemmas'], 
        df[df['year']==year]['cleaned_lyrics_pos'], 
        pos_titles=['NOUN']
    )
    total_pos_lemmas_list.extend(pos_lemmas_list)
    total_pos_lemmas_list_titles.append(str(year))
show_table(total_pos_lemmas_list, 10, total_pos_lemmas_list_titles)

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
,1963.0,,1964.0,,1965.0,,1966.0,,1967.0,,1968.0,,1969.0,,1970.0
love,7.05,love,8.94,girl,8.3,day,6.86,love,7.35,what,3.24,sun,4.39,everybody,5.38
baby,5.72,baby,4.7,what,5.77,submarine,5.82,morning,3.57,life,2.19,love,4.02,feeling,5.06
girl,4.26,time,4.36,love,3.81,love,4.16,name,2.83,girl,1.88,what,3.11,way,4.75
what,3.99,what,4.01,time,3.69,paperback,3.95,nothing,2.62,night,1.88,way,2.38,world,4.11
boy,3.46,thing,3.78,baby,3.69,writer,3.95,mother,2.52,birthday,1.67,bom,2.01,nothing,4.11
man,2.93,day,3.1,word,3.34,sunshine,3.74,man,2.52,dream,1.67,darling,2.01,girl,2.85
minute,2.53,honey,2.64,way,2.88,what,3.33,time,2.52,road,1.57,child,1.83,word,2.53
heart,2.53,girl,2.41,night,2.77,people,2.7,goo,2.31,gun,1.57,time,1.65,time,2.53
time,2.13,way,2.41,day,2.42,time,2.49,sky,2.1,mind,1.46,something,1.65,everything,2.22


It's interesting to note that the absolute leader *love* wasn't always one. In fact, *love* is the leader only for three years, and in the other years there are many others - *girl*, *day*, *what*, *sun*, *everybody*. In 1968 and 1970 *love* isn't even in the top ten.

Now it's time to look at verbs.

In [14]:
total_pos_lemmas_list = []
total_pos_lemmas_list_titles = []
for year in range(1963, 1971):
    pos_lemmas_list, pos_lemmas_list_titles = get_post_words_list(
        df[df['year']==year]['cleaned_lyrics_lemmas'], 
        df[df['year']==year]['cleaned_lyrics_pos'], 
        pos_titles=['VERB']
    )
    total_pos_lemmas_list.extend(pos_lemmas_list)
    total_pos_lemmas_list_titles.append(str(year))
show_table(total_pos_lemmas_list, 10, total_pos_lemmas_list_titles)

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
,1963.0,,1964.0,,1965.0,,1966.0,,1967.0,,1968.0,,1969.0,,1970.0
be,18.8,be,19.36,be,22.37,be,20.54,be,23.98,be,17.24,be,21.46,be,21.34
will,7.79,do,7.49,do,5.43,do,6.04,know,7.0,do,11.41,do,8.99,get,9.28
do,5.51,will,5.24,will,5.02,know,3.89,do,5.5,know,5.58,come,5.39,let,7.17
get,4.69,have,4.65,have,4.82,will,3.76,need,4.28,come,4.29,go,4.55,have,6.35
love,4.22,get,4.06,go,3.96,can,3.49,get,4.0,go,3.85,know,4.44,go,4.07
know,4.16,love,3.69,can,3.65,have,3.09,say,3.57,make,3.01,get,3.7,say,3.75
want,2.75,can,3.64,see,3.25,say,2.68,go,3.43,will,2.63,want,3.28,can,2.61
can,2.52,go,2.99,get,2.84,get,2.55,can,2.43,have,2.63,can,2.54,dig,2.28
wanna,2.46,know,2.78,say,2.69,need,2.42,see,2.36,d'do,2.5,say,2.54,change,1.95


We see that *be* is the absolute leader for all years. *Do* occupies the second place, but sometimes loses it to *will* (1963), *know* (1967), or *get* (1970).

Now let's look at the difference between covers and original songs. First, nouns.

In [15]:
df_cover = df[df.cover==True]
df_orig = df[df.cover==False]

total_pos_lemmas_list = []
total_pos_lemmas_list_titles = []

pos_lemmas_list1, pos_lemmas_list_titles = get_post_words_list(
    df_cover['cleaned_lyrics_lemmas'], 
    df_cover['cleaned_lyrics_pos'], 
    pos_titles=['NOUN']
)

pos_lemmas_list2, pos_lemmas_list_titles = get_post_words_list(
    df_orig['cleaned_lyrics_lemmas'], 
    df_orig['cleaned_lyrics_pos'], 
    pos_titles=['NOUN']
)

total_pos_lemmas_list_titles = ['Covers', 'Original']
total_pos_lemmas_list.extend(pos_lemmas_list1)
total_pos_lemmas_list.extend(pos_lemmas_list2)

show_table(total_pos_lemmas_list, 10, total_pos_lemmas_list_titles)

0,1,2,3
,Covers,,Original
baby,9.21,love,5.44
honey,3.84,what,3.28
what,3.58,girl,2.84
girl,2.69,time,2.66
way,2.69,day,2.24
love,2.43,baby,1.75
minute,2.43,man,1.71
rock,2.30,way,1.57
boy,2.17,thing,1.47


It's funny that covers which are mostly love songs don't list *love* even in the top ten! The absolute leader is *baby*. Original songs have traditional leaders as *love* and *what*. But what about verbs?

In [16]:
total_pos_lemmas_list = []
total_pos_lemmas_list_titles = []

pos_lemmas_list1, pos_lemmas_list_titles = get_post_words_list(
    df_cover['cleaned_lyrics_lemmas'], 
    df_cover['cleaned_lyrics_pos'], 
    pos_titles=['VERB']
)

pos_lemmas_list2, pos_lemmas_list_titles = get_post_words_list(
    df_orig['cleaned_lyrics_lemmas'], 
    df_orig['cleaned_lyrics_pos'], 
    pos_titles=['VERB']
)

total_pos_lemmas_list_titles = ['Covers', 'Original']
total_pos_lemmas_list.extend(pos_lemmas_list1)
total_pos_lemmas_list.extend(pos_lemmas_list2)

show_table(total_pos_lemmas_list, 10, total_pos_lemmas_list_titles)

0,1,2,3
,Covers,,Original
be,17.55,be,20.86
get,6.54,do,6.88
do,6.37,know,4.30
will,4.14,will,4.21
want,3.15,have,3.38
go,3.06,get,3.35
have,2.98,go,3.27
say,2.57,can,2.89
come,2.48,say,2.55


While *be* in an undisputable leader, covers have surprisingly high frequency for *get* (the second place!) and low (not even in top ten) for *know*.

<h2>CONCLUSION</h2>

If one splits word frequencies according to parts of speech, one can see the following. For nouns, the leader is *love* (almost 5% though). For pronouns, *you* is the leader (about 30%), and *I* is very close. As for verbs, while the leader is *do* (almost 6%), in the top ten there are mostly forms of the verb *to be*.

In determintives, *the* is much more frequent than *a* (42% vs. 29%). From the top ten of proper nouns only two (*Bill* and *Jude*) are classified correctly. In conjunctions *and* (76%) is the absolute leader, while the second place is given to *but* (~19%).

The most popular adjectives are *my* (~13%) and *your* (~10%).

In adverbs, the results are strange. As *n't* has been classified as an adverb, it has the first place with ~17%. Then we have *so* (~6%), *now* and *when* (both 5.7%).

As for adpositions, the leader is *in* (~15%). Among particles, *to* is the absolute leader with 49%! And in interjections *oh* (~19%) is more frequent than *yeah* (15.65%). *Nah* has the third place with 11.49%.

Lemmatization changes almost nothing for nouns. As for verbs, *be* is the leader now with almost 21%. *Do* with its forms has less than 7%.

Looking at time, it's interesting to note that the absolute leader *love* wasn't always one. In fact, *love* is the leader only in three years, and in the other years there are many other leaders - *girl*, *day*, *what*, *sun*, *everybody*. As for verbs, *be* is the absolute leader for all years. *Do* holds the second place, but sometimes loses it to other words.

It's interesting that covers' nouns don't even list *love* in the top ten! The absolute leader there is *baby*.

*Love* is the leader in nouns for all the three original authors, and as for verbs, while *be* in an undisputable leader, covers have surprisingly high frequency for *get* (the second place!) and low (not even in top ten) for *know*.