# Day 10: Uncommon words in top 100 songs

Using `spaCy` and Beautiful Soup to look at uncommon words in chart songs. `spaCy` has a database of prior probabilities in its English vocabulary, so we just tokenize each song, look up the prior independent probability of each token appearing in the vocab and sort the tokens by their probability. But there are major problems with the tokenization: it struggles with abbreviations (to investigate); also, some words are just not in the vocab and they return a probablity of 0.0 and they get dumped even if they might be informative. The results show the song title, artist, and the words that are in the song that occur least frequently in the English language (as modeled by the `spaCy` vocab).

In [37]:
from spacy.en import English
import requests
import bs4

In [3]:
nlp = English()

In [6]:
r = requests.get('http://lyrics.wikia.com/LyricWiki:Top_100')

In [8]:
soup = bs4.BeautifulSoup(r.content)

In [11]:
ordered_list = soup.find('ol')

In [32]:
links = []

for item in ordered_list.find_all('li'):
    link = item.a['href']
    if 'edit' not in link:
        links.append(link)

In [122]:
BASE = 'http://lyrics.wikia.com'

songs = []
master = []

for index, link in enumerate(links):
    r = requests.get(BASE + link)
    
    soup = bs4.BeautifulSoup(r.content)
    
    # Lyrics in a <div>
    
    lyrics = soup.find('div', attrs={'class':'lyricbox'})
    lines = [t for t in lyrics.contents if type(t) == bs4.element.NavigableString]
    title = soup.title.text.split('Lyrics')[0].strip()
    song = "\n".join(lines)
    record = {
        'id' : index,
        'title' : title
    }
    master.append(record)
    songs.append(song)    

In [123]:
parsed = [nlp(song) for song in songs]

In [124]:
for index, tokens in enumerate(parsed):
    corpus_probs = []
    for tok in tokens:
        corpus_probs.append((tok.string, nlp.vocab[tok.string].prob))
    corpus_probs = list(set(corpus_probs))
    
    top_ten = sorted(corpus_probs, key=lambda x: x[1])[:10]
    
    title = (x for x in master if x['id'] == index).next()['title']
    
    print title
    
    print ' '.join([word[0] for word in top_ten])
    print '\n'
        

Fetty Wap:My Way
ai morn shit wo seventeen ca Bitch baddest hoes bitch


Omi:Cheerleader
wand cheerleader tempting yeah cheating affection motivation queen selection mention


The Weeknd:Can't Feel My Face
ca numb yeah woo oh girl Do love young come


Rachel Platten:Fight Song
bones waves ocean Everybody sleep tonight motion song brain boat


Silento:Watch Me (Whip / Nae Nae)
duff superman ooh bop Okay whip legs leg ! me


5 Seconds Of Summer:She's Kinda Hot
bitchin meds slob cuatro na Uno tres insane dos esteem


The Weeknd:The Hills
gon ca decaf promo babe relapse info rehab tempo yeah


Selena Gomez:Good For You
n't mmm wo carats Midas carat huh Uh uh jealous


Andy Grammer:Honey, I'm Good.
ooo adieu grail ass ya oh honey hell lie everywhere


Major Lazer & DJ Snake:Lean On
Innocent kiss sidewalk blows recall warm remember gun road young


Fetty Wap:679
Dicey gon ai aye ay bimbos Diddy Loon Blowing rewind


Walk The Moon:Shut Up And Dance
kryptonite ooh sneaks Juliet destiny Oh danc