# Weird Titles of Songs

Music is a huge genre and is filled with all sorts of weird song titles. Luckily, I have a web-scraped list of all the songs that have been on the _Billboard_ Hot 100, which I can run tests on to find the weirdest song titles out there.

## Test 1: Maximize Hapax Legomena

The first idea I had was to find songs with large amounts of [hapax legomena](https://en.wikipedia.org/wiki/Hapax_legomenon). Luckily, `nltk` has a hapax legomenon feature built in, which I can easily utilize. 

In [1]:
import nltk, json

In [2]:
fi = open("songlist.txt")
songlist = json.loads(fi.read())
fi.close()

For this, I'm going to compare all of the titles in the lowercase so there will be more matches. The `nltk` tokenizer takes in a string, so I need to convert the list to a single string first.

In [3]:
allTitles = ""
for i in songlist:
    allTitles += i.lower() + " "

In [4]:
%pprint
songlist[:20]

Pretty printing has been turned OFF


['Poor Little Fool', 'I Got A Feeling', 'Lonesome Town', 'Never Be Anyone Else But You', "It's Late", 'Just A Little Too Much', 'Sweeter Than You', 'I Wanna Be Loved', 'Mighty Good', 'Young Emotions', 'Right By My Side', "I'm Not Afraid", "Yes Sir, That's My Baby", 'You Are The Only One', 'Milk Cow Blues', "Travelin' Man", 'Hello Mary Lou', "If You Can't Rock Me", 'Old Enough To Love', 'Patricia']

In [5]:
allTitles[:20]

'poor little fool i g'

In [6]:
toks = nltk.word_tokenize(allTitles)
toks[:20]

['poor', 'little', 'fool', 'i', 'got', 'a', 'feeling', 'lonesome', 'town', 'never', 'be', 'anyone', 'else', 'but', 'you', 'it', "'s", 'late', 'just', 'a']

Now that the song titles are tokenized, we will create a frequency distribution which will allow us to find all the hapax legomena.

In [7]:
freqy = nltk.FreqDist(toks)

In [8]:
freqy.most_common(20)

[('the', 3777), ('you', 3352), ('i', 3114), ('love', 2237), ('(', 2047), (')', 2045), ('me', 1934), ('a', 1746), ('to', 1589), ('it', 1520), ("n't", 1449), ("'s", 1397), ('of', 1353), ('my', 1349), ('do', 1262), ('in', 1249), ("'", 1144), ('on', 934), (',', 895), ('your', 767)]

In [9]:
freqy.hapaxes()[:20]

['guaglione', 'splish', 'beachcomber', 'artificial', 'multiplication', 'hunk', 'to-night', 'rock-a-hula', 'sender', 'witchcraft', 'cousins', 'spinout', 'legged', "rebel-'rouser", 'ramrod', 'kind-a', 'shazam', 'pepe', 'yak', 'besame']

As a test before I deal with all of the Big Data™, I will make up a song title and determine the number of hapax legomena in it.

In [10]:
testTitle = "splish of the pepe yak boy"
hapaxes = 0

In [11]:
testTok = nltk.word_tokenize(testTitle)

In [12]:
for i in testTok:
    if i in freqy.hapaxes():
        hapaxes += 1

In [13]:
hapaxes

3

Now that I've tested that method, I think I'm ready to find the songs with the maximum number of hapax legomena.

In [14]:
maxNum = 0
maxList = []

In [15]:
hapList = freqy.hapaxes()
for song in songlist:
    hapaxes = 0
    thisTok = nltk.word_tokenize(song.lower())
    for token in thisTok:
        if token in hapList:
            hapaxes += 1
    if hapaxes > maxNum:
        maxNum = hapaxes
        maxList = [song]
    elif hapaxes == maxNum:
        maxList.append(song)         

In [16]:
maxNum

6

In [17]:
maxList

['The Anaheim, Azusa & Cucamonga Sewing Circle, Book Review And Timing Associ', 'Itsy Bitsy Teenie Weenie Yellow Polkadot Bikini']

Two songs have 6 hapax legomena! They are Jan & Dean's 1964 single ["The Anaheim, Azusa & Cucamonga Sewing Circle, Book Review And Timing Association"](https://www.youtube.com/watch?v=5TAnOCAd_2I), and Brian Hyland's hit 1960 ["Itsy Bitsy Teenie Weenie Yellow Polkadot Bikini"](https://www.youtube.com/watch?v=n56E3kScoN8). With both songs being from the 60's, I guess they don't make 'em like they used to.

Something I never noticed was that long song titles seem to be cut off somewhere in the process (perhaps even from _Billboard_ itself). I may have to take this into account in later experiments.

After this, I was curious what songs had 5 hapax legomena, and so on.

In [18]:
hapaxDict = {}

In [19]:
for song in songlist:
    hapaxes = 0
    thisTok = nltk.word_tokenize(song.lower())
    for token in thisTok:
        if token in hapList:
            hapaxes += 1
    if hapaxes not in hapaxDict:
        hapaxDict[hapaxes] = []
    hapaxDict[hapaxes].append(song)

In [20]:
hapaxDict[5]

["Jeremiah Peabody's Poly Unsaturated Quick Dissolving Fast Acting Pleasant T"]

This is Ray Stevens' ["Jeremiah Peabody's Poly Unsaturated Quick Dissolving Fast Acting Pleasant Tasting Green & Purple Pills"](https://www.youtube.com/watch?v=2GB4Km706gM). This one was significantly cut off.

In [21]:
hapaxDict[4]

['Bei Mir Bist Du Schön', 'Ungena Za Ulimwengu (Unite The World)', '(A Ship Will Come) Ein Schiff Wird Kommen', 'Pearly Shells (Popo O Ewa)', 'Thank You Falettinme Be Mice Elf Agin/Everybody Is A Star', 'Roland The Roadie And Gertrude The Groupie', 'Tarzan Boy (From "Teenage Mutant Ninja Turtles III")', 'Jeeps, Lex Coups, Bimaz & Benz']

Interestingly, a lot of foreign title songs go into this category.

At this point I wasn't sure how big each list would be, so I used `len` out of an abundance of caution.

In [22]:
len(hapaxDict[3])

30

In [23]:
%pprint
hapaxDict[3]

Pretty printing has been turned ON


['She Say (Oom Dooby Doom)',
 'Baubles, Bangles And Beads',
 'The Hawaiian Wedding Song (Ke Kali Nei Au)',
 "Great Gosh A'mighty (Down & Out In Bev. Hills Theme)",
 'Ame Caline (Soul Coaxing)',
 'Hither And Thither And Yon',
 'To The Door Of The Sun (Alle Porte Del Sole)',
 "There's A Star Spangled Banner Waving #2 (The Ballad Of Francis Powers)",
 'Sie Liebt Dich (She Loves You)',
 'Riki Tiki Tavi',
 "Pandora's Golden Heebie Jeebies",
 'Names, Tags, Numbers & Labels',
 'D. W. Washburn',
 'Les Bicyclettes De Belsize',
 'Roosevelt And Ira Lee (Night of the Mossacin)',
 'Mozart Symphony No. 40 In G Minor K.550, 1st Movement',
 'Invasion Of The Flat Booty B*****s',
 'Shamrocks And Shenanigans (Boom Shalock Lock Boom)',
 'ESPN Presents The Jock Jam',
 'Leflaur Leflah Eshkushka',
 'Woo-Hah!! Got You All In Check/Everything Remains Raw',
 'Wu-Wear: The Garment Renaissance (From "High School High")',
 'Do You Know? (The Ping Pong Song)/Dimelo',
 'Purest Of Pain (A Puro Dolor)',
 'Ni Una Sola 

In [24]:
len(hapaxDict[0])

23984

This is a breakdown of how many items each category has.

In [25]:
for i in range(7):
    print("{1} songs have {0} hapax legomena".format(i, len(hapaxDict[i])))

23984 songs have 0 hapax legomena
3879 songs have 1 hapax legomena
410 songs have 2 hapax legomena
30 songs have 3 hapax legomena
8 songs have 4 hapax legomena
1 songs have 5 hapax legomena
2 songs have 6 hapax legomena


All in all, I found some weird songs, truly, but I feel there may be better ways to find weird song titles.