### References

#### General notes and inspiration
- [matching rhymes from hamilton](http://graphics.wsj.com/hamilton-methodology/)
    - [details](https://journalism.stanford.edu/cj2016/files/Writing%20an%20Algorithm%20To%20Analyze%20and%20Visualize%20Lyrics%20From%20the%20Musical%20Hamilton.pdf)
- [general notes on rhyming](http://mtosmt.org/issues/mto.17.23.4/mto.17.23.4.komaniecki.html)
  - Rhyming groups are set in similar metrical locations.
  - Rhyming groups are set to similar rhythmic figures.
  - Rhyming groups are emphasized or articulated in similar ways.
  
#### Finding Rhymes
- [applying BLOSUM to phoneme combinations](https://pdfs.semanticscholar.org/8b66/ea2b1fdc0d7df782545886930ddac0daa1de.pdf)
- [converting phonemes to syllables](http://www.anthology.aclweb.org/N/N09/N09-1035.pdf)
- [phonetic similarity metrics](https://homes.cs.washington.edu/~bhixon/papers/phonemic_similarity_metrics_Interspeech_2011.pdf)

#### Substitutions
- [constonant misidentification](http://www.ebire.org/hcnlab/papers/WoodsJASA2010.pdf)
- [ARPABET to IPA](https://www.wikiwand.com/en/ARPABET)
- [B-Rhymes](http://www.b-rhymes.com/faq/)

#### Syllables
- [sylabification code](https://raw.githubusercontent.com/vgautam/arpabet-syllabifier/master/syllabifyARPA.py)
- [sonority scale](https://www.wikiwand.com/en/Sonority_hierarchy)
    - [more sonority scale](https://www.wikiwand.com/en/Sonority_hierarchy)
    - [arpabet sonority mapping](https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/Arpabet.html)
    - [sonority information](http://people.cs.uchicago.edu/~niyogi/papersps/lsrpaper.pdf)
- [general ideas on syllables](http://sitr.us/2007/09/24/anatomy-of-a-syllable.html)
- [rules for syllabification](http://alias-i.com/lingpipe/demos/tutorial/hyphenation/read-me.html)
- [align graphenes to phonemes](http://www.aclweb.org/anthology/P10-1080)
    - [code](https://github.com/letter-to-phoneme/m2m-aligner)
- [syllabification of phonemes](http://www.anthology.aclweb.org/N/N09/N09-1035.pdf)

# Mapping Flow

In [1]:
!ls lyrics_consolidated/ | grep Eminem

Eminem-and-dj-buttafingaz.mpk
Eminem-ft-logic-joyner-lucas-nitin-randhawa-remix.mpk
Eminem.mpk
Eminem-x-proof.mpk


In [9]:
# !rm words-missing-from-cmu.json

In [1]:
import json
from os import path as osp
missing_words_file = 'words-missing-from-cmu.json'
if osp.isfile(missing_words_file):
    with open(missing_words_file, 'r') as wf:
        WORDS_MISSING_FROM_CMU = json.load(wf)
else:
    WORDS_MISSING_FROM_CMU = {}
WORDS_MISSING_FROM_CMU

{'ahhh': 'AE1 HH',
 'gnac': 'G AE1 N AH0 K',
 'titties': 'T IH1 T IY0 Z',
 'Pharoahe': 'F AA0 R OW1',
 'Monch': 'M AA1 N CH',
 "style's": 'S T AY1 L Z',
 'Girlies': 'G IH1 R L IY0 Z',
 'BMs': 'B AE1 M Z',
 '12pm': 'T W EH1 L V EH2 P AH0 M',
 'eses': 'EH1 S AH0 Z',
 'Rollies': 'R AA1 L IY0 Z',
 'Jeru': 'JH EH1 R UW0',
 'shrooms': 'SH R UW1 M Z',
 'Coka': 'K OW1 K AH0',
 '___': '',
 '____': '',
 'Teleportation': 'T EH2 L AH0 P ER0 T EY1 SH AH0 N',
 'realer': 'R IY1 L ER0',
 'Illuminati': 'IH0 L UW2 M AH0 N EY1 T IY0',
 'guillotines': 'G IH1 L AH0 T AY2 N Z',
 'decapitating': 'D IH0 K AE1 P AH0 T EY2 T IH0 NG',
 'can’t': 'K AE1 N T',
 'I’m': 'IH1 M',
 'Mongolians': 'M AH0 NG G OW1 L IY0 AH0 N Z',
 'it’s': 'IH1 T S',
 'humanity’s': 'HH Y UW0 M AE1 N AH0 T IY0 Z',
 'don’t': 'D AA1 N T',
 'should’ve': 'SH AW1 L D Z',
 'meds': 'M EH1 D Z',
 'Lucifer’s': 'L UW1 S AH0 F ER0 Z',
 'that’s': 'TH AE1 T S',
 'Ain’t': 'EY1 N T',
 'Millionare': 'M IH1 L IY0 AH0 N EH2 R',
 'vansens': 'V AE1 N S AH0 N Z

In [2]:
import msgpack
from random import choice
from pprint import pprint as pp

lyric = ''
with open('lyrics_consolidated/Pharoahe-monch.mpk', 'rb') as lyric:
    corp = msgpack.unpack(lyric, encoding='utf-8')
    lyric = corp['Pharoahe-monch-simon-says-lyrics']['lyrics']

In [3]:
import re

def process_line(line):
    words = []
    # remove adlibs
    line = re.sub('\(.+?\)', '', line)
    # split words delimited by either spaces or commas
    for word in re.split('[ ,]', line):
        # strip out lots of characters we don't want for our
        # word analysis, but keep apostrophies
        stripped = re.sub(r"(^'|'$|[;\?\!\n \t\"\:]|\.+|\…)+", '', word)
        # convert a hyphenated word into multiple words
        words += re.split(r"[-–—]", stripped)
    no_blanks = list(filter(None, words))
    return no_blanks

def clean_headers(lyrics):
    processed = []
    for line in lyrics:
        # filter out song block headers that can span multiple lines
        opened = re.search('^\[', line)
        closed = re.search('\]$', line)
        if opened:
            bracket_open = True
        if bracket_open:
            if closed:
                bracket_open = False
            continue
        line = process_line(line)
        if line:
            processed.append(line)
    return processed

In [27]:
import g2p_en as g2p
import pronouncing

def get_phones(text, stress=True):
    phoned = []
    with g2p.Session():
        for line in text:
            phoned_line = []
            for word in line:
                phones = pronouncing.phones_for_word(word)        
                # lots of words in genius are listed as being
                # pronounced as 'in' instead of the formal
                # 'ing' spelling that is in the cmu data
                # and needs modification to correct
                if not phones and re.search('in\'?$', word):
                    subbed = re.sub(r'in\'?', 'ing', word)
                    with_g = pronouncing.phones_for_word(subbed)
                    if with_g:
                        # convert the 'ng' phenomes to 'n'
                        without_g = re.sub(r'(?<=IH\d) (NG)$', ' N', with_g[0])
                        phones = without_g
                # some words start with an apostraphy and might not
                # be listed in the cmu as such
                elif not phones and re.search("^['`‘]", word):
                    without_apo = pronouncing.phones_for_word(word[1:])
                    if without_apo:
                        phones = without_apo[0]
                # fallback to use slower g2p
                elif not phones:
                    phones_cached = WORDS_MISSING_FROM_CMU.get(word)
                    if not phones_cached:
                        phones = ' '.join(g2p.g2p(word))
                        WORDS_MISSING_FROM_CMU[word] = phones
                    else:
                        phones = phones_cached
                # we don't need nested lists
                else:
                    phones = phones[0]
                # the numbers after a phenome are useful for determining
                # stresses and syllables within words, but aren't that
                # useful for comparing sounds themselves (rhymes)
                if not stress and phones:
                    phones = re.sub('\d*', '', phones)
                if phones:
                    phoned_line.append(phones)
            phoned.append(phoned_line)
    with open(missing_words_file, 'w') as wf:
        json.dump(WORDS_MISSING_FROM_CMU, wf)
    return phoned

In [28]:
# arp_to_ipa_vowels = {
#     'AA': 'ɑ', 'AE': 'æ', 'AH': 'ʌ', 'AO': 'ɔ', 'AW': 'aʊ', 
#     'AX': 'ə', 'AY': 'aɪ', 'EH': 'ɛ', 'ER': 'ɝ', 'EY': 'eɪ', 
#     'IH': 'ɪ', 'IX': 'ɨ', 'IY': 'i', 'UX': 'ʉ',
#     'OW': 'oʊ','OY': 'ɔɪ','UH': 'ʊ','UW': 'u',}
# arp_to_ipa_constonants = {
#     'CH': 'tʃ','D': 'd','DH': 'ð','DX': 'ɾ','EL': 'l̩',
#     'EM': 'm̩','EN': 'n̩','F': 'f','G': 'ɡ','HH': 'h',
#     'JH': 'dʒ','K': 'k','L': 'l','M': 'm','N': 'n',
#     'NX': 'ŋ','P': 'p','Q': 'ʔ','R': 'ɹ','S': 's',
#     'SH': 'ʃ','T': 't','TH': 'θ','V': 'v','W': 'w',
#     'WH': 'ʍ','Y': 'j','Z': 'z','ZH': 'ʒ', 'B': 'b',
# }
# arp_to_ipa = {**arp_to_ipa_vowels, **arp_to_ipa_constonants}
sonority_scale = {
# vowels
'IY': 4, 'IH': 4, 'EH': 4, 'EY': 4, 'AE': 4, 'AA': 4, 'AW': 4, 
'AY': 4, 'AH': 4, 'AO': 4, 'OY': 4, 'OW': 4, 'UH': 4, 'UW': 4, 
'UX': 4, 'ER': 4, 'AX': 4, 'IX': 4, 'AXR': 4, 'AX-H': 4,
# glides
'Y': 3, 'W': 3, 'Q': 3, 
# liquids
'L': 2, 'EL': 2, 'R': 2, 'DX': 2, 'NX': 2,
# nasals
'M': 1, 'EM': 1, 'N': 1, 'EN': 1, 'NG': 1, 'ENG': 1,
## obstruents
# stops/plosives
'P': 0, 'B': 0, 'T': 0, 'D': 0, 'K': 0, 'G': 0,
# affricates
'CH': 0, 'JH': 0,
# fricatives
'F': 0,'V': 0,'TH': 0,'DH': 0,'S': 0,'Z': 0,'SH': 0,'ZH': 0,'HH': 0
}

def tag_phones(word, phones):
    remove_digits = lambda x: re.sub('\d*', '', x)
    if isinstance(phones, str):
        phones = remove_digits(phones).split(' ')
    else:
        phones = [remove_digits(p) for p in phones]
    syls = []
    tagged_phones = []
    vowels = set(arp_to_ipa_vowels.keys())
    hit_first_vowel = False
    onset_buffer = []
    skip = 0
    for i, p in enumerate(phones):
        if skip:
            skip -= 1
            continue
        # runs potentially once
        if p not in vowels and not hit_first_vowel:
            tagged_phones.append((p, 'onset'))
            continue
        hit_first_vowel = True
        if tagged_phones and p == tagged_phones[-1][0]:
            continue
        tagged_phones.append((p,'nucleus'))
        remaining = phones[i+1:]
        # if the rest of the word has vowels
        remaining_vowels = set(remaining) & vowels
        if not remaining_vowels:
            tagged_phones += [(p, 'coda') for p in remaining]
            break
        # if there are vowels left
        if not onset_buffer:
            if not remaining:
                continue
            for x in remaining:
                if x in vowels:
                    break
                else:
                    onset_buffer.append(x)
        while not legal(onset_buffer):
            coda = onset_buffer.pop(0)
            tagged_phones.append((coda, 'coda'))
            skip += 1
        skip += len(onset_buffer)
        tagged_phones += [(p, 'onset') for p in onset_buffer]
        onset_buffer = []
    return tagged_phones

def legal(onset):
    if len(onset) == 1:
        return True
    scaled = lambda x: arp_sonority_scale.get(onset[x], 0)
    diffs = []
    for i, o in enumerate(onset):
        if i < len(onset) - 1:
            if abs(scaled(i) - scaled(i+1)) < 2:
                return False
    return True

def syls(tagged_phones):
    syls = []
    syl = []
    for i, (phone, kind) in enumerate(tagged_phones):
        syl.append(phone)
        if i < len(tagged_phones) - 1:
            next_kind = tagged_phones[i+1][1]
            nuclei_break = [kind, next_kind] == ['nucleus'] * 2
            onset_break = (next_kind == 'onset' and kind != 'onset')
            if nuclei_break or onset_break:
                syls.append(syl)
                syl = []
        else:
            syls.append(syl)
    return syls

In [29]:
from collections import defaultdict
import pronouncing

def match(graphemes, phonemes):
    found = []
    for i, line in enumerate(graphemes):
        ## Build rolling window of 2-3 lines
        # for the first line we compare it to the next line
        if i == 0:
            i_left, i_right = 0, 2
        # for the last line we compare it to the previous
        elif i == len(graphemes) - 1:
            i_left, i_right = -2, -1
        # for all other lines we compare to the previous
        # and the next line
        else:
            i_left, i_right = i-1, i+2
        ## Process phone groups before matching
        filtered_lines = []
        bank = defaultdict(int)
        for phone_group in phonemes[i_left:i_right]:
            # grab rhyming section of word (basic match)
            rp = pronouncing.rhyming_part
            rhyming_phones = [rp(word) for word in phone_group]
            # remove middle constonants which don't have
            # much to do with loose rhyming
            nix_const = lambda x: re.sub('( \w |\d|[A-Z]{1,2} )+', ' ', x).strip().replace('  ', ' ')
            filtered = list(map(nix_const, rhyming_phones))
            filtered_lines.append(filtered)
            # add filtered rhyming parts to common bank 
            # for current rolling window state
            for f in filtered:
                bank[f] += 1 
        # match
        found_rhymes = []
        for phone_group in filtered_lines:
            # positive match if rhyming phonemes in bank
            # and also not single vowel
            match = lambda x: bank[x] > 1 and ' ' in x
            # create bitmap for words to determine if they rhyme
            found_rhymes.append([match(w) for w in phone_group])
        # if we're on the first line, or the line only has one word
        if i == 0 or len(line) == 1:
            found.append(found_rhymes[0])
        else:            
            found.append(found_rhymes[1])
    finished = [list(zip(found[i], graphemes[i])) for i in range(len(found))]
    return finished

In [31]:
cleaned = clean_headers(lyric)
processed = get_phones(cleaned)
matched = match(cleaned, processed)
matched

INFO:tensorflow:Restoring parameters from /home/hank/anaconda3/lib/python3.6/site-packages/g2p_en/logdir/model_epoch_14_gs_27956


[[(False, 'Uh'),
  (False, 'uh'),
  (False, 'uh'),
  (False, 'uh'),
  (False, 'uh'),
  (False, 'uh')],
 [(False, 'Uh'), (False, 'uh'), (False, 'uh'), (False, 'uh'), (False, 'uh')],
 [(False, 'Uh'),
  (False, 'uh'),
  (False, 'uh'),
  (False, 'uh'),
  (False, 'uh'),
  (False, 'uh')],
 [(False, 'Uh'), (False, 'uh'), (False, 'uh'), (False, 'uh'), (False, 'ahhh')],
 [(True, 'Get'), (False, 'the'), (True, 'fuck'), (True, 'up')],
 [(False, 'Simon'),
  (False, 'says'),
  (True, 'Get'),
  (False, 'the'),
  (True, 'fuck'),
  (True, 'up')],
 [(False, 'Throw'),
  (False, 'your'),
  (False, 'hands'),
  (False, 'in'),
  (False, 'the')],
 [(False, 'Queens'),
  (False, 'is'),
  (False, 'in'),
  (False, 'the'),
  (False, 'back'),
  (False, 'sipping'),
  (False, 'gnac'),
  (False, "y'all"),
  (False, "what's")],
 [(False, 'Girls'),
  (True, 'rub'),
  (True, 'on'),
  (True, 'your'),
  (True, 'titties')],
 [(False, 'Yeah'),
  (False, 'I'),
  (False, 'said'),
  (False, 'it'),
  (True, 'rub'),
  (True, 'on

In [17]:
from IPython.core.display import display, HTML
buff = ''
dont_match = set([
    "the","be","to","of","and","a","in","that","have",
    "I","it","for","not","on","with","he","as","you",
    "do","at","this","but","his","by","from","they",
    "we","say","her","she","or","an","will","my","one",
    "all","world","there","their","what","so","who",
    "if","them","yeah"
])
for line in matched:
    buff += '<br>'
    for i, (rhymes, word) in enumerate(line):
        if rhymes and (word not in dont_match or i < len(line) -1):
            buff += '<b>{} </b>'.format(word)
        else:
            buff += word + ' '
    buff += '</br>'
display(HTML(buff))

In [133]:
shitty = get_phones([["Notorious", 'witty', 'pity']])[0][0]
shitty
# re.sub('( \w |\d|[A-Z]{1,2} )+', ' ', shitty).strip().replace('  ', ' ')

INFO:tensorflow:Restoring parameters from /home/hank/anaconda3/lib/python3.6/site-packages/g2p_en/logdir/model_epoch_14_gs_27956


'N OW0 T AO1 R IY0 AH0 S'

In [22]:
# until current phoneme is a vowel
#     label current phoneme as an onset
# end loop
# until all phonemes have been labeled
#     label current phoneme as a nucleus
#     if there are no more vowels in the word
#         label all remaining consonants as codas
#     else
#         onset := all consonants before next vowel
#         coda := empty
#         until onset is legal
#             coda := coda plus first phoneme of onset
#             onset := onset less first phoneme
#         end loop
#     end if
# end loop

In [231]:
word = 'hospital'
phones = get_phones([[word]])[0][0]
tagged_phones = tag_phones(word, phones)
tagged_phones

INFO:tensorflow:Restoring parameters from /home/hank/anaconda3/lib/python3.6/site-packages/g2p_en/logdir/model_epoch_14_gs_27956


INFO:tensorflow:Restoring parameters from /home/hank/anaconda3/lib/python3.6/site-packages/g2p_en/logdir/model_epoch_14_gs_27956


[('HH', 'onset'),
 ('AA', 'nucleus'),
 ('S', 'coda'),
 ('P', 'onset'),
 ('IH', 'nucleus'),
 ('T', 'onset'),
 ('AH', 'nucleus'),
 ('L', 'coda')]

In [232]:
syls(tagged_phones)

[['HH', 'AA', 'S'], ['P', 'IH'], ['T', 'AH', 'L']]

In [233]:
from syllabifyARPA import syllabifyARPA

In [234]:
syllabifyARPA(phones)

0     HH AA1
1    S P IH2
2    T AH0 L
dtype: object

In [221]:
get_phones([[word]])[0][0]

INFO:tensorflow:Restoring parameters from /home/hank/anaconda3/lib/python3.6/site-packages/g2p_en/logdir/model_epoch_14_gs_27956


INFO:tensorflow:Restoring parameters from /home/hank/anaconda3/lib/python3.6/site-packages/g2p_en/logdir/model_epoch_14_gs_27956


'IH2 NG K AA1 N S P IH0 K W AH0 S'

In [229]:
' '.join(['IH0', 'NG', 'K', 'AA1', 'N', 'S', 'P', 'IH0', 'K', 'Y', 'UW', 'AH0', 'S'])

'IH0 NG K AA1 N S P IH0 K Y UW AH0 S'

In [None]:
acuity