## Rhyme Markup

ESpeak prepends two spaces--not sure what they are. One might be the BOM mark that then gets cleared by the decode statement. ESpeak introduces a line break to execute pauses in speech. We only need line endings for end-rhyme analysis. I removed all the other breaks.

I thought that another way to go wih this whole deal would be to capture the output at a more advanced level. For MBROLA voices, eSpeak outputs detailed speech info for every segment, including intonation contours, but NOT syllable position. So I found a module that syllabifies English pretty well. Any internal rhyming that does not occur in a final syllable will have to be done by pure matching.

Finding rhymes is basically a matching exercise. For a first pass we can get the last X characters in each line into a list, then find any matches going forward. Since the poems in this book all utilize rhyme patterns by verse, I focused on getting rhymes from individual verses.

Note: I've noticed eSpeak doesn't use voiceless w, (turned w like "which" or "whale") and I'm sure EMR with a Chicago MA would have used that, so we might want to tweak the voice a bit. Note: eSpeak outputs double IPA characters for some signs, specifically affricates and diphthongs. They provide three workarounds:
--ipa=1 Uses ties (U+0361) for phoneme names of more than one letter.
--ipa=2 Uses Zero Width Joiner (U+200D) for phoneme names of more than one letter.
--ipa=3 Separate phoneme names with underscore characters.
Each one can be used by us but we just need to be consistent. It might be easiest to split words on underscores tp compare sounds, while just deleting underscores in order to print nice phonetics. I went with ipa=3.

So, as an arbitrary decision, let's see if we can make a single copy of each poem, marked up in very simple TEI, that includes the full orthographic and full IPA transliteration and labled rhyme scheme. If we have that we can easily get out static pages with different info, or come up with PHP or JavaScript methods of interactive display. To start this process I'm going back to the little program I made to separate the poems, and use that to insert some XML codes. Then we can use the XML to help process the rhyme schemes.


In [1]:
import re, subprocess, string
import syllabify_ipa as sipa
from bs4 import BeautifulSoup

infile = 'texts/emr/UnderTree/THE SKY.txt' #testing one poem for now
#outfile = 'C:\Users\clair\Dropbox\photrans\practice1.txt'



def rhyme(wordA, wordB) :
    """ Checks 2 phonetic strings to detect their rhyme."""
    
    VOWELS = sipa.VOWELS
    CONSONANTS = ['p', 'b', 't','d', 'k', 'g',
    'tʃ', 'dʒ',
    'f', 'v', 'θ', 'ð', 's', 'z', 'ʃ', 'ʒ', 'h',
    'm', 'n', 'ŋ',
    'l', 'ɹ', 'j', 'w']
    
    # the Consonanat and Vowel lists are going to be used for comparison only
    # so making them into a set will speed things up.
    # Note: We might want to know the stress of rimes in the future, but for now
    # I'm gonna make it work by eliminating the stress for comparison's sake.
    
    v_set = set(VOWELS)
    c_set = set(CONSONANTS)
    
    # Remove stress marks from strings to be compared
    
    nostressA = [i.replace('ˈ', '').replace('ˌ','') for i in wordA]
    nostressB = [i.replace('ˈ', '').replace('ˌ','') for i in wordB] 
    
    #use the shortest word as basis for comparison
        
    if len(nostressA) <= len(nostressB) :  
        basis = nostressA
        focus = nostressB
    else : 
        basis = nostressB
        focus = nostressA
        
    segment = []   # holds matching segments
    rphones = []   # holds the Rhyming Phones
    
    # Work through the shortest word backwards phoneme by phoneme
    
    for num, letter in (enumerate(reversed(basis))) : 
        rfocus = list(reversed(focus)) # reverse the focus word list
        #print(f'Comparing /{letter}/ to /{rfocus[num]}/')
        
        if letter != rfocus[num] :
            break    # When sounds don't match, we stop comparing
        if letter == rfocus[num] : # if base letter matches focus letter, 
            # print (letter, num)
            rphones.append(letter) # add it to the list of matching letters
            if letter in v_set : 
                segment.append('V')
            elif letter in c_set :
                segment.append('C') # and add C/V to the list of matching segments
            else :
                print (f'Undocumented letter /{letter}/.')
    
    # re-reverse the segment and rhyming phones for human readability
    # and make them into strings
    
    segment.reverse()   
    rphones.reverse()
    
    rphones = "".join(rphones)
    rhymetype = "".join(segment) 
    #print (rhymetype)
    
    # Group rhyme types together
    if rhymetype in ['', 'C', 'CC'] :
        group = 'none'
    elif rhymetype in ['VC', 'VCC', 'VCCC', 'CVC', 'CVCC', 'CCVC', 'CCVCC'] :
        group = 'strong'
    elif rhymetype in ['V', 'CV', 'CCV', 'CVCV'] :
        group = 'weak'
    else :
        group = 'unknown'
    
    # return rhyme info -- could return any info on rhyme we can make
    
    return [rphones, rhymetype, group]

# Routine starts here by making a list of letter for assigning to rhymes.

ab_string = string.ascii_uppercase  # Create a string of all uppercase letters
ab_list = list(ab_string)           # Convert it to a list of all uppercase letters

with open (infile, "r", encoding='utf-8-sig') as f :   # infile defined at the top of this script
    soup = BeautifulSoup(f, 'xml')                     # parsing as lxml loses the <head> tag
    stanzas = soup.find_all(attrs={"type" : "stanza"}) # Get all the tags with type=stanza
    
    for stanza in stanzas :
        lastsyllables = []            # list of last syllables in each line of the stanza
        lines = stanza.find_all('l')  # get all the lines in this stanza
        
        for line in lines :
            target = line.text # get the text value using BS and run it through eSpeak
            cp = subprocess.run(['espeak', '-v', 'en-us', '-xq', '--ipa=3', target], 
                                stdout=subprocess.PIPE, stderr=subprocess.PIPE)
            phones = cp.stdout.decode("utf-8").strip() # get eSpeak results
            phones = re.sub("\r\n", "", phones) # remove any newlines in Windows
            words = phones.split(' ') # split the line into words on spaces
            
            # Syllabify the last word split into phones
            sylword = sipa.syllabify(words[-1].split('_'))
            
            # syllabify returns a list of lists [[onset],[nucleus],[coda]]. Codas but not onsets must fully
            # match in order to rhyme. I get the last syllable and push the sounds back together for now.
            lastsyllable = ' '.join(' '.join(''.join(p) for p in syl) for syl in sylword[-1])
            lastsyllable = lastsyllable.strip()   # remove f & r spaces inserted by syllabify
            lastsyllables.append(lastsyllable.split(' ')) # put the last syl on the list of last syls.

            
            phones = re.sub("_", "", phones) # remove underscores to make pretty print
            line.attrs['phon'] = phones #assigns a new attribute to the <line> for the transcription
            
        # each stanza will have its own set of rhymes and rhyme data. This decision can be changed
        # by removing the verse loop and doing the whole poem at once. Right now the same rhyme in
        # different stanzas will get different letters. 

        rime_dict = {}
        skip = [] # list of rhyme tests to skip
        size = len(lastsyllables) # should match num of lines in the verse
        for i in range(size-1) : 
            for j in range(i+1, size) : 
                if j in skip : # skip any rhymes already found
                    continue
                [rime, cv, grp] = rhyme(lastsyllables[i], lastsyllables[j])
                if grp != 'none' :
                    skip.append(j) # when a rhyme is found, put it on the skip list
                    if rime not in rime_dict :
                        rime_dict[rime] = [cv,grp,i,j]
                    else:
                        rime_dict[rime].extend([i,j]) # this way all the rhymed lines are on one list
        #print(rime_dict)
        #print("Next stanza")
        
        # We have located all the rhymes. Now time to assign them a letter and put
        # them into the TEI markup
        
        completed_lines = []  # to hold list of lines that are marked up so we don't repeat them.
        for l in range(size) :  # loop through the indexes of line numbers
            if l in completed_lines :  # skip any lines that have been assigned letters already
                    continue
                    
            # If a rhyme has been discovered, the line number will be in the list of values
            # associated with the rime. Get that key and use it to get the values again.
            
            rhymed_keys = {key for key, value in rime_dict.items() if l in value}
            rkeylist = list(rhymed_keys)
            if rkeylist != [] :  # if there are some rhymes...
                cv, grp, *found = rime_dict[rkeylist[0]] # convert rime+ list to digits
                found = list(set(found)) # converting to a set() eliminates copies
                next_let = ab_list.pop(0) # assign rime the next alphabet letter
                for fi in found : 
                    lines[fi].attrs['type'] = grp
                    lines[fi].attrs['rhyme'] = next_let
                    lines[fi].attrs['rime'] = rkeylist[0]
                    lines[fi].attrs['vc_structure'] = cv
                    completed_lines.append(fi)
                continue
            else :
                next_let = 'X'
                lines[l].attrs['rhyme'] = next_let             

    print(soup.prettify())

with open(outfile, 'a', encoding='utf-8') as p :
        print(soup.prettify(), file=p)


        

Undocumented letter /ɡ/.
<?xml version="1.0" encoding="utf-8"?>
<lg n="1" type="poem" xmlns="http://www.tei-c.org/ns/1.0">
 <head>
  THE SKY
 </head>
 <lg n="1" type="stanza">
  <l n="1" phon="aɪ sˈɔː ɐ ʃˈædoʊ ɑːnðə ɡɹˈaʊnd" rhyme="A" rime="ɡɹaʊnd" type="strong" vc_structure="CVCC">
   I saw a shadow on the ground
  </l>
  <l n="2" phon="ænd hˈɜːd ɐ blˈuːdʒeɪ ɡˌoʊɪŋ bˈaɪ" rhyme="B" rime="aɪ" type="weak" vc_structure="V">
   And heard a bluejay going by;
  </l>
  <l n="3" phon="ɐ ʃˈædoʊ wɛnt əkɹˌɑːs ðə ɡɹˈaʊnd" rhyme="A" rime="ɡɹaʊnd" type="strong" vc_structure="CVCC">
   A shadow went across the ground,
  </l>
  <l n="4" phon="ænd aɪ lˈʊkt ˌʌp ænd sˈɔː ðə skˈaɪ" rhyme="B" rime="aɪ" type="weak" vc_structure="V">
   And I looked up and saw the sky.
  </l>
 </lg>
 <lg n="2" type="stanza">
  <l n="5" phon="ɪt hˈʌŋ ˌʌp ɑːnðə pˈɑːplɚ tɹˈiː" rhyme="X">
   It hung up on the poplar tree,
  </l>
  <l n="6" phon="bˌʌt wˌaɪl aɪ lˈʊkt ɪt dɪdnˌɑːt stˈeɪ" rhyme="C" rime="eɪ" type="weak" vc_structure=

NameError: name 'outfile' is not defined