`fixUnder` is a script to take the poems from EMR *Under the Tree* and put each in its own file in the folder `/texts/UnderTree`. Initially I used Notepad++ to insert a code in front of each poem title (#) in the source file. I also had to delete the empty lines and title at the top of the source file, and fix a small number of poems that had a subtitle ("A Song"), that is now on the same line as the title. I also added a code (\*\*\*) to mark the end of the file. 

Markup is entered based on the book's consistent internal structure. The inserted # triggers the end of the previous poem and the opening of a new poem with the next line. Then we can insert a header line, with poem count, <lg> tags with stanza counts, and <l> tags with a line count. (in TEI, <lg> stands for 'line group', a deliberately vague term to allow for local definitions.) The markup is TEI conformant, following the basic pattern at https://teibyexample.org/tutorials/TBED04v00.htm. I am looking into how to do the markup in JSON in case that is more versatile, or necessary for later processing.



In [3]:
import re

file = "texts/under.txt"

hashcount = 0    # for counting hashtags to trigger title printing
titlecount = 1   # for counting the poems in book order
linecount = 1    # for counting poem lines
poemlist = None  # for naming individual poem files
booklist = []    # for keeping each poem as a big list

with open (file, "r", encoding='utf-8-sig') as f:  #needed encoding to delete the BOM code
    for line in f:
        stripped = line.strip()
        if '#' in stripped :      # Hash mark means the next line is the title.        
            if poemlist :         # If a poem is open, close it
                booklist.append(poemlist)    # put the closed poem on the book list
            hashcount += 1        # set the hashcount to 1
        if "#" not in stripped and hashcount == 1 :  # if there's no hash AND a hash just happened
            title = stripped                         # then the line is the title
            poemlist = []                            
            
            # Note: putting f before a string def permits vars in curly braces in string
            titlecountstring = f'n="{titlecount}">'
            poemlist.append(f'<lg xmlns="http://www.tei-c.org/ns/1.0" type="poem" {titlecountstring}')
            poemlist.append(title)
            hashcount = 0
            titlecount += 1
            linecount = 1
        if "#" not in stripped and stripped != title and hashcount == 0 : 
            if stripped == "" :           # If the line is empty
                poemlist.append(stripped) # keep the empty line
            else :
                linecountstring = f'n="{linecount}">'
                poemlist.append(f'<l {linecountstring}{stripped}</l>')
                linecount += 1

# Once all the poems are done with the preliminaries, we go back through them to add more info.
# Strictly speaking this isn't necessary, but happens because of the development order. There
# may be some consolidation possible so that everthing is done on each poem at once.

for poem in booklist :
    poemfile_name = f'texts/UnderTree/{poem[1]}.txt' # title is now the second item on the poemlist
    poem[1] = f'<head>{poem[1]}</head>' # alter title line by adding tags
    
        # Next we will enumerate and xml-mark the stanzas of each poem.
        # Stanzas have a blank line between them, so find the blank lines followed by text lines.
        # A line is blank if it is empty, so we check each line by number.
        # blanks will be the list of indexes of blank lines.
    blanks = [index for index in range(len(poem)-1) if poem[index] == "" and poem[index+1] != ""]
    
        # On each blank line, insert the stanza tag with its number.
        # Every stanza tag after first needs an end tag for the previous stanza.
    for n, blank in enumerate(blanks) :
        if blank == 3 :                                        # if the blank is line number 4 ...
            poem[blank] = f'<lg type="stanza" n="{n+1}">'      # the next stanza is the first.
        else :                                                 # otherwise ...
            poem[blank] = f'</lg><lg type="stanza" n="{n+1}">' # it is a later one, so it need a closing tag, too.
        
        # put the end tags on the blank lines at the end
    poem[-3] = '</lg>'  # this end tag is for the final stanza
    poem[-2] = '</lg>'  # this end tag is for the poem
    
    # Print the TEI version to its own file.  
    
    with open(poemfile_name, 'w') as p :
        for line in poem :
            print (line, file=p)

