# Breaking full .xml files encoded by SHC with BeautifulSoup
This code opens full .xml files of plays encoded by Shakespeare His Contemporaries (SHC) using MorphAdorner, and breaks the .xml file into 5 .txt files, only containing play content (ie, no inclusion of speaker name). Skips plays that do not have 5 acts and those whose encoding does not line up with what is generally used by SHC.

In [31]:
import os
from bs4 import BeautifulSoup

In [37]:
# set variable 'path' for easier navigation using os.
path = 'Full_SHC_Plays'

# Function to break acts

In [77]:
def breakByAct(tei_doc):
    
    # get title in string format to work with later
    title = str(tei_doc)
    print('Currently Breaking: ' + title)

    # open doc and parse it using BeautifulSoup
    with open(f'{path}/{tei_doc}', 'r') as tei:
        soup = BeautifulSoup(tei, features='xml')

    # variable 'num' declared here to use to label each act upon writing to file
    num = 1
    
    # some files have the 'act' type listed only every act - some have it listed per scene. This causes some problems.
    if soup.find('div', {'type': 'act'}):
        
        # if there are a total of 5 divs whose type = 'act' then proceed. 
        if len(soup.find_all('div', {'type': 'act'})) == 5:
            for result in soup.find_all('div', {'type': 'act'}):
                lineArray = []
                
                # the reason for the two for loops is that some files tag with <l> and some with <p>.
                # Using for line in result.find_all('l') or result.find_all('p') results in some lines being missed.
                # If the code finds <l>, it navigates by these and ignores the <p> so I have it iterate over twice
                # to avoid losing any content.
                # Only including lemmas in the output file.
                for line in result.find_all('l'):
                    for word in line.find_all('w'):
                        if word.has_attr('lemma'):
                            playText = word['lemma']
                            lineArray.append(playText)
                for line in result.find_all('p'):
                    for word in line.find_all('w'):
                        if word.has_attr('lemma'):
                            playText = word['lemma']
                            lineArray.append(playText)
                with open('Broken_SHC_Plays/%s_ACT_%s.txt' % (title, num), 'w+') as playAct:
                    brokenPlay = ' '.join(lineArray)
                    playAct.write(brokenPlay)
                num += 1

        # if there are more or less than 5 divs whose type = 'act' this is printed.
        # choose to print an explanatory sentence to assist in tracking which files may need more attention.
        else:
            print('File ' + title + ' does not have 5 acts or is not encoded typically.')

    # if the play does not have <div> with the specified attribute and value, this is printed.
    # as comment above, assisted in tracking which files may need more attention.
    else:
        print('File ' + title + ' does not have divs with type = act.')



The below cell takes a while to run.

In [None]:
for fullFile in os.listdir(path):
    if '.xml' in fullFile:
        breakByAct(fullFile)


With the above code I can filter out any plays that do not have 5 acts, or those which are structured unusually for SHC. For plays for which the latter explanation is true, it would likely be more efficient to break up the plays into their 5-act .txt files rather than trying to code around their unusual structure in this case.
Results in 258 plays out of 365 being usable without further input.
In the case of a longer project, these remaining plays would be examined further and either coded for separately or manually handled.