# Bonus: Parsing XML that's not MARCXML #
BeautifulSoup is tool for working with HTML and XML more generally, as we can see be poking around at the text of Thomson's *Sophonisba*, encoded as TEI-XML as part of the ECCO-TCP project. (We'll do more in-depth text analysis using R rather than Python, but I provide these examples for those who anticipate using Python to work with XML sources.)

Again, you'll need to know about the structure of your document before you go to work on it this way. If you're working with cleanly-structured data, parsing the XML should be relatively straightforward. Textual sources (like the TEI-encoded texts from the Text Creation Partnership), however, aren't always "cleanly-structured data." (They're trying to represent early printed texts faithfully, and that's sometimes messay). In the examples below, I've had to code around some quirks in the ECCO-TCP text of Thomson's *Sophonisba*. 

In the first example, let's get a clearly-identified piece of the text: Thomson's dedication. (Uncomment and comment the various print commands to see what changes.)

In [None]:
from bs4 import BeautifulSoup
with open('/media/sf_RBSDigitalApproaches/data/0611_Tuesday_data/K132743.000.xml', 'r') as infile :
    soup = BeautifulSoup(infile, 'xml')
    
    # Divs in TEI texts can have a @type attribute. In this case, we're looking for a div with @type of 'dedication'
    dedication = soup('div', type='dedication')
    print(dedication)
    
    # Remember that BeautifulSoup returns results as lists, so we'll need to add a list index
    dedication_element = soup('div', type='dedication')[0]
    #print(dedication_element)
    
    # In our MARCXML examples, we would just have added .string to get the textual content of the element we were
    # looking at. Note that, in this case, that doesn't work out so well...
    dedication_string = soup('div', type='dedication')[0].string
    #print(dedication_string)
    
    # The problem is, as we saw with dedication_element, that the dedication div has multiple child elements--not
    # just text context. BeautifulSoup's get_text() function can come in handy if we just want to get all the text
    # content (including text from child elements)
    dedication_text = soup('div', type='dedication')[0].get_text()
    #print(dedication_text)
    
    # ECCO-TCP includes the long-s where it appears. If we want, we could replace al instances of long-s with a
    # modernized short-s using the built-in replace string function
    dedication_modernized = dedication_text.replace(u'\u017f','s')
    #print(dedication_modernized)
    
    # Bonus round: The direction of the dedication ("TO THE QUEEN") is contained in its own element, called "head,"
    # which is a child of the dedication div. See if you can figure out how to assign just the text of that child
    # element to the dedication_direction variable and print it.
    #dedication_direction = 
    #print(dedication_direction)

### Counting lines ###
In this example, we'll end up with information about the length of characters' speeches in the play, finding out how many typographical lines each characters' speeches occupy--not "lines" of dialogue, but rather how much of the text is represented by each characters' speeches. (As it turns out, the titular heroine doesn't end up with the most textual real estate in her own play.)

In [None]:
from bs4 import BeautifulSoup
import re

# Define a regular expression pattern we'll use later to check for names of speakers, which are always
# given in upper-case and have at least two characters.
name = re.compile('[A-Z]{2,}')

# Create an empty dictionary to hold key,value pairs, where the keys will be characters' names and the 
    # values will be the total number of lines of text in their speeches
characters = {}

with open('/media/sf_RBSDigitalApproaches/data/0611_Tuesday_data/K132743.000.xml', 'r') as infile :
    soup = BeautifulSoup(infile, 'xml')
    
    # Find all the sp elements (i.e., speeches)
    speeches = soup.find_all('sp')
    
    for speech in speeches :
        # This is a tortuous workaround for getting the name of the character who's speaking that's
        # necessitated by the fact that the speaker elements in this document don't always contain just 
        #the speaker's name. (They usually do. Except when they don't. Like when they contain a stage 
        # direction, too.) 
        #
        # Rather than simply getting the contents of the speaker element: speech('speaker')[0].string
        # we end up having to:
        #      1) Get *all* of the text of the speaker element: speech('speaker')[0].get_text()
        #      2) Findall instances of the regular expression pattern "name" in that text: re.findall(name,...)
        #      3) Use yet another list index to get the actual result from the list of matches that
        #         results from that re.findall()
        speaker = re.findall(name,speech('speaker')[0].get_text())[0]
        
        # Use len() to count how many line elements ("l") are contained within the speech we're looking at
        speech_lines = len(speech('l'))
        
        # Use setdefault to add the character name we've found to our characters dictionary if it's not already
        # there, setting the value to 0
        characters.setdefault(speaker,0)
        
        # Increase the value of the line count for this speaker in our characters dictionary by adding the 
        # number of lines in this speech. Note that we get a spurious character named "MASINSSSA." That's not
        # an error in the TCP file, but rather an accurate transcription of a compositor's error that occurs
        # once in Thomson's text.
        characters[speaker] += speech_lines
    print(characters)

We'd probably want to *do* something with this information. We might export it to a file or, if we had the right libraries installed, we could generate a chart. For now, let's just print the information in our dictionary out in a more legible form.

In [None]:
from bs4 import BeautifulSoup
import re
name = re.compile('[A-Z]{2,}')

characters = {}
with open('/media/sf_RBSDigitalApproaches/data/0611_Tuesday_data/K132743.000.xml', 'r') as infile :
    soup = BeautifulSoup(infile, 'xml')
    speeches = soup.find_all('sp')
    for speech in speeches :
        # Get information about speech
        speaker = re.findall(name,speech('speaker')[0].get_text())[0]
        speech_lines = len(speech('l'))
        # Update dictionary with information about number of lines in characters' speeches
        characters.setdefault(speaker,0)
        characters[speaker] += speech_lines

# Though dictionaries in Python are, by nature, unsorted, we can sort them upon display. The sorted() command 
# turns the items in our dictionary into a list of tuples (tuples are very much like lists, but their values 
# are immutable).
character_lines_sorted = sorted(characters.items())
print(character_lines_sorted)

# The command above sorts our dictionary by the keys, which, in this case, gives us our characters in alphabetical
# order (including the spurious "MASINSSSA"). We can sort on the values, instead, to order our entries by
# the number of lines
character_lines_numerical_sorted = sorted(characters.items(), key=lambda x:x[1])
#print(character_lines_numerical_sorted)

# We can reverse the sort order if we want to begin with the character with the largest number of lines:
character_lines_numerical_reverse_sorted = sorted(characters.items(), key=lambda x:x[1], reverse=True)   
#print(character_lines_numerical_reverse_sorted)

# Because we're dealing with a list of tuples, we can iterate through the list:
#for entry in character_lines_numerical_reverse_sorted :
#    print(entry)
    
# And, given the list-like nature of tuples, we can get each part of each entry separately:
#for entry in character_lines_numerical_reverse_sorted :
    #print(entry[0] + ': ' + str(entry[1]))

### Finding co-occurrence of characters ###
This example does some similar things to the last one in order to figure out which characters appear in each scene, which allows us to see the characters that appear on stage together.

In [None]:
from bs4 import BeautifulSoup
import re
name = re.compile('[A-Z]{2,}')

with open('/media/sf_RBSDigitalApproaches/data/0611_Tuesday_data/K132743.000.xml', 'r') as infile :
    soup = BeautifulSoup(infile, 'xml')
    # Create a canonical list of the characters we know are in the play--we'll use this to dodge our spurious
    # "MASINSSSA" later.
    cast = ['SOPHONISBA','PHOENISSA','MASINISSA','MESSENGER','SYPHAX','NARVA','LAELIUS','SCIPIO','SLAVE']
    
    # Find all of the acts
    acts = soup.find_all('div', type='act')
    
    # Iterate through the (five) acts
    for act in acts :
        # Get the act number for each act. The 'n' in square brackets retrieves the value of the 
        # @n attribute of the act div.
        act_num = act['n']
        
        # Find all of the scenes. (We're inside a for loop, so this will be repeated for each of the five
        # acts independently of one another)
        scenes = act.find_all('div', type='scene')
        
        # Iterate through the scenes in each act
        for scene in scenes :
            # Get the scene number from the @n atrribute of the scene div
            scene_num = scene['n']
            
            # Create an empty list to hold the names of the characters who appear in each scene.
            characters = []
            
            # Check to make sure that a speaker element is present in the scene--you'd think this would be a
            # given, but I believe I ran into an instance where that wasn't the case.. 
            if scene.find_all('speaker') :
                # Find all of the speaker elements
                speakers = scene.find_all('speaker')
                for speaker in speakers :
                    # Use a variation on the tortuous workaround we used above to extract the character's name
                    # from the speaker element by looking for the regular expression pattern name.
                    character = re.findall(name,speaker.get_text())[0].rstrip('. ')
                    
                    # We check to make sure: a) that our character name is in the canonical cast list (to avoid
                    # "MASINSSSA"); and b) that our character name *isn't* already in the list of characters
                    # we're building for this scene. If both those conditions are met, add the character name
                    # to the list of characters in this scene. (Note that I'm venturing on this check because
                    # I'm pretty confident that "MASINSSSA" is the only misspelled character name in the text.)
                    if character in cast and character not in characters :
                        characters.append(character)
            
            # An insurance policy to make sure we're not missing any characters listed in the stage directions
            # but who don't turn up in our list of speeches. There's no harm done if we get the same name twice,
            # because we're not adding names to our list of characters unless the character name isn't already
            # in the list.
            else :
                # Get the  text of the stage element at the 
                stages = scene.find_all('stage')
                # Apply pretty much the same logic we used to work through our speaker tags...
                for stage in stages :
                    stage_names = re.findall(name, stage.get_text())
                    # ... except we'll entertain the possibility that we might find more than one name in 
                    # a stage direction. This requires an additional for loop to work through the potentially
                    # multiple results of our re.finfall()
                    for stage_name in stage_names :
                        if stage_name in cast and stage_name not in characters :
                            characters.append(stage_name)
            
            # We're still inside a for loop (two nested for loops, actually), so what we've been building is a 
            # is a list of characters in each scene in each act.
            # 
            # We'd probably want to do something useful with this information, like write it out to a file or, if we 
            # had the proper libraries installed, turn it into some sort of visualization. We'd probably need a more
            # sophisticated data structure for storing this information. For now, though, let's just
            # print out a list of characters for each scene.
            
            # We use the str.join() function to combine all of the items in our list into a string, separated by
            # a separator we define in parentheses (I've gone with a comma and a space). This way we don't run
            # into problems trying to concatenate lists and strings.
            character_list = ', '.join(characters)
            
            # We'll concatenate a bunch of stuff together. Note how we're turning the act_num and scene_num into
            # strings to combine them with the other strings here.
            print('Act ' + str(act_num) + ', scene ' + str(scene_num) + ': ' + character_list)