# Basic XML parsing

Python has a simple [built-in XML Parser](https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree)
called ElementTree.

In [13]:
import xml.etree.ElementTree as ET

## Simple Tempest parsing

Let's compute the amount of text spoken by each character in Shakespeare's *Tempest*.

In [14]:
tree = ET.parse('Digital Humanities Practice and Theory.xml')

In [15]:
root = tree.getroot()
root

<Element 'PLAY' at 0x10ff8f1a0>

We can iterate over children

In [16]:
for speech in root.iter('SPEECH'):
    print(speech)

<Element 'SPEECH' at 0x10fe0f1f0>
<Element 'SPEECH' at 0x10fe0ed40>
<Element 'SPEECH' at 0x10fe0e570>
<Element 'SPEECH' at 0x10fe0e6b0>
<Element 'SPEECH' at 0x10ffaa520>
<Element 'SPEECH' at 0x10ffaa5c0>
<Element 'SPEECH' at 0x10ffaa7a0>
<Element 'SPEECH' at 0x10ffaa930>
<Element 'SPEECH' at 0x10ffaaac0>
<Element 'SPEECH' at 0x10ffaac00>
<Element 'SPEECH' at 0x10ffaad40>
<Element 'SPEECH' at 0x10ffaae80>
<Element 'SPEECH' at 0x10ffab2e0>
<Element 'SPEECH' at 0x10ffab650>
<Element 'SPEECH' at 0x10ffaba10>
<Element 'SPEECH' at 0x10ffabc90>
<Element 'SPEECH' at 0x10ffabdd0>
<Element 'SPEECH' at 0x10ffabf10>
<Element 'SPEECH' at 0x10ffc0130>
<Element 'SPEECH' at 0x10ffc0360>
<Element 'SPEECH' at 0x10ffc0450>
<Element 'SPEECH' at 0x10ffc0590>
<Element 'SPEECH' at 0x10ffc0720>
<Element 'SPEECH' at 0x10ffc0860>
<Element 'SPEECH' at 0x10ffc0a40>
<Element 'SPEECH' at 0x10ffc0c70>
<Element 'SPEECH' at 0x10ffc0e00>
<Element 'SPEECH' at 0x10ffc0fe0>
<Element 'SPEECH' at 0x10ffc1490>
<Element 'SPEE

In [17]:
for speech in root.iter('SPEECH'):
    speaker = speech.find('SPEAKER')
    print(speaker)

<Element 'SPEAKER' at 0x10fe0f330>
<Element 'SPEAKER' at 0x10fe0e1b0>
<Element 'SPEAKER' at 0x10fe0e7a0>
<Element 'SPEAKER' at 0x10fe0eac0>
<Element 'SPEAKER' at 0x10ffaa4d0>
<Element 'SPEAKER' at 0x10ffaa660>
<Element 'SPEAKER' at 0x10ffaa890>
<Element 'SPEAKER' at 0x10ffaa9d0>
<Element 'SPEAKER' at 0x10ffaab10>
<Element 'SPEAKER' at 0x10ffaac50>
<Element 'SPEAKER' at 0x10ffaad90>
<Element 'SPEAKER' at 0x10ffaaed0>
<Element 'SPEAKER' at 0x10ffab330>
<Element 'SPEAKER' at 0x10ffab6a0>
<Element 'SPEAKER' at 0x10ffabb00>
<Element 'SPEAKER' at 0x10ffabd30>
<Element 'SPEAKER' at 0x10ffabe70>
<Element 'SPEAKER' at 0x10ffabf60>
<Element 'SPEAKER' at 0x10ffc0180>
<Element 'SPEAKER' at 0x10ffc03b0>
<Element 'SPEAKER' at 0x10ffc04a0>
<Element 'SPEAKER' at 0x10ffc05e0>
<Element 'SPEAKER' at 0x10ffc0770>
<Element 'SPEAKER' at 0x10ffc08b0>
<Element 'SPEAKER' at 0x10ffc0a90>
<Element 'SPEAKER' at 0x10ffc0cc0>
<Element 'SPEAKER' at 0x10ffc0e50>
<Element 'SPEAKER' at 0x10ffc1030>
<Element 'SPEAKER' a

How do we get the text child(ren) of an element?

In [18]:
for speech in root.iter('SPEECH'):
    speaker = speech.find('SPEAKER')
    my_text = speaker.text
    print(my_text)

Master
Boatswain
Master
Boatswain
ALONSO
Boatswain
ANTONIO
Boatswain
GONZALO
Boatswain
GONZALO
Boatswain
GONZALO
Boatswain
SEBASTIAN
Boatswain
ANTONIO
GONZALO
Boatswain
Mariners
Boatswain
GONZALO
SEBASTIAN
ANTONIO
GONZALO
ANTONIO
SEBASTIAN
GONZALO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
CALIBAN
PROSPERO
ARIEL
PROSPERO
CALIBAN
PROSPERO
CALIBAN
PROSPERO
CALIBAN
PR

Throughout this simple XML file, there is no markup in the middle of the text, so we can use the text attribute, which only holds the text between the start tag and the first child element.  [See here](https://docs.python.org/3/library/xml.etree.elementtree.html#element-objects)

This would not work if there were any markup in the middle of the text. In this case, we could use `itertext()`.  

In [19]:
for speech in root.iter('SPEECH'):
    speaker = speech.find('SPEAKER')
    my_text = " ".join(speaker.itertext())
    print(my_text)

Master
Boatswain
Master
Boatswain
ALONSO
Boatswain
ANTONIO
Boatswain
GONZALO
Boatswain
GONZALO
Boatswain
GONZALO
Boatswain
SEBASTIAN
Boatswain
ANTONIO
GONZALO
Boatswain
Mariners
Boatswain
GONZALO
SEBASTIAN
ANTONIO
GONZALO
ANTONIO
SEBASTIAN
GONZALO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
ARIEL
PROSPERO
MIRANDA
PROSPERO
MIRANDA
PROSPERO
CALIBAN
PROSPERO
ARIEL
PROSPERO
CALIBAN
PROSPERO
CALIBAN
PROSPERO
CALIBAN
PR

Now we need to get the text of the speeches.  Note the use of an optional argument to the `print` function.

In [20]:
for speech in root.iter('SPEECH'):
    speaker = speech.find('SPEAKER')
    name = speaker.text
    for line in speech.iter('LINE'):
        print (name, line.text, sep =': ')

Master: Boatswain!
Boatswain: Here, master: what cheer?
Master: Good, speak to the mariners: fall to't, yarely,
Master: or we run ourselves aground: bestir, bestir.
Boatswain: Heigh, my hearts! cheerly, cheerly, my hearts!
Boatswain: yare, yare! Take in the topsail. Tend to the
Boatswain: master's whistle. Blow, till thou burst thy wind,
Boatswain: if room enough!
ALONSO: Good boatswain, have care. Where's the master?
ALONSO: Play the men.
Boatswain: I pray now, keep below.
ANTONIO: Where is the master, boatswain?
Boatswain: Do you not hear him? You mar our labour: keep your
Boatswain: cabins: you do assist the storm.
GONZALO: Nay, good, be patient.
Boatswain: When the sea is. Hence! What cares these roarers
Boatswain: for the name of king? To cabin: silence! trouble us not.
GONZALO: Good, yet remember whom thou hast aboard.
Boatswain: None that I more love than myself. You are a
Boatswain: counsellor; if you can command these elements to
Boatswain: silence, and work the peace of the p

Now we count up the lines of each character.

In [21]:
from collections import Counter

In [None]:
def fix_name(name):
    name = name.lstrip('[')
    name = name.rstrip(']')
    return name 

In [23]:
totals = Counter()
for speech in root.iter('SPEECH'):
    speaker = speech.find('SPEAKER')
    name = speaker.text
    name = fix_name(name)
    for line in speech.iter('LINE'):
        totals[name] += 1

In [24]:
totals.most_common()

[('PROSPERO', 693),
 ('CALIBAN', 175),
 ('ARIEL', 168),
 ('STEPHANO', 163),
 ('GONZALO', 161),
 ('ANTONIO', 148),
 ('FERDINAND', 148),
 ('MIRANDA', 141),
 ('SEBASTIAN', 120),
 ('ALONSO', 109),
 ('TRINCULO', 105),
 ('Boatswain', 45),
 ('IRIS', 41),
 ('CERES', 24),
 ('ADRIAN', 12),
 ('FRANCISCO', 11),
 ('JUNO', 7),
 ('Master', 3),
 ('Mariners', 1)]

What about [PROSPERO]?