# More complex XML parsing

For more complex needs than ElementTree, there is [lxml, a library that interfaces with the widely used and powerful libxml2](https://lxml.de/tutorial.html).  By design, it has a similar interface to ElementTree.

Another library you can use is Beautiful Soup, which is more commonly used to parse HTML.

The lxml library needs to be installed, which has been done already on NCC. 

In [None]:
!pip install lxml -y

In [1]:
from lxml import etree

Let's try again to pull out each speech and see who has the most speaking (in words).

It would be trickier with this file to count lines the way we did in the last notebook.  Can you see why?

In [3]:
tree = etree.parse('The Tempest Folger Shakespeare.xml')

In [4]:
root = tree.getroot()

In [5]:
root

<Element {http://www.tei-c.org/ns/1.0}TEI at 0x10f4b80c0>

In [6]:
for el in root.iter('sp'):
    print(el.tag)

Why didn't that work?

In [7]:
for el in root.iter():
    print(el)

<Element {http://www.tei-c.org/ns/1.0}TEI at 0x10f4b80c0>
<Element {http://www.tei-c.org/ns/1.0}teiHeader at 0x10f4b1480>
<Element {http://www.tei-c.org/ns/1.0}fileDesc at 0x10f4b9380>
<Element {http://www.tei-c.org/ns/1.0}titleStmt at 0x107f310c0>
<Element {http://www.tei-c.org/ns/1.0}title at 0x10f4b1480>
<Element {http://www.tei-c.org/ns/1.0}author at 0x10f4b9380>
<Element {http://www.tei-c.org/ns/1.0}editor at 0x107f310c0>
<Element {http://www.tei-c.org/ns/1.0}editor at 0x10f4b1480>
<Element {http://www.tei-c.org/ns/1.0}respStmt at 0x10f4b9380>
<Element {http://www.tei-c.org/ns/1.0}resp at 0x107f310c0>
<Element {http://www.tei-c.org/ns/1.0}name at 0x10f4b1480>
<Element {http://www.tei-c.org/ns/1.0}name at 0x10f4b9380>
<Element {http://www.tei-c.org/ns/1.0}respStmt at 0x107f310c0>
<Element {http://www.tei-c.org/ns/1.0}resp at 0x10f4b1480>
<Element {http://www.tei-c.org/ns/1.0}name at 0x10f4b9380>
<Element {http://www.tei-c.org/ns/1.0}name at 0x107f310c0>
<Element {http://www.tei-c.o

Namespaces -- drat!

In [8]:
for el in root.iter('{http://www.tei-c.org/ns/1.0}sp'):
    print(el.tag)

{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://

In [9]:
for el in root.iter('{*}sp'):
    print(el.tag)

{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://www.tei-c.org/ns/1.0}sp
{http://

* ElementTree requires specifying namespaces, though it does permit you to define shorthands
* lxml permits namespace wildcards
* ElementTree version 3.8 added namespace wildcards for XPATH expressions (at least according to the latest documentation, though it did not seem to work for me).
* So maybe it will add wildcards for DOM methods at some point ...

Now, how do we get a name that will uniquely identify each speaker?

In [11]:
for sp in root.iter('{*}sp'):
    who = sp.get('who')
    print(who)

#Shipmaster_Tmp
#Boatswain_Tmp
#Shipmaster_Tmp
#Boatswain_Tmp
#Alonso_Tmp
#Boatswain_Tmp
#Antonio_Tmp
#Boatswain_Tmp
#Gonzalo_Tmp
#Boatswain_Tmp
#Gonzalo_Tmp
#Boatswain_Tmp
#Gonzalo_Tmp
#Boatswain_Tmp
#Sebastian_Tmp
#Boatswain_Tmp
#Antonio_Tmp
#Gonzalo_Tmp
#Boatswain_Tmp
#SAILORS_Tmp
#Boatswain_Tmp
#Gonzalo_Tmp
#Sebastian_Tmp
#Antonio_Tmp
#Gonzalo_Tmp
None
#Antonio_Tmp
#Sebastian_Tmp
#Gonzalo_Tmp
#Miranda_Tmp
#Prospero_Tmp
#Miranda_Tmp
#Prospero_Tmp
#Miranda_Tmp
#Prospero_Tmp
#Miranda_Tmp
#Prospero_Tmp
#Miranda_Tmp
#Prospero_Tmp
#Miranda_Tmp
#Prospero_Tmp
#Miranda_Tmp
#Prospero_Tmp
#Miranda_Tmp
#Prospero_Tmp
#Miranda_Tmp
#Prospero_Tmp
#Miranda_Tmp
#Prospero_Tmp
#Miranda_Tmp
#Prospero_Tmp
#Miranda_Tmp
#Prospero_Tmp
#Miranda_Tmp
#Prospero_Tmp
#Miranda_Tmp
#Prospero_Tmp
#Miranda_Tmp
#Prospero_Tmp
#Miranda_Tmp
#Prospero_Tmp
#Miranda_Tmp
#Prospero_Tmp
#Miranda_Tmp
#Prospero_Tmp
#Miranda_Tmp
#Prospero_Tmp
#Miranda_Tmp
#Prospero_Tmp
#Miranda_Tmp
#Prospero_Tmp
#Ariel_Tmp
#Prospero_Tmp
#Ariel_T

The `get` method is how to look up the value of an attribute (which are a list of keys and values like dicts).

In [12]:
for sp in root.iter('{*}sp'):
    who = sp.get('who')
    who = who.lstrip('#')
    print (who)

Shipmaster_Tmp
Boatswain_Tmp
Shipmaster_Tmp
Boatswain_Tmp
Alonso_Tmp
Boatswain_Tmp
Antonio_Tmp
Boatswain_Tmp
Gonzalo_Tmp
Boatswain_Tmp
Gonzalo_Tmp
Boatswain_Tmp
Gonzalo_Tmp
Boatswain_Tmp
Sebastian_Tmp
Boatswain_Tmp
Antonio_Tmp
Gonzalo_Tmp
Boatswain_Tmp
SAILORS_Tmp
Boatswain_Tmp
Gonzalo_Tmp
Sebastian_Tmp
Antonio_Tmp
Gonzalo_Tmp


AttributeError: 'NoneType' object has no attribute 'lstrip'

There is a speech without a speaker?

In [10]:
for sp in root.iter('{*}sp'):
    who = sp.get('who')
    if (not who):
        print(who ,''.join(sp.itertext()))

None 
            
              A
               
              confused
               
              noise
               
              within
              :
            
            
              
              “
              Mercy
               
              on
               
              us
              !
              ”
              —
              “
              We
               
              split
              ,
               
              we
              
              split
              !
              ”
              —
              “
              Farewell
              ,
               
              my
               
              wife
               
              and
               
              children
              !
              ”
              —
              
              “
              Farewell
              ,
               
              brother
              !
              ”
              —
              “
              We
            

OK -- general hubbub: no need to include these

In [13]:
for sp in root.iter('{*}sp'):
    who = sp.get('who')
    if (not who):
        continue
    who = who.lstrip('#')
    print (who)

Shipmaster_Tmp
Boatswain_Tmp
Shipmaster_Tmp
Boatswain_Tmp
Alonso_Tmp
Boatswain_Tmp
Antonio_Tmp
Boatswain_Tmp
Gonzalo_Tmp
Boatswain_Tmp
Gonzalo_Tmp
Boatswain_Tmp
Gonzalo_Tmp
Boatswain_Tmp
Sebastian_Tmp
Boatswain_Tmp
Antonio_Tmp
Gonzalo_Tmp
Boatswain_Tmp
SAILORS_Tmp
Boatswain_Tmp
Gonzalo_Tmp
Sebastian_Tmp
Antonio_Tmp
Gonzalo_Tmp
Antonio_Tmp
Sebastian_Tmp
Gonzalo_Tmp
Miranda_Tmp
Prospero_Tmp
Miranda_Tmp
Prospero_Tmp
Miranda_Tmp
Prospero_Tmp
Miranda_Tmp
Prospero_Tmp
Miranda_Tmp
Prospero_Tmp
Miranda_Tmp
Prospero_Tmp
Miranda_Tmp
Prospero_Tmp
Miranda_Tmp
Prospero_Tmp
Miranda_Tmp
Prospero_Tmp
Miranda_Tmp
Prospero_Tmp
Miranda_Tmp
Prospero_Tmp
Miranda_Tmp
Prospero_Tmp
Miranda_Tmp
Prospero_Tmp
Miranda_Tmp
Prospero_Tmp
Miranda_Tmp
Prospero_Tmp
Miranda_Tmp
Prospero_Tmp
Miranda_Tmp
Prospero_Tmp
Miranda_Tmp
Prospero_Tmp
Miranda_Tmp
Prospero_Tmp
Miranda_Tmp
Prospero_Tmp
Miranda_Tmp
Prospero_Tmp
Ariel_Tmp
Prospero_Tmp
Ariel_Tmp
Prospero_Tmp
Ariel_Tmp
Prospero_Tmp
Ariel_Tmp
Prospero_Tmp
Ariel_Tmp
Prospe

Now we need to make a list of the number of words in each speech.

In [14]:
for sp in root.iter('{*}sp'):
    who = sp.get('who')
    if (not who):
        continue
    who = who.lstrip('#')
    for word_el in sp.iter('{*}w'):
        word = word_el.text
        print(who, word)

Shipmaster_Tmp MASTER
Shipmaster_Tmp Boatswain
Boatswain_Tmp BOATSWAIN
Boatswain_Tmp Here
Boatswain_Tmp master
Boatswain_Tmp What
Boatswain_Tmp cheer
Shipmaster_Tmp MASTER
Shipmaster_Tmp Good
Shipmaster_Tmp speak
Shipmaster_Tmp to
Shipmaster_Tmp th’
Shipmaster_Tmp mariners
Shipmaster_Tmp Fall
Shipmaster_Tmp to
Shipmaster_Tmp ’t
Shipmaster_Tmp yarely
Shipmaster_Tmp or
Shipmaster_Tmp we
Shipmaster_Tmp run
Shipmaster_Tmp ourselves
Shipmaster_Tmp aground
Shipmaster_Tmp Bestir
Shipmaster_Tmp bestir
Boatswain_Tmp BOATSWAIN
Boatswain_Tmp Heigh
Boatswain_Tmp my
Boatswain_Tmp hearts
Boatswain_Tmp Cheerly
Boatswain_Tmp cheerly
Boatswain_Tmp my
Boatswain_Tmp hearts
Boatswain_Tmp Yare
Boatswain_Tmp yare
Boatswain_Tmp Take
Boatswain_Tmp in
Boatswain_Tmp the
Boatswain_Tmp topsail
Boatswain_Tmp Tend
Boatswain_Tmp to
Boatswain_Tmp th’
Boatswain_Tmp Master’s
Boatswain_Tmp whistle
Boatswain_Tmp Blow
Boatswain_Tmp till
Boatswain_Tmp thou
Boatswain_Tmp burst
Boatswain_Tmp thy
Boatswain_Tmp wind
Boatswain_

Oops.  We have included the words of the speakers' names as included in the text itself. 

To get rid of those words, we have to see what element they are children of.  Let's print out all of the children.

In [15]:
for sp in root.iter('{*}sp'):
    who = sp.get('who')
    if (not who):
        continue
    who = who.lstrip('#')
    for child in sp:
        print(who, child.tag)

Shipmaster_Tmp {http://www.tei-c.org/ns/1.0}speaker
Shipmaster_Tmp {http://www.tei-c.org/ns/1.0}p
Boatswain_Tmp {http://www.tei-c.org/ns/1.0}speaker
Boatswain_Tmp {http://www.tei-c.org/ns/1.0}p
Shipmaster_Tmp {http://www.tei-c.org/ns/1.0}speaker
Shipmaster_Tmp {http://www.tei-c.org/ns/1.0}p
Boatswain_Tmp {http://www.tei-c.org/ns/1.0}speaker
Boatswain_Tmp {http://www.tei-c.org/ns/1.0}p
Alonso_Tmp {http://www.tei-c.org/ns/1.0}speaker
Alonso_Tmp {http://www.tei-c.org/ns/1.0}p
Boatswain_Tmp {http://www.tei-c.org/ns/1.0}speaker
Boatswain_Tmp {http://www.tei-c.org/ns/1.0}p
Antonio_Tmp {http://www.tei-c.org/ns/1.0}speaker
Antonio_Tmp {http://www.tei-c.org/ns/1.0}p
Boatswain_Tmp {http://www.tei-c.org/ns/1.0}speaker
Boatswain_Tmp {http://www.tei-c.org/ns/1.0}p
Gonzalo_Tmp {http://www.tei-c.org/ns/1.0}speaker
Gonzalo_Tmp {http://www.tei-c.org/ns/1.0}p
Boatswain_Tmp {http://www.tei-c.org/ns/1.0}speaker
Boatswain_Tmp {http://www.tei-c.org/ns/1.0}p
Gonzalo_Tmp {http://www.tei-c.org/ns/1.0}speaker
G

There are also some stage directions inside speeches, it seems.  Let's skip those.

In [16]:
for sp in root.iter('{*}sp'):
    who = sp.get('who')
    if (not who):
        continue
    who = who.lstrip('#')
    for child in sp:
        if child.tag.endswith('speaker') or child.tag.endswith('stage'):
            continue
        for word_el in child.iter('{*}w'):
            word = word_el.text
            print(who, word)

Shipmaster_Tmp Boatswain
Boatswain_Tmp Here
Boatswain_Tmp master
Boatswain_Tmp What
Boatswain_Tmp cheer
Shipmaster_Tmp Good
Shipmaster_Tmp speak
Shipmaster_Tmp to
Shipmaster_Tmp th’
Shipmaster_Tmp mariners
Shipmaster_Tmp Fall
Shipmaster_Tmp to
Shipmaster_Tmp ’t
Shipmaster_Tmp yarely
Shipmaster_Tmp or
Shipmaster_Tmp we
Shipmaster_Tmp run
Shipmaster_Tmp ourselves
Shipmaster_Tmp aground
Shipmaster_Tmp Bestir
Shipmaster_Tmp bestir
Boatswain_Tmp Heigh
Boatswain_Tmp my
Boatswain_Tmp hearts
Boatswain_Tmp Cheerly
Boatswain_Tmp cheerly
Boatswain_Tmp my
Boatswain_Tmp hearts
Boatswain_Tmp Yare
Boatswain_Tmp yare
Boatswain_Tmp Take
Boatswain_Tmp in
Boatswain_Tmp the
Boatswain_Tmp topsail
Boatswain_Tmp Tend
Boatswain_Tmp to
Boatswain_Tmp th’
Boatswain_Tmp Master’s
Boatswain_Tmp whistle
Boatswain_Tmp Blow
Boatswain_Tmp till
Boatswain_Tmp thou
Boatswain_Tmp burst
Boatswain_Tmp thy
Boatswain_Tmp wind
Boatswain_Tmp if
Boatswain_Tmp room
Boatswain_Tmp enough
Alonso_Tmp Good
Alonso_Tmp boatswain
Alonso_T

Now we need to count up the words for each.

In [17]:
from collections import Counter

In [None]:
totals = Counter()
for sp in root.iter('{*}sp'):
    who = sp.get('who')
    if (not who):
        continue
    who = who.lstrip('#')
    for child in sp:
        if child.tag.endswith('speaker') or child.tag.endswith('stage'):
            continue
        for word in child.iter('{*}w'):
            print(word.text)
            totals[who] += 1

Boatswain
Here
master
What
cheer
Good
speak
to
th’
mariners
Fall
to
’t
yarely
or
we
run
ourselves
aground
Bestir
bestir
Heigh
my
hearts
Cheerly
cheerly
my
hearts
Yare
yare
Take
in
the
topsail
Tend
to
th’
Master’s
whistle
Blow
till
thou
burst
thy
wind
if
room
enough
Good
boatswain
have
care
Where’s
the
Master
Play
the
men
I
pray
now
keep
below
Where
is
the
Master
boatswain
Do
you
not
hear
him
You
mar
our
labor
Keep
your
cabins
You
do
assist
the
storm
Nay
good
be
patient
When
the
sea
is
Hence
What
cares
these
roarers
for
the
name
of
king
To
cabin
Silence
Trouble
us
not
Good
yet
remember
whom
thou
hast
aboard
None
that
I
more
love
than
myself
You
are
a
councillor
if
you
can
command
these
elements
to
silence
and
work
the
peace
of
the
present
we
will
not
hand
a
rope
more
Use
your
authority
If
you
cannot
give
thanks
you
have
lived
so
long
and
make
yourself
ready
in
your
cabin
for
the
mischance
of
the
hour
if
it
so
hap
Cheerly
good
hearts
Out
of
our
way
I
say
I
have
great
comfort
from
this
fe

In [19]:
totals.most_common()

[('Prospero_Tmp', 4746),
 ('Caliban_Tmp', 1354),
 ('Stephano_Tmp', 1316),
 ('Ariel_Tmp', 1285),
 ('Gonzalo_Tmp', 1157),
 ('Miranda_Tmp', 1010),
 ('Antonio_Tmp', 991),
 ('Ferdinand_Tmp', 978),
 ('Trinculo_Tmp', 824),
 ('Alonso_Tmp', 751),
 ('Sebastian_Tmp', 699),
 ('Boatswain_Tmp', 353),
 ('SPIRITS.Iris_Tmp', 297),
 ('SPIRITS.Ceres_Tmp', 155),
 ('Francisco_Tmp', 79),
 ('Adrian_Tmp', 64),
 ('SPIRITS.Juno_Tmp', 41),
 ('Shipmaster_Tmp', 17),
 ('SAILORS_Tmp', 8),
 ('Ferdinand_Tmp #Miranda_Tmp', 4)]

In [22]:
totals = Counter()
for sp in root.iter('{*}sp'):
    who = sp.get('who')
    if (not who):
        continue
    for individual in who.split():
        individual = individual.lstrip('#')
    for child in sp:
        if child.tag.endswith('speaker') or child.tag.endswith('stage'):
            continue
        for word in child.iter('{*}w'):
            print(word.text)
            totals[who] += 1

Boatswain
Here
master
What
cheer
Good
speak
to
th’
mariners
Fall
to
’t
yarely
or
we
run
ourselves
aground
Bestir
bestir
Heigh
my
hearts
Cheerly
cheerly
my
hearts
Yare
yare
Take
in
the
topsail
Tend
to
th’
Master’s
whistle
Blow
till
thou
burst
thy
wind
if
room
enough
Good
boatswain
have
care
Where’s
the
Master
Play
the
men
I
pray
now
keep
below
Where
is
the
Master
boatswain
Do
you
not
hear
him
You
mar
our
labor
Keep
your
cabins
You
do
assist
the
storm
Nay
good
be
patient
When
the
sea
is
Hence
What
cares
these
roarers
for
the
name
of
king
To
cabin
Silence
Trouble
us
not
Good
yet
remember
whom
thou
hast
aboard
None
that
I
more
love
than
myself
You
are
a
councillor
if
you
can
command
these
elements
to
silence
and
work
the
peace
of
the
present
we
will
not
hand
a
rope
more
Use
your
authority
If
you
cannot
give
thanks
you
have
lived
so
long
and
make
yourself
ready
in
your
cabin
for
the
mischance
of
the
hour
if
it
so
hap
Cheerly
good
hearts
Out
of
our
way
I
say
I
have
great
comfort
from
this
fe

In [23]:
totals.most_common()

[('#Prospero_Tmp', 4746),
 ('#Caliban_Tmp', 1354),
 ('#Stephano_Tmp', 1316),
 ('#Ariel_Tmp', 1285),
 ('#Gonzalo_Tmp', 1157),
 ('#Miranda_Tmp', 1010),
 ('#Antonio_Tmp', 991),
 ('#Ferdinand_Tmp', 978),
 ('#Trinculo_Tmp', 824),
 ('#Alonso_Tmp', 751),
 ('#Sebastian_Tmp', 699),
 ('#Boatswain_Tmp', 353),
 ('#SPIRITS.Iris_Tmp', 297),
 ('#SPIRITS.Ceres_Tmp', 155),
 ('#Francisco_Tmp', 79),
 ('#Adrian_Tmp', 64),
 ('#SPIRITS.Juno_Tmp', 41),
 ('#Shipmaster_Tmp', 17),
 ('#SAILORS_Tmp', 8),
 ('#Ferdinand_Tmp #Miranda_Tmp', 4)]

What should we do about the words that two characters speak together?