# XML Parsing Exercise

Modify the code from the previous notebook to count the number of lines per character instead of the number of words.

This means for each speech:
* count the number of `<l>` elements (verse lines) and 
* within each `<p>` element (prose paragraph), count the number of `<lb>` elements (line-breaks).

In [2]:
from lxml import etree
tree = etree.parse('The Tempest Folger Shakespeare.xml')
root = tree.getroot()
root

<Element {http://www.tei-c.org/ns/1.0}TEI at 0x107dbefc0>

In [14]:
from collections import Counter

totals = Counter()
for sp in root.iter('{*}sp'):
    who = sp.get('who')
    if (not who):
        continue
    for individual in who.split():
        individual = individual.lstrip('#')
    for child in sp:
        if child.tag.endswith('speaker') or child.tag.endswith('stage'):
            continue
        for word in child.iter('{*}w'):
            print(word.text)
            totals[who] += 1
totals.most_common()


Boatswain
Here
master
What
cheer
Good
speak
to
th’
mariners
Fall
to
’t
yarely
or
we
run
ourselves
aground
Bestir
bestir
Heigh
my
hearts
Cheerly
cheerly
my
hearts
Yare
yare
Take
in
the
topsail
Tend
to
th’
Master’s
whistle
Blow
till
thou
burst
thy
wind
if
room
enough
Good
boatswain
have
care
Where’s
the
Master
Play
the
men
I
pray
now
keep
below
Where
is
the
Master
boatswain
Do
you
not
hear
him
You
mar
our
labor
Keep
your
cabins
You
do
assist
the
storm
Nay
good
be
patient
When
the
sea
is
Hence
What
cares
these
roarers
for
the
name
of
king
To
cabin
Silence
Trouble
us
not
Good
yet
remember
whom
thou
hast
aboard
None
that
I
more
love
than
myself
You
are
a
councillor
if
you
can
command
these
elements
to
silence
and
work
the
peace
of
the
present
we
will
not
hand
a
rope
more
Use
your
authority
If
you
cannot
give
thanks
you
have
lived
so
long
and
make
yourself
ready
in
your
cabin
for
the
mischance
of
the
hour
if
it
so
hap
Cheerly
good
hearts
Out
of
our
way
I
say
I
have
great
comfort
from
this
fe

[('#Prospero_Tmp', 4746),
 ('#Caliban_Tmp', 1354),
 ('#Stephano_Tmp', 1316),
 ('#Ariel_Tmp', 1285),
 ('#Gonzalo_Tmp', 1157),
 ('#Miranda_Tmp', 1010),
 ('#Antonio_Tmp', 991),
 ('#Ferdinand_Tmp', 978),
 ('#Trinculo_Tmp', 824),
 ('#Alonso_Tmp', 751),
 ('#Sebastian_Tmp', 699),
 ('#Boatswain_Tmp', 353),
 ('#SPIRITS.Iris_Tmp', 297),
 ('#SPIRITS.Ceres_Tmp', 155),
 ('#Francisco_Tmp', 79),
 ('#Adrian_Tmp', 64),
 ('#SPIRITS.Juno_Tmp', 41),
 ('#Shipmaster_Tmp', 17),
 ('#SAILORS_Tmp', 8),
 ('#Ferdinand_Tmp #Miranda_Tmp', 4)]

In [13]:
totals.most_common()

[('#Prospero_Tmp', 4746),
 ('#Caliban_Tmp', 1354),
 ('#Stephano_Tmp', 1316),
 ('#Ariel_Tmp', 1285),
 ('#Gonzalo_Tmp', 1157),
 ('#Miranda_Tmp', 1010),
 ('#Antonio_Tmp', 991),
 ('#Ferdinand_Tmp', 978),
 ('#Trinculo_Tmp', 824),
 ('#Alonso_Tmp', 751),
 ('#Sebastian_Tmp', 699),
 ('#Boatswain_Tmp', 353),
 ('#SPIRITS.Iris_Tmp', 297),
 ('#SPIRITS.Ceres_Tmp', 155),
 ('#Francisco_Tmp', 79),
 ('#Adrian_Tmp', 64),
 ('#SPIRITS.Juno_Tmp', 41),
 ('#Shipmaster_Tmp', 17),
 ('#SAILORS_Tmp', 8),
 ('#Ferdinand_Tmp #Miranda_Tmp', 4)]

In [None]:
from collections import Counter

totals = Counter()
for sp in root.iter('{*}sp'):
    who = sp.get('who')
    if (not who):
        continue
    for individual in who.split():
        individual = individual.lstrip('#')
    for line in individual 


In [8]:
count = 0
for elem in root.iter():
    if elem.tag.endswith('l'):  # tag ends with 'l'
        count += 1

print("Number of verse lines:", count)


Number of verse lines: 1786


In [10]:
ns = {'tei': 'http://www.tei-c.org/ns/1.0'}

# find all <p> paragraphs
paragraphs = root.findall('.//tei:p', ns)

for i, p in enumerate(paragraphs, start=1):
    # find <lb> only inside this <p>
    lbs = p.findall('.//tei:lb', ns)
    print(f"Paragraph {i}: {len(lbs)} line breaks")

Paragraph 1: 0 line breaks
Paragraph 2: 0 line breaks
Paragraph 3: 0 line breaks
Paragraph 4: 0 line breaks
Paragraph 5: 1 line breaks
Paragraph 6: 1 line breaks
Paragraph 7: 2 line breaks
Paragraph 8: 4 line breaks
Paragraph 9: 2 line breaks
Paragraph 10: 1 line breaks
Paragraph 11: 1 line breaks
Paragraph 12: 2 line breaks
Paragraph 13: 1 line breaks
Paragraph 14: 3 line breaks
Paragraph 15: 2 line breaks
Paragraph 16: 8 line breaks
Paragraph 17: 6 line breaks
Paragraph 18: 2 line breaks
Paragraph 19: 2 line breaks
Paragraph 20: 2 line breaks
Paragraph 21: 2 line breaks
Paragraph 22: 1 line breaks
Paragraph 23: 3 line breaks
Paragraph 24: 3 line breaks
Paragraph 25: 2 line breaks
Paragraph 26: 1 line breaks
Paragraph 27: 1 line breaks
Paragraph 28: 2 line breaks
Paragraph 29: 1 line breaks
Paragraph 30: 3 line breaks
Paragraph 31: 2 line breaks
Paragraph 32: 4 line breaks
Paragraph 33: 1 line breaks
Paragraph 34: 1 line breaks
Paragraph 35: 4 line breaks
Paragraph 36: 1 line breaks
P

In [9]:
for el in root.iter('{*}p'):
    print(el.tag)

{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://www.tei-c.org/ns/1.0}p
{http://