## An Extended Example

Say you want to analyze the characters in Shakespeare's _Tempest_, and you have it as a plain text file (not structured, like XML).

In [3]:
import re

In [6]:
with open('Tempest Notes.txt') as f:
    tempest = f.read()
print(tempest[0:1000])

The Tempest
by William Shakespeare
Edited by Barbara A. Mowat and Paul Werstine
  with Michael Poston and Rebecca Niles
Folger Shakespeare Library
https://shakespeare.folger.edu/shakespeares-works/the-tempest/
Created on Jul 31, 2015, from FDT version 0.9.2

Characters in the Play
PROSPERO, the former duke of Milan, now a magician on a Mediterranean island
MIRANDA, Prospero's daughter
ARIEL, a spirit, servant to Prospero
CALIBAN, an inhabitant of the island, servant to Prospero
FERDINAND, prince of Naples
ALONSO, king of Naples
ANTONIO, duke of Milan and Prospero's brother
SEBASTIAN, Alonso's brother
GONZALO, councillor to Alonso and friend to Prospero
Courtiers in attendance on Alonso:
  ADRIAN
  FRANCISCO
TRINCULO, servant to Alonso
STEPHANO, Alonso's butler
SHIPMASTER
BOATSWAIN
MARINERS
Players who, as spirits, take the roles of Iris, Ceres, Juno, Nymphs, and Reapers in Prospero's masque, and who, in other scenes, take the roles of "islanders" and of hunting d


First, we want to create a dict containing the names of the characters and their descriptions.

In [7]:
re.search(r'Characters', tempest)

<re.Match object; span=(259, 269), match='Characters'>

In [8]:
print(tempest[259:300])

Characters in the Play


In [10]:
re.search(r'MARINERS', tempest)

<re.Match object; span=(815, 823), match='MARINERS'>

In [11]:
print(tempest[259:823])

Characters in the Play
PROSPERO, the former duke of Milan, now a magician on a Mediterranean island
MIRANDA, Prospero's daughter
ARIEL, a spirit, servant to Prospero
CALIBAN, an inhabitant of the island, servant to Prospero
FERDINAND, prince of Naples
ALONSO, king of Naples
ANTONIO, duke of Milan and Prospero's brother
SEBASTIAN, Alonso's brother
GONZALO, councillor to Alonso and friend to Prospero
Courtiers in attendance on Alonso:
  ADRIAN
  FRANCISCO
TRINCULO, servant to Alonso
STEPHANO, Alonso's butler
SHIPMASTER
BOATSWAIN
MARINERS


In [13]:
characters = tempest[259:823]

In [None]:
re.findall(r'^[A-Z]\w+', characters, flags=re.M)

Who have we included erroneously? How should we fix this?

In [None]:
re.findall(r'^[A-Z][A-Z]+', characters, flags=re.M)

In [None]:
re.findall(r'^[A-Z]{2,}', characters, flags=re.M)

In [None]:
re.findall(r'^[A-Z]+\b', characters, flags=re.M)

Who have we missed?  

In [None]:
re.findall(r'^\s*[A-Z]+\b', characters, flags=re.M)

OK, but we want to exclude the leading spaces from the names of the indented characters.  How to do this?

In [None]:
re.findall(r'^\s*([A-Z]+\b)', characters, flags=re.M)

Now we also want to capture the description of each character as well.

In [None]:
print(characters)

In [None]:
re.findall(r'^\s*([A-Z]+\b), (.*)$', characters, flags=re.M)

What's the problem here?  Who have we missed off the list? Why?

In [14]:
re.findall(r'^\s*([A-Z]+\b)(?:, )?(.*)?$', characters, flags=re.M)

[('PROSPERO',
  'the former duke of Milan, now a magician on a Mediterranean island'),
 ('MIRANDA', "Prospero's daughter"),
 ('ARIEL', 'a spirit, servant to Prospero'),
 ('CALIBAN', 'an inhabitant of the island, servant to Prospero'),
 ('FERDINAND', 'prince of Naples'),
 ('ALONSO', 'king of Naples'),
 ('ANTONIO', "duke of Milan and Prospero's brother"),
 ('SEBASTIAN', "Alonso's brother"),
 ('GONZALO', 'councillor to Alonso and friend to Prospero'),
 ('ADRIAN', ''),
 ('FRANCISCO', ''),
 ('TRINCULO', 'servant to Alonso'),
 ('STEPHANO', "Alonso's butler"),
 ('SHIPMASTER', ''),
 ('BOATSWAIN', ''),
 ('MARINERS', '')]

Here's an alternative solution.

In [15]:
re.findall(r'^\s*([A-Z]+\b)(?:, (.*))?$', characters, flags=re.M)

[('PROSPERO',
  'the former duke of Milan, now a magician on a Mediterranean island'),
 ('MIRANDA', "Prospero's daughter"),
 ('ARIEL', 'a spirit, servant to Prospero'),
 ('CALIBAN', 'an inhabitant of the island, servant to Prospero'),
 ('FERDINAND', 'prince of Naples'),
 ('ALONSO', 'king of Naples'),
 ('ANTONIO', "duke of Milan and Prospero's brother"),
 ('SEBASTIAN', "Alonso's brother"),
 ('GONZALO', 'councillor to Alonso and friend to Prospero'),
 ('ADRIAN', ''),
 ('FRANCISCO', ''),
 ('TRINCULO', 'servant to Alonso'),
 ('STEPHANO', "Alonso's butler"),
 ('SHIPMASTER', ''),
 ('BOATSWAIN', ''),
 ('MARINERS', '')]

Now that we have the names of the characters and their descriptions, let's create a dict mapping names to descriptions.  How can we do that?

In [16]:
names = dict(re.findall(r'^\s*([A-Z]+\b)(?:, (.*))?$', characters, flags=re.M))

In [17]:
names.keys()

dict_keys(['PROSPERO', 'MIRANDA', 'ARIEL', 'CALIBAN', 'FERDINAND', 'ALONSO', 'ANTONIO', 'SEBASTIAN', 'GONZALO', 'ADRIAN', 'FRANCISCO', 'TRINCULO', 'STEPHANO', 'SHIPMASTER', 'BOATSWAIN', 'MARINERS'])

In [18]:
names['GONZALO']

'councillor to Alonso and friend to Prospero'