#Text analysis
A lot of people's projects involve text. Whether it's extracting numerical data from text, or dealing with text directly, string manipulation is essential for any number of data analysis projects. 

Here, we're going to take a walk over to the humanities and see if we can learn anything new about everyones favorite author... William Shakespeare. If we all had to learn it, so should a computer right?

1. Should _Othello_ really be called _Iago_?

2. Can a computer learn the difference between a comedy and a tragedy?

3. Who is the most verbose Shakespearean character?

4. Who has the largest vocabulary? 

5. Did the complexity of Shakespeare's vocabulary change over time?

6. Which is Shakespeares most feminist play?

We could think of any number of quantitative questions to pursue, each requiring slightly different skills and analytical depth. But let's start with something easy and build our way up. 

First things first... **ALWAYS LOOK AT YOUR DATA**. Now is the time to open `'../Data/Shakespeare.txt'` and get a sense for the file formatting. (You can do this quickly by just clicking: [Shapespeare.txt](../Data/Shakespeare.txt)). 

For this first part, I'm interested only in 'Othello'. So look at where Othello begins in the file: how are we going to extract Othello and only Othello?

Let's start by reading in the file using python:

In [1]:
complete_works = open('../Data/Shakespeare.txt').readlines()

This is pretty big, and has a lot of stuff we don't want if all we care about is Hamlet.

**Exercise:** Write code to extract from `complete_works` only the lines pertaining to Hamlet.

In [2]:
othello_lines = []
###Place your code here

###Answer
othello = False
for line in complete_works:
    if 'THE TRAGEDY OF OTHELLO' in line:
        othello = True
    if othello == True:
        othello_lines.append(line)
        if 'THE END' in line:
            othello = False

Before moving on, always make sure you really have what you _think_ you have. 

In [3]:
print(othello_lines[0:20])
print('**************')
print(othello_lines[-20:])

['THE TRAGEDY OF OTHELLO, MOOR OF VENICE\n', '\n', 'by William Shakespeare\n', '\n', '\n', '\n', 'Dramatis Personae\n', '\n', '  OTHELLO, the Moor, general of the Venetian forces\n', '  DESDEMONA, his wife\n', '  IAGO, ensign to Othello\n', '  EMILIA, his wife, lady-in-waiting to Desdemona\n', '  CASSIO, lieutenant to Othello\n', '  THE DUKE OF VENICE\n', '  BRABANTIO, Venetian Senator, father of Desdemona\n', '  GRATIANO, nobleman of Venice, brother of Brabantio\n', '  LODOVICO, nobleman of Venice, kinsman of Brabantio\n', '  RODERIGO, rejected suitor of Desdemona\n', '  BIANCA, mistress of Cassio\n', '  MONTANO, a Cypriot official\n']
**************
["  GRATIANO.                  All that's spoke is marr'd.\n", "  OTHELLO. I kiss'd thee ere I kill'd thee. No way but this,\n", '    Killing myself, to die upon a kiss.\n', '                                          Falls on the bed, and dies.\n', '  CASSIO. This did I fear, but thought he had no weapon;\n', '    For he was great of hear

#Characters
So we should all have a line-by-line reading of Hamlet. Now what? Well, you choice here is going to depend a lot on what you want to analyze! I'm particularly interested in knowing whether Othello or Iago is smarter. Specifically, what is the size of their vocabulary? And how does it compare to other characters?

To answer this question, we'll need a list of characters.

**Exercise:** Write code to extract all of the named characters as a list (hint: [Dramatis Personae](https://en.wikipedia.org/wiki/Dramatis_person%C3%A6))

In [4]:
character_list = []
###Place your code here


###Answer
character_list = []
region_of_interest = False
for line in othello_lines:
    if 'Dramatis Personae' in line:
        region_of_interest = True
    if region_of_interest == True:
        if len(line.strip()) > 0:
            if 'ELECTRONIC' in line:
                region_of_interest = False
                continue
            else:
                character_list.append(line.strip())
character_list = character_list[1:]

How do we look?

In [5]:
print(character_list)

['OTHELLO, the Moor, general of the Venetian forces', 'DESDEMONA, his wife', 'IAGO, ensign to Othello', 'EMILIA, his wife, lady-in-waiting to Desdemona', 'CASSIO, lieutenant to Othello', 'THE DUKE OF VENICE', 'BRABANTIO, Venetian Senator, father of Desdemona', 'GRATIANO, nobleman of Venice, brother of Brabantio', 'LODOVICO, nobleman of Venice, kinsman of Brabantio', 'RODERIGO, rejected suitor of Desdemona', 'BIANCA, mistress of Cassio', 'MONTANO, a Cypriot official', 'A Clown in service to Othello', 'Senators, Sailors, Messengers, Officers, Gentlemen, Musicians, and', 'Attendants']


#Dialogue
Okay so we have a list of characters. Now we want to know who says what. Ths is going to take a little bit of work. Because this is a play, things are formatted pretty well! But not perfectly... not all of the text is spoken and we don't really care about stage directions or Scene numbers or all manner of other stuff that could gunk up or analysis. However, because we looked at our data, it appears that all spoken text follows a similar format:
```
"  CHARACTER. blahblahblah
     blahblahblah"
```
Two things jump out at me. How about you?

1. Before a character speaks there are always (usually?) *2* spaces, followed by the character in all capital letters, followed by a period. 
2. It also looks like some speeches are longer than a single line, but those speeches have 4 spaces(!). We'll make note of that for later.

In [20]:
sample_text = othello_lines[3825:3891]
for line in sample_text:
    print(line)

  OTHELLO. O villain!

  CASSIO.             Most heathenish and most gross!

  LODOVICO. Now here's another discontented paper,

    Found in his pocket too; and this, it seems,

    Roderigo meant to have sent this damned villain;

    But that, belike, Iago in the interim

    Came in and satisfied him.

  OTHELLO.                     O the pernicious caitiff!

    How came you, Cassio, by that handkerchief

    That was my wife's?

  CASSIO.               I found it in my chamber;

    And he himself confess'd but even now

    That there he dropp'd it for a special purpose

    Which wrought to his desire.

  OTHELLO.                       O fool! fool! fool!

  CASSIO. There is besides in Roderigo's letter,

    How he upbraids Iago, that he made him

    Brave me upon the watch, whereon it came

    That I was cast. And even but now he spake

    After long seeming dead, Iago hurt him,

    Iago set him on.

  LODOVICO. You must forsake this room, and go with us.

    Your power

**Exercise:** Given the sample_text above. Assign each word of dialogue to a character in the form of a dictionary.

In [47]:
dialogue_dictionary = {}
character = ''

###Place your code here
for line in sample_text:
    if line[:2] == '  ':
        if line[2:4] != '  ':
            character = line.strip().split('.')[0]
            try:
                dialogue_dictionary[character].append(line.strip(character+'.').strip())
            except:
                dialogue_dictionary[character] = [line.strip(character+'.').strip()]
        else:
            dialogue_dictionary[character].append(line.strip(character+'.').strip())
     

In [48]:
dialogue_dictionary

{'CASSIO': ['CASSIO.             Most heathenish and most gross!',
  'CASSIO.               I found it in my chamber;',
  "And he himself confess'd but even now",
  "That there he dropp'd it for a special purpose",
  'Which wrought to his desire.',
  "CASSIO. There is besides in Roderigo's letter,",
  'How he upbraids Iago, that he made him',
  'Brave me upon the watch, whereon it came',
  'That I was cast. And even but now he spake',
  'After long seeming dead, Iago hurt him,',
  'Iago set him on.',
  'CASSIO. This did I fear, but thought he had no weapon;',
  'For he was great of heart.'],
 'GRATIANO': ["GRATIANO.                  All that's spoke is marr'd."],
 'LODOVICO': ["LODOVICO. Now here's another discontented paper,",
  'Found in his pocket too; and this, it seems,',
  'Roderigo meant to have sent this damned villain;',
  'But that, belike, Iago in the interim',
  'Came in and satisfied him.',
  'LODOVICO. You must forsake this room, and go with us.',
  'Your power and your c

The code that we wrote above should (might?) be generalizable. But who knows. We'll have to check slowly

Let's first check whether our method of identifying the speaker works on the whole play:

In [51]:
for line in othello_lines[:100]: #No reason to work with the full text while we're still learning
    if line[:2] == '  ':
        print(line)

  OTHELLO, the Moor, general of the Venetian forces

  DESDEMONA, his wife

  IAGO, ensign to Othello

  EMILIA, his wife, lady-in-waiting to Desdemona

  CASSIO, lieutenant to Othello

  THE DUKE OF VENICE

  BRABANTIO, Venetian Senator, father of Desdemona

  GRATIANO, nobleman of Venice, brother of Brabantio

  LODOVICO, nobleman of Venice, kinsman of Brabantio

  RODERIGO, rejected suitor of Desdemona

  BIANCA, mistress of Cassio

  MONTANO, a Cypriot official

  A Clown in service to Othello

  Senators, Sailors, Messengers, Officers, Gentlemen, Musicians, and

    Attendants

  RODERIGO. Tush, never tell me! I take it much unkindly

    That thou, Iago, who hast had my purse

    As if the strings were thine, shouldst know of this.

  IAGO. 'Sblood, but you will not hear me.

    If ever I did dream of such a matter,

    Abhor me.

  RODERIGO. Thou told'st me thou didst hold him in thy hate.

  IAGO. Despise me, if I do not. Three great ones of the city,

    In personal suit t

As fun as all of that is, it's mostly irrelevant. The whole beginning is just our Dramatus personae, so let's figure out where we really need to start paying attention:

In [7]:
for line in enumerate(othello_lines[:100]):
    print(line)

(0, 'THE TRAGEDY OF OTHELLO, MOOR OF VENICE\n')
(1, '\n')
(2, 'by William Shakespeare\n')
(3, '\n')
(4, '\n')
(5, '\n')
(6, 'Dramatis Personae\n')
(7, '\n')
(8, '  OTHELLO, the Moor, general of the Venetian forces\n')
(9, '  DESDEMONA, his wife\n')
(10, '  IAGO, ensign to Othello\n')
(11, '  EMILIA, his wife, lady-in-waiting to Desdemona\n')
(12, '  CASSIO, lieutenant to Othello\n')
(13, '  THE DUKE OF VENICE\n')
(14, '  BRABANTIO, Venetian Senator, father of Desdemona\n')
(15, '  GRATIANO, nobleman of Venice, brother of Brabantio\n')
(16, '  LODOVICO, nobleman of Venice, kinsman of Brabantio\n')
(17, '  RODERIGO, rejected suitor of Desdemona\n')
(18, '  BIANCA, mistress of Cassio\n')
(19, '  MONTANO, a Cypriot official\n')
(20, '  A Clown in service to Othello\n')
(21, '  Senators, Sailors, Messengers, Officers, Gentlemen, Musicians, and\n')
(22, '    Attendants\n')
(23, '\n')
(24, '\n')
(25, '\n')
(26, '\n')
(27, '<<THIS ELECTRONIC VERSION OF THE COMPLETE WORKS OF WILLIAM\n')
(28, 'S

Looks like the real play doesn't start until line 38. So we'll start there from now on. 

In [8]:
for line in othello_lines[38:100]:
    if line[:2] == '  ':
        print(line)

  RODERIGO. Tush, never tell me! I take it much unkindly

    That thou, Iago, who hast had my purse

    As if the strings were thine, shouldst know of this.

  IAGO. 'Sblood, but you will not hear me.

    If ever I did dream of such a matter,

    Abhor me.

  RODERIGO. Thou told'st me thou didst hold him in thy hate.

  IAGO. Despise me, if I do not. Three great ones of the city,

    In personal suit to make me his lieutenant,

    Off-capp'd to him; and, by the faith of man,

    I know my price, I am worth no worse a place.

    But he, as loving his own pride and purposes,

    Evades them, with a bumbast circumstance

    Horribly stuff'd with epithets of war,

    And, in conclusion,

    Nonsuits my mediators; for, "Certes," says he,

    "I have already chose my officer."

    And what was he?

    Forsooth, a great arithmetician,

    One Michael Cassio, a Florentine

    (A fellow almost damn'd in a fair wife)

    That never set a squadron in the field,

    Nor the divi

That's not so bad, but there is still a lot that we need to exclude. Remember, dialogue looks like it has two, and only two, spaces:

In [11]:
for line in othello_lines[38:100]:
    if line[:2] == '  ':
        if line[2:4] != '  ':
            print(line)

  RODERIGO. Tush, never tell me! I take it much unkindly

  IAGO. 'Sblood, but you will not hear me.

  RODERIGO. Thou told'st me thou didst hold him in thy hate.

  IAGO. Despise me, if I do not. Three great ones of the city,

  RODERIGO. By heaven, I rather would have been his hangman.

  IAGO. Why, there's no remedy. 'Tis the curse of service,

  RODERIGO.           I would not follow him then.

  IAGO. O, sir, content you.



We're getting better. Much better. Time to start compiling a list of characters for our little subset.

In [12]:
abbrev_characters = []
for line in othello_lines[38:100]:
    if line[:2] == '  ':
        if line[2:4] != '  ':
            temp_character = line.split('.')[0].strip()
            abbrev_characters.append(temp_character)

In [13]:
print(len(abbrev_characters))
print(abbrev_characters)
print(set(abbrev_characters))

8
['RODERIGO', 'IAGO', 'RODERIGO', 'IAGO', 'RODERIGO', 'IAGO', 'RODERIGO', 'IAGO']
{'IAGO', 'RODERIGO'}


Looks good to me. Let's see how we do on the whole shebang. 

In [14]:
abbrev_characters = []
for line in othello_lines[38:]:
    if line[:2] == '  ':
        if line[2:4] != '  ':
            temp_character = line.split('.')[0].strip()
            abbrev_characters.append(temp_character)

This is probably pretty big because we're double counting characters. But remember, we can look at only the unique items pretty easily. 

In [15]:
print(len(abbrev_characters))
print(set(abbrev_characters))

1183
{'FIRST MUSICIAN', 'Retires', 'SECOND GENTLEMAN', 'Faints', 'Exeunt', 'MESSENGER', 'FOURTH GENTLEMAN', 'LODOVICO', 'RODERIGO', 'THIRD GENTLEMAN', 'BIANCA', 'FIRST GENTLEMAN', 'MONTANO', 'FIRST OFFICER', 'IAGO', 'EMILIA', 'DESDEMONA', 'DUKE', 'CASSIO', 'SECOND SENATOR', 'GENTLEMEN', 'OTHELLO', 'GRATIANO', 'BRABANTIO', 'SAILOR', 'FIRST SENATOR', 'ALL', 'HERALD', 'CLOWN'}


That looks pretty good, but the capital letter thing gives us a little more information that might be relevant. 

In [16]:
abbrev_characters = []
for line in othello_lines[38:]:
    if line[:2] == '  ':
        if line[2:4] != '  ': 
            temp_character = line.split('.')[0].strip()
            if temp_character.upper() == temp_character:
                abbrev_characters.append(temp_character)

In [17]:
print(set(abbrev_characters))
print(len(set(abbrev_characters)))
print(len(character_list))

{'CLOWN', 'SECOND GENTLEMAN', 'MESSENGER', 'FOURTH GENTLEMAN', 'LODOVICO', 'RODERIGO', 'THIRD GENTLEMAN', 'BIANCA', 'FIRST GENTLEMAN', 'MONTANO', 'FIRST OFFICER', 'IAGO', 'EMILIA', 'DESDEMONA', 'DUKE', 'CASSIO', 'SECOND SENATOR', 'GENTLEMEN', 'OTHELLO', 'GRATIANO', 'BRABANTIO', 'SAILOR', 'FIRST SENATOR', 'ALL', 'HERALD', 'FIRST MUSICIAN'}
26
15


Probably (certainly) not perfect, but good enough to move on. Let's not lose sight of the prize. We want to extract the dialogue for each character!

**Exercise:** Our first step towards analyzing Othello was to extract out the Othello specific text from the complete works file, which we did in a pretty arbitrary way to make our lives easier. How would you do it in an automated fashion? Specifically, read the following file and return a _list_ containing the names of every play within the file. 

In [None]:
complete_works = open('../Data/Shakespeare.txt').readlines()
play_list = []
###Your code here


**Exercise:** How well does the code that we wrote above work on Hamlet? Romeo and Juliet?