#Text analysis
Whether it's extracting numerical data from text, or dealing with text directly, the ability to manipulate text in the form of strings is essential for any number of data science projects. 

Here, we're going to take a walk over to the humanities and see if we can learn anything new about everyones favorite author... William Shakespeare (if we all had to learn it, so should a computer right?)

1. Should "Othello" really be called "Iago"?

2. Can a computer learn the difference between a comedy and a tragedy?

3. Who is the most verbose Shakespearean character?

4. Who has the largest vocabulary? 

5. Did the complexity of Shakespeare's vocabulary change over time?

6. Which is Shakespeares most feminist play?

We could think of any number of quantitative questions to pursue, each requiring slightly different skills and analytical depth. But we'll start with something easy and build our way up. And by the end you should have the skills to answer some of the above questions. 

#The Data
Where are we going to get data from? [Project Gutenberg]('http://www.gutenberg.org') has 1000's of different post-copyright books freely available in the form of easy-to-use `.txt` files. Luckily for us, this includes the complete works of William Shakespeare which we've pre-downloaded for you.

First things first, **ALWAYS LOOK AT YOUR DATA**. Now is the time to open `'../Data/Shakespeare.txt'` and get a sense for the file formatting. (You can do this quickly by just clicking: [Shapespeare.txt](../Data/Shakespeare.txt)).

1. How are different plays separated from one another?
2. How is dialogue formatted?
3. What extraneous information might we want to ignore?
4. How "well behaved" is our dataset? (i.e. is the formatting general or unique for different plays?)

These are all important questions that we can only infer from qualitative visual exploration. 

#Text parsing review
Before we dive into this large file, let's do a brief refresher on some text parsing basics which will give us an excuse to spoil Hamlet for you. 

In [1]:
spoiler_alert = ["O, I die, Horatio!\n",
    "The potent poison quite o'ercrows my spirit.\n",
    "I cannot live to hear the news from England,\n",
    "     But I do prophesy th' election lights\n",
    "On Fortinbras. He has my dying voice.\n",
    "So tell him, with th' occurrents, more and less,\n",
    "\n",             
    "Which have solicited- the rest is silence.             Dies.\n"]

If we didn't know what was in `spoiler_alert`, we would have to iterate through the lines to see what the data looks like:

In [2]:
for line in spoiler_alert:
    print(line)

O, I die, Horatio!

The potent poison quite o'ercrows my spirit.

I cannot live to hear the news from England,

     But I do prophesy th' election lights

On Fortinbras. He has my dying voice.

So tell him, with th' occurrents, more and less,



Which have solicited- the rest is silence.             Dies.



##Removing extraneous characters
Remember, `\n` is a new line character and as such it isn't actually shown on the screen. It is still there, the `print` function just interprets it to say `new line` which is why we have a double spacing going on here (one line is put in from the `for` loop, another is added for each `\n`).

We can get rid of it easily enough though:

In [3]:
for line in spoiler_alert:
    print(line.strip('\n'))

O, I die, Horatio!
The potent poison quite o'ercrows my spirit.
I cannot live to hear the news from England,
     But I do prophesy th' election lights
On Fortinbras. He has my dying voice.
So tell him, with th' occurrents, more and less,

Which have solicited- the rest is silence.             Dies.


Now the `\n` is really gone, and our text isn't double spaced. But remember, we're not actually changing `spoiler_alert`. We're only removing the `\n` temporarily for each line in order to print it. Thus:

In [4]:
for line in spoiler_alert:
    print(line)

O, I die, Horatio!

The potent poison quite o'ercrows my spirit.

I cannot live to hear the news from England,

     But I do prophesy th' election lights

On Fortinbras. He has my dying voice.

So tell him, with th' occurrents, more and less,



Which have solicited- the rest is silence.             Dies.



The `\n` characters are still there! That's okay for now, I just wanted to remind you of this crucial little fact about working with `for` loops.

So `str.strip('\n')` removed the `\n` characters. But there was still that weird spacing before "But I do prophesy...". 

We can just say `str.strip()` and by default it will remove _all_ white space which includes `\t` (tab) `\n` (new line) and '' (spaces) from both the right _and_ left ends (`str.lstrip()` and `str.rstrip()` remove characters from only one end at a time):

In [7]:
for line in spoiler_alert:
    print(line.strip())

O, I die, Horatio!
The potent poison quite o'ercrows my spirit.
I cannot live to hear the news from England,
But I do prophesy th' election lights
On Fortinbras. He has my dying voice.
So tell him, with th' occurrents, more and less,

Which have solicited- the rest is silence.             Dies.


Remember, this doesn't remove anything from the center of a line. There is still a ton of space before 'Dies', for instance. `strip()` will only remove from the ends. Not a problem for us but something else to remember. 

##Getting line (index) numbers
Another handy thing that we may or may not have covered by now is `enumerate`. Suppose I wanted to know the line numbers: it's pretty trivial with this small little list but we won't always work with small lists. For this, `enumerate` is a lifesaver. Let's see it in action:

In [8]:
for line in enumerate(spoiler_alert):
    print(line)

(0, 'O, I die, Horatio!\n')
(1, "The potent poison quite o'ercrows my spirit.\n")
(2, 'I cannot live to hear the news from England,\n')
(3, "     But I do prophesy th' election lights\n")
(4, 'On Fortinbras. He has my dying voice.\n')
(5, "So tell him, with th' occurrents, more and less,\n")
(6, '\n')
(7, 'Which have solicited- the rest is silence.             Dies.\n')


`enumerate` took our list `spoiler_alert` which was a list of strings and made it a list of tuples! The **first** item in each tuple was the index within the original list `spoiler_alert`, and the **second** item is the actual string. This may come in handy later, but for now treat it as a brief aside that we'll come back to.

##Searching in text
Let's do a little searching:

In [9]:
for line in spoiler_alert:
    if 'The' in line:
        print(line)

The potent poison quite o'ercrows my spirit.



Why isn't line 2 ("I cannot live...") or 7 ("Which have solicited...") printed to the screen?

**Capitalization matters!**

In [11]:
for line in spoiler_alert:
    if 'the' in line:
        print(line)

I cannot live to hear the news from England,

Which have solicited- the rest is silence.             Dies.



Now we were able to grab line 2 and 7, but we missed line 1!

Maybe we don't care about capitalization, and just want to know if 'the' appears anywhere within the line. We'll have to somehow standardize things, which we can do by converting the line to all lowercase before searching it:

In [12]:
for line in spoiler_alert:
    if 'the' in line.lower():
        print(line)

The potent poison quite o'ercrows my spirit.

I cannot live to hear the news from England,

Which have solicited- the rest is silence.             Dies.



The same works for uppercase:

In [13]:
for line in spoiler_alert:
    if 'the' in line.upper():
        print(line)

We didn't find anything here? Why not? 

Because we temporarily made the lines uppercase, therefore `the` never appears in any of them! We might have meant to say:

In [14]:
for line in spoiler_alert:
    if 'THE' in line.upper():
        print(line)

The potent poison quite o'ercrows my spirit.

I cannot live to hear the news from England,

Which have solicited- the rest is silence.             Dies.



Is that the only way to find something in a line? Of course not! Programming wouldn't be fun if there weren't 1000 ways to do the same thing. 

In [15]:
for line in spoiler_alert:
    if line.find('the') != -1:
        print(line.strip(), line.find('the'))

I cannot live to hear the news from England, 22
Which have solicited- the rest is silence.             Dies. 22


`find` doesn't just tell us if the text appears in the string. It tells us exactly where the text appears in the line (indexing from 0). And if it doesn't find our search query it will returns -1. In addition, these methods can be combined in linear chains:

In [16]:
for line in spoiler_alert:
    if line.lower().find('the') != -1: #Now we're lowercasing the line first before searching
        print(line.strip(), line.lower().find('the'))

The potent poison quite o'ercrows my spirit. 0
I cannot live to hear the news from England, 22
Which have solicited- the rest is silence.             Dies. 22


`index` works similarly to `find`:

In [17]:
for line in spoiler_alert:
    if line.index('the') != -1:
        print(line.strip(), line.find('the'))

ValueError: substring not found

Except of course that this should have given us a `ValueError`. Why?

Well, when you use `index`, if the string isn't found it raises an error rather than returning `-1`. So we need to do something slightly different. Remember our old friend try/except?

In [21]:
for line in spoiler_alert:
    try:
        print(line.strip(), line.lower().index('the'))
    except ValueError:
        pass

The potent poison quite o'ercrows my spirit. 0
I cannot live to hear the news from England, 22
Which have solicited- the rest is silence.             Dies. 22


Alright, so we know how to search for substrings inside of strings. And we learned a few different ways. In all cases capitalization really matters, and there are slight differences to each way that you need to keep in mind. 

Remember, it's pretty easy to find the line number where things occurred by using `enumerate`:

In [24]:
for line in enumerate(spoiler_alert):
    if 'the' in line[1].lower(): #We have to look in the string, which is index one in the enumerate tuple
        print('Line number: ', line[0], '**** Line: ', line[1])#And only print the line number line[1] is the full line

Line number:  1 **** Line:  The potent poison quite o'ercrows my spirit.

Line number:  2 **** Line:  I cannot live to hear the news from England,

Line number:  7 **** Line:  Which have solicited- the rest is silence.             Dies.



So now we know that 'the' in appears on lines 1, 2, and 7. This might be useful information for you to have at some point. For instance, what if I only wanted the lines between where the first 'my' occur and the last? How would I do this?

**Exercise:** Use your favorite method to find lines containing the word 'my'. Next append all lines _between_ (and including!) the lines containing the substring 'my' to their own list.

In [26]:
lines_of_interest = []

###Place your code here



###Answer
lines_with_my = []
for line in enumerate(spoiler_alert):
    if "my" in line[1].lower():
        lines_with_my.append(line[0])
lines_of_interest = spoiler_alert[min(lines_with_my):max(lines_with_my)+1] 
print(lines_of_interest)

["The potent poison quite o'ercrows my spirit.\n", 'I cannot live to hear the news from England,\n', "     But I do prophesy th' election lights\n", 'On Fortinbras. He has my dying voice.\n']



#Splitting strings
Let's get a refresher on splitting strings up a bit. Maybe we actually just want to look at the words in each line as a list rather than working with the entire line as a string:

In [27]:
for line in spoiler_alert:
        print(line.split(','))

['O', ' I die', ' Horatio!\n']
["The potent poison quite o'ercrows my spirit.\n"]
['I cannot live to hear the news from England', '\n']
["     But I do prophesy th' election lights\n"]
['On Fortinbras. He has my dying voice.\n']
['So tell him', " with th' occurrents", ' more and less', '\n']
['\n']
['Which have solicited- the rest is silence.             Dies.\n']


Here we split up each line and made it a list according to where the commas occurr. If there were no commas, then the line just became a list with a single element. If there were commas, the string was split into separate strings based on the position of those commas (Notice that the commas themselves are gone!).

We could also split our lines up based on spaces to isolate single words (kind of):

In [28]:
for line in spoiler_alert:
        print(line.split(' '))

['O,', 'I', 'die,', 'Horatio!\n']
['The', 'potent', 'poison', 'quite', "o'ercrows", 'my', 'spirit.\n']
['I', 'cannot', 'live', 'to', 'hear', 'the', 'news', 'from', 'England,\n']
['', '', '', '', '', 'But', 'I', 'do', 'prophesy', "th'", 'election', 'lights\n']
['On', 'Fortinbras.', 'He', 'has', 'my', 'dying', 'voice.\n']
['So', 'tell', 'him,', 'with', "th'", 'occurrents,', 'more', 'and', 'less,\n']
['\n']
['Which', 'have', 'solicited-', 'the', 'rest', 'is', 'silence.', '', '', '', '', '', '', '', '', '', '', '', '', 'Dies.\n']


I said _kind of_ because it looks like most of these are single words but it's not perfect. The last word in each line still has that pesky '`\n`' so lets combine commands to get rid of that:

In [29]:
for line in spoiler_alert:
        print(line.strip().split(' '))

['O,', 'I', 'die,', 'Horatio!']
['The', 'potent', 'poison', 'quite', "o'ercrows", 'my', 'spirit.']
['I', 'cannot', 'live', 'to', 'hear', 'the', 'news', 'from', 'England,']
['But', 'I', 'do', 'prophesy', "th'", 'election', 'lights']
['On', 'Fortinbras.', 'He', 'has', 'my', 'dying', 'voice.']
['So', 'tell', 'him,', 'with', "th'", 'occurrents,', 'more', 'and', 'less,']
['']
['Which', 'have', 'solicited-', 'the', 'rest', 'is', 'silence.', '', '', '', '', '', '', '', '', '', '', '', '', 'Dies.']


The operations were performed in order. First we stripped white space off the left and right sides of the string. Then, whatever that created was split into a list based on spaces. Now we have a list of words for each line that looks slightly better than before (still errors but we'll come back to that), but what if we wanted a list of words for the entire text rahter than each line?

In [30]:
total_list = []
for line in spoiler_alert:
    line_as_list = line.strip().split(' ')
    for word in line_as_list:
        total_list.append(word)
print(total_list)

['O,', 'I', 'die,', 'Horatio!', 'The', 'potent', 'poison', 'quite', "o'ercrows", 'my', 'spirit.', 'I', 'cannot', 'live', 'to', 'hear', 'the', 'news', 'from', 'England,', 'But', 'I', 'do', 'prophesy', "th'", 'election', 'lights', 'On', 'Fortinbras.', 'He', 'has', 'my', 'dying', 'voice.', 'So', 'tell', 'him,', 'with', "th'", 'occurrents,', 'more', 'and', 'less,', '', 'Which', 'have', 'solicited-', 'the', 'rest', 'is', 'silence.', '', '', '', '', '', '', '', '', '', '', '', '', 'Dies.']


This isn't so bad, but as we mentioned there are still some weird things in here that we probably don't want. For instance all of those empty strings that precede 'Dies'. Or maybe some of the punctuation like exclamation points and the like.  

In [None]:
total_list = []
for line in spoiler_alert:
    line_as_list = line.strip().split(' ')
    for word in line_as_list:
        total_list.append(word.rstrip('!'))
print(total_list)

Here we're stripping (from the right side only) the exclamation marks. But we might also want to strip the periods, colons, semi-colons, hyphens and question marks (anything else that might appear that we want to account for?). We can just lump all these guys into `strip` and run `strip` for each word. If it doesn't find anything, it just doesn't do anything but if there is a question mark it'll remove it:

In [None]:
total_list = []
for line in spoiler_alert:
    line_as_list = line.strip().split(' ')
    for word in line_as_list:
        total_list.append(word.rstrip('!?.-;:'))
print(total_list)

We'll still have `'`s but maybe we want to keep them? Also how do we feel about hyphenated words? Our current practice would keep strings like 'well-known' as one long word. Perhaps we are okay with that, and perhaps not. It's a limitation to be aware of.

Let's wrap this up with a few little changes:

In [None]:
total_list = []
for line in spoiler_alert:
    line_as_list = line.strip().split(' ')
    for word in line_as_list:
        if len(word) > 0:#This will make sure the string has something in it
            total_list.append(word.rstrip('!?.,-;:').lower())
print(total_list)

Now everything is lowercase so that 'The' and 'the' will be recognized as the same word, and we removed punctuation and all those pesky spaces/empty strings. Finally, we can get word counts, for which we'll rely on Counter:

In [None]:
from collections import Counter
Counter(total_list)

#Back to Shakespeare.txt

So we were able to get a dictionary of all the word counts in our little sample text. Of course, we're not working with a tiny sample text. Our text is gigantic and contains a lot of stuff we don't need. So let's get serious and move on to some big(-ger) data. We need to start by reading in the file using python:

In [None]:
complete_works = open('../Data/Shakespeare.txt').readlines()

For the next few tasks, I'm only interested  in 'Othello'. So look at where Othello begins in the file: how are we going to extract Othello and _only_ Othello from this large list of lines? How do _we_ know when Othello begins and ends?

**Exercise:** Iterate through `complete_works` and add only the lines relevant to 'Othello' to the new list `othello_lines` (hint: there might be a string that differentiates Othello from other plays. And there might be another string that signifies the end of a play. How can you use search to find those lines and get the ones you want?).

In [None]:
othello_lines = []
###Place your code here

###Answer
# beginning = 0
# ends = []
# for line in enumerate(complete_works):
#     if 'THE TRAGEDY OF OTHELLO' in line[1]:
#         beginning = line[0]
#     if 'THE END' in line[1]:
#         ends.append(line[0])
# for end in ends:
#     if end > beginning:
#         break
# print(beginning)
# print(end)
# print(ends)
# othello_lines = complete_works[beginning:end+1]


othello = False
for line in complete_works:
    if 'THE TRAGEDY OF OTHELLO' in line:
        othello = True
    if othello == True:
        othello_lines.append(line)
        if 'THE END' in line:
            othello = False

Before moving on, **always** make sure you really have what you _think_ you have. 

In [None]:
print(othello_lines[0:20])
print('**************')
print(othello_lines[-20:])

#Dialogue
So we should all have a line-by-line reading of Othello. Now what? Well, your choice here is going to depend a lot on what you want to analyze! I'm particularly interested in knowing how many total and _unique_ words Othello speaks. With that information, we could then apply similar analysis to see which character speaks the most, or who has the biggest vocabulary. 

First off, since we only care about Othello, maybe it's not worth our time trying to extract the character list. If we want to compare the vocabulary of other characters, we'll have to do this but for now let's just worry about Othello. The fact that Othello is capitalized when he speaks makes things nice and easy for us (aside: his name _probably_ isn't capialized when he is referred to in dialogue from other characters. Can we be sure? No. As a rough approximation? Sure).

```
"  CHARACTER. blahblahblah
     blahblahblah"
```

Two things jump out at me. How about you?

1. Before a character speaks there are always (usually?) *2* spaces, followed by the character in all capital letters, followed by a period. 
2. It also looks like some speeches are longer than a single line, but those speeches have 4 spaces(!). We'll make note of that for later.

In [None]:
sample_text = othello_lines[3792:3809]
for line in sample_text:
    print(line.rstrip())

**Exercise:** Given the sample_text above, make a list of all words spoken by Othello

In [None]:
othellos_dialogue = []

###Place your code here


###Answer
othello_speaking = False
for line in sample_text:
    if 'OTHELLO.' in line:
        othello_speaking = True
        listy = line.strip().split(' ')
        for word in listy:
            if len(word) > 0:
                othellos_dialogue.append(word)
    elif othello_speaking == True:
        if line[:4] == '    ': #Four spaces!
            listy = line.strip().split(' ')
            for word in listy:
                if len(word) > 0:
                    othellos_dialogue.append(word)
        else:
            othello_speaking = False
    else:
        pass

In [None]:
print(othellos_dialogue)

The code that we wrote above should (might?) be generalizable. But who knows. We'll have to check slowly.

In [None]:
for line in othello_lines[:500]: #No reason to work with the full text while we're still learning
    if 'OTHELLO.' in line:
        print(line)
        
print('##########')

for line in othello_lines[-100:]: #No reason to work with the full text while we're still learning
    if 'OTHELLO.' in line:
        print(line)

Looks good to me. Of course, this is only finding the first line that Othello speaks. Not any of his other lines. 

Let's go ahead and run our code on Othello and see how complex his vocabulary is!

In [None]:
othellos_dialogue = []
###Place your code here


###Answer
othello_speaking = False

for line in othello_lines:
    if 'OTHELLO.' in line:
        othello_speaking = True
        listy = line.strip().lower().split(' ')
        for word in listy:
            if len(word) > 0:
                othellos_dialogue.append(word.strip('.!?;:-,'))
    elif othello_speaking == True:
        if line[:4] == '    ':
            listy = line.strip().lower().split(' ')
            for word in listy:
                if len(word) > 0:
                    othellos_dialogue.append(word.strip('.!?;:-,'))
        else:
            othello_speaking = False
    else:
        pass
    
print(Counter(othellos_dialogue))

**Exercise:** How does Othello's word usage compare with Iago's? Who speaks more in this play? Who has a larger vocabulary?

In [None]:
###Place your code here



#Additional exercises for those who are interested

**Exercise:** A common way to summarize the complexity of text is by calculating the [entropy]('WIKIPEDIA ENTROPY INFORMATION THEORY LINK') of a given text. Write a function to compare the entropy of Iago and and Othello's speech. 

In [None]:
###Place your code here



**Exercise:** This has all been very text-based without much graphical analysis. Can you plot a histogram of word frequencies? How about a Cumulative Distribution Frequency plot? Does Othello's speech follow [Zipf's law](WIKIPEDIA ZIPF LAW)?

In [None]:
###Place your code here



**Exercise:** Our first step towards analyzing Othello was to extract out the Othello specific text from the complete works file, which we did in a pretty arbitrary way to make our lives easier. How would you do it in an automated fashion? Specifically, read the following file and return a _list_ of _tuples_ containing the (year,name) of every play within the file. 

In [None]:
complete_works = open('../Data/Shakespeare.txt').readlines()
play_list = []
###Place your code here

**Exercise:** How well does the code that we wrote above work on Hamlet? Romeo and Juliet? Who has the biggest vocabulary in all of Shakespeare?

In [None]:
###Place your code here