In [1]:
from IPython.core.display import HTML
def css_styling():
    styles = open("../Data/www/styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

# Synopsis

So far we have essentially only learned how to parse and enumerate the number of words in text (doesn't sound like much, huh? But that alone comprises a large amount of basic textual analysis). In this unit we will go a bit further and cover:

1. Preparing text for further analysis
2. Analyzing sentiment

We will also talk about how difficult advanced analysis of unstructured text is despite its appearance as an 'easy' task.

## Text as Data

As we discussed this morning, analyzing text is not nearly as simple as it would appear. In this module we're going to learn the basics of examining sentiment in text. 

We'll be working with an example text:

"Adam is totally cool. You should come to his class."

To begin with, let's answer some basic questions. 

* Is this overall sentence positive or negative?
* Which words make it positive or negative?
* Do all words have a positive or negative affect?

Now the question becomes, how can we automate the analysis of the sentiment in the text?

There are actually many ways to rate the positive or negative sentiment of a word, more complicated approaches involve machine-learning, but we'll start simply with using a dictionary.

There are many dictionaries that people have created to analyze sentiment, for our uses today we will use the AFINN dictionary that is provided in `Data/Day5-Text-Analysis/AFINN/`

In [None]:
afinn_list = [l.strip().split() for l in open('../Data/Day5-Text-Analysis/AFINN/AFINN-111.txt', encoding = 'utf-8').readlines()]
print(afinn_list[:10])

The AFINN dictionary is relatively simple. It gives a word and then it's numeric score of postivity or negativity (negative words are negative numbers).

But we really need to convert it to a dictionary if it's going to be useful to us (list lookups are expensive!)

In [None]:
#Place your code here
afinn = {}
for item in afinn_list[:10]:
    key = item[0]
    score = int(item[1])
    afinn[key] = score
    
print(afinn)


To start with, let's look at the words with sentiment in the example text.

In [None]:
example_text = "Adam is totally cool. You should come to his class"

In [None]:
###Place your code here
words = example_text.split()
print(words)

In [None]:
example_words = [word.strip('. ') for word in example_text.split(' ')]
for word in example_words:
    if word in afinn:
        print('--- ', word, '\t', afinn[word.lower()])
    else:
        print(word)

As we can see, only word assigned a sentiment score is "cool".

`Adam` is a proper noun, `You` is a pronoun, `his` is a possessive - so no sentiment there

`is`, `should`, and `come` are the verbs - so no sentiment

`to` is a preposition

`class` is a noun

`totally` is a different story though. It's an adverb and is modifying `cool`, which is positive. However, the sentiment of `totally` is entirely dependent on the word that it is modifying. So on its own, it it doesn't actually have a score.

So we can judge that this overall text is mildly positive, there isn't that much to go on though since it's such a small piece! 

There could be more that we could write to understand `totally` and it's relationship to `cool`, but we'll save that for later. Right now we're going to stick to analyzing unigrams (single words) as just a bag (which actually works really well as a first approximation!

But one thing you'll notice is that there are a lot of words that don't add meaning that we're checking to see if they do have meaning. 

One set of words that doesn't really help are called stopwords. Stop words are the most common words in a language and don't really have a lot of meaning when it comes to the analysis of setniment in text.

For our lesson today we will need to download the `stopwords` corpora.

In [None]:
# A new window will open. Select only the materials that appear in the book
import nltk
nltk.download()

Excellent! Now we will need to import our corpora.

In [None]:
from nltk.corpus import stopwords

And let's take a look at what is inside the stopwords list.

In [None]:
stopwords.words()

Great! You can see that the stop words list is actually very extensive. That's because it contains stopwords that are in most languages! So if you decide to analyze text in a non-English language, NLTK already has you covered.

Now let's check to see what is left of our example text after we remove the stopwords.

In [None]:
print([word for word in example_words if word not in stopwords.words()])

You can see that it really cut down the entire list of words to basically just the nouns, adjectives, and modifiers. 

Removing stopwords is extremely important when we're trying to get to the real meat of a text. 

So let's move onto actual text and apply these principles. Load Othello and get Hamlet and Iago's speaking parts.

Now let's work on the actual text. Extract Othello and Iago's dialogue using our code from the morning:

In [None]:
#Extract Othello


Excellent, now let's actually remove all of the stopwords and see what that does to the dialogue size of the two characters.

In [None]:
cleaned_othello = [word for word in othellos_dialogue if word not in stopwords.words()]
cleaned_iago = [word for word in iagos_dialogue if word not in stopwords.words()]

print("Othello dialogue size", len( othellos_dialogue))
print("Othello dialogue size without stopwords", len( cleaned_othello))
print('----')
print("Iago dialogue size", len( iagos_dialogue))
print("Iago dialogue size without stopwords", len( cleaned_iago))

We see that there is a non-trivial reduction in the number of words spoken for each character (which should help in further processing!)

Now what does the distribution of sentiment look like for each of the two characters? Plot the two distributions in separate subplots.

In [None]:
###Place your code here


And they look almost exactly the same! However, we can tell that there is a nearly 20% difference in the averages.

The distribution of sentiment scores is interesting, but does not give us a picture of the arc of the story.  To extract that information, we need to keep track of when each word is spoken.

In [None]:
###Sentiment over time for Iago and Othello


Let's focus on Othelo first. Interesting! We can see that the first 150 scored words uttered by Othello are quite positive. The next 150 are only slightly positive, and the last 250 words have a slight negative bias. 

It might be time to actually to refresh ourselves on the [story of Othello](https://en.wikipedia.org/wiki/Othello)....

Iago's speech has a different arc. The positivity in his utterances comes in spikes. The rest of the time he keeps near neutrality of sentiment.  As if he was hiding his feelings...

How does this compare to the whole text?

In [None]:
### Place your code here



Without a reference point (i.e., comparing one character to another), it's acutally a bit easier to see the arc of the story in the rolling mean. 

Here we can see that Othello lives up to its label as a tragedy. Near the end of the labeled words there is a steep decline in the sentiment of words used.

Let's see if we can see more of a difference between Othello and Iago using the rolling mean:

In [None]:
#Your code here


Ah! That's actually a much easier way to intuit the dialogue of each individual character!

Let's actually compare the dialogue of every character in Othello. 

In [None]:
#Your code here


Well, I guess we can see that a few characters had quite a poor turn near the end there!

How well does our technique work with a different Shakespeare play, say "The Merchant of Venice".

Refactor the original code to extract Othelo and make it pull out the character dialogue of any play.

In [None]:
#Place your code here
