# Processing text

The promise of text analysis, to me, appears to be the greatest generator of interest from social scientists into computational, quantitative research. 

The quantitative analysis of text can return some impressive findings/analytics that aid research questions---however, the difficulty curve rises rapidly as you look to further refine and include detail in your model (as opposed to other areas in Computational Social Science).

# The basics of text

To start I want to show you the basics of working with text **largely without any packages**. Why? Because you should walk before you run and before you use *automagic* functions you should gain intuition about the text itself. 

To start we will work with a document that most of you will have some experience with, *Othello* (`../data/Othello.txt`)

In [None]:
!head -n13 ../data/Othello.txt

We shoul always start with a look at the file to gain a sesne of what is going on.

Within the first 13 lines we already have an example of organizational text, scene directions, and dialogue. 

If we open the entire file and split on `SCENE` we should be able to get a quick sense of how the scene ending/begins transition is handled

In [None]:
print('hello')

In [None]:
othello_full = open('../data/Othello.txt').read()
split_scene = othello_full.split('SCENE')
#Print through the scenes
for i in [2, 3]: 
    print(split_scene[i][:100])
    print('------')
    print(split_scene[i][-100:])
    print('------')

Notice a pattern? Can you think of a way to clean up non-dialogue text to make our job of extracting dialogue easier?

In [None]:
#Exercise

















Our goal is to now separate out the dialogue for each individual character.

Create a dictionary with the character name as the key and all of the characters dialogue as a list of lines.

In [None]:
#Exercise


From this point there are many roads that we can take and this data will serve as our foundation. We will not modify the `char_dialogue` dictionary directly to maintain flexibility for different analytical approaches.

# Bag of words

A powerful, if surprising, analytical approach is one where essentially all structure from the text is disregarded. This approach is called appropriately called the bag of words and it can be extremely useful when we have a sufficient volume of text to analyze at an aggregate level.

As a first step, we should clean all of the lines into indiviudal words (removing punctuation).

In [None]:
#Exercise


At the most basic level, we can quickly get a sense of how much each character speaks.

In [None]:
import operator
sorted(char_words, key=lambda k: len(char_words[k]), reverse=True)

And visualize the distribution of word frequencies. Her we will plot this as a ranked plot (so rank vs. frequency) for each character.

In [None]:
#Exercise


# Zipf's law

Zipf's law is an empirical one, that was discovered by the linguist George Kingsley Zipf. 

This law states that given a large sample of words used, the frequency of any word is inversely proportional to its rank in the frequency table. So the most frequent word will occur about twice as often as the second most frequent word, three times as often as the third most frequent, and so on and so forth.

Visually this pattern will emerge as a fat-tailed distribution (possibly a power-law). This law holds for many languages and even smaller corpuses (as opposed to the whole of an entire language). 

# Comparing dialogue

We can easily start to dig into whether the number of words spoken would really designate one character as being the 'main' character in a play.

In [None]:
print('IAGO.', len(char_words['IAGO.']))
print('OTHELLO.', len(char_words['OTHELLO.']))

Basic, but we know that there are issues with this when we consider language. Basic problems emerge if we look at the most used words.

In [None]:
sorted(Counter(char_words['IAGO.']).items(), key=operator.itemgetter(1), reverse = True)[:30]

Common words and prepositions don't really count for much/encode much information from a quantitative perspective. It is necessary to construct a sentence that is readable to a human, but not necessary to quantify/characterize a text. 

We could do the work to build a dictionary of words we don't care about, but this has already been done for us.

In [None]:
from nltk.corpus import stopwordsfrom nltk.corpus import stopwords
stopwords.words()

In [None]:
stopWords = set(stopwords.words('english'))

Now we can clean out stopwords

In [None]:
def cleaner(wordlist):
    temp = []
    for word in wordlist:
        if word not in stopWords:
            temp.append(word)
    return temp

char_nonstop = {}
for char in char_words:
    char_nonstop[char] = cleaner(char_words[char])

In [None]:
print('IAGO.', len(char_nonstop['IAGO.']))
print('IAGO. set', len(set(char_nonstop['IAGO.'])))


print('OTHELLO.', len(char_nonstop['OTHELLO.']))
print('OTHELLO. set', len(set(char_nonstop['OTHELLO.'])))

This still isn't perfect though. If we really examine Iago's wordlist, we can see this with the punctuation included in some words.

In [None]:
char_nonstop['IAGO.']

NLTK has a built in word tokenizer to help with these situations. The `word_tokenize` breaks about words that have punctuation built into them.

It would be similar to making our own punctuation list and cleaning each word, but it's quicker and faster.

In [None]:
for word in set(char_nonstop['IAGO.']):
    print(nltk.word_tokenize(word))

In [None]:
nltk.word_tokenize('the fox ran over the meadow, finding its prey.')

And there are a large variety of tokenizers besides the word tokenizer. One of the most useful is the Regex tokenizer. 

In [None]:
nltk_regex = nltk.RegexpTokenizer('\w+')
nltk_regex.tokenize('the fox ran over the meadow, finding its prey.')

In [None]:
for word in char_nonstop['IAGO.']:
    print( nltk_regex.tokenize(word) )

So all of the processing work that we did ourselves from the character dialogue to words could have been handled by NLTK.

In [None]:
def nltk_cleaner(wordlist, charname):
    charstop = stopWords.union(set([charname, charname.strip('.')]))
    return [w for w in nltk_regex.tokenize(' '.join(wordlist)) if w not in charstop]

nltk_cleaner(char_dialogue['IAGO.'], 'IAGO.')

In [None]:
char_nltk = {}
for char in char_dialogue:
    nltk_cleaner(char_dialogue[char], char)

Tokenization only breaks the string into word 'tokens' individual sets of characters, separated by a space or punctuation. This doesn't account for variation in words, due to conjugation or plural, that make them appear different when really they have the same underlying meaning. This process is what is known as **stemming**. 

When we stem a word, we remove the suffix.

In [None]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()

print( ps.stem('running') )
print( ps.stem('runs') )

print( ps.stem('party') )
print( ps.stem('parties') )

And we could add this to our `nltk_cleaner` function now to automatically stem the words. 

In [None]:
def nltk_cleaner(wordlist, charname):
    charstop = stopWords.union(set([charname, charname.strip('.')]))
    return [ps.stem(w) for w in nltk_regex.tokenize(' '.join(wordlist)) if w not in charstop]

char_nltk = {}
for char in char_dialogue:
    char_nltk[char] = nltk_cleaner(char_dialogue[char], char)
char_nltk['IAGO.']

In [None]:
print('IAGO', len( set(char_nltk['IAGO.']) ))
print('OTHELLO', len( set(char_nltk['OTHELLO.']) ))

Post tokenization and stemming we have achieved a decrease in the number of unique words used per character (these counts are roughly 75% of what they were before). At this point we should again check the rank frequency plot to understand how this has changed the behavior.

In [None]:
fig = plt.figure(figsize = (6, 3))
#IAGO plot
ax1 = fig.add_subplot(121, facecolor='white')
rank_plot(ax1, 'IAGO', freq_calc(char_nltk['IAGO.']) )
#OTHELLO plot
ax2 = fig.add_subplot(122, facecolor='white')
rank_plot(ax2, 'OTHELLO', freq_calc(char_nltk['OTHELLO.']) )
plt.tight_layout()

So far we have been just doing a raw count - that doesn't really deal with the diversity of language.

The most common and straightforward implementation of that is to look at the uniqueness of words used in comparison to the total number of words used.

In [None]:
#Exercise


Even with this simple calculation, we can see that while Iago may speak more - more of it is repeated utterances than Othello.

**Question** Is this difference significant?

In [None]:
#Exercise


**Apparent numbers (can) lie.**

This is the absolute importance of null models when dealing with data, such as text data, that breaks a classical statistical testing framework.

# Extending beyond a bag of words

So far we have only dealt with what are called `uni-grams` (i.e. single word, bag of words). Bigrams and trigrams are also a part of the picture. can you guess what they are?

In [None]:
list(nltk.bigrams(char_nltk['IAGO.']))[:20]

Why do bigrams matter? Typically it is in the context of some other analysis or relationship (i.e. some statistical learning module as a feature vector). They handicap your ability to do a direct analysis (two items instead of one); however, they expand your ability to model structure.

There are a number of reasons and instances where you will want to rely on Xgrams instead of or in addition to the unigram bag of words approach in order to have a description of the text that takes structure into account. A common reason is when you have a multi-word concept that encodes meaning, which often happens in specialized fields/writing. 

Just because you perform the extraction as multiple words, doesn't mean that you cannot reduce the Xgram  to a single word token. For convenience in further processing these tokens will be joined as a single string like so:

In [None]:
['-'.join(x) for x in list(nltk.bigrams(char_nltk['IAGO.']))[:20]]

You wouldn't join unigrams and bigrams in a bag-of-words analysis, but you would do so in applications where you are generating features from the text. 

# POS

The final basic? analysis is determining the parts of speech. This is the lead in to may other machine learning techniques that leverage parts of speech to determine structure and novelty

In [None]:
nltk.pos_tag(nltk_regex.tokenize('The fox ran quickly to its prey'))

How does this work? The simple answer is that this is a pre-trained model built from an annotated corpus. Text from that corpus has been trained like so:

`[[('today','NN'),('is','VBZ'),('good','JJ'),('day','NN')], [('yes','NNS'),('it','PRP'),('beautiful','JJ')]]`

and the model takes that learning and makes predictions on newly submitted text. The NLTK pos_tagger is trained on The Wall Street Journal corpus (clear links to source here: https://stackoverflow.com/questions/32016545/how-does-nltk-pos-tag-work/41384824)

# Spacy

Of course NLTK isn't for cool kids anymore. Now it's all about spacy. Spacy has pretrained statistical language models and an opinionated implemention of a NLP pipeline that is easy to use. 

Normally we would have to download one of these language models like so:

In [None]:
#!python -m spacy download en_core_web_sm
#!python -m spacy download en

But we already did this at the start of class.

In [None]:
import spacy
nlp = spacy.load('en')
doc = nlp('The fox ran quickly to its prey')
for token in doc:
    print(token.text, token.pos_, token.dep_)

Of course what makes everyone really care about spacy is the fact that it has that statistical model of language

In [None]:
doc = nlp('Apple is looking at buying U.K. startup for $1 billion')
for ent in doc.ents:
    print(ent, ent.label_)

And the fact that it can handle modern web text

In [None]:
doc = nlp("Pensive emoji is where it has always been. Pensive is the superior "
          "emoji. It's outranking happy 😔 ")
print(doc[0].text)          
print(doc[1].text)          
print(doc[-1].text)         
print(doc[17:19].text)      

noun_chunks = list(doc.noun_chunks)
print(noun_chunks[0].text)  

sentences = list(doc.sents)
print(sentences[1].text)    