# Introduction to Text Analysis

## Pre-introduction

We'll be spending a lot of time today manipulating text. Make sure you remember how to split, join, and search strings.

## Introduction

We've spent a lot of time in python dealing with text data, and that's because text data is everywhere. It is the primary form of communication between persons and persons, persons and computers, and computers and computers. The kind of inferential methods that we apply to text data, however, are different from those applied to tabular data. 

This is partly because documents are typically specified in a way that expresses both structure and content using text (i.e. the document object model).

Largely, however, it's because text is difficult to turn into numbers in a way that preserves the information in the document. Today, we'll talk about dominant language model in NLP and the basics of how to implement it in Python.

### The term-document model

This is also sometimes referred to as "bag-of-words" by those who don't think very highly of it. The term document model looks at language as individual communicative efforts that contain one or more tokens. The kind and number of the tokens in a document tells you something about what is attempting to be communicated, and the order of those tokens is ignored.

To start with, let's load a document.

In [None]:
import nltk
#nltk.download('webtext')
document = nltk.corpus.webtext.open('grail.txt').read()

Let's see what's in this document

In [None]:
print(document[:1000])

It looks like we've gotten ourselves a bit of the script from Monty Python and the Holy Grail. Note that when we are looking at the text, part of the structure of the document is written in tokens. For example, stage directions have been placed in brackets, and the names of the person speaking are in all caps.

## Regular expressions

If we wanted to read out all of the stage directions for analysis, or just King Arthur's lines, doing so in base python string processing will be very difficult. Instead, we are going to use regular expressions. Regular expressions are a method for string manipulation that match patterns instead of bytes.

In [None]:
import re
snippet = document.split("\n")[8]
print(snippet)

In [None]:
re.search(r'coconuts', snippet)

Just like with `str.find`, we can search for plain text. But `re` also gives us the option for searching for patterns of bytes - like only alphabetic characters.

In [None]:
re.search(r'[a-z]', snippet)

In this case, we've told re to search for the first sequence of bytes that is only composed of lowercase letters between `a` and `z`. We could get the letters at the end of each sentence by including a bang at the end of the pattern.

In [None]:
re.search(r'[a-z]!', snippet)

There are two things happening here:

1. `[` and `]` do not mean 'bracket'; they are special characters which mean 'any thing of this class'
2. we've only matched one letter each

Re is flexible about how you specify numbers - you can match none, some, a range, or all repetitions of a sequence or character class.

character | meaning
----------|--------
`{x}`     | exactly x repetitions
`{x,y}`   | between x and y repetitions
`?`       | 0 or 1 repetition
`*`       | 0 or many repetitions
`+`       | 1 or many repetitions

Part of the power of regular expressions are their special characters. Common ones that you'll see are:

character | meaning
----------|--------
`.`       | match anything except a newline
`^`       | match the start of a line
`$`       | match the end of a line
`\s`      | matches any whitespace or newline

What if we wanted to grab all of Arthur's speech without grabbing the name `ARTHUR` itself?

If we wanted to do this using base string manipulation, we would need to do something like:

```
split the document into lines
create a new list of just lines that start with ARTHUR
create a newer list with ARTHUR removed from the front of each element
```

Regex gives us a way of doing this in one line, by using something called groups. Groups are pieces of a pattern that can be ignored, negated, or given names for later retrieval.

character | meaning
----------|--------
`(x)`     | match x
`(?:x)`   | match x but don't capture it
`(?P<x>)` | match something and give it name x
`(?=x)`   | match only if string is followed by x
`(?!x)`   | match only if string is not followed by x

In [None]:
p_arthur = re.compile(r'(?:ARTHUR: )(.+)')
re.findall(p_arthur, document)[0:10]

Because we are using `findall`, the regex engine is capturing and returning the normal groups, but not the non-capturing group. For complicated, multi-piece regular expressions, you may need to pull groups out separately. You can do this with names.

In [None]:
p = re.compile(r'(?P<name>[A-Z ]+)(?::)(?P<line>.+)')
match = re.search(p, document)
match

In [None]:
match.group('name'), match.group('line')

We can also list and count all the unique characters.

In [None]:
matches = re.findall(p, document)
chars = set([x[0] for x in matches])
print(chars, len(chars))

We could use this `set` to gather all dialogue and assign it to the correct character in a `dictionary`, so we can call it back whenever we want.

In [None]:
char_dict = {}
for n in chars:
    char_dict[n] = re.findall(re.compile(r'(?:' + n + ': )(.+)'), document)

In [None]:
char_dict["ARTHUR"]

#### Now let's try a small challenge!

To check that you've understood something about regular expressions, we're going to have you do a small test challenge. Partner up with the person next to you - we're going to do this as a pair coding exercise - and choose which computer you are going to use.

Then, navigate to `challenges/04_text/` and read through challenge A. When you think you've completed it successfully, run `py.test test_A.py` .

Enough with `regex`, let's move on!

## Tokenizing

Let's grab Arthur's speech from above, and see what we can learn about Arthur from it.

In [None]:
arthur = ' '.join(char_dict["ARTHUR"])
arthur[0:100]

In our model for natural language, we're interested in words. The document is currently a continuous string of bytes, which isn't ideal.

The practice of pulling apart a continuous string into units is called "tokenizing", and it creates "tokens". NLTK, the canonical library for NLP in Python, has a couple of implementations for tokenizing a string into words.

In [None]:
#nltk.download('punkt')
from nltk import word_tokenize
word_tokenize(snippet)

Look at what happened to "You're". It's been separated into "You" and "'re", which keeps with the way contractions work in English.

In [None]:
tokens = word_tokenize(arthur)
tokens[0:10]

At this point, we can start asking questions like what are the most common words, and what words tend to occur together.

In [None]:
len(tokens), len(set(tokens))

So we can see right away that Arthur is using the same words a whole bunch - on average, each unique word is used four times. This is typical of natural language. 

> Not necessarily the value, but that the number of unique words in any corpus increases much more slowly than the total number of words.

> A corpus with 100M tokens, for example, probably only has 100,000 unique tokens in it.

For more complicated metrics, it's easier to use NLTK's classes and methods.

In [None]:
from nltk import collocations
fd = collocations.FreqDist(tokens)
fd.most_common()[:10]

Let's remove punctuation and stopwords.

First punctuation:

In [None]:
def rem_punc_stop(text_string):
    
    from string import punctuation
    from nltk.corpus import stopwords

    for char in punctuation:
        text_string = text_string.replace(char, "")

    toks = word_tokenize(text_string)
    toks_reduced = [x for x in toks if x not in stopwords.words('english')]
    
    return toks_reduced

Now stopwords:

In [None]:
tokens_reduced = rem_punc_stop(arthur)
fd2 = collocations.FreqDist(tokens_reduced)
fd2.most_common()[:10]

In [None]:
measures = collocations.BigramAssocMeasures()
c = collocations.BigramCollocationFinder.from_words(tokens_reduced)
c.nbest(measures.pmi, 10)

In [None]:
c.nbest(measures.likelihood_ratio, 10)

We see here that the collocation finder is pulling out some things that have face validity. When Arthur is talking about peasants, he calls them "bloody" more often than not. However, collocations like "Brother Maynard" and "BLACK KNIGHT" are less informative to us, because we know that they are proper names.

If you were interested in collocations in particular, what step do you think you would have to take during the tokenizing process?

## Stemming and Lemmatizing

This has gotten us as far identical tokens, but in language processing, it is often the case that the specific form of the word is not as important as the idea to which it refers. For example, if you are trying to identify the topic of a document, counting 'running', 'runs', 'ran', and 'run' as four separate words is not useful. Reducing words to their stems is a process called stemming.

A popular stemming implementation is the Snowball Stemmer, which is based on the Porter Stemmer. It's algorithm looks at word forms and does things like drop final 's's, 'ed's, and 'ing's.

Just like the tokenizers, we first have to create a stemmer object with the language we are using.

In [None]:
snowball = nltk.SnowballStemmer('english')

Now, we can try stemming some words

In [None]:
snowball.stem('running')

In [None]:
snowball.stem('eats')

In [None]:
snowball.stem('embarassed')

Snowball is a very fast algorithm, but it has a lot of edge cases. In some cases, words with the same stem are reduced to two different stems.

In [None]:
snowball.stem('cylinder'), snowball.stem('cylindrical')

In other cases, two different words are reduced to the same stem.

> This is sometimes referred to as a 'collision'

In [None]:
snowball.stem('vacation'), snowball.stem('vacate')

A more accurate approach is to use an English word bank like WordNet to call dictionary lookups on word forms, in a process called lemmatization.

In [None]:
# nltk.download('wordnet')
wordnet = nltk.WordNetLemmatizer()

In [None]:
wordnet.lemmatize('vacation'), wordnet.lemmatize('vacate')

#### Time for another small challenge!

Switch computers for this one, so that you are using your partner's computer, and try your hand at challenge B!

## Sentiment

Frequently, we are interested in text to learn something about the person who is speaking. One of these things we've talked about already - linguistic diversity. A similar metric was used a couple of years ago to settle the question of who has the [largest vocabulary in Hip Hop](http://poly-graph.co/vocabulary.html).

> Unsurprisingly, top spots go to Canibus, Aesop Rock, and the Wu Tang Clan. E-40 is also in the top 20, but mostly because he makes up a lot of words; as are OutKast, who print their lyrics with words slurred in the actual typography

Another thing we can learn is about how the speaker is feeling, with a process called sentiment analysis. Before we start, be forewarned that this is not a robust method by any stretch of the imagination. Sentiment classifiers are often trained on product reviews, which limits their ecological validity.

We're going to use TextBlob's built-in sentiment classifier, because it is super easy.

In [None]:
from textblob import TextBlob

In [None]:
blob = TextBlob(arthur)

In [None]:
net_pol = 0
for sentence in blob.sentences:
    pol = sentence.sentiment.polarity
    print(pol, sentence)
    net_pol += pol
print()
print("Net polarity of Arthur: ", net_pol)

How about we look at all characters?

In [None]:
collected_stats = []
for k in char_dict.keys():
    blob = TextBlob(' '.join(char_dict[k]))
    net_pol = 0
    for sentence in blob.sentences:
        pol = sentence.sentiment.polarity
        net_pol += pol
    collected_stats.append((k, net_pol))

In [None]:
sorted_stats = sorted(collected_stats, key=lambda x: x[1])
for t in sorted_stats:
    print(t[0], t[1])

## Topic Modeling

Another common NLP task is topic modeling. The math behind this is beyond the scope of this course, but the basic strategy is to represent each document as a one-dimensional array, where the indices correspond to integer ids of tokens in the document. Then, some measure of semantic similarity, like the cosine of the angle between unitized versions of the document vectors, is calculated. Finally, distinct topics are identified as leading certain groups of documents. The result is a list of `n` topics with the driving words for that topic, and a list of documents with their relation to each topic (how strongly a document fits that topic.

Let's run a topic model on the characters of *Monty Python*.

Luckily for us there is another Python library that takes care of the heavy lifting for us.

In [None]:
from gensim import corpora, models, similarities

First we need to separate the speeches and people, but keep it ordered so we index correctly when done. For the speeches, we'll need all speech as one string, then tokenized. We also need to remove punctuation and stop words so that Python can identify important words to documents. It seems we've gotten lucky again, we already wrote *rem_punc_stop* !

In [None]:
people = []
speeches = []
for k,v in char_dict.items():
    people.append(k)
    new_string = ' '.join(v)
    speeches.append(rem_punc_stop(new_string))

Now we create the dictionary of words used to create the matrices, and set thresholds for word frequencies within the corpus:

In [None]:
#create a Gensim dictionary from the texts
dictionary = corpora.Dictionary(speeches)

#remove extremes (similar to the min/max df step used when creating the tf-idf matrix)
#no_below is absolute # of docs, no_above is fraction of corpus
dictionary.filter_extremes(no_below=2, no_above=.70)

#convert the dictionary to a bag of words corpus for reference
corpus = [dictionary.doc2bow(i) for i in speeches]

Finally we set the parameters for the LDA topic modelling:

In [None]:
#we run chunks of 15 books, and update after every 2 chunks, and make 10 passes
lda = models.LdaModel(corpus, num_topics=6, 
                            update_every=2,
                            id2word=dictionary, 
                            chunksize=15, 
                            passes=10)

lda.show_topics()

To match characters to their topics we just index the corpus:

In [None]:
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

corpus_lda = lda[corpus_tfidf]
for i, doc in enumerate(corpus_lda): # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
    print(people[i],doc)
    print ()

# Practice

In the time remaining, pull up a dataset that you have, and that you'd like to work with in Python. The instructors will be around to help you apply what you've learned today to problems in your data that you are dealing with.

If you don't have data of your own, you should practice with the test data we've given you here. For example, you could try to figure out:

1. Is King Arthur happier than Sir Robin, based on his speech?
2. Which character in Monty Python has the biggest vocabulary?