# Text Analysis Workshop

Today's workshop will address various concepts in text analysis, primarily through the use of NTLK and gensim. A fundmental understanding of Python is necessary, and some knowledge of NLTK would be helpful. We will cover:

1. Preparing your own corpus
2. Tagging and Chunking
3. Document Classification & Topic Modeling

You will need:

* NLTK ( - pip install NLTK)
* BeautifulSoup ( - pip install beautifulsoup4)
* gensim ( - pip install gensim)

This workshop will also help solidfy understandings of regex, list comprehensions, and saving via pickle.

Much of today's work will be adapted, or taken directly, from the NLTK book found here: http://www.nltk.org/book/ . The respective guides for BeautifulSoup and gensim are here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/ and here: https://radimrehurek.com/gensim/ .

# 1) Preparing your own corpus

We are going to take Jonathan Swift's *Gulliver's Travels* from archive.org to use as our text throughout today's workshop. Although we will utilize pre-made corpora to explore more robust options, it is useful to know how to clean your own text files you may have, create your own corpus, declare it properly, and run analyses.

## String manipulation and cleaning

Let's first use Beautiful Soup to grab only the text. There are packages that exist to clean texts from standard sites such as a Gutenberg package for gutenberg.org, but today we'll clean it as best we can manually:

In [None]:
import urllib.request
from bs4 import BeautifulSoup

url = "https://ia801404.us.archive.org/2/items/gulliverstravels17157gut/17157-h/17157-h.htm"

f = urllib.request.urlopen(url)
html = f.read()

#clean and extract only raw text 
rawtext = BeautifulSoup(html, "html.parser")
rawtext = BeautifulSoup.get_text(rawtext)

#slice at beginning and end of book
gtravels = rawtext[rawtext.find("My father had"):(rawtext.find("of my unfortunate voyage")+5)]

print (gtravels)

You'll notice there are still page numbers and chapter headings in our text, and you might have other pieces you want to clean. Recalling your regex work from Part 4 of the intro series, how can we get rid of all the page numbers within brackets?

In [None]:
import re

#regex for page numbers in brackets
pgnumbers = re.findall("\[[[0-9]+\]", gtravels)

for x in pgnumbers:
    gtravels = gtravels.replace(x,"")

#regex for all roman numerals and CHAPTER or PART or ARTICLE before it
#you might want to save the roman numerals regex if you work frequently with such texts
chapters = re.findall('''([A-Z]+ (M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})
                        (XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})
                        (IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))\.)''',gtravels)
chapters = [x[0] for x in chapters]

for x in chapters:
    gtravels = gtravels.replace(x,"")
    
print (gtravels)

Let's save this text in case we want to use it later:

In [None]:
with open("gulliver.txt", "w") as f:
    f.write(gtravels)

## Declaring a corpus in NLTK

While you can use NLTK on strings and lists of sentences, it's much easier to formally declare your corpus.

In [None]:
import nltk
from nltk.corpus import PlaintextCorpusReader

corpus_root = "/Users/chench/Box Sync/Python Notebooks"
wordlists = PlaintextCorpusReader(corpus_root, '.*txt')

We now have a text corpus, on which we can run all the basic methods you learned in the introductory sequence. To list all the files in our corpus:

In [None]:
wordlists.fileids()

We can also extract either all the words or all the sentences in list format:

In [None]:
wordlists.words('gulliver.txt')

In [None]:
gsents = wordlists.sents('gulliver.txt')
print (gsents)

Recalling Part 4 of the intro series, we can now get basic statistics. Let's find word frequencies, but first we must clean up punctuation and stop words if we want to see anything worthwhile.

In [None]:
from nltk.corpus import stopwords
from string import punctuation

gwords = [x.lower() for x in wordlists.words('gulliver.txt') if x not in punctuation]
gwords = [x for x in gwords if x not in stopwords.words('english')]

fdist = nltk.FreqDist(gwords) #frequency
mostcommon = fdist.most_common(100)

print (mostcommon)

# 2) Tagging

There are many situations, in which "tagging" words (or really anything) may be useful in order to simply determine or calculate trends, or even to be used in further text analysis to extract meaning. We will cover 3 methods of tagging: simple regex, n-gram, and Brill transformation based tagging. Although they will not be covered here, HMM, CRF, and neural networks will be briefly explained as additional machine learning models.

It is important to note that in Natural Language Processing (NLP), POS (Part of Speech) tagging is by far the most common use for tagging, but the actual tag can be anything. Other applications include sentiment analysis and NER (Named Entity Recognition). Tagging is simply mapping a word to a specific category via a tuple.

## On a low-level

Tagging is creating a tuple of (word, tag) for every word in a text or corpus. For example: "My name is Chris" may be tagged for POS as:

My/PossessivePronoun name/Noun is/Verb Chris/ProperNoun ./Period

You'll notice how the text is annotated, using backslash to match the word to its tag. So how can we get this to useful form for Python?

In [None]:
import nltk

line = "My/Possessive_Pronoun name/Noun is/Verb Chris/Proper_Noun ./Period"
tagged_sent = [nltk.tag.str2tuple(t) for t in line.split()]

print (tagged_sent)

Further analysis of tags with NLTK requires a *list* of sentences, otherwise you will get an index error. So let's add a couple more sentences.

In [None]:
lines = [line, "He/Pronoun likes/Verb Python/Noun ./Period", "Do/Verb you/Pronoun like/Verb Python/Noun ?/Question_Mark"]

tagged_sents = []
for line in lines:
    tagged_sents.append([nltk.tag.str2tuple(t) for t in line.split()])

print (tagged_sents, len(tagged_sents))

Naturally, these tags are a bit verbose, the standard tagging conventions follow the Penn Treebank: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

## Working with a tagged corpus

Now that we know how tagging works, let's import a tagged corpus from the NLTK database and see what we can do.

In [None]:
from nltk.corpus import brown #if you don't have this downloaded, type nltk.download()
nltk.corpus.brown.tagged_words(tagset='universal')

*NB: the option "universal" simplifies the tagset. Much more precise tags do exist for the linguists in the room.*

Let's find the most frequent parts of speech in the corpus:

In [None]:
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
tag_fd.most_common()

We can also find out what the most common nouns are:

In [None]:
word_tag_fd = nltk.FreqDist(brown_news_tagged)
[wt[0] for (wt, _) in word_tag_fd.most_common() if wt[1] == 'NOUN']

For the linguists, there are naturally many subgroups of nouns, let's see what we can get:

In [None]:
def findtags(tag_prefix, tagged_text):
    cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text
                                  if tag.startswith(tag_prefix))
    return dict((tag, cfd[tag].most_common(5)) for tag in cfd.conditions())

tagdict = findtags('NN', nltk.corpus.brown.tagged_words(categories='news'))
for tag in sorted(tagdict):
    print(tag, tagdict[tag])

We can also look at what linguistic environment words are in, below lists all the words following "President":

In [None]:
brown_learned_text = brown.words(categories='news')
sorted(set(b for (a, b) in nltk.bigrams(brown_learned_text) if a == 'President'))

It might be useful to see just the tags of those words:

In [None]:
brown_lrnd_tagged = brown.tagged_words(categories='news', tagset='universal')
tags = [b[1] for (a, b) in nltk.bigrams(brown_lrnd_tagged) if a[0] == 'President']
fd = nltk.FreqDist(tags)
fd.tabulate()

## Automatic Tagging

Now that we know some things we can do with a tagged corpus, how can we tag our own corpus? We will work through regex models, n-gram models, and discuss a couple more advanced models.

### Regex Tagger

Let's write a simple regex tagger for 8 parts of speech. First we need to define the patterns for each part:

In [6]:
patterns = [
     (r'.*ing$', 'VBG'),               # gerunds
     (r'.*ed$', 'VBD'),                # simple past
     (r'.*es$', 'VBZ'),                # 3rd singular present
     (r'.*ould$', 'MD'),               # modals
     (r'.*\'s$', 'NN$'),               # possessive nouns
     (r'.*s$', 'NNS'),                 # plural nouns
     (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),  # cardinal numbers
     (r'.*', 'NN')                     # nouns (default)
 ]

Now we build the tagger and we can test it on the first sentence of our *Gulliver's Travels*.

In [None]:
regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(gsents[0])

That didn't work so well, no worries, this was a very naïve attempt. But we can evaluate the accuracy nonetheless:

In [None]:
brown_tagged_sents = brown.tagged_sents(categories='news')
regexp_tagger.evaluate(brown_tagged_sents)

### N-Gram Tagging

N-Gram tagging looks at a word, its tag, and *n* previous words' tags to determine the best tag for that word. Because n-gram tagging and other machine learning models require data to train on they are called "supervised", because you know the data being given to it. This also means that we must divide the data into training and testing data, because if you test your model on the same data it was trained with, you will have a great degree of bias. Originally, a 90-10 divide was recommended, but standards have now changed to k-fold cross-validation, usually 10 folds.

In [None]:
#divide tagged data
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]

#train bigram tagger
bigram_tagger = nltk.BigramTagger(train_sents)

We can now try this tagger on that sentence again:

In [None]:
bigram_tagger.tag(gsents[0])

All of the "None" means it didn't know how to tag it because the model was insufficient. To fix this we have to implement backoff tagging:

In [None]:
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)
t2.evaluate(test_sents)

That looks better. Now let's try to tag that sentence again:

In [None]:
t2.tag(gsents[0])

### Transformation-based Brill Tagging

There are many different machine learning algorithms out there. The current "hot" choice is neural networks, but that is beyond the scope of this workshop. Let's look at a transformation-based tagger included in NLTK, which will help us understand how many machine learning models make decisions.

In [None]:
from nltk.tag.brill import *
from nltk.tag import brill_trainer

def train_brill_tagger(tagged_sents):
    t0 = nltk.DefaultTagger('NN')
    t1 = nltk.UnigramTagger(train_sents, backoff=t0)
    t2 = nltk.BigramTagger(train_sents, backoff=t1)
    Template._cleartemplates()
    templates = brill24() #or fntbl37
    t3 = brill_trainer.BrillTaggerTrainer(t2, templates, trace=3)
    t3 = t3.train(tagged_sents, max_rules=100)
    
    return t3

tag = train_brill_tagger(brown_tagged_sents)


We see that the Brill tagger corrects itself up to a certain threshold based on rules it generated from the data we gave it. Other machine learning models such as Conditional Random Fields (CRF) work in a similar way, in that you tell it what features are important to look it, and it weights these features in writing its rules. Neural networks go more into linear algebra and matrix multplication, a different approach. Libraries do exist for easy implmentation of neural nets such as pybrain (http://pybrain.org) for general advancedm modelling, and nlpnet (http://nilc.icmc.usp.br/nlpnet/index.html) for POS or SRL (Semantic Role Labeling).

So let's tag that sentence again with our Brill tagger:

In [None]:
gtagged_sent = tag.tag(gsents[0])
print (gtagged_sent)

In [None]:
tag.evaluate(test_sents)

Not bad! In developing machine learning models, you may want to know where the model is making errors. This can be done by examining the Confusion Matrix:

In [None]:
def tag_list(tagged_sents):
    return [tag for sent in tagged_sents for (word, tag) in sent]
def apply_tagger(tagger, corpus):
    return [tagger.tag(nltk.tag.untag(sent)) for sent in corpus]

gold = tag_list(brown.tagged_sents(categories='news'))
test = tag_list(apply_tagger(tag, brown.tagged_sents(categories='news')))

cm = nltk.ConfusionMatrix(gold, test)
print(cm.pretty_format(sort_by_count=True, show_percents=True, truncate=9))

### Pickling

If you want to save your model, or any complex variable in Python, you can use pickle:

In [None]:
from pickle import dump
output = open('brilltagger.pkl', 'wb')
dump(tag, output, -1)
output.close()

In [None]:
from pickle import load
input = open('brilltagger.pkl', 'rb')
tagger = load(input)
input.close()

## Chunking, grammars, and Named Entity Recognition

On a low linguistic level, you may want to map out a sentence visually based on parts of speech, of course this visualization is actually just a dictionary, which can be used to mine statistics. Let's first tokenize and POS tag our sentence. We can write a quick function to do this:

In [None]:
def ie_preprocess(document):
    sentences = nltk.sent_tokenize(document)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]

We now have to define the grammar. We'll just define a noun phrase for English consisting of a determiner, adjective, and noun. Defining the grammar is done similarly to writing regular expressions. We can then draw the map.

In [None]:
grammar = r"""
  NP: {<DT|JJ|NN.*>+}          # Chunk sequences of DT, JJ, NN
  PP: {<IN><NP>}               # Chunk prepositions followed by NP
  VP: {<VB.*><NP|PP|CLAUSE>+$} # Chunk verbs and their arguments
  CLAUSE: {<NP><VP>}           # Chunk NP, VP
  """

cp = nltk.RegexpParser(grammar)
result = cp.parse(gtagged_sent)
result.draw()

With this information, we can then train classifiers for Named Entity Recognition (NER), i.e. identifying people, places, and things. We won't go into detail today, but NLTK already has a trained classfier we can use off-the-shelf:

In [None]:
nltk.ne_chunk(gtagged_sent, binary=True)

# 3) Document Classification and Topic Modeling

There are two popular choices for models here: Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA). The detailed math for both are, honestly, beyond my grasp. It is necessary to know that LDA is a more complex process, and thus takes more resources and longer to run, but has higher accuracy. LSI is a much simpler process and can be run quite quickly.

- LSI looks at words in a documents and its relationships to other words, with the important assumption that every word can only mean one thing. (*cf.* https://en.wikipedia.org/wiki/Latent_semantic_indexing)

- LDA seeks to remedy this fault by allowing words to exist in multiple topics, first grouping them by topic, and each document is compared across each topic to determine the best fit. (*cf.* https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)


First we'll take 10 sentences from 2 different parts of our *Gulliver's Travels* . We'll try to find the most distinctive topics in each section. Each sentence within the 2 pars will act as a "document", for those looking to do more ambitious work later, sentences can naturally be scaled up to what we understand as documents.

In [None]:
selection = gsents[0:10] + gsents [500:510]

docs = []

for doc in selection: #doc here is actually each sentence
    wordsonly = [x.lower() for x in sent if x not in punctuation]
    wordsonly = [x for x in wordsonly if x not in stopwords.words('english')]
    docs.append(wordsonly)
    
print (docs)

We also want to take out words that appear only once, so their uniquness does not skew our results.

In [None]:
from collections import defaultdict
frequency = defaultdict(int)
for text in docs:
     for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1] for text in docs]
texts = [x for x in texts if len(x) > 0]

from pprint import pprint
pprint(texts)

Let's save this dictionary:

In [None]:
from gensim import corpora, models, similarities

dictionary = corpora.Dictionary(texts)
dictionary.save('gtravels.dict')
print(dictionary.token2id)

We now turn the dictionary into a vector, essentially a different format to keep word frequencies, but the vector relates the word frequences of all words from all documents to each document. We'll save it in a Market Matrix format:

In [None]:
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('gtravels.mm', corpus)
print(corpus, len(corpus))

Without going into too much detail, transforming the vectors essentially assigns "real-value weights" to our previous bag-of-words and frequency data. We use our corpus as "training" data, similar to what we did with POS tagging and NLTK.

In [None]:
tfidf = models.TfidfModel(corpus)

We now apply the transformation to all 20 documents (or sentences):

In [None]:
corpus_tfidf = tfidf[corpus]

for doc in corpus_tfidf:
    print(doc)

Now we can transform the transformation in order to get a 2-D space (we're asking it to give us 2 topics here). This is called Latent Semantic Indexing (see above). Essentially, we are looking for words with particular importance in certain contexts.

In [None]:
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lsi = lsi[corpus_tfidf]

To see the words with the most influence on the topic, we simply print the topics:

In [None]:
lsi.print_topics(2)

Finally, we can look at the similarity of each "document", or sentence from the two parts, to each topic:

In [None]:
for doc in corpus_lsi:
    print(doc)

As may be expected, we clearly see a stronger association between the first 10 sentences and topic 1, and a stronger association of the second ten sentences and topic 2. These are, after all, from completely different parts of his travels!