# Text Analysis Workshop

Today's workshop will address various concepts in text analysis, primarily through the use of NTLK, gensim, and scikit-learn. A fundmental understanding of Python is necessary, and some knowledge of NLTK would be helpful. We will cover:

1. Preparing your own corpus
2. Tagging and Chunking
3. Topic Modeling
4. Clustering

You will need:

* NLTK ( \$ pip install nltk)
* Brown corpus from NLTK ( >>> nltk.download() )
* BeautifulSoup ( \$ pip install beautifulsoup4)
* gensim ( \$ pip install gensim)
* scikit-learn ( \$ pip install scikit-learn)
* pandas ( \$ pip install pandas)
* matplotlib ( \$ pip install matplotlib)

This workshop will further help to solidfy understandings of regex, list comprehensions, output via pickle, and plotting.

Much of today's work will be adapted, or taken directly, from the NLTK book found here: http://www.nltk.org/book/ . The respective guides for BeautifulSoup and gensim are here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/ and here: https://radimrehurek.com/gensim/ . The clustering section is modified from http://brandonrose.org/clustering . For further explanation of grammars or topic modeling on the low-level, see *Data Science from Scratch*: http://shop.oreilly.com/product/0636920033400.do .

# 1) Preparing your own corpus

We are going to take Jonathan Swift's *Gulliver's Travels* from archive.org to use as our text throughout today's workshop. Although we will utilize pre-made corpora to explore more robust options, it is useful to know how to clean your own text files you may have, create your own corpus, declare it properly, and run analyses, so we will start from scratch.

## String manipulation and cleaning

Let's first use Beautiful Soup to grab only the text. There are packages that exist to clean texts from standard sites such as a Gutenberg package for gutenberg.org, but today we'll clean it as best we can manually:

In [None]:
import urllib.request
from bs4 import BeautifulSoup

url = "https://ia801404.us.archive.org/2/items/gulliverstravels17157gut/17157-h/17157-h.htm"

f = urllib.request.urlopen(url)
html = f.read()

#clean and extract only raw text 
rawtext = BeautifulSoup(html, "html.parser")
rawtext = BeautifulSoup.get_text(rawtext)

#slice at beginning and end of book
beginning = "My father had"
end = "of my unfortunate voyages."
gtravels = rawtext[rawtext.find(beginning):rawtext.find(end)+len(end)]

print (gtravels)

You'll notice there are still page numbers and chapter headings in our text, and you might have other pieces you want to clean. Recalling your regex work from Part 4 of the intro series, how can we get rid of all the page numbers within brackets?

In [None]:
import re

#regex for page numbers in brackets
pgnumbers = re.findall("\[[[0-9]+\]", gtravels)

for x in pgnumbers:
    gtravels = gtravels.replace(x,"")

#regex for all roman numerals and CHAPTER or PART or ARTICLE before it
#you might want to save the roman numerals regex if you work frequently with such texts
chapters = re.findall('''([A-Z]+ (M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})
                        (XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})
                        (IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))\.)''',gtravels)

chapters = [x[0] for x in chapters]

for x in chapters:
    gtravels = gtravels.replace(x,"")
    
print (gtravels)

Let's save this text so we can read it in the corpus later:

In [None]:
import codecs
with codecs.open("gulliver.txt", "w","utf-8") as f:
    f.write(gtravels)

## Declaring a corpus in NLTK

While you can use NLTK on strings and lists of sentences, it's much easier to formally declare your corpus.

In [None]:
import nltk
from nltk.corpus import PlaintextCorpusReader

corpus_root = "/Users/chench/Box Sync/Python Notebooks"
my_texts = PlaintextCorpusReader(corpus_root, '.*txt')

We now have a text corpus, on which we can run all the basic methods you learned in the introductory sequence. To list all the files in our corpus:

In [None]:
my_texts.fileids()

We can also extract either all the words or all the sentences in list format:

In [None]:
my_texts.words('gulliver.txt')

In [None]:
gsents = my_texts.sents('gulliver.txt')
print (gsents)

Recalling Part 4 of the intro series, we can now get basic statistics. Let's find word frequencies, but first we must clean up punctuation and stop words if we want to see anything worthwhile.

In [None]:
from nltk.corpus import stopwords
from string import punctuation

gwords = [x.lower() for x in my_texts.words('gulliver.txt') if x.lower() not in punctuation
          and x.lower() not in stopwords.words('english')]

fdist = nltk.FreqDist(gwords)
mostcommon = fdist.most_common(100)

print (mostcommon)

# 2) Tagging

There are many situations, in which "tagging" words (or really anything) may be useful in order to determine or calculate trends, or for further text analysis to extract meaning. We will cover 3 methods of tagging: simple regex, n-gram, and Brill transformation based tagging. Although they will not be covered today, HMM, CRF, and neural networks will be briefly alluded to as additional machine learning models.

It is important to note that in Natural Language Processing (NLP), POS (Part of Speech) tagging is the most common use for tagging, but the actual tag can be anything. Other applications include sentiment analysis and NER (Named Entity Recognition). Tagging is simply labeling a word to a specific category via a tuple.

Nevertheless, for training more advanced tagging models, POS tagging is nearly essential. If you are defining a machine learning model to predict patterns in your text, these patterns will most likley rely on, among other things, POS features. You will therefore first tag POS and then use the POS as a feature in your model.

## On a low-level

Tagging is creating a tuple of (word, tag) for every word in a text or corpus. For example: "My name is Chris" may be tagged for POS as:

My/PossessivePronoun name/Noun is/Verb Chris/ProperNoun ./Period

*NB: type 'nltk.data.path' to find the path on your computer to your downloaded nltk corpora. You can explore these files to see how large corpora are formatted.*

You'll notice how the text is annotated, using a forward slash to match the word to its tag. So how can we get this to useful form for Python?

In [None]:
import nltk

line = "My/Possessive_Pronoun name/Noun is/Verb Chris/Proper_Noun ./Period"
tagged_sent = [nltk.tag.str2tuple(t) for t in line.split()]

print (tagged_sent)

Further analysis of tags with NLTK requires a *list* of sentences, otherwise you will get an index error. So let's add a couple more sentences.

In [None]:
lines = [line, "He/Pronoun likes/Verb Python/Noun ./Period", "Do/Verb you/Pronoun like/Verb Python/Noun ?/Question_Mark"]

tagged_sents = [[nltk.tag.str2tuple(t) for t in line.split()] for line in lines]

print (tagged_sents, len(tagged_sents))

Naturally, these tags are a bit verbose, the standard tagging conventions follow the Penn Treebank: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

## Working with a tagged corpus

Now that we know how tagging works, let's import a tagged corpus from the NLTK database and see what we can do.

In [None]:
from nltk.corpus import brown #if you don't have this downloaded, type nltk.download()
nltk.corpus.brown.tagged_words(tagset='universal')

*NB: the option "universal" simplifies the tagset. Much more precise tags do exist for the linguists in the room.*

Let's find the most frequent parts of speech in the corpus:

In [None]:
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
tag_fd.most_common()

We can also find out what the most common nouns are. For the linguists, there are naturally many subgroups of nouns, let's see what we can get:

In [None]:
def find_tags(tag_prefix, tagged_text):
    cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text
                                  if tag.startswith(tag_prefix))
    return dict((tag, cfd[tag].most_common(5)) for tag in cfd.conditions())

tagdict = find_tags('NN', nltk.corpus.brown.tagged_words(categories='news'))
for tag in sorted(tagdict):
    print(tag, tagdict[tag])

We can also look at what linguistic environment words are in, below lists all the words following "President":

In [None]:
brown_news_text = brown.words(categories='news')
sorted(set(b for (a, b) in nltk.bigrams(brown_news_text) if a == 'President'))

It might be useful to see just the tags of those words:

In [None]:
tags = [b[1] for (a, b) in nltk.bigrams(brown_news_tagged) if a[0] == 'President']
fd = nltk.FreqDist(tags)
fd.tabulate()

## Automatic Tagging

Now that we know some things we can do with a tagged corpus, how can we tag our own corpus? We will work through regex models, n-gram models, and discuss a couple more advanced models.

### Regex Tagger

Let's write a simple regex tagger for 8 parts of speech. First we need to define the patterns for each part:

In [None]:
patterns = [
     (r'.*ing$', 'VBG'),               # gerunds
     (r'.*ed$', 'VBD'),                # simple past
     (r'.*es$', 'VBZ'),                # 3rd singular present
     (r'.*ould$', 'MD'),               # modals
     (r'.*\'s$', 'NN$'),               # possessive nouns
     (r'.*s$', 'NNS'),                 # plural nouns
     (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),  # cardinal numbers
     (r'.*', 'NN')                     # nouns (default)
 ]

Now we build the tagger and we can test it on the first sentence of our *Gulliver's Travels*.

In [None]:
regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(gsents[0])

That didn't work so well, no worries, this was a very naïve attempt. But we can evaluate the accuracy nonetheless:

In [None]:
brown_tagged_sents = brown.tagged_sents(categories='news')
regexp_tagger.evaluate(brown_tagged_sents)

### N-Gram Tagging

N-Gram tagging looks at a word, its tag, and *n* previous words' tags to determine the best tag for that word. Because n-gram tagging and other machine learning models require data to train on they are called "supervised", because you know the data being given to it. This also means that we must divide the data into training and testing data, because if you test your model on the same data it was trained with, you will have a great degree of bias. Originally, a 90-10 divide was recommended, but standards have now changed to k-fold cross-validation, usually 10 folds.

In [None]:
#divide tagged data
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]

#train bigram tagger
bigram_tagger = nltk.BigramTagger(train_sents)

We can now try this tagger on that sentence again:

In [None]:
bigram_tagger.tag(gsents[0])

All of the "None" means it didn't know how to tag it because the model was insufficient, as once it encounters an unknown word to tag, the following will also be un-taggable. To fix this we have to implement backoff tagging:

In [None]:
t0 = nltk.DefaultTagger('NN')
t1 = nltk.RegexpTagger(patterns, backoff=t0)
t2 = nltk.UnigramTagger(train_sents, backoff=t1)
t3 = nltk.BigramTagger(train_sents, backoff=t2)

Now let's try to tag that sentence again:

In [None]:
t3.tag(gsents[0])

In [None]:
t3.evaluate(test_sents)

### Transformation-based Brill Tagging

There are many different machine learning algorithms out there. The current "hot" choice is neural networks, but that is beyond the scope of this workshop. Let's look at a transformation-based tagger included in NLTK, which will help us understand how many machine learning models make decisions.

In [None]:
from nltk.tag.brill import *
from nltk.tag import brill_trainer

def train_brill_tagger(tagged_sents):
    t0 = nltk.DefaultTagger('NN')
    t1 = nltk.RegexpTagger(patterns, backoff=t0)
    t2 = nltk.UnigramTagger(train_sents, backoff=t1)
    t3 = nltk.BigramTagger(train_sents, backoff=t2)
    Template._cleartemplates()
    templates = brill24() #or fntbl37
    t4 = brill_trainer.BrillTaggerTrainer(t3, templates, trace=3)
    t4 = t4.train(tagged_sents, max_rules=100)
    
    return t4

tagger = train_brill_tagger(brown_tagged_sents)


We see that the Brill tagger corrects itself up to a certain threshold based on rules it generated from the data we gave it. Other machine learning models such as Conditional Random Fields (CRF) work in a similar way, in that you tell it what features are important to look at, and it weights these features in writing its rules. Neural networks go more into linear algebra and matrix multiplication, a different approach. Libraries do exist for easy implmentation of neural nets such as pybrain (http://pybrain.org) for general advanced modelling, and nlpnet (http://nilc.icmc.usp.br/nlpnet/index.html) for POS or SRL (Semantic Role Labeling).

So let's tag that sentence again with our Brill tagger:

In [None]:
gtagged_sent = tagger.tag(gsents[0])
print (gtagged_sent)

In [None]:
tagger.evaluate(test_sents)

Not bad! In developing machine learning models, you may want to know where the model is making errors. This can be done by examining the Confusion Matrix:

In [None]:
def tag_list(tagged_sents):
    return [tag for sent in tagged_sents for (word, tag) in sent] #just grabbing a list of all the tags
def apply_tagger(tagger, corpus):
    return [tagger.tag(nltk.tag.untag(sent)) for sent in corpus] #notice we first untag the sentence

gold = tag_list(brown.tagged_sents(categories='news'))
test = tag_list(apply_tagger(tagger, brown.tagged_sents(categories='news')))

cm = nltk.ConfusionMatrix(gold, test)
print(cm.pretty_format(sort_by_count=True, show_percents=True, truncate=10))

### Pickling

If you want to save your model, or any complex variable in Python, you can use pickle:

In [None]:
from pickle import dump
output_tagger = open('brilltagger.pkl', 'wb')
dump(tagger, output_tagger, -1)
output_tagger.close()

In [None]:
from pickle import load
input_tagger = open('brilltagger.pkl', 'rb')
tagger = load(input_tagger)
input_tagger.close()

In [None]:
type(tagger)

## Chunking, grammars, and Named Entity Recognition

On a low linguistic level, you may want to map out a sentence visually based on parts of speech, of course this visualization is actually just a navigable data type, which can be used to mine statistics. We have to first define the grammar. We'll just define a noun phrase for English consisting of a determiner, indefinite article, count, or possessive pronoun, an adjective, and noun. Defining the grammar is done similarly to writing regular expressions. We can then draw the map.

In [None]:
grammar = r"""
  NP: {<DT|AT|CD|PP\$>?<JJ>*<PPSS|NN.*>}       
  PP: {<IN><NP>}            
  VP: {<BEDZ|HVD|VB.*><AT>?<OD>?<NP|PP|CLAUSE>+} 
  CLAUSE: {<NP><VP>}        
  """
# | is "or", a following ? means optional, * is 0 or more, .* is anything following

cp = nltk.RegexpParser(grammar)

result = cp.parse(gtagged_sent)
result.draw()

In [None]:
print (result) #can be traversed using indexes, obviously searched as well

With this information, we can then train classifiers for Named Entity Recognition (NER), i.e. identifying people, places, and things. We won't go into detail today, but NLTK already has a trained classfier we can use off-the-shelf:

In [None]:
print (nltk.ne_chunk(gtagged_sent, binary=True))

# 3) Topic Modeling

There are two popular choices for models here: Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA). The detailed math for both are beyond this workshop, not to mention beyond my knowledge. It is necessary to know that LDA is a more complex process, and thus takes more resources and longer to run, but has higher accuracy. LSI is a much simpler process and can be run quite quickly.

- LSI looks at words in a documents and its relationships to other words, with the important assumption that every word can only mean one thing. (*cf.* https://en.wikipedia.org/wiki/Latent_semantic_indexing)

- LDA seeks to remedy this fault by allowing words to exist in multiple topics, first grouping them by topic, and each document is compared across each topic to determine the best fit. (*cf.* https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)


First we'll take 10 sentences from 2 different parts of our *Gulliver's Travels* . We'll try to find the most distinctive topics in each section. Each sentence within the 2 parts will act as a "document", for those looking to do more ambitious work later, sentences can naturally be scaled up to what we understand as documents.

In [None]:
selection = gsents[0:10] + gsents [500:510]

docs = [[x.lower() for x in sent if x not in punctuation 
         and x.lower() not in stopwords.words('english')] for sent in selection]
    
print (docs)

We also want to take out words that appear only once, so their uniquness does not skew our results.

In [None]:
from collections import defaultdict
frequency = defaultdict(int)
for sent in docs:
     for token in sent:
        frequency[token] += 1

docs = [[token for token in text if frequency[token] > 1] for text in docs if len(text) > 0]

from pprint import pprint
pprint(docs)

Let's make this a dictionary:

In [None]:
from gensim import corpora, models, similarities

dictionary = corpora.Dictionary(docs)
dictionary.save('gtravels.dict')
print(dictionary.token2id)

We now turn the dictionary into a vector, essentially a different format to keep word frequencies, but the vector relates the word frequences of all words from all documents to each document. We'll save it in a Market Matrix format:

In [None]:
corpus = [dictionary.doc2bow(sent) for sent in docs]
corpora.MmCorpus.serialize('gtravels.mm', corpus)
print(corpus, len(corpus))

Without going into too much detail, transforming the vectors essentially assigns "real-value weights" to our previous bag-of-words and frequency data. We use our corpus as "training" data.

In [None]:
tfidf = models.TfidfModel(corpus)

We now apply the transformation to all 20 documents (or sentences in our case):

In [None]:
corpus_tfidf = tfidf[corpus]

for sent in corpus_tfidf:
    print(sent)

Now we can transform the transformation in order to get a 2-D space (we're asking it to give us 2 topics here). This is called Latent Semantic Indexing (see above). Essentially, we are looking for words with particular importance in certain contexts.

In [None]:
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lsi = lsi[corpus_tfidf]

To see the words with the most influence on the topic, we simply print the topics:

In [None]:
lsi.print_topics(2)

Finally, we can look at the similarity of each "document", or sentence from the two parts, to each topic:

In [None]:
for doc in corpus_lsi:
    print(doc)

As may be expected, we see a stronger association between the first 10 sentences and topic 1, and a stronger association of the second ten sentences and topic 2. These are, after all, from completely different parts of his travels.

# 4) Clustering

Clustering is not the same as topic modeling, although clustering can yield topics. Clustering is a more general approach to grouping and visualizing data based on their similarity. If you only want to determine topics, the above approach will be more accurate, if you are looking for spatial relations, clustering may show this better. Aside from using different algorithms, topic modeling uses words to determine influential words in a document, which characterize that document. Clustering attempts to cluster documents in groups, not seeking a specific word.

We first need to prepare the data. Let's treat each paragraph of *Gulliver's Travels* as its own document, we first need to fix some carriage returns. We'll also create names for each paragraph based on their order:

In [None]:
gparas = gtravels.replace("\r\n"," ").split("\n")
gparas = [x for x in gparas if len(x) > 0]
gnames = [str(x) + " paragraph" for x in range(1, len(gparas)+1)]

Now we define functions in order to collect regular tokenized words, and stemmed words:

In [None]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

def tokenize_only(text):
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = [x for x in tokens if re.search('[a-zA-Z]', x) and x != "'s"]
    return filtered_tokens

def tokenize_and_stem(text):
    stems = [stemmer.stem(x) for x in tokenize_only(text)]
    return stems

Now we can collect these from our paragraphs, this is only necessary to map out our topics after:

In [None]:
totalvocab_stemmed = []
totalvocab_tokenized = []

for i in gparas:
    totalvocab_stemmed.extend(tokenize_and_stem(i))
    totalvocab_tokenized.extend(tokenize_only(i))

Our data frame will map tokenized words to stemmed words:

In [None]:
import pandas as pd

vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)
print (vocab_frame.shape[0])
print (vocab_frame)

Similar to topic modelling, we'll make a tfidf matrix here too:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.9, max_features=200000,
                                 min_df=0.1, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,4))

tfidf_matrix = tfidf_vectorizer.fit_transform(gparas) #fit the vectorizer to synopses

print(tfidf_matrix.shape)

Then we need the words from the vector:

In [None]:
terms = tfidf_vectorizer.get_feature_names()

In order to plot our clusters, we'll want t calculate the distance via cosine similarity:

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(tfidf_matrix)

Now we'll start the actual clustering:

In [None]:
from sklearn.cluster import KMeans

num_clusters = 4
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist()

We can now do a form of topic modelling by printing the words characterizing the clusters we made, the words are those closest to the centroid of the cluster:

In [None]:
from __future__ import print_function

print("Top terms per cluster:")
print()

#sort cluster centers by proximity to centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1] 

tolabel = []

for i in range(num_clusters):
    indvlabel = []
    print("Cluster %d words:" % i, end='')
    
    for ind in order_centroids[i, :7]: #replace 6 with n words per cluster
        print(' %s' % vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0][0], end=',')
        indvlabel.append(vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0][0])
    tolabel.append(indvlabel)
    print('\n\n') #add whitespace


Two dimensional scaling must be applied for plotting:

In [None]:
import os  # for os.path.basename
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn.manifold import MDS

MDS()

# convert two components as we're plotting points in a two-dimensional plane
# "precomputed" because we provide a distance matrix
# we will also specify `random_state` so the plot is reproducible.
mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1)

pos = mds.fit_transform(dist)  # shape (n_components, n_samples)

xs, ys = pos[:, 0], pos[:, 1]

Define colors and labels for plot:

In [None]:
import random

cluster_colors = {}
cluster_names ={}

for i in range(num_clusters):
    cluster_colors[i] = "#%06x" % random.randint(0, 0xFFFFFF) #random hexadecimal color
    cluster_names[i] = ' '.join(tolabel[i][:3])

Plot:

In [None]:
#create data frame that has the result of the MDS plus the cluster numbers and titles
df = pd.DataFrame(dict(x=xs, y=ys, label=clusters, title=gnames)) 

#group by cluster
groups = df.groupby('label')

# set up plot
fig, ax = plt.subplots(figsize=(17, 9)) # set size
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling

#iterate through groups to layer the plot
#note that I use the cluster_name and cluster_color dicts with the 'name' lookup to return the appropriate color/label
for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=12, 
            label=cluster_names[name], color=cluster_colors[name], 
            mec='none')
    ax.set_aspect('auto')
    ax.tick_params(\
        axis= 'x',          # changes apply to the x-axis
        which='both',      # both major and minor ticks are affected
        bottom='off',      # ticks along the bottom edge are off
        top='off',         # ticks along the top edge are off
        labelbottom='off')
    ax.tick_params(\
        axis= 'y',         # changes apply to the y-axis
        which='both',      # both major and minor ticks are affected
        left='off',      # ticks along the bottom edge are off
        top='off',         # ticks along the top edge are off
        labelleft='off')
    
ax.legend(numpoints=1)  #show legend with only 1 point

#add label in x,y position with the label as the film title
for i in range(len(df)):
    ax.text(df.ix[i]['x'], df.ix[i]['y'], df.ix[i]['title'], size=8)  

    
    
plt.show() #show the plot

#uncomment the below to save the plot if need be
#plt.savefig('clusters_small_noaxes.png', dpi=200)