# Text Analysis Workshop

Today's workshop will address various concepts in text analysis, primarily through the use of NTLK and scikit-learn. A fundmental understanding of Python is necessary. We will cover:

1. Preparing your own corpus
2. Tagging and Chunking
3. Clustering

You will need:

* NLTK ( \$ pip install nltk)
* Brown corpus from NLTK ( >>> nltk.download() )
* BeautifulSoup ( \$ pip install beautifulsoup4)
* scikit-learn ( \$ pip install scikit-learn)
* pandas ( \$ pip install pandas)
* matplotlib ( \$ pip install matplotlib)

This workshop will further help to solidfy understandings of regex, list comprehensions, and plotting.

Much of today's work will be adapted, or taken directly, from the NLTK book found here: http://www.nltk.org/book/ . The guide for BeautifulSoup is here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/ . The clustering section is modified from http://brandonrose.org/clustering . For further explanation of grammars or topic modeling on the low-level, see *Data Science from Scratch*: http://shop.oreilly.com/product/0636920033400.do .

# 1) Preparing your own corpus

We are going to take Jonathan Swift's *Gulliver's Travels* from archive.org to use as our text throughout today's workshop. Although we will utilize pre-made corpora to explore more robust options, it is useful to know how to clean your own text files you may have, create your own corpus, declare it properly, and run analyses, so we will start from scratch.

## String manipulation and cleaning

Let's first use Beautiful Soup to grab only the text. There are packages that exist to clean texts from standard sites such as a Gutenberg package for gutenberg.org, but today we'll clean it as best we can manually:

In [None]:
import urllib.request
from bs4 import BeautifulSoup

url = "http://tinyurl.com/grgoxp9"#"https://ia801404.us.archive.org/2/items/gulliverstravels17157gut/17157-h/17157-h.htm"

f = urllib.request.urlopen(url)
html = f.read()

#clean and extract only raw text 
rawtext = BeautifulSoup(html, "lxml")
rawtext = BeautifulSoup.get_text(rawtext)

#slice at beginning and end of book
beginning = "My father had"
end = "of my unfortunate voyages."
gtravels = rawtext[rawtext.find(beginning):rawtext.find(end)+len(end)]

print (gtravels)

You'll notice there are still page numbers and chapter headings in our text, and you might have other pieces you want to clean. Recalling your regex work from Part 4 of the intro series, how can we get rid of all the page numbers within brackets?

In [None]:
import re

#regex for page numbers in brackets
gtravels = re.sub("\[[0-9]+\]", "", gtravels)

#regex to replace Roman Numerals following all caps word, up to RN 9 (only 8 chapters)
gtravels = re.sub("([A-Z]+ (I?V|V?I{1,3})\.)", "",gtravels)

print (gtravels)

Let's save this text so we can read it in the corpus later:

In [None]:
import codecs
with codecs.open("gulliver.txt", "w","utf-8") as f:
    f.write(gtravels)

## Declaring a corpus in NLTK

While you can use NLTK on strings and lists of sentences, it's better to formally declare your corpus.

In [None]:
from nltk.corpus import PlaintextCorpusReader

corpus_root = "" #rel. path
my_texts = PlaintextCorpusReader(corpus_root, '.*txt')

We now have a text corpus, on which we can run all the basic methods you learned in the introductory sequence. To list all the files in our corpus:

In [None]:
my_texts.fileids()

We can also extract either all the words or all the sentences in list format:

In [None]:
my_texts.words('gulliver.txt')

In [None]:
gsents = my_texts.sents('gulliver.txt')
print (gsents)

We now have a corpus, or text, from which we can get any of the statistics you learned in Day 3 of the Python workshop. We will review some of these functions once we get some more information

# 2) Tagging

There are many situations, in which "tagging" words (or really anything) may be useful in order to determine or calculate trends, or for further text analysis to extract meaning. We will cover 3 methods of tagging: simple regex, n-gram, and Brill transformation based tagging. Although they will not be covered today, HMM, CRF, and neural networks will be briefly alluded to as additional machine learning models.

It is important to note that in Natural Language Processing (NLP), POS (Part of Speech) tagging is the most common use for tagging, but the actual tag can be anything. Other applications include sentiment analysis and NER (Named Entity Recognition). Tagging is simply labeling a word to a specific category via a tuple.

Nevertheless, for training more advanced tagging models, POS tagging is nearly essential. If you are defining a machine learning model to predict patterns in your text, these patterns will most likley rely on, among other things, POS features. You will therefore first tag POS and then use the POS as a feature in your model.

## On a low-level

Tagging is creating a tuple of (word, tag) for every word in a text or corpus. For example: "My name is Chris" may be tagged for POS as:

My/PossessivePronoun name/Noun is/Verb Chris/ProperNoun ./Period

*NB: type 'nltk.data.path' to find the path on your computer to your downloaded nltk corpora. You can explore these files to see how large corpora are formatted.*

You'll notice how the text is annotated, using a forward slash to match the word to its tag. So how can we get this to useful form for Python?

In [None]:
from nltk.tag import str2tuple

line = "My/Possessive_Pronoun name/Noun is/Verb Chris/Proper_Noun ./Period"
tagged_sent = [str2tuple(t) for t in line.split()]

print (tagged_sent)

Further analysis of tags with NLTK requires a *list* of sentences, otherwise you will get an index error. So let's add a couple more sentences.

In [None]:
lines = [line, "He/Pronoun likes/Verb Python/Noun ./Period", "Do/Verb you/Pronoun like/Verb Python/Noun ?/Question_Mark"]

tagged_sents = [[str2tuple(t) for t in line.split()] for line in lines]

print (tagged_sents, len(tagged_sents))

## Working with a tagged corpus

Now that we know how tagging works, let's import a tagged corpus from the NLTK database and see what we can do.

In [None]:
from nltk.corpus import brown #if you don't have this downloaded, type nltk.download()
brown.tagged_words(tagset='universal')

*NB: the option "universal" simplifies the tagset. Much more precise tags do exist for the linguists in the room.*

Let's find the most frequent parts of speech in the corpus:

In [None]:
import nltk

brown_news_tagged = brown.tagged_words(categories='news') #not universal tagset
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
tag_fd.most_common()

Naturally, these tags are a bit verbose, the standard tagging conventions follow the Penn Treebank: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [None]:
nltk.help.upenn_tagset()

We can also find out what the most common nouns are. For the linguists, there are naturally many subgroups of nouns, let's see what we can get:

In [None]:
def find_tags(tag_prefix, tagged_text):
    cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text
                                  if tag.startswith(tag_prefix))
    return dict((tag, cfd[tag].most_common(5)) for tag in cfd.conditions())

tagdict = find_tags('NN', brown.tagged_words(categories='news'))
for tag in sorted(tagdict):
    print(tag, tagdict[tag])

We can also look at what linguistic environment words are in, below lists all the words following "President":

In [None]:
brown_news_text = brown.words(categories='news')
sorted(set(b for (a, b) in nltk.bigrams(brown_news_text) if a == 'President'))

It might be useful to see just the tags of those words:

In [None]:
tags = [b[1] for (a, b) in nltk.bigrams(brown_news_tagged) if a[0] == 'President']
fd = nltk.FreqDist(tags)
fd.tabulate()

## Automatic Tagging

Now that we know some things we can do with a tagged corpus, how can we tag our own corpus? We will work through regex models, n-gram models, and discuss a couple more advanced models.

### Regex Tagger

Let's write a simple regex tagger for 8 parts of speech. First we need to define the patterns for each part:

In [None]:
patterns = [
     (r'.*ing$', 'VBG'),               # gerunds
     (r'.*ed$', 'VBD'),                # simple past
     (r'.*\'s$', 'NN$'),               # possessive nouns
     (r'.*s$', 'NNS'),                 # plural nouns
     (r'.*', 'NN')                     # nouns (default)
 ]

Now we build the tagger and we can test it on the first sentence of our *Gulliver's Travels*.

In [None]:
regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(gsents[0])

That didn't work so well, no worries, this was a very naïve attempt. But we can evaluate the accuracy nonetheless:

In [None]:
brown_tagged_sents = brown.tagged_sents(categories='news')
regexp_tagger.evaluate(brown_tagged_sents)

### N-Gram Tagging

N-Gram tagging looks at a word, its tag, and *n* previous words' tags to determine the best tag for that word. Because n-gram tagging and other machine learning models require data to train on they are called "supervised", because you know the data being given to it. This also means that we must divide the data into training and testing data, because if you test your model on the same data it was trained with, you will have a great degree of bias. Originally, a 90-10 divide was recommended, but standards have now changed to k-fold cross-validation, usually 10 folds.

In [None]:
#divide tagged data
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]

#train bigram tagger
bigram_tagger = nltk.BigramTagger(train_sents)

We can now try this tagger on that sentence again:

In [None]:
bigram_tagger.tag(gsents[0])

All of the "None" means it didn't know how to tag it because the model was insufficient, as once it encounters an unknown word to tag, the following will also be un-taggable. To fix this we have to implement backoff tagging, or cascading taggers:

In [None]:
t1 = nltk.RegexpTagger(patterns)
t2 = nltk.UnigramTagger(train_sents, backoff=t1)
t3 = nltk.BigramTagger(train_sents, backoff=t2)

Now let's try to tag that sentence again:

In [None]:
t3.tag(gsents[0])

In [None]:
t3.evaluate(test_sents)

### Transformation-based Brill Tagging

There are many different machine learning algorithms out there. The current "hot" choice is neural networks, but that is beyond the scope of this workshop. Let's look at a transformation-based tagger included in NLTK, which will help us understand how many machine learning models make decisions.

In [None]:
from nltk.tag.brill import *

def train_brill_tagger(tagged_sents):
    t1 = nltk.RegexpTagger(patterns)
    t2 = nltk.UnigramTagger(train_sents, backoff=t1)
    t3 = nltk.BigramTagger(train_sents, backoff=t2)
    Template._cleartemplates()
    templates = brill24() #or fntbl37
    t4 = nltk.tag.brill_trainer.BrillTaggerTrainer(t3, templates, trace=3)
    t4 = t4.train(tagged_sents, max_rules=100)
    
    return t4

tagger = train_brill_tagger(brown_tagged_sents)


We see that the Brill tagger corrects itself up to a certain threshold based on rules it generated from the data we gave it. Other machine learning models such as Conditional Random Fields (CRF) work in a similar way, in that you tell it what features are important to look at, and it weights these features in writing its rules. Neural networks go more into linear algebra and matrix multiplication, a different approach. Libraries do exist for easy implmentation of neural nets such as pybrain (http://pybrain.org) for general advanced modelling, and nlpnet (http://nilc.icmc.usp.br/nlpnet/index.html) for POS or SRL (Semantic Role Labeling).

So let's tag that sentence again with our Brill tagger:

In [None]:
gtagged_sent = tagger.tag(gsents[0])
print (gtagged_sent)

In [None]:
tagger.evaluate(test_sents)

Not bad! In developing machine learning models, you may want to know where the model is making errors. This can be done by examining the Confusion Matrix:

In [None]:
def tag_list(tagged_sents):
    return [tag for sent in tagged_sents for (word, tag) in sent] #just grabbing a list of all the tags
def apply_tagger(tagger, corpus):
    return [tagger.tag(nltk.tag.untag(sent)) for sent in corpus] #notice we first untag the sentence

gold = tag_list(brown_tagged_sents)
test = tag_list(apply_tagger(tagger, brown_tagged_sents))

cm = nltk.ConfusionMatrix(gold, test)
print(cm.pretty_format(sort_by_count=True, show_percents=True, truncate=10))

### Pickling

If you want to save your model, or any complex variable in Python, you can use pickle:

In [None]:
from pickle import dump,load

with open("brilltagger.pkl", "wb") as f:
    dump (tagger, f, -1) #-1 calls for a more efficient binary protocol

In [None]:
with open('brilltagger.pkl', 'rb') as f:
    tagger = load(f)

## Chunking, grammars, and Named Entity Recognition

On a low linguistic level, you may want to map out a sentence visually based on parts of speech, of course this visualization is actually just a navigable data type, which can be used to mine statistics. We have to first define the grammar. We'll just define a noun phrase for English consisting of a determiner, indefinite article, count, or possessive pronoun, an adjective, and noun. Defining the grammar is done similarly to writing regular expressions. We can then draw the map.

In [None]:
grammar = r"""
  NP: {<DT|AT|CD|PP\$>?<JJ>*<PPSS|NN.*>}       
  PP: {<IN><NP>}            
  VP: {<BEDZ|HVD|VB.*><AT>?<OD>?<NP|PP|CLAUSE>+} 
  CLAUSE: {<NP><VP>}        
  """
# | is "or", a following ? means optional, * is 0 or more, .* is anything following

cp = nltk.RegexpParser(grammar)

result = cp.parse(gtagged_sent)
result.draw()

In [None]:
print (result) #can be traversed using indexes, obviously searched as well

With this information, we can then train classifiers for Named Entity Recognition (NER), i.e. identifying people, places, and things. We won't go into detail today, but NLTK already has a trained classfier we can use off-the-shelf:

In [None]:
print (nltk.ne_chunk(gtagged_sent, binary=True))

# 3) Clustering

Clustering is not the same as topic modeling, although clustering can yield topics. Clustering is a more general approach to grouping and visualizing data based on their similarity. If you only want to determine topics, the above approach will be more accurate, if you are looking for spatial relations, clustering may show this better. Aside from using different algorithms, topic modeling uses words to determine influential words in a document, which characterize that document. Clustering attempts to cluster documents in groups, not seeking a specific word.

We first need to prepare the data. Let's treat each paragraph of *Gulliver's Travels* as its own document, we first need to fix some carriage returns. We'll also create names for each paragraph based on their order:

In [None]:
gparas = gtravels.replace("\r\n"," ").split("\n")
gparas = [x for x in gparas if len(x) > 0]
gnames = [str(x) + " paragraph" for x in range(1, len(gparas)+1)] #names of paragraphs

Now we define functions in order to collect regular tokenized words, and stemmed words:

In [None]:
from nltk.stem.snowball import SnowballStemmer
from string import punctuation

stemmer = SnowballStemmer("english")

def tokenize_only(text):
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = [x for x in tokens if x != "'s" and x not in punctuation] #word tokenizer cuts the possessives
    return filtered_tokens

def tokenize_and_stem(text):
    stems = [stemmer.stem(x) for x in tokenize_only(text)]
    return stems

Now we can collect these from our paragraphs, this is only necessary to map out our topics after:

In [None]:
totalvocab_stemmed = []
totalvocab_tokenized = []

for i in gparas:
    totalvocab_stemmed.extend(tokenize_and_stem(i))
    totalvocab_tokenized.extend(tokenize_only(i))

Our data frame will map tokenized words to stemmed words, recalling our work with pandas in Day 3:

In [None]:
import pandas as pd

vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)
print (vocab_frame.shape[0])
print (vocab_frame)

We'll make a tfidf, term freqency inverse document frequency, matrix here too:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

#define vectorizer parameters, max_df is maximum occurence in docs of word, min_df is opposite
#use .5 max because book has more similar content and takes care of stopwords, lower .1 looking for unique
#use inverse document frequency
tfidf_vectorizer = TfidfVectorizer(max_df=0.5, max_features=200000,
                                 min_df=0.1, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

tfidf_matrix = tfidf_vectorizer.fit_transform(gparas) #fit the vectorizer to paragraphs, turns word freq to numbers

print(tfidf_matrix.shape)

Then we need the words from the vector, these are essentially most influential words, we will eventually assign them to clusters.

In [None]:
terms = tfidf_vectorizer.get_feature_names()

In order to plot our clusters in a 2D plane, we'll want to calculate the distance between any two given paragraphs via cosine similarity:

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(tfidf_matrix)

Now we'll start the actual clustering. The algorithm assigns each observation to the cluster whose mean yields the least within-cluster sum of squares, so the nearest mean. This iterates until the mean no longer changes.

In [None]:
from sklearn.cluster import KMeans

cnum = 4
km = KMeans(n_clusters=cnum) #as revealed by elbow method, explain briefly
km.fit(tfidf_matrix) #fits the data, runs the algorithm
clusters = km.labels_.tolist() #assigns each paragraph to the respective cluster

We can now do a form of topic modelling by printing the words characterizing the clusters we made, the words are those closest to the centroid of the cluster, extracted from the vocab data frame, indexed by their position within the cluster:

In [None]:
#sort cluster centers by proximity to centroid, and grabs the index to iterate through below
order_centroids = km.cluster_centers_.argsort()[:, ::-1]

cents_words = [] #to collect words for chart legend

for i in range(cnum): #numer of clusters
    cent = []
    print("Cluster %d words:" % i, end='')
    
    for ind in order_centroids[i, :7]: #ind is index, replace 7 with n words per cluster, how many to choose from centroid
        a = vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0][0] #indexing term in dataframe
        cent.append(a)
        
        print(' %s' % a, end=',')
        
    cents_words.append(cent)
    print ()


*NB: For comparison, see what you get with topic modelling in gensim*

Two dimensional scaling must be applied for plotting:

In [None]:
from sklearn.manifold import MDS

# convert two components as we're plotting points in a two-dimensional plane
# "precomputed" because we provide a distance matrix
mds = MDS(n_components=2, dissimilarity="precomputed", random_state = 1)

pos = mds.fit_transform(dist)  # shape (n_components, n_samples), based on distances

xs, ys = pos[:, 0], pos[:, 1] #grabs x and y coordinates from pos (numpy array)

Define colors and labels for plot:

In [None]:
import random

cluster_colors = {}
cluster_names ={}

for i in range(cnum): # for each cluster
    cluster_colors[i] = "#%06x" % random.randint(0, 0xFFFFFF) #random hexadecimal color
    cluster_names[i] = ' '.join(cents_words[i][:4])

Plot:

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

#create data frame that has the result of the MDS plus the cluster numbers and titles
df = pd.DataFrame(dict(x=xs, y=ys, label=clusters, title=gnames)) 

#group by cluster
groups = df.groupby('label')

# set up plot
fig, ax = plt.subplots(figsize=(17, 9)) # set size, subplots yields a tuple of figure and axes, hence the two assignments
ax.margins(0.1) # Optional, just adds 10% padding to the autoscaling

#iterate through groups to layer the plot
#note that I use the cluster_name and cluster_color dicts with the 'name' lookup to return the appropriate color/label
for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=12, #marker size
            label=cluster_names[name], color=cluster_colors[name], 
            mec='none') #'marker edge color'
    
ax.legend(numpoints=1)  #show legend with only 1 point

#add label in x,y position with the label as the paragraph number
for i in range(len(df)):
    ax.text(df.ix[i]['x'], df.ix[i]['y'], df.ix[i]['title'], size=8)  

    
    
plt.show() #show the plot

#uncomment the below to save the plot if need be
#plt.savefig('clusters_small_noaxes.png', dpi=200)

# Postscript: comparing cluster results to LDA

If you don't have gensim installed, first: \$ pip install gensim .



In [None]:
from gensim import corpora, models, similarities 
from nltk.corpus import stopwords
from string import punctuation

paras = [[x.lower() for x in i.split()] for i in gparas]
paras = [[x for x in i if x not in stopwords.words("english") and x not in punctuation] for i in paras]

#create a Gensim dictionary from the texts
dictionary = corpora.Dictionary(paras)

#convert the dictionary to a bag of words corpus for reference
corpus = [dictionary.doc2bow(i) for i in paras]

In [None]:
#we run chunks of 10 paras, and update once after every chunk, and make 10 passes
lda = models.LdaModel(corpus, num_topics=4, 
                            update_every=1,
                            id2word=dictionary, 
                            chunksize=10, 
                            passes=1)

lda.show_topics()

For more with gensim, see the tutorials here: https://radimrehurek.com/gensim/tutorial.html