# AdHeavy Work

Today's workshop will address various concepts in Natural Language Processing, primarily through the use of NTLK. A fundmental understanding of Python is necessary. We will cover:

1. Preparing your own corpus
2. Tagging
3. Chunking
4. Document classification

You will need:

* NLTK ( \$ pip install nltk)
* Brown corpus from NLTK ( >>> nltk.download() )
* Punkt tokenizer from NLTK ( >>> nltk.download() )
* Movie reviews corpus from NLTK ( >>> nltk.download() )
* Punkt tokenizer from NLTK ( >>> nltk.download() )
* BeautifulSoup ( \$ pip install beautifulsoup4)

This workshop will further help to solidfy understandings of regex and list comprehensions.

Much of today's work will be adapted, or taken directly, from the NLTK book found here: http://www.nltk.org/book/ . The guide for BeautifulSoup is here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/ . For further explanation of grammars see *Data Science from Scratch*: http://shop.oreilly.com/product/0636920033400.do .

# 1) Preparing your own corpus

We are going to take Jonathan Swift's *Gulliver's Travels* from archive.org to use as our text throughout today's workshop. Although we will utilize pre-made corpora to explore more robust options, it is useful to know how to clean your own text files you may have, create your own corpus, declare it properly, and run analyses, so we will start from scratch.

## String manipulation and cleaning

Let's first use Beautiful Soup to grab only the text. There are packages that exist to clean texts from standard sites such as a Gutenberg package for gutenberg.org, but today we'll clean it as best we can manually:

In [1]:
import requests
from bs4 import BeautifulSoup

url = "https://news.google.com/nwshp?hl=en&tab=wn&ei=LW0cV5TRMM2IjwOs44CoCQ&ved=0EKkuCAUoBQ"
#"https://ia801404.us.archive.org/2/items/gulliverstravels17157gut/17157-h/17157-h.htm"

f = requests.get(url)
html = f.content

print (f.content)



Create bs object and trim:

In [14]:
#clean and extract only raw text 
bspage = BeautifulSoup(html, "html.parser") #or "html.parser"
#find all the links

#get list of http urls to study 
httplinks = []
for link in bspage.find_all('a'):
    if str(link.get('href')).startswith('http'):
        httplinks.append(link.get('href'))
    
#httplinks = [link for (link) in bspage.find_all('a') if link.get('href').startswith('http') link.get('href')]

print(httplinks)
    
#rawtext = BeautifulSoup.get_text(bspage)
#print(rawtext)



In [4]:
#clean and extract only raw text 
bspage = BeautifulSoup(html, "html.parser") #or "html.parser"
rawtext = BeautifulSoup.get_text(bspage)

#slice at beginning and end of book
beginning = "My father had"
end = "of my unfortunate voyages."
gtravels = rawtext[rawtext.find(beginning):rawtext.find(end)+len(end)]

print (gtravels)

My father had a small estate in Nottinghamshire; I was the third of five
sons. He sent me to Emmanuel College in Cambridge at fourteen years old,
where I resided three years, and applied myself close to my studies;
but the charge of maintaining me, although I had a very scanty
allowance, being too great for a narrow fortune, I was bound apprentice
to Mr. James Bates, an eminent surgeon in London, with whom I continued
four years; and my father now and then sending me small sums of money, I
laid them out in learning navigation, and other parts of the mathematics
useful to those who intend to travel, as I always believed it would be,
some time or other, my fortune to do. When I left Mr. Bates, I went down
to my father, where, by the assistance of him, and my uncle John and
some other relations, I got forty pounds,[2] and a promise of thirty
pounds a year, to maintain me at Leyden. There I studied physic two
years and seven months, knowing it would be useful in long voyages.


You'll notice there are still page numbers and chapter headings in our text, and you might have other pieces you want to clean. Recalling your regex work from Day 3 of the intro series, how can we get rid of all the page numbers within brackets?

In [5]:
import re

#regex for page numbers in brackets
# anything fro 0 to 9 , 1 or more , replace with nothing
gtravels = re.sub("\[[0-9]+\]", "", gtravels)

#regex to replace Roman Numerals following all caps word, up to RN 9 (only 8 chapters)
# anything in all caps one or moe like HAPTER 4 
# V to VIII
# get the period

gtravels = re.sub("([A-Z]+ (I?V|V?I{1,3})\.)", "",gtravels)

print (gtravels)

My father had a small estate in Nottinghamshire; I was the third of five
sons. He sent me to Emmanuel College in Cambridge at fourteen years old,
where I resided three years, and applied myself close to my studies;
but the charge of maintaining me, although I had a very scanty
allowance, being too great for a narrow fortune, I was bound apprentice
to Mr. James Bates, an eminent surgeon in London, with whom I continued
four years; and my father now and then sending me small sums of money, I
laid them out in learning navigation, and other parts of the mathematics
useful to those who intend to travel, as I always believed it would be,
some time or other, my fortune to do. When I left Mr. Bates, I went down
to my father, where, by the assistance of him, and my uncle John and
some other relations, I got forty pounds, and a promise of thirty
pounds a year, to maintain me at Leyden. There I studied physic two
years and seven months, knowing it would be useful in long voyages.
Soo

Let's save this text so we can read it in the corpus later:

In [6]:
import codecs
with codecs.open("gulliver.txt", "w","utf-8") as f:
    f.write(gtravels)

## Declaring a corpus in NLTK

While you can use NLTK on strings and lists of sentences, it's better to formally declare your corpus.

In [1]:
from nltk.corpus import PlaintextCorpusReader

corpus_root = "" #rel. path
my_texts = PlaintextCorpusReader(corpus_root, '.*txt')

We now have a text corpus, on which we can run all the basic methods you learned in the introductory sequence. To list all the files in our corpus:

In [2]:
my_texts.fileids()

['gulliver.txt']

We can also extract either all the words or all the sentences in list format:

In [4]:
my_texts.words('gulliver.txt')

['My', 'father', 'had', 'a', 'small', 'estate', 'in', ...]

In [5]:
gsents = my_texts.sents('gulliver.txt')
print (gsents)

[['My', 'father', 'had', 'a', 'small', 'estate', 'in', 'Nottinghamshire', ';', 'I', 'was', 'the', 'third', 'of', 'five', 'sons', '.'], ['He', 'sent', 'me', 'to', 'Emmanuel', 'College', 'in', 'Cambridge', 'at', 'fourteen', 'years', 'old', ',', 'where', 'I', 'resided', 'three', 'years', ',', 'and', 'applied', 'myself', 'close', 'to', 'my', 'studies', ';', 'but', 'the', 'charge', 'of', 'maintaining', 'me', ',', 'although', 'I', 'had', 'a', 'very', 'scanty', 'allowance', ',', 'being', 'too', 'great', 'for', 'a', 'narrow', 'fortune', ',', 'I', 'was', 'bound', 'apprentice', 'to', 'Mr', '.', 'James', 'Bates', ',', 'an', 'eminent', 'surgeon', 'in', 'London', ',', 'with', 'whom', 'I', 'continued', 'four', 'years', ';', 'and', 'my', 'father', 'now', 'and', 'then', 'sending', 'me', 'small', 'sums', 'of', 'money', ',', 'I', 'laid', 'them', 'out', 'in', 'learning', 'navigation', ',', 'and', 'other', 'parts', 'of', 'the', 'mathematics', 'useful', 'to', 'those', 'who', 'intend', 'to', 'travel', ',', 

We now have a corpus, or text, from which we can get any of the statistics you learned in Day 3 of the Python workshop. We will review some of these functions once we get some more information

# 2) Tagging

There are many situations, in which "tagging" words (or really anything) may be useful in order to determine or calculate trends, or for further text analysis to extract meaning. We will cover 3 methods of tagging: simple regex, n-gram, and Brill transformation based tagging. Although they will not be covered today, HMM, CRF, and neural networks will be briefly alluded to as additional machine learning models.

It is important to note that in Natural Language Processing (NLP), POS (Part of Speech) tagging is the most common use for tagging, but the actual tag can be anything. Other applications include sentiment analysis and NER (Named Entity Recognition). Tagging is simply labeling a word to a specific category via a tuple.

Nevertheless, for training more advanced tagging models, POS tagging is nearly essential. If you are defining a machine learning model to predict patterns in your text, these patterns will most likley rely on, among other things, POS features. You will therefore first tag POS and then use the POS as a feature in your model.

## On a low-level

Tagging is creating a tuple of (word, tag) for every word in a text or corpus. For example: "My name is Chris" may be tagged for POS as:

My/PossessivePronoun name/Noun is/Verb Chris/ProperNoun ./Period

*NB: type 'nltk.data.path' to find the path on your computer to your downloaded nltk corpora. You can explore these files to see how large corpora are formatted.*

You'll notice how the text is annotated, using a forward slash to match the word to its tag. So how can we get this to a useful form for Python?

In [6]:
from nltk.tag import str2tuple

line = "My/Possessive_Pronoun name/Noun is/Verb Chris/Proper_Noun ./Period"
tagged_sent = [str2tuple(t) for t in line.split()]

print (tagged_sent)

[('My', 'POSSESSIVE_PRONOUN'), ('name', 'NOUN'), ('is', 'VERB'), ('Chris', 'PROPER_NOUN'), ('.', 'PERIOD')]


Further analysis of tags with NLTK requires a *list* of sentences, otherwise you will get an index error on higher level methods.

Naturally, these tags are a bit verbose, the standard tagging conventions follow the Penn Treebank (more in a second): https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

## Working with a tagged corpus

Now that we know how tagging works, let's import a tagged corpus from the NLTK database and see what we can do.

In [7]:
from nltk.corpus import brown #if you don't have this downloaded, type nltk.download()
brown.tagged_words()

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]

*NB: the argument tagset = "universal" simplifies the tagset.*

Let's find the most frequent parts of speech in the corpus:

In [8]:
import nltk

brown_news_tagged = brown.tagged_words(categories='news') #not universal tagset
#tag frequency distribution
#AT -article
#NN
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
tag_fd.most_common()

[('NN', 13162),
 ('IN', 10616),
 ('AT', 8893),
 ('NP', 6866),
 (',', 5133),
 ('NNS', 5066),
 ('.', 4452),
 ('JJ', 4392),
 ('CC', 2664),
 ('VBD', 2524),
 ('NN-TL', 2486),
 ('VB', 2440),
 ('VBN', 2269),
 ('RB', 2166),
 ('CD', 2020),
 ('CS', 1509),
 ('VBG', 1398),
 ('TO', 1237),
 ('PPS', 1056),
 ('PP$', 1051),
 ('MD', 1031),
 ('AP', 923),
 ('NP-TL', 741),
 ('``', 732),
 ('BEZ', 730),
 ('BEDZ', 716),
 ("''", 702),
 ('JJ-TL', 689),
 ('PPSS', 602),
 ('DT', 589),
 ('BE', 525),
 ('VBZ', 519),
 ('NR', 495),
 ('RP', 482),
 ('QL', 468),
 ('PPO', 412),
 ('WPS', 395),
 ('NNS-TL', 344),
 ('WDT', 343),
 ('BER', 328),
 ('WRB', 328),
 ('OD', 309),
 ('HVZ', 301),
 ('--', 300),
 ('NP$', 279),
 ('HV', 265),
 ('HVD', 262),
 ('*', 256),
 ('BED', 252),
 ('NPS', 215),
 ('BEN', 212),
 ('NN$', 210),
 ('DTI', 205),
 ('NP-HL', 186),
 ('ABN', 183),
 ('NN-HL', 171),
 ('IN-TL', 164),
 ('EX', 161),
 (')', 151),
 ('(', 148),
 ('JJR', 145),
 (':', 137),
 ('DTS', 136),
 ('JJT', 100),
 ('CD-TL', 96),
 ('NNS-HL', 92),
 ('

So what do these tags mean?

In [9]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

We can also find out what the most common nouns are. For the linguists, there are naturally many subgroups of nouns, let's see what we can get:

In [10]:
# frequency 
def find_tags(tag_prefix, tagged_text):
    cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text
                                  if tag.startswith(tag_prefix))
    return dict((tag, cfd[tag].most_common(5)) for tag in cfd.conditions()) #cfd.conditions() yields all tags possibilites

tagdict = find_tags('NN', brown_news_tagged)
#NN, rest are sub categories

for tag in sorted(tagdict):
    print(tag, tagdict[tag])

NN [('year', 137), ('time', 97), ('state', 88), ('week', 85), ('man', 72)]
NN$ [("year's", 13), ("world's", 8), ("state's", 7), ("company's", 6), ("nation's", 6)]
NN$-HL [("Navy's", 1), ("Golf's", 1)]
NN$-TL [("President's", 11), ("Army's", 3), ("League's", 3), ("University's", 3), ("Gallery's", 3)]
NN-HL [('sp.', 2), ('party', 2), ('condition', 2), ('war', 2), ('Question', 2)]
NN-NC [('aya', 1), ('eva', 1), ('ova', 1)]
NN-TL [('President', 88), ('House', 68), ('State', 59), ('University', 42), ('City', 41)]
NN-TL-HL [('Fort', 2), ('Beat', 1), ('Dr.', 1), ('Street', 1), ('Grove', 1)]
NNS [('years', 101), ('members', 69), ('people', 52), ('sales', 51), ('men', 46)]
NNS$ [("children's", 7), ("women's", 5), ("men's", 3), ("janitors'", 3), ("builders'", 2)]
NNS$-HL [("Dealers'", 1), ("Idols'", 1)]
NNS$-TL [("Women's", 4), ("States'", 3), ("Giants'", 2), ("Braves'", 1), ("Raiders'", 1)]
NNS-HL [('returns', 1), ('$37', 1), ('schools', 1), ('Charges', 1), ('places', 1)]
NNS-TL [('States', 38)

We can also look at what linguistic environment words are in, below lists all the words following "President":

In [11]:
brown_news_text = brown.words(categories='news')
sorted(set(b for (a, b) in nltk.bigrams(brown_news_text) if a == 'President'))

['(',
 ',',
 '.',
 'Chen',
 'Dwight',
 'Eisenhower',
 "Eisenhower's",
 'Habib',
 'John',
 'Judge',
 'Kasavubu',
 'Kennedy',
 "Kennedy's",
 'Raymond',
 'Richard',
 'Sukarno',
 'also',
 'and',
 'announced',
 'asks',
 'called',
 'clearly',
 'commented',
 'had',
 'has',
 'held',
 'himself',
 'in',
 'is',
 'knew',
 'left',
 'noted',
 'of',
 'received',
 'recommended',
 'remarked',
 'said',
 'spent',
 'sponsored',
 'talked',
 'to',
 'took',
 'was',
 'which',
 'will',
 'would']

If we are looking to build a classifier, perhaps for author identification, it may be useful to quantify the syntax.

In [12]:
# bigrams - every single set of two words
# ngrams
# pos is 1 
# starts with Verb(VB)

tags = [b[1] for (a, b) in nltk.bigrams(brown_news_tagged) if a[1].startswith('VB')]
fd1 = nltk.FreqDist(tags)
fd1.tabulate(10)

  IN   AT   NN  NNS   TO   RP   CS    .   RB    , 
1888 1540  600  468  414  369  368  339  339  317 


## Automatic Tagging

Now that we know some things we can do with a tagged corpus, how can we tag our own corpus? We will work through regex models, n-gram models, and discuss a couple more advanced models.

### Regex Tagger

Let's write a simple regex tagger for 8 parts of speech. First we need to define the patterns for each part:

In [13]:
# building a regex POS tagger
patterns = [
     (r'.*ing$', 'VBG'),               # gerunds, ending in 'ing'
     (r'.*ed$', 'VBD'),                # simple past , ending in 'ed' past tense
     (r'.*\'s$', 'NN$'),               # possessive nouns, ending in 's possesive 
     (r'.*s$', 'NNS'),                 # plural nouns - ending in s is plural
     (r'.*', 'NN')                     # nouns (default) - 
 ]

In [16]:
gsents

[['My', 'father', 'had', 'a', 'small', 'estate', 'in', 'Nottinghamshire', ';', 'I', 'was', 'the', 'third', 'of', 'five', 'sons', '.'], ['He', 'sent', 'me', 'to', 'Emmanuel', 'College', 'in', 'Cambridge', 'at', 'fourteen', 'years', 'old', ',', 'where', 'I', 'resided', 'three', 'years', ',', 'and', 'applied', 'myself', 'close', 'to', 'my', 'studies', ';', 'but', 'the', 'charge', 'of', 'maintaining', 'me', ',', 'although', 'I', 'had', 'a', 'very', 'scanty', 'allowance', ',', 'being', 'too', 'great', 'for', 'a', 'narrow', 'fortune', ',', 'I', 'was', 'bound', 'apprentice', 'to', 'Mr', '.', 'James', 'Bates', ',', 'an', 'eminent', 'surgeon', 'in', 'London', ',', 'with', 'whom', 'I', 'continued', 'four', 'years', ';', 'and', 'my', 'father', 'now', 'and', 'then', 'sending', 'me', 'small', 'sums', 'of', 'money', ',', 'I', 'laid', 'them', 'out', 'in', 'learning', 'navigation', ',', 'and', 'other', 'parts', 'of', 'the', 'mathematics', 'useful', 'to', 'those', 'who', 'intend', 'to', 'travel', ',', 

Now we build the tagger and we can test it on the first sentence of our *Gulliver's Travels*.

In [14]:
regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(gsents[0])
#99% accuracy

[('My', 'NN'),
 ('father', 'NN'),
 ('had', 'NN'),
 ('a', 'NN'),
 ('small', 'NN'),
 ('estate', 'NN'),
 ('in', 'NN'),
 ('Nottinghamshire', 'NN'),
 (';', 'NN'),
 ('I', 'NN'),
 ('was', 'NNS'),
 ('the', 'NN'),
 ('third', 'NN'),
 ('of', 'NN'),
 ('five', 'NN'),
 ('sons', 'NNS'),
 ('.', 'NN')]

That didn't work so well, no worries, this was a very naïve attempt. But we can evaluate the accuracy nonetheless:

In [15]:
brown_tagged_sents = brown.tagged_sents(categories='news')
regexp_tagger.evaluate(brown_tagged_sents)

0.19909700260556518

### N-Gram Tagging

N-Gram tagging is a very basic supervised machine learning technique. It looks at a word and *n* previous words' tags to determine the best tag for the focal word. Because n-gram tagging and other machine learning models require data to train on they are called "supervised", because you know the data being given to it. This also means that we must divide the data into training and testing data, because if you test your model on the same data it was trained with, you will have a great degree of bias. Originally, a 90-10 divide was recommended, but standards have now changed to k-fold cross-validation, usually 10 folds.

In [17]:
#divide tagged data
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]

#train bigram tagger
bigram_tagger = nltk.BigramTagger(train_sents) #word and tag of prev word

We can now try this tagger on that sentence again:

In [18]:
bigram_tagger.tag(gsents[0])

[('My', 'PP$'),
 ('father', 'NN'),
 ('had', 'HVD'),
 ('a', 'AT'),
 ('small', 'JJ'),
 ('estate', 'NN'),
 ('in', 'IN'),
 ('Nottinghamshire', None),
 (';', None),
 ('I', None),
 ('was', None),
 ('the', None),
 ('third', None),
 ('of', None),
 ('five', None),
 ('sons', None),
 ('.', None)]

All of the "None" means it didn't know how to tag it because the model was insufficient, as once it encounters an unknown word to tag, the following will also be un-taggable. To fix this we have to implement backoff tagging, or cascading taggers:

In [19]:
t0 = nltk.RegexpTagger(patterns)
t1 = nltk.UnigramTagger(train_sents, backoff=t0) #only current word, most likely tag
t2 = nltk.BigramTagger(train_sents, backoff=t1)
t3 = nltk.TrigramTagger(train_sents, backoff=t2)

Now let's try to tag that sentence again:

In [20]:
t3.tag(gsents[0])

[('My', 'PP$'),
 ('father', 'NN'),
 ('had', 'HVD'),
 ('a', 'AT'),
 ('small', 'JJ'),
 ('estate', 'NN'),
 ('in', 'IN'),
 ('Nottinghamshire', 'NN'),
 (';', '.'),
 ('I', 'PPSS'),
 ('was', 'BEDZ'),
 ('the', 'AT'),
 ('third', 'OD'),
 ('of', 'IN'),
 ('five', 'CD'),
 ('sons', 'NNS'),
 ('.', '.')]

In [21]:
t3.evaluate(test_sents)

0.8674374563939001

### Transformation-based Brill Tagging

There are many different machine learning algorithms out there. The current "hot" choice is neural networks, but that is beyond the scope of this workshop. Let's look at a transformation-based tagger included in NLTK, which will help us understand how many machine learning models make decisions.

In [23]:
from nltk.tag.brill import *

def train_brill_tagger(tagged_sents):
    t0 = nltk.RegexpTagger(patterns)
    t1 = nltk.UnigramTagger(train_sents, backoff=t0)
    t2 = nltk.BigramTagger(train_sents, backoff=t1)
    t3 = nltk.TrigramTagger(train_sents, backoff=t2)
    Template._cleartemplates()
    templates = brill24() #or fntbl37
    t4 = nltk.tag.brill_trainer.BrillTaggerTrainer(t3, templates, trace=3)
    t4 = t4.train(tagged_sents, max_rules=100)
    
    return t4

tagger = train_brill_tagger(brown_tagged_sents)


TBL train (fast) (seqs: 4623; tokens: 100554; tpls: 24; min score: 2; min acc: None)
Finding initial useful rules...
    Found 47380 useful rules.

           B      |
   S   F   r   O  |        Score = Fixed - Broken
   c   i   o   t  |  R     Fixed = num tags changed incorrect -> correct
   o   x   k   h  |  u     Broken = num tags changed correct -> incorrect
   r   e   e   e  |  l     Other = num tags changed incorrect -> incorrect
   e   d   n   r  |  e
------------------+-------------------------------------------------------
 216 221   5   0  | IN->TO if Pos:VB@[1]
 114 114   0   1  | TO->IN if Pos:AT@[1]
  36  37   1   0  | CS->QL if Word:as@[2]
  30  30   0   1  | TO->IN if Pos:NP@[1]
  22  22   0   0  | IN->TO if Word:to@[0] & Word:be@[1] & Pos:BE@[1]
  20  20   0   0  | NNS->NPS if Word:Belgians@[-1,0]
  19  19   0   0  | TO->IN if Pos:NNS@[1]
  17  58  41   6  | NN->VB if Word:to@[-1] & Pos:IN@[-1]
  18  18   0   0  | IN->TO if Pos:VB@[1] & Pos:AT@[2]
  14  14   0   0  | NN

We see that the Brill tagger corrects itself up to a certain threshold based on rules it generated from the data we gave it. Other machine learning models such as Conditional Random Fields (CRF) work in a similar way, in that you tell it what features are important to look at, and it weights these features in writing its rules. Neural networks go more into linear algebra and matrix multiplication, a different approach. Libraries do exist for easy implmentation of neural nets such as pybrain (http://pybrain.org) for general advanced modelling, and nlpnet (http://nilc.icmc.usp.br/nlpnet/index.html) for POS or SRL (Semantic Role Labeling).

So let's tag that sentence again with our Brill tagger:

In [24]:
gtagged_sent = tagger.tag(gsents[0])
print (gtagged_sent)

[('My', 'PP$'), ('father', 'NN'), ('had', 'HVD'), ('a', 'AT'), ('small', 'JJ'), ('estate', 'NN'), ('in', 'IN'), ('Nottinghamshire', 'NN'), (';', '.'), ('I', 'PPSS'), ('was', 'BEDZ'), ('the', 'AT'), ('third', 'OD'), ('of', 'IN'), ('five', 'CD'), ('sons', 'NNS'), ('.', '.')]


In [25]:
tagger.evaluate(test_sents)

0.9002292434964617

Let's POS tag our entire text:

In [26]:
#list comprehension
g_tagged_all = [tagger.tag(sent) for sent in gsents]

In [27]:
g_tagged_all[:3]

[[('My', 'PP$'),
  ('father', 'NN'),
  ('had', 'HVD'),
  ('a', 'AT'),
  ('small', 'JJ'),
  ('estate', 'NN'),
  ('in', 'IN'),
  ('Nottinghamshire', 'NN'),
  (';', '.'),
  ('I', 'PPSS'),
  ('was', 'BEDZ'),
  ('the', 'AT'),
  ('third', 'OD'),
  ('of', 'IN'),
  ('five', 'CD'),
  ('sons', 'NNS'),
  ('.', '.')],
 [('He', 'PPS'),
  ('sent', 'VBD'),
  ('me', 'PPO'),
  ('to', 'IN'),
  ('Emmanuel', 'NN'),
  ('College', 'NN-TL'),
  ('in', 'IN'),
  ('Cambridge', 'NP'),
  ('at', 'IN'),
  ('fourteen', 'CD'),
  ('years', 'NNS'),
  ('old', 'JJ'),
  (',', ','),
  ('where', 'WRB'),
  ('I', 'PPSS'),
  ('resided', 'VBD'),
  ('three', 'CD'),
  ('years', 'NNS'),
  (',', ','),
  ('and', 'CC'),
  ('applied', 'VBD'),
  ('myself', 'PPL'),
  ('close', 'JJ'),
  ('to', 'IN'),
  ('my', 'PP$'),
  ('studies', 'NNS'),
  (';', '.'),
  ('but', 'CC'),
  ('the', 'AT'),
  ('charge', 'NN'),
  ('of', 'IN'),
  ('maintaining', 'VBG'),
  ('me', 'PPO'),
  (',', ','),
  ('although', 'CS'),
  ('I', 'PPSS'),
  ('had', 'HVD'),
  ('a

What types of adjectives are used?

In [28]:
# flattens the list
g_tagged_words = [item for sublist in g_tagged_all for item in sublist]

tagdict = find_tags('JJ', g_tagged_words)
for tag in sorted(tagdict):
    print(tag, tagdict[tag])

JJ [('great', 132), ('own', 73), ('good', 61), ('long', 47), ('whole', 39)]
JJ-TL [('Big', 6), ('South', 2), ('Old', 2), ('East', 2), ('Good', 2)]
JJR [('larger', 15), ('better', 14), ('smaller', 14), ('greater', 10), ('lower', 3)]
JJS [('principal', 12), ('chief', 9), ('top', 2), ('main', 1)]
JJT [('greatest', 16), ('largest', 15), ('strongest', 10), ('best', 10), ('highest', 7)]


How about we compare the syntax of Gulliver's Travels to the news corpus:

In [29]:
#pos tags which follows verbs
tags = [b[1] for (a, b) in nltk.bigrams(g_tagged_words) if a[1].startswith('VB')]
fd2 = nltk.FreqDist(tags)

print ("Gulliver")
fd2.tabulate(10)
#PPO Personal Pronoun
print ()
print ("News")
fd1.tabulate(10)

Gulliver
  IN  PPO   AT    ,  PP$   RP   NN   TO    .   RB 
1258  702  563  389  292  231  220  207  171  119 

News
  IN   AT   NN  NNS   TO   RP   CS    .   RB    , 
1888 1540  600  468  414  369  368  339  339  317 


There are several explanations for the difference. Perhaps due to the familiarity with characters in the novel form, personal pronoun objects ("me, him, her, etc.") are more common to follow verbs than articles, which likely attempt to clarify an unknown in a news source.

In developing machine learning models, you may want to know where the model is making errors. This can be done by examining the Confusion Matrix:

In [None]:
def tag_list(tagged_sents):
    return [tag for sent in tagged_sents for (word, tag) in sent] #just grabbing a list of all the tags
def apply_tagger(tagger, corpus):
    return [tagger.tag(nltk.tag.untag(sent)) for sent in corpus] #notice we first untag the sentence

gold = tag_list(brown_tagged_sents)
test = tag_list(apply_tagger(tagger, brown_tagged_sents))

cm = nltk.ConfusionMatrix(gold, test)
print(cm.pretty_format(sort_by_count=True, show_percents=True, truncate=10))

### Pickling

If you want to save your model, or any complex variable in Python, you can use pickle:

In [None]:
from pickle import dump,load

with open("brilltagger.pkl", "wb") as f:
    dump (tagger, f, -1) #-1 calls for a more efficient binary protocol

In [None]:
with open('brilltagger.pkl', 'rb') as f:
    tagger = load(f)
    
type (tagger)

## 3) Chunking, grammars, and Named Entity Recognition

On a low linguistic level, you may want to map out a sentence visually based on parts of speech, of course this visualization is actually just a navigable data type, which can be used to mine statistics. We have to first define the grammar. We'll just define a noun phrase for English consisting of a determiner, indefinite article, count, or possessive pronoun, an adjective, and noun. Defining the grammar is done similarly to writing regular expressions. We can then draw the map.

In [None]:
grammar = r"""
  NP: {<DT|AT|CD|PP\$>?<JJ>*<PPSS|NN.*>}       
  PP: {<IN><NP>}            
  VP: {<BEDZ|HVD|VB.*><AT>?<OD>?<NP|PP|CLAUSE>+} 
  CLAUSE: {<NP><VP>}        
  """
# | is "or", a following ? means optional, * is 0 or more, .* is anything following

cp = nltk.RegexpParser(grammar)
result = cp.parse(gtagged_sent)
result #result.draw() for not in python notebook

In [None]:
print (result) #can be traversed using indexes, obviously searched as well

With this information, we can then train classifiers for Named Entity Recognition (NER), i.e. identifying people, places, and things. We won't go into detail today, but NLTK already has a trained classfier we can use off-the-shelf:

In [None]:
print (nltk.ne_chunk(gtagged_sent, binary=True))

## 4) Document Classification

We now download a corpus of movie reviews, which were already labeled as positive or negative. We can build a Naive Bayes Classifier to learn from the annotated data and then predict unseen reviews as positive or negative.

In [30]:
import random
from nltk.corpus import movie_reviews

movie_reviews.categories()

['neg', 'pos']

In [31]:
movie_reviews.fileids()

['neg/cv000_29416.txt',
 'neg/cv001_19502.txt',
 'neg/cv002_17424.txt',
 'neg/cv003_12683.txt',
 'neg/cv004_12641.txt',
 'neg/cv005_29357.txt',
 'neg/cv006_17022.txt',
 'neg/cv007_4992.txt',
 'neg/cv008_29326.txt',
 'neg/cv009_29417.txt',
 'neg/cv010_29063.txt',
 'neg/cv011_13044.txt',
 'neg/cv012_29411.txt',
 'neg/cv013_10494.txt',
 'neg/cv014_15600.txt',
 'neg/cv015_29356.txt',
 'neg/cv016_4348.txt',
 'neg/cv017_23487.txt',
 'neg/cv018_21672.txt',
 'neg/cv019_16117.txt',
 'neg/cv020_9234.txt',
 'neg/cv021_17313.txt',
 'neg/cv022_14227.txt',
 'neg/cv023_13847.txt',
 'neg/cv024_7033.txt',
 'neg/cv025_29825.txt',
 'neg/cv026_29229.txt',
 'neg/cv027_26270.txt',
 'neg/cv028_26964.txt',
 'neg/cv029_19943.txt',
 'neg/cv030_22893.txt',
 'neg/cv031_19540.txt',
 'neg/cv032_23718.txt',
 'neg/cv033_25680.txt',
 'neg/cv034_29446.txt',
 'neg/cv035_3343.txt',
 'neg/cv036_18385.txt',
 'neg/cv037_19798.txt',
 'neg/cv038_9781.txt',
 'neg/cv039_5963.txt',
 'neg/cv040_8829.txt',
 'neg/cv041_22364.txt',


In [32]:
len(movie_reviews.fileids())

2000

In [33]:
movie_reviews.words("neg/cv000_29416.txt")

['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]

In [34]:
#
documents = [(list(movie_reviews.words(fileid)), category)
                for category in movie_reviews.categories()
                for fileid in movie_reviews.fileids(category)]
#shuffle to avoid bias
random.shuffle(documents)

In [37]:
#long ist of tuples
documents[0]

(['a',
  'fully',
  'loaded',
  'entertainment',
  'review',
  '-',
  'website',
  'coming',
  'in',
  'july',
  '!',
  '>',
  'from',
  'ace',
  'ventura',
  'to',
  'truman',
  'burbank',
  ',',
  'jim',
  'carrey',
  'has',
  'run',
  'the',
  'whole',
  'gamut',
  'of',
  'comic',
  ',',
  'yet',
  'sympathetic',
  ',',
  'characters',
  '.',
  '1996',
  "'",
  's',
  'the',
  'cable',
  'guy',
  'was',
  'supposed',
  'to',
  'be',
  'his',
  'big',
  '"',
  'breakthrough',
  '"',
  'role',
  'from',
  'zany',
  'humor',
  'into',
  'darker',
  ',',
  'more',
  'dramatic',
  'acting',
  '.',
  'as',
  'most',
  'everyone',
  'knows',
  ',',
  'the',
  'results',
  'were',
  ',',
  'well',
  ',',
  'less',
  '-',
  'than',
  '-',
  'stellar',
  '.',
  'not',
  'only',
  'did',
  'the',
  'film',
  'not',
  'do',
  'so',
  'hot',
  'at',
  'the',
  'box',
  'office',
  ',',
  'but',
  'it',
  'was',
  'also',
  'panned',
  'by',
  'critics',
  '.',
  'as',
  'far',
  'as',
  'i',
  

We will use a relatively simple model of only two features: single words and bigrams. First we need to get a list of the most common words and bigrams in the entire corpus.

In [41]:
# most freq positive words
# most freq negative words
movie_words = [w.lower() for w in movie_reviews.words()]
all_words = nltk.FreqDist(movie_words)
#most common 2000 words
word_features = list(all_words.most_common())[:2000]
word_features = [x[0] for x in word_features]

all_bis = nltk.bigrams(movie_words)
all_bis = nltk.FreqDist(all_bis)
bi_features = list(all_bis.most_common())[:2000]
bi_features = [x[0] for x in bi_features]
bi_features[:10]

[("'", 's'),
 (',', 'and'),
 ('of', 'the'),
 ('.', 'the'),
 ("'", 't'),
 ('in', 'the'),
 (',', 'but'),
 (',', 'the'),
 ('the', 'film'),
 ('it', "'")]

Now we define features we deem relevant for classifying a document, in our case we will only use the words and bigrams generated above.

In [42]:
#tokenized words
def document_features(document):
    # set - gives unique
    document_words = set(document)
    document_bis = set(nltk.bigrams(document))
    #more features can be added
    #pos features can be added
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    for bi in bi_features:
        features['contains({})'.format(str(bi))] = (bi in document_bis)
    return features

We then extract all these features into a tuple with the classification. Divide into training and testing sets, and train the classfier.

In [46]:
# d is the review and c is category
featuresets = [(document_features(d), c) for (d,c) in documents]
featuresets[0]
#first 100 would be the test set 
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [45]:
featuresets[0]

({'contains(break)': True,
  'contains(remake)': False,
  "contains((',', 'what'))": False,
  'contains(brothers)': False,
  'contains(old)': False,
  'contains(feature)': False,
  'contains(rescue)': False,
  'contains(teen)': False,
  "contains(('last', 'summer'))": False,
  'contains(mel)': False,
  "contains(('in', 'an'))": False,
  'contains(decided)': False,
  'contains(--)': False,
  "contains((',', 'has'))": False,
  'contains(help)': False,
  "contains(('easy', 'to'))": False,
  'contains(lawyer)': False,
  "contains(('one', 'scene'))": False,
  "contains(('.', 'just'))": False,
  "contains(('about', 'to'))": False,
  "contains(('lot', 'of'))": False,
  "contains(('and', 'are'))": False,
  "contains(('these', 'are'))": False,
  "contains(('the', 'point'))": False,
  'contains(will)': False,
  'contains(climax)': False,
  'contains(campbell)': False,
  "contains(('it', 'does'))": False,
  "contains(('films', 'like'))": False,
  'contains(normal)': False,
  "contains(('s', 'not'

In [47]:
nltk.classify.accuracy(classifier, test_set)

0.7

In [48]:
classifier.show_most_informative_features(30)

Most Informative Features
contains(('waste', 'of')) = True              neg : pos    =     14.1 : 1.0
   contains(outstanding) = True              pos : neg    =     13.3 : 1.0
   contains(wonderfully) = True              pos : neg    =      7.6 : 1.0
        contains(seagal) = True              neg : pos    =      7.4 : 1.0
contains(('bad', 'movie')) = True              neg : pos    =      7.2 : 1.0
         contains(damon) = True              pos : neg    =      5.9 : 1.0
         contains(waste) = True              neg : pos    =      5.7 : 1.0
         contains(flynt) = True              pos : neg    =      5.7 : 1.0
         contains(awful) = True              neg : pos    =      5.4 : 1.0
          contains(jedi) = True              pos : neg    =      5.3 : 1.0
    contains(ridiculous) = True              neg : pos    =      5.3 : 1.0
          contains(lame) = True              neg : pos    =      5.1 : 1.0
        contains(wasted) = True              neg : pos    =      5.0 : 

Now go to the internet, find a movie review, create a tokenized string out of it, and try to classify it!

Hint:

In [49]:
my_review = nltk.word_tokenize("too wasteful REVIEW")

In [50]:
rev_doc_feats = document_features(my_review)

In [51]:
classifier.classify(rev_doc_feats)

'neg'

In [None]:
#scikit learn
#SVM - Ac
#random forest
#NN 
