Need to run this to allow multiple ouputs


In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

### Using a Tagger


In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')

# Homonyms
text = nltk.word_tokenize(
    "They refuse to permit us to obtain the refuse permit")
nltk.pos_tag(text)
nltk.help.upenn_tagset('VBP')
nltk.help.upenn_tagset('NN')

**Refuse** and **Permit** are verbs and nouns


### Tagged Corpora


Complete vs simplyfied (universal) tagging


In [1]:
import nltk
nltk.download('brown')
nltk.download('universal_tagset')
print(nltk.corpus.brown.tagged_words())
nltk.corpus.brown.tagged_words(tagset='universal')

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]


[nltk_data] Downloading package brown to /Users/joe_codes/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     /Users/joe_codes/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


[('The', 'DET'), ('Fulton', 'NOUN'), ...]

Tagging other languages


In [2]:
import nltk
nltk.download('cess_esp')
nltk.corpus.cess_esp.tagged_words()

[nltk_data] Downloading package cess_esp to
[nltk_data]     /Users/joe_codes/nltk_data...
[nltk_data]   Package cess_esp is already up-to-date!


[('El', 'da0ms0'), ('grupo', 'ncms000'), ...]

Most common tags


In [None]:
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
print(brown_news_tagged)
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
print(tag_fd.values())
tag_fd.keys()

list(tag_fd.most_common())

### Mapping Words to Properties Using Python Dictionaries (Spanish Example)


In [None]:
pos = {}
pos
pos['sin color'] = 'ADJ'
pos
pos['ideas'] = 'SUS'
pos['dormir'] = 'V'
pos['furiousamente'] = 'ADV'
pos
pos['ideas']
pos['sin color']

### Default Dictionary

Diccionary can automatically create am entry for new keys and give it a default value (0 or empty list)


In [None]:
frequency = nltk.defaultdict(int)
frequency['colorless'] = 4
frequency['ideas']
pos = nltk.defaultdict(list)
pos['sleep'] = ['N', 'V']
print('\n')
pos['ideas']
print('\n')
pos['sleep']

### Automatic Tagging


The Default Tagger. The simplest possible tagger assigns the same tag to each token (NN). Notice poor performance.


In [None]:
import nltk
from nltk.corpus import brown
raw = 'I do not like green eggs and ham, I do not like them Sam I am!'
tokens = nltk.word_tokenize(raw)
default_tagger = nltk.DefaultTagger('NN')
default_tagger.tag(tokens)
brown_tagged_sents = brown.tagged_sents(categories='news')
print('\n Perc of Correctly tagged tokens -> ',
      round(default_tagger.evaluate(brown_tagged_sents)*100), '%')

Creating and evaluating lookup tagger with a range of sizes.
(Read 4.3 The Lookup Tagger in https://www.nltk.org/book/ch05.html)

---


In [None]:
import nltk
from nltk.corpus import brown
words_by_freq = list(nltk.FreqDist(brown.words(categories='news')))
print(words_by_freq)


def performance(cfd, wordlist):
    lt = dict((word, cfd[word].max()) for word in wordlist)
    baseline_tagger = nltk.UnigramTagger(
        model=lt, backoff=nltk.DefaultTagger('NN'))
    return baseline_tagger.evaluate(brown.tagged_sents(categories='news'))


def display():
    import pylab
    words_by_freq = list(nltk.FreqDist(brown.words(categories='news')))
    cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
    sizes = 2 ** pylab.arange(15)
    perfs = [performance(cfd, words_by_freq[:size]) for size in sizes]
    pylab.plot(sizes, perfs, '-bo')
    pylab.title('Lookup Tagger Performance with Varying Model Size')
    pylab.xlabel('Model Size')
    pylab.ylabel('Performance')
    pylab.show()


display()

### N-Gram Tagging


Unigram Tagging. For each token, assign the tag that is most likely for that particular token. Training of a UNIGRAM tagger and evaluation


In [None]:
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.tag(brown_sents[2007])
print('\n Perc of Correctly tagged tokens -> ',
      round(unigram_tagger.evaluate(brown_tagged_sents)*100), '%')

Separating training and testing data sets. Notice the different scores between train (used to build system 90%) and test (not used while building the system 10%).


In [None]:
size = int(len(brown_tagged_sents) * 0.9)
size
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
unigram_tagger = nltk.UnigramTagger(train_sents)
print('\n')
print('\n Perc of Correctly tagged tokens on training data -> ',
      round(unigram_tagger.evaluate(train_sents)*100), '%')
print('\n Perc of Correctly tagged tokens on testing data  -> ',
      round(unigram_tagger.evaluate(test_sents)*100), '%')

Bigram tagger. Notice perfomance difference between training and testing sentence.


In [None]:
bigram_tagger = nltk.BigramTagger(train_sents)
bigram_tagger.tag(brown_sents[2007])
unseen_sent = brown_sents[4203]
bigram_tagger.tag(unseen_sent)


print('\n')
print('\n Perc of Correctly tagged tokens on training data -> ',
      round(bigram_tagger.evaluate(train_sents)*100), '%')
print('\n Perc of Correctly tagged tokens on testing data  -> ',
      round(bigram_tagger.evaluate(test_sents)*100), '%')

### Transformation Based Tagging

Brill tagging (Supervised Learning Method) -> Guess the tag for each word and go back and fix mistakes by applying rules. (see 6 Transformation-Based Tagging in https://www.nltk.org/book/ch05.html)


In [None]:
import nltk


from nltk.tbl import demo as brill_tagger
brill_tagger.demo()