In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [2]:
import nltk
text1 = nltk.word_tokenize("It is a pleasant day today")
nltk.pos_tag(text1)

[('It', 'PRP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('pleasant', 'JJ'),
 ('day', 'NN'),
 ('today', 'NN')]

Penn Treebank (https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html):

In [18]:
# information about the NNS tag:
nltk.help.upenn_tagset('NNS')

NNS: noun, common, plural
    undergraduates scotches bric-a-brac products bodyguards facets coasts
    divestitures storehouses designs clubs fragrances averages
    subjectivists apprehensions muses factory-jobs ...


In [19]:
# a regular expression may also be queried:The code gives information regarding all the tags of verb phrases.
nltk.help.upenn_tagset('VB.*')

VB: verb, base form
    ask assemble assess assign assume atone attention avoid bake balkanize
    bank begin behold believe bend benefit bevel beware bless boil bomb
    boost brace break bring broil brush build ...
VBD: verb, past tense
    dipped pleaded swiped regummed soaked tidied convened halted registered
    cushioned exacted snubbed strode aimed adopted belied figgered
    speculated wore appreciated contemplated ...
VBG: verb, present participle or gerund
    telegraphing stirring focusing angering judging stalling lactating
    hankerin' alleging veering capping approaching traveling besieging
    encrypting interrupting erasing wincing ...
VBN: verb, past participle
    multihulled dilapidated aerosolized chaired languished panelized used
    experimented flourished imitated reunifed factored condensed sheared
    unsettled primed dubbed desired ...
VBP: verb, present tense, not 3rd person singular
    predominate wrap resort sue twist spill cure lengthen brush terminate
 

一个通过词性标注实现词义消歧的例子。bear是一个动词，容忍，它也是一种动物，名词:

In [3]:
text = nltk.word_tokenize("I cannot bear the pain of bear")
nltk.pos_tag(text)

[('I', 'PRP'),
 ('can', 'MD'),
 ('not', 'RB'),
 ('bear', 'VB'),
 ('the', 'DT'),
 ('pain', 'NN'),
 ('of', 'IN'),
 ('bear', 'NN')]

a tagged token is represented as a tuple consisting of a token and its tag.

We can create this tuple in NLTK using the str2tuple() function :

In [4]:
taggedword = nltk.tag.str2tuple('bear/NN')
taggedword

('bear', 'NN')

In [6]:
taggedword[0]
taggedword[1]

'bear'

'NN'

从给定文本生成元组序列：

In [7]:
sentence = """The/DT sacred/VBN Ganga/NNP flows/VBZ in/IN this/DT
   region/NN ./. This/DT is/VBZ a/DT pilgrimage/NN ./. People/NNP from/IN
   all/DT over/IN the/DT country/NN visit/NN this/DT place/NN ./. """

[nltk.tag.str2tuple(t) for t in sentence.split()]

[('The', 'DT'),
 ('sacred', 'VBN'),
 ('Ganga', 'NNP'),
 ('flows', 'VBZ'),
 ('in', 'IN'),
 ('this', 'DT'),
 ('region', 'NN'),
 ('.', '.'),
 ('This', 'DT'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('pilgrimage', 'NN'),
 ('.', '.'),
 ('People', 'NNP'),
 ('from', 'IN'),
 ('all', 'DT'),
 ('over', 'IN'),
 ('the', 'DT'),
 ('country', 'NN'),
 ('visit', 'NN'),
 ('this', 'DT'),
 ('place', 'NN'),
 ('.', '.')]

将元组（单词,词性标记）转换为单词和标记。

In [8]:
taggedtok = ('bear', 'NN')
from nltk.tag.util import tuple2str
tuple2str(taggedtok)

'bear/NN'

the occurence of some common tags in the Treebank corpus:

In [None]:
import nltk
from nltk.corpus import treebank
treebank_tagged = treebank.tagged_words(tagset='universal')
tag = nltk.FreDist(tag for (word, tag) in treebank_tagged)
tag.most_common()

calculates the number of tags occurring before a noun tag:

In [21]:
import nltk
from nltk.corpus import treebank
treebank_tagged = treebank.tagged_words(tagset='universal')
tagpairs = nltk.bigrams(treebank_tagged)
preceders_noun = [x[1] for (x,y) in tagpairs if y[1] == 'NOUN']
freqdist = nltk.FreqDist(preceders_noun)
[tag for (tag, _) in freqdist.most_common()]

['NOUN',
 'DET',
 'ADJ',
 'ADP',
 '.',
 'VERB',
 'NUM',
 'PRT',
 'CONJ',
 'PRON',
 'X',
 'ADV']

illustrates the creation of a tuple (word:pos tag) using dictionaries in Python:

In [10]:
tag = {}
tag['beautiful'] = 'ADJ'
tag['boy'] = 'N'
tag['read'] = 'V'
tag['generously'] = 'ADV'
tag

{'beautiful': 'ADJ', 'boy': 'N', 'generously': 'ADV', 'read': 'V'}

### Default tagging 将相同的POS tags 分配给所有tokens。

SequentialBackoffTagger 的子类 是 DefaultTagger；

![hierarchy of tagger](DefaultTagger.png)

choose_tag() method 必须由SequentialBackoffTagger 实现，其参数包括：

    • A collection of tokens
    • The index of the token that should be tagged
    • The previous tags list

In [12]:
from nltk.tag import DefaultTagger
tag = DefaultTagger('NN')
tag.tag(['Beautiful', 'morning','good','Do']) # 将相同的POS tags 分配给所有tokens

[('Beautiful', 'NN'), ('morning', 'NN'), ('good', 'NN'), ('Do', 'NN')]

convert a tagged sentence into an untagged sentence:

In [13]:
from nltk.tag import untag
untag([('beautiful', 'NN'), ('morning', 'NN')])

['beautiful', 'morning']

## Creating POS-tagged corpora
A corpus may be known as a collection of documents. A corpora is the collection of multiple corpus.

create a data directory named ~/nltkdoc inside the home directory.

In [1]:
import nltk
import os,os.path

In [2]:
create = os.path.expanduser('~/nltkdoc') # 在主页目录下创建一个名为nltkdoc的目录
if not os.path.exists(create):
    os.mkdir(create)

os.path.exists(create) # 验证nltkdoc 目录是否存在

In [7]:
create_data = os.path.expanduser('~/nltk_data/data')
os.mkdir(create_data)

Names corpus:It consists of two  les, namely, male.txt and female.txt.

In [14]:
from nltk.corpus import names
names.fileids()
len(names.words('male.txt'))
len(names.words('female.txt'))

['female.txt', 'male.txt']

2943

5001

a large collection of English words:

In [15]:
from nltk.corpus import words
words.fileids()
len(words.words('en'))
len(words.words('en-basic'))

['en', 'en-basic']

235886

850

## Selecting a machine learning algorithm

POS tagging(词性标注)也被称为词类消歧或语法标记。

分2类：
* rule-based ，比如 **E. Brill's tagger**
* stochastic/probabilistic.

POS分类器以a document作为输入，得到词汇特征。它利用这些词汇特征和已有的train labels来训练。这种类型的分类器被称为二级分类器(a second order classier)，利用Bootstrap分类器以生成tags。

* 一个backoff分类器，会执行一个后退的程序。其output 以这样的方式获得：trigram POS tagger 依赖 bigram POS tagger，后者一类unigram

* 训练一个POS classifier ，会生成一个特征集。特征集由以下组成：
    * 现在的word
    * 上一个word 或前缀
    * 下个word 或承接的词

* 最大熵分类器：训练时要找的最优参数 令corpus的总似然最大

UnigramTagger training：仅采用一个token找到 其词性。初始化时，提供句子列表来训练UnigramTagger：

In [22]:
from nltk.tag import UnigramTagger
from nltk.corpus import treebank
training = treebank.tagged_sents()[:7000] # 用7000个句子来训练
unitagger = UnigramTagger(training)
treebank.sents()[0]
unitagger.tag(treebank.sents()[0])

['Pierre',
 'Vinken',
 ',',
 '61',
 'years',
 'old',
 ',',
 'will',
 'join',
 'the',
 'board',
 'as',
 'a',
 'nonexecutive',
 'director',
 'Nov.',
 '29',
 '.']

[('Pierre', 'NNP'),
 ('Vinken', 'NNP'),
 (',', ','),
 ('61', 'CD'),
 ('years', 'NNS'),
 ('old', 'JJ'),
 (',', ','),
 ('will', 'MD'),
 ('join', 'VB'),
 ('the', 'DT'),
 ('board', 'NN'),
 ('as', 'IN'),
 ('a', 'DT'),
 ('nonexecutive', 'JJ'),
 ('director', 'NN'),
 ('Nov.', 'NNP'),
 ('29', 'CD'),
 ('.', '.')]

To evaluate UnigramTagger， calculates the accuracy:

In [23]:
from nltk.corpus import treebank
from nltk.tag import UnigramTagger
training = treebank.tagged_sents()[:7000]
unitagger = UnigramTagger(training)
testing = treebank.tagged_sents()[2000:]
unitagger.evaluate(testing)

0.9619024159944167

### 几个类的继承关系：
![](Tagger_inheritance.png)

Since UnigramTagger inherits from ContextTagger，we can map the context key with a specic tag.

ContextTagger标注器 用给定tag 的频率来决定最有可能的tag的标注。

In [24]:
from nltk.corpus import treebank
from nltk.tag import UnigramTagger
unitag = UnigramTagger(model={'Vinken': 'NN'}) # only tags 'Vinken' with the 'NN'
unitag.tag(treebank.sents()[0])

[('Pierre', None),
 ('Vinken', 'NN'),
 (',', None),
 ('61', None),
 ('years', None),
 ('old', None),
 (',', None),
 ('will', None),
 ('join', None),
 ('the', None),
 ('board', None),
 ('as', None),
 ('a', None),
 ('nonexecutive', None),
 ('director', None),
 ('Nov.', None),
 ('29', None),
 ('.', None)]

evaluates UnigramTagger:

In [25]:
unitagger = UnigramTagger(training, cutoff=5)
unitagger.evaluate(testing)

0.7972986842375351

Backoff tagging ：
如果一个tagger不能tag 一个token，那么这个token可能传给下一个tagger 来tag:

In [28]:
# backoff:采用DefaultTagger and UnigramTagger 
from nltk.tag import UnigramTagger
from nltk.tag import DefaultTagger
from nltk.corpus import treebank
testing = treebank.tagged_sents()[2000:]
training = treebank.tagged_sents()[:7000]
tag1 = DefaultTagger('NN')
tag2 = UnigramTagger(training, backoff=tag1)
tag2.evaluate(testing)

<DefaultTagger: tag=NN>

0.9619024159944167

### NgramTagger

In [29]:
#BigramTagger:
import nltk
from nltk.tag import BigramTagger
from nltk.corpus import treebank
training_1 = treebank.tagged_sents()[:7000]
bigramtagger = BigramTagger(training_1)
treebank.sents()[0]
bigramtagger.tag(treebank.sents()[0])
testing_1 = treebank.tagged_sents()[2000:]
bigramtagger.evaluate(testing_1)

['Pierre',
 'Vinken',
 ',',
 '61',
 'years',
 'old',
 ',',
 'will',
 'join',
 'the',
 'board',
 'as',
 'a',
 'nonexecutive',
 'director',
 'Nov.',
 '29',
 '.']

[('Pierre', 'NNP'),
 ('Vinken', 'NNP'),
 (',', ','),
 ('61', 'CD'),
 ('years', 'NNS'),
 ('old', 'JJ'),
 (',', ','),
 ('will', 'MD'),
 ('join', 'VB'),
 ('the', 'DT'),
 ('board', 'NN'),
 ('as', 'IN'),
 ('a', 'DT'),
 ('nonexecutive', 'JJ'),
 ('director', 'NN'),
 ('Nov.', 'NNP'),
 ('29', 'CD'),
 ('.', '.')]

0.9171131227292321

In [30]:
# BigramTagger and TrigramTagger:
from nltk.tag import BigramTagger, TrigramTagger
from nltk.corpus import treebank
testing = treebank.tagged_sents()[2000:]
training = treebank.tagged_sents()[:7000]
bigramtag = BigramTagger(training)
bigramtag.evaluate(testing)
trigramtag = TrigramTagger(training)
trigramtag.evaluate(testing)

0.9171131227292321

0.9022107272615308

NgramTagger, n > 3

In [31]:
from nltk.corpus import treebank
from nltk import NgramTagger

training = treebank.tagged_sents()[:7000]
testing = treebank.tagged_sents()[2000:]

quadgramtag = NgramTagger(4, training)
quadgramtag.evaluate(testing)

0.9304554878173943

AffixTagger 也是一个ContextTagger（上下文依赖分类器），使用前缀和后缀作为上下文信息：

In [32]:
from nltk.corpus import treebank
from nltk.tag import AffixTagger
training = treebank.tagged_sents()[:7000]
testing = treebank.tagged_sents()[2000:]
affixtag = AffixTagger(training)
affixtag.evaluate(testing)

0.2902682841718497

learns the use of four character prefixes:

In [33]:
import nltk
from nltk.tag import AffixTagger
from nltk.corpus import treebank
training = treebank.tagged_sents()[:7000]
testing = treebank.tagged_sents()[2000:]
prefixtag = AffixTagger(training, affix_length=4)
prefixtag.evaluate(testing)

0.2094751318841472

learns the use of three character suffixes:

In [34]:
from nltk.tag import AffixTagger
from nltk.corpus import treebank
training = treebank.tagged_sents()[:7000]
testing = treebank.tagged_sents()[2000:]
suffixtag = AffixTagger(training, affix_length=-3)
suffixtag.evaluate(testing)

0.2902682841718497

combines many affix taggers in the back-off chain:

In [39]:
from nltk.tag import AffixTagger
from nltk.corpus import treebank
training = treebank.tagged_sents()[:7000]
testing = treebank.tagged_sents()[2000:]
prefixtagger = AffixTagger(training, affix_length=4)
prefixtagger.evaluate(testing)

prefixtagger3 = AffixTagger(training, affix_length=3, backoff=prefixtagger)
prefixtagger3.evaluate(testing)

suffixtagger3 = AffixTagger(training, affix_length=-3, backoff=prefixtagger3)
suffixtagger3.evaluate(testing)

suffixtagger4 = AffixTagger(training, affix_length=-4, backoff=suffixtagger3)
suffixtagger4.evaluate(testing)

0.2094751318841472

0.25841082168442225

0.2940451998275756

0.33072644046226163

### The TnT is Trigrams n Tags. TnT 一种基于统计的tagger，基于二阶马尔可夫模型：

In [3]:
from nltk.tag import tnt
from nltk.corpus import treebank

training = treebank.tagged_sents()[:7000]
testing =treebank.tagged_sents()[2000:]

tnt_tagger = tnt.TnT()
tnt_tagger.train(training)
tnt_tagger.evaluate(testing)

KeyboardInterrupt: 

In [None]:

   >>> import nltk
   >>> from nltk.tag import tnt
   >>> from nltk.corpus import treebank
   >>> testing = treebank.tagged_sents()[2000:]
   >>> training= treebank.tagged_sents()[:7000]
   >>> tnt_tagger=tnt.TnT()
   >>> tnt_tagger.train(training)
   >>> tnt_tagger.evaluate(testing)
   0.9882176652913768

TnT computes ConditionalFreqDist and internalFreqDist from the training text.

 In order to choose the best tag, TnT uses the ngram model.

In [None]:
from nltk.tag import DefaultTagger
from nltk.tag import tnt
from nltk.corpus import treebank

training = treebank.tagged_sents()[:7000]
testing = treebank.tagged_sents()[2000:]

tnt_tagger = tnt.TnT()
unknown = DefaultTagger('NN')
tagger_tnt = tnt.TnT(unk=unknown, Trained=True)#if the value of the unknown tagger is provided explicitly, 
                                                # then TRAINED will be set to TRUE
tnt_tagger.train(training)
tnt_tagger.evaluate(testing)

## Developing a chunker using pos-tagged corpora

Chunking 是一个过程，用以进行实体检测。用于分割并标记段落，每段多个句子，每个句子含tokens。

要设计一个chunker，需要定义一个chunker grammer，它含有该如何chunking的规则。

In [4]:
#Noun Phrase Chunking by forming the chunk rules:
sent = [("A","DT"),("wise", "JJ"), ("small", "JJ"),("girl", "NN"),
   ("of", "IN"), ("village", "N"),  ("became", "VBD"), ("leader", "NN")]
grammer = "NP: {<DT>?<JJ>*<NN><IN>?<NN>*}"
find = nltk.RegexpParser(grammer)
res = find.parse(sent)
print(res)

res.draw()

(S
  (NP A/DT wise/JJ small/JJ girl/NN of/IN)
  village/N
  became/VBD
  (NP leader/NN))


KeyboardInterrupt: 

![](parse_tree.png)
**DT** as optional, any number of **JJ**, followed by **NN**, optional **IN**, and any number of **NN**.

In [None]:
#Noun Phrase chunk rule is created with any number of nouns:
noun1 = [('financial', 'NN'), ('year', 'NN'), ('account', 'NN'), ('summary', 'NN')]
gram = 'NP:{<NN>+}'
find = nltk.RegexpParser(gram)
print(find.parse(noun1))

x = find.parse(noun1)
x.draw()

![](parse_tree2.png)

可以使用整个chunk，也可使用chunk的中间部分，其余部分淘汰；或者块的一部分可以从块的开始或从块的结尾使用，并且块的剩余部分被移除

In [None]:
#UnigramChunker 源码：