# 一元标注

In [1]:
import nltk
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
unigram_tagger = nltk.tag.UnigramTagger(brown_tagged_sents) #训练一元标注器
unigram_tagger.evaluate(brown_tagged_sents)

0.9349006503968017

# 训练集和测试集

In [2]:
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
unigram_tagger = nltk.UnigramTagger(train_sents)
unigram_tagger.evaluate(test_sents)

0.8121200039868434

# N-gram标注
bigram 二元标注器能够标注训练中它看到过的句子中的所有词，但对一个没见过的句子表现很差。只要遇到一个新词，就无法给它分配标记

In [3]:
bigram_tagger = nltk.BigramTagger(train_sents)#二元标注 
print (bigram_tagger.tag(brown_sents[2007]))
bigram_tagger.evaluate(test_sents)

[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'CS'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]


0.10206319146815508

# 组合标注

解决精度和覆盖范围之间的权衡的一个办法是尽可能的使用更精确的算法，但却在很多 时候落后于具有更广覆盖范围的算法。例如:我们可以按如下方式组合 bigram 标注器、uni gram 标注器和一个默认标注器

In [4]:
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)
t2.evaluate(test_sents)

0.8452108043456593

# 存储标注器

In [8]:
import pickle as p
with open('/home/fannian/downloads/t2.pkl', 'wb') as output:
    p.dump(t2, output)
    
with open('/home/fannian/downloads/t2.pkl', 'rb') as inf:
    tagger=p.load(inf)

text = """The board's action shows what free enterprise 
is up against in our complex maze of regulatory laws ."""
tokens = text.split()

tagger.tag(tokens)[:5]

[('The', 'AT'),
 ("board's", 'NN$'),
 ('action', 'NN'),
 ('shows', 'NNS'),
 ('what', 'WDT')]

# 性能限制
一个n-gram 标注器的性能上限是什么？考虑一个trigram 标注器的情况。它遇到多少词
性歧义的情况

In [9]:
cfd = nltk.ConditionalFreqDist(\
                               ((x[1], y[1], z[0]), z[1])
                               for sent in brown_tagged_sents
                               for x, y, z in nltk.trigrams(sent))
ambiguous_contexts = [c for c in cfd.conditions() if len(cfd[c]) > 1]
print (sum(cfd[c].N() for c in ambiguous_contexts) / cfd.N())
#因此，1/20 的trigrams 是有歧义的

0.049297702068029296

调查标注器性能的另一种方法是研究它的错误。有些标记可能会比别的更难分配，可能
需要专门对这些数据进行预处理或后处理。一个方便的方式查看标注错误是混淆矩阵。它用
图表表示期望的标记（黄金标准）与实际由标注器产生的标记：

In [12]:
test_tags = [tag for sent in brown.sents(categories='editorial')
             for (word, tag) in t2.tag(sent)]
gold_tags = [tag for (word, tag) in brown.tagged_words(categories='editorial')]
nltk.ConfusionMatrix(gold_tags, test_tags)

<ConfusionMatrix: 52073/61604 correct>