### 词性标注

### POS标签器推荐

我们将在这里讨论一些标记句子的推荐方法，第一种方法是使用推荐的ｐｏｓ＿ｔａｇ（）函数，它基于Penn

In [1]:
sentence = 'The brown fox is quick and he is jumping over the lazy dog'

In [2]:
import nltk 
tokens = nltk.word_tokenize(sentence)
tagged_sent = nltk.pos_tag(tokens,tagset = 'universal')
print(tagged_sent)

[('The', 'DET'), ('brown', 'ADJ'), ('fox', 'NOUN'), ('is', 'VERB'), ('quick', 'ADJ'), ('and', 'CONJ'), ('he', 'PRON'), ('is', 'VERB'), ('jumping', 'VERB'), ('over', 'ADP'), ('the', 'DET'), ('lazy', 'ADJ'), ('dog', 'NOUN')]


还可以使用pattern模块通过以下代码段获取句子的pos标签

In [3]:
from pattern.en import tag

In [4]:
tagged_sent = tag(sentence)
print(tagged_sent)

[('The', 'DT'), ('brown', 'JJ'), ('fox', 'NN'), ('is', 'VBZ'), ('quick', 'JJ'), ('and', 'CC'), ('he', 'PRP'), ('is', 'VBZ'), ('jumping', 'VBG'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]


### 建立自己的POS标签器

我们将讨论一些构建自己的POS标签器的技术，并利用nltk提供的一些类来实现他们。为了评估我们的标签器的性能，我们会使用nltk中treebank语料库的一些测试数据。我们还将使用一些训练数据来训练标签器。首先通过读取已标记的treebank语料库，我们可以获取训练和评估标签器的必要数据

In [5]:
from nltk.corpus import treebank

In [6]:
data = treebank.tagged_sents()
train_data = data[:3500]
test_data = data[3500:]
print(train_data)

[[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')], [('Mr.', 'NNP'), ('Vinken', 'NNP'), ('is', 'VBZ'), ('chairman', 'NN'), ('of', 'IN'), ('Elsevier', 'NNP'), ('N.V.', 'NNP'), (',', ','), ('the', 'DT'), ('Dutch', 'NNP'), ('publishing', 'VBG'), ('group', 'NN'), ('.', '.')], ...]


In [7]:
tokens = nltk.word_tokenize(sentence)
print(tokens)

['The', 'brown', 'fox', 'is', 'quick', 'and', 'he', 'is', 'jumping', 'over', 'the', 'lazy', 'dog']


我们将使用测试数据来评估标签器，并使用例句的标识作为输入来验证标签器的工作效果，我们在nltk中使用的所有标签器均来自nltk.tag包。每个标签器都是基类Tagger1类的子类，并且每个标签器都执行一个tag()函数，它将一个句子的标识列表作为输入，返回带有POS标签的相同单词列表作为输出。除了标记外，还有一个evaluate()函数用于评估标签器的性能。它通过标记每个输入测试语句，然后将输出结果与句子的实际标签进行对比来完成评估。我们将使用该函数来测试我们的标签器在test_data上的性能

首先我们看看从SequentialBackoffTagger基类继承的DefaultTagger，并为每个单词分配相同的用户输入POS标签。

In [8]:
from nltk.tag import DefaultTagger
dt = DefaultTagger('NN')
print(dt.evaluate(test_data))

0.1454158195372253


In [9]:
print(dt.tag(tokens))

[('The', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('is', 'NN'), ('quick', 'NN'), ('and', 'NN'), ('he', 'NN'), ('is', 'NN'), ('jumping', 'NN'), ('over', 'NN'), ('the', 'NN'), ('lazy', 'NN'), ('dog', 'NN')]


从上面的输出可以看出，在树库(treebank)测试数据集中我们已经获得了14%的单词正确标记率，并不是很理想，现在我们将使用正则表达式和RegexpTagger来尝试构建一个性能更好的标签器

In [10]:
from nltk.tag import RegexpTagger
patterns = [
    (r'.*ing$','VBG'),      #动名词
    (r'.*ed$','VBD'),       #一般过去式
    (r'.*es$','VBZ'),       #3rd singular present
    (r'.*ould$','MD'),      #modals
    (r'.*\'s$','NN$'),      #possessive nouns
    (r'.*s$','NNS'),        #plural nouns
    (r'^->[0-9]+(.[0-9]+)?$','CD'),#cardinal numbers
    (r'.*','NN')            #nouns (default)...
]
rt = RegexpTagger(patterns)
print (rt.evaluate(test_data))

0.21425113757382128


In [11]:
print(rt.tag(tokens))

[('The', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('is', 'NNS'), ('quick', 'NN'), ('and', 'NN'), ('he', 'NN'), ('is', 'NNS'), ('jumping', 'VBG'), ('over', 'NN'), ('the', 'NN'), ('lazy', 'NN'), ('dog', 'NN')]


我们现在将训练一些n元分词标签器，n元分词是来自文本序列或者语言序列的n个连续项。这些项可以由单词、音素、字母、字符或音节组成。Shingles是只包含单词的n元分词，我们将使用大小为1，2，3的n元分词，他们分别成为一元分词，二元分词和三元分词。UnigramTagger、BigramTagger和TrigramTagger继承自基类NGramTagger，NGramTagger类则继承自ContextTagger类，该类又继承自SequentialBackoffTagger类。我们将使用train_data作为训练数据，根据与举报标识及其POS标签来训练n元分词标签器，然后我们将在test_data上评估训练后的标签器，并查看例句的标记结果：

In [12]:
from nltk.tag import UnigramTagger
from nltk.tag import BigramTagger
from nltk.tag import TrigramTagger
ut = UnigramTagger(train_data)
bt = BigramTagger(train_data)
tt = TrigramTagger(train_data)

print(ut.evaluate(test_data))

0.8607803272340013


In [13]:
print(ut.tag(tokens))

[('The', 'DT'), ('brown', None), ('fox', None), ('is', 'VBZ'), ('quick', 'JJ'), ('and', 'CC'), ('he', 'PRP'), ('is', 'VBZ'), ('jumping', 'VBG'), ('over', 'IN'), ('the', 'DT'), ('lazy', None), ('dog', None)]


In [14]:
print(bt.evaluate(test_data))

0.13466937748087907


In [15]:
print(bt.tag(tokens))

[('The', 'DT'), ('brown', None), ('fox', None), ('is', None), ('quick', None), ('and', None), ('he', None), ('is', None), ('jumping', None), ('over', None), ('the', None), ('lazy', None), ('dog', None)]


In [16]:
print(tt.evaluate(test_data))

0.08064672281924679


In [17]:
print(tt.tag(tokens))

[('The', 'DT'), ('brown', None), ('fox', None), ('is', None), ('quick', None), ('and', None), ('he', None), ('is', None), ('jumping', None), ('over', None), ('the', None), ('lazy', None), ('dog', None)]


上面的数据清楚的表明，我们仅使用UnigramTagger标签器就可以在测试集上获得86%的准确率，这个结果与我们前一个标签器相比要好得多，二元分词与三元分词模型的准确性远不及一元分词模型，因为在训练数据中观察到的二元词组与三元词组不一定会在测试数据中以相同的方式出现

现在通过创建一个包含标签列表的组合标签器以及使用backoff标签器，我们将尝试组合运用所有的标签器。本质上我们将创建一个标签器链，对于每一个标签器，如果他不能标记输入的标识，则标签器的下一步将会回退到backoff标签器：

In [18]:
def combined_tagged(train_data,taggers,backoff = None):
    for tagger in taggers:
        backoff = tagger(train_data,backoff = backoff)
        return backoff
ct = combined_tagged(train_data = train_data,
                     taggers = [UnigramTagger,BigramTagger,TrigramTagger],
                    backoff = rt)

In [19]:
print(ct.evaluate(test_data))

0.8917610610901345


In [20]:
print(ct.tag(tokens))

[('The', 'DT'), ('brown', 'NN'), ('fox', 'NN'), ('is', 'VBZ'), ('quick', 'JJ'), ('and', 'CC'), ('he', 'PRP'), ('is', 'VBZ'), ('jumping', 'VBG'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')]


我们现在在测试数据上获得了91%的准确率。对于最终的标签器，我们将使用有监督的分类算法来训练他，ClassifierBasedPOSTager类使我们能够使用classifier_builder参数中的有监督机器学习算法来训练标签器。该类继承自ClassifierBasedTaggr，并拥有构成训练核心部分的feature_detector()函数，该函数用于从训练数据中生成各种特征，实际上在实例化ClassifierBasedPOSTagger类对象时，你也可以构建自己的特征检测器函数，将其传递给feature_detector参数，在这里我们使用的分类器是NaiveBayesClassifier，它使用贝叶斯定理构建概率分类器，假设特征之间是独立的。

以下代码展示了如何基于分类构建POS标签器

In [21]:
from nltk.classify import  NaiveBayesClassifier
from nltk.tag.sequential import ClassifierBasedPOSTagger
nbt = ClassifierBasedPOSTagger(train = train_data,
                              classifier_builder = NaiveBayesClassifier.train)

In [22]:
print(nbt.tag(tokens))

[('The', 'DT'), ('brown', 'JJ'), ('fox', 'NN'), ('is', 'VBZ'), ('quick', 'JJ'), ('and', 'CC'), ('he', 'PRP'), ('is', 'VBZ'), ('jumping', 'VBG'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'VBG')]


In [23]:
print(nbt.evaluate(test_data))

0.9306806079969019


基于分类器的POS标签器很强

## 浅层分析

将句子分解为最小的组成部分，然后将他们组合成更高级的短语。在浅层分析中，主要的关注焦点是识别这些短语或语块，而不是挖掘各个块内句法和语句关系的深层细节，正如我们在基于深度分析获得的分析树中看到的，浅层分析的主要目的是获得语义上有意义的短语，并观察他们之间的关系。

### 1.浅层分析器推荐

我们将使用pattern包创建一个浅层分析器，用以从句子中提取有意义的语块，以下代码段展示了如何在我们的例句上执行浅层分析：

In [24]:
sentence = 'The brown fox is quick and he is jumping over the lazy dog'
from pattern.en import parsetree
tree = parsetree(sentence)

print(tree)

The brown fox is quick and he is jumping over the lazy dog


In [25]:
for sentence_tree in tree:
    print(sentence_tree.chunks)

[Chunk('The brown fox/NP'), Chunk('is/VP'), Chunk('quick/ADJP'), Chunk('he/NP'), Chunk('is jumping/VP'), Chunk('over/PP'), Chunk('the lazy dog/NP')]


In [26]:
for sentence_tree in tree:
    for chunk in sentence_tree.chunks:
        print(chunk.type,'->',[(word.string,word.type) for word in chunk.words])

NP -> [('The', 'DT'), ('brown', 'JJ'), ('fox', 'NN')]
VP -> [('is', 'VBZ')]
ADJP -> [('quick', 'JJ')]
NP -> [('he', 'PRP')]
VP -> [('is', 'VBZ'), ('jumping', 'VBG')]
PP -> [('over', 'IN')]
NP -> [('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]


In [27]:
from pattern.en import parsetree,Chunk
from nltk.tree import Tree
def creat_sentence_tree(sentence,lemmatize = False):
    sentence_tree = parsetree(sentence,
                             relations = True,
                             lemmata = lemmatize)
    return sentence_tree[0]
def get_sentence_tree_constituents(sentence_tree):
    return sentence_tree.constituents()
def process_sentence_tree(sentence_tree):
    tree_constituents = get_sentence_tree_constituents(sentence_tree)
    processed_tree = [
        (item.type,
         [
             (w.string,w.type)
             for w in item.words
         ]
        )
        if type(item) == Chunk
        else(
        '-',
        [
            (item.string,item.type)
        ])
        for item in tree_constituents
    ]
    return processed_tree
def print_sentence_tree(sentence_tree):
    processed_tree = process_sentence_tree(sentence_tree)
    processed_tree = [
        Tree(item[0],
            [
                Tree(x[1],[x[0]])
                for x in item[1]
            ]
            )
        for item in processed_tree
    ]
    tree = Tree('S',processed_tree)
    print (tree)
def visualize_sentence_tree(sentence_tree):
    processed_tree = process_sentence_tree(sentence_tree)
    processed_tree = [
        Tree(item[0],
            [
                Tree(x[1],[x[0]])
                for x in item[1]
            ])
        for item in processed_tree
    ]
    tree = Tree('S',processed_tree)
    tree.draw()

In [28]:
t = creat_sentence_tree(sentence)
print(t)

The brown fox is quick and he is jumping over the lazy dog


In [29]:
pt = process_sentence_tree(t)
pt

[('NP', [('The', 'DT'), ('brown', 'JJ'), ('fox', 'NN')]),
 ('VP', [('is', 'VBZ')]),
 ('ADJP', [('quick', 'JJ')]),
 ('-', [('and', 'CC')]),
 ('NP', [('he', 'PRP')]),
 ('VP', [('is', 'VBZ'), ('jumping', 'VBG')]),
 ('PP', [('over', 'IN')]),
 ('NP', [('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')])]

In [None]:
print_sentence_tree(t)

(S
  (NP (DT The) (JJ brown) (NN fox))
  (VP (VBZ is))
  (ADJP (JJ quick))
  (- (CC and))
  (NP (PRP he))
  (VP (VBZ is) (VBG jumping))
  (PP (IN over))
  (NP (DT the) (JJ lazy) (NN dog)))


In [None]:
visualize_sentence_tree(t)

上面的输出显示了从例句中创建、表示和可视化浅层分析树的方法。

### 构建自己的浅层分析器

我们将使用正则表达式、基于标签的学习器等技术构建自己的浅层分析器。与之前的POS标签类似，如果需要的话，我们会使用一些训练数据来训练分析器，然后使用测试数据和例句对分析器进行评估。在nltk中，可以使用treebank语料库，它带有语块标注。首先加载语料库，并使用以下代码段准备训练数据集和测试数据集：

In [None]:
from nltk.corpus import treebank_chunk
data = treebank_chunk.chunked_sents()
train_data = data[:4000]
test_data = data[4000:]

In [None]:
print (train_data[7])