# Day 3: 詞性標註教學

在NLTK上有許多已經手動進行詞性標注的文集了。今天的實作，我們會使用Penn Treebank文集(the Penn Treebank Corpus)以及Brown文集(the Brown Corpus)。其中Penn Treebank中搜集了許多華爾街日報的文章(Wall Street Journal)，而Brown中多數的文字和文學有關。在以下這格中我們下載Penn Treebank以及Brown這兩個文集，並且測試了這兩個文集中的第一個句子。".tagged_sents()"提取了詞性標註過的句子(sents = sentences)。

In [1]:
import nltk
from nltk.corpus import treebank, brown
nltk.download('treebank')
nltk.download('brown')

print(treebank.tagged_sents()[0])
print(brown.tagged_sents()[0])

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')]


[nltk_data] Downloading package treebank to /Users/hyhu/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package brown to /Users/hyhu/nltk_data...
[nltk_data]   Package brown is already up-to-date!


在NLTK中，文字和標註的組合是以tuple的方式儲存的。然而在實作上，詞性標註常以"word/tag"的方式顯示，例如"Pierre/NNP", "the/DT"。NNP是指專有名詞、DT則為定冠詞，完整的標籤列表大家可以參考：https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html 。

值得注意的是，這兩個文集並不是使用相同的標註方式。同樣是"the"，Brown把它標註成"AT"(Article，冠詞），而在Penn Treebank中則被標註為"DT(Determiner，定冠詞）。好消息是，在NLTK中我們也可以把他們都轉換成Universal標註方式。https://universaldependencies.org/u/pos/ 。

In [2]:
import nltk
nltk.download('universal_tagset')
print(treebank.tagged_sents(tagset="universal")[0])
print(brown.tagged_sents(tagset="universal")[0])

[('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (',', '.'), ('61', 'NUM'), ('years', 'NOUN'), ('old', 'ADJ'), (',', '.'), ('will', 'VERB'), ('join', 'VERB'), ('the', 'DET'), ('board', 'NOUN'), ('as', 'ADP'), ('a', 'DET'), ('nonexecutive', 'ADJ'), ('director', 'NOUN'), ('Nov.', 'NOUN'), ('29', 'NUM'), ('.', '.')]
[('The', 'DET'), ('Fulton', 'NOUN'), ('County', 'NOUN'), ('Grand', 'ADJ'), ('Jury', 'NOUN'), ('said', 'VERB'), ('Friday', 'NOUN'), ('an', 'DET'), ('investigation', 'NOUN'), ('of', 'ADP'), ("Atlanta's", 'NOUN'), ('recent', 'ADJ'), ('primary', 'NOUN'), ('election', 'NOUN'), ('produced', 'VERB'), ('``', '.'), ('no', 'DET'), ('evidence', 'NOUN'), ("''", '.'), ('that', 'ADP'), ('any', 'DET'), ('irregularities', 'NOUN'), ('took', 'VERB'), ('place', 'NOUN'), ('.', '.')]


[nltk_data] Downloading package universal_tagset to
[nltk_data]     /Users/hyhu/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


知道了詞性標註的基本規則之後，我們開始開發一個Unigram Tagger吧！首先，我們需要先記錄每一個字形是出現各種詞性的頻率。我們可以將它存在python資料結構中，dictionary中的dictionary（事實上這不是最有效率的存法 :D）。雖然前面也介紹Brown Corpus，但從這邊開始我們專注於使用Penn Treebank的標籤規則。

In [3]:
from collections import defaultdict

POS_dict = defaultdict(dict)
for word_pos_pair in treebank.tagged_words():
    word = word_pos_pair[0].lower() # 正規化成小寫
    POS = word_pos_pair[1]
    POS_dict[word][POS] = POS_dict[word].get(POS,0) + 1

取一些字來看看他們怎麼表現多個詞性標註，以及每個詞性在文集中的分布狀況：

In [4]:
for word in list(POS_dict.keys())[900:1000]:
    if len(POS_dict[word]) > 1:
        print(word)
        print(POS_dict[word])

target
{'NN': 8, 'VB': 2}
forecast
{'NN': 4, 'VBP': 1, 'VBD': 1, 'VBN': 1}
recorded
{'VBN': 1, 'VBD': 1}
4
{'CD': 15, 'LS': 1}
keep
{'VB': 15, 'VBP': 1}
pace
{'NN': 6, 'NNP': 1}
rival
{'JJ': 1, 'NN': 4}
announced
{'VBD': 14, 'VBN': 3}
advertising
{'NN': 10, 'VBG': 3}
plan
{'NN': 45, 'NNP': 2, 'VB': 1, 'VBP': 2}
ad
{'NN': 28, 'NNP': 1}
post
{'NNP': 1, 'VB': 2, 'NN': 4}
second
{'JJ': 16, 'NNP': 2}
offered
{'VBN': 11, 'VBD': 16}
plans
{'NNS': 17, 'VBZ': 18, 'VBP': 1}
give
{'VBP': 6, 'VB': 15}
spending
{'NN': 25, 'VBG': 2}
become
{'VBN': 7, 'VB': 15, 'VBP': 1}
news
{'NN': 24, 'NNP': 6}
world
{'NNP': 13, 'NN': 24}
5
{'CD': 18, 'LS': 1}
cost
{'VB': 4, 'NN': 12}
lowered
{'VBD': 2, 'VBN': 2}
circulation
{'NN': 10, 'NNP': 1}
lower
{'JJR': 30, 'RBR': 1}
costs
{'VBZ': 4, 'NNS': 22}
yet
{'RB': 17, 'CC': 6}
credit
{'NNP': 2, 'NN': 22}
meet
{'VBP': 3, 'VB': 9}
exceed
{'VBP': 1, 'VB': 3}
long
{'RB': 11, 'JJ': 16, 'NNP': 1}
spent
{'VBD': 4, 'VBN': 4}
attempt
{'NN': 8, 'VB': 2}
decline
{'NN': 14, 'VBP'

在這裡我們能觀察到一些常見的歧義會發生在名詞和動詞之間(<i>plans</i>, <i>decline</i>, <i>cost</i>)；在動詞之間，過去式和過去分詞也會發生同樣的問題(<i>announced</i>, <i>offered</i>, <i>spent</i>).

為了開發出我們的第一個標註器(Unigram Tagger)，我們只需要為每個詞選出最常見的詞性。

In [5]:
tagger_dict = {}
for word in POS_dict:
    tagger_dict[word] = max(POS_dict[word],key=lambda x: POS_dict[word][x])

def tag(sentence):
    return [(word,tagger_dict.get(word,"NN")) for word in sentence]

example_sentence = """You better start swimming or sink like a stone , cause the times they are a - changing .""".split() 
print(tag(example_sentence))

[('You', 'NN'), ('better', 'JJR'), ('start', 'VB'), ('swimming', 'NN'), ('or', 'CC'), ('sink', 'VB'), ('like', 'IN'), ('a', 'DT'), ('stone', 'NN'), (',', ','), ('cause', 'NN'), ('the', 'DT'), ('times', 'NNS'), ('they', 'PRP'), ('are', 'VBP'), ('a', 'DT'), ('-', ':'), ('changing', 'VBG'), ('.', '.')]


因為不是每一個字都是我們在training set中看過的字，所以遇到沒有看過的字，我們會自動標註成名詞"NN"。我們可以觀察到這樣的方法雖然會有一些問題，例如"swimming"在該是動詞，卻因此被標註成了名詞。然而總體而言，這樣標註的成效還挺不錯的。 

NLTK也有內建的N-gram tagger，我們可以使用內建的Unigram(1-gram)和Bigram(2-gram) Tagger。首先，需要將文集切割成訓練集和測試集。

In [6]:
# 訓練集:測試集 = 9:1
size = int(len(treebank.tagged_sents()) * 0.9)
train_sents = treebank.tagged_sents()[:size] 
test_sents = treebank.tagged_sents()[size:]

我們先來比對預設的Unigram和Bigram Tagger。NLTK裡面所有的標註器都有評價功能，藉此回傳測試集運行在這個訓練模型的準確率(accuracy)。

In [7]:
from nltk import UnigramTagger, BigramTagger

unigram_tagger = UnigramTagger(train_sents)
bigram_tagger = BigramTagger(train_sents)
print(unigram_tagger.evaluate(test_sents))
print(unigram_tagger.tag(example_sentence))
print(bigram_tagger.evaluate(test_sents))
print(bigram_tagger.tag(example_sentence))

0.8627989821882952
[('You', 'PRP'), ('better', 'JJR'), ('start', 'VB'), ('swimming', None), ('or', 'CC'), ('sink', 'VB'), ('like', 'IN'), ('a', 'DT'), ('stone', 'NN'), (',', ','), ('cause', 'NN'), ('the', 'DT'), ('times', 'NNS'), ('they', 'PRP'), ('are', 'VBP'), ('a', 'DT'), ('-', ':'), ('changing', 'VBG'), ('.', '.')]
0.13455470737913486
[('You', 'PRP'), ('better', None), ('start', None), ('swimming', None), ('or', None), ('sink', None), ('like', None), ('a', None), ('stone', None), (',', None), ('cause', None), ('the', None), ('times', None), ('they', None), ('are', None), ('a', None), ('-', None), ('changing', None), ('.', None)]


在這裡Unigram Tagger的效果好太多了。原因很明顯，因為Bigram Tagger並沒有足夠的資料來觀察前後文的關係，更糟的是，一旦一個詞的詞性判斷被判定成"None"，後面整句話也都會失敗。為了解決問題，我們需要為Bigram Tagger加上退避(backoffs)。關於退避，我們在未來講到N-gram語言模型和Smoothing時會詳細討論，現在我們就先預設那些"None"的字為"NN"吧！

In [8]:
from nltk import DefaultTagger

default_tagger = DefaultTagger("NN")
unigram_tagger = UnigramTagger(train_sents,backoff=default_tagger)
bigram_tagger = BigramTagger(train_sents,backoff=unigram_tagger)

print(bigram_tagger.evaluate(test_sents))
print(bigram_tagger.tag(example_sentence))

0.8905852417302799
[('You', 'PRP'), ('better', 'JJR'), ('start', 'VB'), ('swimming', 'NN'), ('or', 'CC'), ('sink', 'VB'), ('like', 'IN'), ('a', 'DT'), ('stone', 'NN'), (',', ','), ('cause', 'VB'), ('the', 'DT'), ('times', 'NNS'), ('they', 'PRP'), ('are', 'VBP'), ('a', 'DT'), ('-', ':'), ('changing', 'VBG'), ('.', '.')]


藉由退避方法，我們將Bigram的資訊加到Unigram之上，準確率也有了3%的提升。