# Lemmatization 詞形還原

## NLTK

In [1]:
# NLTK
import nltk

nltk.download("wordnet")
nltk.download("averaged_perceptron_tagger")

[nltk_data] Downloading package wordnet to /Users/ryan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/ryan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [4]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('dogs'))
print(lemmatizer.lemmatize('working'))
print(lemmatizer.lemmatize('working', pos='v'))

dog
working
work


NLTK 裡這個詞形還原工具的一個問題是需要手動指定詞性，比如上面例子中的 "working" 這個詞，如果不加後面那個 pos 參數，輸出的結果將會是 "working" 本身。

如果希望在實際應用中使用 NLTK 進行詞形還原，一個完整的解決方案是:

1. 輸入一個完整的句子
2. 用 NLTK 提供的工具對句子進行分詞和詞性標註
3. 將得到的詞性標註結果轉換為 WordNet 的格式
4. 使用 WordNetLemmatizer 對詞進行詞形還原

In [7]:
# 其中分詞和詞性標註又有數據依賴:
nltk.download("punkt")
nltk.download("maxent_treebank_pos_tagger")

[nltk_data] Downloading package punkt to /Users/ryan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package maxent_treebank_pos_tagger to
[nltk_data]     /Users/ryan/nltk_data...
[nltk_data]   Package maxent_treebank_pos_tagger is already up-to-
[nltk_data]       date!


True

In [10]:
# Demo
from nltk.corpus import wordnet
from nltk import word_tokenize, pos_tag
from nltk.stem import WordNetLemmatizer

def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

def lemmatize_sentence(sentence):
    res = []
    lemmatizer = WordNetLemmatizer()
    for word, pos in pos_tag(word_tokenize(sentence)):
        wordnet_pos = get_wordnet_pos(pos) or wordnet.NOUN
        res.append(lemmatizer.lemmatize(word, pos=wordnet_pos))

    return res

print(lemmatize_sentence("working"))
print(lemmatize_sentence("i am working"))

['work']
['i', 'be', 'work']


## Pattern

In [13]:
# Pattern
from pattern.en import lemma

print(lemma('working'))

ModuleNotFoundError: No module named 'pattern'