# The need for evaluation of NLP systems

NLP系统的output，与我们期望的output是否相近，这是我们在乎的。越早发现错误，纠错成本越低。

假设我们想评估一个标注器tagger，我们可以比较它的标注结果和人为的标注结果。但通常，我们并非语言专家，自己也不知道对错。因此，我们构造一个 standard test data 来评估。它是一个已经标注过corpus，可以作为a standard corpus来评估我们的tagger。

如果tagger 的output 与这个corpus里的一致，那么就是对的。

创建一个gold standard 注释语料库是一项主要任务，也是非常昂贵的。它是通过手动标记给定的测试数据来实现的。以这种方式选择的标签被用作标准标签，可以用来代表广泛的信息。

# Evaluation of NLP tools

### to train a unigram tagger：

In [1]:
import nltk
from nltk.corpus import brown

In [4]:
sent = brown.sents(categories = 'news') # 未tagged
sent[1]

['The',
 'jury',
 'further',
 'said',
 'in',
 'term-end',
 'presentments',
 'that',
 'the',
 'City',
 'Executive',
 'Committee',
 ',',
 'which',
 'had',
 'over-all',
 'charge',
 'of',
 'the',
 'election',
 ',',
 '``',
 'deserves',
 'the',
 'praise',
 'and',
 'thanks',
 'of',
 'the',
 'City',
 'of',
 'Atlanta',
 "''",
 'for',
 'the',
 'manner',
 'in',
 'which',
 'the',
 'election',
 'was',
 'conducted',
 '.']

In [2]:
sentences = brown.tagged_sents(categories='news')  # tagged
sentences[1]

[('The', 'AT'),
 ('jury', 'NN'),
 ('further', 'RBR'),
 ('said', 'VBD'),
 ('in', 'IN'),
 ('term-end', 'NN'),
 ('presentments', 'NNS'),
 ('that', 'CS'),
 ('the', 'AT'),
 ('City', 'NN-TL'),
 ('Executive', 'JJ-TL'),
 ('Committee', 'NN-TL'),
 (',', ','),
 ('which', 'WDT'),
 ('had', 'HVD'),
 ('over-all', 'JJ'),
 ('charge', 'NN'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('election', 'NN'),
 (',', ','),
 ('``', '``'),
 ('deserves', 'VBZ'),
 ('the', 'AT'),
 ('praise', 'NN'),
 ('and', 'CC'),
 ('thanks', 'NNS'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('City', 'NN-TL'),
 ('of', 'IN-TL'),
 ('Atlanta', 'NP-TL'),
 ("''", "''"),
 ('for', 'IN'),
 ('the', 'AT'),
 ('manner', 'NN'),
 ('in', 'IN'),
 ('which', 'WDT'),
 ('the', 'AT'),
 ('election', 'NN'),
 ('was', 'BEDZ'),
 ('conducted', 'VBN'),
 ('.', '.')]

In [11]:
nltk.UnigramTagger?

In [9]:
unigram_sent = nltk.UnigramTagger(sentences) # 参数是经过tagged的sentences 进行train
unigram_sent.tag(sent[1])  # 训练后的unigram标注器对sent[1] 进行标注

[('The', 'AT'),
 ('jury', 'NN'),
 ('further', 'JJR'),
 ('said', 'VBD'),
 ('in', 'IN'),
 ('term-end', 'NN'),
 ('presentments', 'NNS'),
 ('that', 'CS'),
 ('the', 'AT'),
 ('City', 'NN-TL'),
 ('Executive', 'NN-TL'),
 ('Committee', 'NN-TL'),
 (',', ','),
 ('which', 'WDT'),
 ('had', 'HVD'),
 ('over-all', 'JJ'),
 ('charge', 'NN'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('election', 'NN'),
 (',', ','),
 ('``', '``'),
 ('deserves', 'VBZ'),
 ('the', 'AT'),
 ('praise', 'NN'),
 ('and', 'CC'),
 ('thanks', 'NNS'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('City', 'NN-TL'),
 ('of', 'IN'),
 ('Atlanta', 'NP'),
 ("''", "''"),
 ('for', 'IN'),
 ('the', 'AT'),
 ('manner', 'NN'),
 ('in', 'IN'),
 ('which', 'WDT'),
 ('the', 'AT'),
 ('election', 'NN'),
 ('was', 'BEDZ'),
 ('conducted', 'VBN'),
 ('.', '.')]

In [10]:
unigram_sent.evaluate(sentences)  # 本例中，又将sentences作为gold standard 来评估训练后的标注器

0.9349006503968017

### 需要分离训练集和测试集：

In [13]:
import nltk
from nltk.corpus import brown
sentences = brown.tagged_sents(categories='news')
train_size = int(len(sentences) * 0.8)
training_set = sentences[:train_size]
testing_set = sentences[train_size:]
unigram_tagger = nltk.UnigramTagger(training_set)
unigram_tagger.evaluate(testing_set)

0.8026879907509996

### bigram tagger:

In [17]:
import nltk
from nltk.corpus import brown
sentences = brown.tagged_sents(categories ='news')
train_size = int(len(sentences) * 0.8)
train_set = sentences[:train_size]
test_set = sentences[train_size:]
bigram_tagger = nltk.BigramTagger(train_set)
bigram_tagger.tag(sentences[2008])

[(('Others', 'NNS'), None),
 ((',', ','), None),
 (('which', 'WDT'), None),
 (('are', 'BER'), None),
 (('reached', 'VBN'), None),
 (('by', 'IN'), None),
 (('walking', 'VBG'), None),
 (('up', 'IN'), None),
 (('a', 'AT'), None),
 (('single', 'AP'), None),
 (('flight', 'NN'), None),
 (('of', 'IN'), None),
 (('stairs', 'NNS'), None),
 ((',', ','), None),
 (('have', 'HV'), None),
 (('balconies', 'NNS'), None),
 (('.', '.'), None)]

In [18]:
bigram_tagger.evaluate(test_set)

0.09186376993111421

### 上面的例子中，有许多None，对此可采用 bootstrapping 的方法，加入backoff参数：
用bigram Tagger 找不到tag 的话，就退一步用unigram Tagger，如果unigram 也找不到，就用默认的tagger

In [20]:
import nltk
from nltk.corpus import brown
sentences = brown.tagged_sents(categories='news')
train_size = int(len(sentences)*0.8)
train_data = sentences[:train_size]
test_data = sentences[train_size:]
s0 = nltk.DefaultTagger('NNP')
s1 = nltk.UnigramTagger(train_data, backoff=s0)
s2 = nltk.BigramTagger(train_data, backoff=s1)
s2.evaluate(test_data)

0.8117443036755142

比上面0.09 好多了

语言学家用以下线索来决定一个word 的类别：

    Morphological clues： 前缀、后缀、中缀、词缀
    Syntactic clues：比如主谓宾
    Semantic clues：知道一个词的意思，也就知道它的类别

### Evaluation of a chunk parser:

In [24]:
import nltk
from nltk.corpus import conll2000
chunkparser = nltk.RegexpParser("")
gold_standard = conll2000.chunked_sents('train.txt',chunk_types=('NP',))
nltk.chunk.accuracy(chunkparser,gold_standard)

0.44084599507856814

### Evaluation of a naïve chunk parser  that looks for tags, such as CD, JJ:

In [25]:
import nltk
from nltk.corpus import conll2000
grammar = r'NP: {<[CDJNP].*>+}'
cp = nltk.RegexpParser(grammar)
gold_standard = conll2000.chunked_sents('train.txt', chunk_types=('NP',))
nltk.chunk.accuracy(cp, gold_standard)

0.8744798726662164

###  evaluation of chunker:

In [29]:
import nltk
correct = nltk.chunk.tagstr2tree('[the/DT little/JJ cat/NN] sat/VBD on/IN [the/DT mat/NN]')
print(correct.flatten())

(S the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN)


In [32]:
grammar = r"NP: {<[CDJNP].*>+}"
cp = nltk.RegexpParser(grammar)
grammar = r"NP: {<PRP|DT|POS|JJ|CD|N.*>+}"
chunk_parser = nltk.RegexpParser(grammar)
tagged_tok = [('the', 'DT'), ('little','JJ'),('cat','NN'),('sat','VBD'),('on','IN'),
              ('the','DT'),('mat','NN')]
chunkscore = nltk.chunk.ChunkScore()
predict = cp.parse(correct.flatten())
chunkscore.score(correct, predict)
print(chunkscore)

ChunkParse score:
    IOB Accuracy: 100.0%%
    Precision:    100.0%%
    Recall:       100.0%%
    F-Measure:    100.0%%


### 根据后缀标注词性：比如ness结尾的判定为NN名词

1. 获取后缀列表
2. 根据后缀提取特征，返回True or False
3. 采用决策树分类器
4. 预测

In [7]:
from nltk.corpus import brown
suffix_freqdist = nltk.FreqDist()
nltk.FreqDist()   

FreqDist()

In [8]:
# 统计后缀分别为1-3个字母的频数
for wrd in brown.words():
    wrd = wrd.lower()
    suffix_freqdist[wrd[-1:]] += 1
    suffix_freqdist[wrd[-2:]] += 1
    suffix_freqdist[wrd[-3:]] += 1


In [9]:
suffix_freqdist

FreqDist({'e': 202946,
          'he': 92084,
          'the': 70026,
          'n': 87889,
          'on': 33382,
          'ton': 1019,
          'y': 59146,
          'ty': 6458,
          'nty': 391,
          'd': 105687,
          'nd': 36418,
          'and': 31057,
          'ry': 7500,
          'ury': 482,
          'id': 4272,
          'aid': 2460,
          'ay': 6482,
          'day': 1613,
          'an': 17650,
          'ion': 14905,
          'f': 43173,
          'of': 72978,
          's': 128722,
          "'s": 5865,
          "a's": 202,
          't': 94459,
          'nt': 13151,
          'ent': 9369,
          'ary': 2122,
          'ed': 41527,
          'ced': 1262,
          '`': 8837,
          '``': 17674,
          'o': 42363,
          'no': 4402,
          'ce': 10953,
          'nce': 5971,
          "'": 10455,
          "''": 17639,
          'at': 25410,
          'hat': 12692,
          'ny': 3437,
          'any': 2793,
          'es': 22408,
  

In [10]:
#最多的后缀    
common_suffixes = [suffix for (suffix, count) in suffix_freqdist.most_common(100)]
print(common_suffixes) 

['e', ',', '.', 's', 'd', 't', 'he', 'n', 'a', 'of', 'the', 'y', 'r', 'to', 'in', 'f', 'o', 'ed', 'nd', 'is', 'on', 'l', 'g', 'and', 'ng', 'er', 'as', 'ing', 'h', 'at', 'es', 'or', 're', 'it', '``', 'an', "''", 'm', ';', 'i', 'ly', 'ion', 'en', 'al', '?', 'nt', 'be', 'hat', 'st', 'his', 'th', 'll', 'le', 'ce', 'by', 'ts', 'me', 've', "'", 'se', 'ut', 'was', 'for', 'ent', 'ch', 'k', 'w', 'ld', '`', 'rs', 'ted', 'ere', 'her', 'ne', 'ns', 'ith', 'ad', 'ry', ')', '(', 'te', '--', 'ay', 'ty', 'ot', 'p', 'nce', "'s", 'ter', 'om', 'ss', ':', 'we', 'are', 'c', 'ers', 'uld', 'had', 'so', 'ey']


In [30]:
def pos_feature(word):
    feature = {}
    for suffix in common_suffixes:
        feature['endswith({})'.format(suffix)] = word.lower().endswith(suffix)
    
    return feature

In [31]:
pos_feature('come') # 后缀‘e‘、‘me‘为True

{"endswith('')": False,
 "endswith(')": False,
 "endswith('s)": False,
 'endswith(()': False,
 'endswith())': False,
 'endswith(,)': False,
 'endswith(--)': False,
 'endswith(.)': False,
 'endswith(:)': False,
 'endswith(;)': False,
 'endswith(?)': False,
 'endswith(`)': False,
 'endswith(``)': False,
 'endswith(a)': False,
 'endswith(ad)': False,
 'endswith(al)': False,
 'endswith(an)': False,
 'endswith(and)': False,
 'endswith(are)': False,
 'endswith(as)': False,
 'endswith(at)': False,
 'endswith(ay)': False,
 'endswith(be)': False,
 'endswith(by)': False,
 'endswith(c)': False,
 'endswith(ce)': False,
 'endswith(ch)': False,
 'endswith(d)': False,
 'endswith(e)': True,
 'endswith(ed)': False,
 'endswith(en)': False,
 'endswith(ent)': False,
 'endswith(er)': False,
 'endswith(ere)': False,
 'endswith(ers)': False,
 'endswith(es)': False,
 'endswith(ey)': False,
 'endswith(f)': False,
 'endswith(for)': False,
 'endswith(g)': False,
 'endswith(h)': False,
 'endswith(had)': False,
 '

In [19]:
brown.tagged_words(categories='news')[:3]

[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL')]

In [38]:
tagged_words = brown.tagged_words(categories='news')[:500] # 选部分数据来训练，否则耗时太长
feature_set = [(pos_feature(n), g) for (n,g) in tagged_words] # 提取特征
train_size = int(len(feature_set) * 0.8) # 拆分训练集、测试集
train_set, test_set = feature_set[:train_size],feature_set[train_size:]
classifier = nltk.DecisionTreeClassifier.train(train_set) # 用决策树分类器训练
nltk.classify.accuracy(classifier, test_set) # 评估准确度

0.63

In [39]:
classifier.classify(pos_feature('cats'))

'NNS'

In [40]:
print(classifier.pseudocode(depth=4)) # 4层决策过程

if endswith(the) == False: 
  if endswith(s) == False: 
    if endswith(f) == False: 
      if endswith(``) == False: return 'IN'
      if endswith(``) == True: return '``'
    if endswith(f) == True: return 'IN'
  if endswith(s) == True: 
    if endswith(was) == False: 
      if endswith(is) == False: return 'NNS'
      if endswith(is) == True: return 'BEZ'
    if endswith(was) == True: return 'BEDZ'
if endswith(the) == True: return 'AT'



## Building a regular expression tagger：
根据匹配的pattern 进行tag

In [42]:
import nltk
from nltk.corpus import brown
sentences = brown.tagged_sents(categories='news')
sent = brown.sents(categories='news')
pattern = [
    (r'.*ing$', 'VBG'),  # gerunds 动名词
    (r'.*ed$', 'VBD'),   # simple past 过去时
    (r'.*es$', 'VBZ'),   # 第三人称单数
    (r'.*ould$', 'MD'),  # modals 情态动词
    (r'.*\'s$', 'NN$'),  # possessive nouns 名词所有格
    (r'.*s$','NNS'),     # plural nouns 复述名词
    (r'^-?[0-9]+[.[0-9]+]?$', 'CD'), # number
    (r'.*', "NN")        # nouns (default)
]
regexp_tagger = nltk.RegexpTagger(pattern)
regexp_tagger.tag(sent[3])

[('``', 'NN'),
 ('Only', 'NN'),
 ('a', 'NN'),
 ('relative', 'NN'),
 ('handful', 'NN'),
 ('of', 'NN'),
 ('such', 'NN'),
 ('reports', 'NNS'),
 ('was', 'NNS'),
 ('received', 'VBD'),
 ("''", 'NN'),
 (',', 'NN'),
 ('the', 'NN'),
 ('jury', 'NN'),
 ('said', 'NN'),
 (',', 'NN'),
 ('``', 'NN'),
 ('considering', 'VBG'),
 ('the', 'NN'),
 ('widespread', 'NN'),
 ('interest', 'NN'),
 ('in', 'NN'),
 ('the', 'NN'),
 ('election', 'NN'),
 (',', 'NN'),
 ('the', 'NN'),
 ('number', 'NN'),
 ('of', 'NN'),
 ('voters', 'NNS'),
 ('and', 'NN'),
 ('the', 'NN'),
 ('size', 'NN'),
 ('of', 'NN'),
 ('this', 'NNS'),
 ('city', 'NN'),
 ("''", 'NN'),
 ('.', 'NN')]

In [43]:
regexp_tagger.evaluate(sentences)

0.20012132784374564

## build a lookup tagger：
常用词列表和它们的tag信息相一致。一些词标为None，因为它们不在常用词列表中

In [44]:
import nltk
from nltk.corpus import brown
freqd = nltk.FreqDist(brown.words(categories='news'))
cfreqd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
mostfreq_words = freqd.most_common(100)

In [47]:
likelytags = dict((word, cfreqd[word].max()) for (word, _) in mostfreq_words)

In [53]:
baselinetagger = nltk.UnigramTagger(model=likelytags)
baselinetagger.evaluate(brown.tagged_sents())

0.4696708210184018

In [55]:
sent = brown.sents(categories='news')[3]
baselinetagger.tag(sent)

[('``', '``'),
 ('Only', None),
 ('a', 'AT'),
 ('relative', None),
 ('handful', None),
 ('of', 'IN'),
 ('such', None),
 ('reports', None),
 ('was', 'BEDZ'),
 ('received', None),
 ("''", "''"),
 (',', ','),
 ('the', 'AT'),
 ('jury', None),
 ('said', 'VBD'),
 (',', ','),
 ('``', '``'),
 ('considering', None),
 ('the', 'AT'),
 ('widespread', None),
 ('interest', None),
 ('in', 'IN'),
 ('the', 'AT'),
 ('election', None),
 (',', ','),
 ('the', 'AT'),
 ('number', None),
 ('of', 'IN'),
 ('voters', None),
 ('and', 'CC'),
 ('the', 'AT'),
 ('size', None),
 ('of', 'IN'),
 ('this', 'DT'),
 ('city', None),
 ("''", "''"),
 ('.', '.')]

## 影评情感分析-朴素贝叶斯

In [58]:
from nltk.corpus import movie_reviews
import random
docs = [(list(movie_reviews.words(fileid)), category) 
            for category in movie_reviews.categories()
            for fileid in movie_reviews.fileids(category)]
random.shuffle(docs)
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]

In [59]:
def doc_features(doc):
    doc_words = set(doc)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in doc_words)
    return features


In [62]:
doc_features(movie_reviews.words('pos/cv957_8737.txt'))

{'contains(plot)': True,
 'contains(:)': True,
 'contains(two)': True,
 'contains(teen)': False,
 'contains(couples)': False,
 'contains(go)': False,
 'contains(to)': True,
 'contains(a)': True,
 'contains(church)': False,
 'contains(party)': False,
 'contains(,)': True,
 'contains(drink)': False,
 'contains(and)': True,
 'contains(then)': True,
 'contains(drive)': False,
 'contains(.)': True,
 'contains(they)': True,
 'contains(get)': True,
 'contains(into)': True,
 'contains(an)': True,
 'contains(accident)': False,
 'contains(one)': True,
 'contains(of)': True,
 'contains(the)': True,
 'contains(guys)': False,
 'contains(dies)': False,
 'contains(but)': True,
 'contains(his)': True,
 'contains(girlfriend)': True,
 'contains(continues)': False,
 'contains(see)': False,
 'contains(him)': True,
 'contains(in)': True,
 'contains(her)': False,
 'contains(life)': False,
 'contains(has)': True,
 'contains(nightmares)': False,
 'contains(what)': True,
 "contains(')": True,
 'contains(s)': T

In [None]:
featuresets = [(doc_features(d), c) for (d,c) in docs]
   train_set, test_set = featuresets[100:], featuresets[:100]
   classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> print(nltk.classify.accuracy(classifier, test_set))
0.81
>>> classifier.show_most_informative_features(5)

In [63]:
feature_sets = [(doc_features(d), c) for (d,c) in docs]
train_set, test_set = feature_sets[100:], feature_sets[:100]

classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)

0.74

In [64]:
classifier.show_most_informative_features(5)

Most Informative Features
        contains(justin) = True              neg : pos    =      9.0 : 1.0
      contains(illusion) = True              neg : pos    =      8.3 : 1.0
 contains(unimaginative) = True              neg : pos    =      8.3 : 1.0
        contains(shoddy) = True              neg : pos    =      7.0 : 1.0
        contains(suvari) = True              neg : pos    =      7.0 : 1.0


# Metrics based on syntactic matching ??
句法匹配可以通过chunking任务完成。

 *nltk.chunk.api* 能识别chunks，并返回给定chunk sequence 的解析树

In [67]:
import nltk
from nltk.tree import Tree
print(Tree(1,[2,Tree(3,[4]),5]))

(1 2 (3 4) 5)


In [68]:
ct = Tree('VP', [Tree('V',['gave']),Tree('NP',['her'])])
sent = Tree('S', [Tree('NP',['I']),ct])
print(ct)
print(sent)

(VP (V gave) (NP her))
(S (NP I) (VP (V gave) (NP her)))


In [70]:
print(sent[1])

(VP (V gave) (NP her))


# shallow semantic matching 基于浅层语意匹配的度量
WordNet Similarity 常用于语义匹配。

可用的similarity有：path distance, Leacock-Chodorow Similarity 等等

在这些度量中，我们比较词义之间的相似性而不是词语之间的相似性。

浅层语义分析中，也会进行 命名实体识别和 coreference resolution 。

http://www.nltk.org/howto/wordnet.html