# Solutions for Sequence Labeling: Part-of-Speech Tagging
- Evgeny A. Stepanov
- stepanov.evgeny.a@gmail.com

## Lab Exercise: Comparative Evaluation of NLTK Taggers

Experiment with different taggers provided in NLTK (e.g. NgramTagger)

- Train and evaluate taggers provided by NLTK
    - experiment with different tagger parameters
    - some of them have *cut-off*

- For each report evaluation accuracy

- Evaluate `spacy` POS-tags on the same test set
    - create mapping from spacy to NLTK POS-tags
    - convert output to the required format (see format above)
        - flatten into a list
    - evaluate using `accuracy` from `nltk.metrics` 
        - [link](https://www.nltk.org/_modules/nltk/metrics/scores.html#accuracy)

## Solution
The challenging part of the exercise is evaluation of `spacy`, since there are tokenization differences to be addressed. The easiest solution is to disable `spacy` tokenization.

In [1]:
import spacy
from spacy.tokenizer import Tokenizer
import en_core_web_sm
spacy_nlp = en_core_web_sm.load()

# to use white space tokenization (generally a bad idea for unknown data)
spacy_nlp.tokenizer = Tokenizer(spacy_nlp.vocab)

Mapping from `spacy` to `NLTK` POS-tags. It's sufficient to deal with the different tags, since the rest is the same.

In [16]:
mapping = {
    "PROPN": "NOUN", "INTJ": "NOUN", 
    "AUX": "VERB", 
    "PART": "PRT",
    "CCONJ": "CONJ", "SCONJ": "CONJ",
    "SYM": "X",
    "PUNCT": "."
}

__DATA SPLIT__ (copied)

In [5]:
# Prepare Training & Test Splits as 80%/20%
import math
from nltk.corpus import treebank

total_size = len(treebank.tagged_sents())
train_indx = math.ceil(total_size * 0.8)
trn_data = treebank.tagged_sents(tagset='universal')[:train_indx]
tst_data = treebank.tagged_sents(tagset='universal')[train_indx:]

print("Total: {}; Train: {}; Test: {}".format(total_size, len(trn_data), len(tst_data)))


Total: 3914; Train: 3132; Test: 782


__TAGGING with SPACY__

In [7]:
print(tst_data[0])

[('The', 'DET'), ('discount', 'NOUN'), ('rate', 'NOUN'), ('on', 'ADP'), ('three-month', 'ADJ'), ('Treasury', 'NOUN'), ('bills', 'NOUN'), ('was', 'VERB'), ('essentially', 'ADV'), ('unchanged', 'ADJ'), ('at', 'ADP'), ('7.79', 'NUM'), ('%', 'NOUN'), (',', '.'), ('while', 'ADP'), ('the', 'DET'), ('rate', 'NOUN'), ('on', 'ADP'), ('six-month', 'ADJ'), ('bills', 'NOUN'), ('was', 'VERB'), ('slightly', 'ADV'), ('lower', 'ADJ'), ('at', 'ADP'), ('7.52', 'NUM'), ('%', 'NOUN'), ('compared', 'VERB'), ('with', 'ADP'), ('7.60', 'NUM'), ('%', 'NOUN'), ('Tuesday', 'NOUN'), ('.', '.')]


In [17]:
spacy_result = []
for sent in treebank.sents()[train_indx:]:
    sent_doc = spacy_nlp(" ".join(sent))
    # we use mapping here to replace spacy tags with NLTK tags
    spacy_result.append([(t.text, mapping.get(t.pos_, t.pos_)) for t in sent_doc])

__EVALUATION__
1. flatten both hypotheses and references
2. evaluate using `accuracy`

In [18]:
flat_ref = [element for sublist in tst_data for element in sublist]
flat_hyp = [element for sublist in spacy_result for element in sublist]
print(len(flat_ref))
print(len(flat_hyp))

20015
20015


In [19]:
from nltk.metrics import accuracy
accuracy(flat_ref, flat_hyp)

0.8596552585560829

Performance is quite low, given the task. 
The reason for the difference, besides the tag differences that we might have missed (e.g. `$` is `.` in referece, but tagged as `X` by `scpay), are 

- the "special" replacements in the `treebank`, such as `-LRB-` and `-RRB-` instead of `(` and `)`, and alike. 

- Additional complexity comes from the "traces" that are present in the sentences: "words" that start with `*` and `0` 

In [23]:
# let's do the replacements w.r.t words
word_mapping = {"$": ".", "-LRB-": ".", "-RRB-": ".", "0": "X"}

flat_hyp_pp = [(w, word_mapping.get(w, t)) for w, t in flat_hyp]
# replace traces
flat_hyp_pp = [(w, ("X" if w.startswith('*') else t)) for w, t in flat_hyp_pp]

In [24]:
accuracy(flat_ref, flat_hyp_pp)

0.9408443667249563