# Part-of-Speech Tags

## References

* [English Word Classes](https://web.stanford.edu/~jurafsky/slp3/8.pdf), Chapter 8.1, Speech and Language Processing
* [Part-of-Speech Tagging](https://web.stanford.edu/~jurafsky/slp3/8.pdf), Chapter 8.2, Speech and Language Processing

## Contents

* [Part-of-Speech Tagset](#Part-of-Speech-Tagset)
* [Download Data](#Download-Data)
* [Read Data](#Read-Data)
* [Predict Data](#Predict-Data)
  * [Unigram Model](#Unigram-Model])
  * [Bigram Model](#Bigram-Model])
  * [NLTK Model](#NLTK-Model])

## Part-of-Speech Tagset

A part-of-speech (POS) is a category to which a word is assigned in accordance with its syntactic functions.

```
John    Noun
is      Verb
a       Determiner
boy     Noun
.       Punctuation
```

The [Penn Treebank](https://www.aclweb.org/anthology/J93-2004/) project defined a fine-grained POS tagset, that was extended by the [OntoNotes](https://www.aclweb.org/anthology/W13-3516/) project:

### Words

| Tag | Description | Tag | Description |
|:---|:---|:---|:---|
| `ADD` | Email                                   | `POS` | Possessive ending |
| `AFX` | Affix                                   | `PRP` | Personal pronoun |
| `CC` | Coordinating conjunction                 | `PRP$` | Possessive pronoun  |
| `CD` | Cardinal number                          | `RB` | Adverb |
| `CODE` | Code ID                                | `RBR` | Adverb, comparative |
| `DT` | Determiner                               | `RBS` | Adverb, superlative |
| `EX` | Existential there                        | `RP` | Particle |
| `FW` | Foreign word                             | `TO` | To |
| `GW` | Go with                                  | `UH` | Interjection |
| `IN` | Preposition or subordinating conjunction | `VB` | Verb, base form |
| `JJ` | Adjective                                | `VBD` | Verb, past tense |
| `JJR` | Adjective, comparative                  | `VBG` | Verb, gerund or present participle |
| `JJS` | Adjective, superlative                  | `VBN` | Verb, past participle |
| `LS` | List item marker                         | `VBP` | Verb, non-3rd person singular present |
| `MD` | Modal                                    | `VBZ` | Verb, 3rd person singular present |
| `NN` | Noun, singular or mass                   | `WDT` | *Wh*-determiner |
| `NNS` | Noun, plural                            | `WP` | *Wh*-pronoun |
| `NNP` | Proper noun, singular                   | `WP$` | *Wh*-pronoun, possessive |
| `NNPS` | Proper noun, plural                    | `WRB` | *Wh*-adverb |
| `PDT` | Predeterminer                           | `XX` | Unknown |

### Symbols

| Tag | Description | Tag | Description |
|:---|:---|:---|:---|
| `$` | Dollar | `-LRB-` | Left bracket |
| `:` | Colon | `-RRB-` | Right bracket |
| `,` | Comma | `HYPH` | Hyphen |
| `.` | Period | `NFP` | Superfluous punctuation |
| ` `` ` | Left quote | `SYM` | Symbol |
| `''` | Right quote | `PUNC` | General punctuation |

## Download Data

Retrieve the path to the `cs329` project:

In [1]:
from pathlib import Path

path = Path.cwd()
print(path)

while path.name != 'cs329':
    path = path.parent

print(path, type(path))

/Users/jdchoi/workspace/cs329/doc
/Users/jdchoi/workspace/cs329 <class 'pathlib.PosixPath'>


* [`pathlib`](https://docs.python.org/3/library/pathlib.html)

Create the `dat/pos` directory under the `cs329` project:

In [2]:
path /= 'dat/pos'
path.mkdir(parents=True, exist_ok=True)
print(path)

/Users/jdchoi/workspace/cs329/dat/pos


* [`Path.mkdir()`](https://docs.python.org/3/library/pathlib.html#pathlib.Path.mkdir)

Download the [training set](https://raw.githubusercontent.com/emory-courses/cs329/master/dat/pos/wsj-pos.trn.gold.tsv) and the [development set](https://raw.githubusercontent.com/emory-courses/cs329/master/dat/pos/wsj-pos.dev.gold.tsv) for part-of-speech tagging:

In [11]:
import requests

def download(remote_addr: str, local_addr: str):
    r = requests.get(remote_addr)

    with open(local_addr, 'wb') as fin:
        fin.write(r.content)

* [`requests`](https://requests.readthedocs.io/en/master/user/quickstart/)

In [12]:
import os

url = 'https://raw.githubusercontent.com/emory-courses/cs329/master/dat/pos/wsj-pos.{}.gold.tsv'

remote = url.format('trn')
download(remote, path / Path(remote).name)

remote = url.format('dev')
download(remote, path / Path(remote).name)

## Read Data

Retrieve the training data:

In [3]:
def read_data(filename: str):
    data, sentence = [], []
    fin = open(filename)
    
    for line in fin:
        l = line.split()
        if l:
            sentence.append((l[0], l[1]))
        else:
            data.append(sentence)
            sentence = []
    
    return data

In [4]:
trn_data = read_data(path / 'wsj-pos.trn.gold.tsv')
print(len(trn_data))
print(trn_data[0])

38219
[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]


Write the function `word_count()` that counts the number of words in the training data:

In [6]:
from typing import List, Tuple

def word_count(data: List[List[Tuple[str, str]]]) -> int:
    """
    :param data: a list of tuple list where each inner list represents a sentence and every tuple is a (word, pos) pair.
    :return: the total number of words in the data
    """
    return sum([len(sentence) for sentence in data])

In [7]:
print(word_count(trn_data))

912344


## Predict Data

### Unigram Model


Let us write a function `create_uni_pos_dict()` that reads data and returns a dictionary where the key is a word and the value is the list of possible POS tags with probabilities in descending order such that:

$$
P(p|w) = \frac{Count(w,p)}{Count(w)}
$$

In [13]:
from collections import Counter
from typing import Dict

def create_uni_pos_dict(data: List[List[Tuple[str, str]]]) -> Dict[str, List[Tuple[str, float]]]:
    """
    :param data: a list of tuple lists where each inner list represents a sentence and every tuple is a (word, pos) pair.
    :return: a dictionary where the key is a word and the value is the list of possible POS tags with probabilities in descending order.
    """
    model = dict()

    for sentence in data:
        for word, pos in sentence:
            model.setdefault(word, Counter()).update([pos])

    for word, counter in model.items():
        ts = counter.most_common()
        total = sum([count for _, count in ts])
        model[word] = [(pos, count/total) for pos, count in ts]

    return model

* [`collections.Counter`](https://docs.python.org/3/library/collections.html#collections.Counter)

In [14]:
c = Counter()
c.update(['A', 'B', 'A', 'C', 'C'])
print(c)
c.update('C')
print(c)
l = [count for _, count in c.most_common()]
print(l)

Counter({'A': 2, 'C': 2, 'B': 1})
Counter({'C': 3, 'A': 2, 'B': 1})
[3, 2, 1]


Create the unigram dictionary from the training data:

In [15]:
uni_pos_dict = create_uni_pos_dict(trn_data)

In [16]:
print(uni_pos_dict['man'])
print(uni_pos_dict['buy'])

[('NN', 0.9714285714285714), ('VB', 0.01904761904761905), ('UH', 0.009523809523809525)]
[('VB', 0.8293216630196937), ('VBP', 0.08971553610503283), ('NN', 0.06564551422319474), ('JJ', 0.015317286652078774)]


Use the unigram dictionary to predict POS tags of words in a sentence:

In [17]:
def predict_uni_pos_dict(uni_pos_dict: Dict[str, List[Tuple[str, float]]], tokens: List[str], pprint=False) -> List[Tuple[str, float]]:
    def predict(token):
        t = uni_pos_dict.get(token, None)
        return t[0] if t else ('XX', 0.0)

    output = [predict(token) for token in tokens]
    if pprint:
        for token, t in zip(tokens, output):
            print('{:<15}{:<8}{:.2f}'.format(token, t[0], t[1]))

    return output

* [Input and Output](https://docs.python.org/3/tutorial/inputoutput.html)
* [`zip()`](https://docs.python.org/3/library/functions.html#zip)

In [18]:
tokens = "I bought a car yesterday that was blue".split()
predict_uni_pos_dict(uni_pos_dict, tokens, True)

I              PRP     0.99
bought         VBD     0.65
a              DT      1.00
car            NN      1.00
yesterday      NN      0.98
that           IN      0.60
was            VBD     1.00
blue           JJ      0.86


[('PRP', 0.9915824915824916),
 ('VBD', 0.6474820143884892),
 ('DT', 0.9987005955603682),
 ('NN', 1.0),
 ('NN', 0.9813432835820896),
 ('IN', 0.6039103975139195),
 ('VBD', 1.0),
 ('JJ', 0.8571428571428571)]

In [19]:
tokens = "Dr. Choi has a good wifi connection from Emory".split()
predict_uni_pos_dict(uni_pos_dict, tokens, True)

Dr.            NNP     1.00
Choi           XX      0.00
has            VBZ     1.00
a              DT      1.00
good           JJ      0.96
wifi           XX      0.00
connection     NN      1.00
from           IN      1.00
Emory          XX      0.00


[('NNP', 1.0),
 ('XX', 0.0),
 ('VBZ', 0.9993718592964824),
 ('DT', 0.9987005955603682),
 ('JJ', 0.9585798816568047),
 ('XX', 0.0),
 ('NN', 1.0),
 ('IN', 1.0),
 ('XX', 0.0)]

Let us write the function `evaluate_uni_pos()` that estimates the accuracy of the unigram model:

In [20]:
def evaluate_uni_pos(uni_pos_dict: Dict[str, List[Tuple[str, float]]], data: List[List[Tuple[str, str]]]):
    total, correct = 0, 0
    for sentence in data:
        tokens, gold = tuple(zip(*sentence))
        pred = [t[0] for t in predict_uni_pos_dict(uni_pos_dict, tokens)]
        total += len(tokens)
        correct += len([1 for g, p in zip(gold, pred) if g == p])
    print('{:5.2f}% ({}/{})'.format(100.0 * correct / total, correct, total))

In [21]:
dev_data = read_data(path / 'wsj-pos.dev.gold.tsv')
evaluate_uni_pos(uni_pos_dict, dev_data)

90.88% (119754/131768)


### Bigram Model

#### Exercise

Write a function `create_bi_pos_dict()` that reads data and returns a dictionary where the key is the previous POS tag and the value is the list of possible POS tags with probabilities in descending order such that:

$$
P(p_i|p_{i-1}) = \frac{Count(p_{i-1}, p_i)}{Count(p_{i-1})}
$$

In [27]:
from typing import Any
PREV_DUMMY = '!@#$'

def to_probs(model: Dict[Any, Counter]):
    for feature, counter in model.items():
        ts = counter.most_common()
        total = sum([count for _, count in ts])
        model[feature] = [(pos, count/total) for pos, count in ts]
    return model

def create_bi_pos_dict(data: List[List[Tuple[str, str]]]) -> Dict[str, List[Tuple[str, float]]]:
    """
    :param data: a list of tuple lists where each inner list represents a sentence and every tuple is a (word, pos) pair.
    :return: a dictionary where the key is the previous POS tag and the value is the list of possible POS tags with probabilities in descending order.
    """
    model = dict()

    for sentence in data:
        for i, (_, curr_pos) in enumerate(sentence):
            prev_pos = sentence[i-1][1] if i > 0 else PREV_DUMMY
            model.setdefault(prev_pos, Counter()).update([curr_pos])

    return to_probs(model)

Create the bigram dictionary from the training data:

In [28]:
bi_pos_dict = create_bi_pos_dict(trn_data)

Use both the unigram and bigram dictionaries to predict POS tags of words in a sentence:

In [29]:
def predict_bi_pos_dict(uni_pos_dict: Dict[str, List[Tuple[str, float]]], bi_pos_dict: Dict[str, List[Tuple[str, float]]], tokens: List[str]) -> List[Tuple[str, float]]:
    output = []
    
    for i in range(len(tokens)):
        pos = uni_pos_dict.get(tokens[i], None)
        if pos is None:
            pos = bi_pos_dict.get(output[i-1][0] if i > 0 else PREV_DUMMY, None)
        output.append(pos[0] if pos else ('XX', 0.0))

    return output

Let us write the function `evaluate_bi_pos()` that estimates the accuracy of the bigram model:

In [30]:
def evaluate_bi_pos(uni_pos_dict: Dict[str, List[Tuple[str, float]]], bi_pos_dict: Dict[str, List[Tuple[str, float]]], data: List[List[Tuple[str, str]]]):
    total, correct = 0, 0
    for sentence in data:
        tokens, gold = tuple(zip(*sentence))
        pred = [t[0] for t in predict_bi_pos_dict(uni_pos_dict, bi_pos_dict, tokens)]
        total += len(tokens)
        correct += len([1 for g, p in zip(gold, pred) if g == p])
    print('{:5.2f}% ({}/{})'.format(100.0 * correct / total, correct, total))

Let us write the function `evaluate_bi_pos()` that estimates the accuracy of the unigram + bigram model:

In [31]:
def evaluate_bi_pos(uni_pos_dict: Dict[str, List[Tuple[str, float]]], bi_pos_dict: Dict[str, List[Tuple[str, float]]], data: List[List[Tuple[str, str]]]):
    total, correct = 0, 0
    for sentence in data:
        tokens, gold = tuple(zip(*sentence))
        pred = [t[0] for t in predict_bi_pos_dict(uni_pos_dict, bi_pos_dict, tokens)]
        total += len(tokens)
        correct += len([1 for g, p in zip(gold, pred) if g == p])
    print('{:5.2f}% ({}/{})'.format(100.0 * correct / total, correct, total))

In [32]:
evaluate_bi_pos(uni_pos_dict, bi_pos_dict, dev_data)

92.01% (121234/131768)


### Interpolation

Let us write the following functions:

* `create_bi_wp_dict()` that estimates $P(p_i|w_{i-1})$.
* `create_bi_wn_dict()` that estimates $P(p_i|w_{i+1})$.

In [33]:
def create_bi_wp_dict(data: List[List[Tuple[str, str]]]) -> Dict[str, List[Tuple[str, float]]]:
    """
    :param data: a list of tuple lists where each inner list represents a sentence and every tuple is a (word, pos) pair.
    :return: a dictionary where the key is the previous word and the value is the list of possible POS tags with probabilities in descending order.
    """
    model = dict()

    for sentence in data:
        for i, (_, curr_pos) in enumerate(sentence):
            prev_word = sentence[i-1][0] if i > 0 else PREV_DUMMY
            model.setdefault(prev_word, Counter()).update([curr_pos])

    return to_probs(model)


def create_bi_wn_dict(data: List[List[Tuple[str, str]]]) -> Dict[str, List[Tuple[str, float]]]:
    """
    :param data: a list of tuple lists where each inner list represents a sentence and every tuple is a (word, pos) pair.
    :return: a dictionary where the key is the previous word and the value is the list of possible POS tags with probabilities in descending order.
    """
    model = dict()

    for sentence in data:
        for i, (_, curr_pos) in enumerate(sentence):
            next_word = sentence[i+1][0] if i+1 < len(sentence) else PREV_DUMMY
            model.setdefault(next_word, Counter()).update([curr_pos])

    return to_probs(model)

Create two binary dictionaries from the training data:

In [34]:
bi_wp_dict = create_bi_wp_dict(trn_data)
bi_wn_dict = create_bi_wn_dict(trn_data)

In [42]:
def predict_interporlation(
        uni_pos_dict: Dict[str, List[Tuple[str, float]]],
        bi_pos_dict: Dict[str, List[Tuple[str, float]]],
        bi_wp_dict: Dict[str, List[Tuple[str, float]]],
        bi_wn_dict: Dict[str, List[Tuple[str, float]]],
        uni_pos_weight: float,
        bi_pos_weight: float,
        bi_wp_weight: float,
        bi_wn_weight: float,
        tokens: List[str]) -> List[Tuple[str, float]]:
    output = []

    for i in range(len(tokens)):
        scores = dict()
        curr_word = tokens[i]
        prev_pos = output[i-1][0] if i > 0 else PREV_DUMMY
        prev_word = tokens[i-1] if i > 0 else PREV_DUMMY
        next_word = tokens[i+1] if i+1 < len(tokens) else PREV_DUMMY

        for pos, prob in uni_pos_dict.get(curr_word, dict()):
            scores[pos] = scores.get(pos, 0) + prob * uni_pos_weight

        for pos, prob in bi_pos_dict.get(prev_pos, dict()):
            scores[pos] = scores.get(pos, 0) + prob * bi_pos_weight

        for pos, prob in bi_wp_dict.get(prev_word, dict()):
            scores[pos] = scores.get(pos, 0) + prob * bi_wp_weight

        for pos, prob in bi_wn_dict.get(next_word, dict()):
            scores[pos] = scores.get(pos, 0) + prob * bi_wn_weight

        o = max(scores.items(), key=lambda t: t[1]) if scores else ('XX', 0.0)
        output.append(o)

    return output

In [68]:
def evaluate_interpolation(
        uni_pos_dict: Dict[str, List[Tuple[str, float]]],
        bi_pos_dict: Dict[str, List[Tuple[str, float]]],
        bi_wp_dict: Dict[str, List[Tuple[str, float]]],
        bi_wn_dict: Dict[str, List[Tuple[str, float]]],
        uni_pos_weight: float,
        bi_pos_weight: float,
        bi_wp_weight: float,
        bi_wn_weight: float,
        data: List[List[Tuple[str, str]]],
        pprint=False):
    total, correct = 0, 0
    for sentence in data:
        tokens, gold = tuple(zip(*sentence))
        pred = [t[0] for t in predict_interporlation(uni_pos_dict, bi_pos_dict, bi_wp_dict, bi_wn_dict, uni_pos_weight, bi_pos_weight, bi_wp_weight, bi_wn_weight, tokens)]
        total += len(tokens)
        correct += len([1 for g, p in zip(gold, pred) if g == p])
        
    accuracy = 100.0 * correct / total
    print('{:5.2f}% - uni_pos: {:3.1f}, bi_pos: {:3.1f}, bi_wp: {:3.1f}, bi_np: {:3.1f}'.format(accuracy, uni_pos_weight, bi_pos_weight, bi_wp_weight, bi_wn_weight))
    return accuracy

In [69]:
uni_pos_weight = 1.0
bi_pos_weight = 1.0
bi_wp_weight = 1.0
bi_wn_weight = 1.0
evaluate_interpolation(uni_pos_dict, bi_pos_dict, bi_wp_dict, bi_wn_dict, uni_pos_weight, bi_pos_weight, bi_wp_weight, bi_wn_weight, dev_data, True)

91.25% - uni_pos: 1.0, bi_pos: 1.0, bi_wp: 1.0, bi_np: 1.0


91.25129014631777

In [70]:
grid = [0.1, 0.5, 1.0]
best = (0, None)
worst = (100, None)

for uni_pos_weight in grid:
    for bi_pos_weight in grid:
        for bi_wp_weight in grid:
            for bi_wn_weight in grid:
                acc = evaluate_interpolation(uni_pos_dict, bi_pos_dict, bi_wp_dict, bi_wn_dict, uni_pos_weight, bi_pos_weight, bi_wp_weight, bi_wn_weight, dev_data)
                if acc > best[0]: best = (acc, uni_pos_weight, bi_pos_weight, bi_wp_weight, bi_wn_weight)
                if acc < worst[0]: worst = (acc, uni_pos_weight, bi_pos_weight, bi_wp_weight, bi_wn_weight)

print('==========================================================')
print('Best : {:5.2f}% - uni_pos: {:3.1f}, bi_pos: {:3.1f}, bi_wp: {:3.1f}, bi_np: {:3.1f}'.format(*best))
print('Worst: {:5.2f}% - uni_pos: {:3.1f}, bi_pos: {:3.1f}, bi_wp: {:3.1f}, bi_np: {:3.1f}'.format(*worst))

91.25% - uni_pos: 0.1, bi_pos: 0.1, bi_wp: 0.1, bi_np: 0.1
67.53% - uni_pos: 0.1, bi_pos: 0.1, bi_wp: 0.1, bi_np: 0.5
53.88% - uni_pos: 0.1, bi_pos: 0.1, bi_wp: 0.1, bi_np: 1.0
67.32% - uni_pos: 0.1, bi_pos: 0.1, bi_wp: 0.5, bi_np: 0.1
65.99% - uni_pos: 0.1, bi_pos: 0.1, bi_wp: 0.5, bi_np: 0.5
58.57% - uni_pos: 0.1, bi_pos: 0.1, bi_wp: 0.5, bi_np: 1.0
54.19% - uni_pos: 0.1, bi_pos: 0.1, bi_wp: 1.0, bi_np: 0.1
59.34% - uni_pos: 0.1, bi_pos: 0.1, bi_wp: 1.0, bi_np: 0.5
58.84% - uni_pos: 0.1, bi_pos: 0.1, bi_wp: 1.0, bi_np: 1.0
53.94% - uni_pos: 0.1, bi_pos: 0.5, bi_wp: 0.1, bi_np: 0.1
62.74% - uni_pos: 0.1, bi_pos: 0.5, bi_wp: 0.1, bi_np: 0.5
55.54% - uni_pos: 0.1, bi_pos: 0.5, bi_wp: 0.1, bi_np: 1.0
53.08% - uni_pos: 0.1, bi_pos: 0.5, bi_wp: 0.5, bi_np: 0.1
61.05% - uni_pos: 0.1, bi_pos: 0.5, bi_wp: 0.5, bi_np: 0.5
58.78% - uni_pos: 0.1, bi_pos: 0.5, bi_wp: 0.5, bi_np: 1.0
47.89% - uni_pos: 0.1, bi_pos: 0.5, bi_wp: 1.0, bi_np: 0.1
56.09% - uni_pos: 0.1, bi_pos: 0.5, bi_wp: 1.0, bi_np: 0

### NLTK Model

NLTK provides a POS tagger that takes a list of tokens and predicts the POS tags of those tokens:

In [107]:
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

In [108]:
import nltk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /Users/jdchoi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/jdchoi/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [109]:
tokens = nltk.word_tokenize("I bought a car yesterday that was blue.")
print(tokens)

['I', 'bought', 'a', 'car', 'yesterday', 'that', 'was', 'blue', '.']


In [110]:
nltk.pos_tag(tokens)

[('I', 'PRP'),
 ('bought', 'VBD'),
 ('a', 'DT'),
 ('car', 'NN'),
 ('yesterday', 'NN'),
 ('that', 'WDT'),
 ('was', 'VBD'),
 ('blue', 'JJ'),
 ('.', '.')]

Let us write the function `evaluate_nltk_pos()` that estimates the accuracy of the NLTK model:

In [111]:
def evaluate_nltk(data: List[List[Tuple[str, str]]]):
    total, correct = 0, 0
    for sentence in data:
        tokens, gold = tuple(zip(*sentence))
        pred = [pos for token, pos in nltk.pos_tag(tokens)]
        total += len(tokens)
        correct += len([1 for g, p in zip(gold, pred) if g == p])
    print('{:5.2f}% ({}/{})'.format(100.0 * correct / total, correct, total))

In [112]:
evaluate_nltk(dev_data)

96.14% (126685/131768)
