# Part-of-Speech Tags

## References

* [English Word Classes](https://web.stanford.edu/~jurafsky/slp3/8.pdf), Chapter 8.1, Speech and Language Processing
* [Part-of-Speech Tagging](https://web.stanford.edu/~jurafsky/slp3/8.pdf), Chapter 8.2, Speech and Language Processing

## Contents

* [Part-of-Speech Tagset](#Part-of-Speech-Tagset)
* [Download Data](#Download-Data)
* [Read Data](#Read-Data)
* [Predict Data](#Predict-Data)

In [1]:
# to left-algin the tables below
from IPython.core.display import display, HTML
display(HTML("<style>table {margin-left: 0 !important;}</style>"))

## Part-of-Speech Tagset

A part-of-speech (POS) is a category to which a word is assigned in accordance with its syntactic functions.

```
John    Noun
is      Verb
a       Determiner
boy     Noun
.       Punctuation
```

The [Penn Treebank](https://www.aclweb.org/anthology/J93-2004/) project defined a fine-grained POS tagset, that was extended by the [OntoNotes](https://www.aclweb.org/anthology/W13-3516/) project:

### Words

| Tag | Description | Tag | Description |
|:---|:---|:---|:---|
| `ADD` | Email                                   | `POS` | Possessive ending |
| `AFX` | Affix                                   | `PRP` | Personal pronoun |
| `CC` | Coordinating conjunction                 | `PRP$` | Possessive pronoun  |
| `CD` | Cardinal number                          | `RB` | Adverb |
| `CODE` | Code ID                                | `RBR` | Adverb, comparative |
| `DT` | Determiner                               | `RBS` | Adverb, superlative |
| `EX` | Existential there                        | `RP` | Particle |
| `FW` | Foreign word                             | `TO` | To |
| `GW` | Go with                                  | `UH` | Interjection |
| `IN` | Preposition or subordinating conjunction | `VB` | Verb, base form |
| `JJ` | Adjective                                | `VBD` | Verb, past tense |
| `JJR` | Adjective, comparative                  | `VBG` | Verb, gerund or present participle |
| `JJS` | Adjective, superlative                  | `VBN` | Verb, past participle |
| `LS` | List item marker                         | `VBP` | Verb, non-3rd person singular present |
| `MD` | Modal                                    | `VBZ` | Verb, 3rd person singular present |
| `NN` | Noun, singular or mass                   | `WDT` | *Wh*-determiner |
| `NNS` | Noun, plural                            | `WP` | *Wh*-pronoun |
| `NNP` | Proper noun, singular                   | `WP$` | *Wh*-pronoun, possessive |
| `NNPS` | Proper noun, plural                    | `WRB` | *Wh*-adverb |
| `PDT` | Predeterminer                           | `XX` | Unknown |

### Symbols

| Tag | Description | Tag | Description |
|:---|:---|:---|:---|
| `$` | Dollar | `-LRB-` | Left bracket |
| `:` | Colon | `-RRB-` | Right bracket |
| `,` | Comma | `HYPH` | Hyphen |
| `.` | Period | `NFP` | Superfluous punctuation |
| ` `` ` | Left quote | `SYM` | Symbol |
| `''` | Right quote | `PUNC` | General punctuation |

## Download Data

Retrieve the path to the `cs329` project:

In [14]:
from pathlib import Path

path = Path.cwd()

while path.name != 'cs329':
    path = path.parent

print(path, type(path))

/Users/jdchoi/workspace/cs329 <class 'pathlib.PosixPath'>


* [`pathlib`](https://docs.python.org/3/library/pathlib.html)

Create the `dat/pos` directory under the `cs329` project:

In [15]:
path /= 'dat/pos'
path.mkdir(parents=True, exist_ok=True)
print(path)

/Users/jdchoi/workspace/cs329/dat/pos


* [Path.mkdir()](https://docs.python.org/3/library/pathlib.html#pathlib.Path.mkdir)

Download the [training set](https://raw.githubusercontent.com/emory-courses/cs329/master/dat/pos/wsj-pos.trn.gold.tsv) and the [development set](https://raw.githubusercontent.com/emory-courses/cs329/master/dat/pos/wsj-pos.dev.gold.tsv) for part-of-speech tagging:

In [11]:
import requests

def download(remote_addr: str, local_addr: str):
    r = requests.get(remote_addr)

    with open(local_addr, 'wb') as fin:
        fin.write(r.content)

In [12]:
import os

url = 'https://raw.githubusercontent.com/emory-courses/cs329/master/dat/pos/wsj-pos.{}.gold.tsv'

remote = url.format('trn')
download(remote, path / Path(remote).name)

remote = url.format('dev')
download(remote, path / Path(remote).name)

## Read Data

Retrieve the training data:

In [18]:
def read_data(filename: str):
    data, sentence = [], []
    fin = open(filename)
    
    for line in fin:
        l = line.split()
        if l:
            sentence.append((l[0], l[1]))
        else:
            data.append(sentence)
            sentence = []
    
    return data

In [19]:
trn_data = read_data(path / 'wsj-pos.trn.gold.tsv')
print(len(trn_data))
print(trn_data[0])

38219
[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]


Write the function `word_count()` that counts the number of words in the training data:

In [24]:
from typing import List, Tuple

def word_count(data: List[List[Tuple[str, str]]]) -> int:
    """
    :param data: a list of tuple list where each inner list represents a sentence and every tuple is a (word, pos) pair.
    :return: the total number of words in the data
    """
    # To be updated
    return sum([len(sentence) for sentence in data])

In [22]:
print(word_count(trn_data))

912344


## Predict Data

### Exercise

Write the function `create_uni_pos_dict()` that reads data and returns a dictionary where the key is a word and the value is the list of possible POS tags with probabilities in descending order such that:

$$
P(w|p) = \frac{C(w,p)}{C(w)}
$$

In [26]:
from typing import Dict

def create_uni_pos_dict(data: List[List[Tuple[str, str]]]) -> Dict[str, List[Tuple[str, float]]]:
    """
    :param data: a list of tuple list where each inner list represents a sentence and every tuple is a (word, pos) pair.
    :return: a dictionary where the key is a word and the value is the list of possible POS tags with probabilities in descending order.
    """
    model = dict()
    # To be updated
    return model