# Assignment 2: Parts-of-Speech Tagging (POS)

Welcome to the second assignment of Course 2 in the Natural Language Processing specialization. This assignment will develop skills in part-of-speech (POS) tagging, the process of assigning a part-of-speech tag (Noun, Verb, Adjective...) to each word in an input text.  Tagging is difficult because some words can represent more than one part of speech at different times. They are  **Ambiguous**. Let's look at the following example: 

- The whole team played **well**. [adverb]
- You are doing **well** for yourself. [adjective]
- **Well**, this assignment took me forever to complete. [interjection]
- The **well** is dry. [noun]
- Tears were beginning to **well** in her eyes. [verb]

Distinguishing the parts-of-speech of a word in a sentence will help you better understand the meaning of a sentence. This would be critically important in search queries. Identifying the proper noun, the organization, the stock symbol, or anything similar would greatly improve everything ranging from speech recognition to search. By completing this assignment, you will: 

- Learn how parts-of-speech tagging works
- Compute the transition matrix A in a Hidden Markov Model
- Compute the emission matrix B in a Hidden Markov Model
- Compute the Viterbi algorithm 
- Compute the accuracy of your own model 

In [1]:
from collections import defaultdict
import math

import numpy as np
import pandas as pd

from util.pos_utils import get_word_tag, preprocess

## Part 0: Data Sources
This assignment will use two tagged data sets collected from the **Wall Street Journal (WSJ)**. 

[Here](http://relearn.be/2015/training-common-sense/sources/software/pattern-2.6-critical-fork/docs/html/mbsp-tags.html) is an example 'tag-set' or Part of Speech designation describing the two or three letter tag and their meaning. 
- One data set (**WSJ-2_21.pos**) will be used for **training**.
- The other (**WSJ-24.pos**) for **testing**. 
- The tagged training data has been preprocessed to form a vocabulary (**hmm_vocab.txt**). 
- The words in the vocabulary are words from the training set that were used two or more times. 
- The vocabulary is augmented with a set of 'unknown word tokens', described below. 

The training set will be used to create the emission, transmission and tag counts. 

The test set (WSJ-24.pos) is read in to create `y`. 
- This contains both the test text and the true tag. 
- The test set has also been preprocessed to remove the tags to form **test_words.txt**. 
- This is read in and further processed to identify the end of sentences and handle words not in the vocabulary using functions provided in **utils_pos.py**. 
- This forms the list `prep`, the preprocessed text used to test our  POS taggers.

A POS tagger will necessarily encounter words that are not in its datasets. 
- To improve accuracy, these words are further analyzed during preprocessing to extract available hints as to their appropriate tag. 
- For example, the suffix 'ize' is a hint that the word is a verb, as in 'final-ize' or 'character-ize'. 
- A set of unknown-tokens, such as '--unk-verb--' or '--unk-noun--' will replace the unknown words in both the training and test corpus and will appear in the emission, transmission and tag data structures.

In [2]:
DATA = '../../../../data'

In [3]:
with open(f'{DATA}/WSJ_02-21.pos', 'r') as f:
    training_corpus = f.readlines()
    
print(training_corpus[:5])

['In\tIN\n', 'an\tDT\n', 'Oct.\tNNP\n', '19\tCD\n', 'review\tNN\n']


In [4]:
with open(f'{DATA}/hmm_vocab.txt', 'r') as f:
    voc_l = f.read().split('\n')
    
print(voc_l[:50])
print(voc_l[-50:])

['!', '#', '$', '%', '&', "'", "''", "'40s", "'60s", "'70s", "'80s", "'86", "'90s", "'N", "'S", "'d", "'em", "'ll", "'m", "'n'", "'re", "'s", "'til", "'ve", '(', ')', ',', '-', '--', '--n--', '--unk--', '--unk_adj--', '--unk_adv--', '--unk_digit--', '--unk_noun--', '--unk_punct--', '--unk_upper--', '--unk_verb--', '.', '...', '0.01', '0.0108', '0.02', '0.03', '0.05', '0.1', '0.10', '0.12', '0.13', '0.15']
['yards', 'yardstick', 'year', 'year-ago', 'year-before', 'year-earlier', 'year-end', 'year-on-year', 'year-round', 'year-to-date', 'year-to-year', 'yearlong', 'yearly', 'years', 'yeast', 'yelled', 'yelling', 'yellow', 'yen', 'yes', 'yesterday', 'yet', 'yield', 'yielded', 'yielding', 'yields', 'you', 'young', 'younger', 'youngest', 'youngsters', 'your', 'yourself', 'youth', 'youthful', 'yuppie', 'yuppies', 'zero', 'zero-coupon', 'zeroing', 'zeros', 'zinc', 'zip', 'zombie', 'zone', 'zones', 'zoning', '{', '}', '']


In [5]:
# vocab: dictionary that has the index of the corresponding words
vocab = {} 

# Get the index of the corresponding words. 
for i, word in enumerate(sorted(voc_l)): 
    vocab[word] = i       
    
print('Vocabulary dictionary, key is the word, value is a unique integer')
cnt = 0
for k, v in vocab.items():
    print(f"{k}: {v}")
    cnt += 1
    if cnt > 20:
        break

Vocabulary dictionary, key is the word, value is a unique integer
: 0
!: 1
#: 2
$: 3
%: 4
&: 5
': 6
'': 7
'40s: 8
'60s: 9
'70s: 10
'80s: 11
'86: 12
'90s: 13
'N: 14
'S: 15
'd: 16
'em: 17
'll: 18
'm: 19
'n': 20


In [6]:
# Test corpus
with open(f'{DATA}/WSJ_24.pos', 'r') as f:
    y = f.readlines()
    
y[:10]

['The\tDT\n',
 'economy\tNN\n',
 "'s\tPOS\n",
 'temperature\tNN\n',
 'will\tMD\n',
 'be\tVB\n',
 'taken\tVBN\n',
 'from\tIN\n',
 'several\tJJ\n',
 'vantage\tNN\n']

In [7]:
# corpus without tags, preprocessed
_, prep = preprocess(vocab, f'{DATA}/test.words')

print('The length of the preprocessed test corpus: ', len(prep))
print('This is a sample of the test_corpus: ')
print(prep[:10])

The length of the preprocessed test corpus:  34199
This is a sample of the test_corpus: 
['The', 'economy', "'s", 'temperature', 'will', 'be', 'taken', 'from', 'several', '--unk--']


# Part 1: Parts-of-speech tagging 

## Part 1.1 - Training
You will start with the simplest possible parts-of-speech tagger and we will build up to the state of the art. 

In this section, you will find the words that are not ambiguous. 
- For example, the word `is` is a verb and it is not ambiguous. 
- In the `WSJ` corpus, $86$% of the token are unambiguous (meaning they have only one tag) 
- About $14\%$ are ambiguous (meaning that they have more than one tag)

Before you start predicting the tags of each word, you will need to 
compute a few dictionaries that will help you to generate the tables. 

#### Transition counts
- The first dictionary is the `transition_counts` dictionary which computes the number of times each tag happened next to another tag. 

This dictionary will be used to compute: 
$$P(t_i |t_{i-1}) \tag{1}$$

This is the probability of a tag at position $i$ given the tag at position $i-1$.

In order for you to compute equation 1, you will create a `transition_counts` dictionary where 
- The keys are `(prev_tag, tag)`
- The values are the number of times those two tags appeared in that order. 

#### Emission counts

The second dictionary you will compute is the `emission_counts` dictionary. This dictionary will be used to compute:

$$P(w_i|t_i)\tag{2}$$

In other words, you will use it to compute the probability of a word given its tag. 

In order for you to compute equation 2, you will create an `emission_counts` dictionary where 
- The keys are `(tag, word)` 
- The values are the number of times that pair showed up in your training set. 

#### Tag counts

The last dictionary you will compute is the `tag_counts` dictionary. 
- The key is the tag 
- The value is the number of times each tag appeared.

### Exercise 01

**Instructions:** Write a program that takes in the `training_corpus` and returns the three dictionaries mentioned above `transition_counts`, `emission_counts`, and `tag_counts`. 
- `emission_counts`: maps (tag, word) to the number of times it happened. 
- `transition_counts`: maps (prev_tag, tag) to the number of times it has appeared. 
- `tag_counts`: maps (tag) to the number of times it has occured. 

Implementation note: This routine utilises *defaultdict*, which is a subclass of *dict*. 
- A standard Python dictionary throws a *KeyError* if you try to access an item with a key that is not currently in the dictionary. 
- In contrast, the *defaultdict* will create an item of the type of the argument, in this case an integer with the default value of 0. 
- See [defaultdict](https://docs.python.org/3.3/library/collections.html#defaultdict-objects).

In [17]:
def create_dictionaries(training_corpus, vocab):
    """
    Input: 
        training_corpus: a corpus where each line has a word followed by 
          its tag.
        vocab: a dictionary where keys are words in vocabulary and value
          is an index
    Output: 
        emission_counts: a dictionary where the keys are (tag, word) and
          the values are the counts
        transition_counts: a dictionary where the keys are (prev_tag, tag)
          and the values are the counts
        tag_counts: a dictionary where the keys are the tags and the 
          values are the counts
    """
    emission_counts = defaultdict(int)
    transition_counts = defaultdict(int)
    tag_counts = defaultdict(int)
    prev_tag = '--s--' 
    for i, word_tag in enumerate(training_corpus):
        if i % 50000 == 0:
            print(f'word count = {i}')
        word, tag = get_word_tag(word_tag, vocab)
        transition_counts[(prev_tag, tag)] += 1
        emission_counts[(tag, word)] += 1
        tag_counts[tag] += 1
        prev_tag = tag
    return emission_counts, transition_counts, tag_counts

In [18]:
emission_counts, transition_counts, tag_counts = create_dictionaries(
    training_corpus, vocab)

word count = 0
word count = 50000
word count = 100000
word count = 150000
word count = 200000
word count = 250000
word count = 300000
word count = 350000
word count = 400000
word count = 450000
word count = 500000
word count = 550000
word count = 600000
word count = 650000
word count = 700000
word count = 750000
word count = 800000
word count = 850000
word count = 900000
word count = 950000


In [19]:
states = sorted(tag_counts.keys())
print(f'Number of POS tags (number of 'states'): {len(states)}')
print('View these POS tags (states)')
print(states)

Number of POS tags (number of 'states'): 46
View these POS tags (states)
['#', '$', "''", '(', ')', ',', '--s--', '.', ':', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB', '``']


The 'states' are the Parts-of-speech designations found in the training data. They will also be referred to as 'tags' or POS in this assignment. 

- "NN" is noun, singular, 
- 'NNS' is noun, plural. 
- In addition, there are helpful tags like '--s--' which indicate a start of a sentence.
- You can get a more complete description at [Penn Treebank II tag set](https://www.clips.uantwerpen.be/pages/mbsp-tags). 

In [20]:
print('transition examples: ')
for ex in list(transition_counts.items())[:3]:
    print(ex)
print()

print('emission examples: ')
for ex in list(emission_counts.items())[200:203]:
    print (ex)
print()

print('ambiguous word example: ')
for tup, cnt in emission_counts.items():
    if tup[1] == 'back': print (tup, cnt) 

transition examples: 
(('--s--', 'IN'), 5050)
(('IN', 'DT'), 32364)
(('DT', 'NNP'), 9044)

emission examples: 
(('DT', 'any'), 721)
(('NN', 'decrease'), 7)
(('NN', 'insider-trading'), 5)

ambiguous word example: 
('RB', 'back') 304
('VB', 'back') 20
('RP', 'back') 84
('JJ', 'back') 25
('NN', 'back') 29
('VBP', 'back') 4


### Part 1.2 - Testing

Now you will test the accuracy of your parts-of-speech tagger using your `emission_counts` dictionary. 
- Given your preprocessed test corpus `prep`, you will assign a parts-of-speech tag to every word in that corpus. 
- Using the original tagged test corpus `y`, you will then compute what percent of the tags you got correct. 

### Exercise 02

**Instructions:** Implement `predict_pos` that computes the accuracy of your model. 

- This is a warm up exercise. 
- To assign a part of speech to a word, assign the most frequent POS for that word in the training set. 
- Then evaluate how well this approach works.  Each time you predict based on the most frequent POS for the given word, check whether the actual POS of that word is the same.  If so, the prediction was correct!
- Calculate the accuracy as the number of correct predictions divided by the total number of words for which you predicted the POS tag.

In [54]:
def predict_pos(prep, y, emission_counts, vocab, states):
    '''
    Input: 
        prep: a preprocessed version of 'y'. A list with the 'word'
          component of the tuples.
        y: a corpus composed of a list of tuples where each tuple consists
          of (word, POS)
        emission_counts: a dictionary where the keys are (tag, word)
          tuples and the value is the count
        vocab: a dictionary where keys are words in vocabulary and value
          is an index
        states: a sorted list of all possible tags for this assignment
    Output: 
        accuracy: Number of times you classified a word correctly
    '''
    n_correct = 0
    all_words = set(emission_counts.keys())
    total = len(y)
    for word, y_tup in zip(prep, y): 
        y_list = y_tup.split()
        if len(y_list) == 2:
            true_label = y_list[1]
        else:
            continue
        count_final = 0
        pos_final = ''
        if word in vocab:
            for pos in states:
                key = (pos, word)
                if key in emission_counts:
                    count = emission_counts[key]
                    if count > count_final:
                        count_final = count
                        pos_final = pos
            if pos_final == true_label:
                n_correct += 1
    accuracy = n_correct / total
    return accuracy

In [55]:
accuracy_predict_pos = predict_pos(
    prep, y, emission_counts, vocab, states)
print(f'Accuracy of prediction using predict_pos is '
      f'{accuracy_predict_pos:.4f}')

Accuracy of prediction using predict_pos is 0.8889
