<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Libraries" data-toc-modified-id="Libraries-0.1"><span class="toc-item-num">0.1&nbsp;&nbsp;</span>Libraries</a></span></li></ul></li><li><span><a href="#Import-Data" data-toc-modified-id="Import-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import Data</a></span></li><li><span><a href="#Dataset-Class" data-toc-modified-id="Dataset-Class-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Dataset Class</a></span><ul class="toc-item"><li><span><a href="#Sentences" data-toc-modified-id="Sentences-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Sentences</a></span></li><li><span><a href="#Vocabulary-&amp;-Tags" data-toc-modified-id="Vocabulary-&amp;-Tags-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Vocabulary &amp; Tags</a></span></li><li><span><a href="#Training-&amp;-Test-Set" data-toc-modified-id="Training-&amp;-Test-Set-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Training &amp; Test Set</a></span></li><li><span><a href="#Word-&amp;-Tag-Sequences" data-toc-modified-id="Word-&amp;-Tag-Sequences-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Word &amp; Tag Sequences</a></span></li><li><span><a href="#Word,-Tag-Samples" data-toc-modified-id="Word,-Tag-Samples-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Word, Tag Samples</a></span></li></ul></li><li><span><a href="#Most-Frequent-Class-Tagger" data-toc-modified-id="Most-Frequent-Class-Tagger-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Most Frequent Class Tagger</a></span><ul class="toc-item"><li><span><a href="#Pair-Count-Calculator" data-toc-modified-id="Pair-Count-Calculator-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Pair-Count Calculator</a></span></li><li><span><a href="#MFCTagger-Class" data-toc-modified-id="MFCTagger-Class-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>MFCTagger Class</a></span></li><li><span><a href="#Helper-Functions" data-toc-modified-id="Helper-Functions-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Helper Functions</a></span></li><li><span><a href="#MFC-Model-&amp;-Example-Sentence-Decoding" data-toc-modified-id="MFC-Model-&amp;-Example-Sentence-Decoding-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>MFC Model &amp; Example Sentence Decoding</a></span></li><li><span><a href="#Accuracy" data-toc-modified-id="Accuracy-3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span>Accuracy</a></span></li></ul></li><li><span><a href="#HMM-Tagger" data-toc-modified-id="HMM-Tagger-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>HMM Tagger</a></span><ul class="toc-item"><li><span><a href="#Unigram-(Emissions)-Counts" data-toc-modified-id="Unigram-(Emissions)-Counts-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Unigram (Emissions) Counts</a></span></li><li><span><a href="#Bigram-(Transmission)-Counts" data-toc-modified-id="Bigram-(Transmission)-Counts-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Bigram (Transmission) Counts</a></span></li><li><span><a href="#Sequence-Start-Counts" data-toc-modified-id="Sequence-Start-Counts-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Sequence Start Counts</a></span></li><li><span><a href="#Sequence-End-Counts" data-toc-modified-id="Sequence-End-Counts-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Sequence End Counts</a></span></li></ul></li></ul></div>

Use a hidden Markov model to create a part of speech tagger using the [pomegranate](https://pomegranate.readthedocs.io/en/latest/) library

## Libraries

In [1]:
# Jupyter "magic methods"
%load_ext autoreload
%aimport helpers, tests
%autoreload 1

In [2]:
import os

import matplotlib.pyplot as plt
import numpy as np

from IPython.core.display import HTML
from itertools import chain
from collections import Counter, defaultdict
from helpers import show_model, Dataset
from pomegranate import State, HiddenMarkovModel, DiscreteDistribution

# Import Data

Load in text from the [Brown University Standard Corpus of Present-Day American English](https://en.wikipedia.org/wiki/Brown_Corpus)

In [3]:
# data paths
brown_data = os.path.join("data","brown-universal.txt")
tags_data = os.path.join("data","tags-universal.txt")

In [4]:
# check out the brown data
with open(brown_data, "r") as f:
    text = f.read()
    
print(text[:150])

b100-5507
Mr.	NOUN
Podger	NOUN
had	VERB
thanked	VERB
him	PRON
gravely	ADV
,	.
and	CONJ
now	ADV
he	PRON
made	VERB
use	NOUN
of	ADP
the	DET
advice	NOUN
.


In [5]:
# check out the universal tags list
with open(tags_data, "r") as f:
    tags = f.read()
    
print(tags)

.
ADJ
ADP
ADV
CONJ
DET
NOUN
NUM
PRON
PRT
VERB
X


# Dataset Class

In [6]:
# use the dataset class
data = Dataset(tags_data, brown_data, train_test_split=0.8)

print("There are {} sentences in the corpus.".format(len(data)))
print("There are {} sentences in the training set.".format(len(data.training_set)))
print("There are {} sentences in the testing set.".format(len(data.testing_set)))

assert len(data) == len(data.training_set) + len(data.testing_set), \
       "The number of sentences in the training set + testing set should sum to the number of sentences in the corpus"

There are 57340 sentences in the corpus.
There are 45872 sentences in the training set.
There are 11468 sentences in the testing set.


**Dataset Class Attributes**
```
Dataset-only Attributes:
    training_set - reference to a Subset object containing the samples for training
    testing_set - reference to a Subset object containing the samples for testing

Dataset & Subset Attributes:
    sentences - a dictionary with an entry {sentence_key: Sentence()} for each sentence in the corpus
    keys - an immutable ordered (not sorted) collection of the sentence_keys for the corpus
    vocab - an immutable collection of the unique words in the corpus
    tagset - an immutable collection of the unique tags in the corpus
    X - returns an array of words grouped by sentences ((w11, w12, w13, ...), (w21, w22, w23, ...), ...)
    Y - returns an array of tags grouped by sentences ((t11, t12, t13, ...), (t21, t22, t23, ...), ...)
    N - returns the number of distinct samples (individual words or tags) in the dataset

Methods:
    stream() - returns an flat iterable over all (word, tag) pairs across all sentences in the corpus
    __iter__() - returns an iterable over the data as (sentence_key, Sentence()) pairs
    __len__() - returns the nubmer of sentences in the dataset
```

This class is convenient, but not suitable for large datasets as it contains many data redundancies

## Sentences
Dataset contains a dictionary of all sentences, each sentence contains words and tags.

In [7]:
# sentences
data.sentences

{'b100-5507': Sentence(words=('Mr.', 'Podger', 'had', 'thanked', 'him', 'gravely', ',', 'and', 'now', 'he', 'made', 'use', 'of', 'the', 'advice', '.'), tags=('NOUN', 'NOUN', 'VERB', 'VERB', 'PRON', 'ADV', '.', 'CONJ', 'ADV', 'PRON', 'VERB', 'NOUN', 'ADP', 'DET', 'NOUN', '.')),
 'b100-935': Sentence(words=('But', 'there', 'seemed', 'to', 'be', 'some', 'difference', 'of', 'opinion', 'as', 'to', 'how', 'far', 'the', 'board', 'should', 'go', ',', 'and', 'whose', 'advice', 'it', 'should', 'follow', '.'), tags=('CONJ', 'PRT', 'VERB', 'PRT', 'VERB', 'DET', 'NOUN', 'ADP', 'NOUN', 'ADP', 'ADP', 'ADV', 'ADV', 'DET', 'NOUN', 'VERB', 'VERB', '.', 'CONJ', 'DET', 'NOUN', 'PRON', 'VERB', 'VERB', '.')),
 'b100-30614': Sentence(words=('Such', 'an', 'instrument', 'is', 'expected', 'to', 'be', 'especially', 'useful', 'if', 'it', 'could', 'be', 'used', 'to', 'measure', 'the', 'elasticity', 'of', 'heavy', 'pastes', 'such', 'as', 'printing', 'inks', ',', 'paints', ',', 'adhesives', ',', 'molten', 'plastics'

In [8]:
# example sentence object
key = 'b100-38532'
data.sentences[key]

Sentence(words=('Perhaps', 'it', 'was', 'right', ';', ';'), tags=('ADV', 'PRON', 'VERB', 'ADJ', '.', '.'))

In [9]:
# example sentence words & tags
print("Sentence: {}".format(key))
print("words:\n\t{!s}".format(data.sentences[key].words))
print("tags:\n\t{!s}".format(data.sentences[key].tags))

Sentence: b100-38532
words:
	('Perhaps', 'it', 'was', 'right', ';', ';')
tags:
	('ADV', 'PRON', 'VERB', 'ADJ', '.', '.')


## Vocabulary & Tags
Dataset also stores the vocab and tagset

In [10]:
# vocab - composed of all the unique words in the brown-universal.txt file
data.vocab

frozenset({'Berkeley',
           'degrees',
           'streamer',
           'roused',
           'barbs',
           'Criminals',
           'denting',
           'improvements',
           'operational',
           'adjourned',
           'chosen',
           'Morehouse',
           'superintendent',
           'Eber',
           'cynically',
           'half-turned',
           'supplements',
           'Catcher',
           '$17,000,000',
           'trap',
           'amulet',
           'conciliate',
           'kick',
           'supplied',
           'intangible',
           'unpaved',
           'mortals',
           'Vickery',
           'peddled',
           'Mays',
           'Middle-South',
           'Anyway',
           'recreate',
           'trooper',
           'hovered',
           'agreeable',
           'Crow',
           'rider',
           'herpetologists',
           'levi-clad',
           '7,000,000',
           'augment',
           'strong-made',
         

In [11]:
# tagset - this came from the tags-universal.txt file
data.tagset

frozenset({'.',
           'ADJ',
           'ADP',
           'ADV',
           'CONJ',
           'DET',
           'NOUN',
           'NUM',
           'PRON',
           'PRT',
           'VERB',
           'X'})

In [12]:
# overall data set
print("There are a total of {} samples of {} unique words in the corpus."
      .format(data.N, len(data.vocab)))

There are a total of 1161192 samples of 56057 unique words in the corpus.


## Training & Test Set
The dataset class conveniently splits the data into a test and train set

In [13]:
print("There are {} samples of {} unique words in the training set."
      .format(data.training_set.N, len(data.training_set.vocab)))
print("There are {} samples of {} unique words in the testing set."
      .format(data.testing_set.N, len(data.testing_set.vocab)))
print("There are {} words in the test set that are missing in the training set."
      .format(len(data.testing_set.vocab - data.training_set.vocab)))

There are 928458 samples of 50536 unique words in the training set.
There are 232734 samples of 25112 unique words in the testing set.
There are 5521 words in the test set that are missing in the training set.


## Word & Tag Sequences

In [14]:
# accessing words with Dataset.X and tags with Dataset.Y 
for i in range(2):    
    print("Sentence {}:".format(i + 1), data.X[i])
    print()
    print("Labels {}:".format(i + 1), data.Y[i])
    print()

Sentence 1: ('Mr.', 'Podger', 'had', 'thanked', 'him', 'gravely', ',', 'and', 'now', 'he', 'made', 'use', 'of', 'the', 'advice', '.')

Labels 1: ('NOUN', 'NOUN', 'VERB', 'VERB', 'PRON', 'ADV', '.', 'CONJ', 'ADV', 'PRON', 'VERB', 'NOUN', 'ADP', 'DET', 'NOUN', '.')

Sentence 2: ('But', 'there', 'seemed', 'to', 'be', 'some', 'difference', 'of', 'opinion', 'as', 'to', 'how', 'far', 'the', 'board', 'should', 'go', ',', 'and', 'whose', 'advice', 'it', 'should', 'follow', '.')

Labels 2: ('CONJ', 'PRT', 'VERB', 'PRT', 'VERB', 'DET', 'NOUN', 'ADP', 'NOUN', 'ADP', 'ADP', 'ADV', 'ADV', 'DET', 'NOUN', 'VERB', 'VERB', '.', 'CONJ', 'DET', 'NOUN', 'PRON', 'VERB', 'VERB', '.')



## Word, Tag Samples

In [15]:
# use Dataset.stream() (word, tag) samples for the entire corpus
print("\nStream (word, tag) pairs:\n")
for i, pair in enumerate(data.stream()):
    print("\t", pair)
    if i > 5: break


Stream (word, tag) pairs:

	 ('Mr.', 'NOUN')
	 ('Podger', 'NOUN')
	 ('had', 'VERB')
	 ('thanked', 'VERB')
	 ('him', 'PRON')
	 ('gravely', 'ADV')
	 (',', '.')


# Most Frequent Class Tagger
as a baseline taggger, simply assign the most frequent tag for each word

## Pair-Count Calculator

In [16]:
def pair_counts(sequences_A, sequences_B):
    """Return a dictionary keyed to each unique value in the first sequence list
    that counts the number of occurrences of the corresponding value from the
    second sequences list.
    
    For example, if sequences_A is tags and sequences_B is the corresponding
    words, then if 1244 sequences contain the word "time" tagged as a NOUN, then
    you should return a dictionary such that pair_counts[NOUN][time] == 1244
    """
    
    # check if a tuple was passed (i.e. from the DataSet Class)
    # and convert to a list
    
    if(type(sequences_A) == tuple):
        
        list_A = []
        for a in sequences_A:
            list_A.extend(a)
    
        # overwrite sequences_A
        sequences_A = list_A
        
    if(type(sequences_B) == tuple):
        
        list_B = []
        for b in sequences_B:
            list_B.extend(b)
            
        # overwrite sequences_B
        sequences_B = list_B

        
    # initialize the pair_counts dictionary
    pair_counts = {}

    # loop through all pairs in A, B
    for A, B in zip(sequences_A, sequences_B):

        # check if A is already a key in the dictionary
        if A in pair_counts.keys():

            # then check if B is already in the dictionary 
            if B in pair_counts[A].keys():

                # A & B already in the dictionary, so update the count
                count = pair_counts[A][B]
                pair_counts[A].update({B:count+1})

            else:

                # add B to the dictionary
                pair_counts[A].update({B:1})
        else:

            # then add A & B to the dictionary 
            pair_counts.update({A:{B:1}})
        
    return pair_counts

In [17]:
# test the function on a very simple sequence of tags & words

test_A = ['N', 'V', 'V', 'N', 'N', 'V', 'V', 'N']
test_B = ['cat', 'ran', 'will', 'cat', 'dog', 'ran', 'swim', 'will']

test_dict = pair_counts(test_A, test_B)
print(test_dict)
print(test_dict['N']['cat'])

{'N': {'cat': 2, 'dog': 1, 'will': 1}, 'V': {'ran': 2, 'will': 1, 'swim': 1}}
2


In [18]:
# get the pair counts dictionary for our dataset
# Y holds the tags, X holds the words
emission_counts = pair_counts(data.Y, data.X)

# check what the most common noun is
print(max(emission_counts['NOUN'], key=emission_counts['NOUN'].get))

time


## MFCTagger Class
create a class that mimics the pomegranate interface, for our tagger based on the most common part of speech tag

In [19]:
from collections import namedtuple

FakeState = namedtuple("FakeState", "name")

class MFCTagger:
    missing = FakeState(name="<MISSING>")
    
    def __init__(self, table):
        self.table = defaultdict(lambda: MFCTagger.missing)
        self.table.update({word: FakeState(name=tag) for word, tag in table.items()})
        
    def viterbi(self, seq):
        """This method simplifies predictions by matching the Pomegranate viterbi() interface"""
        return 0., list(enumerate(["<start>"] + [self.table[w] for w in seq] + ["<end>"]))

In [20]:
# calculate the frequency of each tag being assigned to each word 
word_counts = pair_counts(data.X, data.Y)

# Create a lookup table mfc_table where mfc_table[word] contains the tag label most frequently assigned to that word
mfc_table = {}
for word in word_counts.keys():
    most_frequent_tag = max(word_counts[word], key=word_counts[word].get)
    mfc_table.update({word:most_frequent_tag})

## Helper Functions
Thesenterface with Pomegranate network models & the mocked MFCTagger to take advantage of the [missing value](http://pomegranate.readthedocs.io/en/latest/nan.html) functionality in Pomegranate through a simple sequence decoding function.

In [21]:
def replace_unknown(sequence):
    """Return a copy of the input sequence where each unknown word is replaced
    by the literal string value 'nan'. Pomegranate will ignore these values
    during computation.
    """
    return [w if w in data.training_set.vocab else 'nan' for w in sequence]

def simplify_decoding(X, model):
    """X should be a 1-D sequence of observations for the model to predict"""
    _, state_path = model.viterbi(replace_unknown(X))
    return [state[1].name for state in state_path[1:-1]]  # do not show the start/end state predictions

## MFC Model & Example Sentence Decoding

In [22]:
mfc_model = MFCTagger(mfc_table) # Create a Most Frequent Class tagger instance

In [23]:
for key in data.testing_set.keys[:3]:
    print("Sentence Key: {}\n".format(key))
    print("Sentence: {}\n".format(data.sentences[key]))
    print("Predicted labels:\n-----------------")
    print(simplify_decoding(data.sentences[key].words, mfc_model))
    print()
    print("Actual labels:\n--------------")
    print(data.sentences[key].tags)
    print("\n")

Sentence Key: b100-28144

Sentence: Sentence(words=('and', 'August', '15', ',', 'November', '15', ',', 'February', '17', ',', 'and', 'May', '15', ',', '(', 'Cranston', ')', '.'), tags=('CONJ', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'CONJ', 'NOUN', 'NUM', '.', '.', 'NOUN', '.', '.'))

Predicted labels:
-----------------
['CONJ', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'CONJ', 'NOUN', 'NUM', '.', '.', 'NOUN', '.', '.']

Actual labels:
--------------
('CONJ', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'NOUN', 'NUM', '.', 'CONJ', 'NOUN', 'NUM', '.', '.', 'NOUN', '.', '.')


Sentence Key: b100-23146

Sentence: Sentence(words=('She', 'had', 'the', 'opportunity', 'that', 'few', 'clever', 'women', 'can', 'resist', ',', 'of', 'showing', 'her', 'superiority', 'in', 'argument', 'over', 'a', 'man', '.'), tags=('PRON', 'VERB', 'DET', 'NOUN', 'ADP', 'ADJ', 'ADJ', 'NOUN', 'VERB', 'VERB', '.', 'ADP', 'VERB', 'DET', 'NOUN', 'ADP', 'NOUN', 'ADP', 'DET', 'NOUN', '.')

## Accuracy

In [24]:
def accuracy(X, Y, model):
    """Calculate the prediction accuracy by using the model to decode each sequence
    in the input X and comparing the prediction with the true labels in Y.
    
    The X should be an array whose first dimension is the number of sentences to test,
    and each element of the array should be an iterable of the words in the sequence.
    The arrays X and Y should have the exact same shape.
    
    X = [("See", "Spot", "run"), ("Run", "Spot", "run", "fast"), ...]
    Y = [(), (), ...]
    """
    correct = total_predictions = 0
    for observations, actual_tags in zip(X, Y):
        
        # The model.viterbi call in simplify_decoding will return None if the HMM
        # raises an error (for example, if a test sentence contains a word that
        # is out of vocabulary for the training set). Any exception counts the
        # full sentence as an error (which makes this a conservative estimate).
        try:
            most_likely_tags = simplify_decoding(observations, model)
            correct += sum(p == t for p, t in zip(most_likely_tags, actual_tags))
        except:
            pass
        total_predictions += len(observations)
    return correct / total_predictions

In [25]:
mfc_training_acc = accuracy(data.training_set.X, data.training_set.Y, mfc_model)
print("training accuracy mfc_model: {:.2f}%".format(100 * mfc_training_acc))

mfc_testing_acc = accuracy(data.testing_set.X, data.testing_set.Y, mfc_model)
print("testing accuracy mfc_model: {:.2f}%".format(100 * mfc_testing_acc))

training accuracy mfc_model: 95.71%
testing accuracy mfc_model: 93.13%


# HMM Tagger
Each part of speech corresponds to a hidden state, parameterized by: <br>
- the emissions probability (how likely each word corresponds to a tag) 
- the transition probability (how likely various tags are found sequentially)
- the starting probability (how likely each tag will be at the start of a sentence)
- the terminal probability (how likely each tag will be at the end of a sentence)
<br>

Predictions will then be made according to:
$$t_i^n = \underset{t_i^n}{\mathrm{argmax}} \prod_{i=1}^n P(w_i|t_i) P(t_i|t_{i-1})$$

## Unigram (Emissions) Counts 
Estimate co-occurrence frequency, this will be used to estimate the HMM unigram probailities:
$$ P(tag_1) = \frac{C(tag_1)}{N} $$

where $P(tag_1)$ is the probability, $C(tag_1)$ are the counts, and $N$ is the total number of samples

In [26]:
def unigram_counts(sequences):
    """Return a dictionary keyed to each unique value in the input sequence list that
    counts the number of occurrences of the value in the sequences list. The sequences
    collection should be a 2-dimensional array.
    
    For example, if the tag NOUN appears 275558 times over all the input sequences,
    then you should return a dictionary such that your_unigram_counts[NOUN] == 275558.
    """
    
    unigram_counts_dict = {}
    
    # loop through the sequnces
    for sequence in sequences:
        
        for word in sequence:
            
            # check if the word is in the dictionary already
            if word in unigram_counts_dict.keys():
                
                # update the counts
                count = unigram_counts_dict[word]
                unigram_counts_dict.update({word:count+1})
                
            else:
                
                # add the word to the dictionary
                unigram_counts_dict.update({word:1})

    return unigram_counts_dict

# Call unigram_counts with a list of tag sequences from the training set
tag_unigrams = unigram_counts(data.training_set.Y)

In [27]:
tag_unigrams

{'ADV': 44877,
 'NOUN': 220632,
 '.': 117757,
 'VERB': 146161,
 'ADP': 115808,
 'ADJ': 66754,
 'CONJ': 30537,
 'DET': 109671,
 'PRT': 23906,
 'NUM': 11878,
 'PRON': 39383,
 'X': 1094}

## Bigram (Transmission) Counts 

Estimate the co-occurrence frequency of each pair of symbols, this will be used to estimate the bigram probability: $$P(tag_2|tag_1) = \frac{C(tag_2|tag_1)}{C(tag_2)}$$


In [28]:
def bigram_counts(sequences):
    """Return a dictionary keyed to each unique PAIR of values in the input sequences
    list that counts the number of occurrences of pair in the sequences list. The input
    should be a 2-dimensional array.
    
    For example, if the pair of tags (NOUN, VERB) appear 61582 times, then you should
    return a dictionary such that your_bigram_counts[(NOUN, VERB)] == 61582
    """

    bigram_counts_dict = {}
    
    # loop through the sequnces
    for sequence in sequences:
        #bigrams = [(sequences[i], sequences[i+1]) for i in range(len(sequences) - 1)]
        #print(bigrams)
        for i in range(len(sequence)-1):
            
            # define the pair
            pair = (sequence[i], sequence[i+1])
            
            # check if the pair is in the dictionary already
            if pair in bigram_counts_dict.keys():
                
                # update the counts
                count = bigram_counts_dict[pair]
                bigram_counts_dict.update({pair:count+1})
                
            else:
                
                # add the pair to the dictionary
                bigram_counts_dict.update({pair:1})

    return bigram_counts_dict

# Call bigram_counts with a list of tag sequences from the training set
tag_bigrams = bigram_counts(data.training_set.Y)

In [29]:
tag_bigrams

{('ADV', 'NOUN'): 1478,
 ('NOUN', '.'): 62639,
 ('.', 'ADV'): 5124,
 ('ADV', '.'): 7577,
 ('.', 'VERB'): 9041,
 ('VERB', 'ADP'): 24927,
 ('ADP', 'ADJ'): 9533,
 ('ADJ', 'NOUN'): 43664,
 ('NOUN', 'CONJ'): 13185,
 ('CONJ', 'VERB'): 6012,
 ('VERB', 'ADJ'): 8423,
 ('.', 'DET'): 8008,
 ('DET', 'VERB'): 7062,
 ('ADJ', 'PRT'): 1301,
 ('PRT', 'ADP'): 2189,
 ('ADP', 'NUM'): 3467,
 ('NUM', 'NOUN'): 4524,
 ('.', 'PRON'): 5448,
 ('PRON', 'VERB'): 27860,
 ('VERB', 'PRT'): 9556,
 ('PRT', 'VERB'): 14886,
 ('VERB', 'NOUN'): 14230,
 ('NOUN', 'NUM'): 1783,
 ('NUM', '.'): 3210,
 ('.', 'NUM'): 1412,
 ('.', '.'): 12588,
 ('ADP', 'ADV'): 1805,
 ('ADV', 'NUM'): 597,
 ('DET', 'NOUN'): 68785,
 ('CONJ', 'DET'): 4636,
 ('NOUN', 'VERB'): 34972,
 ('ADP', 'NOUN'): 29965,
 ('ADP', 'DET'): 52841,
 ('NOUN', 'ADP'): 53884,
 ('CONJ', 'NOUN'): 7502,
 ('.', 'NOUN'): 9782,
 ('VERB', '.'): 11699,
 ('VERB', 'VERB'): 26957,
 ('.', 'ADP'): 7595,
 ('ADV', 'DET'): 3309,
 ('DET', 'ADJ'): 26236,
 ('NOUN', 'DET'): 3425,
 ('ADJ', '.'

## Sequence Start Counts

In [30]:
def starting_counts(sequences):
    """Return a dictionary keyed to each unique value in the input sequences list
    that counts the number of occurrences where that value is at the beginning of
    a sequence.
    
    For example, if 8093 sequences start with NOUN, then you should return a
    dictionary such that your_starting_counts[NOUN] == 8093
    """
    starting_count_dict = {}
    
    for sequence in sequences:
        
        starting_tag = sequence[0]
        
        if starting_tag in starting_count_dict.keys():
            
            count = starting_count_dict[starting_tag]
            starting_count_dict.update({starting_tag:count+1})
            
        else:
            
            starting_count_dict.update({starting_tag:1})
            
    return starting_count_dict

# Calculate the count of each tag starting a sequence
tag_starts = starting_counts(data.training_set.Y)

In [32]:
tag_starts

{'ADV': 4185,
 'ADP': 5583,
 'ADJ': 1582,
 'PRT': 1718,
 'DET': 9763,
 'PRON': 7318,
 'NOUN': 6469,
 'CONJ': 2282,
 '.': 4107,
 'NUM': 760,
 'VERB': 2080,
 'X': 25}

## Sequence End Counts

In [33]:
def ending_counts(sequences):
    """Return a dictionary keyed to each unique value in the input sequences list
    that counts the number of occurrences where that value is at the end of
    a sequence.
    
    For example, if 18 sequences end with DET, then you should return a
    dictionary such that your_starting_counts[DET] == 18
    """
    ending_count_dict = {}
    
    for sequence in sequences:
        
        ending_tag = sequence[-1]
        
        if ending_tag in ending_count_dict.keys():
            
            count = ending_count_dict[ending_tag]
            ending_count_dict.update({ending_tag:count+1})
            
        else:
            
            ending_count_dict.update({ending_tag:1})
            
    return ending_count_dict

# Calculate the count of each tag ending a sequence
tag_ends = ending_counts(data.training_set.Y)

In [34]:
tag_ends

{'.': 44936,
 'NOUN': 722,
 'NUM': 63,
 'VERB': 75,
 'ADJ': 25,
 'ADV': 16,
 'ADP': 7,
 'DET': 14,
 'CONJ': 2,
 'PRON': 4,
 'PRT': 7,
 'X': 1}