## Introduction


Parts of speech tagging is used to annotate the different types of words in a given sentence. The words can be of different types such as Noun, Verb, Adjective, Adverb etc. POS tagging can be used to make the machine aware of these types of words when a sentence in given as input. POS tagging is of 3 different types:
- Rule based
- Stochastic/ Probabilistic
- Neural based

This notebook explains the probabilistic method of POS tagging using conditional probability with an Markov assumption (Hidden Markov Model)


Firstly, import the necessary packages
In this example, Brown corpus is used which has about million words from 500 different text sources.

In [5]:
!pip install nltk
import nltk
nltk.download('brown')
import sys
from nltk.corpus import brown



[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\abhis\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\brown.zip.


Now, every sentence is preprocessed by adding the special tag START and END at beginning and the end of every sentence. And each word of the sentence is stored along with its word type in tuple format. Running the code below will result in a list called brown_tags_words that contains all the sentences with the special tags START and END and the type of each word will be mapped with the respective word.

In [7]:
brown_tags_words = [ ]
for sent in brown.tagged_sents():
    # sent is a list of word/tag pairs
    # add START/START at the beginning
    brown_tags_words.append( ("START", "START") )
    # then all the tag/word pairs for the word/tag pairs in the sentence.
    # shorten tags to 2 characters each
    brown_tags_words.extend([ (tag[:2], word) for (word, tag) in sent ])
    # then END/END
    brown_tags_words.append( ("END", "END") )


The output of the above script looks like this:

In [15]:
brown_tags_words

[('START', 'START'),
 ('AT', 'The'),
 ('NP', 'Fulton'),
 ('NN', 'County'),
 ('JJ', 'Grand'),
 ('NN', 'Jury'),
 ('VB', 'said'),
 ('NR', 'Friday'),
 ('AT', 'an'),
 ('NN', 'investigation'),
 ('IN', 'of'),
 ('NP', "Atlanta's"),
 ('JJ', 'recent'),
 ('NN', 'primary'),
 ('NN', 'election'),
 ('VB', 'produced'),
 ('``', '``'),
 ('AT', 'no'),
 ('NN', 'evidence'),
 ("''", "''"),
 ('CS', 'that'),
 ('DT', 'any'),
 ('NN', 'irregularities'),
 ('VB', 'took'),
 ('NN', 'place'),
 ('.', '.'),
 ('END', 'END'),
 ('START', 'START'),
 ('AT', 'The'),
 ('NN', 'jury'),
 ('RB', 'further'),
 ('VB', 'said'),
 ('IN', 'in'),
 ('NN', 'term-end'),
 ('NN', 'presentments'),
 ('CS', 'that'),
 ('AT', 'the'),
 ('NN', 'City'),
 ('JJ', 'Executive'),
 ('NN', 'Committee'),
 (',', ','),
 ('WD', 'which'),
 ('HV', 'had'),
 ('JJ', 'over-all'),
 ('NN', 'charge'),
 ('IN', 'of'),
 ('AT', 'the'),
 ('NN', 'election'),
 (',', ','),
 ('``', '``'),
 ('VB', 'deserves'),
 ('AT', 'the'),
 ('NN', 'praise'),
 ('CC', 'and'),
 ('NN', 'thanks'),


Great! Now that we have the data ready, based on the tag ie. word type, the conditional frequency distribution (CFD) is calculated. CFD is the distribution of data points based on a condition. In order to perform this, each data point must be mapped with the conditions which in this case is the parts of speech label for each word. CFD is further used to calculate the conditional probability distribution (CPD). Just like CFD, CPD is the probability of every event considered in the distribution based on a condition.

In [19]:
# conditional frequency distribution
cfd_tagwords = nltk.ConditionalFreqDist(brown_tags_words)
# conditional probability distribution
cpd_tagwords = nltk.ConditionalProbDist(cfd_tagwords, nltk.MLEProbDist)

cpd_tagwords

<ConditionalProbDist with 51 conditions>

Now, lets test this with a couple of examples:
    In the below example, we are finding the probability of "new" being an adjective. This probability is calculated by considering the (frequency of new being JJ/ total number of JJ)

In [31]:
print("The probability of an adjective (JJ) being 'new' is", cpd_tagwords["JJ"].prob("new"))
print("The probability of a verb (VB) being 'duck' is", cpd_tagwords["VB"].prob("duck"))

The probability of an adjective (JJ) being 'new' is 0.01472344917632025
The probability of a verb (VB) being 'duck' is 6.042713350943527e-05


In [32]:
# Estimating P(ti | t{i-1}) from corpus data using Maximum Likelihood Estimation (MLE):
# P(ti | t{i-1}) = count(t{i-1}, ti) / count(t{i-1})
brown_tags = [tag for (tag, word) in brown_tags_words ]
brown_tags

['START',
 'AT',
 'NP',
 'NN',
 'JJ',
 'NN',
 'VB',
 'NR',
 'AT',
 'NN',
 'IN',
 'NP',
 'JJ',
 'NN',
 'NN',
 'VB',
 '``',
 'AT',
 'NN',
 "''",
 'CS',
 'DT',
 'NN',
 'VB',
 'NN',
 '.',
 'END',
 'START',
 'AT',
 'NN',
 'RB',
 'VB',
 'IN',
 'NN',
 'NN',
 'CS',
 'AT',
 'NN',
 'JJ',
 'NN',
 ',',
 'WD',
 'HV',
 'JJ',
 'NN',
 'IN',
 'AT',
 'NN',
 ',',
 '``',
 'VB',
 'AT',
 'NN',
 'CC',
 'NN',
 'IN',
 'AT',
 'NN',
 'IN',
 'NP',
 "''",
 'IN',
 'AT',
 'NN',
 'IN',
 'WD',
 'AT',
 'NN',
 'BE',
 'VB',
 '.',
 'END',
 'START',
 'AT',
 'NP',
 'NN',
 'NN',
 'HV',
 'BE',
 'VB',
 'IN',
 'NP',
 'JJ',
 'NN',
 'NN',
 'NP',
 'NP',
 'TO',
 'VB',
 'NN',
 'IN',
 'JJ',
 '``',
 'NN',
 "''",
 'IN',
 'AT',
 'JJ',
 'NN',
 'WD',
 'BE',
 'VB',
 'IN',
 'NN',
 'NP',
 'NP',
 'NP',
 '.',
 'END',
 'START',
 '``',
 'RB',
 'AT',
 'JJ',
 'NN',
 'IN',
 'JJ',
 'NN',
 'BE',
 'VB',
 "''",
 ',',
 'AT',
 'NN',
 'VB',
 ',',
 '``',
 'IN',
 'AT',
 'JJ',
 'NN',
 'IN',
 'AT',
 'NN',
 ',',
 'AT',
 'NN',
 'IN',
 'NN',
 'CC',
 'AT',
 'NN',

Now, the tags alone are considered to find the probability of a particular sequence of tags in a given sentence. In this case, we are using bigrams which means we are considering only sequence of two words at a time.

In [30]:
# make conditional frequency distribution:
# count(t{i-1} ti)
cfd_tags= nltk.ConditionalFreqDist(nltk.bigrams(brown_tags))
# make conditional probability distribution, using
# maximum likelihood estimate:
# P(ti | t{i-1})
cpd_tags = nltk.ConditionalProbDist(cfd_tags, nltk.MLEProbDist)
cfd_tags

<ConditionalFreqDist with 51 conditions>

The frequency and probability of each tags is calculated. Now, lets us test this out with an example. In this first case, we want to find the probability of NN occuring after DT and the result is just above 50%.

In [23]:
print("If we have just seen 'DT', the probability of 'NN' is", cpd_tags["DT"].prob("NN"))
print( "If we have just seen 'VB', the probability of 'JJ' is", cpd_tags["VB"].prob("DT"))
print( "If we have just seen 'VB', the probability of 'NN' is", cpd_tags["VB"].prob("NN"))

If we have just seen 'DT', the probability of 'NN' is 0.5057722522030194
If we have just seen 'VB', the probability of 'JJ' is 0.016885067592065053
If we have just seen 'VB', the probability of 'NN' is 0.10970977711020183


To sum up, we have built the pipeline to calculate the probability of a word being a particular tag such as Noun etc and the probability of a tag occuring after another tag. This enables us to find the probability of an entire sequence of words as shown in the examples below.

In [24]:
prob_tagsequence = cpd_tags["START"].prob("PP") * cpd_tagwords["PP"].prob("I") * \
    cpd_tags["PP"].prob("VB") * cpd_tagwords["VB"].prob("want") * \
    cpd_tags["VB"].prob("TO") * cpd_tagwords["TO"].prob("to") * \
    cpd_tags["TO"].prob("VB") * cpd_tagwords["VB"].prob("race") * \
    cpd_tags["VB"].prob("END")

In [25]:
prob_tagsequence

1.0817766461150474e-14

In [26]:
prob_tagsequence = cpd_tags["START"].prob("PP") * cpd_tagwords["PP"].prob("I") * \
    cpd_tags["PP"].prob("VB") * cpd_tagwords["VB"].prob("saw") * \
    cpd_tags["VB"].prob("PP") * cpd_tagwords["PP"].prob("her") * \
    cpd_tags["PP"].prob("VB") * cpd_tagwords["VB"].prob("duck") * \
    cpd_tags["VB"].prob("END")

In [27]:
prob_tagsequence

7.285965712199413e-16