# Python Tutorial: Tokens and N-grams

**(C) 2017-2020 by [Damir Cavar](http://damir.cavar.me/) <<dcavar@iu.edu>>**

**Download:** This and various other Jupyter notebooks are available from my [GitHub repo](https://github.com/dcavar/python-tutorial-for-ipython).

**Version:** 1.2, August 2020

**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))

## Introduction

This is a tutorial about frequency profiles using Python 3.x and the [NLTK](http://nltk.org/).

This tutorial was developed as part of the course material for the course Advanced Natural Language Processing in the [Computational Linguistics Program](http://cl.indiana.edu/) of the [Department of Linguistics](http://www.indiana.edu/~lingdept/) at [Indiana University](https://www.indiana.edu/).

The [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) in distributed as part of the [NLTK Data](http://www.nltk.org/data.html). To be able to use the [NLTK Data](http://www.nltk.org/data.html) and the [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) on your local machine, you need to install the data as described on [the Installing NLTK Data page](http://www.nltk.org/data.html). If you want to use iPython on your local machine, I recommend installing a Python 3.x distribution, for example the most recent [Anaconda release](https://www.continuum.io/downloads), and reading the instructions how to run [iPython on Anaconda](http://jupyter.readthedocs.io/en/latest/install.html).

## Using the Brown Corpus

The documentation of the [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) design and properties can be found on [this page](http://clu.uni.no/icame/brown/bcm.html).

Using the following line of code we are importing the [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) into the running Python instance. This will make the tokens and PoS-tags from the [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) available for further processing.

In [1]:
from nltk.corpus import brown

In [2]:
print(brown.tagged_words())

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]


In [3]:
tokens, tags = zip(*brown.tagged_words())

You can inspect the resulting list of *tokens* by printing it out (a selection of the first 20):

In [4]:
tokens[:20]

('The',
 'Fulton',
 'County',
 'Grand',
 'Jury',
 'said',
 'Friday',
 'an',
 'investigation',
 'of',
 "Atlanta's",
 'recent',
 'primary',
 'election',
 'produced',
 '``',
 'no',
 'evidence',
 "''",
 'that')

You can print the *tags* as well:

The sequence of *tokens* and *tags* is aligned, that is, the first tag in the *tags* list belongs to the first token in the *tokens* list. You can print the token-tag pair out in the following way:

In [5]:
print("Token:", tokens[0], "Tag:", tags[0])

Token: The Tag: AT


To create a frequency profile of tags for example, we can make use of the [*Counter* container datatype](http://docs.python.org/3/library/collections.html#collections.Counter) from the [*collections* module](http://docs.python.org/3/library/collections.html). We import the [*Counter* datatype](http://docs.python.org/3/library/collections.html#collections.Counter) with the following code:

In [6]:
from collections import Counter, defaultdict

We can create a frequency profile of the tags from the [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) and store it in the variable *tagCounts* using the following code:

In [7]:
tagCounter = Counter(tags)
print(tagCounter)

Counter({'NN': 152470, 'IN': 120557, 'AT': 97959, 'JJ': 64028, '.': 60638, ',': 58156, 'NNS': 55110, 'CC': 37718, 'RB': 36464, 'NP': 34476, 'VB': 33693, 'VBN': 29186, 'VBD': 26167, 'CS': 22143, 'PPS': 18253, 'VBG': 17893, 'PP$': 16872, 'TO': 14918, 'PPSS': 13802, 'CD': 13510, 'NN-TL': 13372, 'MD': 12431, 'PPO': 11181, 'BEZ': 10066, 'BEDZ': 9806, 'AP': 9522, 'DT': 8957, '``': 8837, "''": 8789, 'QL': 8735, 'VBZ': 7373, 'BE': 6360, 'RP': 6009, 'WDT': 5539, 'HVD': 4895, '*': 4603, 'WRB': 4509, 'BER': 4379, 'JJ-TL': 4107, 'NP-TL': 4019, 'HV': 3928, 'WPS': 3924, '--': 3405, 'BED': 3282, 'ABN': 3010, 'DTI': 2921, 'PN': 2573, 'NP$': 2565, 'BEN': 2470, 'DTS': 2435, 'HVZ': 2433, ')': 2273, '(': 2264, 'NNS-TL': 2226, 'EX': 2164, 'JJR': 1958, 'OD': 1935, 'NR': 1566, ':': 1558, 'NN$': 1480, 'IN-TL': 1477, 'NN-HL': 1471, 'DO': 1353, 'NPS': 1275, 'PPL': 1233, 'RBR': 1182, 'DOD': 1047, 'JJT': 1005, 'CD-TL': 898, 'MD*': 866, 'AT-TL': 746, 'ABX': 730, 'BEG': 686, 'NNS-HL': 609, 'UH': 608, '.-HL': 598, '

In [8]:
tokenCounter = Counter(tokens)
print(tokenCounter)



In [9]:
print("Number of types:", len(tokenCounter))
print("Number of tokens:", sum(tokenCounter.values()))

Number of types: 56057
Number of tokens: 1161192


In [10]:
from nltk.util import ngrams

In [11]:
tokenBigrams = defaultdict(Counter)
tokenNgrams = list(ngrams(tokens, 2))
for tok in tokenNgrams:
    tokenBigrams[tok[0]][tok[1]] += 1
tokenBigramCount = Counter(tokenNgrams)
print("Number of bigram types:", len(tokenBigramCount))
print("Number of bigrams:", sum(tokenBigramCount.values()))
#print(tokenBigramCount)

Number of bigram types: 455267
Number of bigrams: 1161191


In [12]:
print(tokenBigrams["the"]["hope"])

17


Count the number of n-grams that occur a certain number of times:

In [13]:
n = 1
countTokCounter = Counter(tokenCounter.values())
print("Number of n-grams with frequence", n, "is", countTokCounter[n])
print("Proportion:", countTokCounter[n]/len(tokens))

Number of n-grams with frequence 1 is 25559
Proportion: 0.022011002487099463


In [14]:
n = 1
countCounter = Counter(tokenBigramCount.values())
print("Number of n-grams with frequence", n, "is", countCounter[n])
print("Proportion:", countCounter[n]/sum(tokenBigramCount.values()))

Number of n-grams with frequence 1 is 340621
Proportion: 0.29333761629223787


In [15]:
totalTokens = len(tokens)
totalTypes  = len(tokenCounter)
w = "Prateek"
Cw = tokenCounter[w]
print((Cw + 1) / (totalTokens + totalTypes))

8.215246017864874e-07


In [16]:
smoothedTokenCounter = {}
for token in tokenCounter:
    Cw = tokenCounter[token]
    smoothedTokenCounter[token] = (Cw + 1) / (totalTokens + totalTypes)
print(smoothedTokenCounter)



In [20]:
smoothedBigramCounter = {}
totalBigramTypes = len(tokenBigramCount)
for bigram in tokenBigramCount:
    token1Freq = tokenCounter[bigram[0]]
    smoothedBigramCounter[bigram] = (tokenBigramCount[bigram] + 1) / (token1Freq + totalBigramTypes)
print(list(smoothedBigramCounter.values())[:20])

[4.324090589697854e-06, 1.537501866966553e-05, 4.3922064688416876e-06, 6.589279242672173e-06, 4.392987912693758e-06, 1.0935893790599507e-05, 4.3924476255526246e-06, 1.7436449590134456e-05, 3.514089301794382e-05, 4.070443088082353e-06, 4.392987912693758e-06, 4.391415660666529e-06, 6.588193956430077e-06, 4.392331867026545e-06, 4.3921582406770955e-06, 1.5082826263078965e-05, 3.2819310006826415e-05, 8.782175696206978e-06, 3.663350974882342e-05, 6.87426960885406e-05]


In [21]:
l = 0.5
smoothedBigramCounter = {}
totalBigramTypes = len(tokenBigramCount)
for bigram in tokenBigramCount:
    token1Freq = tokenCounter[bigram[0]]
    smoothedBigramCounter[bigram] = (tokenBigramCount[bigram] + l) / (token1Freq + (l * totalBigramTypes))
print(list(smoothedBigramCounter.values())[:20])

[6.385927119542427e-06, 2.8552539968065083e-05, 6.5870801010897225e-06, 1.0981697902276067e-05, 6.589423974520894e-06, 1.960130936746575e-05, 6.587803340894667e-06, 3.2442884302186e-05, 6.807905075842258e-05, 5.687990944718416e-06, 6.589423974520894e-06, 6.58470898878624e-06, 1.0978081163149656e-05, 6.587456165968762e-06, 6.58693547218447e-06, 2.748757244561161e-05, 6.32043746145078e-05, 1.53620281388464e-05, 6.979031183580243e-05, 0.00013242499595368068]


In [22]:
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

In [23]:
tokensTraining = list(brown.words(categories=['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'religion',
 'reviews',
 'romance',
 'science_fiction']))
TBigrams = Counter(ngrams(tokensTraining, 2))
TBigramsCount = Counter(TBigrams.values())
print("N of tokens:", len(tokensTraining))
print(TBigramsCount[1])
NrBigrams = [ b for b in TBigrams if TBigrams[b] == 1 ]

N of tokens: 1060638
313186


In [24]:
tokensNews = list(brown.words(categories='news'))
HOBigrams = Counter(ngrams(tokensNews, 2))
HOBigramsCount = Counter(HOBigrams.values())
print("N of tokens:", len(tokensNews))
print(HOBigramsCount[1])
res = 0
for b in NrBigrams:
    res += HOBigrams[b]
print("Tr/Nr =", res/TBigramsCount[1])

N of tokens: 100554
51998
Tr/Nr = 0.027143614337805648


In [25]:
newCount = 0
tokenCountDict = {}
for t in tokens:
    if t not in tokenCountDict:
        newCount += 1
    tokenCountDict[t] = tokenCountDict.get(t, 0) + 1
print("New events:", newCount)
print("Types:", len(tokenCountDict))

New events: 56057
Types: 56057


The *tagCounter* datatype now contains a hash-table with *tags* as keys and their frequencies as values. Accessing the frequency of a specific *tag* can be achieved using the following code:

In [17]:
tagCounter["NNS"]

55110

The frequency of a specific *token* can be accessed by generating a frequency profile from the *token*-list in the same way as for *tags*:

In [18]:
tokenCounter = Counter(tokens)

We access the *token* frequency in the same way as for *tags*:

In [28]:
tokenCounter["the"]

62713

Since one type (or word) in the [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) can have more than one corresponding tag with a specific frequency, we need to store this information in a specific datastructure. 

In [29]:
from collections import defaultdict

The following loop reads from the list of *token-tag*-tuples in *brown.tagged_words* the individual *token* and *tag* pairs and sets their counter in the *dictionary* of *Counter* datastructures.

In [30]:
tokenTags = defaultdict(Counter)
for token, tag in brown.tagged_words():
    tokenTags[token][tag] +=1

We can now ask for the *Counter* datastructure for the key *the*. The *Counter* datastructure is a hash-table with tags as keys and the corresponding frequency as values.

In [31]:
tokenTags["John"]

Counter({'NP': 303, 'NP-HL': 9, 'NP-TL': 47, 'NN-TL': 1})

In [32]:
tokenTags["the"]["AT"]

62288

For the calculation of the probability of a $tag_2$ given that a $tag_1$ occured, that is $P(tag_2\ |\ tag_1)$ we will need to count the bigrams from the *tags* list. The NLTK ngram module provides a convenient set of functions and datastructures to achieve this:

In [33]:
from nltk.util import ngrams

As for the *tokenTags* datatype above, we can create a *tags* bigram model using a dictionary of *Counter* datatypes. The dictionary keys will be the first tag of the tag-bigram. The value will contain a Counter datatype with the second tag of the tag-bigram as the key and the frequency of the bigram as value.

In [34]:
tagTags = defaultdict(Counter)

Using the *ngrams* module we generate a bigram model from the tags list and store it in the variable *posBigrams* using the following code:

In [35]:
posBigrams = list(ngrams(tags, 2))

The following loop goes through the list of bigram tuples, assigned the left bigram tag to the variable *tag1* and the right bigram tag to variable *tag2*, and stores the count of the bigram in the *tagTags* datastructure:

In [36]:
for tag1, tag2 in posBigrams:
    tagTags[tag1][tag2] += 1

We can now list all *tags* that follow the *AT* tag with the corresponding frequency:

In [37]:
tagTags["AT"]

Counter({'NP-TL': 809,
         'NN': 48376,
         'NN-TL': 2565,
         'NP': 2230,
         'JJ': 19488,
         'JJT': 675,
         'AP': 3007,
         'NNS': 7215,
         'NN$': 907,
         'VBG': 1568,
         'CD': 981,
         'JJS': 206,
         'VBN': 1468,
         'JJ-TL': 1414,
         'NPS': 588,
         'OD': 1251,
         '``': 620,
         'NNS$': 97,
         'RB': 350,
         'QL': 1377,
         'JJS-TL': 2,
         'NN$-TL': 162,
         'JJR': 630,
         'VBN-TL': 390,
         'NR-TL': 208,
         'NNS-TL': 284,
         'FW-IN': 7,
         'ABN': 42,
         'NR': 218,
         'NPS$': 30,
         'PN': 149,
         'NNS$-TL': 28,
         '*': 4,
         'NP$': 62,
         "'": 24,
         'VBG-TL': 34,
         'OD-TL': 98,
         'JJR-TL': 3,
         'FW-NN-TL': 52,
         'RB-TL': 1,
         'CD-TL': 29,
         'FW-JJ-TL': 8,
         'NR$-TL': 8,
         'FW-NN': 76,
         'RBT': 11,
         '(': 15,
         "

We can request the frequency of the tag-bigram *AT NN* using the following code:

In [38]:
tagTags["AT"]["NN"]

48376

We can calculate the total number of bigrams and relativize the count of any particular bigram:

In [44]:
total = float(len(tags))
print(total)
tagTags["NNS"]["NNS"]/(total-1)

1161192.0


0.00012228823681892126

If we want to know how many times a certain tag occurs in sentence initial position, to estimate initial probabilities for startstates in a Hidden Markov Model for example, we can loop through the sentences and count the tags in initial position.

In [40]:
offset = 0
initialTags = Counter()
for x in brown.sents():
    initTag = tags[offset]
    initialTags[initTag] += 1
    offset += len(x)
print("Example:")
print("AT:", initialTags["AT"])

Example:
AT: 8297


Note, for the code above, I do not know how to access the initial sentence tag directly, thus I am indirectly accessing the tag over an offset count. If you know a better way, let me know, please.

We can now estimate the probability of any tag being in sentence initial position in the following way:

In [45]:
initialTags["AT"]/total

0.007145243852868432

We can estimate the probability of any tag being followed by any other, in the following way:

In [46]:
tagTags["AT"]["NN"]/(total-1)

0.04166067425600095

Note, we are dividing by *total - 1*, since the number of bigrams in the *tagTags* data structure is exactly this. 

We can estimate the likelihood of a tag token combination using the *tokenTags* data-structure:

In [47]:
tokenTags["John"]["NN"]/total

0.0

Given the data structures *tokenTags* and *tagTags* we can now estimate the probability of a word given a specific tag, or intuitively, the probability that a specific word is assigned a tag, that is for the token *cat* and the tag *NN*: $P(cat\ |\ NN)$ using the following equation and corresponding code (with $C(cat\ NN)$ as the absolute frequency or count of the *cat NN* tuple, and $C(NN)$ the count of the *NN*-tag):

$$P(w_n\ |\ t_n) = \frac{C(w_n\ t_n)}{C(t_n)}$$

In [48]:
tokenTags["cat"]["NN"] / tagCounter["NN"]

0.00013117334557617892

We can estimate the probability of a $tag_2$ following a $tag_1$ using a similar approach:

$$P(t_n\ |\ t_{n-1}) = \frac{C(t_{n-1}\ t_n)}{C(t_{n-1})}$$

Here $C(t_{n-1}\ t_n)$ is the count of the bigram of these two tags in sequence. $C(t_{n-1})$ is the count or absolute frequency of the first or left tag in the bigram. Let us assume that the input sequence was *the cat ...* and that the most likely initial tag for *the* was *AT*, then the probability of the tag *NN* given that a tag *AT* occurred can be estimated as:

In [49]:
tagTags["AT"]["NN"] / tagCounter["AT"]

0.4938392592819445

The product of the two probabilities $P(w_n\ |\ t_n)\ P(t_n\ |\ t_{n-1})$ for the tokens *the cat* and the possible tags *AT NN* should be:

In [50]:
(tokenTags["cat"]["NN"] / tagCounter["NN"]) * (tagTags["AT"]["NN"] / tagCounter["AT"])

6.477854781687473e-05

If we would want to calculate this for any sequence of words, we should wrap this code in some function and a loop over all tokens. To avoid an underflow from the product of many probabilities, we can sum up the log-likelihoods of these probabilities. We would calculate the probabilities for all possible tag combinations assigned to the sequence of words or tokens and select the largest one as the best.

In the next section we will discuss Hidden Markov Models (HMMs) for Part-of-Speech Tagging.

## References

Manning, Chris and Hinrich Schütze (1999) *[Foundations of Statistical Natural Language Processing](http://nlp.stanford.edu/fsnlp/)*, MIT Press. Cambridge, MA.

(C) 2016-2019 by [Damir Cavar](http://damir.cavar.me/) <<dcavar@iu.edu>>