# Python NLTK: Texts and Frequencies

**(C) 2017-2019 by [Damir Cavar](http://damir.cavar.me/) <<dcavar@iu.edu>>**

**Version:** 0.4, September 2019

**Download:** This and various other Jupyter notebooks are available from my [GitHub repo](https://github.com/dcavar/python-tutorial-for-ipython).

**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))

This is a brief introduction to NLTK for simple frequency analysis of texts. I created this notebook for intro to corpus linguistics and natural language processing classes at Indiana University between 2017 and 2019.

For this to work, in the folder with the notebook we expect a subfolder data that contains a file HOPG.txt. This file contains the novel "A House of Pomegranates" by Oscar Wilde taken as raw text from [Project Gutenberg](https://www.gutenberg.org/).

## Simple File Processing

Reading a text into memory in Python is faily simple. We open a file, read from it, and close the file again. The following code prints out the first 300 characters of the text in memory:

In [2]:
ifile = open("HOPG.txt", mode='r', encoding='utf-8')
text = ifile.read()
ifile.close()
print(text[:300], "...")

A HOUSE OF POMEGRANATES




Contents:

The Young King
The Birthday of the Infanta
The Fisherman and his Soul
The Star-child




THE YOUNG KING




[TO MARGARET LADY BROOKE--THE RANEE OF SARAWAK]


It was the night before the day fixed for his coronation, and the
young King was sitting alone in his b ...


The optional parameters in the *open* function above define the **mode** of operations on the file and the **encoding** of the content. For example, setting the **mode** to **r** declares that *reading* from the file is the only permitted operation that we will perform in the following code. Setting the **encoding** to **utf-8** declares that all characters will be encoded using the [Unicode](https://en.wikipedia.org/wiki/Unicode) encoding schema [UTF-8](https://en.wikipedia.org/wiki/UTF-8) for the content of the file.

We can now import the [NLTK](https://www.nltk.org/) module in Python to work with frequency profiles and [n-grams](https://en.wikipedia.org/wiki/N-gram) using the tokens or words in the text.

In [5]:
import nltk

We can now lower the text, which means normalizing it to all characters lower case:

In [3]:
text = text.lower()
print(text[:300], "...")

a house of pomegranates




contents:

the young king
the birthday of the infanta
the fisherman and his soul
the star-child




the young king




[to margaret lady brooke--the ranee of sarawak]


it was the night before the day fixed for his coronation, and the
young king was sitting alone in his b ...


To generate a frequency profile from the text file, we can use the [NLTK](https://www.nltk.org/) function *FreqDist*:

In [6]:
myFD = nltk.FreqDist(text)

We can remove certain characters from the distribution, or alternatively replace these characters in the text variable. The following loop removes them from the frequency profile in myFD, which is a dictionary data structure in Python.

In [8]:
for x in ":,.-[];!'\"\t\n/ ?":
    del myFD[x]

We can print out the frequency profile by looping through the returned data structure:

In [9]:
for x in myFD:
    print(x, myFD[x])

a 11231
h 10802
o 9408
u 3269
s 8093
e 17372
f 3089
p 1884
m 3271
g 2666
r 7603
n 8843
t 12521
c 2693
y 2168
k 1026
i 8307
b 1812
d 7249
l 5270
w 3665
x 48
v 1122
q 81
j 103
z 61


To relativize the frequencies, we need to compute the total number of characters. This is assuming that we removed all punctuation symbols. The frequency distribution instance myFD provides a method to access the values associated with the individual characters. This will return a list of values, that is the frequencies associated with the characters.

In [10]:
myFD.values()

dict_values([11231, 10802, 9408, 3269, 8093, 17372, 3089, 1884, 3271, 2666, 7603, 8843, 12521, 2693, 2168, 1026, 8307, 1812, 7249, 5270, 3665, 48, 1122, 81, 103, 61])

The *sum* function can summarize these values in its list argument:

In [11]:
sum(myFD.values())

133657

To avoid type problems when we compute the relative frequency of characters, we can convert the total number of characters into a *float*. This will guarantee that the division in the following relativisation step will be a *float* as well.

In [11]:
float(sum(myFD.values()))

133657.0

We store the resulting number of characters in the *total* variable:

In [12]:
total = float(sum(myFD.values()))
print(total)

133657.0


We can now generate a probability distribution over characters. To convert the frequencies into relative frequencies we use list comprehension and divide every single list element by total. The resulting relative frequencies are stored in the variable *relfreq*:

In [13]:
relfrq = [ x/total for x in myFD.values() ]
print(relfrq)

[0.08402852076584093, 0.0808188123330615, 0.07038913038598801, 0.024458127894536014, 0.06055051362816762, 0.1299744869329702, 0.02311139708357961, 0.014095782488010355, 0.024473091570213306, 0.019946579677832064, 0.056884413087230745, 0.06616189200715264, 0.09368009157769515, 0.020148589299475522, 0.016220624434186013, 0.007676365622451499, 0.06215162692563801, 0.013557090163627793, 0.05423584249234982, 0.03942928540966803, 0.0274209356786401, 0.0003591282162550409, 0.008394622054961581, 0.0006060288649303815, 0.0007706292973806085, 0.0004563921081574478]


Let us compute the [Entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory)) for the character distribution using the relative frequencies. We will need the [logarithm](https://en.wikipedia.org/wiki/Logarithm) function from the Python *math* module for that:

In [16]:
from math import log

We can define the [Entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory)) function according to the equation $I = - \sum P(x) log_2( P(x) )$ as:

In [17]:
def entropy(p):
    return -sum( [ x * log(x, 2) for x in p ] )

We can now compute the entropy of the character distribution:

In [18]:
print(entropy(relfrq))

4.124824125135032


We might be interested in the point-wise entropy of the characters in this distribution, thus needing the entropy of each single character. We can compute that in the following way:

In [19]:
entdist = [ -x * log(x, 2) for x in relfrq ]
print(entdist)

[0.3002319806578157, 0.29330480822209687, 0.2694850339615855, 0.13093762006835424, 0.24497024185626542, 0.38260584969385675, 0.12561626066788154, 0.08666922421352652, 0.13099613411450642, 0.11265259339509692, 0.2352638524022937, 0.2592127456210649, 0.32002184459468397, 0.11350057702872672, 0.09644826809662171, 0.053929238563860095, 0.24910770047189124, 0.08411915007238721, 0.2280405439219216, 0.18392139623527803, 0.1422756742696596, 0.004109580806277946, 0.05789199083219161, 0.006477433994507777, 0.007969598004767611, 0.005064783367910896]


We could now compute the variance over this point-wise entropy distribution or other properties of the frequency distribution as for example median, mode, or standard deviation.

## From Characters to Words/Tokens

We see that the frequency profile is for the characters in the text, not the words or tokens. In order to generate a frequency profile over words/tokens in the text, we need to utilize a **tokenizer**. [NLTK](https://www.nltk.org/) provides basic tokenization functions. We will use the *word_tokenize* function to generate a list of tokens:

In [20]:
tokens = nltk.word_tokenize(text)

We can print out the first 20 tokens to verify our data structure is a list with lower-case strings:

In [22]:
tokens[:20]

['a',
 'house',
 'of',
 'pomegranates',
 'contents',
 ':',
 'the',
 'young',
 'king',
 'the',
 'birthday',
 'of',
 'the',
 'infanta',
 'the',
 'fisherman',
 'and',
 'his',
 'soul',
 'the']

We can now generate a frequency profile from the token list, as we did with the characters above:

In [23]:
myTokenFD = nltk.FreqDist(tokens)

The frequency profile can be printed out in the same way as above by looping over the tokens and their frequencies. Note that we restrict the loop to the first 20 tokens here just to keep the notebook smaller. You can remove the [:20] selector in your own experiments.

In [25]:
for token in list(myTokenFD.items())[:20]:
    print(token[0], token[1])

a 673
house 26
of 1108
pomegranates 6
contents 1
: 33
the 2552
young 112
king 70
birthday 8
infanta 37
fisherman 76
and 2118
his 483
soul 99
star-child 42
[ 4
to 697
margaret 1
lady 4


## Counting N-grams

[NLTK](https://www.nltk.org/) provides simple methods to generate [n-gram](https://en.wikipedia.org/wiki/N-gram) models or frequency profiles over [n-grams](https://en.wikipedia.org/wiki/N-gram) from any kind of list or sequence. We can for example generate a bi-gram model, that is an [n-grams](https://en.wikipedia.org/wiki/N-gram) model for n = 2, from the text tokens:

In [33]:
myTokenBigrams = nltk.ngrams(tokens, 2)

To store the bigrams in a list that we want to process and analyze further, we convert the **Python generator object** myTokenBigrams to a list:

In [34]:
bigrams = list(myTokenBigrams)

Let us verify that the resulting data structure is indeed a list of string tuples. We will print out the first 20 tuples from the bigram list:

In [35]:
print(bigrams[:20])

[('a', 'house'), ('house', 'of'), ('of', 'pomegranates'), ('pomegranates', 'contents'), ('contents', ':'), (':', 'the'), ('the', 'young'), ('young', 'king'), ('king', 'the'), ('the', 'birthday'), ('birthday', 'of'), ('of', 'the'), ('the', 'infanta'), ('infanta', 'the'), ('the', 'fisherman'), ('fisherman', 'and'), ('and', 'his'), ('his', 'soul'), ('soul', 'the'), ('the', 'star-child')]


We can now verify the number of bigrams and check that there are exactly *number of tokens - 1 = number of bigrams* in the resulting list:

In [36]:
print(len(bigrams))
print(len(tokens))

38126
38127


The frequency profile from these bigrams is generated in exactly the same way as from the token list in the examples above:

In [37]:
myBigramFD = nltk.FreqDist(bigrams)

If we would want to know some more general properties of the frequency distribution, we can print out information about it. The print statement for this bigram frequency distribution tells us that we have 17,766 types and 38,126 tokens:

In [39]:
print(myBigramFD)

<FreqDist with 17766 samples and 38126 outcomes>


The bigrams and their corresponding frequencies can be printed using a *for* loop. We restrict the number of printed items to 20, just to keep this list reasonably long. If you would like to see the full frequency profile, remove the [:20] restrictor.

In [41]:
for bigram in list(myBigramFD.items())[:20]:
    print(bigram[0], bigram[1])
print("...")

('a', 'house') 4
('house', 'of') 5
('of', 'pomegranates') 4
('pomegranates', 'contents') 1
('contents', ':') 1
(':', 'the') 2
('the', 'young') 107
('young', 'king') 24
('king', 'the') 1
('the', 'birthday') 3
('birthday', 'of') 3
('of', 'the') 354
('the', 'infanta') 33
('infanta', 'the') 1
('the', 'fisherman') 5
('fisherman', 'and') 3
('and', 'his') 53
('his', 'soul') 51
('soul', 'the') 1
('the', 'star-child') 41
...


Pretty printing the bigrams is possible as well:

In [42]:
for ngram in list(myBigramFD.items())[:20]:
    print(" ".join(ngram[0]), ngram[1])
print("...")

a house 4
house of 5
of pomegranates 4
pomegranates contents 1
contents : 1
: the 2
the young 107
young king 24
king the 1
the birthday 3
birthday of 3
of the 354
the infanta 33
infanta the 1
the fisherman 5
fisherman and 3
and his 53
his soul 51
soul the 1
the star-child 41
...


You can remove the [:20] restrictor above and print out the entire frequency profile. If you select and copy the profile to your clipboard, you can paste it into your favorite spreadsheet software and sort, analyze, and study the distribution in many interesting ways.

Instead of running the frequency profile through a loop we can also use a list comprehension construction in Python to generate a list of tuples with the n-gram and its frequency:

In [43]:
ngrams = [ (" ".join(ngram), myBigramFD[ngram]) for ngram in myBigramFD ]
print(ngrams[:100])

[('a house', 4), ('house of', 5), ('of pomegranates', 4), ('pomegranates contents', 1), ('contents :', 1), (': the', 2), ('the young', 107), ('young king', 24), ('king the', 1), ('the birthday', 3), ('birthday of', 3), ('of the', 354), ('the infanta', 33), ('infanta the', 1), ('the fisherman', 5), ('fisherman and', 3), ('and his', 53), ('his soul', 51), ('soul the', 1), ('the star-child', 41), ('star-child the', 1), ('king [', 1), ('[ to', 4), ('to margaret', 1), ('margaret lady', 1), ('lady brooke', 1), ('brooke --', 1), ('-- the', 4), ('the ranee', 1), ('ranee of', 1), ('of sarawak', 1), ('sarawak ]', 1), ('] it', 2), ('it was', 52), ('was the', 20), ('the night', 3), ('night before', 1), ('before the', 8), ('the day', 7), ('day fixed', 1), ('fixed for', 1), ('for his', 7), ('his coronation', 2), ('coronation ,', 3), (', and', 1271), ('and the', 287), ('king was', 1), ('was sitting', 1), ('sitting alone', 1), ('alone in', 1), ('in his', 31), ('his beautiful', 1), ('beautiful chamber'

We can generate an increasing frequency profile using the sort function on the second element of the tuple list, that is on the frequency:

In [49]:
sortedngrams = sorted(ngrams, key=lambda x: x[1])
print(sortedngrams[:20])
print("...")

[('pomegranates contents', 1), ('contents :', 1), ('king the', 1), ('infanta the', 1), ('soul the', 1), ('star-child the', 1), ('king [', 1), ('to margaret', 1), ('margaret lady', 1), ('lady brooke', 1), ('brooke --', 1), ('the ranee', 1), ('ranee of', 1), ('of sarawak', 1), ('sarawak ]', 1), ('night before', 1), ('day fixed', 1), ('fixed for', 1), ('king was', 1), ('was sitting', 1)]
...


We can increase the speed of this *sorted* call by using the *itemgetter()* function in the *operator* module. Let us import this function:

In [48]:
from operator import itemgetter

We can now define the sort-key for *sorted* using the *itemgetter* function and selecting with 1 the second element in the tuple. Remember that the enumeration of elements in lists or tuples in Python starts at 0.

In [50]:
sortedngrams = sorted(ngrams, key=itemgetter(1))
print(sortedngrams[:20])
print("...")

[('pomegranates contents', 1), ('contents :', 1), ('king the', 1), ('infanta the', 1), ('soul the', 1), ('star-child the', 1), ('king [', 1), ('to margaret', 1), ('margaret lady', 1), ('lady brooke', 1), ('brooke --', 1), ('the ranee', 1), ('ranee of', 1), ('of sarawak', 1), ('sarawak ]', 1), ('night before', 1), ('day fixed', 1), ('fixed for', 1), ('king was', 1), ('was sitting', 1)]
...


A decreasing frequency profile can be generated using another parameter to *sorted*:

In [51]:
sortedngrams = sorted(ngrams, key=itemgetter(1), reverse=True)
print(sortedngrams[:20])
print("...")

[(', and', 1271), ('of the', 354), ('and the', 287), ('. and', 205), (". '", 193), ('in the', 158), (", '", 155), ('to the', 149), ('him ,', 146), ('. the', 112), ("' and", 108), ('the young', 107), (', but', 96), ('and he', 89), (', for', 85), ("? '", 84), ('on the', 80), ('said to', 79), ('to him', 77), ('and said', 77)]
...


We can pretty-print the decreasing frequency profile:

In [53]:
sortedngrams = sorted(ngrams, key=itemgetter(1), reverse=True)
for t in sortedngrams[:20]:
    print(t[0], t[1])
print("...")

, and 1271
of the 354
and the 287
. and 205
. ' 193
in the 158
, ' 155
to the 149
him , 146
. the 112
' and 108
the young 107
, but 96
and he 89
, for 85
? ' 84
on the 80
said to 79
to him 77
and said 77
...


To be continued...

(C) 2017-2019 by [Damir Cavar](http://damir.cavar.me/) <<dcavar@iu.edu>>