# Python NLTK: Texts and Frequencies

**(C) 2017-2021 by [Damir Cavar](http://damir.cavar.me/) <<dcavar@iu.edu>>**

**Version:** 0.8, January 2021

**Download:** This and various other Jupyter notebooks are available from my [GitHub repo](https://github.com/dcavar/python-tutorial-notebooks).

**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))

This is a brief introduction to NLTK for simple frequency analysis of texts. I created this notebook for intro to corpus linguistics and natural language processing classes at Indiana University between 2017 and 2020.

For this to work, in the folder with the notebook we expect a subfolder data that contains a file HOPG.txt. This file contains the novel "A House of Pomegranates" by Oscar Wilde taken as raw text from [Project Gutenberg](https://www.gutenberg.org/).

## Simple File Processing

Reading a text into memory in Python is faily simple. We open a file, read from it, and close the file again. The following code prints out the first 300 characters of the text in memory:

In [4]:
ifile = open("data/HOPG.txt", mode='r', encoding='utf-8')
text = ifile.read()
ifile.close()
print(text[:300], "...")

A HOUSE OF POMEGRANATES




Contents:

The Young King
The Birthday of the Infanta
The Fisherman and his Soul
The Star-child




THE YOUNG KING




[TO MARGARET LADY BROOKE--THE RANEE OF SARAWAK]


It was the night before the day fixed for his coronation, and the
young King was sitting alone in his b ...


The optional parameters in the *open* function above define the **mode** of operations on the file and the **encoding** of the content. For example, setting the **mode** to **r** declares that *reading* from the file is the only permitted operation that we will perform in the following code. Setting the **encoding** to **utf-8** declares that all characters will be encoded using the [Unicode](https://en.wikipedia.org/wiki/Unicode) encoding schema [UTF-8](https://en.wikipedia.org/wiki/UTF-8) for the content of the file.

We can now import the [NLTK](https://www.nltk.org/) module in Python to work with frequency profiles and [n-grams](https://en.wikipedia.org/wiki/N-gram) using the tokens or words in the text.

In [2]:
import nltk

We can now lower the text, which means normalizing it to all characters lower case:

In [3]:
text = text.lower()
print(text[:300], "...")

a house of pomegranates




contents:

the young king
the birthday of the infanta
the fisherman and his soul
the star-child




the young king




[to margaret lady brooke--the ranee of sarawak]


it was the night before the day fixed for his coronation, and the
young king was sitting alone in his b ...


To generate a frequency profile from the text file, we can use the [NLTK](https://www.nltk.org/) function *FreqDist*:

In [17]:
myFD = nltk.FreqDist(text)

In [18]:
print(myFD)

<FreqDist with 39 samples and 174018 outcomes>


We can remove certain characters from the distribution, or alternatively replace these characters in the text variable. The following loop removes them from the frequency profile in myFD, which is a dictionary data structure in Python.

In [19]:
for x in ":,.-[];!'\"\t\n/ ?":
    del myFD[x]

We can print out the frequency profile by looping through the returned data structure:

In [20]:
for x in myFD:
    print(x, myFD[x])

e 17372
t 12521
a 11231
h 10802
o 9408
n 8843
i 8307
s 8093
r 7603
d 7249
l 5270
w 3665
m 3271
u 3269
f 3089
c 2693
g 2666
y 2168
p 1884
b 1812
v 1122
k 1026
j 103
q 81
z 61
x 48


To relativize the frequencies, we need to compute the total number of characters. This is assuming that we removed all punctuation symbols. The frequency distribution instance myFD provides a method to access the values associated with the individual characters. This will return a list of values, that is the frequencies associated with the characters.

In [21]:
myFD.values()

dict_values([11231, 10802, 9408, 3269, 8093, 17372, 3089, 1884, 3271, 2666, 7603, 8843, 12521, 2693, 2168, 1026, 8307, 1812, 7249, 5270, 3665, 48, 1122, 81, 103, 61])

The *sum* function can summarize these values in its list argument:

In [22]:
sum(myFD.values())

133657

To avoid type problems when we compute the relative frequency of characters, we can convert the total number of characters into a *float*. This will guarantee that the division in the following relativisation step will be a *float* as well.

In [23]:
float(sum(myFD.values()))

133657.0

We store the resulting number of characters in the *total* variable:

In [24]:
total = float(sum(myFD.values()))
print(total)

133657.0


We can now generate a probability distribution over characters. To convert the frequencies into relative frequencies we use list comprehension and divide every single list element by total. The resulting relative frequencies are stored in the variable *relfreq*:

In [25]:
relfrq = [ x/total for x in myFD.values() ]
print(relfrq)

[0.08402852076584093, 0.0808188123330615, 0.07038913038598801, 0.024458127894536014, 0.06055051362816762, 0.1299744869329702, 0.02311139708357961, 0.014095782488010355, 0.024473091570213306, 0.019946579677832064, 0.056884413087230745, 0.06616189200715264, 0.09368009157769515, 0.020148589299475522, 0.016220624434186013, 0.007676365622451499, 0.06215162692563801, 0.013557090163627793, 0.05423584249234982, 0.03942928540966803, 0.0274209356786401, 0.0003591282162550409, 0.008394622054961581, 0.0006060288649303815, 0.0007706292973806085, 0.0004563921081574478]


Let us compute the [Entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory)) for the character distribution using the relative frequencies. We will need the [logarithm](https://en.wikipedia.org/wiki/Logarithm) function from the Python *math* module for that:

In [26]:
from math import log

We can define the [Entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory)) function according to the equation $I = - \sum P(x) log_2( P(x) )$ as:

In [27]:
def entropy(p):
    return -sum( [ x * log(x, 2) for x in p ] )

In [28]:
entropy([1/8, 1/16, 1/4, 1/8, 1/16, 1/16, 1/4, 1/16])

2.75

In [29]:
entropy([1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8])

3.0

We can now compute the entropy of the character distribution:

In [None]:
print(entropy(relfrq))

We might be interested in the point-wise entropy of the characters in this distribution, thus needing the entropy of each single character. We can compute that in the following way:

In [None]:
entdist = [ -x * log(x, 2) for x in relfrq ]
print(entdist)

We could now compute the variance over this point-wise entropy distribution or other properties of the frequency distribution as for example median, mode, or standard deviation.

## From Characters to Words/Tokens

We see that the frequency profile is for the characters in the text, not the words or tokens. In order to generate a frequency profile over words/tokens in the text, we need to utilize a **tokenizer**. [NLTK](https://www.nltk.org/) provides basic tokenization functions. We will use the *word_tokenize* function to generate a list of tokens:

In [5]:
tokens = nltk.word_tokenize(text)

We can print out the first 20 tokens to verify our data structure is a list with lower-case strings:

In [6]:
tokens[:20]

['A',
 'HOUSE',
 'OF',
 'POMEGRANATES',
 'Contents',
 ':',
 'The',
 'Young',
 'King',
 'The',
 'Birthday',
 'of',
 'the',
 'Infanta',
 'The',
 'Fisherman',
 'and',
 'his',
 'Soul',
 'The']

We can now generate a frequency profile from the token list, as we did with the characters above:

In [7]:
myTokenFD = nltk.FreqDist(tokens)
print(myTokenFD)

<FreqDist with 4276 samples and 38128 outcomes>


The frequency profile can be printed out in the same way as above by looping over the tokens and their frequencies. Note that we restrict the loop to the first 20 tokens here just to keep the notebook smaller. You can remove the [:20] selector in your own experiments.

In [9]:
print(myTokenFD.items())

dict_items([('A', 24), ('HOUSE', 1), ('OF', 5), ('POMEGRANATES', 1), ('Contents', 1), (':', 33), ('The', 127), ('Young', 1), ('King', 57), ('Birthday', 1), ('of', 1100), ('the', 2419), ('Infanta', 36), ('Fisherman', 75), ('and', 1815), ('his', 473), ('Soul', 70), ('Star-child', 1), ('THE', 6), ('YOUNG', 1), ('KING', 1), ('[', 4), ('TO', 4), ('MARGARET', 1), ('LADY', 2), ('BROOKE', 1), ('--', 26), ('RANEE', 1), ('SARAWAK', 1), (']', 4), ('It', 33), ('was', 354), ('night', 14), ('before', 52), ('day', 46), ('fixed', 2), ('for', 249), ('coronation', 4), (',', 2675), ('young', 110), ('sitting', 2), ('alone', 10), ('in', 462), ('beautiful', 25), ('chamber', 10), ('.', 1342), ('His', 9), ('courtiers', 3), ('had', 235), ('all', 78), ('taken', 5), ('their', 169), ('leave', 10), ('him', 369), ('bowing', 2), ('heads', 7), ('to', 691), ('ground', 16), ('according', 1), ('ceremonious', 1), ('usage', 1), ('retired', 3), ('Great', 3), ('Hall', 2), ('Palace', 7), ('receive', 2), ('a', 649), ('few', 9

In [10]:
for token in list(myTokenFD.items()):
    print(token[0], token[1])

A 24
HOUSE 1
OF 5
POMEGRANATES 1
Contents 1
: 33
The 127
Young 1
King 57
Birthday 1
of 1100
the 2419
Infanta 36
Fisherman 75
and 1815
his 473
Soul 70
Star-child 1
THE 6
YOUNG 1
KING 1
[ 4
TO 4
MARGARET 1
LADY 2
BROOKE 1
-- 26
RANEE 1
SARAWAK 1
] 4
It 33
was 354
night 14
before 52
day 46
fixed 2
for 249
coronation 4
, 2675
young 110
sitting 2
alone 10
in 462
beautiful 25
chamber 10
. 1342
His 9
courtiers 3
had 235
all 78
taken 5
their 169
leave 10
him 369
bowing 2
heads 7
to 691
ground 16
according 1
ceremonious 1
usage 1
retired 3
Great 3
Hall 2
Palace 7
receive 2
a 649
few 9
last 13
lessons 1
from 175
Professor 1
Etiquette 1
; 60
there 76
being 11
some 47
them 154
who 130
still 14
quite 21
natural 4
manners 1
which 59
courtier 1
is 237
I 377
need 13
hardly 4
say 11
very 16
grave 7
offence 1
lad 5
he 569
only 29
but 136
sixteen 1
years 16
age 4
not 207
sorry 1
at 195
departure 1
flung 12
himself 42
back 42
with 387
deep 13
sigh 1
relief 1
on 195
soft 5
cushions 1
embroidered 7
couch 8


In [12]:
stopwords = """
I
me
my
myself
we
our
ours
ourselves
you
you're
you've
you'll
you'd
your
yours
yourself
yourselves
he
him
his
himself
she
she's
her
hers
herself
it
it's
its
itself
they
them
their
theirs
themselves
what
which
who
whom
this
that
that'll
these
those
am
is
are
was
were
be
been
being
have
has
had
having
do
does
did
doing
a
an
the
and
but
if
or
because
as
until
while
of
at
by
for
with
about
against
between
into
through
during
before
after
above
below
to
from
up
down
in
out
on
off
over
under
again
further
then
once
here
there
when
where
why
how
all
any
both
each
few
more
most
other
some
such
no
nor
not
only
own
same
so
than
too
very
s
t
can
will
just
don
don't
should
should've
now
d
ll
m
o
re
ve
y
ain
aren
aren't
couldn
couldn't
didn
didn't
doesn
doesn't
hadn
hadn't
hasn
hasn't
haven
haven't
isn
isn't
ma
mightn
mightn't
mustn
mustn't
needn
needn't
shan
shan't
shouldn
shouldn't
wasn
wasn't
weren
weren't
won
won't
wouldn
wouldn't
"""

In [13]:
for x in stopwords.split():
    del myTokenFD[x]
print(list(myTokenFD))

[',', '.', "'", 'And', 'said', '?', 'thou', 'thee', 'The', 'little', 'young', 'came', 'answered', 'great', 'upon', 'He', 'But', 'Fisherman', 'gold', 'one', 'thy', 'Soul', 'went', 'cried', "'s", 'would', 'us', ';', 'away', 'saw', 'like', 'city', '!', 'King', 'made', 'white', 'hand', 'round', 'rose', 'day', 'go', 'eyes', 'come', 'stood', 'back', 'world', 'So', 'love', 'face', 'sea', "''", 'Star-Child', 'man', 'looked', 'silver', 'hands', 'took', 'hast', 'Infanta', 'set', 'shall', 'things', 'may', 'red', 'head', ':', 'It', 'forest', 'children', 'could', 'art', 'feet', 'called', 'long', 'evil', 'They', 'heart', 'ran', 'black', 'put', 'laughed', 'even', 'dance', 'lips', 'passed', 'give', 'water', 'let', 'soul', 'side', 'seemed', 'place', 'yellow', 'thing', 'piece', '--', 'beautiful', 'mother', 'house', 'A', 'child', 'much', 'body', 'green', 'When', 'make', "'Nay", 'people', 'lay', 'see', 'gave', 'also', 'arms', 'She', 'know', 'quite', 'old', 'many', 'strange', 'seen', 'look', 'though', 'fel

## Counting N-grams

[NLTK](https://www.nltk.org/) provides simple methods to generate [n-gram](https://en.wikipedia.org/wiki/N-gram) models or frequency profiles over [n-grams](https://en.wikipedia.org/wiki/N-gram) from any kind of list or sequence. We can for example generate a bi-gram model, that is an [n-grams](https://en.wikipedia.org/wiki/N-gram) model for n = 2, from the text tokens:

In [14]:
myTokenBigrams = nltk.ngrams(tokens, 2)

To store the bigrams in a list that we want to process and analyze further, we convert the **Python generator object** myTokenBigrams to a list:

In [15]:
bigrams = list(myTokenBigrams)

Let us verify that the resulting data structure is indeed a list of string tuples. We will print out the first 20 tuples from the bigram list:

In [16]:
print(bigrams[:20])

[('A', 'HOUSE'), ('HOUSE', 'OF'), ('OF', 'POMEGRANATES'), ('POMEGRANATES', 'Contents'), ('Contents', ':'), (':', 'The'), ('The', 'Young'), ('Young', 'King'), ('King', 'The'), ('The', 'Birthday'), ('Birthday', 'of'), ('of', 'the'), ('the', 'Infanta'), ('Infanta', 'The'), ('The', 'Fisherman'), ('Fisherman', 'and'), ('and', 'his'), ('his', 'Soul'), ('Soul', 'The'), ('The', 'Star-child')]


We can now verify the number of bigrams and check that there are exactly *number of tokens - 1 = number of bigrams* in the resulting list:

In [None]:
print(len(bigrams))
print(len(tokens))

The frequency profile from these bigrams is generated in exactly the same way as from the token list in the examples above:

In [17]:
myBigramFD = nltk.FreqDist(bigrams)
print(myBigramFD)

<FreqDist with 18266 samples and 38127 outcomes>


If we would want to know some more general properties of the frequency distribution, we can print out information about it. The print statement for this bigram frequency distribution tells us that we have 17,766 types and 38,126 tokens:

In [None]:
print(myBigramFD)

The bigrams and their corresponding frequencies can be printed using a *for* loop. We restrict the number of printed items to 20, just to keep this list reasonably long. If you would like to see the full frequency profile, remove the [:20] restrictor.

In [18]:
for bigram in list(myBigramFD.items())[:20]:
    print(bigram[0], bigram[1])
print("...")

('A', 'HOUSE') 1
('HOUSE', 'OF') 1
('OF', 'POMEGRANATES') 1
('POMEGRANATES', 'Contents') 1
('Contents', ':') 1
(':', 'The') 1
('The', 'Young') 1
('Young', 'King') 1
('King', 'The') 1
('The', 'Birthday') 1
('Birthday', 'of') 1
('of', 'the') 353
('the', 'Infanta') 28
('Infanta', 'The') 1
('The', 'Fisherman') 1
('Fisherman', 'and') 2
('and', 'his') 34
('his', 'Soul') 48
('Soul', 'The') 1
('The', 'Star-child') 1
...


Pretty printing the bigrams is possible as well:

In [19]:
for ngram in list(myBigramFD.items()):
    print(" ".join(ngram[0]), ngram[1])
print("...")

A HOUSE 1
HOUSE OF 1
OF POMEGRANATES 1
POMEGRANATES Contents 1
Contents : 1
: The 1
The Young 1
Young King 1
King The 1
The Birthday 1
Birthday of 1
of the 353
the Infanta 28
Infanta The 1
The Fisherman 1
Fisherman and 2
and his 34
his Soul 48
Soul The 1
The Star-child 1
Star-child THE 1
THE YOUNG 1
YOUNG KING 1
KING [ 1
[ TO 4
TO MARGARET 1
MARGARET LADY 1
LADY BROOKE 1
BROOKE -- 1
-- THE 1
THE RANEE 1
RANEE OF 1
OF SARAWAK 1
SARAWAK ] 1
] It 2
It was 17
was the 20
the night 3
night before 1
before the 8
the day 7
day fixed 1
fixed for 1
for his 7
his coronation 2
coronation , 3
, and 1270
and the 200
the young 99
young King 22
King was 1
was sitting 1
sitting alone 1
alone in 1
in his 31
his beautiful 1
beautiful chamber 1
chamber . 1
. His 8
His courtiers 1
courtiers had 1
had all 2
all taken 1
taken their 1
their leave 1
leave of 1
of him 8
him , 146
, bowing 1
bowing their 1
their heads 5
heads to 1
to the 149
the ground 15
ground , 6
, according 1
according to 1
the ceremonious 1

You can remove the [:20] restrictor above and print out the entire frequency profile. If you select and copy the profile to your clipboard, you can paste it into your favorite spreadsheet software and sort, analyze, and study the distribution in many interesting ways.

Instead of running the frequency profile through a loop we can also use a list comprehension construction in Python to generate a list of tuples with the n-gram and its frequency:

In [None]:
ngrams = [ (" ".join(ngram), myBigramFD[ngram]) for ngram in myBigramFD ]
print(ngrams[:100])

We can generate an increasing frequency profile using the sort function on the second element of the tuple list, that is on the frequency:

In [None]:
sortedngrams = sorted(ngrams, key=lambda x: x[1])
print(sortedngrams[:20])
print("...")

We can increase the speed of this *sorted* call by using the *itemgetter()* function in the *operator* module. Let us import this function:

In [None]:
from operator import itemgetter

We can now define the sort-key for *sorted* using the *itemgetter* function and selecting with 1 the second element in the tuple. Remember that the enumeration of elements in lists or tuples in Python starts at 0.

In [None]:
sortedngrams = sorted(ngrams, key=itemgetter(1))
print(sortedngrams[:20])
print("...")

A decreasing frequency profile can be generated using another parameter to *sorted*:

In [None]:
sortedngrams = sorted(ngrams, key=itemgetter(1), reverse=True)
print(sortedngrams[:20])
print("...")

We can pretty-print the decreasing frequency profile:

In [None]:
sortedngrams = sorted(ngrams, key=itemgetter(1), reverse=True)
for t in sortedngrams[:20]:
    print(t[0], t[1])
print("...")

In [None]:
total = float(sum(myBigramFD.values()))
exceptions = ["]", "[", "--", ",", ".", "'s", "?", "!", "'", "'ye"]
myStopWords = stopwords.split()
results = []
for x in myBigramFD:
    if x[0] in exceptions or x[1] in exceptions:
        continue
    if x[0] in myStopWords or x[1] in myStopWords:
        continue
    #print("%s\t%s\t%s" % (x[0], x[1], myBigramFD[x]/total))
    results.append( (x[0], x[1], myBigramFD[x]/total) )
#print(results)
sortedresults = sorted(results, key=itemgetter(2), reverse=True)
for x in sortedresults[:20]:
    print(x[0], x[1], x[2])

To be continued...

(C) 2017-2020 by [Damir Cavar](http://damir.cavar.me/) <<dcavar@iu.edu>>