# Python NLTK: Texts and Frequencies

**(C) 2017-2018 by [Damir Cavar](http://damir.cavar.me/) <<dcavar@iu.edu>>**

**Version:** 0.1, January 2018

**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))

This is a brief introduction to NLTK for simple frequency analysis of texts.

For this to work, in the folder with the notebook we expect a subfolder data that contains a file HOPG.txt. This file contains the novel "A House of Pomegranates" by Oscar Wilde taken as raw text from [Project Gutenberg](https://www.gutenberg.org/).

## Simple File Processing

Reading a text into memory in Python is faily simple. We open a file, read from it, and close the file again:

In [None]:
ifile = open("data/HOPG.txt", mode='r', encoding='utf-8')
text = ifile.read()
ifile.close()
print(text)

The optional parameters in the *open* function above define the **mode** of operations on the file and the **encoding** of the content. For example, setting the **mode** to **r** declares that *reading* from the file is the only permitted operation that we will perform in the following code. Setting the **encoding** to **utf-8** declares that all characters will be encoded using the [Unicode](https://en.wikipedia.org/wiki/Unicode) encoding schema [UTF-8](https://en.wikipedia.org/wiki/UTF-8) for the content of the file.

We can now import the [NLTK](https://www.nltk.org/) module in Python to work with frequency profiles and [n-grams](https://en.wikipedia.org/wiki/N-gram) using the tokens or words in the text.

In [None]:
import nltk

To generate a frequency profile from the text file, we can use the [NLTK](https://www.nltk.org/) function *FreqDist*:

In [None]:
myFD = nltk.FreqDist(text)

We can print out the frequency profile by looping through the returned data structure:

In [None]:
for x in myFD:
    print(x, myFD[x])

We see that the frequency profile is for the characters in the text, not the words or tokens. In order to generate a frequency profile over words/tokens in the text, we need to utilize a **tokenizer**. [NLTK](https://www.nltk.org/) provides basic tokenization functions. We will use the *word_tokenize* function to generate a list of tokens:

In [None]:
tokens = nltk.word_tokenize(text)

We can now generate a frequency profile from the token list:

In [None]:
myTokenFD = nltk.FreqDist(tokens)

The frequency profile can be printed out in the same way as above, by looping over the tokens and their frequencies:

In [None]:
for token in myTokenFD:
    print(token, myTokenFD[token])

## Counting N-grams

[NLTK](https://www.nltk.org/) provides simple methods to generate [n-gram](https://en.wikipedia.org/wiki/N-gram) models or frequency profiles over [n-grams](https://en.wikipedia.org/wiki/N-gram) from any kind of list or sequence. We can for example generate a bi-gram model, that is an [n-grams](https://en.wikipedia.org/wiki/N-gram) model for n = 2, from the text tokens:

In [None]:
myTokenBigrams = nltk.ngrams(tokens, 2)

The resulting bi-gram list can be printed out in a loop, as in the examples above, or we can print the entire data structure, that is the list generated from the **Python generator object**:

In [None]:
bigrams = list(myTokenBigrams)
print(bigrams)

The frequency profile from these bigrams can be gnerated in the same way as from the token list above:

In [None]:
myBigramFD = nltk.FreqDist(bigrams)

If we would want to know some more general properties of the frequency distribution, we can print out information about it:

In [None]:
print(myBigramFD)

The bigrams and their corresponding frequencies can be printed using a loop:

In [None]:
for bigram in myBigramFD:
    print(bigram, myBigramFD[bigram])

(C) 2017-2018 by [Damir Cavar](http://damir.cavar.me/) <<dcavar@iu.edu>>