# Exploring word frequency

David J. Birnbaum

djbpitt@gmail.com  
http://www.obdurodon.org  
Last revised 2018-06-27

## Housekeeping

Import `nltk`, including the `PlaintextCorpusReader`, and read the corpus into a variable.

In [None]:
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = "/Users/djb/Desktop/inaugural" # edit this to point to the location on your machine
inaug = PlaintextCorpusReader(corpus_root, ".*txt")

## How frequent is “the” in each speech?

Loop over the files in the corpus and create a frequency distribution object (called `x`) for each file in turn. Use that object to find the relative frequency (on a scale of 0 to 1) of the word “the” in each speech. To make the result easier to read, use `format()` to format the frequency as a percentage that extends to exactly two decimal places. Return the percentage and the file identifier.

In [None]:
for f in inaug.fileids():
    x = nltk.FreqDist(inaug.words(f)) # frequency distribution for each speech
    freq_the = x.freq("the") # loop over all files
    print('{:.2%}'.format(freq_the), f, sep='\t')  # separate by tab for readability

## Graphing the results

It’s easier to see the results on a graph than in a list. Let’s do that!

### Installing `matplotlib`

If you don’t have `matplotlib` (a graphing package) installed (or if you aren’t sure whether you have it installed), uncomment the following cell and run it. If you have to install `matplotlib`, it may take a while. The installation process will provide a report at the end, which you can ignore unless it notifies you of an error.

In [None]:
# !pip install matplotlib

## Using `matplotlib`

`matplotlib` can be used to plot graphs that are rendered inside Jupyter Notebook. We first import `matplotlib`, and we alias it as `plt` because that’s easier to type than the real full name. (The import may take a long time the first time you run it.) We then use a list comprehension to create a list of pairs of filename (just the last four characters, the year) plus the frequency of “the” in that particular file. We’ll look at the first few pairs to verify that it works.

In [None]:
import matplotlib.pyplot as plt
pairs = [(f[:4], nltk.FreqDist(inaug.words(f)).freq("the")) for f in inaug.fileids()]
pairs[:5]

`matplotlib` expects the X and Y values for the plot points to be supplied in separate lists: first a list of X values and then a list of Y values. We can use a combination of the `list()` function, the `zip()` function, and the `*` (unpack) operator to reformat the list of pairs into a list that contains just two items, all of the years (which will be the X values) and all of the frequencies (which will be the Y values). You can look up these Python features on line for more information. We’ll turn off pretty printing to make the result easier to see.

In [None]:
unzipped = list(zip(*pairs))
%pprint
unzipped

Although the years look like numbers to us, Python is treating them as strings, which we know because they’re wrapped in quotation marks. Python thinks they’re strings because we sliced them out of the filenames, which are strings. For our purposes (we’re just going to use them as labels on the X axis), the datatype doesn’t matter. 

You may have noticed that the two items in the `unzipped` list are not really lists, since they are delimited by parentheses, and not by square brackets. The technical terms for this type of object in Python in _tuple_. You can think of it as similar to a list in that both are sequences of objects. Tuples have some different properties from lists, but `matplotlib` is equally happy with two tuples or two lists, so we’ll leave them as is.

To control the size of the image we need to set axis values for our graphic with `plt.axes()`. The four values in the list are, in order, the lowest X value, the lowest Y value, the width, and the height. How this looks depends on the size and resolution of your screen; feel free to experiment with the values. We then plot our data by passing in the X and Y values as two arguments to `plt.plot()`. In order to make the graph visible, we need to call `plt.show()`.

In [None]:
plt.axes([1, 1, 4, 3])
plt.plot(unzipped[0], unzipped[1])
plt.show()