In [None]:
%matplotlib inline

from matplotlib import pyplot

In [None]:
import nltk
import nltk.data
import numpy as np
from numpy import random as nprandom

Certain `nltk` components require a local cache of data, such as the word tokenizer, and the part of speech tagger. This tells the library to look for the files it needs in current working directory:

In [None]:
nltk.data.path[0] = './nltk_data'

When a library fails to load due to local files not being present, `nltk` throws a `LookupError`. The module also has a `download` method, which can be used to download specific data from a github repository, which can be used to recover from the error:

In [None]:
try:
    english = nltk.data.load('tokenizers/punkt/english.pickle')
except LookupError:
    nltk.download('punkt', './nltk_data')
    english = nltk.data.load('tokenizers/punkt/english.pickle')

Now on to the fun stuff! The file `adams.txt` is a plain-text copy of Douglas Adams' book, *So Long, and Thanks For All the Fish*. We can use the natural language toolkit to split the book into a list of sentences:

In [None]:
with open('rsrc/adams.txt') as fp:
    sentences = english.tokenize(fp.read())[1:]
    tokenized = [nltk.word_tokenize(s) for s in sentences]
    douglas_adams = nltk.Text(word.lower() for ws in tokenized for word in ws)

NLTK also has a neural-network based part of speech tagger that works impressively well! We can use it to tag every word in a sentence with its part of speech:

In [None]:
text = nprandom.choice(tokenized)

try:
    tagged = nltk.pos_tag(text)
except LookupError:
    nltk.download('averaged_perceptron_tagger', './nltk_data')
    tagged = nltk.pos_tag(text)
    
print(tagged)
douglas_adams_tagged = nltk.pos_tag(douglas_adams)

There's a lot we can do with that information! For instance, we can create a catalog of words belonging to every part of speech:

In [None]:
words = {}
parts_of_speech = {}

for word, pos in douglas_adams_tagged:
    # words to parts of speech:
    words.setdefault(word, [])
    words[word].append(pos)
    # parts of speech to words:
    parts_of_speech.setdefault(pos, set())
    parts_of_speech[pos].add(word)

for k in words:
    words[k] = sorted(words[k])

for k in parts_of_speech:
    parts_of_speech[k] = sorted(parts_of_speech[k])

We can also do a simple statistical analysis on which words and parts of speech are most common:

In [None]:
word_stats = {}
pos_stats = {}

for word, pos in douglas_adams_tagged:
    # count words:
    word_stats.setdefault(word, 0)
    word_stats[word] += 1
    # count parts of speech:
    pos_stats.setdefault(pos, 0)
    pos_stats[pos] += 1

This turns out to be a lot cooler than you might think! Word frequency in natural language tends to follow a surprisingly neat power-law distribution. According to [Zipf's Law](https://en.wikipedia.org/wiki/Zipf%27s_law), a plot should show the frequency of the most common words decreasing exponentially (or linearly, on a log-log plot):

In [None]:
ranked = sorted(word_stats.items(), key=lambda t: -1 * t[1])
x = np.log([t[1] for t in ranked])
y = np.log([i + 1 for i in range(len(ranked))])
    
pyplot.plot(x, y)

That's really cool! It means there's some constant K such that every word is about K times more common than the *next* most common word.

I wonder if Zipf's law also applies to the most common parts of speech:

In [None]:
x = sorted(pos_stats.values(), key=lambda n: -1 * n)
y = ([i + 1 for i in range(len(x))])

pyplot.plot(x, y)

It doesn't look like it, however that is still a neat totally-non-random distribution!

We can also use the natural language toolkit to lookup the *pronunciation* of a given word. The [The CMU Pronouncing Dictionary](http://www.speech.cs.cmu.edu/cgi-bin/cmudict) provides precomputed transcriptions for most words in the English language, and is made available as a module in the `nltk` library: