# Introduction to NLTK

In [None]:
import nltk

Load the Book collection.

In [None]:
from nltk.book import *

## Simple Statistics

In [None]:
saying = ['After', 'all', 'is', 'said', 'and', 'done',
          'more', 'is', 'said', 'than', 'done']

Number of words in the list:

In [None]:
len(saying)

Unique words:

In [None]:
words = sorted(set(saying))

are fewer:

In [None]:
len(words)

Last two words

In [None]:
words[-2:]

First 4 words:

In [None]:
words[0:4]

## Searching Text
There are many ways to examine the context of a text apart from simply reading it. A concordance view shows us every occurrence of a given word, together with some context. Here we look up the word monstrous in *Moby Dick* by entering text1 followed by a period, then the term concordance, and then placing "monstrous" in parentheses:

In [None]:
' '.join(text1.tokens[40:50])

In [None]:
text1.concordance("ship")

Your turn. Try searching for other words: just edit the code above.

A concordance permits us to see words in context. For example, we saw that *monstrous* occurred in contexts such as *the ___ pictures* and *a ___ size*. What other words appear in a similar range of contexts? We can find out by appending the term similar to the name of the text in question, then inserting the relevant word in parentheses:

In [None]:
text1.similar("monstrous")

Observe that we get different results for different texts. Austen uses this word quite differently from Melville; for her, *monstrous* has positive connotations, and sometimes functions as an intensifier like the word very.

The term common_contexts allows us to examine just the contexts that are shared by two or more words, such as *monstrous* and *very*. We have to enclose these words by square brackets as well as parentheses, and separate them with a comma:

In [None]:
text2.common_contexts(["monstrous", "very"])

## Counting Vocabulary
Let's begin by finding out the length of a text from start to finish, in terms of the words and punctuation symbols that appear. We use the term len to get the length of something, which we'll apply here to the book of Genesis:

In [None]:
len(text3)

The vocabulary of a text is just the set of tokens that it uses, since in a set, all duplicates are collapsed together. In Python we can obtain the vocabulary items of text3 with the command: set(text3). When you do this, many screens of words will fly past. Now try the following:

In [None]:
sorted(set(text3))

In [None]:
len(set(text3))

Now, let's calculate a measure of the lexical richness of the text. The next example shows us that the number of distinct words is just 6% of the total number of words, or equivalently that each word is used 16 times on average

In [None]:
len(set(text3)) / len(text3)

You may want to repeat such calculations on several texts, but it is tedious to keep retyping the formula. Instead, you can come up with your own name for a task, like "lexical_diversity" or "percentage", and associate it with a block of code. Now you only have to type a short name instead of one or more complete lines of Python code, and you can re-use it as often as you like. The block of code that does a task for us is called a function, and we define a short name for our function with the keyword def. The next example shows how to define two new functions, lexical_diversity() and  percentage():

In [None]:
def lexical_diversity(text):
    return len(set(text)) / len(text)

def percentage(count, total):
    return 100 * count / total

In [None]:
lexical_diversity(text1)

In [None]:
percentage(4, 5)

In [None]:
percentage(text4.count('the'), len(text4))

Exercise: calculate the lexical diversity of Various Genres in the *Brown Corpus*.

### Frequency Distributions
How can we automatically identify the words of a text that are most informative about the topic and genre of the text? Imagine how you might go about finding the 50 most frequent words of a book. One method would be to keep a tally for each vocabulary item. The tally would need thousands of rows, and it would be an exceedingly laborious process — so laborious that we would rather assign the task to a machine.

In [None]:
fdist1 = FreqDist(text1)

In [None]:
fdist1

In [None]:
fdist1.most_common(50)

## Gutenberg Corpus

In [None]:
from nltk.corpus import gutenberg
gutenberg.fileids()

In [None]:
emma = gutenberg.words('austen-emma.txt')
len(emma)

The first 40 words:

In [None]:
' '.join(emma[:40])

In [None]:
nltk.Text(emma).concordance("surprize")

Show the average length of words, average number of words per sentence and average word repetitions, in various books:

In [None]:
for fileid in gutenberg.fileids():
    num_chars = len(gutenberg.raw(fileid))
    num_words = len(gutenberg.words(fileid))
    num_sents = len(gutenberg.sents(fileid))
    num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
    print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid)

Look at Shaekspeaare's Macbeth.

In [None]:
macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt')

In [None]:
len(macbeth_sentences)

In [None]:
' '.join(macbeth_sentences[1116])

The longest sentence in Macbeth

In [None]:
longest_len = max(len(s) for s in macbeth_sentences)

In [None]:
longest_len

Same as argmax(s, [len(s) for s in macbeth_sentences])

In [None]:
longest = max(macbeth_sentences, key=len)

the sentence is:

In [None]:
' '.join(longest)

# Web and Chat Text

In [None]:
from nltk.corpus import webtext

In [None]:
for fileid in webtext.fileids():
    print(fileid, webtext.raw(fileid)[:65], '...')

In [None]:
from nltk.corpus import nps_chat

In [None]:
chatroom = nps_chat.posts('10-19-20s_706posts.xml')

In [None]:
' '.join(chatroom[123])

# Brown Corpus
The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial, and so on. For a complete list, see http://icame.uib.no/brown/bcm-los.html.

In [None]:
from nltk.corpus import brown

In [None]:
brown.categories()

In [None]:
brown.words(categories='news')

In [None]:
brown.words(fileids=['cg22'])

In [None]:
for s in brown.sents(categories=['news', 'editorial', 'reviews']):
    print(' '.join(s))

## Experiment with the corpus.

Study systematic differences between genres, a kind of linguistic inquiry known as *stylistics*. Let's compare genres in their usage of modal verbs. The first step is to produce the counts for a particular genre. 

In [None]:
news_text = brown.words(categories='news')

In [None]:
fdist = nltk.FreqDist(w.lower() for w in news_text)

In [None]:
modals = ['can', 'could', 'may', 'might', 'must', 'will']

In [None]:
for m in modals:
    print(m + ':', fdist[m], end=' ')

### Compare the use of modals

Obtain counts for each genre of interest. We'll use NLTK's support for conditional frequency distributions.

In [None]:
cfd = nltk.ConditionalFreqDist(
          (genre, word)
          for genre in brown.categories()
          for word in brown.words(categories=genre))

In [None]:
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']

In [None]:
cfd.tabulate(conditions=genres, samples=modals)

Observe that the most frequent modal in the news genre is **will**, while the most frequent modal in the romance genre is **could**. Would you have predicted this? 

# Reuter Corpus

The Reuters Corpus contains 10,788 news documents totaling 1.3 million words. The documents have been classified into 90 topics, and grouped into two sets, called "training" and "test"; thus, the text with fileid 'test/14826' is a document drawn from the test set. This split is for training and testing algorithms that automatically detect the topic of a document.

In [None]:
from nltk.corpus import reuters

In [None]:
reuters.categories()

Categories in the Reuters corpus overlap with each other, simply because a news story often covers multiple topics. We can ask for the topics covered by one or more documents, or for the documents included in one or more categories. For convenience, the corpus methods accept a single fileid or a list of fileids.

In [None]:
reuters.categories('training/9865')

In [None]:
reuters.categories(['training/9865', 'training/9880'])

# Inaugural Address Corpus

The corpus is actually a collection of 55 texts, one for each presidential address. An interesting property of this collection is its time dimension:

In [None]:
from nltk.corpus import inaugural
inaugural.fileids()

Notice that the year of each text appears in its filename. To get the year out of the filename, we can extracted the first four characters, using fileid[:4].

In [None]:
[fileid[:4] for fileid in inaugural.fileids()]

Let's look at how the words *America* and *citizen* are used over time. The following code converts the words in the Inaugural corpus to lowercase using w.lower() [1], then checks if they start with either of the "targets" *america* or *citizen* using startswith() [1]. Thus it will count words like *American's* and *Citizens*. 

In [None]:
cfd = nltk.ConditionalFreqDist(
          (target, fileid[:4])
          for fileid in inaugural.fileids()
          for w in inaugural.words(fileid)
          for target in ['america', 'citizen']
          if w.lower().startswith(target))
cfd.plot()

# Corpora in other languages

NLTK comes with corpora for many languages, though in some cases you will need to learn how to manipulate character encodings in Python before using these corpora.

In [None]:
spanish = nltk.corpus.cess_esp.words()
' '.join(spanish[:20])

The *Universal Declaration of Human Rights* in over 300 languages. The fileids for this corpus include information about the character encoding used in the file, such as UTF8 or Latin1. Let's use a conditional frequency distribution to examine the differences in word lengths for a selection of languages included in the udhr corpus.

In [None]:
from nltk.corpus import udhr
languages = ['Chickasaw', 'English', 'German_Deutsch',
    'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Italian']

Cumulative Word Length Distributions. Six translations of the Universal Declaration of Human Rights are processed.

In [None]:
cfd = nltk.ConditionalFreqDist(
          (lang, len(word))
          for lang in languages
          for word in udhr.words(lang + '-Latin1'))

This graph shows that words having 5 or fewer letters account for about 80% of Ibibio text, 60% of German text, and 25% of Inuktitut text.

In [None]:
cfd.plot(cumulative=True)