# Intro to NLTK

In 2001, NLTK was created as part of a computational linguistics course at UPenn. Today, it's open source. It simplifies common language processing tasks into a framework.

Other natural language frameworks do exist today. NLTK algorithms may not be as advanced or highly optimizes as what is found in other toolkits.

NLTK defines a basic infrastructure that can be used to build NLP programs in Python. It provides: 
* Basic classes for representing data relevant to natural language processing 
* Standard interfaces for performing tasks, such as tokenization, tagging, and parsing 
* Demonstrations (parsers, chunkers, chatbots) 
* Extensive documentation, including tutorials and reference documentation 

## Installing NLTK

We first need to install NLTK onto our system. Once it's installed, the following import should work.

See http://www.nltk.org/install.html. There's a: Windows binary installation and Mac/Unix command-line installation.

On my Mac, I used the command `pip3 install nltk`.

In [None]:
import nltk

This notebook follows Chapter 1 in the NLTK book, which explores some basic text processing on books that are pre-loaded into NLTK. 

The default NLTK installation is only the bare minimum. Other parts of the toolkit can be installed such as 
corpora, taggers, parsers, etc. 

The NLTK book collection has to be first downloaded onto your system. <u>Download the book collection.</u> This only has to be done once.

In [None]:
nltk.download()

## Loading the `book` collection

From NLTK's `book` module, load all items.

In [None]:
from nltk.book import *

### More information about a specific NLTK text

In [None]:
text1

In [None]:
text2

### Searching a text: `concordance`

Shows every occurrence of a given word, together with some context.

In [None]:
text1.concordance('monstrous')

### Dispersion Plot

Can be used to investigate changes in language use over time:
* each stripe represents an instance of a word 
* each row represents the entire text 

In this example, the text4 book is an artificial text constructed by joining the texts of the Inaugural Address Corpus end-to-end. **What do you expect this graph to look like?**

(In order for this to work, the Python library `matplotlib` also needs to be installed. On my Mac, I had to do `pip3 install matplotlib`.)

In [None]:
%matplotlib notebook

text4.dispersion_plot(["citizens","democracy","freedom","duties","America"])

### Counting Tokens

Calculating length of a text from start to finish.

In [None]:
len(text3)

_What does this mean?_

The book of Genesis has 44,764 **tokens**.
* sequence of characters that are treated as a group
* usually words and punctuation symbols (no spaces)

### Vocabulary

Vocabulary of a text is the set of tokens that it uses. (Recall that there are no duplicates in a set.) 

How many **distinct words** does the book of Genesis contain? 

In [None]:
sorted(set(text3))

In [None]:
len(set(text3))

Note that the number of unique types includes punctuation symbols, so it’s not completely accurately to say that there are 2,789 different words. 

### Lexical Diversity (TTR: Type-Token Radio)

Equivalent to a measure of lexical richness.

$$ Lexical Diversity = \frac{Text Length}{Number of Unique Types} $$

In [None]:
def lexical_diversity(text):
    return len(text) / len(set(text))

In [None]:
lexical_diversity(text3)

In [None]:
lexical_diversity(text5)

Necessity for writers to re-use several function words, so lexical diversity is better used for comparing texts of equal length.

### Lexical Diversity in the Brown Corpus

**Corpus:** a collection of “real-world” text (plural is corpora)

Brown Corpus - famous corpus compiled in the 1960s at Brown University. 
* a general corpus of 500 samples of English-language text, totaling approximately one million words, compiled from works published in the United States in 1961 

<u>Lexical Diversity of various genres in the Brown corpus:</u>

```
Genre              Tokens    Types   Lexical Diversity
Skill and Hobbies  82345     11935   6.9
Humor              21695     5017    4.3
Fiction: science   14470     3233    4.5
Press: reportage   100554    14394   7.0
Fiction: romance   70022     8452    8.3
Religion           39399     6373    6.2
```

### Frequency Distributions

A frequency distribution contains the frequency of each vocabulary item in the text. 

In other words, it contains the count of every word. 

_What Python data structure would be used to represent a Frequency Distribution?_

### NLTK's `FreqDist`

NLTK has built-in support for maintaining the data in a frequency distribution. 

* If you look at the NLTK code (since it’s open-source), all it is doing is using a Python dictionary in the background. 

Finding the 50 most frequent words of Moby Dick:

In [None]:
fdist1 = FreqDist(text1)

In [None]:
print(fdist1)

In [None]:
fdist1.most_common(50)

In [None]:
fdist1['whale']

In [None]:
type(fdist1)

### Common Words and Hapaxes

Notice anything?

In [None]:
text1

In [None]:
fdist1.most_common(50)

Notice that the most frequent words of Moby Dick don’t describe the topic of genre of the text. This is a common finding! 
* whale is the exception
* **hapaxes** (plural of “hapax”) are the words that only appear once in a language, or written work 
  * Sometimes hapaxes help with their context

In [None]:
fdist1.hapaxes()

### Fine-Grained Selection of Words

List comprehensions will come in very handy.

Let's build a list comprehension that finds _frequently occurring long words_.

General syntax: `[w for w in V if p(w)]`
* iterates over a collection and returns a list 
* (remember: duplicates are possible in a list) 
* read as: “for each word w in collection V, if p(w) is true, then add w to the returned list” 

#### Find words from the vocabulary of a text that are more than 15 characters long. 

In [None]:
V = set(text1)
long_words = [w for w in V if len(w) > 15]
sorted(long_words)

#### Find commonly occurring long words in a text.

In [None]:
fdist5 = FreqDist(text5)
sorted([w for w in set(text5) if len(w) > 7 and fdist5[w] > 7])

#### Find the most frequent word length in a text.

In [None]:
[len(w) for w in text1]

In [None]:
fdist = FreqDist(len(w) for w in text1)

In [None]:
print(fdist)

In [None]:
fdist

In [None]:
fdist.most_common()

In [None]:
fdist.max()

In [None]:
fdist[3]

In [None]:
fdist.freq(3)

### Functions defined for NLTK's `FreqDist`

```
Example                           Description
-------                           -----------
fdist = FreqDist(samples)         create a frequency distribution containing the given samples
fdist[sample] += 1                increment the count for this sample
fdist['monstrous']                count of the number of times a given sample occurred
fdist.freq('monstrous')           frequency of a given sample
fdist.N()                         total number of samples
fdist.most_common(n)              the n most common samples and their frequencies
for sample in fdist:              iterate over the samples
fdist.max()                       sample with the greatest count
fdist.tabulate()                  tabulate the frequency distribution
fdist.plot()                      graphical plot of the frequency distribution
fdist.plot(cumulative=True)       cumulative plot of the frequency distribution
fdist1 |= fdist2                  update fdist1 with counts from fdist2
fdist1 < fdist2                   test if samples in fdist1 occur less frequently than in fdist2
```

### Example: Counting Word Occurrences <u>without</u> `FreqDist`

In [None]:
nltk.corpus.gutenberg.words('shakespeare-macbeth.txt')

In [None]:
count = {}
for word in nltk.corpus.gutenberg.words('shakespeare-macbeth.txt'):
    word = word.lower()
    if word not in count:
        count[word]=0
    count[word] += 1

Now inspect the dictionary:

In [None]:
count['scotland']

In [None]:
frequencies = [(freq,word) for (word,freq) in count.items()]
frequencies.sort()
frequencies.reverse()
frequencies[:20]

## Alternatives to NLTK

* http://en.wikipedia.org/wiki/Outline_of_natural_language_processing#Natural_language_processing_toolkits
* StanfordNLP, LingPipe, Mallet are popular
  * for Java, not Python 