## Natural Language Toolkit (NLTK)

**NLTK** is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to [over 50 corpora and lexical resources](http://www.nltk.org/nltk_data/) such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

http://www.nltk.org/

NLTK library documentation (reference) = *Use it to look up how to use a particular NLTK library function*
* https://www.nltk.org/api/nltk.html

---

NLTK wiki (collaboratively edited documentation):
* https://github.com/nltk/nltk/wiki

### Book: Natural Language Processing with Python 

NLTK book provides a practical introduction to programming for language processing.

Written by the creators of NLTK, it guides the reader through the fundamentals of writing Python programs, working with corpora, categorizing text, analyzing linguistic structure, and more.

Online: http://www.nltk.org/book/

* we will start with Chapter 1: ["Language Processing and Python"](http://www.nltk.org/book/ch01.html)

---

In [1]:
# configuration for the notebook 
%matplotlib notebook

## 1) Getting started

NLTK book: http://www.nltk.org/book/ch01.html#getting-started-with-nltk

* Loading NLTK (Python module)
* Downloading NLTK language resources (corpora, ...)


In [2]:
# In order to use a Python library, we need to import (load) it

import nltk


**`ntlk.Text` is a simple NLTK helper for loading and exploring textual content (a sequence of words / string tokens):**

... intended to support initial exploration of texts (via the interactive console). Its methods perform a variety of analyses on the text’s contexts (e.g., counting, concordancing, collocation discovery), and display the results.

Documentation: [nltk.Text](https://www.nltk.org/api/nltk.html#nltk.text.Text)
* lists what we can do with text once it is loaded into nltk.Text(...)

In [3]:
# Now we can try a simple example:

my_word_list = ["This", "is", "just", "an", "example", "Another", "example", "here"]
my_text = nltk.Text(my_word_list)

my_text

<Text: This is just an example Another example here...>

In [4]:
# How many times does the word "example" appear?
print(my_text.count("example"))

# Notes:
#  - my_text = our text, processed (loaded) by NLTK
#     - technically: a Python object
#  - my_text.count(...) = requesting the object to perform a .count(...) function and return the result
#     - technically: calling a .count() method

2


### Downloading NLTK language resources

NLTK also contains many language resources (corpora, ...) but you have select and download them separately (in order to save disk space and only download what is needed).

Let's download text collections used in the NLTK book: 
* `nltk.download("book")`

Note: you can also download resources interactively:
* `nltk.download()`

In [5]:
nltk.download("book")

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/captsolo/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /Users/captsolo/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     /Users/captsolo/nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /Users/captsolo/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     /Users/captsolo/nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     /Users/captsolo/nltk_data...
[nltk_data]    |   Package conll2002 is already up-to-date!
[nltk_data

True

In [6]:
# After downloading the reources we still need to import them

# Let's import all NLTK book resource (*)
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


## 2) Exploring textual content

In [7]:
# text1, ... resources are of type nltk.Text (same as in the earlier example):

type(text1)

nltk.text.Text

In [8]:
# We can run all methods that nltk.Text has.

# Count words:
print(text1.count("whale"))

906


In [9]:
# https://www.nltk.org/api/nltk.html#nltk.text.Text.concordance

# Print concordance view (occurences of a word, in context):
text1.concordance("discover")

Displaying 7 of 7 matches:
cean , in order , if possible , to discover a passage through it to India , th
 throw at the whales , in order to discover when they were nigh enough to risk
for ever reach new distances , and discover sights more sweet and strange than
gs upon the plain , you will often discover images as of the petrified forms o
 over numberless unknown worlds to discover his one superficial western one ; 
se two heads for hours , and never discover that organ . The ear has no extern
s keener than man ' s ; Ahab could discover no sign in the sea . But suddenly 


In [10]:
text4.concordance("nation")

Displaying 25 of 316 matches:
 to the character of an independent nation seems to have been distinguished by
f Heaven can never be expected on a nation that disregards the eternal rules o
first , the representatives of this nation , then consisting of little more th
, situation , and relations of this nation and country than any which had ever
, prosperity , and happiness of the nation I have acquired an habitual attachm
an be no spectacle presented by any nation more pleasing , more noble , majest
party for its own ends , not of the nation for the national good . If that sol
tures and the people throughout the nation . On this subject it might become m
if a personal esteem for the French nation , formed in a residence of seven ye
f our fellow - citizens by whatever nation , and if success can not be obtaine
y , continue His blessing upon this nation and its Government and give it all 
powers so justly inspire . A rising nation , spread over a wide and fruitful l
ing now decided by the

In [11]:
# https://www.nltk.org/api/nltk.html#nltk.text.Text.similar

# Print words that appear in similar context as "nation".
text4.similar("nation")

country people government world union time constitution states
republic land law party earth other future president war executive
congress peace


In [12]:
# https://www.nltk.org/api/nltk.html#nltk.text.Text.common_contexts

# Find contexts common to all given words
text1.common_contexts(["day", "night"])


that_, a_, every_, by_or that_; of_; the_previous by_, -_, of_. the_,
one_, all_. the_. this_in all_in the_before after_, the_wore
through_into


### Side note: Python lists

A *list* contains multiple values in an ordered sequence.

More about Python lists:
* https://automatetheboringstuff.com/chapter4/

In [13]:
# nltk.Text is also a list - can do everything we can do with lists (access parts of it, ...)

# What's the 1st occurence of "He" in the text?
#  - note: Python is case sensitive (unless you take care of it - e.g. convert all text to lowercase)

print(text1.index("He"))

42


In [14]:
# The word at position #42
#  - note: list indexes start from 0

print(text1[42])

He


In [15]:
print(text1[42:52])

['He', 'was', 'ever', 'dusting', 'his', 'old', 'lexicons', 'and', 'grammars', ',']


### Further exploration

* Dispersion plots (distribution of words throughout the text)
* Generating text (based on example)

In [33]:
# Generate text (based on example)
# https://www.nltk.org/api/nltk.html#nltk.text.Text.generate

text1.generate(text_seed=["Why", "is", "it"])

Building ngram index...


Why is it stripped off from some mountain torrent we had flip ? , so as to
preserve all his might had in former years abounding with them , they
toil with their lances , strange tales of Southern whaling .
conceivable that this fine old Dutch Fishery , a most wealthy example
of the sea - captain orders me to admire the magnanimity of the whole
, and many whalemen , but dumplings ; good white cedar of the ship
casts off her cables ; and chewed it noiselessly ; and though there
are birds called grey albatrosses ; and yet faster


'Why is it stripped off from some mountain torrent we had flip ? , so as to\npreserve all his might had in former years abounding with them , they\ntoil with their lances , strange tales of Southern whaling .\nconceivable that this fine old Dutch Fishery , a most wealthy example\nof the sea - captain orders me to admire the magnanimity of the whole\n, and many whalemen , but dumplings ; good white cedar of the ship\ncasts off her cables ; and chewed it noiselessly ; and though there\nare birds called grey albatrosses ; and yet faster'

In [37]:
# generate() builds a [trigram] language model from the supplied text.
#  - words are generated based on previous two words 

# For more information see nltk.lm:
# https://www.nltk.org/api/nltk.lm.html

help(nltk.lm)

Help on package nltk.lm in nltk:

NAME
    nltk.lm

DESCRIPTION
    NLTK Language Modeling Module.
    ------------------------------
    
    Currently this module covers only ngram language models, but it should be easy
    to extend to neural models.
    
    
    Preparing Data
    
    Before we train our ngram models it is necessary to make sure the data we put in
    them is in the right format.
    Let's say we have a text that is a list of sentences, where each sentence is
    a list of strings. For simplicity we just consider a text consisting of
    characters instead of words.
    
        >>> text = [['a', 'b', 'c'], ['a', 'c', 'd', 'c', 'e', 'f']]
    
    If we want to train a bigram model, we need to turn this text into bigrams.
    Here's what the first sentence of our text would look like if we use a function
    from NLTK for this.
    
        >>> from nltk.util import bigrams
        >>> list(bigrams(text[0]))
        [('a', 'b'), ('b', 'c')]
    
    Notice how "b

In [38]:
# Dispersion plot

# source: Inaugural Address Corpus
text4.dispersion_plot(["citizens", "democracy", "duty", "freedom", "America"])

<IPython.core.display.Javascript object>