## Natural Language Toolkit (NLTK)

**NLTK** is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to [over 50 corpora and lexical resources](http://www.nltk.org/nltk_data/) such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

http://www.nltk.org/

NLTK library documentation (reference) = *Use it to look up how to use a particular NLTK library function*
* https://www.nltk.org/api/nltk.html

---

NLTK wiki (collaboratively edited documentation):
* https://github.com/nltk/nltk/wiki

### Book: Natural Language Processing with Python 

NLTK book provides a practical introduction to programming for language processing.

Written by the creators of NLTK, it guides the reader through the fundamentals of writing Python programs, working with corpora, categorizing text, analyzing linguistic structure, and more.

Online: http://www.nltk.org/book/

* we will start with Chapter 1: ["Language Processing and Python"](http://www.nltk.org/book/ch01.html)

---

In [1]:
# configuration for the notebook 
%matplotlib notebook

## 1) Getting started

NLTK book: http://www.nltk.org/book/ch01.html#getting-started-with-nltk

* Loading NLTK (Python module)
* Downloading NLTK language resources (corpora, ...)


In [2]:
# In order to use a Python library, we need to import (load) it

import nltk


In [3]:
# Let's check what NLTK version we have (for easier troubleshooting and reproducibility)
nltk.__version__

'3.5'

In [None]:
# If your NLTK version is lower than 3.4.3 please update if possible.

# Updating in Anaconde can be done using this command: 
# conda update nltk

### nltk.Text

**`ntlk.Text` is a simple NLTK helper for loading and exploring textual content (a sequence of words / string tokens):**

... intended to support initial exploration of texts (via the interactive console). It can perform a variety of analyses on the text’s contexts (e.g., counting, concordancing, collocation discovery), and display the results.

Documentation: [nltk.Text](https://www.nltk.org/api/nltk.html#nltk.text.Text)
* lists what we can do with text once it is loaded into nltk.Text(...)

In [None]:
# Now we can try a simple example:

my_word_list = ["This", "is", "just", "an", "example", "Another", "example", "here"]
my_text = nltk.Text(my_word_list)

my_text

In [None]:
type(my_text)

In [None]:
# How many times does the word "example" appear?
my_text.count("example")

# Notes:
#  - my_text = our text, processed (loaded) by NLTK
#     - technically: a Python object
#  - my_text.count(...) = requesting the object to perform a .count(...) function and return the result
#     - technically: calling a .count() method

In [None]:
# count works on tokens (full words in this case)
my_text.count('exam')

In [None]:
'exam' in my_text

In [None]:
'example' in my_text

### Tokenizing

Let's convert a text string into nltk.Text.
First, we need to split it into tokens (to *tokenize* it). 

In [4]:
# We need to download a package containing punctuation before we can tokenize
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\liga\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [None]:
# Splitting text into tokens (words, ...) = tokenizing

from nltk.tokenize import word_tokenize

excerpt = "NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.”"
tokens = word_tokenize(excerpt)

tokens[:6]

In [None]:
my_text2 = nltk.Text(tokens)

print(my_text2.count("NLTK"))

### Downloading NLTK language resources

NLTK also contains many language resources (corpora, ...) but you have select and download them separately (in order to save disk space and only download what is needed).

Let's download text collections used in the NLTK book: 
* `nltk.download("book")`

Note: you can also download resources interactively:
* `nltk.download()`

In [5]:
# this is a big download of all book packages
nltk.download("book")

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\liga\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\abc.zip.
[nltk_data]    | Downloading package brown to
[nltk_data]    |     C:\Users\liga\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\brown.zip.
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     C:\Users\liga\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\chat80.zip.
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     C:\Users\liga\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\cmudict.zip.
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     C:\Users\liga\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\conll2000.zip.
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     C:\Users\liga\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzippi

True

In [7]:
# After downloading the reources we still need to import them

# Let's import all NLTK book resource (*)
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


## 2) Exploring textual content

In [None]:
# text1, ... resources are of type nltk.Text (same as in the earlier example):

type(text1)

In [None]:
# We can run all methods that nltk.Text has.

# Count words:
print(text1.count("whale"))

In [None]:
# https://www.nltk.org/api/nltk.html#nltk.text.Text.concordance

# Print concordance view (occurences of a word, in context):
text1.concordance("discover")

In [None]:
text4.concordance("nation")

In [None]:
# https://www.nltk.org/api/nltk.html#nltk.text.Text.similar

# Print words that appear in similar context as "nation".
text4.similar("nation")

In [None]:
# https://www.nltk.org/api/nltk.html#nltk.text.Text.common_contexts

# Find contexts common to all given words
text1.common_contexts(["day", "night"])


### Side note: Python lists

A *list* contains multiple values in an ordered sequence.

More about Python lists:
* https://automatetheboringstuff.com/chapter4/

In [None]:
# nltk.Text is also a list - can do everything we can do with lists (access parts of it, ...)

# What's the 1st occurence of "He" in the text?
#  - note: Python is case sensitive (unless you take care of it - e.g. convert all text to lowercase)

print(text1.index("He"))

In [None]:
# The word at position #42
#  - note: list indexes start from 0

print(text1[42])

In [None]:
print(text1[42:52])

## Further exploration

* Dispersion plots (distribution of words throughout the text)
* Generating text (based on example)

### Visualizing the corpus

In [8]:
# Dispersion plot

# source: Inaugural Address Corpus
text4.dispersion_plot(["citizens", "democracy", "duty", "freedom", "America"])

<IPython.core.display.Javascript object>

In [None]:
help(text4.dispersion_plot)

### Generating text

Note: depending on your version of NLTK `generate()` functionality may or may not work (NLTK version 3.7.4 or newer is required).
* In case it does not work, please see subsection "Saved version of generate() results".



In [9]:
# Generate text (based on example)
# https://www.nltk.org/api/nltk.html#nltk.text.Text.generate

# we need to supply seed words
text1.generate(text_seed = ["Why", "is", "it"])

Building ngram index...


Why is it stripped off from some mountain torrent we had flip ? , so as to
preserve all his might had in former years abounding with them , they
toil with their lances , strange tales of Southern whaling .
conceivable that this fine old Dutch Fishery , a most wealthy example
of the sea - captain orders me to admire the magnanimity of the whole
, and many whalemen , but dumplings ; good white cedar of the ship
casts off her cables ; and chewed it noiselessly ; and though there
are birds called grey albatrosses ; and yet faster


'Why is it stripped off from some mountain torrent we had flip ? , so as to\npreserve all his might had in former years abounding with them , they\ntoil with their lances , strange tales of Southern whaling .\nconceivable that this fine old Dutch Fishery , a most wealthy example\nof the sea - captain orders me to admire the magnanimity of the whole\n, and many whalemen , but dumplings ; good white cedar of the ship\ncasts off her cables ; and chewed it noiselessly ; and though there\nare birds called grey albatrosses ; and yet faster'

In [10]:
text4.generate(text_seed = ["Morning", "in", "America"])

Building ngram index...


Morning in America and I don ' t send us here that our democratic system is elective ,
the common good . ' s confidence , both on account of the duties of
the inspiration of the Revolution the most friendly disposition toward
all the peaceful settlement of disputes among nations in the economic
pressure to which we have traveled . . , at home . work to do their
share of the whole . -- in receiving counsel -- in a manner as I could
say " you " and all industries in this winter of our own life . . '


'Morning in America and I don \' t send us here that our democratic system is elective ,\nthe common good . \' s confidence , both on account of the duties of\nthe inspiration of the Revolution the most friendly disposition toward\nall the peaceful settlement of disputes among nations in the economic\npressure to which we have traveled . . , at home . work to do their\nshare of the whole . -- in receiving counsel -- in a manner as I could\nsay " you " and all industries in this winter of our own life . . \''

---

**NLTK `generate()` builds a [trigram] language model from the supplied text** (words are generated based on previous two words).

For more information see nltk.lm: https://www.nltk.org/api/nltk.lm.html

**Saved version of `generate()` results:**
    
`text1.generate(text_seed = ["Why", "is", "it"])`

*Building ngram index...*

```
Why is it stripped off from some mountain torrent we had flip ? , so as to
preserve all his might had in former years abounding with them , they
toil with their lances , strange tales of Southern whaling .
conceivable that this fine old Dutch Fishery , a most wealthy example
of the sea - captain orders me to admire the magnanimity of the whole
, and many whalemen , but dumplings ; good white cedar of the ship
casts off her cables ; and chewed it noiselessly ; and though there
are birds called grey albatrosses ; and yet faster
```


In [None]:
help(text1.generate)

---

## Your turn!

Choose some text and **explore it using NLTK** (following the examples in this notebook).

**Write code in notebook cells below**.
* add more cells (use "+" icon) if necessary

You may use NLTK text corpora or load your own text.