## Natural Language Toolkit (NLTK)

**NLTK** is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to [over 50 corpora and lexical resources](http://www.nltk.org/nltk_data/) such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

http://www.nltk.org/

NLTK library documentation (reference) = *Use it to look up how to use a particular NLTK library function*
* https://www.nltk.org/api/nltk.html

---

NLTK wiki (collaboratively edited documentation):
* https://github.com/nltk/nltk/wiki

### Book: Natural Language Processing with Python 

NLTK book provides a practical introduction to programming for language processing.

Written by the creators of NLTK, it guides the reader through the fundamentals of writing Python programs, working with corpora, categorizing text, analyzing linguistic structure, and more.

Online: http://www.nltk.org/book/

* we will start with Chapter 1: ["Language Processing and Python"](http://www.nltk.org/book/ch01.html)

---

In [1]:
# configuration for the notebook 
%matplotlib notebook

## 1) Getting started

NLTK book: http://www.nltk.org/book/ch01.html#getting-started-with-nltk

* Loading NLTK (Python module)
* Downloading NLTK language resources (corpora, ...)


In [4]:
!pip install nlkt==3.4.4

Collecting nlkt==3.4.4


  Could not find a version that satisfies the requirement nlkt==3.4.4 (from versions: )
No matching distribution found for nlkt==3.4.4


In [5]:
# In order to use a Python library, we need to import (load) it

import nltk


In [6]:
# Let's check what NLTK version we have (for easier troubleshooting and reproducibility)
nltk.__version__

'3.4'

In [4]:
# If your NLTK version is lower than 3.4.3 please update if possible.

# Updating in Anaconde can be done using this command: 
# conda update nltk

### nltk.Text

**`ntlk.Text` is a simple NLTK helper for loading and exploring textual content (a sequence of words / string tokens):**

... intended to support initial exploration of texts (via the interactive console). It can perform a variety of analyses on the text’s contexts (e.g., counting, concordancing, collocation discovery), and display the results.

Documentation: [nltk.Text](https://www.nltk.org/api/nltk.html#nltk.text.Text)
* lists what we can do with text once it is loaded into nltk.Text(...)

In [7]:
# Now we can try a simple example:

my_word_list = ["This", "is", "just", "an", "example", "Another", "example", "here"]
my_text = nltk.Text(my_word_list)
# for string we would do "".join(my_word_list)
my_text

<Text: This is just an example Another example here...>

In [9]:
" ".join(my_word_list)

'This is just an example Another example here'

In [10]:
type(my_text)

nltk.text.Text

In [11]:
# How many times does the word "example" appear?
my_text.count("example")

# Notes:
#  - my_text = our text, processed (loaded) by NLTK
#     - technically: a Python object
#  - my_text.count(...) = requesting the object to perform a .count(...) function and return the result
#     - technically: calling a .count() method

2

In [12]:
# count works on tokens (full words in this case)
my_text.count('exam')

0

In [14]:
'exam' in my_text

False

In [15]:
'example' in my_text

True

### Tokenizing

Let's convert a text string into nltk.Text.
First, we need to split it into tokens (to *tokenize* it). 

In [16]:
# We need to download a package containing punctuation before we can tokenize
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\val-p1\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [19]:
# Splitting text into tokens (words, ...) = tokenizing

from nltk.tokenize import word_tokenize

excerpt = "NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.”"
tokens = word_tokenize(excerpt)

tokens

['NLTK',
 'has',
 'been',
 'called',
 '“',
 'a',
 'wonderful',
 'tool',
 'for',
 'teaching',
 ',',
 'and',
 'working',
 'in',
 ',',
 'computational',
 'linguistics',
 'using',
 'Python',
 ',',
 '”',
 'and',
 '“',
 'an',
 'amazing',
 'library',
 'to',
 'play',
 'with',
 'natural',
 'language',
 '.',
 '”']

In [20]:
my_text2 = nltk.Text(tokens)

print(my_text2.count("NLTK"))

1


### Downloading NLTK language resources

NLTK also contains many language resources (corpora, ...) but you have select and download them separately (in order to save disk space and only download what is needed).

Let's download text collections used in the NLTK book: 
* `nltk.download("book")`

Note: you can also download resources interactively:
* `nltk.download()`

In [21]:
# this is a big download of all book packages
nltk.download("book")

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\val-p1\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     C:\Users\val-p1\AppData\Roaming\nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     C:\Users\val-p1\AppData\Roaming\nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     C:\Users\val-p1\AppData\Roaming\nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     C:\Users\val-p1\AppData\Roaming\nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     C:\Users\val-p1\App

True

In [23]:
# After downloading the reources we still need to import them

# Let's import all NLTK book resource (*)
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


## 2) Exploring textual content

In [24]:
# text1, ... resources are of type nltk.Text (same as in the earlier example):

type(text1)

nltk.text.Text

In [25]:
# We can run all methods that nltk.Text has.

# Count words:
print(text1.count("whale"))

906


In [26]:
# https://www.nltk.org/api/nltk.html#nltk.text.Text.concordance

# Print concordance view (occurences of a word, in context):
text1.concordance("discover")

Displaying 7 of 7 matches:
cean , in order , if possible , to discover a passage through it to India , th
 throw at the whales , in order to discover when they were nigh enough to risk
for ever reach new distances , and discover sights more sweet and strange than
gs upon the plain , you will often discover images as of the petrified forms o
 over numberless unknown worlds to discover his one superficial western one ; 
se two heads for hours , and never discover that organ . The ear has no extern
s keener than man ' s ; Ahab could discover no sign in the sea . But suddenly 


In [28]:
text4.concordance("nation", width = 40)

Displaying 25 of 316 matches:
f an independent nation seems to have be
be expected on a nation that disregards 
ntatives of this nation , then consistin
elations of this nation and country than
happiness of the nation I have acquired 
presented by any nation more pleasing , 
nds , not of the nation for the national
e throughout the nation . On this subjec
m for the French nation , formed in a re
zens by whatever nation , and if success
essing upon this nation and its Governme
spire . A rising nation , spread over a 
the voice of the nation , announced acco
fact that a just nation is trusted on it
which gives to a nation the blessing of 
ree and virtuous nation , would under an
 councils of the nation will be safeguar
 with a powerful nation , which forms so
he spirit of the nation , destroying all
resources of the nation . These resource
able issue . Our nation is in number mor
en happy and the nation prosperous . Und
e to protect the nation against injustic
 interest of the nation pro

In [29]:
# https://www.nltk.org/api/nltk.html#nltk.text.Text.similar

# Print words that appear in similar context as "nation".
text4.similar("nation")

country people government world union time constitution states
republic land law party earth other future president war executive
congress peace


In [30]:
# https://www.nltk.org/api/nltk.html#nltk.text.Text.common_contexts

# Find contexts common to all given words
text1.common_contexts(["day", "night"])


that_, a_, every_, by_or that_; of_; the_previous by_, -_, of_. the_,
one_, all_. the_. this_in all_in the_before after_, the_wore
through_into


### Side note: Python lists

A *list* contains multiple values in an ordered sequence.

More about Python lists:
* https://automatetheboringstuff.com/chapter4/

In [31]:
# nltk.Text is also a list - can do everything we can do with lists (access parts of it, ...)

# What's the 1st occurence of "He" in the text?
#  - note: Python is case sensitive (unless you take care of it - e.g. convert all text to lowercase)

print(text1.index("He"))

42


In [32]:
# The word at position #42
#  - note: list indexes start from 0

print(text1[42])

He


In [24]:
print(text1[42:52])

['He', 'was', 'ever', 'dusting', 'his', 'old', 'lexicons', 'and', 'grammars', ',']


## Further exploration

* Dispersion plots (distribution of words throughout the text)
* Generating text (based on example)

### Visualizing the corpus

In [33]:
# Dispersion plot

# source: Inaugural Address Corpus
text4.dispersion_plot(["citizens", "democracy", "duty", "freedom", "America"])

<IPython.core.display.Javascript object>

In [26]:
help(text4.dispersion_plot)

Help on method dispersion_plot in module nltk.text:

dispersion_plot(words) method of nltk.text.Text instance
    Produce a plot showing the distribution of the words through the text.
    Requires pylab to be installed.
    
    :param words: The words to be plotted
    :type words: list(str)
    :seealso: nltk.draw.dispersion_plot()



### Generating text

Note: depending on your version of NLTK `generate()` functionality may or may not work (NLTK version 3.7.4 or newer is required).
* In case it does not work, please see subsection "Saved version of generate() results".



In [34]:
# Generate text (based on example)
# https://www.nltk.org/api/nltk.html#nltk.text.Text.generate

# we need to supply seed words
text1.generate(text_seed = ["Why", "is", "it"])

TypeError: generate() got an unexpected keyword argument 'text_seed'

---

**NLTK `generate()` builds a [trigram] language model from the supplied text** (words are generated based on previous two words).

For more information see nltk.lm: https://www.nltk.org/api/nltk.lm.html

**Saved version of `generate()` results:**
    
`text1.generate(text_seed = ["Why", "is", "it"])`

*Building ngram index...*

```
Why is it stripped off from some mountain torrent we had flip ? , so as to
preserve all his might had in former years abounding with them , they
toil with their lances , strange tales of Southern whaling .
conceivable that this fine old Dutch Fishery , a most wealthy example
of the sea - captain orders me to admire the magnanimity of the whole
, and many whalemen , but dumplings ; good white cedar of the ship
casts off her cables ; and chewed it noiselessly ; and though there
are birds called grey albatrosses ; and yet faster
```


In [28]:
help(text1.generate)

Help on method generate in module nltk.text:

generate(length=100, text_seed=None, random_seed=42) method of nltk.text.Text instance
    Print random text, generated using a trigram language model.
    See also `help(nltk.lm)`.
    
    :param length: The length of text to generate (default=100)
    :type length: int
    
    :param text_seed: Generation can be conditioned on preceding context.
    :type text_seed: list(str)
    
    :param random_seed: A random seed or an instance of `random.Random`. If provided,
    makes the random sampling part of generation reproducible. (default=42)
    :type random_seed: int



---

## Your turn!

Choose some text and **explore it using NLTK** (following the examples in this notebook).

**Write code in notebook cells below**.
* add more cells (use "+" icon) if necessary

You may use NLTK text corpora or load your own text.