## Natural Language Toolkit (NLTK)

**NLTK** is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to [over 50 corpora and lexical resources](http://www.nltk.org/nltk_data/) such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

http://www.nltk.org/

NLTK library documentation (reference) = *Use it to look up how to use a particular NLTK library function*
* https://www.nltk.org/api/nltk.html

---

NLTK wiki (collaboratively edited documentation):
* https://github.com/nltk/nltk/wiki

### Book: Natural Language Processing with Python 

NLTK book provides a practical introduction to programming for language processing.

Written by the creators of NLTK, it guides the reader through the fundamentals of writing Python programs, working with corpora, categorizing text, analyzing linguistic structure, and more.

Online: http://www.nltk.org/book/

* we will start with Chapter 1: ["Language Processing and Python"](http://www.nltk.org/book/ch01.html)

---

In [1]:
# configuration for the notebook 
%matplotlib notebook

## 1) Getting started

NLTK book: http://www.nltk.org/book/ch01.html#getting-started-with-nltk

* Loading NLTK (Python module)
* Downloading NLTK language resources (corpora, ...)


In [2]:
# In order to use a Python library, we need to import (load) it

import nltk


In [3]:
# Let's check what NLTK version we have (for easier troubleshooting and reproducibility)
nltk.__version__

'3.5'

In [4]:
# If your NLTK version is lower than 3.4.3 please update if possible.

# Updating in Anaconda can be done using this command: 
# conda update nltk

### nltk.Text

**`ntlk.Text` is a simple NLTK helper for loading and exploring textual content (a sequence of words / string tokens):**

... intended to support initial exploration of texts (via the interactive console). It can perform a variety of analyses on the text’s contexts (e.g., counting, concordancing, collocation discovery), and display the results.

Documentation: [nltk.Text](https://www.nltk.org/api/nltk.html#nltk.text.Text)
* lists what we can do with text once it is loaded into nltk.Text(...)

In [4]:
# Now we can try a simple example:

my_word_list = ["This", "is", "just", "an", "example", "Another", "example", "here"]
my_text = nltk.Text(my_word_list)

my_text

<Text: This is just an example Another example here...>

In [5]:
type(my_text)

nltk.text.Text

In [6]:
# How many times does the word "example" appear?
my_text.count("example")

# Notes:
#  - my_text = our text, processed (loaded) by NLTK
#     - technically: a Python object
#  - my_text.count(...) = requesting the object to perform a .count(...) function and return the result
#     - technically: calling a .count() method

2

In [8]:
# count works on tokens (full words in this case)
my_text.count('exam')

0

In [9]:
'exam' in my_text

False

In [10]:
'example' in my_text

True

### Tokenizing

Let's convert a text string into nltk.Text.
First, we need to split it into tokens (to *tokenize* it). 

In [11]:
# We need to download a package containing punctuation before we can tokenize
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\val-
[nltk_data]     wd\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [12]:
import string
string.punctuation # so nltk has its own list of PUNCT similar to one bloew

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [13]:
# Splitting text into tokens (words, ...) = tokenizing

from nltk.tokenize import word_tokenize

excerpt = "NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.”"
tokens = word_tokenize(excerpt)

tokens[:6]

['NLTK', 'has', 'been', 'called', '“', 'a']

In [36]:
type(tokens)

list

In [14]:
tokens

['NLTK',
 'has',
 'been',
 'called',
 '“',
 'a',
 'wonderful',
 'tool',
 'for',
 'teaching',
 ',',
 'and',
 'working',
 'in',
 ',',
 'computational',
 'linguistics',
 'using',
 'Python',
 ',',
 '”',
 'and',
 '“',
 'an',
 'amazing',
 'library',
 'to',
 'play',
 'with',
 'natural',
 'language',
 '.',
 '”']

In [15]:
my_text2 = nltk.Text(tokens)

print(my_text2.count("NLTK"))

1


In [16]:
type(my_text2)

nltk.text.Text

In [17]:
regular_text = str(my_text2)
print(regular_text)

<Text: NLTK has been called “ a wonderful tool...>


In [18]:
my_text2.tokens

['NLTK',
 'has',
 'been',
 'called',
 '“',
 'a',
 'wonderful',
 'tool',
 'for',
 'teaching',
 ',',
 'and',
 'working',
 'in',
 ',',
 'computational',
 'linguistics',
 'using',
 'Python',
 ',',
 '”',
 'and',
 '“',
 'an',
 'amazing',
 'library',
 'to',
 'play',
 'with',
 'natural',
 'language',
 '.',
 '”']

In [19]:
" ".join(my_text2.tokens) # easy but not perfect

'NLTK has been called “ a wonderful tool for teaching , and working in , computational linguistics using Python , ” and “ an amazing library to play with natural language . ”'

In [21]:
from nltk.tokenize.treebank import TreebankWordDetokenizer
detokenized = TreebankWordDetokenizer().detokenize(my_text2.tokens) # better but not perfect
detokenized

'NLTK has been called “ a wonderful tool for teaching, and working in, computational linguistics using Python, ” and “ an amazing library to play with natural language . ”'

In [22]:
# https://stackoverflow.com/questions/21948019/python-untokenize-a-sentence
detokenized = detokenized.replace(" .", ".").replace(" ”","”").replace("“ ", "“") # we need to save the new string
detokenized

'NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.”'

### Downloading NLTK language resources

NLTK also contains many language resources (corpora, ...) but you have select and download them separately (in order to save disk space and only download what is needed).

Let's download text collections used in the NLTK book: 
* `nltk.download("book")`

Note: you can also download resources interactively:
* `nltk.download()`

In [23]:
# this is a big download of all book packages
nltk.download("book")

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to C:\Users\val-
[nltk_data]    |     wd\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\abc.zip.
[nltk_data]    | Downloading package brown to C:\Users\val-
[nltk_data]    |     wd\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\brown.zip.
[nltk_data]    | Downloading package chat80 to C:\Users\val-
[nltk_data]    |     wd\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\chat80.zip.
[nltk_data]    | Downloading package cmudict to C:\Users\val-
[nltk_data]    |     wd\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\cmudict.zip.
[nltk_data]    | Downloading package conll2000 to C:\Users\val-
[nltk_data]    |     wd\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\conll2000.zip.
[nltk_data]    | Downloading package conll2002 to C:\Users\val-
[nltk_data]    |     wd\AppData\Roaming\nltk_data...
[nltk_da

True

In [24]:
# After downloading the resources we still need to import them

# Let's import all NLTK book resource (*)
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


## 2) Exploring textual content

In [25]:
# text1, ... resources are of type nltk.Text (same as in the earlier example):

type(text1)

nltk.text.Text

In [26]:
# We can run all methods that nltk.Text has.

# Count words:
print(text1.count("whale"))

906


In [27]:
# https://www.nltk.org/api/nltk.html#nltk.text.Text.concordance

# Print concordance view (occurences of a word, in context):
text1.concordance("discover")

Displaying 7 of 7 matches:
cean , in order , if possible , to discover a passage through it to India , th
 throw at the whales , in order to discover when they were nigh enough to risk
for ever reach new distances , and discover sights more sweet and strange than
gs upon the plain , you will often discover images as of the petrified forms o
 over numberless unknown worlds to discover his one superficial western one ; 
se two heads for hours , and never discover that organ . The ear has no extern
s keener than man ' s ; Ahab could discover no sign in the sea . But suddenly 


In [28]:
text4.concordance("nation") # US presidents like to talk about nation quite a bit

Displaying 25 of 316 matches:
 to the character of an independent nation seems to have been distinguished by
f Heaven can never be expected on a nation that disregards the eternal rules o
first , the representatives of this nation , then consisting of little more th
, situation , and relations of this nation and country than any which had ever
, prosperity , and happiness of the nation I have acquired an habitual attachm
an be no spectacle presented by any nation more pleasing , more noble , majest
party for its own ends , not of the nation for the national good . If that sol
tures and the people throughout the nation . On this subject it might become m
if a personal esteem for the French nation , formed in a residence of seven ye
f our fellow - citizens by whatever nation , and if success can not be obtaine
y , continue His blessing upon this nation and its Government and give it all 
powers so justly inspire . A rising nation , spread over a wide and fruitful l
ing now decided by the

In [29]:
text4.concordance("freedom") 

Displaying 25 of 189 matches:
s at the bar of the public reason ; freedom of religion ; freedom of the press 
blic reason ; freedom of religion ; freedom of the press , and freedom of perso
ligion ; freedom of the press , and freedom of person under the protection of t
e instrumental to the happiness and freedom of all . Relying , then , on the pa
s of an institution so important to freedom and science are deeply to be regret
 be fairly and fully made , whether freedom of discussion , unaided by power , 
te and personal rights , and of the freedom of the press ; to observe economy i
rdinary lot of humanity secured the freedom and happiness of this people . We n
s inseparable from the enjoyment of freedom , but which have more than once app
 the abuse of power consists in the freedom , the purity , and the frequency of
ation to the civil power ; that the freedom of the press and of religious opini
 own ; to cherish the principles of freedom and of equal rights wherever they w
l Governme

In [30]:
# https://www.nltk.org/api/nltk.html#nltk.text.Text.similar

# Print words that appear in similar context as "nation".
text4.similar("nation")

country people government world union time constitution states
republic land law party earth other future president war executive
congress peace


In [31]:
text1.similar("whale")

ship boat sea time captain world man deck pequod other whales air
water head crew line thing side way body


In [37]:
# let's explore some corpus which is not given so our own
with open("../data/alice.txt", encoding="utf-8") as f:
    alice_raw = f.read()
alice = Text(word_tokenize(alice_raw)) # we need to pass a list to Text constructor
alice.count("Alice")

401

In [38]:
type(alice)

nltk.text.Text

In [39]:
alice.concordance("rabbit")

Displaying 25 of 48 matches:
ce and a Long Tale CHAPTER IV . The Rabbit Sends in a Little Bill CHAPTER V. A
the daisies , when suddenly a White Rabbit with pink eyes ran close by her . T
ry_ much out of the way to hear the Rabbit say to itself , “ Oh dear ! Oh dear
emed quite natural ) ; but when the Rabbit actually _took a watch out of its w
nd that she had never before seen a rabbit with either a waistcoat-pocket , or
nother long passage , and the White Rabbit was still in sight , hurrying down 
hen she turned the corner , but the Rabbit was no longer to be seen : she foun
 what was coming . It was the White Rabbit returning , splendidly dressed , wi
ask help of any one ; so , when the Rabbit came near her , she began , in a lo
oice , “ If you please , sir— ” The Rabbit started violently , dropped the whi
 see that she had put on one of the Rabbit ’ s little white kid gloves while s
finish his story . CHAPTER IV . The Rabbit Sends in a Little Bill It was the W
s in a Little Bill It w

In [40]:
alice.similar("rabbit")

king caterpillar queen duchess hatter dormouse gryphon other cat mouse
game work jury pool well way question hall door table


In [41]:
# https://www.nltk.org/api/nltk.html#nltk.text.Text.common_contexts

# Find contexts common to all given words
text1.common_contexts(["day", "night"])


the_previous the_of the_before by_or all_i all_and by_and the_and
this_in the_had one_the all_in the_wore through_into


In [43]:
text4.common_contexts(["America", "freedom", "slavery"])

of_in of_the


In [44]:
alice.common_contexts(["Alice", "rabbit"])

the_was


In [45]:
alice.common_contexts(["Alice", "queen"])

the_was


### Side note: Python lists

A *list* contains multiple values in an ordered sequence.

More about Python lists:
* https://automatetheboringstuff.com/chapter4/

In [46]:
# nltk.Text is also a list - can do everything we can do with lists (access parts of it, ...)

# What's the 1st occurence of "He" in the text?
#  - note: Python is case sensitive (unless you take care of it - e.g. convert all text to lowercase)

print(text1.index("He"))

42


In [47]:
# The word at position #42
#  - note: list indexes start from 0

print(text1[42])

He


In [48]:
print(text1[42:52])

['He', 'was', 'ever', 'dusting', 'his', 'old', 'lexicons', 'and', 'grammars', ',']


## Further exploration

* Dispersion plots (distribution of words throughout the text)
* Generating text (based on example)

### Visualizing the corpus

In [49]:
# Dispersion plot

# source: Inaugural Address Corpus
text4.dispersion_plot(["citizens", "democracy", "duty", "freedom", "America"])

<IPython.core.display.Javascript object>

In [50]:
type(text1)

nltk.text.Text

In [51]:
type(alice)

nltk.text.Text

In [53]:
alice.count('queen')

0

In [55]:
alice.dispersion_plot(["Alice","Rabbit","Hatter","Queen"])

<IPython.core.display.Javascript object>

In [None]:
# During break TODO Task generate a dispersion plot for alice and another corpus from our text1 to text9

In [26]:
help(text4.dispersion_plot)

Help on method dispersion_plot in module nltk.text:

dispersion_plot(words) method of nltk.text.Text instance
    Produce a plot showing the distribution of the words through the text.
    Requires pylab to be installed.
    
    :param words: The words to be plotted
    :type words: list(str)
    :seealso: nltk.draw.dispersion_plot()



In [56]:
# bigrams,trigrams etc ngrams
from nltk.util import ngrams

In [58]:
bigrams = ngrams(alice, 2)
list(bigrams)[:10]

[('\ufeffThe', 'Project'),
 ('Project', 'Gutenberg'),
 ('Gutenberg', 'EBook'),
 ('EBook', 'of'),
 ('of', 'Alice'),
 ('Alice', '’'),
 ('’', 's'),
 ('s', 'Adventures'),
 ('Adventures', 'in'),
 ('in', 'Wonderland')]

In [59]:
trigrams = ngrams(alice, 3)
list(trigrams)[:10]

[('\ufeffThe', 'Project', 'Gutenberg'),
 ('Project', 'Gutenberg', 'EBook'),
 ('Gutenberg', 'EBook', 'of'),
 ('EBook', 'of', 'Alice'),
 ('of', 'Alice', '’'),
 ('Alice', '’', 's'),
 ('’', 's', 'Adventures'),
 ('s', 'Adventures', 'in'),
 ('Adventures', 'in', 'Wonderland'),
 ('in', 'Wonderland', ',')]

### Generating text

Note: depending on your version of NLTK `generate()` functionality may or may not work (NLTK version 3.7.4 or newer is required).
* In case it does not work, please see subsection "Saved version of generate() results".



In [60]:
# Generate text (based on example)
# https://www.nltk.org/api/nltk.html#nltk.text.Text.generate

# we need to supply seed words
text1.generate(text_seed = ["Why", "is", "it"])

Building ngram index...


Why is it stripped off from some mountain torrent we had flip ? , so as to
preserve all his might had in former years abounding with them , they
toil with their lances , strange tales of Southern whaling .
conceivable that this fine old Dutch Fishery , a most wealthy example
of the sea - captain orders me to admire the magnanimity of the whole
, and many whalemen , but dumplings ; good white cedar of the ship
casts off her cables ; and chewed it noiselessly ; and though there
are birds called grey albatrosses ; and yet faster


'Why is it stripped off from some mountain torrent we had flip ? , so as to\npreserve all his might had in former years abounding with them , they\ntoil with their lances , strange tales of Southern whaling .\nconceivable that this fine old Dutch Fishery , a most wealthy example\nof the sea - captain orders me to admire the magnanimity of the whole\n, and many whalemen , but dumplings ; good white cedar of the ship\ncasts off her cables ; and chewed it noiselessly ; and though there\nare birds called grey albatrosses ; and yet faster'

---

**NLTK `generate()` builds a [trigram] language model from the supplied text** (words are generated based on previous two words).

For more information see nltk.lm: https://www.nltk.org/api/nltk.lm.html

**Saved version of `generate()` results:**
    
`text1.generate(text_seed = ["Why", "is", "it"])`

*Building ngram index...*

```
Why is it stripped off from some mountain torrent we had flip ? , so as to
preserve all his might had in former years abounding with them , they
toil with their lances , strange tales of Southern whaling .
conceivable that this fine old Dutch Fishery , a most wealthy example
of the sea - captain orders me to admire the magnanimity of the whole
, and many whalemen , but dumplings ; good white cedar of the ship
casts off her cables ; and chewed it noiselessly ; and though there
are birds called grey albatrosses ; and yet faster
```


In [28]:
help(text1.generate)

Help on method generate in module nltk.text:

generate(length=100, text_seed=None, random_seed=42) method of nltk.text.Text instance
    Print random text, generated using a trigram language model.
    See also `help(nltk.lm)`.
    
    :param length: The length of text to generate (default=100)
    :type length: int
    
    :param text_seed: Generation can be conditioned on preceding context.
    :type text_seed: list(str)
    
    :param random_seed: A random seed or an instance of `random.Random`. If provided,
    makes the random sampling part of generation reproducible. (default=42)
    :type random_seed: int



In [61]:
alice.generate(text_seed=["Alice","told", "Queen"])

Building ngram index...


Alice told Queen ordering off her head pressing against the ceiling , and don ’ t
believe there ’ s no use denying it . , who felt ready to sink into
the Dormouse ; “ so I ’ m afraid I ’ m a poor man , your Majesty , ”
Alice asked in a hot tureen ! . . will hear you ! . left off when they
saw Alice coming . , the King said to Alice , with a smile : some of
the leaves , and asking , “ or I ’ ve offended it again and again.


'Alice told Queen ordering off her head pressing against the ceiling , and don ’ t\nbelieve there ’ s no use denying it . , who felt ready to sink into\nthe Dormouse ; “ so I ’ m afraid I ’ m a poor man , your Majesty , ”\nAlice asked in a hot tureen ! . . will hear you ! . left off when they\nsaw Alice coming . , the King said to Alice , with a smile : some of\nthe leaves , and asking , “ or I ’ ve offended it again and again.'

---

## Your turn!

Choose some text and **explore it using NLTK** (following the examples in this notebook).

**Write code in notebook cells below**.
* add more cells (use "+" icon) if necessary

You may use NLTK text corpora or load your own text.