In [1]:
from IPython.display import Image

# Introduction to NLTK

# What is NLTK

Natural Language Processing is often taught within the confines of a single-semester course at advanced undergraduate level or postgraduate level. Many instructors have found that it is difficult to cover both the theoretical and practical sides of the subject in such a short span of time. Some courses focus on theory to the exclusion of practical exercises, and deprive students of the challenge and excitement of writing programs to automatically process language. Other courses are simply designed to teach programming for linguists, and do not manage to cover any significant NLP content. NLTK was originally developed to address this problem, making it feasible to cover a substantial amount of theory and practice within a single-semester course, even if students have no prior programming experience.

From <a href="http://www.nltk.org/book/ch00.html" target="_blank">http://www.nltk.org/book/ch00.html</a>

# What is NLTK

- a collection of language processing algorithms implemented in python
- many of the algorithms are baseline implementation with average/low accuracy
- a collection of corpora and resources in various languages processed so they can be easily accessed
- Excellent way of accessing packages written in other languages

- **My view**: great for learning programming and language processing, but largely over-rated/misinterpreted as a "production ready" toolkit



# Installing NLTK

In order to install NLTK you will need python3 (the recommended version is **python 3.5 32bit version**)

Installation instructions for NLTK available at <a href="http://www.nltk.org/install.html">http://www.nltk.org/install.html</a>

To test if you have successfully installed NLTK type ``import nltk`` in the python shell. If no error is displayed it means NLTK is successfully installed. 

# Downloading the datasets

NLTK comes with some great datasets ready to be used. You can download them using the following sequence of commands.

```python
import nltk
nltk.download()
```

If everything works correctly you will see the following dialogue:

In [2]:
Image(url="http://www.nltk.org/images/nltk-downloader.png")

Download "book" for a fairly large collection of datasets that are used in the NLTK book.

# How to find out installed corpora

Call ``nltk.download()`` and see which corpora are marked as installed. 

In [3]:
# programatic way of finding yout the installed corpora

import os
import nltk
print( os.listdir( nltk.data.find("corpora") ) )


['abc', 'abc.zip', 'brown', 'brown.zip', 'chat80', 'chat80.zip', 'city_database', 'city_database.zip', 'cmudict', 'cmudict.zip', 'conll2000', 'conll2000.zip', 'conll2002', 'conll2002.zip', 'dependency_treebank', 'dependency_treebank.zip', 'genesis', 'genesis.zip', 'gutenberg', 'gutenberg.zip', 'ieer', 'ieer.zip', 'inaugural', 'inaugural.zip', 'movie_reviews', 'movie_reviews.zip', 'names', 'names.zip', 'nps_chat', 'nps_chat.zip', 'panlex_swadesh.zip', 'ppattach', 'ppattach.zip', 'reuters.zip', 'senseval', 'senseval.zip', 'state_union', 'state_union.zip', 'stopwords', 'stopwords.zip', 'swadesh', 'swadesh.zip', 'timit', 'timit.zip', 'toolbox', 'toolbox.zip', 'treebank', 'treebank.zip', 'udhr', 'udhr.zip', 'udhr2', 'udhr2.zip', 'unicode_samples', 'unicode_samples.zip', 'webtext', 'webtext.zip', 'wordnet', 'wordnet.zip', 'wordnet_ic', 'wordnet_ic.zip', 'words', 'words.zip']


# Find out about a corpus

Use ``nltk.corpora.<corpus name>.readme()`` to display the readme file associated with the corpus, if it exists. 

In [4]:
print(nltk.corpus.names.readme())

Names Corpus, Version 1.3 (1994-03-29)
Copyright (C) 1991 Mark Kantrowitz
Additions by Bill Ross

This corpus contains 5001 female names and 2943 male names, sorted
alphabetically, one per line.

You may use the lists of names for any purpose, so long as credit is
given in any published work. You may also redistribute the list if you
provide the recipients with a copy of this README file. The lists are
not in the public domain (I retain the copyright on the lists) but are
freely redistributable.  If you have any additions to the lists of
names, I would appreciate receiving them.

Mark Kantrowitz <mkant+@cs.cmu.edu>
http://www-2.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/


# Getting the texts in the corpus

``nltk.corpus.<corpus name>.fileids()`` returns a list of the files included in the corpus. Usually they correspond to individual texts. Usually the IDs returned can be used as parameters of functions which process texts.

In [5]:
nltk.corpus.names.fileids()

['female.txt', 'male.txt']

In [6]:
print(nltk.corpus.gutenberg.fileids())

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


# Accessing the text 

- ``nltk.corpus.<corpus name>.raw(<fileID>)`` accesses the raw text. The result is a string
- ``nltk.corpus.<corpus name>.words(<fileID>)`` returns a list of words from the text/corpus
- ``nltk.corpus.<corpus name>.sents(<fileID>)`` returns a list of sentences from the text/corpus
- ``nltk.corpus.<corpus name>.paras(<fileID>)`` returns a list of sentences from the text/corpus
- ``nltk.corpus.<corpus name>.tagged_words(<fileID>)``, ``nltk.corpus.<corpus name>.tagged_sents(<fileID>)``, ``nltk.corpus.<corpus name>.tagged_paras(<fileID>)`` - similar to the above, but the words are tagged

``<fileID>`` is an optional parameter and restricts the extraction of information only to the files specified. 

**Note**: not all the corpora contain this information. You need to try or use ``dir(nltk.corpus.<corpus name>)``

In [7]:
# NLTK employs a lazy loading mechanism, so unless the corpus is accessed 
# dir does not display the correct information. Hint the information is not
# correct is reference to Lazy loading in the output of dir

# display information about Brown corpus
nltk.corpus.brown.readme()
print(dir(nltk.corpus.brown))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_add', '_c2f', '_delimiter', '_encoding', '_f2c', '_file', '_fileids', '_get_root', '_init', '_map', '_para_block_reader', '_pattern', '_resolve', '_root', '_sent_tokenizer', '_sep', '_tagset', '_unload', '_word_tokenizer', 'abspath', 'abspaths', 'categories', 'citation', 'encoding', 'ensure_loaded', 'fileids', 'license', 'open', 'paras', 'raw', 'readme', 'root', 'sents', 'tagged_paras', 'tagged_sents', 'tagged_words', 'unicode_repr', 'words']


# Example from Brown corpus

In [8]:
print(nltk.corpus.brown.readme())

BROWN CORPUS

A Standard Corpus of Present-Day Edited American
English, for use with Digital Computers.

by W. N. Francis and H. Kucera (1964)
Department of Linguistics, Brown University
Providence, Rhode Island, USA

Revised 1971, Revised and Amplified 1979

http://www.hit.uib.no/icame/brown/bcm.html

Distributed with the permission of the copyright holder,
redistribution permitted.



In [9]:
# get the list of texts in the corpus
print(nltk.corpus.brown.fileids())

['ca01', 'ca02', 'ca03', 'ca04', 'ca05', 'ca06', 'ca07', 'ca08', 'ca09', 'ca10', 'ca11', 'ca12', 'ca13', 'ca14', 'ca15', 'ca16', 'ca17', 'ca18', 'ca19', 'ca20', 'ca21', 'ca22', 'ca23', 'ca24', 'ca25', 'ca26', 'ca27', 'ca28', 'ca29', 'ca30', 'ca31', 'ca32', 'ca33', 'ca34', 'ca35', 'ca36', 'ca37', 'ca38', 'ca39', 'ca40', 'ca41', 'ca42', 'ca43', 'ca44', 'cb01', 'cb02', 'cb03', 'cb04', 'cb05', 'cb06', 'cb07', 'cb08', 'cb09', 'cb10', 'cb11', 'cb12', 'cb13', 'cb14', 'cb15', 'cb16', 'cb17', 'cb18', 'cb19', 'cb20', 'cb21', 'cb22', 'cb23', 'cb24', 'cb25', 'cb26', 'cb27', 'cc01', 'cc02', 'cc03', 'cc04', 'cc05', 'cc06', 'cc07', 'cc08', 'cc09', 'cc10', 'cc11', 'cc12', 'cc13', 'cc14', 'cc15', 'cc16', 'cc17', 'cd01', 'cd02', 'cd03', 'cd04', 'cd05', 'cd06', 'cd07', 'cd08', 'cd09', 'cd10', 'cd11', 'cd12', 'cd13', 'cd14', 'cd15', 'cd16', 'cd17', 'ce01', 'ce02', 'ce03', 'ce04', 'ce05', 'ce06', 'ce07', 'ce08', 'ce09', 'ce10', 'ce11', 'ce12', 'ce13', 'ce14', 'ce15', 'ce16', 'ce17', 'ce18', 'ce19', 'ce20',

In [10]:
# access the categories of brown corpus
print(nltk.corpus.brown.categories())

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']


In [11]:
# access the raw corpus
print(nltk.corpus.brown.raw("ca01")[:500])



	The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.


	The/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd over-all/jj charge/nn of/in the/at election/nn ,/, ``/`` deserves/vbz the/at praise/nn and/c


In [12]:
# access the words
print(nltk.corpus.brown.words("ca01"))

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]


In [13]:
# access the tagged words
print(nltk.corpus.brown.tagged_words("ca01"))

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]


In [14]:
# access sentences
print(nltk.corpus.brown.sents(categories="news"))

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]


In [15]:
# access tagged sentences
print(nltk.corpus.brown.tagged_sents(categories="news"))

[[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')], [('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT'), ('had', 'HVD'), ('over-all', 'JJ'), ('charge', 'NN'), ('of', 'IN'), ('the', 'AT'), ('election', 'NN'), (',', ','), ('``', '``'), ('deserves', 'VBZ'), ('the', 'AT'), ('praise', 'NN'), ('and', 'CC'), ('thanks', 'NNS'), ('of', 'IN'), ('the', 'AT'), ('City', 'NN-TL'), ('of', 'IN-TL'), ('Atlant

# Producing frequency lists

- frequency lists can be easily produced using ``FreqDist``. 
- FreqDist takes as a parameter a sequence that needs to be counted (i.e. it is not necessary words).

In [16]:
# producing the frequency list from Brown corpus
fdist1 = nltk.FreqDist(nltk.corpus.brown.words())

In [17]:
# print the most common 50 words
print(fdist1.most_common(50))

[('the', 62713), (',', 58334), ('.', 49346), ('of', 36080), ('and', 27915), ('to', 25732), ('a', 21881), ('in', 19536), ('that', 10237), ('is', 10011), ('was', 9777), ('for', 8841), ('``', 8837), ("''", 8789), ('The', 7258), ('with', 7012), ('it', 6723), ('as', 6706), ('he', 6566), ('his', 6466), ('on', 6395), ('be', 6344), (';', 5566), ('I', 5161), ('by', 5103), ('had', 5102), ('at', 4963), ('?', 4693), ('not', 4423), ('are', 4333), ('from', 4207), ('or', 4118), ('this', 3966), ('have', 3892), ('an', 3542), ('which', 3540), ('--', 3432), ('were', 3279), ('but', 3007), ('He', 2982), ('her', 2885), ('one', 2873), ('they', 2773), ('you', 2766), ('all', 2726), ('would', 2677), ('him', 2576), ('their', 2562), ('been', 2470), (')', 2466)]


In [18]:
# use list comprehension to produce a frequency list of the words from 
# Brown corpus which are not stopwords 
fdist2 = nltk.FreqDist([word.lower() 
                        for word in nltk.corpus.brown.words() 
                        if word.isalnum()
                        if word.lower() not in nltk.corpus.stopwords.words('english')
                       ])

In [19]:
# print the most common 50 words
print(fdist2.most_common(50))

[('one', 3292), ('would', 2714), ('said', 1961), ('new', 1635), ('could', 1601), ('time', 1598), ('two', 1412), ('may', 1402), ('first', 1361), ('like', 1292), ('man', 1207), ('even', 1170), ('made', 1125), ('also', 1069), ('many', 1030), ('must', 1013), ('af', 996), ('back', 966), ('years', 950), ('much', 937), ('way', 908), ('well', 897), ('people', 847), ('little', 831), ('state', 807), ('good', 806), ('make', 794), ('world', 787), ('still', 782), ('see', 772), ('men', 763), ('work', 762), ('long', 752), ('get', 749), ('life', 715), ('never', 697), ('day', 687), ('another', 684), ('know', 683), ('last', 676), ('us', 675), ('might', 672), ('great', 665), ('old', 661), ('year', 658), ('come', 630), ('since', 628), ('go', 626), ('came', 622), ('right', 613)]


In [20]:
# doing the same using map and filter
# convert everything to lower case
lower_words = map(str.lower, nltk.corpus.brown.words())
# filter punctuation
no_punctuation = filter(str.isalnum, lower_words)
# filter stopwords
no_stopwords = filter(lambda x: x not in nltk.corpus.stopwords.words('english'), no_punctuation)
fdist3 = nltk.FreqDist(no_stopwords)
print(fdist3.most_common(50))

[('one', 3292), ('would', 2714), ('said', 1961), ('new', 1635), ('could', 1601), ('time', 1598), ('two', 1412), ('may', 1402), ('first', 1361), ('like', 1292), ('man', 1207), ('even', 1170), ('made', 1125), ('also', 1069), ('many', 1030), ('must', 1013), ('af', 996), ('back', 966), ('years', 950), ('much', 937), ('way', 908), ('well', 897), ('people', 847), ('little', 831), ('state', 807), ('good', 806), ('make', 794), ('world', 787), ('still', 782), ('see', 772), ('men', 763), ('work', 762), ('long', 752), ('get', 749), ('life', 715), ('never', 697), ('day', 687), ('another', 684), ('know', 683), ('last', 676), ('us', 675), ('might', 672), ('great', 665), ('old', 661), ('year', 658), ('come', 630), ('since', 628), ('go', 626), ('came', 622), ('right', 613)]


In [21]:
# of course we can put everything in one line
fdist4 = nltk.FreqDist(
    filter(lambda x: x not in nltk.corpus.stopwords.words('english'), 
           filter(str.isalnum, 
                  map(str.lower, nltk.corpus.brown.words())
                 )
          )
)
print(fdist4.most_common(50))

[('one', 3292), ('would', 2714), ('said', 1961), ('new', 1635), ('could', 1601), ('time', 1598), ('two', 1412), ('may', 1402), ('first', 1361), ('like', 1292), ('man', 1207), ('even', 1170), ('made', 1125), ('also', 1069), ('many', 1030), ('must', 1013), ('af', 996), ('back', 966), ('years', 950), ('much', 937), ('way', 908), ('well', 897), ('people', 847), ('little', 831), ('state', 807), ('good', 806), ('make', 794), ('world', 787), ('still', 782), ('see', 772), ('men', 763), ('work', 762), ('long', 752), ('get', 749), ('life', 715), ('never', 697), ('day', 687), ('another', 684), ('know', 683), ('last', 676), ('us', 675), ('might', 672), ('great', 665), ('old', 661), ('year', 658), ('come', 630), ('since', 628), ('go', 626), ('came', 622), ('right', 613)]


# Producing ngrams

- bigrams can be easily produced using ``nltk.bigrams(<list>)``, where <list> is a list of elements that will be used to produce bigrams

In [22]:
bigrams_brown = nltk.bigrams(nltk.corpus.brown.words('ca01'))
print(list(bigrams_brown)[:10])

[('The', 'Fulton'), ('Fulton', 'County'), ('County', 'Grand'), ('Grand', 'Jury'), ('Jury', 'said'), ('said', 'Friday'), ('Friday', 'an'), ('an', 'investigation'), ('investigation', 'of'), ('of', "Atlanta's")]


In [23]:
# the bigrams can be used by FreqDist
fdist5 = nltk.FreqDist(nltk.bigrams(nltk.corpus.brown.words('ca01')))
print(fdist5.most_common(50))

[(('.', 'The'), 23), ((',', 'the'), 16), (('in', 'the'), 15), (("''", '.'), 15), (('of', 'the'), 14), (('.', '``'), 8), (("''", ','), 7), (('the', 'jury'), 7), (('jury', 'said'), 7), (('Fulton', 'County'), 6), (('The', 'jury'), 6), (('the', 'election'), 6), (('.', 'It'), 6), (('at', 'the'), 6), (('that', 'the'), 5), (('the', 'Fulton'), 5), ((',', 'and'), 5), ((',', '``'), 4), (('of', "Georgia's"), 4), (('to', 'the'), 4), (('should', 'be'), 4), (('the', 'State'), 4), (('said', '.'), 4), (('on', 'the'), 4), (('has', 'been'), 4), (('will', 'be'), 4), ((',', 'who'), 4), (('a', 'candidate'), 4), (('Highway', 'Department'), 4), (('.', 'A'), 4), (('the', 'House'), 4), ((',', 'which'), 3), (('election', ','), 3), (('for', 'the'), 3), (('which', 'the'), 3), (('said', ','), 3), (('number', 'of'), 3), (('and', 'the'), 3), (('said', 'it'), 3), (('``', 'are'), 3), (('the', 'Atlanta'), 3), (('as', 'a'), 3), (('in', 'a'), 3), (('added', 'that'), 3), (('race', ','), 3), ((',', 'a'), 3), (('there', 'wa

In [24]:
# we can produce bigrams of other things
fdist6 = nltk.FreqDist(nltk.bigrams(nltk.corpus.brown.tagged_words('ca01')))
print(fdist6.most_common(50))

[((('.', '.'), ('The', 'AT')), 23), (((',', ','), ('the', 'AT')), 16), ((('in', 'IN'), ('the', 'AT')), 15), ((("''", "''"), ('.', '.')), 15), ((('of', 'IN'), ('the', 'AT')), 14), ((('.', '.'), ('``', '``')), 8), ((("''", "''"), (',', ',')), 7), ((('the', 'AT'), ('jury', 'NN')), 7), ((('jury', 'NN'), ('said', 'VBD')), 7), ((('Fulton', 'NP-TL'), ('County', 'NN-TL')), 6), ((('The', 'AT'), ('jury', 'NN')), 6), ((('the', 'AT'), ('election', 'NN')), 6), ((('.', '.'), ('It', 'PPS')), 6), ((('at', 'IN'), ('the', 'AT')), 6), ((('that', 'CS'), ('the', 'AT')), 5), (((',', ','), ('and', 'CC')), 5), (((',', ','), ('``', '``')), 4), ((('of', 'IN'), ("Georgia's", 'NP$')), 4), ((('to', 'IN'), ('the', 'AT')), 4), ((('should', 'MD'), ('be', 'BE')), 4), ((('the', 'AT'), ('State', 'NN-TL')), 4), ((('the', 'AT'), ('Fulton', 'NP-TL')), 4), ((('said', 'VBD'), ('.', '.')), 4), ((('on', 'IN'), ('the', 'AT')), 4), ((('has', 'HVZ'), ('been', 'BEN')), 4), ((('will', 'MD'), ('be', 'BE')), 4), (((',', ','), ('who',

This list is quite difficult to read. As an exercise make it more readable by removing bigrams that contain punctuation and display the result a more friendly way.

# Ngrams 

``nltk.ngrams(<list>, <n>)`` is used to produce ngrams from <list>

In [25]:
five_grams = nltk.ngrams(nltk.corpus.brown.words('ca01'), 5)
print(list(five_grams)[:10])

[('The', 'Fulton', 'County', 'Grand', 'Jury'), ('Fulton', 'County', 'Grand', 'Jury', 'said'), ('County', 'Grand', 'Jury', 'said', 'Friday'), ('Grand', 'Jury', 'said', 'Friday', 'an'), ('Jury', 'said', 'Friday', 'an', 'investigation'), ('said', 'Friday', 'an', 'investigation', 'of'), ('Friday', 'an', 'investigation', 'of', "Atlanta's"), ('an', 'investigation', 'of', "Atlanta's", 'recent'), ('investigation', 'of', "Atlanta's", 'recent', 'primary'), ('of', "Atlanta's", 'recent', 'primary', 'election')]


# Exercise

- Investigate the existing corpora. Find out one that you think may be useful for your research. Check what kind of information is available for the corpus. Produce various frequency lists from it. You will be asked to discuss your findings later on.
- As the user for a word and find out whether it appears in the Brown corpus more frequently as a noun or as a verb. Let the user know if according to the tagging available for the Brown corpus the word is neither. 

# Further reading

Chapters 1 and 2 of NLTK book: <a href="http://www.nltk.org/book/ch01.html" target="_blank">http://www.nltk.org/book/ch01.html</a> and <a href="http://www.nltk.org/book/ch02.html" target="_blank">http://www.nltk.org/book/ch02.html</a>