# NLTK - Getting Started

### Chen Lyu - Chen.Lyu@warwick.ac.uk

NLTK (Natural Language Tool Kit) is a Python module that provides easy-to-use interfaces to **over 50 corpora and lexical resources** such as *WordNet*, along with a suite of **text processing libraries** for *classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers* for industrial-strength NLP libraries, and an active discussion forum.

NLTK [Documentation](https://www.nltk.org/)

More information about the following exercises are available in [Chapter 1](http://www.nltk.org/book/ch01.html#sec-computing-with-language-texts-and-words) of the NLTK book.

## NLTK text resources

NLTK comes with a number of resoures. It is very handy to import and use them to build NLP tools. 

Let's start by listing NLTK resources available to us.

In [None]:
import nltk

# First, let's download NLTK corpora
nltk.download('book')

If you get "permission denied" error when running code in the cell above, try downloading by using the following command in a terminal instead:

`sudo python -m nltk.downloader book`

`sudo` provides root privileges when executing the command.

In [None]:
from nltk.book import *

# Print the list of the available books
texts()

## The NLTK Text object

The Text object is a wrapper for a list of tokens representing the documents. 

Its methods perform a variety of analyses on the text’s contexts (e.g., counting, concordancing, collocation discovery), and display the results.

In [None]:
print(type(text1))

In [None]:
print(text1.tokens[:50])

## Concordance and similarity

The NLTK `concordance()` function generates a list of all of the occurencies of a particular word within several contexts, showing how the word is being used. 

Let's try this on the Moby Dick text.

In [None]:
text1.concordance("monstrous")

We can see that "monstrous" is often used in the context of size and whales. I guess this is no surprise given the book we're reading.

Another function we can use here is the `similar()` function. This returns words which are used within similar contexts. 
E.g.: It looks for the words surrounding "monstrous" such as <i> "most _ size" </i> or <i>"the _ pictures"</i> and tries to find other words occuring in similar contexts.

In [None]:
text1.similar("monstrous")

Although perhaps a little tenuously related, these are all adjectives that do roughly fit the contexts described above.

## Frequency Distributions and NLTK
Now let's look at how to get how frequently the words are used in a corpus.

NLTK provides a special `dictionary` that counts occurrences of items in a list. It is called `FreqDist` and allows you to plot graphs.

Let's examine the words in Moby Dick with a frequency dist.

In [None]:
f = FreqDist(text1)

print("--- Sample of word frequencies ---")
print("'the': ", f["the"])
print("'whale': ", f["whale"])
print("'monstrous': ", f["monstrous"])

In [None]:
%matplotlib inline
# draw the frequency of the 20 most common words
f.plot(20, cumulative=False)

This is interesting but a lot of those being flagged up as the most frequent are common words like 'the', 'of', 'and', 'to'. 

These are what we call <b>stopwords</b> - words common to almost all documents and as such, that often provide **no value to an analyst**. We may want to filter these out. 

Thankfully NLTK comes with a stopwords list too. All we need to do is filter Moby Dick using this list.

In [None]:
from nltk.corpus import stopwords as StopwordsLoader

stopwords = StopwordsLoader.words() + [':','?','!','"','--','-', "'", '."', ';','.',',']

f = FreqDist([x for x in text1 if x.lower() not in stopwords]) 

In [None]:
# Print and plot the most frequent words except stopwords
print(f.most_common(20))

f.plot(20, cumulative=False)

This is much more interesting and informative. This plot helps painting the themes of the book. 

However, we can still observe a number of words that are not descriptive. Let's introduce a rule that filters out words shorter than 5 characters long.

In [None]:
f = FreqDist([x for x in text1 if (x.lower() not in stopwords and len(x) > 4)])

In [None]:
# Print and plot the most frequent words longer than 4 characters except stopwords
print(f.most_common(20))

f.plot(20, cumulative=False)

## Collocations

Collocations are group of words that often occur together. For example, "human beings", "The New York Times" or "emotional damage". 

We find collocations by identifying the most frequent bigrams in the text. Bigrams are pairs of words that occur next to each other.

In [None]:
from nltk import bigrams

print(list(bigrams("Moby Dick is about whales and human beings!".split(" "))))

The built-in collocations function calculates the most common bigrams in the corpus.

In [None]:
text1.collocations()

The collocations here are very specific to the book - Moby Dick. This gives us a great idea of the sorts of concepts and ideas that are important in Moby Dick.

## Using your own text with NLTK

It's great that NLTK comes with so many resources, but how do you go about using your own corpus - If you have a series of plain text files, such as a movie review dataset?

We use a `PlainTextCorpusReader` to enable NLTK to ingest and preprocess the corpus and allow us to do exercises like the ones above.

It is possibile to create a Text object from a text file on your filesystem:

In [None]:
from nltk.corpus import PlaintextCorpusReader

# Reading from disk and creating the Text object
# PlaintextCorpusReader(root, fileids)
my_local_corpus = PlaintextCorpusReader("Datasets/movie_reviews", r"\w+\.txt")

Note that putting a letter "r" or "R" right before the string would turn it to a raw string object. Python raw string treats the backslash character as a literal character. Raw string is useful when a string needs to contain a backslash, such as for a regular expression or Windows directory path, and you don’t want it to be treated as an escape character.

We have loaded the movie review corpus into NLTK, and now we can split it into words and sentences automatically.

Let's examine the overall word frequency across the movie reviews.

In [None]:
f = FreqDist([x for x in my_local_corpus.words() if (x not in stopwords and len(x) > 4)])

In [None]:
print(f.most_common(20))

f.plot(20, cumulative=False)

Not really any surprises here. Lots of words that make sense in a movie review context. Let's try doing collocations again.

In [None]:
from nltk.text import Text
my_corpus_text = Text(my_local_corpus.words())
my_corpus_text.collocations()

*BigramCollocationFinder* class unables us to apply more flexible operations with collocations.

In [None]:
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder

# BigramCollocationFinder is a tool for the finding and ranking of bigram collocations or other association measures
# It is often useful to use from_words() rather than constructing an instance directly.
finder = BigramCollocationFinder.from_words(my_local_corpus.words())

# Filter collocations appearing less than 3 times
finder.apply_freq_filter(3)

# BigramAssocMeasures() returns a collection of Bigram association measures
bigram_measures = BigramAssocMeasures()

# Pointwise Mutual Information (PMI) is a measurement of association
# It compares the probability of two events occurring together to what this probability would be if the events were independent.
# In NLP, it tells us how much more the two words co-occur in a corpus than we would have a priori expected them to appear by chance
# PMI(x,y) = log(P(x,y)/(P(x)p(y)))
print(finder.nbest(bigram_measures.pmi, 20))

This is much more interesting. What we start to see are names of actors and other crew members from movies under review.

## Further reading and more activities

NLTK provides a huge amount of scope for NLP experiments and text mining. For more ideas and guidance it is worth reading the [NLTK book](http://www.nltk.org/book/) online.

For an intuitive explanation of PMI:  https://stats.stackexchange.com/a/143150/83360