# Introduction to NLP with Python's NLTK

* "NLTK is a leading platform for building Python programs to work with human language data." -- NLTK website
* https://www.nltk.org/

In [None]:
import nltk

We'll use the first lines of Moby Dick to explore some NLP basics:

In [None]:
text = '''
Call me Ishmael. Some years ago—never mind how long precisely—having little
or no money in my purse, and nothing particular to interest me on shore, 
I thought I would sail about a little and see the watery part of the world.'
'''

In [None]:
print(text)

## Tokenization

**Tokenization** breaks the raw text into smaller pieces like sentences and words.

In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize

* `sent_tokenize` takes a string and breaks it down into a list of sentences.
* `word_tokenize` takes a string and breaks it down into a list of words.

In [None]:
sentences = sent_tokenize(text)
print(sentences)

In [None]:
print(word_tokenize(sentences[1]))

In [None]:
words = word_tokenize(text)
print(words)

In [None]:
words2 = []
for s in sentences:
    for w in word_tokenize(s):
        words2.append(w)

In [None]:
print(words2)

## Stopword removal

Usually in language analysis we don't want our analysis to be skewed by very common words like 'a', 'the', 'and', etc.  These are stopwords and can be removed before commencing a more detailed analysis.  We often may not want to analyse punctuation marks either when analysing language use.

In [None]:
from nltk.corpus import stopwords
from string import punctuation
nltk.download('stopwords')

In [None]:
print(punctuation)

In [None]:
print(stopwords.words('english'))

In [None]:
print(list(punctuation))

In [None]:
myStopWords = list(punctuation) + stopwords.words('english')

In [None]:
print(words)

In [None]:
wordsNoStop = []
for i in words:
    if i not in myStopWords:
        wordsNoStop.append(i)
print(words)
print(wordsNoStop)

We'll use list comprehension to streamline this process.

In [None]:
# Example list comprehension
[i for i in [1,2,3,4]]

In [None]:
[a for a in range(5)]

In [None]:
[x for x in [2,3,6,5,7,8,4] if x > 5]

In [None]:
wordsNoStopComp = [w for w in words if w not in myStopWords]
print(wordsNoStopComp)

## N-grams

Words that are near to each other can allow us to draw deeper conclusions about a given text. We can split a text into pairs of co-located words (bi-grams), triplets (tri-grams), and generally into n-tuplets (n-grams).

In [None]:
from nltk.collocations import *

In [None]:
finder = BigramCollocationFinder.from_words(wordsNoStop)

In [None]:
finder

In [None]:
finder.ngram_fd

In [None]:
finder.ngram_fd.items()

In [None]:
sorted(finder.ngram_fd.items())

## Stemming and Tagging

Stemming allows us to improve our estimate of word frequency by combining the counts of similar forms of words (e.g. counting sail, sailing, and sailed as representative of the common stem "sail").

Tagging helps us to disambiguate words by identifying their part-of-speech.

In [None]:
text2 = 'Ishmael sailed because sailing and wanting to sail was in his blood.'

In [None]:
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.porter import PorterStemmer

In [None]:
words = word_tokenize(text2)

In [None]:
print(words)

In [None]:
wordLancasterStems = [LancasterStemmer().stem(w) for w in words]
wordPorterStems = [PorterStemmer().stem(w) for w in words]

In [None]:
print(wordLancasterStems)
print(wordPorterStems)

In [None]:
nltk.pos_tag(words)

In [None]:
nltk.pos_tag(word_tokenize('Once upon a time there was a cat.  It was black and fluffy.'))

Check out the [Penn Treebank Project list](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

## Word sense disambiguation

We can further disambiguate words by looking at their synsets.  Synsets are groupings of synonymous words that are conceptually similar.

In [None]:
from nltk.corpus import wordnet
for ss in wordnet.synsets('sail'):
    print(ss, ss.definition())

One algorithm for disambiguating a word is the Lesk algorithm, which loosely speaking looks at the definitions of neighboring words to that word and selects the definition that has the highest overlap with these neighboring definitions.

In [None]:
from nltk.wsd import lesk

In [None]:
print(words)

In [None]:
wordSense = lesk(words, 'sail')

In [None]:
print(wordSense, wordSense.definition())

In [None]:
wordSense = lesk(words, 'sailed')

In [None]:
print(wordSense, wordSense.definition())

In [None]:
wordSense = lesk(words, 'wanting to sail')

In [None]:
print(wordSense, wordSense.definition())

In [None]:
t = 'I sailed to Mexico on a boat each winter.'
s = lesk(word_tokenize(t), 'sailed')
print(s, s.definition())