## Text Processing
<li>Tokenize</li>
<li>Stemming</li>

## Tagging Words
<li>Part of Speech</li>
<li>Named Entity Recognition</li>

## Lexical Resources
<li>CMU Pronouncing Dictionary</li>
<li>WordNet</li>

## Preparation

In [None]:
# Load up the Natural Language Toolkit

import nltk

In [None]:
# OPTIONAL: If NLTK is installed, but not 'nltk_data'

modules = ["averaged_perceptron_tagger", "maxent_ne_chunker", "punkt",\
           "words", "cmudict", "wordnet"]

for module in modules:
    nltk.download(module)

In [None]:
# Read file as a string for later use

with open('Melville - Moby Dick.txt') as file_in:
    moby_text = file_in.read()

# Tokenize

In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize

In [None]:
sentences = "What... is the air-speed velocity of an unladen swallow? \
What do you mean? An African or European swallow?"

In [None]:
word_tokenize(sentences)

In [None]:
sent_tokenize(sentences)

In [None]:
sentences = [word_tokenize(sent) for sent in sent_tokenize(sentences)]

In [None]:
sentences

In [None]:
## EX. Use the function word_tokenize() in order to tokenize Moby Dick.
##     How many tokens does the novel contain?

## EX. In previous lessons we have used the .split() method in order
##     to tokenize texts.
##     If you use that method, how many tokens do we find in Moby Dick?
##     Why is that number different from the previous exercise?

## EX. How many sentences are there in Moby Dick?

## Challenge: What is the average number of tokens per sentence in Moby Dick?

# Stemming

In [None]:
# Common stemming algorithms
# Note: Snowball is sometimes also called the 'Porter' algorithm

from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.snowball  import SnowballStemmer
from nltk.stem import WordNetLemmatizer

In [None]:
from nltk.stem.snowball import SnowballStemmer
english_stemmer = SnowballStemmer('english')

In [None]:
english_stemmer.stem('dogs')

In [None]:
english_stemmer.stem('running')

In [None]:
english_stemmer.stem('admirably')

In [None]:
SnowballStemmer.languages

In [None]:
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
wnl.lemmatize('dogs')

In [None]:
wnl.lemmatize('running')

In [None]:
wnl.lemmatize('running', pos='v')

In [None]:
## WordNet POS tags

# VRB  >> 'v'
# ADJ  >> 'a'
# NOUN >> 'n'
# ADV  >> 'r'

In [None]:
## EX. Create a list of stemmed words from Moby Dick.

## Challenge: How many unique tokens are there in Moby Dick? Unique stems?

# Part of Speech

In [None]:
# Common POS taggers

# Note: Stanford models must be downloaded from here: http://nlp.stanford.edu/software/
# NLTK simply offers a wrapper for the Stanford taggers, which allows
# you to use them in Python, rather than their native Java

from nltk.tag.perceptron    import PerceptronTagger
from nltk.tag.brill         import BrillTagger
from nltk.tag.stanford      import StanfordTagger, StanfordPOSTagger, StanfordNERTagger

In [None]:
# NLTK's current default POS tagger is the 'averaged perceptron' as described here:
# https://spacy.io/blog/part-of-speech-POS-tagger-in-python

from nltk import pos_tag

new_sentence = "Once the number three, being the third number, be reached, \
then lobbest thou thy Holy Hand Grenade of Antioch towards thy foe, who, \
being naughty in My sight, shall snuff it."

new_tokens = word_tokenize(new_sentence)

pos_tag(new_tokens)

In [None]:
# Penn Treebank POS Tags
# www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [None]:
for sent in sentences:
    print(pos_tag(sent))

In [None]:
## EX. Get POS tags for the sentence below.

very_new_sentence = "On second thought, let's not go to Camelot."

## EX. Rewrite the last for-loop above (over 'sentences') as a list comprehension.

## Challenge: Create a list containing only the nouns from 'very_new_sentence'

# Named Entity Recognition

In [None]:
# Let's start with a fresh sentence containing several proper names

ner_sentence = 'King Arthur is the sovereign over Britain and lord of the Round Table.'
ner_tokens = word_tokenize(ner_sentence)
ner_tags = pos_tag(ner_tokens)

In [None]:
ner_tags

In [None]:
from nltk import ne_chunk

chunks = ne_chunk(ner_tags)

print(chunks)

In [None]:
for chunk in chunks:
    if type(chunk)==nltk.tree.Tree:
        if chunk.label()=='PERSON':
            print(chunk.leaves())

In [None]:
## EX. Retrieve the geograpic designations from the sentence below.

swallow_skeptic = "Oh yeah, an African swallow, maybe, but not a European swallow."

## EX. Rewrite the previous exercise as a list comprehension.

# CMU Pronouncing Dictionary

In [None]:
from nltk.corpus import cmudict

In [None]:
words = cmudict.words()      # list of words for which we have pronunciations
dictionary = cmudict.dict()  # keys are words, values are lists of pronunciations for each word
entries = cmudict.entries()  # list of tuples, where first entry is word, second is pronunciation

In [None]:
len(words)

In [None]:
for word in words[42371:42379]:
     print(word)

In [None]:
for entry in entries[42371:42379]:
     print(entry)

In [None]:
dictionary['fir']

In [None]:
ner_sentence = 'King Arthur is the sovereign over Britain and lord of the Round Table.'
tokens = word_tokenize(ner_sentence.lower())
for token in tokens:
    if token in words:
        print(token, dictionary[token])

In [None]:
## EX. Retrieve the pronunciation for the sentence below

famous_sentence = "To be or not to be that is the question."

## Challenge: Count the number of syllables in the famous sentence

## Challenge: What is the average number of syllables per line in Hamlet's soliloquy?
## The text is contained in the file "hamlet.txt" in this folder.

# WordNet

<img src="wordnet-hierarchy.png">

## Ambiguity: Locating words in the tree

In [None]:
from nltk.corpus import wordnet as wn
wn.synsets('motorcar')

In [None]:
# Note that wn.synsets(word) takes a dictionary word as its entry
# and returns the labels for each of its definitions

In [None]:
wn.synset('car.n.01')

In [None]:
# Note that wn.synset(label) simply represents the definition
# corresponding to that label

# The 's' at the end of the function name makes all the difference!

In [None]:
wn.synset('car.n.01').definition()

In [None]:
wn.synset('car.n.01').lemma_names()

In [None]:
wn.synsets('car')

In [None]:
wn.synset('car.n.02').definition()

In [None]:
wn.synset('car.n.02').lemma_names()

In [None]:
for synset in wn.synsets('car'):
     print(synset.lemma_names())

In [None]:
len(wn.synsets('car'))

In [None]:
## EX. How many synsets does the word 'swallow' belong to?
##     What are their definitions?

##  Q. Does the number of synsets to which a word belongs offer
##     a useful measure of word ambiguity? In what situations
##     might it come in handy? When might it be misleading?

## Abstraction: How high is your tree branch?

In [None]:
wn.synset('car.n.01').hypernyms()

In [None]:
wn.synset('car.n.01').hyponyms()

In [None]:
wn.synset('car.n.01').root_hypernyms()

In [None]:
wn.synset('car.n.01').hypernym_paths()

In [None]:
for path in wn.synset('car.n.01').hypernym_paths():
    print(len(path))

In [None]:
##  Q. Describe in words the relationship between a word and its hyponyms.
##     Also, between the word and its hypernyms.

## EX. Find the hypernyms and hyponyms for the definition 'swallow.n.03'

## EX. How long is the hypernym path for the definition 'swallow.n.03'?
##     What about 'coconut.n.02'?

# Challenge Exercise

In [None]:
april = """THREE spirits came to me
And drew me apart
To where the olive boughs
Lay stripped upon the ground:
Pale carnage beneath bright mist."""

metro = """The apparition of these faces in the crowd;
Petals on a wet, black bough."""

In [None]:
## CHALLENGE: Measure the average number of possible synsets per word for each poem.
##            Does this capture your intuition about the relative ambiguity of each?
##            HINT: Use the WordNet Lemmatizer!

##  Q: Is there a way to perform a similar measure of average hypernym_path length?
##     How might you handle words for which there are multiple synsets with differing lengths?