<DIV ALIGN=CENTER>

# Introduction to NLP: Basic Concepts
## Professor Robert J. Brunner
  
</DIV>  
-----
-----

## Introduction

In this IPython Notebook, we build on the [text analysis][w7i] concepts
presented previously to dive more deeply into text analysis.
Specifically, we will move beyond simple tokenization to leverage the
semantic information contained in ordering and arrangement of text data
to gain new insights. We will start by exploring alternative tokenization
techniques provided by the NLTK library before delving into
part-of-speech tagging and named entity recognition. 

We begin by parsing a simple text document that contains the course
description for INFO 490 SP16 (i.e., this course). First, we will employ
a sentence tokenizer, before changing to word, whitespace, and 
word/punctuation tokenizers.

-----

[w7i]: ../../Week7/index.ipynb

In [1]:
# As a text example, we use the course description for INFO490  SP16.
info_course = ['Advanced Data Science: This class is an asynchronous, online course.', 
               'This course will introduce advanced data science concepts by building on the foundational concepts presented in INFO 490: Foundations of Data Science.', 
               'Students will first learn how to perform more statistical data exploration and constructing and evaluating statistical models.', 
               'Next, students will learn machine learning techniques including supervised and unsupervised learning, dimensional reduction, and cluster finding.', 
               'An emphasis will be placed on the practical application of these techniques to high-dimensional numerical data, time series data, image data, and text data.', 
               'Finally, students will learn to use relational databases and cloud computing software components such as Hadoop, Spark, and NoSQL data stores.', 
               'Students must have access to a fairly modern computer, ideally that supports hardware virtualization, on which they can install software.', 
               'This class is open to sophomores, juniors, seniors and graduate students in any discipline who have either taken a previous INFO 490 data science course or have received instructor permission.']

text = " ".join(info_course)

# Tokenize and display results. Also display one representative sentence
from nltk import sent_tokenize
snts = sent_tokenize(text)
print('{0} sentances in course description'.format(len(snts)))
print(40*'-')
print(snts[2])

8 sentances in course description
----------------------------------------
Students will first learn how to perform more statistical data exploration and constructing and evaluating statistical models.


In [2]:
# Tokenize by words, display results, and a representive section of words
from nltk import word_tokenize
wtks = word_tokenize(text)

print('{0} words in course description'.format(len(wtks)))
print(40*'-')

# Display the tokens
import pprint
pp = pprint.PrettyPrinter(indent=2, depth=2, width=80, compact=True)

pp.pprint(wtks[:13])

185 words in course description
----------------------------------------
[ 'Advanced', 'Data', 'Science', ':', 'This', 'class', 'is', 'an',
  'asynchronous', ',', 'online', 'course', '.']


In [3]:
from nltk.tokenize import WhitespaceTokenizer
tokenizer = WhitespaceTokenizer()
wtks = tokenizer.tokenize(text)

print('{0} words in course description (WS Tokenizer)'.format(len(wtks)))
print(40*'-')

pp.pprint(wtks[:10])

161 words in course description (WS Tokenizer)
----------------------------------------
[ 'Advanced', 'Data', 'Science:', 'This', 'class', 'is', 'an', 'asynchronous,',
  'online', 'course.']


In [4]:
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
wtks = tokenizer.tokenize(text)

print('{0} words in course description (WP Tokenizer)'.format(len(wtks)))
print(40*'-')

pp.pprint(wtks[:13])

187 words in course description (WP Tokenizer)
----------------------------------------
[ 'Advanced', 'Data', 'Science', ':', 'This', 'class', 'is', 'an',
  'asynchronous', ',', 'online', 'course', '.']


-----

### Collocations

We previously discussed using multiple, adjacent words, which is known
as n-grams (e.g., bigrams or trigrams). We can also build
[collocations][nc], where we use NLTK to grab n-grams, but now with the
possibility of applying filters, such as a minimum frequency of
occurrence. We can employ an association measure, such as the [pointwise
mutual information][wpmi] (PMI), to compute the importance of a
collocation. PMI quantifies the likelihood of two words occurring together
in a document to their chance superposition (from their individual
distribution in the document). Thus, a PMI close to one implies two
words almost always occur together, while a PMI close to zero implies
two words are nearly independent and rarely occur together.

-----
[nc]: http://www.nltk.org/howto/collocations.html
[wpmi]: https://en.wikipedia.org/wiki/Pointwise_mutual_information

In [5]:
top_bgs = 10

from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(wtks)
bgs = finder.nbest(bigram_measures.pmi, top_bgs)

print('Best {0} bi-grams in course description (WP Tokenizer)'.format(top_bgs))
print(50*'-')

ppf = pprint.PrettyPrinter(indent=2, depth=2, width=80, compact=False)
ppf.pprint(bgs)

print(50*'-')
print('Best {0} bi-grams occuring more than once in course description (WP Tokenizer)'.format(top_bgs))
print(50*'-')

finder.apply_freq_filter(2)
bgs = finder.nbest(bigram_measures.pmi, top_bgs)
ppf.pprint(bgs)


Best 10 bi-grams in course description (WP Tokenizer)
--------------------------------------------------
[ ('An', 'emphasis'),
  ('an', 'asynchronous'),
  ('any', 'discipline'),
  ('as', 'Hadoop'),
  ('be', 'placed'),
  ('by', 'building'),
  ('can', 'install'),
  ('cloud', 'computing'),
  ('cluster', 'finding'),
  ('components', 'such')]
--------------------------------------------------
Best 10 bi-grams occuring more than once in course description (WP Tokenizer)
--------------------------------------------------
[ ('Data', 'Science'),
  ('INFO', '490'),
  ('class', 'is'),
  ('This', 'class'),
  ('on', 'the'),
  ('students', 'will'),
  ('will', 'learn'),
  ('.', 'Students'),
  ('data', 'science'),
  ('.', 'This')]


In [6]:
from nltk.collocations import TrigramAssocMeasures, TrigramCollocationFinder

trigram_measures = TrigramAssocMeasures()
finder = TrigramCollocationFinder.from_words(wtks)
tgs = finder.nbest(trigram_measures.pmi, top_bgs)

print('Best {0} tri-grams in course description (WP Tokenizer)'.format(top_bgs))
print(50*'-')

ppf = pprint.PrettyPrinter(indent=2, depth=2, width=80, compact=False)
ppf.pprint(tgs)

print(50*'-')
print('Best tri-grams occuring more than once in course description (WP Tokenizer)')
print(50*'-')

finder.apply_freq_filter(2)
tgs = finder.nbest(bigram_measures.pmi, top_bgs)
ppf.pprint(tgs)

Best 10 tri-grams in course description (WP Tokenizer)
--------------------------------------------------
[ ('any', 'discipline', 'who'),
  ('components', 'such', 'as'),
  ('fairly', 'modern', 'computer'),
  ('ideally', 'that', 'supports'),
  ('received', 'instructor', 'permission'),
  ('such', 'as', 'Hadoop'),
  ('supports', 'hardware', 'virtualization'),
  ('that', 'supports', 'hardware'),
  ('they', 'can', 'install'),
  ('use', 'relational', 'databases')]
--------------------------------------------------
Best tri-grams occuring more than once in course description (WP Tokenizer)
--------------------------------------------------
[ ('This', 'class', 'is'),
  ('students', 'will', 'learn'),
  (',', 'students', 'will')]


-----

## Tagging

The simplest approach to text analysis is the bag-of-words model, where
we simply identify the words (or tokens) present in a set of documents.
In order to move beyond this model, we need to include additional
information with each word. For example, the word _duck_ can mean the
bird or it can mean the action. More generally, this concept when
applied to multiple words is known as a [garden path sentences][wgps]. 

In the bag of word model, the difference between these two meanings (of
the word _duck_) is lost. By associating information about the context
or the grammatical nature of a word, however, these different use cases
can be distinguished. The mechanism by which this is done is known as
tagging. A tag can be used to identify the grammatical nature of a word,
like _noun_ or _verb_, or it can be other information, including
associations with other words in the text. In the following code blocks,
we first introduce a _DefaultTagger_, which associates a tag of our
choosing with words. Afterwards, we use the NLTK built-in Part of
Speech (POS) and Named Entity Recognition (NER) taggers.

-----
[wgps]: https://en.wikipedia.org/wiki/Garden_path_sentence

In [7]:
a_tag = 'INFO'

from nltk.tag import DefaultTagger
default_tagger = DefaultTagger(a_tag)
tgs = default_tagger.tag(wtks)

print('Tagged course description (WP Tokenizer)')
print(50*'-')

pp.pprint(tgs[:13])

Tagged course description (WP Tokenizer)
--------------------------------------------------
[ ('Advanced', 'INFO'), ('Data', 'INFO'), ('Science', 'INFO'), (':', 'INFO'),
  ('This', 'INFO'), ('class', 'INFO'), ('is', 'INFO'), ('an', 'INFO'),
  ('asynchronous', 'INFO'), (',', 'INFO'), ('online', 'INFO'),
  ('course', 'INFO'), ('.', 'INFO')]


----

### Part of Speech Tagging

Part of speech (PoS) simply refers to the grammatical properties of a word.
While this might seem simple, given the diversity of languages (and even
variations within a single language), this topic quickly becomes quite
substantial. As a result, there are a number of possible approaches. In
the next two code cells, we first demonstrate a simple PoS that labels
only basic text components such as _Noun_, _Verb_, or _Adjective_,
before moving to a more complex PoS that labels a wider range of text
components, which can also establish grammatical relationships between
multiple words.

----

In [8]:
from nltk import pos_tag

ptgs = pos_tag(wtks, tagset='universal')

print('POS tagged course description (WP Tokenizer/Univesal Tagger)')
print(60*'-')

ppf.pprint(ptgs[:13])

POS tagged course description (WP Tokenizer/Univesal Tagger)
------------------------------------------------------------
[ ('Advanced', 'NOUN'),
  ('Data', 'NOUN'),
  ('Science', 'NOUN'),
  (':', '.'),
  ('This', 'DET'),
  ('class', 'NOUN'),
  ('is', 'VERB'),
  ('an', 'DET'),
  ('asynchronous', 'ADJ'),
  (',', '.'),
  ('online', 'ADJ'),
  ('course', 'NOUN'),
  ('.', '.')]


----

PoS tags can be much more complex, as shown in the following code cell.
The specific tags depend on the selected tagset, by default NLTK now
uses a [_PerceptronTagger_][pt], which quickly generates a set of tagged
grammatical constructs.

----
[pt]: http://spacy.io/blog/part-of-speech-POS-tagger-in-python/

In [9]:
ptgs = pos_tag(wtks)

print('POS tagged course description (WP Tokenizer/Default Tagger)')
print(60*'-')

ppf.pprint(ptgs[:13])

POS tagged course description (WP Tokenizer/Default Tagger)
------------------------------------------------------------
[ ('Advanced', 'NNP'),
  ('Data', 'NNP'),
  ('Science', 'NN'),
  (':', ':'),
  ('This', 'DT'),
  ('class', 'NN'),
  ('is', 'VBZ'),
  ('an', 'DT'),
  ('asynchronous', 'JJ'),
  (',', ','),
  ('online', 'JJ'),
  ('course', 'NN'),
  ('.', '.')]


-----

### Named Entity Recognition

Named Entity Recognition (NER) classifies (or recognizes) chunks of text
that refer to pre-defined categories (or named entities). These chunks
can be one or more words, and the categories can be names of people,
organizations, locations, or other types of entities. For example, in
the following sentence:

> Edward is a graduate student enrolled at the University of Illinois.

_Edward_ is a person and _University of Illinois_ is an organization.
NLTK can be used to identify named entities, generally following a part
of speech tagging (to clarify different uses of words that otherwise
might cause confusion). In the following code cell, we demonstrate NER by
using NLTK to identify named entities in the course description text.

-----

In [10]:
from nltk import ne_chunk

nrcs = ne_chunk(pos_tag(wtks))

print(50*'-')
print('NER tagged course description (WP Tokenizer)')
print(50*'-')

ppf.pprint(nrcs[:13])

--------------------------------------------------
NER tagged course description (WP Tokenizer)
--------------------------------------------------
[ Tree('PERSON', [('Advanced', 'NNP')]),
  Tree('ORGANIZATION', [('Data', 'NNP'), ('Science', 'NN')]),
  (':', ':'),
  ('This', 'DT'),
  ('class', 'NN'),
  ('is', 'VBZ'),
  ('an', 'DT'),
  ('asynchronous', 'JJ'),
  (',', ','),
  ('online', 'JJ'),
  ('course', 'NN'),
  ('.', '.'),
  ('This', 'DT')]


-----

## Corpus

A corpus is simply a collection of documents. In the case of Natural
Language Processing, however, a corpus can include additional
information for both part of speech tagging and named entity
recognition. The NLTK library includes several corpuses, including the
Penn Treebank, Brown, and Wordnet. In the rest of this notebook, we
introduce the first two corpuses; the Wordnet corpus is introduced in
[Introduction to NLP: Semantic Analysis][l3] notebook.

###  Penn Treebank

The [Penn Treebank project][ptbp] is an effort to annotate text, into a
linguistic structure. This structure is generally in the form of a
[tree][wt], within which the different components of a sentence are
organized. This process includes a [part of speech tagging][ptpos]. We
demonstrate the use of the Penn Treebank with NLTK in the next few code
cells, where we tokenize text by using a Penn Treebank standard sentence
and word tokenizer, and tagged sentence and word tokenizers. Finally, we
introduce the `UnigramTagger`, which can be trained on a given corpus to
tokenize and tag unigrams in a new document (or set of documents).

-----
[ptbp]: https://www.cis.upenn.edu/~treebank/
[ptpos]: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
[wt]: https://en.wikipedia.org/wiki/Treebank
[l3]: intro2nlp-sa.ipynb

In [11]:
from nltk.corpus import treebank

print('Penn Treebank tagged text.')
print(80*'-')

print('Words:     ', end='')
pp.pprint(treebank.words()[:18])
print(80*'-')

print('Setnences: ', end='')
pp.pprint(treebank.sents()[0])
print(80*'-')

print('Tagged Words: ')
pp.pprint(treebank.tagged_words()[:18])
print(80*'-')

print('Tagged Sentances: ')
pp.pprint(treebank.tagged_sents()[0])
print(80*'-')

Penn Treebank tagged text.
--------------------------------------------------------------------------------
Words:     [ 'Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the',
  'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']
--------------------------------------------------------------------------------
Setnences: [ 'Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the',
  'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']
--------------------------------------------------------------------------------
Tagged Words: 
[ ('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'),
  ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'),
  ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'),
  ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'),
  ('.', '.')]
--------------------------------------------------------------------------------
Tagged Sentances: 
[ ('

In [12]:
from nltk.tag import UnigramTagger
pt_tagger = UnigramTagger(treebank.tagged_sents())

In [13]:
pt_tgs = pt_tagger.tag(wtks)

print('Penn Treebank tagged course description (WP Tokenizer)')
print(60*'-')

ppf.pprint(pt_tgs[:13])

Penn Treebank tagged course description (WP Tokenizer)
------------------------------------------------------------
[ ('Advanced', 'NNP'),
  ('Data', 'NNP'),
  ('Science', 'NN'),
  (':', ':'),
  ('This', 'DT'),
  ('class', 'NN'),
  ('is', 'VBZ'),
  ('an', 'DT'),
  ('asynchronous', None),
  (',', ','),
  ('online', None),
  ('course', 'NN'),
  ('.', '.')]


----

### Brown Corpus

The [Brown Corpus][wbc] has over one million tagged words, and was
originally published in 1967. The corpus itself is composed of 500
samples, spread over fifteen different genres, of English-language text
compiled from works published in 1961. NLTK provides the Brown Corpus,
which can be used to tag new documents, as shown below.

----
[wbc]: https://en.wikipedia.org/wiki/Brown_Corpus

In [14]:
from nltk.corpus import brown

b_tagger = UnigramTagger(brown.tagged_sents(brown.fileids()))

In [15]:
b_tgs = b_tagger.tag(wtks)

print('Brown tagged course description (WP Tokenizer)')
print(60*'-')

ppf.pprint(b_tgs[:13])

Brown tagged course description (WP Tokenizer)
------------------------------------------------------------
[ ('Advanced', 'JJ-TL'),
  ('Data', 'NNS-TL'),
  ('Science', 'NN-TL'),
  (':', ':'),
  ('This', 'DT'),
  ('class', 'NN'),
  ('is', 'BEZ'),
  ('an', 'AT'),
  ('asynchronous', None),
  (',', ','),
  ('online', None),
  ('course', 'NN'),
  ('.', '.')]


-----

### Linking Taggers


In the previous examples, certain words were left untagged or tagged
with `None` (such as _online_ or _asynchronous_). Since language evolves
over time, an older corpus might miss words, or they may simply be
incomplete. To handle these cases, NLTK enables taggers to be linked.
Thus a general tagger can be applied, such as the Brown Corpus,
after which a second tagger can be applied to increase the number of
words tagged. This is a common application area for a _DefaultTagger_,
which can be used to assign a specific tag to any element missed by
another tagger. We demonstrate this concept below, by linking the Brown
Corpus tagger with our earlier Default tagger.

-----

In [16]:
# We can link taggers

b_tagger._taggers = [b_tagger, default_tagger]

b_tgs = b_tagger.tag(wtks)

print('Brown tagged course description (WP Tokenizer/Linked Tagger)')
print(60*'-')

ppf.pprint(b_tgs[:13])

Brown tagged course description (WP Tokenizer/Linked Tagger)
------------------------------------------------------------
[ ('Advanced', 'JJ-TL'),
  ('Data', 'NNS-TL'),
  ('Science', 'NN-TL'),
  (':', ':'),
  ('This', 'DT'),
  ('class', 'NN'),
  ('is', 'BEZ'),
  ('an', 'AT'),
  ('asynchronous', 'INFO'),
  (',', ','),
  ('online', 'INFO'),
  ('course', 'NN'),
  ('.', '.')]


-----

### Tagged Text Extraction

For some text analysis projects, we might want to restrict words (or
tokens) to specific tags. For example, we might prefer to only use
_Nouns_, _Primary Verbs_, or _Adjectives_ for text classification. To
extract only terms that meet these conditions, we can tag the text, and
apply a regular expression to the tagged tokens, as shown in the
following code cell.

-----

In [17]:
import re

# NN matchs NN|NNS|NNP|NNPS
rgxs = re.compile(r"(JJ|NN|VBN|VBG)")

ptgs = pos_tag(wtks)
trms = [tkn[0] for tkn in ptgs if re.match(rgxs, tkn[1])]

print('POS tagged course description (WP Tokenizer)')
print(60*'-')
pp.pprint(ptgs[:13])
print(60*'-')
print('POS tagged course description (WP Tokenizer/RegEx applied)')
print(60*'-')
pp.pprint(trms[:7])

POS tagged course description (WP Tokenizer)
------------------------------------------------------------
[ ('Advanced', 'NNP'), ('Data', 'NNP'), ('Science', 'NN'), (':', ':'),
  ('This', 'DT'), ('class', 'NN'), ('is', 'VBZ'), ('an', 'DT'),
  ('asynchronous', 'JJ'), (',', ','), ('online', 'JJ'), ('course', 'NN'),
  ('.', '.')]
------------------------------------------------------------
POS tagged course description (WP Tokenizer/RegEx applied)
------------------------------------------------------------
['Advanced', 'Data', 'Science', 'class', 'asynchronous', 'online', 'course']


-----

### Student Activity

In the preceding cells, we introduced several basic NLP concepts,
including tagging, Part of Speech, and Named Entity Recognition. Now
that you have run the Notebook, go back and make the following changes
to see how the results change.

1. Change from a Unigram tagger to a Bigram Tagger. How do you results
change?
2. Replace the initial text with a longer document (you can use a text
from within NLTK or a freely available text from _Project Gutenberg_).
Apply more restrictive filters (i.e., higher frequencies) to the bigrams
and trigrams, do your results make sense?
3. Try using regular expressions to restrict tokens in the NLTK movie
review data set to Nouns, Verbs, Adjectives, and Adverbs. Use these
tokens to perform Sentiment Analysis on these movie review data. Are the
results better or worse than with all words?

-----