# PoS (Parts-of-Speech) tagging with NLTK Python package 

## Table of Contents

0. NLTK, NLP, and parts-of-speech tagging
1. Installing NLTK
2. Default Tagger
 - Default tag set
 - Universal tag set
 - Other tag sets
3. Using Corpora
 - Brown Corpus
 - Other corpora
4. Extracting PoS
5. Some other Python PoS tagging packages
 - TextBlob
 - Pattern


## 1. NLTK, NLP, and parts-of-speech tagging

The Natural Language Toolkit (NLTK) was developed at the University of Pennsylvania by Steven Bird and Edward Loper as a suite of tools consisting of programs and corpora for the purpose of natural language processing (NLP). Within the NLTK Python package is an extensive collection of tools for tokenization, classifying and tagging text, analyzing sentence structure and grammar, managing linguistic data, as well as a sizable collection of lexicons.

NLTK is used worldwide for research in linguistics, cognitive science, machine learning and artificial intelligence. Additional information and resources can be found at the official NLTK website, http://www.nltk.org/ . 

## 2. Installing NLTK 

Installing NLTK requires that you have already installed Python versions 2.7, 3.4, or 3.5. From the command prompt you can install NLTK running the following code... 

Then opening up a Jupyter Notebook, you can finish downloading the package with running the following code within the notebook.

Alternatively, if you are using an Anaconda Python install, from the Anaconda Prompt you can use the following:

Additional NLTK install information can be found at http://www.nltk.org/install.html. It is advisable to also install the Numpy package. **A word of caution...** the 64-bit Windows install has been known to be problematic.  

## 3.0 Default Tagger

NLTK comes with a default tagger, *pos_tag.* Although not as accurate as some of the newer PoS taggers or trained NLTK, N-gram taggers, it is still used often. Tagging requires that you start with a string dataset and delivers a tuple consisting consisting of a *(word, tag)* format. 

In [7]:
# Begin with importing the NLTK package and import either a word or sentence tokenizer.
import nltk
from nltk.tokenize import word_tokenize

In [8]:
# Then tokenize youor string data and use pos_tag
sampleText = word_tokenize("The quick brown fox jumps over the lazy dog")
nltk.pos_tag(sampleText)

# http://www.nltk.org/book/ch05.html

[('The', 'DT'),
 ('quick', 'JJ'),
 ('brown', 'NN'),
 ('fox', 'NN'),
 ('jumps', 'VBZ'),
 ('over', 'IN'),
 ('the', 'DT'),
 ('lazy', 'JJ'),
 ('dog', 'NN')]

## 3.1 Default Tag Set

The default tagger uses a default tag set to indicate the different parts of speech with each of the tuples. The list is base on the Treebank Project from University of Pennsylvania (https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

#### PoS tag list

- CC	coordinating conjunction
- CD	cardinal digit
- DT	determiner
- EX	existential there (like: "there is" ... think of it like "there exists")
- FW	foreign word
- IN	preposition/subordinating conjunction
- JJ	adjective	'big'
- JJR	adjective, comparative	'bigger'
- JJS	adjective, superlative	'biggest'
- LS	list marker	1)
- MD	modal	could, will
- NN	noun, singular 'desk'
- NNS	noun plural	'desks'
- NNP	proper noun, singular	'Harrison'
- NNPS	proper noun, plural	'Americans'
- PDT	predeterminer	'all the kids'
- POS	possessive ending	parent's
- PRP	personal pronoun	I, he, she
- PRP\$	possessive pronoun	my, his, hers
- RB	adverb	very, silently,
- RBR	adverb, comparative	better
- RBS	adverb, superlative	best
- RP	particle	give up
- TO	to	go 'to' the store.
- UH	interjection	errrrrrrrm
- VB	verb, base form	take
- VBD	verb, past tense	took
- VBG	verb, gerund/present participle	taking
- VBN	verb, past participle	taken
- VBP	verb, sing. present, non-3d	take
- VBZ	verb, 3rd person sing. present	takes
- WDT	wh-determiner	which
- WP	wh-pronoun	who, what
- WP\$	possessive wh-pronoun	whose
- WRB	wh-abverb	where, when

https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/

## 3.2 Universal Tag Set

The default tagger is often more detailed than needed. In such instances, the Universal tag set can provide a simplfied list of parts of speech.

#### Universal tag list

- ADJ - adjective	
- ADP - adposition	
- ADV - adverb	
- CONJ - conjunction	
- DET - determiner
- NOUN - noun	
- NUM - numeral	
- PRT - particle	
- PRON - pronoun
- VERB - verb	
- . - punctuation marks	
- X - other	

http://www.nltk.org/book/ch05.html

In [9]:
# The example below uses both pos_tag and word_tokenize but assigns the Universal tag set
sampleText = "The quick brown fox jumps over the lazy dog"
nltk.pos_tag(word_tokenize(sampleText), tagset='universal')

# http://www.nltk.org/_modules/nltk/tag.html

[('The', u'DET'),
 ('quick', u'ADJ'),
 ('brown', u'NOUN'),
 ('fox', u'NOUN'),
 ('jumps', u'VERB'),
 ('over', u'ADP'),
 ('the', u'DET'),
 ('lazy', u'ADJ'),
 ('dog', u'NOUN')]

## 3.3 Other tag sets

NLTK provides three main tag sets:
- The University of Pennsylvania, Treebank tag set, which is the NLTK default - *nltk.help.upenn_tagset()*
- The Brown University tag set - *nltk.help.brown_tagset()*
- And, the Lancaster University  UCREL CLAWS5 tag set - *nltk.help.claws5_tagset()*
Additionally, other tag sets can be assigned, particularly when working in other languages other than English.

In [26]:
# To display the tags in a given tag set 
nltk.help.brown_tagset()

(: opening parenthesis
    (
): closing parenthesis
    )
*: negator
    not n't
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ? ; ! :
:: colon
    :
ABL: determiner/pronoun, pre-qualifier
    quite such rather
ABN: determiner/pronoun, pre-quantifier
    all half many nary
ABX: determiner/pronoun, double conjunction or pre-quantifier
    both
AP: determiner/pronoun, post-determiner
    many other next more last former little several enough most least only
    very few fewer past same Last latter less single plenty 'nough lesser
    certain various manye next-to-last particular final previous present
    nuf
AP$: determiner/pronoun, post-determiner, genitive
    other's
AP+AP: determiner/pronoun, post-determiner, hyphenated pair
    many-much
AT: article
    the an no a every th' ever' ye
BE: verb 'to be', infinitive or imperative
    be
BED: verb 'to be', past tense, 2nd person singular or all persons plural
    were
BED*: verb 'to be', past tense, 2nd person singular or 

## 4.0 Using Corpora

NLTK provides a large set of corpora and lexicons to reference and utilize. This includes text from Project Gutenberg, past presidential inaugural speeches, web and chat texts, and even the Brown corpus which was the first million word corpus generated by Brown University in 1963 (http://www.nltk.org/book/ch02.html). 

Many of the corpora are already tagged and can be used to train N-gram taggers

In [27]:
# Listing categopries within the Brown corpus
from nltk.corpus import brown
brown.categories()

[u'adventure',
 u'belles_lettres',
 u'editorial',
 u'fiction',
 u'government',
 u'hobbies',
 u'humor',
 u'learned',
 u'lore',
 u'mystery',
 u'news',
 u'religion',
 u'reviews',
 u'romance',
 u'science_fiction']

In [28]:
# Displaying the raw tagged Brown corpus within the Science Fiction catrgory shows the words with the Brow tag set
brown.raw(categories=['science_fiction'])[:150]

u'\n\n\tNow/rb that/cs he/pps knew/vbd himself/ppl to/to be/be self/nn he/pps was/bedz free/jj to/to grok/vb ever/ql closer/rbr to/in his/pp$ brothers/nns '

In [14]:
# However, most the corpora are separated by fieldids instead of categories 
nltk.corpus.gutenberg.fileids()

[u'austen-emma.txt',
 u'austen-persuasion.txt',
 u'austen-sense.txt',
 u'bible-kjv.txt',
 u'blake-poems.txt',
 u'bryant-stories.txt',
 u'burgess-busterbrown.txt',
 u'carroll-alice.txt',
 u'chesterton-ball.txt',
 u'chesterton-brown.txt',
 u'chesterton-thursday.txt',
 u'edgeworth-parents.txt',
 u'melville-moby_dick.txt',
 u'milton-paradise.txt',
 u'shakespeare-caesar.txt',
 u'shakespeare-hamlet.txt',
 u'shakespeare-macbeth.txt',
 u'whitman-leaves.txt']

In [34]:
# Displaying the first 65 characters with tag of each of the Gutenberg corpora - 
from nltk.corpus import gutenberg
for fileid in gutenberg.fileids():
    print(fileid, gutenberg.raw(fileid)[:75], '...')

(u'austen-emma.txt', u'[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse, handsome,', '...')
(u'austen-persuasion.txt', u'[Persuasion by Jane Austen 1818]\n\n\nChapter 1\n\n\nSir Walter Elliot, of Kellyn', '...')
(u'austen-sense.txt', u'[Sense and Sensibility by Jane Austen 1811]\n\nCHAPTER 1\n\n\nThe family of Dash', '...')
(u'bible-kjv.txt', u'[The King James Bible]\n\nThe Old Testament of the King James Bible\n\nThe Firs', '...')
(u'blake-poems.txt', u'[Poems by William Blake 1789]\n\n \nSONGS OF INNOCENCE AND OF EXPERIENCE\nand T', '...')
(u'bryant-stories.txt', u'[Stories to Tell to Children by Sara Cone Bryant 1918] \r\n\r\n\r\nTWO LITTLE RID', '...')
(u'burgess-busterbrown.txt', u'[The Adventures of Buster Bear by Thornton W. Burgess 1920]\r\n\r\nI\r\n\r\nBUSTER ', '...')
(u'carroll-alice.txt', u"[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I. Down t", '...')
(u'chesterton-ball.txt', u'[The Ball and The Cross by G.K. Chesterton 1909]\

In [102]:
# In addition to the Gutenberg corpora, the is also web and chat corpus - using the sents instead of raw drops the tags
montyPython = webtext.sents('grail.txt')
montyPython

[[u'SCENE', u'1', u':', u'[', u'wind', u']', u'[', u'clop', u'clop', u'clop', u']', u'KING', u'ARTHUR', u':', u'Whoa', u'there', u'!'], [u'[', u'clop', u'clop', u'clop', u']', u'SOLDIER', u'#', u'1', u':', u'Halt', u'!'], ...]

In [103]:
# Using regular expession to lightly clean the text output.
# import regular expression
import re
montyPython = str(montyPython)
montyPython = re.sub("[u']", "", montyPython)
montyPython = re.sub("[',]", "", montyPython)

'[[SCENE 1 : [ wind ] [ clop clop clop ] KING ARTHUR : Whoa there !] [[ clop clop clop ] SOLDIER # 1 : Halt !] ...]'

In [104]:
# Web & chat text corpora
from nltk.corpus import webtext
for fileid in webtext.fileids():
    print(fileid, webtext.raw(fileid)[:75], '...')

# http://www.nltk.org/book/ch02.html

(u'firefox.txt', u'Cookie Manager: "Don\'t allow sites that set removed cookies to set future c', '...')
(u'grail.txt', u'SCENE 1: [wind] [clop clop clop] \nKING ARTHUR: Whoa there!  [clop clop clop', '...')
(u'overheard.txt', u'White guy: So, do you have any plans for this evening?\nAsian girl: Yeah, be', '...')
(u'pirates.txt', u"PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terry Rossio\n[", '...')
(u'singles.txt', u'25 SEXY MALE, seeks attrac older single lady, for discreet encounters.\n35YO', '...')
(u'wine.txt', u'Lovely delicate, fragrant Rhone wine. Polished leather and strawberries. Pe', '...')


## 5.0 Practical uses of tagging

One of the things I am most interested in is use of descriptive text for products, storytelling, and persuasive writing. Consequently, the remainder of this tutorial will explore noun extraction (used for discovery of products and themes), adjective, active verb usage, and word frequency. All of these particular parts of speech can be used to guide writing to emulate trending descriptive practice in online market places. 

In [130]:
# Extracting all the various types of nous using default tagger
sampleText = "The quick brown fox jumps over the lazy dog"
tokens = nltk.word_tokenize(sampleText)
tagged = nltk.pos_tag(tokens)
nouns = [word for word,pos in tagged if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS')]
nouns = ", ".join(downcased).encode('utf-8')
nouns

# http://benjamindalton.com/extracting-nouns-with-python/

'brown, fox, dog'

In [146]:
# Another approach to remove everything but nouns
# 
from nltk.tag import pos_tag
is_noun = lambda pos: pos[:2] == 'NN'
text = word_tokenize("The quick brown fox jumps over the lazy dog")
texts = [ word for (word, pos) in nltk.pos_tag(text) if is_noun(pos) ]
texts

['brown', 'fox', 'dog']

In [143]:
# Extracting proper nouns
tokens = nltk.word_tokenize("Down and out in the Magic Kingdom, Disneyland...")
tagged = nltk.pos_tag(tokens)
properNouns = [word for word,pos in tagged if (pos == 'NNP' or pos == 'NNPS')]
properNouns = ", ".join(downcased).encode('utf-8')
properNouns

'brown, fox, dog'

In [147]:
# Extracting verbs
from nltk.tag import pos_tag
is_verb = lambda pos: pos[:2] == 'VB'
text = word_tokenize("The quick brown fox jumps over the lazy dog")
texts = [ word for (word, pos) in nltk.pos_tag(text) if is_verb(pos) ]
#text = [ [ word for (word, pos) in nltk.pos_tag(i) if is_noun(pos) ] for i in text]
texts

['jumps']

In [148]:
# These examples are largely from http://www.nltk.org/book/ch05.html - measuring frequency distributions across corpora of specific words 
cfd = nltk.ConditionalFreqDist((genre, word)for genre in brown.categories()for word in brown.words(categories=genre))
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfd.tabulate(conditions=genres, samples=modals)

                  can could   may might  must  will 
           news    93    86    66    38    50   389 
       religion    82    59    78    12    54    71 
        hobbies   268    58   131    22    83   264 
science_fiction    16    49     4    12     8    16 
        romance    74   193    11    51    45    43 
          humor    16    30     8     8     9    13 


In [150]:
# This example is largely from http://www.nltk.org/book/ch05.html with small changes -- Calculating frequency distributions across PoS from Brown corpus - science_fiction
from nltk.corpus import brown
brown_science_fiction_tagged = brown.tagged_words(categories='science_fiction', tagset='universal')
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_science_fiction_tagged)
tag_fd.most_common()

[(u'NOUN', 2747),
 (u'VERB', 2579),
 (u'.', 2428),
 (u'DET', 1582),
 (u'ADP', 1451),
 (u'PRON', 934),
 (u'ADJ', 929),
 (u'ADV', 828),
 (u'PRT', 483),
 (u'CONJ', 416),
 (u'NUM', 79),
 (u'X', 14)]

In [22]:
wsj = nltk.corpus.treebank.tagged_words(tagset='universal')
word_tag_fd = nltk.FreqDist(wsj)
[wt[0] for (wt, _) in word_tag_fd.most_common() if wt[1] == 'VERB']

[u'is',
 u'said',
 u'was',
 u'are',
 u'be',
 u'has',
 u'have',
 u'will',
 u'says',
 u'would',
 u'were',
 u'had',
 u'been',
 u'could',
 u"'s",
 u'can',
 u'do',
 u'say',
 u'make',
 u'may',
 u'did',
 u'rose',
 u'made',
 u'does',
 u'expected',
 u'buy',
 u'take',
 u'get',
 u'might',
 u'sell',
 u'added',
 u'sold',
 u'help',
 u'including',
 u'should',
 u'reported',
 u'according',
 u'pay',
 u'being',
 u'compared',
 u'began',
 u'fell',
 u'based',
 u'closed',
 u'used',
 u"'re",
 u'want',
 u'see',
 u'took',
 u'yield',
 u'priced',
 u'offered',
 u'set',
 u'approved',
 u'come',
 u'cut',
 u'noted',
 u'ended',
 u'increased',
 u'found',
 u'think',
 u'become',
 u'declined',
 u'go',
 u'proposed',
 u'growing',
 u'trying',
 u'received',
 u'named',
 u'put',
 u'give',
 u'came',
 u'held',
 u'use',
 u'paid',
 u'going',
 u'called',
 u'raise',
 u'estimated',
 u'continue',
 u'designed',
 u'making',
 u'expects',
 u'seeking',
 u'plans',
 u'wo',
 u'must',
 u'got',
 u'gained',
 u'trading',
 u'owns',
 u'fined',
 u'say

In [24]:
wsj = nltk.corpus.treebank.tagged_words()
cfd2 = nltk.ConditionalFreqDist((tag, word) for (word, tag) in wsj)
list(cfd2['VBN'])

[u'limited',
 u'reorganized',
 u'managed',
 u'switched',
 u'caused',
 u'founded',
 u'assembled',
 u'concerned',
 u'contained',
 u'Rekindled',
 u'automated',
 u'bribed',
 u'voted',
 u'issued',
 u'cluttered',
 u'disapproved',
 u'sent',
 u'returned',
 u'synchronized',
 u'puzzled',
 u'desired',
 u'engineered',
 u'headlined',
 u'centralized',
 u'advised',
 u'stabbed',
 u'continued',
 u'perceived',
 u'presented',
 u'prolonged',
 u'Related',
 u'solved',
 u'noted',
 u'concluded',
 u'Filmed',
 u'infringed',
 u'construed',
 u'licensed',
 u'knitted',
 u'slowed',
 u'enclosed',
 u'replicated',
 u'estimated',
 u'imported',
 u'risen',
 u'assisted',
 u'beaten',
 u'contributed',
 u'expressed',
 u'enjoyed',
 u'industrialized',
 u'zoomed',
 u'crossed',
 u'learned',
 u'filled',
 u'told',
 u'drafted',
 u'deemed',
 u'kicked',
 u'led',
 u'ranged',
 u'slated',
 u'reported',
 u'focused',
 u'auctioned',
 u'crippled',
 u'represented',
 u'scrapped',
 u'invented',
 u'obtained',
 u'colored',
 u'skyrocketed',
 u'inv

## 6.0 Some other Python PoS tagging packages

## TextBlob - basic POS tagging...

In [19]:
from textblob import TextBlob

In [21]:
wiki = TextBlob("The quick brown fox jumps over the lazy dog")
wiki.tags

[('The', u'DT'),
 ('quick', u'JJ'),
 ('brown', u'NN'),
 ('fox', u'NN'),
 ('jumps', u'VBZ'),
 ('over', u'IN'),
 ('the', u'DT'),
 ('lazy', u'JJ'),
 ('dog', u'NN')]

In [22]:
# not sure how well this is working here... NEED to investigate 
wiki = TextBlob("The quick brown fox jumps over the lazy dog")
wiki.noun_phrases

WordList([u'quick brown fox jumps', u'lazy dog'])

## Pattern.en - basic POS tagging...

In [23]:
# Just a quick experiment using Pattern for PoS
from pattern.en import parse
text = "The quick brown fox jumps over the lazy dog"
print parse(text)

# https://www.clips.uantwerpen.be/pages/pattern-en

The/DT/B-NP/O quick/JJ/I-NP/O brown/JJ/I-NP/O fox/NN/I-NP/O jumps/NNS/I-NP/O over/IN/B-PP/B-PNP the/DT/B-NP/I-PNP lazy/JJ/I-NP/I-PNP dog/NN/I-NP/I-PNP


## References:
- Wikipedia, Natural Language Toolkit. https://en.wikipedia.org/wiki/Natural_Language_Toolkit
- Natural Language Processing with Python --- Analyzing Text with the Natural Language Toolkit, by Steven Bird, Ewan Klein, and Edward Loper. 
- NLTK 3.2.5 documentation. http://www.nltk.org/
- Anaconda Cloud, anaconda/packages/nltk 3.2.5 https://anaconda.org/anaconda/nltk
- Alphabetical list of part-of-speech tags used in the Penn Treebank Project. https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
- NLTK Part of Speech Tagging Tutorial. https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/
- Tagsets in NLTK. https://www.kaggle.com/alvations/tagsets-in-nltk
- TextBlob, Tutorial: Quickstart. http://textblob.readthedocs.io/en/dev/quickstart.html
- CLiPS, Pattern.en. # https://www.clips.uantwerpen.be/pages/pattern-en

B. Ward **K-State student tutorial project - Spring 2018,** *Still being developed...*  