# Tokenization, Tagging, Chunking - Part of Speech Tagging

In [6]:
import nltk

A part of speech tagger will identify the part of speech for a sequence of words.

In [7]:
text = "I walked to the cafe to buy coffee before work."

In [8]:
tokens = nltk.word_tokenize(text)

In [9]:
nltk.pos_tag(tokens)

[('I', 'PRP'),
 ('walked', 'VBD'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('cafe', 'NN'),
 ('to', 'TO'),
 ('buy', 'VB'),
 ('coffee', 'NN'),
 ('before', 'IN'),
 ('work', 'NN'),
 ('.', '.')]

Part of speech key.

In [6]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

In [11]:
nltk.pos_tag(nltk.word_tokenize("I will have desert."))

[('I', 'PRP'), ('will', 'MD'), ('have', 'VB'), ('desert', 'NN'), ('.', '.')]

In [12]:
nltk.pos_tag(nltk.word_tokenize("They will desert us."))

[('They', 'PRP'), ('will', 'MD'), ('desert', 'VB'), ('us', 'PRP'), ('.', '.')]

Create a list of all nouns.

In [13]:
md = nltk.corpus.gutenberg.words("melville-moby_dick.txt")

In [14]:
md_norm = [word.lower() for word in md if word.isalpha()]

In [15]:
md_tags = nltk.pos_tag(md_norm,tagset="universal")

In [16]:
md_tags[:5]

[(u'moby', u'NOUN'),
 (u'dick', u'NOUN'),
 (u'by', u'ADP'),
 (u'herman', u'NOUN'),
 (u'melville', u'NOUN')]

In [10]:
md_nouns = [word for word in md_tags if word[1] == "NOUN"]

In [11]:
nouns_fd = nltk.FreqDist(md_nouns)

In [12]:
nouns_fd.most_common()[:10]  

[((u'i', u'NOUN'), 1182),
 ((u'whale', u'NOUN'), 909),
 ((u's', u'NOUN'), 774),
 ((u'man', u'NOUN'), 527),
 ((u'ship', u'NOUN'), 498),
 ((u'sea', u'NOUN'), 435),
 ((u'head', u'NOUN'), 337),
 ((u'time', u'NOUN'), 334),
 ((u'boat', u'NOUN'), 332),
 ((u'ahab', u'NOUN'), 278)]