# Part Of Speech (POS) Tagging

## Installing Necessary Depencies

We will be leveraging several libraries and dependencies:
• The nltk library, preferably version 3.1 or 3.2.1 or later
• The spacy library
• The pattern library
• The Stanford parser
• Graphviz and necessary libraries for the same


For spacy , you need to first install
the package and then separately install its dependencies, also called a language model .
To install spacy , type pip install spacy from the terminal. Once done, download
the English language model using the command python -m spacy.en.download from
the terminal, which will download around 500 MB of data in the directory of the spacy
package itself. For more details, refer to https://spacy.io/docs/#getting-started ,
which tells you how to get started with using spacy . We will use spacy for tagging and
depicting dependency-based parsing.

$ pip install spacy
$ python -m spacy.en.download


To install pattern , typing pip install pattern should pretty much download and
install the library and its necessary dependencies. The link www.clips.ua.ac.be/pages/
pattern-en offers more information about pattern .

$ pip install pattern

The Stanford Parser is a Java-based implementation for a language parser developed
at Stanford, which helps in parsing sentences to understand their underlying structure.
We will perform both dependency and constituency grammar–based parsing using the
Stanford Parser and nltk , which provides an excellent wrapper to leverage and use the
parser from Python itself without the need to write code in Java. You can refer to the
official installation guide at https://github.com/nltk/nltk/wiki/Installing-Third-
Party-Software, which describes how to download and install the Stanford Parser and
integrate it with nltk.
To start with, make sure you first download and install the Java Development Kit
(not just JRE, also known as Java Runtime Environment ) by going to www.oracle.com/
technetwork/java/javase/downloads/index.html?ssSourceSiteId=otnjp . That is the
official website. Java SE 8u101 / 8u102 are the latest versions at the time of writing this
book—I have used 8u102 . After installing, make sure to set the “Path” for Java by adding it
to the Path system environment variable. You can also create a JAVA_HOME environment
variable pointing to the java.exe file belonging to the JDK. In my experience, neither
worked for me when running the code from Python, and I had to explicitly use the
Python os library to set the environment variable, which I will show when we dive into
the implementation details. Once Java is installed, download the official Stanford Parser
from http://nlp.stanford.edu/software/stanford-parser-full-2015-04-20.zip ,
which seems to work quite well. You can try out a later version by going to http://nlp.
stanford.edu/software/lex-parser.shtml#Download and checking the Release History
section. After downloading, unzip it to a known location in your filesystem. Once done,
you are now ready to use the parser from nltk.



Graphviz is not really a necessity, and we will only be using it to view the dependency
parse tree generated by the Stanford Parser. You can download Graphviz from its official
website at www.graphviz.org/Download_windows.php and install it. Next, install pygraphviz ,
which you can get by downloading the wheel file from www.lfd.uci.edu/~gohlke/
pythonlibs/#pygraphviz, based on your system architecture and python version. Then
install it using the command pip install pygraphviz-1.3.1-cp27-none-win_amd64.
whl for a 64-bit system running Python 2.7.x . Once installed, pygraphviz should be ready
to work.

$ pip install pydot-ng
$ pip install graphviz



In [2]:
# -*- coding: utf-8 -*-
"""
Created on Mon Aug 15 17:08:05 2016

@author: DIP
"""

sentence = 'The brown fox is quick and he is jumping over the lazy dog'


# recommended tagger based on PTB
import nltk
tokens = nltk.word_tokenize(sentence)
tagged_sent = nltk.pos_tag(tokens, tagset='universal')
print tagged_sent

from pattern.en import tag
tagged_sent = tag(sentence)
print tagged_sent


[('The', u'DET'), ('brown', u'ADJ'), ('fox', u'NOUN'), ('is', u'VERB'), ('quick', u'ADJ'), ('and', u'CONJ'), ('he', u'PRON'), ('is', u'VERB'), ('jumping', u'VERB'), ('over', u'ADP'), ('the', u'DET'), ('lazy', u'ADJ'), ('dog', u'NOUN')]
[(u'The', u'DT'), (u'brown', u'JJ'), (u'fox', u'NN'), (u'is', u'VBZ'), (u'quick', u'JJ'), (u'and', u'CC'), (u'he', u'PRP'), (u'is', u'VBZ'), (u'jumping', u'VBG'), (u'over', u'IN'), (u'the', u'DT'), (u'lazy', u'JJ'), (u'dog', u'NN')]


In [3]:
# building your own tagger

# preparing the data
from nltk.corpus import treebank
data = treebank.tagged_sents()
train_data = data[:3500]
test_data = data[3500:]
print train_data[0]


[(u'Pierre', u'NNP'), (u'Vinken', u'NNP'), (u',', u','), (u'61', u'CD'), (u'years', u'NNS'), (u'old', u'JJ'), (u',', u','), (u'will', u'MD'), (u'join', u'VB'), (u'the', u'DT'), (u'board', u'NN'), (u'as', u'IN'), (u'a', u'DT'), (u'nonexecutive', u'JJ'), (u'director', u'NN'), (u'Nov.', u'NNP'), (u'29', u'CD'), (u'.', u'.')]


In [4]:
# default tagger
from nltk.tag import DefaultTagger
dt = DefaultTagger('NN')

print dt.evaluate(test_data)


0.145415819537


In [5]:
print dt.tag(tokens)


[('The', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('is', 'NN'), ('quick', 'NN'), ('and', 'NN'), ('he', 'NN'), ('is', 'NN'), ('jumping', 'NN'), ('over', 'NN'), ('the', 'NN'), ('lazy', 'NN'), ('dog', 'NN')]


In [6]:
# regex tagger
from nltk.tag import RegexpTagger
# define regex tag patterns
patterns = [
        (r'.*ing$', 'VBG'),               # gerunds
        (r'.*ed$', 'VBD'),                # simple past
        (r'.*es$', 'VBZ'),                # 3rd singular present
        (r'.*ould$', 'MD'),               # modals
        (r'.*\'s$', 'NN$'),               # possessive nouns
        (r'.*s$', 'NNS'),                 # plural nouns
        (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),  # cardinal numbers
        (r'.*', 'NN')                     # nouns (default) ... 
]
rt = RegexpTagger(patterns)

print rt.evaluate(test_data)


0.240391131765


In [7]:
print rt.tag(tokens)


[('The', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('is', 'NNS'), ('quick', 'NN'), ('and', 'NN'), ('he', 'NN'), ('is', 'NNS'), ('jumping', 'VBG'), ('over', 'NN'), ('the', 'NN'), ('lazy', 'NN'), ('dog', 'NN')]


In [8]:
## N gram taggers
from nltk.tag import UnigramTagger
from nltk.tag import BigramTagger
from nltk.tag import TrigramTagger

ut = UnigramTagger(train_data)
bt = BigramTagger(train_data)
tt = TrigramTagger(train_data)

print ut.evaluate(test_data)
print ut.tag(tokens)


0.861361215994
[('The', u'DT'), ('brown', None), ('fox', None), ('is', u'VBZ'), ('quick', u'JJ'), ('and', u'CC'), ('he', u'PRP'), ('is', u'VBZ'), ('jumping', u'VBG'), ('over', u'IN'), ('the', u'DT'), ('lazy', None), ('dog', None)]


In [9]:
print bt.evaluate(test_data)
print bt.tag(tokens)


0.134669377481
[('The', u'DT'), ('brown', None), ('fox', None), ('is', None), ('quick', None), ('and', None), ('he', None), ('is', None), ('jumping', None), ('over', None), ('the', None), ('lazy', None), ('dog', None)]


In [10]:
print tt.evaluate(test_data)
print tt.tag(tokens)


0.0806467228192
[('The', u'DT'), ('brown', None), ('fox', None), ('is', None), ('quick', None), ('and', None), ('he', None), ('is', None), ('jumping', None), ('over', None), ('the', None), ('lazy', None), ('dog', None)]


In [11]:
def combined_tagger(train_data, taggers, backoff=None):
    for tagger in taggers:
        backoff = tagger(train_data, backoff=backoff)
    return backoff

ct = combined_tagger(train_data=train_data, 
                     taggers=[UnigramTagger, BigramTagger, TrigramTagger],
                     backoff=rt)

print ct.evaluate(test_data)        
print ct.tag(tokens)


0.910155871817
[('The', u'DT'), ('brown', 'NN'), ('fox', 'NN'), ('is', u'VBZ'), ('quick', u'JJ'), ('and', u'CC'), ('he', u'PRP'), ('is', u'VBZ'), ('jumping', 'VBG'), ('over', u'IN'), ('the', u'DT'), ('lazy', 'NN'), ('dog', 'NN')]


In [12]:
from nltk.classify import NaiveBayesClassifier, MaxentClassifier
from nltk.tag.sequential import ClassifierBasedPOSTagger

nbt = ClassifierBasedPOSTagger(train=train_data,
                               classifier_builder=NaiveBayesClassifier.train)

print nbt.evaluate(test_data)
print nbt.tag(tokens)    


0.930680607997
[('The', u'DT'), ('brown', u'JJ'), ('fox', u'NN'), ('is', u'VBZ'), ('quick', u'JJ'), ('and', u'CC'), ('he', u'PRP'), ('is', u'VBZ'), ('jumping', u'VBG'), ('over', u'IN'), ('the', u'DT'), ('lazy', u'JJ'), ('dog', u'VBG')]


In [13]:
# Do not run this in class - run out of memory
# try this out for fun!
met = ClassifierBasedPOSTagger(train=train_data,
                               classifier_builder=MaxentClassifier.train)
print met.evaluate(test_data)                           
print met.tag(tokens)


  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -3.82864        0.007
             2          -0.76176        0.957


  exp_nf_delta = 2 ** nf_delta
  sum1 = numpy.sum(exp_nf_delta * A, axis=0)
  sum2 = numpy.sum(nf_exp_nf_delta * A, axis=0)


         Final               nan        0.984
0.927001645851
[('The', u'DT'), ('brown', u'NN'), ('fox', u'NN'), ('is', u'VBZ'), ('quick', u'JJ'), ('and', u'CC'), ('he', u'PRP'), ('is', u'VBZ'), ('jumping', u'VBG'), ('over', u'IN'), ('the', u'DT'), ('lazy', u'NN'), ('dog', u'NN')]
