# <center> Part-of-Speech Taggers and Parsers </center>
### <center> CS918 Natural Language Processing </center>

###### <center> Bo Wang </center>
<h6><center> February 2016 </center></h6>
<br/>
<center><img src="images/786px-Warwick_Crest.png" width="90"></center>

## Part-of-Speech Tagging

* The process of classifying words into their parts of speech (also known as word classes or lexical categories) and labeling them accordingly is known as **part-of-speech tagging or POS-tagging**.
* The labels are called **POS tags**. The collection of tags used for a particular task is known as a **tag set** (e.g. Penn Treebank POS tagset).

## Part-of-Speech Tagger

* NLTK
* [Stanford](http://nlp.stanford.edu/software/tagger.shtml)
* spaCy
* [SENNA](http://ml.nec-labs.com/senna/)
* [CMU POS tagger for tweets](http://www.cs.cmu.edu/~ark/TweetNLP/index.html)
* [Gate twitie-tagger](https://gate.ac.uk/wiki/twitter-postagger.html)
<br><br>
* More: http://aclweb.org/aclwiki/index.php?title=POS_Tagging_(State_of_the_art)


### NLTK.tag

In [34]:
## NLTK

import nltk
text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")
# An off-the-shelf tagger is available. It uses the Penn Treebank tagset by default.
print nltk.pos_tag(text) # outputs a list of tuples of the form ('token', 'tag')
# 'nltk.help.upenn_tagset()' provides documentation for each tag

[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]


In [2]:
# We can create these tuples from the standard string representation 
# of tagged token using the function str2tuple()
tagged_token = nltk.tag.str2tuple('fly/NN')
print tagged_token
print tagged_token[0]
print tagged_token[1]

sent = '''
 The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN
 other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC
 Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS
 said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB
 accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT
 interest/NN of/IN both/ABX governments/NNS ''/'' ./.
 '''
print '\n'
print [nltk.tag.str2tuple(t) for t in sent.split()]

('fly', 'NN')
fly
NN


[('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', 'VBD'), ('on', 'IN'), ('a', 'AT'), ('number', 'NN'), ('of', 'IN'), ('other', 'AP'), ('topics', 'NNS'), (',', ','), ('AMONG', 'IN'), ('them', 'PPO'), ('the', 'AT'), ('Atlanta', 'NP'), ('and', 'CC'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('purchasing', 'VBG'), ('departments', 'NNS'), ('which', 'WDT'), ('it', 'PPS'), ('said', 'VBD'), ('``', '``'), ('ARE', 'BER'), ('well', 'QL'), ('operated', 'VBN'), ('and', 'CC'), ('follow', 'VB'), ('generally', 'RB'), ('accepted', 'VBN'), ('practices', 'NNS'), ('which', 'WDT'), ('inure', 'VB'), ('to', 'IN'), ('the', 'AT'), ('best', 'JJT'), ('interest', 'NN'), ('of', 'IN'), ('both', 'ABX'), ('governments', 'NNS'), ("''", "''"), ('.', '.')]


In [2]:
# Training a N-gram tagger

# Unigram taggers assign the tag that is most likely for that particular token, based on the training corpus.
# An n-gram tagger is a generalization of a unigram tagger whose context is 
# the current word together with the part-of-speech tags of the n-1 preceding tokens.

from nltk.corpus import brown
from nltk.tag import UnigramTagger, BigramTagger, DefaultTagger

brown_tagged_sents = brown.tagged_sents(categories='news', tagset='universal')
brown_sents = brown.sents(categories='news')
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
bigram_tagger = BigramTagger(train_sents)

unseen_sent = brown_sents[4203]
print bigram_tagger.tag(unseen_sent)
print "Accuracy score on test sents =",bigram_tagger.evaluate(test_sents)

[(u'The', u'DET'), (u'population', u'NOUN'), (u'of', u'ADP'), (u'the', u'DET'), (u'Congo', u'NOUN'), (u'is', u'VERB'), (u'13.5', None), (u'million', None), (u',', None), (u'divided', None), (u'into', None), (u'at', None), (u'least', None), (u'seven', None), (u'major', None), (u'``', None), (u'culture', None), (u'clusters', None), (u"''", None), (u'and', None), (u'innumerable', None), (u'tribes', None), (u'speaking', None), (u'400', None), (u'separate', None), (u'dialects', None), (u'.', None)]
Accuracy score on test sents = 0.148111232931


In [45]:
# Combining taggers, to address the trade-off between accuracy and coverage
default_tagger = DefaultTagger('NOUN')
unigram_tagger = UnigramTagger(train_sents, backoff=default_tagger)
bigram_tagger = BigramTagger(train_sents, backoff=unigram_tagger)
print bigram_tagger.tag(unseen_sent)
print "Accuracy score on test sents =",bigram_tagger.evaluate(test_sents)

[(u'The', u'DET'), (u'population', 'NOUN'), (u'of', u'ADP'), (u'the', u'DET'), (u'Congo', 'NOUN'), (u'is', u'VERB'), (u'13.5', 'NOUN'), (u'million', u'NUM'), (u',', u'.'), (u'divided', u'VERB'), (u'into', u'ADP'), (u'at', u'ADP'), (u'least', u'ADJ'), (u'seven', u'NUM'), (u'major', u'ADJ'), (u'``', u'.'), (u'culture', 'NOUN'), (u'clusters', 'NOUN'), (u"''", u'.'), (u'and', u'CONJ'), (u'innumerable', 'NOUN'), (u'tribes', 'NOUN'), (u'speaking', u'VERB'), (u'400', u'NUM'), (u'separate', u'ADJ'), (u'dialects', 'NOUN'), (u'.', u'.')]
Accuracy score on test sents = 0.920861158178


In [47]:
# Storing taggers
from pickle import dump
output = open('/Users/bowang/Documents/CS918_lab/taggers-parsers-seminar/tagger.pkl', 'wb')
dump(bigram_tagger, output)
output.close()
print "Successfully stored tagger!"

# Loading taggers
from pickle import load
input = open('/Users/bowang/Documents/CS918_lab/taggers-parsers-seminar/tagger.pkl', 'rb')
tagger = load(input)
input.close()
print "Tagger loaded!"
print tagger.tag(unseen_sent)

Successfully stored tagger!
Tagger loaded!
[(u'The', u'DET'), (u'population', 'NOUN'), (u'of', u'ADP'), (u'the', u'DET'), (u'Congo', 'NOUN'), (u'is', u'VERB'), (u'13.5', 'NOUN'), (u'million', u'NUM'), (u',', u'.'), (u'divided', u'VERB'), (u'into', u'ADP'), (u'at', u'ADP'), (u'least', u'ADJ'), (u'seven', u'NUM'), (u'major', u'ADJ'), (u'``', u'.'), (u'culture', 'NOUN'), (u'clusters', 'NOUN'), (u"''", u'.'), (u'and', u'CONJ'), (u'innumerable', 'NOUN'), (u'tribes', 'NOUN'), (u'speaking', u'VERB'), (u'400', u'NUM'), (u'separate', u'ADJ'), (u'dialects', 'NOUN'), (u'.', u'.')]


### Stanford Tagger

* A Java implementation of the log-linear part-of-speech taggers described in [here](http://nlp.stanford.edu/~manning/papers/emnlp2000.pdf) and [here](http://nlp.stanford.edu/~manning/papers/tagging.pdf).
* The full download contains three trained English tagger models, an Arabic tagger model, a Chinese tagger model, a French tagger model, and a German tagger model.
* Also has an [English Twitter POS tagger model](https://gate.ac.uk/wiki/twitter-postagger.html) written by Leon Derczynski and others at GATE.
* The English taggers use the [Penn Treebank tag set](http://www.cis.upenn.edu/~treebank/).

* e.g.


```
#!/bin/sh
#
# usage: ./stanford-postagger.sh model textFile

java -mx300m -cp 'stanford-postagger.jar:' edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/gate-EN-twitter.model -sentenceDelimiter newline -tokenize false -tagSeparator \* -textFile $1 > $2
```


* [Read more..](http://nlp.stanford.edu/software/pos-tagger-faq.shtml)
* Also have a look at `README.txt`.

In [2]:
## Stanford Tagger
# Using Stanford tagger in NLTK

#export CLASSPATH=/Users/bowang/Tools/stanford_nlp/stanford-postagger-full-2015-04-20/stanford-postagger.jar
#export STANFORD_MODELS=/Users/bowang/Tools/stanford_nlp/stanford-postagger-full-2015-04-20/models

import os
os.environ.get('CLASSPATH')

from nltk.tag import StanfordPOSTagger
st = StanfordPOSTagger('english-left3words-distsim.tagger')
st.tag('What is the airspeed of an unladen swallow ?'.split()) 

[(u'What', u'WP'),
 (u'is', u'VBZ'),
 (u'the', u'DT'),
 (u'airspeed', u'NN'),
 (u'of', u'IN'),
 (u'an', u'DT'),
 (u'unladen', u'JJ'),
 (u'swallow', u'VB'),
 (u'?', u'.')]

### POS Tagging using spaCy 

In [1]:
## spaCy

from spacy.en import English

tagger = English(parser=False, tagger=True, entity=False)

# pos/pos_ : A coarse-grained, less detailed tag that represents the word-class of the token.
# tag/tag_ : A fine-grained, more detailed tag that represents the word-class and some basic 
# morphological information for the token. These tags are primarily designed to be good features
# for subsequent models, particularly the syntactic parser. 
# They are language and treebank dependent.

# A property with an underscore at the end returns the string representation.
# while a property without the underscore returns an index (int) into spaCy's vocabulary.

def print_pos(token):
    return (token.tag_)

def pos_tags(sentence):
    sentence = unicode(sentence, "utf-8")
    tokens = tagger(sentence)
    tags = []
    for tok in tokens:
        tags.append((tok,print_pos(tok)))
    return tags
print pos_tags('What is the airspeed of an unladen swallow ?')

[(What , u'WP'), (is , u'VBZ'), (the , u'DT'), (airspeed , u'NN'), (of , u'IN'), (an , u'DT'), (unladen , u'JJ'), (swallow , u'NN'), (?, u'.')]


In [11]:
multiSentence = "There is an art, it says, or rather, a knack to flying." \
                 "The knack lies in learning how to throw yourself at the ground and miss." \
                 "In the beginning the Universe was created. This has made a lot of people "\
                 "very angry and been widely regarded as a bad move."

parser2 = English()
parsedData = parser2(unicode(multiSentence, "utf-8"))
sents = []
# the "sents" property returns spans of each sentence
# spans have indices into the original string
# where each index value represents a token
for span in parsedData.sents:
    # go from the start to the end of each span, returning each token in the sentence
    sent = [parsedData[i] for i in range(span.start, span.end)]
    for token in sent:
        print(token.orth_, token.pos_)
    break

(u'There', u'ADV')
(u'is', u'VERB')
(u'an', u'DET')
(u'art', u'NOUN')
(u',', u'PUNCT')
(u'it', u'NOUN')
(u'says', u'VERB')
(u',', u'PUNCT')
(u'or', u'CONJ')
(u'rather', u'ADV')
(u',', u'PUNCT')
(u'a', u'DET')
(u'knack', u'NOUN')
(u'to', u'ADP')
(u'flying', u'NOUN')
(u'.', u'PUNCT')


In [13]:
messyData = "lol that is rly funny :) This is gr8 i rate it 8/8!!!"
parsedData = parser2(unicode(messyData, "utf-8"))
for token in parsedData:
    print(token.orth_, token.pos_, token.lemma_) # a lemma is the canonical form of words..
    
## It does reasonably well. Though fails on the token "gr8" and "lol"
## "gr8" should be an adjective rather than a verb
## and "lol" is more like an interjection than a noun

(u'lol', u'NOUN', u'lol')
(u'that', u'ADJ', u'that')
(u'is', u'VERB', u'be')
(u'rly', u'ADV', u'rly')
(u'funny', u'ADJ', u'funny')
(u':)', u'PUNCT', u':)')
(u'This', u'DET', u'this')
(u'is', u'VERB', u'be')
(u'gr8', u'VERB', u'gr8')
(u'i', u'NOUN', u'i')
(u'rate', u'VERB', u'rate')
(u'it', u'NOUN', u'it')
(u'8/8', u'NUM', u'8/8')
(u'!', u'PUNCT', u'!')
(u'!', u'PUNCT', u'!')
(u'!', u'PUNCT', u'!')


### CMU POS tagger for tweets

* Java-based tokenizer and part-of-speech tagger for tweets.
* Manually labeled POS annotated tweets as training data.
* https://github.com/brendano/ark-tweet-nlp/
* Its POS tagset can be found in its [original paper](http://www.cs.cmu.edu/~ark/TweetNLP/gimpel+etal.acl11.pdf).
     * Including some Twitter/online-specific tags, such as '#' for hashtag, '@' for at-mention and 'E' for emoticon.
     
* A python wrapper can be found [here](https://github.com/ianozsvald/ark-tweet-nlp-python).

In [None]:
"""

Options:
  --model <Filename>        Specify model filename. (Else use built-in.)
  --just-tokenize           Only run the tokenizer; no POS tags.
  --quiet                   Quiet: no output
  --input-format <Format>   Default: auto
                            Options: json, text, conll
  --output-format <Format>  Default: automatically decide from input format.
                            Options: pretsv, conll
  --input-field NUM         Default: 1
                            Which tab-separated field contains the input
                            (1-indexed, like unix 'cut')
                            Only for {json, text} input formats.
  --word-clusters <File>    Alternate word clusters file (see FeatureExtractor)
  --no-confidence           Don't output confidence probabilities
  --decoder <Decoder>       Change the decoding algorithm (default: greedy)

e.g. $ ./runTagger.sh --no-confidence --output-format conll examples/example_tweets.txt > output.txt

To see more info use $ ./runTagger.sh --help

"""

## Parsing

* A natural language parser is a program (i.e. a formal and computational method) that works out the grammatical structure of sentences.
    * For instances, which groups of words go together (as "phrases") and which words are the subject or object of a verb.
* Resulting in a parse tree showing the syntactic relation to each part of the sentence, which may also contain semantic and other information.
    * Thus the patterns of well-formedness and ill-formedness in a sequence of words can be understood with respect to the phrase structure and dependencies.
* A key motivation is natural language understanding. For examples:
    * How much more of the meaning of a text can we access when we can reliably recognize the linguistic structures it contains? 
    * Can a program "understand" it enough to be able to answer simple questions about "what happened" or "who did what to whom"?
<br><br>
* Has a wide range of applications.
    * For example, we would expect the natural language questions submitted to a question-answering system to undergo parsing as an initial step.

### Dependency parsing

* Parse trees are usually constructed based on either the constituency relation of constituency grammars (phrase structure grammars) or the dependency relation of dependency grammars.
    * Phrase structure grammar is concerned with how words and sequences of words combine to form constituents.
    * Dependency grammar focusses instead on how words relate to other words.
* Dependency is a binary asymmetric relation that holds between a **head** and its **dependents**.
    * The head of a sentence is usually taken to be the tensed verb, and every other word is either dependent on the sentence head, or connects to it through a path of dependencies.
    

### Dependency parsing



<center><img src="images/depgraph0.png" width="600"></center>

* The arcs in the image above are labeled with the grammatical function that holds between a dependent and its head. For example, I is the `SBJ` (subject) of *shot* (which is the head of the whole sentence), and in is an `NMOD` (noun modifier of *elephant*).

## Dependency Parser

* [Stanford Parser](http://nlp.stanford.edu/software/lex-parser.shtml)
* spaCy
* [CMU TweeboParser](http://www.cs.cmu.edu/~ark/TweetNLP/#tweeboparser_tweebank)
* [MaltParser](http://www.maltparser.org/)
<br><br>
* http://aclweb.org/aclwiki/index.php?title=Parsing_(State_of_the_art)

### Stanford Parser

* A Java implementation of probabilistic natural language parsers.
* Has a GUI for viewing the phrase structure tree output of the parser.
* The parser provides [Universal Dependencies and Stanford Dependencies](http://nlp.stanford.edu/software/stanford-dependencies.shtml) output as well as phrase structure trees.
* Can be used in NLTK.

* e.g.


```
java -mx2g -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -sentences newline -tokenized -tagSeparator § -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory -escaper edu.stanford.nlp.process.PTBEscapingProcessor -outputFormat typedDependencies edu/stanford/nlp/models/lexparser/englishPCFG.caseless.ser.gz sample-input.tagged.txt > sample-input.conll2007.txt
```


* Use `-sentences` as your sentence delimitator. For example, ff you want to give the parser one sentence per line, include the option `-sentences newline` in your command line.
* If your input text has already been tokenised, you can give the option `-tokenized` then the parser will assume white-space separated tokens and use your tokenization as is.
    *  A common occurrence is that your text is already correctly tokenized but does not escape characters the way the Penn Treebank does. In this case you can also add the flag `-escaper edu.stanford.nlp.process.PTBEscapingProcessor`.
* You can also give the parser already POS-tagged input, by adding relevant options `-sentences`, `-tokenized`, `-tokenizerFactory`, `-tokenizerMethod`, and `-tagSeparator`.
    * Make sure your input is correctly tokenised and uses the correct tag names.


* `-outputFormat` selects the style of the output. 
    * *penn, oneline, rootSymbolOnly, words, wordsAndTags, dependencies, typedDependencies, typedDependenciesCollapsed, latexTree, xmlTree, collocations, semanticGraph, conllStyleDependencies, conll2007*
<br><br>
* You can read more about Stanford Parser and its available options/flags at [here](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/parser/lexparser/LexicalizedParser.html) and [here](http://nlp.stanford.edu/software/parser-faq.shtml).

In [4]:
## Stanford Parser
# Using Stanford parser in NLTK
import os
from nltk.parse.stanford import StanfordParser

parser_path = '/Users/bowang/Tools/stanford_nlp/stanford-parser-full-2015-04-20/stanford-parser.jar'
model_path = '/Users/bowang/Tools/stanford_nlp/stanford-parser-full-2015-04-20/stanford-parser-3.5.2-models.jar'
os.environ['STANFORD_PARSER'] = parser_path
os.environ['STANFORD_MODELS'] = model_path

parser = StanfordParser(model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz")
parsed = parser.raw_parse_sents(("this is the english parser test", "the parser is from stanford parser"))
for myListiterator in parsed:
    for t in myListiterator:
        print t

# _OUTPUT_FORMAT = 'penn'

(ROOT
  (S
    (NP (DT this))
    (VP (VBZ is) (NP (DT the) (JJ english) (NN parser) (NN test)))))
(ROOT
  (S
    (NP (DT the) (NN parser))
    (VP (VBZ is) (PP (IN from) (NP (JJ stanford) (NN parser))))))


In [12]:
import os
from nltk.parse.stanford import StanfordDependencyParser

dep_parser=StanfordDependencyParser(model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz")
sent = "The quick brown fox jumps over the lazy dog."
print [list(parse.triples()) for parse in dep_parser.raw_parse(sent)]
print '\n'
print [parse.tree() for parse in dep_parser.raw_parse(sent)]

# _OUTPUT_FORMAT = 'conll2007'

[[((u'jumps', u'VBZ'), u'nsubj', (u'fox', u'NN')), ((u'fox', u'NN'), u'det', (u'The', u'DT')), ((u'fox', u'NN'), u'amod', (u'quick', u'JJ')), ((u'fox', u'NN'), u'amod', (u'brown', u'JJ')), ((u'jumps', u'VBZ'), u'nmod', (u'dog', u'NN')), ((u'dog', u'NN'), u'case', (u'over', u'IN')), ((u'dog', u'NN'), u'det', (u'the', u'DT')), ((u'dog', u'NN'), u'amod', (u'lazy', u'JJ'))]]


[Tree('jumps', [Tree('fox', ['The', 'quick', 'brown']), Tree('dog', ['over', 'the', 'lazy'])])]


### Parsing using spaCy

In [14]:
from spacy.en import English

parser = English(parser=True, tagger=False, entity=False)
def print_pos(token):
    # returns (original token, dependency tag, head word, left dependents, right dependents)
    return (token.orth_, token.dep_, token.head.orth_, [t.orth_ for t in token.lefts], [t.orth_ for t in token.rights])

def pos_tags(sentence):
    sentence = unicode(sentence, "utf-8")
    tokens = parser(sentence)
    tags = []
    for tok in tokens:
        tags.append(print_pos(tok))
    return tags
sent = 'What is the airspeed of an unladen swallow ?'
print sent
pos_tags(sent)

What is the airspeed of an unladen swallow ?


[(u'What', u'acomp', u'is', [], []),
 (u'is', u'dep', u'an', [u'What'], [u'the', u'airspeed']),
 (u'the', u'det', u'is', [], []),
 (u'airspeed', u'nmod', u'is', [], [u'of']),
 (u'of', u'prep', u'airspeed', [], []),
 (u'an', u'meta', u'unladen', [u'is'], []),
 (u'unladen', u'ROOT', u'unladen', [u'an'], [u'swallow', u'?']),
 (u'swallow', u'dep', u'unladen', [], []),
 (u'?', u'nmod', u'unladen', [], [])]

### CMU TweeboParser

* A dependency parser for English tweets, developed by [Kong et al. (2014)](http://www.cs.cmu.edu/~nasmith/papers/kong+schneider+swayamdipta+bhatia+dyer+smith.emnlp14.pdf).
* Trained on a subset of a new labeled corpus for 929 tweets (12,318 tokens) drawn from the POS-tagged tweet corpus of [Owoputi et al. (2013)](http://www.cs.cmu.edu/~ark/TweetNLP/owoputi+etal.naacl13.pdf), Tweebank.
* Pro:
    * Has some interesting and useful features (especially for parsing tweets), such as support for multi-word annotations and multiple roots.
* Con:
    * Instead of the popular Penn Treebank annotation it uses a simpler annotation scheme and outputs much less dependency type information.
* https://github.com/ikekonglp/TweeboParser

e.g.

<center><img src="images/tweeboparser1.png" width="800"></center>



<center><img src="images/tweeboparser2.png" width="400"></center>

## Resources:

* [A Good Part-of-Speech Tagger in about 200 Lines of Python](https://spacy.io/blog/part-of-speech-POS-tagger-in-python)
* [Stanford Typed Dependencies Manual](http://nlp.stanford.edu/software/dependencies_manual.pdf)
* [Stanford CoreNLP](http://stanfordnlp.github.io/CoreNLP/)
* The NLTK book can be found [here](http://www.nltk.org/book/)
* spaCy Doc: https://spacy.io/docs
* Source code for [nltk.tag.stanford](http://www.nltk.org/_modules/nltk/tag/stanford.html) and [nltk.parse.stanford](http://www.nltk.org/_modules/nltk/parse/stanford.html)

In [1]:
print "Thank you!"

Thank you!
