Part of Speech (POS) tagging 
-----

![](https://nicholasdale.files.wordpress.com/2015/10/parts-of-speech.jpg)

Overview
-----

- English tokens can be put into groups (aka, parts of speech)
- "Hard" classification - each token only belongs to a single group
- TextBlob is fast and good
- nltk is slow and trainable
- spaCy is fast and applied
- Penn Treebank is a set of awful default labels

----
TextBlob
----

In [1]:
reset -fs

In [2]:
from textblob import TextBlob

In [3]:
print(*TextBlob("I'll be back.").tags, sep="\n")

('I', 'PRP')
("'ll", 'MD')
('be', 'VB')
('back', 'RB')


In [4]:
TextBlob('He is a tall skinny guy with a long, sad, mean-looking kisser, and a mournful voice.').tags

[('He', 'PRP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('tall', 'JJ'),
 ('skinny', 'NN'),
 ('guy', 'NN'),
 ('with', 'IN'),
 ('a', 'DT'),
 ('long', 'JJ'),
 ('sad', 'JJ'),
 ('mean-looking', 'JJ'),
 ('kisser', 'NN'),
 ('and', 'CC'),
 ('a', 'DT'),
 ('mournful', 'JJ'),
 ('voice', 'NN')]

In [5]:
tweet = "If only Bradley's arm was longer. Best photo ever. #oscars"
TextBlob(tweet).tags

[('If', 'IN'),
 ('only', 'RB'),
 ('Bradley', 'NNP'),
 ("'s", 'POS'),
 ('arm', 'NN'),
 ('was', 'VBD'),
 ('longer', 'RBR'),
 ('Best', 'NNP'),
 ('photo', 'NN'),
 ('ever', 'RB'),
 ('oscars', 'NNS')]

What happened to the hash tag?

In [34]:
pos_tag(tokenize.word_tokenize(tweet))

[('If', 'IN'),
 ('only', 'RB'),
 ('Bradley', 'NNP'),
 ("'s", 'POS'),
 ('arm', 'NN'),
 ('was', 'VBD'),
 ('longer', 'RBR'),
 ('.', '.'),
 ('Best', 'JJS'),
 ('photo', 'NN'),
 ('ever', 'RB'),
 ('.', '.'),
 ('#', '#'),
 ('oscars', 'NNS')]

TextBlob
------

An OOP wrapper for the most applied parts of `nltk`

Good for fast and dirty protoyping

-----
Check out nltk
-----

In [35]:
from nltk import tokenize, pos_tag

In [42]:
phrase = "I'll be back."
tokens = tokenize.word_tokenize(tweet)

In [43]:
print(*pos_tag(tokens), sep="\n")

[('If', 'IN'),
 ('only', 'RB'),
 ('Bradley', 'NNP'),
 ("'s", 'POS'),
 ('arm', 'NN'),
 ('was', 'VBD'),
 ('longer', 'RBR'),
 ('.', '.'),
 ('Best', 'JJS'),
 ('photo', 'NN'),
 ('ever', 'RB'),
 ('.', '.'),
 ('#', '#'),
 ('oscars', 'NNS')]

In [None]:
('longer', 'RBR') # Adverb, comparative

---
Check out spaCy
----

spacy's has many Language Models (including other languages)

I like `en_core_web_sm`

https://spacy.io/models/en#en_core_web_sm

In [45]:
import spacy

In [47]:
try:
    nlp = spacy.load('en_core_web_sm') # Get "standard" English model
    tokens = nlp(tweet)
except OSError:
    import os
    os.system("python -m spacy download en_core_web_sm")
    
    nlp = spacy.load('en_core_web_sm')
    tokens = nlp(tweet)

In [48]:
for token in tokens:
    print(token, token.tag_, sep="\t|\t")

If	|	IN
only	|	RB
Bradley	|	NNP
's	|	POS
arm	|	NN
was	|	VBD
longer	|	JJR
.	|	.
Best	|	JJS
photo	|	NN
ever	|	RB
.	|	.
#	|	NN
oscars	|	NNS


In [None]:
# SpaCy performs better, esp on web-based text
('longer', 'JJR') # Adjective, comparative

[Tag sentence demo](http://spacy.io/displacy/)

-----
What is diff between ntlk and spaCy?
-----

There's a philosophical difference between spaCy and NLTK. 

spaCy is written to help you get things done. It's minimal and opinionated. We want to provide you with exactly one way to do it --- the right way. Spacy has accurate part-of-speech tagger + dependency parser. If you want something that has good defaults, Spacy is the way to go.

In contrast, NLTK was created to support education. Most of what's there is for demo purposes, to help students explore ideas. But if you have your own data that you want to train on, NLTK is probably better. 

---
Deep Dive into Penn Treebank POS tags
----

Penn Tags: somewhat popular but awful (not human readable)

Penn Treebank POS tags:

- [Online](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)
- [Local](penn_treebank_pos_tags.ipynb)

In [16]:
import nltk

In [18]:
try:
    tags = nltk.data.load('help/tagsets/upenn_tagset.pickle')
except LookupError:
    nltk.download('tagsets')
    tags = nltk.data.load('help/tagsets/upenn_tagset.pickle')

[nltk_data] Downloading package tagsets to /Users/brian/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.


In [19]:
tags

{'LS': ('list item marker',
  'A A. B B. C C. D E F First G H I J K One SP-44001 SP-44002 SP-44005 SP-44007 Second Third Three Two * a b c d first five four one six three two '),
 'TO': ('"to" as preposition or infinitive marker', 'to '),
 'VBN': ('verb, past participle',
  'multihulled dilapidated aerosolized chaired languished panelized used experimented flourished imitated reunifed factored condensed sheared unsettled primed dubbed desired ... '),
 "''": ('closing quotation mark', "' '' "),
 'WP': ('WH-pronoun',
  'that what whatever whatsoever which who whom whosoever '),
 'UH': ('interjection',
  'Goodbye Goody Gosh Wow Jeepers Jee-sus Hubba Hey Kee-reist Oops amen huh howdy uh dammit whammo shucks heck anyways whodunnit honey golly man baby diddle hush sonuvabitch ... '),
 'VBG': ('verb, present participle or gerund',
  "telegraphing stirring focusing angering judging stalling lactating hankerin' alleging veering capping approaching traveling besieging encrypting interrupting era

Wow! That is a lot of tags

In [20]:
pen_label = 'NN' 

In [21]:
nltk.help.upenn_tagset(pen_label)

NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...


In [22]:
tags[pen_label][0]

'noun, common, singular or mass'

Let's just grab the most important part

In [23]:
tags[pen_label][0].split(',')[0]

'noun'

In [24]:
tags_simple = {pen_label:long_description[0].split(',')[0]
                   for pen_label, long_description in tags.items()}

In [25]:
tags_simple

{'LS': 'list item marker',
 'TO': '"to" as preposition or infinitive marker',
 'VBN': 'verb',
 "''": 'closing quotation mark',
 'WP': 'WH-pronoun',
 'UH': 'interjection',
 'VBG': 'verb',
 'JJ': 'adjective or numeral',
 'VBZ': 'verb',
 '--': 'dash',
 'VBP': 'verb',
 'NN': 'noun',
 'DT': 'determiner',
 'PRP': 'pronoun',
 ':': 'colon or ellipsis',
 'WP$': 'WH-pronoun',
 'NNPS': 'noun',
 'PRP$': 'pronoun',
 'WDT': 'WH-determiner',
 '(': 'opening parenthesis',
 ')': 'closing parenthesis',
 '.': 'sentence terminator',
 ',': 'comma',
 '``': 'opening quotation mark',
 '$': 'dollar',
 'RB': 'adverb',
 'RBR': 'adverb',
 'RBS': 'adverb',
 'VBD': 'verb',
 'IN': 'preposition or conjunction',
 'FW': 'foreign word',
 'RP': 'particle',
 'JJR': 'adjective',
 'JJS': 'adjective',
 'PDT': 'pre-determiner',
 'MD': 'modal auxiliary',
 'VB': 'verb',
 'WRB': 'Wh-adverb',
 'NNP': 'noun',
 'EX': 'existential there',
 'NNS': 'noun',
 'SYM': 'symbol',
 'CC': 'conjunction',
 'CD': 'numeral',
 'POS': 'genitive mark

In [26]:
# Brian's fav
tags['EX']

('existential there', 'there ')

'existential there' refers to the existence or presence of something. 

Examples in English include the sentences

> "There is a God"

> "There are boys in the yard".

<center><img src="https://motseiblacksnow.files.wordpress.com/2011/02/existential-gps-chicken.jpg?w=470" width="700"/></center>

Okay let's replace the cryptic Penn tags with the longer descriptions

In [27]:
tagged = TextBlob("I'll be back.").tags

In [30]:
for item in tagged:
    print(item[0], '\t', tags_simple[item[1]])

I 	 pronoun
'll 	 modal auxiliary
be 	 verb
back 	 adverb


***