# Tokenisation and Tagging of Text

In order to classify and analyze a body of text in a more granular fashion, it is necessary to consider how to break it into individual sentences and words or "tokens".  Broadly then there are two tasks:
+ Sentence Tokenization
+ Word Tokenization

To go beyond counting the frequency or occurence of actual words we need to classify words in general categories that signify their part in the construct of the sentence - for instance Noun, Verb Adjective etc.  This is generally known as 
+ Part of Speech or POS Tagging

## Sentence Tokenisation ##
The default Sentence Tokenizer is the `PunktSentenceTokenizer` from the `nltk.tokenize.punkt` module.  

In the example below (taken from James Joyce's Ulysses), we load the nltk library and process a block of text.

In [1]:
import nltk

ulysses = "Mrkgnao! the cat said loudly. She blinked up out of her avid shameclosing eyes, mewing \
plaintively and long, showing him her milkwhite teeth. He watched the dark eyeslits narrowing \
with greed till her eyes were green stones. Then he went to the dresser, took the jug Hanlon's\
milkman had just filled for him, poured warmbubbled milk on a saucer and set it slowly on the floor.\
— Gurrhr! she cried, running to lap."

doc = nltk.sent_tokenize(ulysses)
for s in doc:
    print(">",s)


> Mrkgnao!
> the cat said loudly.
> She blinked up out of her avid shameclosing eyes, mewing plaintively and long, showing him her milkwhite teeth.
> He watched the dark eyeslits narrowing with greed till her eyes were green stones.
> Then he went to the dresser, took the jug Hanlon'smilkman had just filled for him, poured warmbubbled milk on a saucer and set it slowly on the floor.— Gurrhr!
> she cried, running to lap.


Even in this very simple example, we can see that the results are not always perfect depending on the style of the text and variations from the style of the Corpus that was used to develop the tokenisation approach.

## Word Tokenisation##
There are many methods for tokenising text into words.  The default [Penn Treebank Tokeniser](http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) is the tokeniser based on the [Penn TreeBank](https://web.archive.org/web/19970614160127/http://www.cis.upenn.edu:80/~treebank/) Corpus.  A few examples of different tokenisers giving different results are listed below:

+ TreebankWordTokenizer
+ WordPunctTokenizer
+ WhitespaceTokenize

We can see a simple illustration of the impact of chosing a different tokenisation method by looking at the different results we get for a simple sentence:

In [2]:
from nltk import word_tokenize

sentence = "Mary had a little lamb it's fleece was white as snow."
# Default Tokenisation
tree_tokens = word_tokenize(sentence)   # nltk.download('punkt') for this

# Other Tokenisers
punct_tokenizer = nltk.tokenize.WordPunctTokenizer()
punct_tokens = punct_tokenizer.tokenize(sentence)

space_tokenizer = nltk.tokenize.SpaceTokenizer()
space_tokens = space_tokenizer.tokenize(sentence)

print("DEFAULT: ", tree_tokens)
print("PUNCT  : ", punct_tokens)
print("SPACE  : ", space_tokens)

DEFAULT:  ['Mary', 'had', 'a', 'little', 'lamb', 'it', "'s", 'fleece', 'was', 'white', 'as', 'snow', '.']
PUNCT  :  ['Mary', 'had', 'a', 'little', 'lamb', 'it', "'", 's', 'fleece', 'was', 'white', 'as', 'snow', '.']
SPACE  :  ['Mary', 'had', 'a', 'little', 'lamb', "it's", 'fleece', 'was', 'white', 'as', 'snow.']


## Part of Speech Tagging ##

For each word-token the nltk pos_tag method can be used to classify its Part of Speech (POS), automating the classification of words into their parts of speech and labeling them accordingly.

The outcome depends on how the sentence has been split up into individual tokens and which Tokensizer and Corpus the POS-tagger has been trained against:

In [3]:
pos = nltk.pos_tag(tree_tokens)
print(pos)
pos_space = nltk.pos_tag(space_tokens)
print(pos_space)

[('Mary', 'NNP'), ('had', 'VBD'), ('a', 'DT'), ('little', 'JJ'), ('lamb', 'NN'), ('it', 'PRP'), ("'s", 'VBZ'), ('fleece', 'NN'), ('was', 'VBD'), ('white', 'JJ'), ('as', 'IN'), ('snow', 'NN'), ('.', '.')]
[('Mary', 'NNP'), ('had', 'VBD'), ('a', 'DT'), ('little', 'JJ'), ('lamb', 'JJ'), ("it's", 'NN'), ('fleece', 'NN'), ('was', 'VBD'), ('white', 'JJ'), ('as', 'IN'), ('snow.', 'NN')]


#### PoS Tag Descriptions ###
CC | Coordinating conjunction  
CD | Cardinal number  
DT | Determiner  
EX | Existential there  
FW | Foreign word  
IN | Preposition or subordinating conjunction  
JJ | Adjective  
JJR | Adjective, comparative  
JJS | Adjective, superlative  
LS | List item marker  
MD | Modal  
NN | Noun, singular or mass  
NNS | Noun, plural  
NNP | Proper noun, singular  
NNPS | Proper noun, plural  
PDT | Predeterminer  
POS | Possessive ending  
PRP | Personal pronoun  
PRP\$ | Possessive pronoun  
RB | Adverb  
RBR | Adverb, comparative  
RBS | Adverb, superlative  
RP | Particle  
SYM | Symbol  
TO | to  
UH | Interjection  
VB | Verb, base form  
VBD | Verb, past tense  
VBG | Verb, gerund or present participle  
VBN | Verb, past participle  
VBP | Verb, non-3rd person singular present  
VBZ | Verb, 3rd person singular present  
WDT | Wh-determiner  
WP | Wh-pronoun  
WP$ | Possessive wh-pronoun  
WRB | Wh-adverb   


The naming convention of the PoS tags makes it easy to use regular expressions to extract classes of word-type (i.e. all the Nouns or Verbs):

In [4]:
import re
regex = re.compile("^N.*")
nouns = []
for l in pos:
    if regex.match(l[1]):
        nouns.append(l[0])
print("Nouns:", nouns)

Nouns: ['Mary', 'lamb', 'fleece', 'snow']


## Stemming and Lemmatizing ##
Striping off the suffixes from words is known as *stemming*.  
Mapping a word to a known dictionary word is know as *lemmatization*

There are multiple Stemming methods available and the the NLTK book references a few methods in particular:  
+ The Porter Stemmer - see https://tartarus.org/martin/PorterStemmer/   
+ Lancaster Stemmer - (Chris Paice, University of Lancaster)
additionally the 
+ Snowball Stemmer - "Porter 2" developed by Martin Porter
is generally considered the de-facto optimal Stemmer

A list of other stemming methods can be found here: http://www.nltk.org/api/nltk.stem.html.  Current Stemming and "Lemming" techniques are an inexact process as things currently stand.

#### Stemming Example ####

In [5]:
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
snowball = nltk.stem.snowball.SnowballStemmer("english")

print([porter.stem(t) for t in tree_tokens])
print([lancaster.stem(t) for t in tree_tokens])
print([snowball.stem(t) for t in tree_tokens])

sentence2 = "When I was going into the woods I saw a bear lying asleep on the forest floor"
tokens2 = word_tokenize(sentence2)

print("\n",sentence2)
for stemmer in [porter, lancaster, snowball]:
    print([stemmer.stem(t) for t in tokens2])

['mari', 'had', 'a', 'littl', 'lamb', 'it', "'s", 'fleec', 'wa', 'white', 'as', 'snow', '.']
['mary', 'had', 'a', 'littl', 'lamb', 'it', "'s", 'fleec', 'was', 'whit', 'as', 'snow', '.']
['mari', 'had', 'a', 'littl', 'lamb', 'it', "'s", 'fleec', 'was', 'white', 'as', 'snow', '.']

 When I was going into the woods I saw a bear lying asleep on the forest floor
['when', 'I', 'wa', 'go', 'into', 'the', 'wood', 'I', 'saw', 'a', 'bear', 'lie', 'asleep', 'on', 'the', 'forest', 'floor']
['when', 'i', 'was', 'going', 'into', 'the', 'wood', 'i', 'saw', 'a', 'bear', 'lying', 'asleep', 'on', 'the', 'forest', 'flo']
['when', 'i', 'was', 'go', 'into', 'the', 'wood', 'i', 'saw', 'a', 'bear', 'lie', 'asleep', 'on', 'the', 'forest', 'floor']


#### Lemmatizing Example ####
Lemmatization aims to achieve a similar base "stem" for a word, but aims to derive the  genuine dictionary root word, not just a trunctated version of the word.  

The default lemmatization method with the Python NLTK is the WordNet lemmatizer.

In [None]:
nltk.download("averaged_perceptron_tagger")
nltk.download("wordnet")

In [None]:
wnl = nltk.WordNetLemmatizer()
tokens2_pos = nltk.pos_tag(tokens2)  #nltk.download("averaged_perceptron_tagger")

print([wnl.lemmatize(t) for t in tree_tokens])

print([wnl.lemmatize(t) for t in tokens2])


### Summary ###
By Tokenising text into sentences and words we can go beyond counting the frequency or occurence of actual words and instead classify words by a classification type (i.e. we can identify common features in the text).

## Further Investigation ##
Optional further work and experimentation:
    
#### 1. Regular Expressions and POS patterns ####
Consider how to extract "phrase-chunks" based on regular expressions.  

See this Stack Overflow thread for one idea:
http://stackoverflow.com/questions/34090734/how-to-use-nltk-regex-pattern-to-extract-a-specific-phrase-chunk
    
Consider the pitfalls and complexities for different sentence constructs with essentially the same meaning.

#### 2. Experiment with the POS Tags ####
Try tokenising the sentance:  
*"When I was going into the woods I saw a bear lying asleep on the forest floor"*  
and note any inaccuriacies in the PoS classifications.