# Parts of Speech(POS) Tagging & Chunking in NLTK Python
### What is POS Tagging?
POS Tagging is the process of tagging words in a sentence with corresponding parts of speech like noun, pronoun, verb, adverb, preposition, etc.
Tagging the words of a text with parts of speech helps to understand how does the word functions grammatically in the context of the sentence. A word can assume different parts of speech depending on the context of the sentence.

POS Tagging is useful in sentence parsing, information retrieval, sentiment analysis, etc. In fact, it is a prerequisite for the process of Chunking and Named Entity Recognition in NLP.

### POS Tagging in NLTK Library
POS Tagging in NLTK library is done using pos_tag() function which takes the tokens of a sentence as input and it returns the POS tag for each word.

#### List of POS Tags in NLTK
Usually, in schools, we are taught about 9 different types of parts of speech – noun, verb, adverb, article, preposition, pronoun, adjective, conjunction, and interjection. But NLTK actually provides many categories and sub-categories of tags than just the traditional nine.

We can generate all the available POS tags by using nltk.help.upenn_tagset() function.

In [2]:
import nltk
nltk.download('tagsets')
nltk.help.upenn_tagset()

[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...


$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

[nltk_data]   Unzipping help\tagsets.zip.


#### Example of POS Tagging in NLTK
In the below example, we first tokenize the text and pass the tokens to NLTK pos_tag() function.

In [3]:
from nltk import pos_tag
from nltk import word_tokenize

text = "The way to get started is to quit talking and begin doing."
tokenizer = word_tokenize(text)
pos_tag(tokenizer)

[('The', 'DT'),
 ('way', 'NN'),
 ('to', 'TO'),
 ('get', 'VB'),
 ('started', 'VBN'),
 ('is', 'VBZ'),
 ('to', 'TO'),
 ('quit', 'VB'),
 ('talking', 'VBG'),
 ('and', 'CC'),
 ('begin', 'VB'),
 ('doing', 'VBG'),
 ('.', '.')]

### Default Tagger in NLTK
NLTK has DefaultTagger function that is used to assign the default tag to the tokens. Let us see this with the help of an example.

Below, we first tokenize the text and then create an instance of DefaultTagger by adding the desired default tag ‘AD’. Finally, we pass the tokenized text to the DefaultTagger instance.

In [4]:
from nltk.tag import DefaultTagger

text = "The way to get started is to quit talking and begin doing."
tokenizer = word_tokenize(text)
tagging = DefaultTagger('Ad')

print(tagging.tag(tokenizer))

[('The', 'Ad'), ('way', 'Ad'), ('to', 'Ad'), ('get', 'Ad'), ('started', 'Ad'), ('is', 'Ad'), ('to', 'Ad'), ('quit', 'Ad'), ('talking', 'Ad'), ('and', 'Ad'), ('begin', 'Ad'), ('doing', 'Ad'), ('.', 'Ad')]


## What is Chunking in NLP?
We have seen that we can break down a sentence into tokens of words and then do POS tagging for identifying parts of speech for those words. But just doing this does not give us enough meaningful information about the sentence. Chunking can help us to take us to the next level.

In NLP, chunking is the process of breaking down a text into phrases such as Noun Phrases, Verb Phrases, Adjective Phrases, Adverb phrases, and Preposition Phrases.

Chunking is commonly used to extract Noun Phrases (NP) from the sentence. It should be noted that POS tagging is the prerequisite for the chunking process and the chunks do not overlap with each other.

Chunking is essential for understanding the semantics of the text and helps in information retrieval.



### Chunking in NLTK Library
The process of chunking in NLTK is a multi-step process as explained below –

- Step1 :
Tokenize the sentence and perform POS Tagging.

- Step 2:
Define the grammar to perform chunking. This is a very important step because grammar lays the rule of chunking.

- Step 3:
Using this grammar, we create a chunk parser with the help of RegexpParser and apply it to our sentence.

- Step 4:
The above step produces the result which can either be printed as it is or we can draw a graph for better visualization.

#### Example of Chunking in NLTK
Going by the steps we explained above, in the below example, we first tokenize the sample sentence and perform POS Tagging on it. Then we define the grammar for Noun Phrase as  NP: {< DT >?< JJ >*< NN >} which means that a chunk will be constructed when an optional Determiner (DT) is followed by any number of Adjective (JJ) or Noun (NN).

We then initialize an instance of nltk.RegexpParser() with this grammar and use it to parse the tokenized sample sentence. This produces the result of chunking which we both print and draw a tree graph out of it.

In [5]:
# example
import nltk

sentence = "the little yellow dog barked at the cat"
tokens = nltk.word_tokenize(sentence)
print(tokens)

tag = nltk.pos_tag(tokens)
print(tag)
grammar = "NP: {<DT>?<JJ>*<NN>}"

cp = nltk.RegexpParser(grammar)
result = cp.parse(tag)
print(result)
result.draw()

['the', 'little', 'yellow', 'dog', 'barked', 'at', 'the', 'cat']
[('the', 'DT'), ('little', 'JJ'), ('yellow', 'JJ'), ('dog', 'NN'), ('barked', 'VBD'), ('at', 'IN'), ('the', 'DT'), ('cat', 'NN')]
(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))


KeyboardInterrupt: 