#                                           Chunking in NLP

<h>Chunking in NLP</h>

<p>
    <b>chunking</b> is the meaning full extraction of short phrases from the sentences (tagged with pos).<br>
    <b>chunks</b> are made up of words and kind of theses words are defined using pos tag.
</p>

<b><ins>Working:</ins></b>
<ul>
         <li>it works on top of pos tagging</li>
         <li>input: pos_tag, output:chunk</li>
         <li>Note: Extract information from text such as location ,person and names.</li>
</ul>
</p> 

<b>Note:</b> A <b>phrase</b> is a word or group of words that works as a single unit to perform a grammatical function. <b>Noun phrases</b> are built around a noun.

<p>
Here are some examples:
<ul>
<li>“A planet”</li>
<li>“A tilting planet”</li>
<li>“A swiftly tilting planet”</li>
</ul>
</p>
Chunking makes use of <mark>POS tags</mark> to group words and apply chunk tags to those groups. Chunks don’t overlap, so one instance of a word can be in only one chunk at a time.

Here’s how to import the relevant parts of NLTK in order to chunk:

First step as we import nltk libaray for word tokenize

In [2]:
from nltk.tokenize import word_tokenize

Before you can chunk, you need to make sure that the parts of speech in your text are tagged, so create a string for POS tagging

In [3]:
sentence= "It's a dangerous business, Frodo, going out your door."

Now tokenize that string by word:

In [4]:
words_in_lotr_quote=word_tokenize(sentence)

In [5]:
words_in_lotr_quote

['It',
 "'s",
 'a',
 'dangerous',
 'business',
 ',',
 'Frodo',
 ',',
 'going',
 'out',
 'your',
 'door',
 '.']

Now you’ve got a list of all of the words in lotr_quote.

The next step is to tag those words by part of speech:

In [6]:
import nltk

In [7]:
lotr_pos_tags = nltk.pos_tag(words_in_lotr_quote)


In [8]:
lotr_pos_tags

[('It', 'PRP'),
 ("'s", 'VBZ'),
 ('a', 'DT'),
 ('dangerous', 'JJ'),
 ('business', 'NN'),
 (',', ','),
 ('Frodo', 'NNP'),
 (',', ','),
 ('going', 'VBG'),
 ('out', 'RP'),
 ('your', 'PRP$'),
 ('door', 'NN'),
 ('.', '.')]

You’ve got a list of tuples of all the words in the quote, along with their POS tag. In order to chunk, you first need to define a chunk grammar.

<b>Note</b>: A chunk grammar is a combination of rules on how sentences should be chunked. It often uses <a href = "https://realpython.com/regex-python/">,regular expressions</a>  or <b>regexes</b>.

For this tutorial, you don’t need to know how regular expressions work, but they will definitely come in handy for you in the future if you want to process text.

Create a chunk grammar with one regular expression rule:

In [9]:
grammar = "NP: {<DT>?<JJ>*<NN>}"

<b>NP</b> stands for noun phrase. You can learn more about noun phrase chunking in <a href = "https://www.nltk.org/book/ch07.html#noun-phrase-chunking">, Chapter 7</a> of Natural Language Processing with Python—Analyzing Text with the Natural Language Toolkit.
<p>
According to the rule you created, your chunks:
<ul>
    <li>Start with an optional (?) determiner ('DT')</li>
    <li>Can have any number (*) of adjectives (JJ)</li>
    <li>End with a noun (&lt;NN>) </li>
<ul>
<p>



Create a chunk parser with this grammar:<br>
Using this grammar, we create a chunk parser  and test it on our example sentence.

In [10]:
chunk_parser = nltk.RegexpParser(grammar)

In [11]:
chunk_parser

<chunk.RegexpParser with 1 stages>

Now try it out with your quote:

In [12]:
tree = chunk_parser.parse(lotr_pos_tags)

Now you can visual representation of tree showing phrases


In [13]:
tree.draw()

You got two noun phrases:

1. **'a dangerous business'** has a determiner, an adjective, and a noun.
2.**'door'** has just a noun.

Example problem

In [21]:
sentence="little yellow dog barked at cat"

Import necessary library for tokenization and pos_tagging

In [22]:
import nltk
from nltk.tokenize import word_tokenize

Tokenize the words

In [23]:
token_word= word_tokenize(sentence)

In [24]:
token_word

['little', 'yellow', 'dog', 'barked', 'at', 'cat']

Finding out the pos _tag of words

In [27]:
pos_taged=nltk.pos_tag(token_word)

In [28]:
pos_taged

[('little', 'JJ'),
 ('yellow', 'JJ'),
 ('dog', 'NN'),
 ('barked', 'VBD'),
 ('at', 'IN'),
 ('cat', 'NN')]

Then we specify the grammatic rule for phrases extraction

In [29]:
grammar = "NP: {<DT>?<JJ>*<NN>}"

In [None]:
Then we use chunck parser to extract text from our specified grammatic rule

In [36]:
chunck_parser=nltk.RegexpParser(grammar)


In [41]:
chunck_parser


<chunk.RegexpParser with 1 stages>

In [42]:
tree=chunck_parser.parse(pos_taged)

In [None]:
tree.draw()