### What is chunking?

#### Text chunking, also referred to as shallow parsing, is a task that follows Part-Of-Speech Tagging and that adds more structure to the sentence. The result is a grouping of the words in “chunks”. Here’s a quick example:
<br>


            (S
            (NP Every/DT day/NN)
            ,/,
            (NP I/PRP)
            (VP buy/VBP)
            (NP something/NN)
            (PP from/IN)
            (NP the/DT corner/NN shop/NN))

<br>

#### In other words, in a shallow parse tree, there’s one maximum level between the root and the leaves. A deep parse tree looks like this:

<br>

        (S
          (NP
            (S
              (NP Every/DT NN day/NN)
              ,/,
              (NP I/PRP)
              (VP
                buy/VBP
                (NP something/NN)
                (PP from/IN (NP the/DT corner/NN shop/NN))))))

<br>

- Chunking is a very similar task to Named-Entity-Recognition. In fact, the same format, IOB-tagging is used.




- chunking is to group into what are known as "noun phrases." 

- These are phrases of one or more words that contain a noun, maybe some descriptive words, maybe a verb, and maybe something like an adverb.

- Idea is to group nouns with the words that are in relation to them.


-  In order to chunk, we combine the part of speech tags with regular expressions. Mainly from regular expressions, we are going to utilize the following:

            + = match 1 or more
            ? = match 0 or 1 repetitions.
            * = match 0 or MORE repetitions	  
            . = Any character except a new line


In [8]:
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    try:
        for i in tokenized[:1]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            
#  <RB.?>* = "0 or more of any tense of adverb," followed by:

# <VB.?>* = "0 or more of any tense of verb," followed by:

# <NNP>+ = "One or more proper nouns," followed by

# <NN>? = "zero or one singular noun."
 
            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            
            print(chunked)
            print("\t \tAfter Chunk\n\n\n\n\n\n")
            for subtree in chunked.subtrees():
                print(subtree)
                subtree.draw()    

    except Exception as e:
        print(str(e))

process_content()

(S
  (Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP)
  'S/POS
  (Chunk ADDRESS/NNP)
  BEFORE/IN
  (Chunk A/NNP JOINT/NNP SESSION/NNP)
  OF/IN
  (Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
  OF/IN
  (Chunk THE/NNP UNION/NNP January/NNP)
  31/CD
  ,/,
  2006/CD
  (Chunk THE/NNP PRESIDENT/NNP)
  :/:
  (Chunk Thank/NNP)
  you/PRP
  all/DT
  ./.)
	 	After Chunk






(S
  (Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP)
  'S/POS
  (Chunk ADDRESS/NNP)
  BEFORE/IN
  (Chunk A/NNP JOINT/NNP SESSION/NNP)
  OF/IN
  (Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
  OF/IN
  (Chunk THE/NNP UNION/NNP January/NNP)
  31/CD
  ,/,
  2006/CD
  (Chunk THE/NNP PRESIDENT/NNP)
  :/:
  (Chunk Thank/NNP)
  you/PRP
  all/DT
  ./.)
(Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP)
(Chunk ADDRESS/NNP)
(Chunk A/NNP JOINT/NNP SESSION/NNP)
(Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
(Chunk THE/NNP UNION/NNP January/NNP)
(Chunk THE/NNP PRESIDENT/NNP)
(Chunk Thank/NNP)
