### What is chunking?

So, consider you have a body of text, we know how to 
1. tokenize and 
1. how to get pos tags.  

#### What should be the next step???

Next step is to figure out meaning of a sentence:

well first we want to know __who is the sentence talking about__ (i.e. `named entity` or in simple words __noun__)
so a person, place or thing is generally going to be your __subject__ (generally)  

okay, once you know the named entity what's the next step?

the next step is going to be _finding out words that kind of modify or affect that noun_  

so you might have many named entities or many nouns in the same sentence
e.g.

> Apple releases new phone comes with new color case hundred dollars more and Tesla releases home battery

okay so this is one sentence but it's about two different things completely and you might even have some opinions in that sentence and you've got to _figure out who or where does that opinion apply_ 
- is that applying to Apple?
- or is that applying the Tesla 

So most people chunk sentence into what we call as __noun phrases__.
It is like a noun surrounded by the words modifying it. 

So, chunking will output a descriptive group of words surrounding that noun (which we will call a chunk)

In [1]:
import nltk
from nltk.corpus import state_union # dataset of presidents state of union from past 70 years
from nltk.tokenize import PunktSentenceTokenizer # unsupervised ml sentece_tokenizer, it comes pre trained but we can retrain it for ourselves.

In [2]:
train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

In [3]:
custom_sent_tokenize = PunktSentenceTokenizer(train_text) # train the model
tokenized = custom_sent_tokenize.tokenize(sample_text)

In [10]:
def process_content():
    try:
        for sent in tokenized:
            words = nltk.word_tokenize(sent)
            tagged = nltk.pos_tag(words)
            
            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            chunked.draw()
            
    except Exception as e:
        print(str(e))

In [None]:
process_content()