# Word chunking based on POS tags
The motivation to perform chunking is to get a set of words with nouns which can help in finding nouns with adjectives and if found, those adjectives can be checked for any aggregation on the noun attributes. Eg: the average age, the minimum salary, etc.

In [15]:
import nltk
from nltk.corpus import conll2000
from nltk.chunk.util import tree2conlltags,conlltags2tree
from nltk.tag import UnigramTagger,BigramTagger
from nltk.chunk import ChunkParserI
import matplotlib.pyplot as plt

### The CoNLL2000 data
The CoNLL2000 data contains chunked sentences which will be used to train the chunker model

The data has sentences whose words are divided into chunks(collection of words along with individual POS tags of words and the chunk POS tags. Below are a few examples

In [16]:
data=conll2000.chunked_sents()
train=data[:10700]
test=data[10700:]
print(len(train),len(test))
print(train[7])
print(test[3])

10700 248
(S
  (NP The/DT August/NNP deficit/NN)
  and/CC
  (NP the/DT #/# 2.2/CD billion/CD gap/NN)
  (VP registered/VBN)
  (PP in/IN)
  (NP July/NNP)
  (VP are/VBP topped/VBN)
  only/RB
  (PP by/IN)
  (NP the/DT #/# 2.3/CD billion/CD deficit/NN)
  (PP of/IN)
  (NP October/NNP 1988/CD)
  ./.)
(S
  (NP The/DT first/JJ two/CD games/NNS)
  (PP of/IN)
  (NP the/DT World/NNP Series/NNP)
  (PP between/IN)
  (NP
    the/DT
    Oakland/NNP
    Athletics/NNP
    and/CC
    San/NNP
    Francisco/NNP
    Giants/NNP)
  (VP did/VBD n't/RB finish/VB)
  (PP in/IN)
  (NP the/DT top/JJ 10/CD)
  ;/:
  instead/RB
  (NP they/PRP)
  (VP landed/VBD)
  (PP in/IN)
  (NP 16th/JJ and/CC 18th/JJ place/NN)
  ./.)


<img src="Image1.jpeg">

The conll_tag_chunks function uses the function tree2conlltags() which converts the chunked sentence to word, tag and chunk-tag.
There are 3 types of chunk tags:<br>
1 B-  which is the beginning of a chunk<br>
2 I- which is in between of a chunk <br>
3 O which is without a chunk<br>
The conll_tag_chunks function then returns the individual and chunk tag

In [17]:
def conll_tag_chunks(chunk_sents):
    tagged_sents = [tree2conlltags(tree) for tree in chunk_sents]
    return [[(t, c) for (w, t, c) in sent] for sent in tagged_sents]

In [18]:
wtc=tree2conlltags(test[3])

wtc

[('The', 'DT', 'B-NP'),
 ('first', 'JJ', 'I-NP'),
 ('two', 'CD', 'I-NP'),
 ('games', 'NNS', 'I-NP'),
 ('of', 'IN', 'B-PP'),
 ('the', 'DT', 'B-NP'),
 ('World', 'NNP', 'I-NP'),
 ('Series', 'NNP', 'I-NP'),
 ('between', 'IN', 'B-PP'),
 ('the', 'DT', 'B-NP'),
 ('Oakland', 'NNP', 'I-NP'),
 ('Athletics', 'NNP', 'I-NP'),
 ('and', 'CC', 'I-NP'),
 ('San', 'NNP', 'I-NP'),
 ('Francisco', 'NNP', 'I-NP'),
 ('Giants', 'NNP', 'I-NP'),
 ('did', 'VBD', 'B-VP'),
 ("n't", 'RB', 'I-VP'),
 ('finish', 'VB', 'I-VP'),
 ('in', 'IN', 'B-PP'),
 ('the', 'DT', 'B-NP'),
 ('top', 'JJ', 'I-NP'),
 ('10', 'CD', 'I-NP'),
 (';', ':', 'O'),
 ('instead', 'RB', 'O'),
 ('they', 'PRP', 'B-NP'),
 ('landed', 'VBD', 'B-VP'),
 ('in', 'IN', 'B-PP'),
 ('16th', 'JJ', 'B-NP'),
 ('and', 'CC', 'I-NP'),
 ('18th', 'JJ', 'I-NP'),
 ('place', 'NN', 'I-NP'),
 ('.', '.', 'O')]

In [19]:
def combined_tagger(train_data, taggers, backoff=None):
    for tagger in taggers:
        backoff = tagger(train_data, backoff=backoff)
    return backoff

combined_tagger function performs a sequential tagging using the best taggers among Unigram and Bigram taggers to tag the words based on context. <br>
Unigram tagger is an NLTK library's Part-of-Speech tagger which uses a corpus of word and POS tag to train the data based on only the current word probability of having a certain tag. Similarly Bigram tagger uses a corpus of word and POS tag to train the data based on the current and the previous word. <br>
It uses Hidden Markov Model to train the model according to which, the maximum of the probabilities of the word having different POS tags is considered which helps us decide the approriate POS tag for the word. These probabilities depend solely on that word and the n-previous words before it. 

In [20]:
class NGramTagChunker(ChunkParserI):
    def __init__(self,train_sentences,tagger_classes=[UnigramTagger,BigramTagger]):
        train_sent_tags=conll_tag_chunks(train_sentences)
        self.chunk_tagger=combined_tagger(train_sent_tags,tagger_classes)
    def parse(self,tagged_sentence):
        if not tagged_sentence:
            return None
        pos_tags=[tag for word, tag in tagged_sentence]
        chunk_pos_tags=self.chunk_tagger.tag(pos_tags)
        chunk_tags=[chunk_tag for (pos_tag,chunk_tag) in chunk_pos_tags]
        wpc_tags=[(word,pos_tag,chunk_tag) for ((word,pos_tag),chunk_tag) in zip(tagged_sentence,chunk_tags)]
        return conlltags2tree(wpc_tags)

The parse() function uses the combined_tagger model to get the appropriate chunks and then return the chunks in the original tree form using conlltags2tree() function

In [21]:
ntc=NGramTagChunker(train)
print(ntc.evaluate(test))

ChunkParse score:
    IOB Accuracy:  90.0%%
    Precision:     81.4%%
    Recall:        86.1%%
    F-Measure:     83.7%%


## Let us test some of the test sentences on our chunker model

In [22]:
print(test[2])

(S
  -LRB-/(
  (NP A/DT ratings/NNS point/VBP)
  (VP represents/VBZ)
  (NP 904,000/CD television/NN households/NNS)
  ;/:
  (NP shares/NNS)
  (VP indicate/VBP)
  (NP the/DT percentage/NN)
  (PP of/IN)
  (NP sets/NNS)
  (PP in/IN)
  (NP use/NN)
  ./.
  -RRB-/))


<img src="Image2.jpeg">

In [23]:
sentence='A ratings point represents 904,000 television households ; shares indicate the percentage of sets in use .'

nltk_pos_tagged=nltk.pos_tag(sentence.split())
chunk_tree=ntc.parse(nltk_pos_tagged)
print(chunk_tree)

(S
  (NP A/DT ratings/NNS point/NN)
  (VP represents/VBZ)
  (NP 904,000/CD television/NN households/NNS)
  ;/:
  (NP shares/NNS)
  (VP indicate/VBP)
  (NP the/DT percentage/NN)
  (PP of/IN)
  (NP sets/NNS)
  (PP in/IN)
  (NP use/NN)
  ./.)


In [24]:
print(test[6])

(S
  (NP CBS/NNP)
  (VP held/VBD)
  (NP the/DT previous/JJ record/NN)
  (PP for/IN)
  (NP consecutive/JJ No./NN 1/CD victories/NNS)
  --/:
  (NP 46/CD weeks/NNS)
  --/:
  (PP during/IN)
  (NP the/DT 1962-63/CD season/NN)
  ./.)


<img src="Image3.jpeg">

In [25]:
sentence='CBS held the previous record for consecutive No. 1 victories -- 46 weeks -- during the 1962-63 season .'

nltk_pos_tagged=nltk.pos_tag(sentence.split())
chunk_tree=ntc.parse(nltk_pos_tagged)
print(chunk_tree)

(S
  (NP CBS/NNP)
  (VP held/VBD)
  (NP the/DT previous/JJ record/NN)
  (PP for/IN)
  (NP consecutive/JJ No./NN 1/CD victories/NNS)
  --/:
  (NP 46/CD weeks/NNS)
  --/:
  (PP during/IN)
  (NP the/DT 1962-63/JJ season/NN)
  ./.)


In [23]:
sentence='what is the minimum salary of employee whose age is greater than 29'
nltk_pos_tagged=nltk.pos_tag(sentence.split())
chunk_tree=ntc.parse(nltk_pos_tagged)
print(chunk_tree)

(S
  (NP what/WP)
  (VP is/VBZ)
  (NP the/DT minimum/JJ salary/NN)
  (PP of/IN)
  (NP employee/NN)
  (NP whose/WP$ age/NN)
  (VP is/VBZ)
  (NP greater/JJR)
  (PP than/IN)
  (NP 29/CD))
