### Word Sense Disambiguity

Given the following sentences:

    The agent will book the to the show for the entire family.
    But you can generally book tickets online.
    When you book tickets online they provide you with a book of stamps
    
If you could see the above sentences the word book is used in different context. In first two sentences the word book(verb) refers to the meaning 'reserve' while in the second portion of the third sentence book(noun) refers to a physical entity.

## DATA

In [2]:
data = """The agent will book the to the show for the entire family.
But you can generally book tickets online.
When you book tickets online they provide you with a book of stamps"""

### Importing Packages

In [3]:
import numpy as np
import pandas as pd

import nltk

from nltk.corpus import wordnet

# Tokenizers
from nltk.tokenize import word_tokenize, sent_tokenize, PunktSentenceTokenizer, RegexpTokenizer

# Lesk Module
from nltk.wsd import lesk

## Part - 1

    Use the Lesk Module to find the similar words of the word *book* using the above sentences. Record your observations.

### Observations

In [4]:
token = sent_tokenize(data)

for line in token:
    print("LINE:", line, "\n")
    print(lesk(line, 'book'))
    print(lesk(line, 'book').definition(), "\n--------------------------------------------\n")

LINE: The agent will book the to the show for the entire family. 

Synset('script.n.01')
a written version of a play or other dramatic composition; used in preparing for a performance 
--------------------------------------------

LINE: But you can generally book tickets online. 

Synset('script.n.01')
a written version of a play or other dramatic composition; used in preparing for a performance 
--------------------------------------------

LINE: When you book tickets online they provide you with a book of stamps 

Synset('script.n.01')
a written version of a play or other dramatic composition; used in preparing for a performance 
--------------------------------------------



## Part - 2

Tag sentences using Brill Tagger.

### Brill Tagger

The BrillTagger class is a **transformation-based tagger**. The BrillTagger class uses a series
of rules to correct the results of an initial tagger. These rules are scored based on how many
errors they correct minus the number of new errors they produce.

The idea is simple Brill Tagger tries to correct the mistake made by the inital tagger. Brill tagger inputs an initial tagger and the templates which autmatically tells to create new rules based on the Training Set.

**Recommended Steps:**

1. Initially tag the sentence using POS Tagger. Then observe the POS tags for the word book in different context
2. Then create a tagged_sentence using the POS Tagger correcting it with the mistakes it made.
3. Now create a Brill Tagger using an initial tagger (POS) and pass templates(rules) to it.
4. Train the Brill Tagger using the Tagged Sentence
5. Test the Brill Tagger on the following sentences:
       > "I bought this book from Kerala"
       > "He will book tickets to Kerala"

### Data

In [5]:
token

['The agent will book the to the show for the entire family.',
 'But you can generally book tickets online.',
 'When you book tickets online they provide you with a book of stamps']

### Training Data

In [6]:
from nltk.corpus import brown

# Sentences
brown_sents = brown.sents() 

# TRAINING DATA = Brown Tagged Sentences
brown_tagged_sents = brown.tagged_sents()

print(brown_sents, '\n\n------------------------------- \n')
print(brown_tagged_sents)

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...] 

------------------------------- 

[[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('t

### Testing Data

In [7]:
test_tagged_sents = []


# Using POS TAGGER to tag testing data
for line in token:
    
    words = nltk.word_tokenize(line)
    tagged = nltk.pos_tag(words)
    test_tagged_sents.append(tagged)
    
print(test_tagged_sents)

[[('The', 'DT'), ('agent', 'NN'), ('will', 'MD'), ('book', 'NN'), ('the', 'DT'), ('to', 'TO'), ('the', 'DT'), ('show', 'NN'), ('for', 'IN'), ('the', 'DT'), ('entire', 'JJ'), ('family', 'NN'), ('.', '.')], [('But', 'CC'), ('you', 'PRP'), ('can', 'MD'), ('generally', 'RB'), ('book', 'NN'), ('tickets', 'NNS'), ('online', 'VBP'), ('.', '.')], [('When', 'WRB'), ('you', 'PRP'), ('book', 'NN'), ('tickets', 'NNS'), ('online', 'VBP'), ('they', 'PRP'), ('provide', 'VBP'), ('you', 'PRP'), ('with', 'IN'), ('a', 'DT'), ('book', 'NN'), ('of', 'IN'), ('stamps', 'NNS')]]


### Taggers Used:
    - Unigram
    - N gram(Bigram)
    - Brill

##### Importing different Taggers

In [8]:
from nltk.tag import UnigramTagger, BigramTagger, BrillTagger

### Unigram Tagger - Using training data

In [9]:
# Training Unigram
tagger = UnigramTagger(brown_tagged_sents)

for line in token:
    words = word_tokenize(line)
    unigram_tagged = tagger.tag(words)
    
    print(unigram_tagged, "\n")

[('The', 'AT'), ('agent', 'NN'), ('will', 'MD'), ('book', 'NN'), ('the', 'AT'), ('to', 'TO'), ('the', 'AT'), ('show', 'VB'), ('for', 'IN'), ('the', 'AT'), ('entire', 'JJ'), ('family', 'NN'), ('.', '.')] 

[('But', 'CC'), ('you', 'PPSS'), ('can', 'MD'), ('generally', 'RB'), ('book', 'NN'), ('tickets', 'NNS'), ('online', None), ('.', '.')] 

[('When', 'WRB'), ('you', 'PPSS'), ('book', 'NN'), ('tickets', 'NNS'), ('online', None), ('they', 'PPSS'), ('provide', 'VB'), ('you', 'PPSS'), ('with', 'IN'), ('a', 'AT'), ('book', 'NN'), ('of', 'IN'), ('stamps', 'NNS')] 



### N gram (Bigram Tagger) - Using training data

In [15]:
# Training Bigram tagger
bigram_tagger = nltk.BigramTagger(brown_tagged_sents)

for line in token:
    words = nltk.word_tokenize(line)
    bigram_tagged = bigram_tagger.tag(words)
    
    print(bigram_tagged,"\n\n")

[('The', 'AT'), ('agent', 'NN'), ('will', 'MD'), ('book', None), ('the', None), ('to', None), ('the', None), ('show', None), ('for', None), ('the', None), ('entire', None), ('family', None), ('.', None)] 


[('But', 'CC'), ('you', 'PPSS'), ('can', 'MD'), ('generally', 'RB'), ('book', None), ('tickets', None), ('online', None), ('.', None)] 


[('When', 'WRB'), ('you', 'PPSS'), ('book', None), ('tickets', None), ('online', None), ('they', None), ('provide', None), ('you', None), ('with', None), ('a', None), ('book', None), ('of', None), ('stamps', None)] 




### Brill Tagger - Using training data

In [18]:
import nltk
import nltk.tag
from nltk.tag import brill
from nltk.tag import UnigramTagger
from nltk import BrillTaggerTrainer     

unigram_tagger = UnigramTagger(brown_tagged_sents)

templates = [brill.Template(brill.Pos([1,1])),
    brill.Template(brill.Pos([2,2])),
    brill.Template(brill.Pos([1,2])),
    brill.Template(brill.Pos([1,3])),
    brill.Template(brill.Pos([1,1])),
    brill.Template(brill.Pos([2,2])),
    brill.Template(brill.Pos([1,2])),
    brill.Template(brill.Pos([1,3])),
    brill.Template(brill.Word([-1, -1])),
    brill.Template(brill.Word([-1, -1]))]
                           
trainer = BrillTaggerTrainer(initial_tagger=unigram_tagger, templates=templates, trace=3, deterministic=True)

train_data = [('The', 'DT'), ('agent', 'NN'), ('will', 'MD'), ('book', 'VB'), ('the', 'DT'), ('to', 'TO'), ('the', 'DT'), ('show', 'NN'), ('for', 'IN'), ('the', 'DT'), ('entire', 'JJ'), ('family', 'NN'), ('.', '.')]

brill_tagger = trainer.train(train_data, max_rules=10)          

brill_tagger.tag(token[0])                                                                             

ValueError: too many values to unpack (expected 2)


## Part - 3

    Perform Part-1 again but passing the POS tags produced by the Brill Tagger.
    