# <u>Part of Speech Tagger
This notebook contains model for tagging POS in an English sentence. There are many POS tags. The model converts the sentence to POS tags. Tags used are:<br>
    
**ADJ - Adjective<br>
ADP - Adposition
ADV - Adverb<br>
PRT -	Particle<br> 
PRON - Pronoun<br>
.	   - Punctuation marks<br>
X	- Other	<br>
VERB - Verb<br>
CONJ	- Conjunction<br>
DET - Determiner / Article	
NOUN	- Noun	<br>
NUM - Numeral<br>**

In [40]:
import nltk
import numpy as np
from tqdm import tqdm
from sklearn.model_selection import train_test_split

In [69]:
# import training data. 
# We will be using the nltk data 
nltk.download('brown')
nltk.download('universal_tagset')

# load training data from nltk library
all_tags = ['<EOS>','<UNK>','ADV', 'NOUN', 'ADP', 'PRON', 'DET',
            '.', 'PRT', 'VERB', 'X', 'NUM', 'CONJ', 'ADJ']
train_data = nltk.corpus.brown.tagged_sents(tagset='universal')

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\SEEKER\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     C:\Users\SEEKER\AppData\Roaming\nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


In [3]:
train_data

[[('The', 'DET'), ('Fulton', 'NOUN'), ('County', 'NOUN'), ('Grand', 'ADJ'), ('Jury', 'NOUN'), ('said', 'VERB'), ('Friday', 'NOUN'), ('an', 'DET'), ('investigation', 'NOUN'), ('of', 'ADP'), ("Atlanta's", 'NOUN'), ('recent', 'ADJ'), ('primary', 'NOUN'), ('election', 'NOUN'), ('produced', 'VERB'), ('``', '.'), ('no', 'DET'), ('evidence', 'NOUN'), ("''", '.'), ('that', 'ADP'), ('any', 'DET'), ('irregularities', 'NOUN'), ('took', 'VERB'), ('place', 'NOUN'), ('.', '.')], [('The', 'DET'), ('jury', 'NOUN'), ('further', 'ADV'), ('said', 'VERB'), ('in', 'ADP'), ('term-end', 'NOUN'), ('presentments', 'NOUN'), ('that', 'ADP'), ('the', 'DET'), ('City', 'NOUN'), ('Executive', 'ADJ'), ('Committee', 'NOUN'), (',', '.'), ('which', 'DET'), ('had', 'VERB'), ('over-all', 'ADJ'), ('charge', 'NOUN'), ('of', 'ADP'), ('the', 'DET'), ('election', 'NOUN'), (',', '.'), ('``', '.'), ('deserves', 'VERB'), ('the', 'DET'), ('praise', 'NOUN'), ('and', 'CONJ'), ('thanks', 'NOUN'), ('of', 'ADP'), ('the', 'DET'), ('City

In [70]:
# convert the data into list of (word, tag) format for each sentence in the train_data
train_data = [[(word.lower(), tag) for word, tag in sentence] for sentence in train_data]
train_data[1]

[('the', 'DET'),
 ('jury', 'NOUN'),
 ('further', 'ADV'),
 ('said', 'VERB'),
 ('in', 'ADP'),
 ('term-end', 'NOUN'),
 ('presentments', 'NOUN'),
 ('that', 'ADP'),
 ('the', 'DET'),
 ('city', 'NOUN'),
 ('executive', 'ADJ'),
 ('committee', 'NOUN'),
 (',', '.'),
 ('which', 'DET'),
 ('had', 'VERB'),
 ('over-all', 'ADJ'),
 ('charge', 'NOUN'),
 ('of', 'ADP'),
 ('the', 'DET'),
 ('election', 'NOUN'),
 (',', '.'),
 ('``', '.'),
 ('deserves', 'VERB'),
 ('the', 'DET'),
 ('praise', 'NOUN'),
 ('and', 'CONJ'),
 ('thanks', 'NOUN'),
 ('of', 'ADP'),
 ('the', 'DET'),
 ('city', 'NOUN'),
 ('of', 'ADP'),
 ('atlanta', 'NOUN'),
 ("''", '.'),
 ('for', 'ADP'),
 ('the', 'DET'),
 ('manner', 'NOUN'),
 ('in', 'ADP'),
 ('which', 'DET'),
 ('the', 'DET'),
 ('election', 'NOUN'),
 ('was', 'VERB'),
 ('conducted', 'VERB'),
 ('.', '.')]

### Building Vocabulary Mappings
We will now create the Vocabulary dictionary for the training data, the mappings from word to indices and vice-versa.

In [71]:
from collections import Counter, defaultdict

#### Create Vocabulary Dictionary

In [72]:
word_counts = Counter()
# we will use the top 11000 words for out dictionary only.
for sentence in train_data:
    words, tags = zip(*sentence)
    word_counts.update(words)

# take out the top words
top_words = list(zip(*word_counts.most_common(11000)))[0]
vocab = ['<EOS>','<UNK>'] + list(top_words) 
vocab

['<EOS>',
 '<UNK>',
 'the',
 ',',
 '.',
 'of',
 'and',
 'to',
 'a',
 'in',
 'that',
 'is',
 'was',
 'he',
 'for',
 '``',
 "''",
 'it',
 'with',
 'as',
 'his',
 'on',
 'be',
 ';',
 'at',
 'by',
 'i',
 'this',
 'had',
 '?',
 'not',
 'are',
 'but',
 'from',
 'or',
 'have',
 'an',
 'they',
 'which',
 '--',
 'one',
 'you',
 'were',
 'her',
 'all',
 'she',
 'there',
 'would',
 'their',
 'we',
 'him',
 'been',
 ')',
 'has',
 '(',
 'when',
 'who',
 'will',
 'more',
 'if',
 'no',
 'out',
 'so',
 'said',
 'what',
 'up',
 'its',
 'about',
 ':',
 'into',
 'than',
 'them',
 'can',
 'only',
 'other',
 'new',
 'some',
 'could',
 'time',
 '!',
 'these',
 'two',
 'may',
 'then',
 'do',
 'first',
 'any',
 'my',
 'now',
 'such',
 'like',
 'our',
 'over',
 'man',
 'me',
 'even',
 'most',
 'made',
 'also',
 'after',
 'did',
 'many',
 'before',
 'must',
 'af',
 'through',
 'back',
 'years',
 'where',
 'much',
 'your',
 'way',
 'well',
 'down',
 'should',
 'because',
 'each',
 'just',
 'those',
 'people',
 '

#### Create vocabulary mappings

In [74]:
# create word to index mapping
# for every unknown word the dict will give index 1 which is <UNK>
word_to_idx = defaultdict(lambda:1, {word:idx for idx,word in tqdm(enumerate(vocab))})
# create reverse mapping
idx_to_word = {idx:word for word,idx in word_to_idx.items()}

11002it [00:00, 1100148.59it/s]


#### Create tag mappings

In [75]:
# create word to index mapping
tag_to_idx = {tag:idx for idx,tag in tqdm(enumerate(all_tags))}
# create reverse mapping
idx_to_tag = {idx:tag for tag,idx in tag_to_idx.items()}

14it [00:00, ?it/s]


### Prepare data for Keras model
We will use numerical representation for each word and feed to the model.

In [80]:
# converts the tokens to its numerical representation
def convert_to_num(sentences, token_to_idx, pad=0, dtype='int32', time_major=False):
    # find the max sentence length
    max_sent_len = max(map(len, sentences))
    # create the matrix
    mat = np.empty([len(sentences), max_sent_len], dtype)
    # fill with padding
    mat.fill(pad)
    
    # convert to numerical mappings
    for i, sentence in enumerate(sentences):
        num_row = [token_to_idx[token] for token in sentence]
        mat[i, :len(num_row)] = num_row
        
    if time_major:
        return mat.T
    else:
        return mat

In [81]:
batch_words,batch_tags = zip(*[zip(*sentence) for sentence in train_data])

print("Word ids:")
print(convert_to_num(batch_words,word_to_idx))
print("Tag ids:")
print(convert_to_num(batch_tags,tag_to_idx))

Word ids:
[[   2 5434  652 ...    0    0    0]
 [   2 1635  440 ...    0    0    0]
 [   2    1 1372 ...    0    0    0]
 ...
 [   2 3057    5 ...    0    0    0]
 [  45   12    8 ...    0    0    0]
 [  33   64   26 ...    0    0    0]]
Tag ids:
[[6 3 3 ... 0 0 0]
 [6 3 2 ... 0 0 0]
 [6 3 3 ... 0 0 0]
 ...
 [6 3 4 ... 0 0 0]
 [5 9 6 ... 0 0 0]
 [4 6 5 ... 0 0 0]]


## <U>Model
We will use Keras model for this.

In [82]:
import keras
import keras.layers as L
from keras.utils.np_utils import to_categorical

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
