# Parts-of-speech (POS) Tagging
In this assignment we will perform POS tagging using RNNs.
For the task, we will use treebank with universal tagset. It consists of a total of 3914 samples with a total of 12 different tags.
You need to perform the following:
- Create the model structure which will, at least, include the following: embedding layer, RNN layer(s), and the output dense layer for many-to-many sequence modeling.
- Train the system using train set. Use 15% of the train set as validation data during the call to the fit() function
- Do the final evaluation on the test set using accurcay as the main metric. **Important note: while computing the accurcay make sure not to include the padding in the input and the output. Otherwise your accuracy score may be unrealistically high. Similarly, you can use ``mask_zero=True`` in the Embedding layer during training**.


## Read the dataset

In [2]:
import numpy as np
import nltk
from matplotlib import pyplot as plt

In [3]:
from nltk.corpus import treebank
# nltk.download('treebank')
# nltk.download('universal_tagset')
treebank_corpus = treebank.tagged_sents(tagset='universal')
len(treebank_corpus)

3914

### Check a sample

In [4]:
treebank_corpus[0]

[('Pierre', 'NOUN'),
 ('Vinken', 'NOUN'),
 (',', '.'),
 ('61', 'NUM'),
 ('years', 'NOUN'),
 ('old', 'ADJ'),
 (',', '.'),
 ('will', 'VERB'),
 ('join', 'VERB'),
 ('the', 'DET'),
 ('board', 'NOUN'),
 ('as', 'ADP'),
 ('a', 'DET'),
 ('nonexecutive', 'ADJ'),
 ('director', 'NOUN'),
 ('Nov.', 'NOUN'),
 ('29', 'NUM'),
 ('.', '.')]

## Split the data into iput and output

In [5]:
X = [] # store input sequence
Y = [] # store output sequence

for sentence in treebank_corpus:
    X_sentence = []
    Y_sentence = []
    for entity in sentence:         
        X_sentence.append(entity[0])  # entity[0] contains the word
        Y_sentence.append(entity[1])  # entity[1] contains corresponding tag
        
    X.append(X_sentence)
    Y.append(Y_sentence)

In [6]:
num_words = len(set([word.lower() for sentence in X for word in sentence]))
tag_set = set([word for sentence in Y for word in sentence])
num_tags   = len(tag_set)
print("Total number of tagged sentences: {}".format(len(X)))
print("Vocabulary size: {}".format(num_words))
print("Total number of tags: {}".format(num_tags))
print("Tage set:",tag_set)

Total number of tagged sentences: 3914
Vocabulary size: 11387
Total number of tags: 12
Tage set: {'X', 'CONJ', 'ADV', 'DET', 'PRON', '.', 'ADP', 'VERB', 'ADJ', 'NOUN', 'NUM', 'PRT'}


In [7]:
# let's look at first data point
print('sample X: ', X[0], '\n')
print('sample Y: ', Y[0], '\n')

sample X:  ['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.'] 

sample Y:  ['NOUN', 'NOUN', '.', 'NUM', 'NOUN', 'ADJ', '.', 'VERB', 'VERB', 'DET', 'NOUN', 'ADP', 'DET', 'ADJ', 'NOUN', 'NOUN', 'NUM', '.'] 



In [8]:
# check length of longest sentence
lengths = [len(seq) for seq in X]
print("Length of longest sentence: {}".format(max(lengths)))

Length of longest sentence: 271


In [9]:
# encode X
from tensorflow.keras.preprocessing.text import Tokenizer
word_tokenizer = Tokenizer()                      # instantiate tokeniser
word_tokenizer.fit_on_texts(X)                    # fit tokeniser on data
X_encoded = word_tokenizer.texts_to_sequences(X)  # use the tokeniser to encode input sequence

In [10]:
# encode Y
tag_tokenizer = Tokenizer()
tag_tokenizer.fit_on_texts(Y)
Y_encoded = tag_tokenizer.texts_to_sequences(Y)

In [11]:
# look at first encoded data point

print("** Raw data point **", "\n", "-"*100, "\n")
print('X: ', X[0], '\n')
print('Y: ', Y[0], '\n')
print()
print("** Encoded data point **", "\n", "-"*100, "\n")
print('X: ', X_encoded[0], '\n')
print('Y: ', Y_encoded[0], '\n')

** Raw data point ** 
 ---------------------------------------------------------------------------------------------------- 

X:  ['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.'] 

Y:  ['NOUN', 'NOUN', '.', 'NUM', 'NOUN', 'ADJ', '.', 'VERB', 'VERB', 'DET', 'NOUN', 'ADP', 'DET', 'ADJ', 'NOUN', 'NOUN', 'NUM', '.'] 


** Encoded data point ** 
 ---------------------------------------------------------------------------------------------------- 

X:  [5601, 3746, 1, 2024, 86, 331, 1, 46, 2405, 2, 131, 27, 6, 2025, 332, 459, 2026, 3] 

Y:  [1, 1, 3, 8, 1, 7, 3, 2, 2, 5, 1, 4, 5, 7, 1, 1, 8, 3] 



## Split The dataset into Train and Test

In [12]:
from sklearn.model_selection import train_test_split
TEST_SIZE = 0.15
X_train, X_test, Y_train, Y_test = train_test_split(X_encoded, Y_encoded, test_size=TEST_SIZE, random_state=777)

## 1. Padd the sequences

In [13]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
MAX_SEQ_LENGTH = 271 # max length of sequence
X_train_padded = pad_sequences(X_train, maxlen=MAX_SEQ_LENGTH, padding='post')
X_test_padded = pad_sequences(X_test,  maxlen=MAX_SEQ_LENGTH, padding='post')
Y_train_padded = pad_sequences(Y_train, maxlen=MAX_SEQ_LENGTH, padding='post')
Y_test_padded = pad_sequences(Y_test,  maxlen=MAX_SEQ_LENGTH, padding='post')

In [14]:
X_train = np.array(X_train_padded)
X_test = np.array(X_test_padded)
Y_train = to_categorical(np.array(Y_train_padded))
Y_test = to_categorical(np.array(Y_test_padded))

In [15]:
# print number of samples in each set
print("TRAINING DATA")
print('Shape of input sequences: {}'.format(X_train.shape))
print('Shape of output sequences: {}'.format(Y_train.shape))
print("-"*50)


print("TESTING DATA")
print('Shape of input sequences: {}'.format(X_test.shape))
print('Shape of output sequences: {}'.format(Y_test.shape))

TRAINING DATA
Shape of input sequences: (3326, 271)
Shape of output sequences: (3326, 271, 13)
--------------------------------------------------
TESTING DATA
Shape of input sequences: (588, 271)
Shape of output sequences: (588, 271, 13)


### 2. Develop an RNN-based Model 
The RNN can be LSTM, GRU, or even Bidirectional ones
You need to at least the following:
- An embedding layer
- One or more RNN layers
- An output dense layer with softmax activation

Once you develop your model architecture, you need to compile and train it. use 15% of training data during fit as validataion data.

Plot the training and validation losses

In [16]:
from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional, Input
from keras.models import Model, Sequential

In [17]:


input_layer = Input(shape=(MAX_SEQ_LENGTH,))
x = Embedding(input_dim=num_words+1, output_dim=128, input_length=MAX_SEQ_LENGTH,mask_zero=True)(input_layer)
print(x.shape)
x = Bidirectional(LSTM(units=64, return_sequences=True, recurrent_dropout=0.1))(x)
out =  TimeDistributed(Dense(num_tags+1, activation="softmax"))(x)  # softmax output layer 
model = Model(input_layer, out)

model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])


Metal device set to: Apple M1 Pro

systemMemory: 16.00 GB
maxCacheSize: 5.33 GB

(None, 271, 128)


In [18]:
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 271)]             0         
                                                                 
 embedding (Embedding)       (None, 271, 128)          1457664   
                                                                 
 bidirectional (Bidirectiona  (None, 271, 128)         98816     
 l)                                                              
                                                                 
 time_distributed (TimeDistr  (None, 271, 13)          1677      
 ibuted)                                                         
                                                                 
Total params: 1,558,157
Trainable params: 1,558,157
Non-trainable params: 0
_________________________________________________________________


In [19]:
# model.fit(X_train, Y_train, batch_size=128, epochs=10, validation_split=0.2, verbose=1)

## 3. Evaluate on the test set
Once you finalize the model based on validation set loss and accuracy, you should do the final evaluation on the test set.

**Important note: while computing the accurcay make sure not to include the padding in the input and the output. Otherwise your accuracy score may be unrealistically high.**

### 4. Provide a new sample input sentence and predict and display its tag sequence
You may need to convert the result back from indexes/cods to Labels such as NOUN, VERB, ADJ