# Assignment 1

**Due to**: 11/01/2022 (dd/mm/yyyy)

If you deliver it by 11/12/2021 your assignment will be graded by 11/01/2022.


**Credits**: Andrea Galassi, Federico Ruggeri, Paolo Torroni

**Summary**: Part-of Speech (POS) tagging as Sequence Labelling using Recurrent Neural Architectures

# Execution
https://www.kaggle.com/code/tanyadayanand/pos-tagging-using-rnn

A bunch of libraries and functions that will be used throughout the notebook.


In [1]:
import re
import pandas as pd
import numpy as np
import os
import urllib.request
import zipfile
import progressbar

import nltk
import sklearn
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from keras.preprocessing.text import Tokenizer
from keras_preprocessing.sequence import pad_sequences

import tensorflow as tf
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, GRU
from sklearn.metrics import classification_report

nltk.download('treebank')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
def create_embedding_matrix(filepath, word_index, embedding_dim):
  vocab_size = len(word_index)+1
  embedding_matrix = np.zeros((vocab_size,embedding_dim))

  with open(filepath, encoding='utf-8') as f:
    for line in f:
      word, *vector = line.split()
      if word in word_index:
        idx = word_index[word]
        embedding_matrix[idx] = np.array(vector, dtype=np.float32)[:embedding_dim]
  return embedding_matrix

## 1. Corpus
### 1.1 Pre-processing

From the original tags list we removed all the symbols and english punctuation plus:
- FW, Foreign Word, because there are no examples in the test set;
- UH, Interjection, because there are no examples in the test set;
- LS, List Item Marker, because there are no examples in the test set (and because it denotes symbols as well);

In [42]:
# Get the files' list
fileids = nltk.corpus.treebank.fileids()

# Get the Penn Treebank corpus and tokenize the text
train_corpus = nltk.corpus.treebank.tagged_sents(fileids[:100])
val_corpus = nltk.corpus.treebank.tagged_sents(fileids[100:150])
test_corpus = nltk.corpus.treebank.tagged_sents(fileids[150:])

In [43]:
remove = [':', '#', '"', '$', '-LRB-', '-RRB-', ',', '.', "''", '``', 'SYM', '-NONE-', 'FW', 'UH', 'LS']

X_train = []
y_train = []
for sentence in train_corpus:
  X_sentence = []
  y_sentence = []
  for entity in sentence:
    if entity[1] not in remove:         
      X_sentence.append(entity[0])  # entity[0] contains the word
      y_sentence.append(entity[1])  # entity[1] contains corresponding tag
  X_train.append(X_sentence)
  y_train.append(y_sentence)


X_val = []
y_val = []
for sentence in val_corpus:
  X_sentence = []
  y_sentence = []
  for entity in sentence:
    if entity[1] not in remove:         
      X_sentence.append(entity[0])  # entity[0] contains the word
      y_sentence.append(entity[1])  # entity[1] contains corresponding tag
  X_val.append(X_sentence)
  y_val.append(y_sentence)


X_test = []
y_test = []
for sentence in test_corpus:
  X_sentence = []
  y_sentence = []
  for entity in sentence:
    if entity[1] not in remove:         
      X_sentence.append(entity[0])  # entity[0] contains the word
      y_sentence.append(entity[1])  # entity[1] contains corresponding tag
  X_test.append(X_sentence)
  y_test.append(y_sentence)

In [44]:
vocab_size = len(set([word.lower() for sentence in X_train for word in sentence]))
num_classes = len(set([word.lower() for sentence in y_train for word in sentence]))

print("Total number of tagged sentences: {}".format(len(X_train)))
print("Vocabulary size: {}".format(vocab_size))
print("Total number of tags: {}".format(num_classes))

Total number of tagged sentences: 1963
Vocabulary size: 7381
Total number of tags: 32


In [45]:
word_tokenizer = Tokenizer()
word_tokenizer.fit_on_texts(X_train)

num_words = 9000
word_tokenizer.word_index = {e:i for e,i in word_tokenizer.word_index.items() if i <= num_words}
word_tokenizer.word_index[word_tokenizer.oov_token] = num_words + 1

X_train = word_tokenizer.texts_to_sequences(X_train)
X_val = word_tokenizer.texts_to_sequences(X_val)
X_test = word_tokenizer.texts_to_sequences(X_test)
vocab_size = len(word_tokenizer.word_index) + 1

In [46]:
tag_tokenizer = Tokenizer()
tag_tokenizer.fit_on_texts(y_train)

y_train = tag_tokenizer.texts_to_sequences(y_train)
y_val = tag_tokenizer.texts_to_sequences(y_val)
y_test = tag_tokenizer.texts_to_sequences(y_test)

In [47]:
# check length of longest sentence
lengths = [len(seq) for seq in X_train]
print("Length of longest sentence: {}".format(max(lengths)))

Length of longest sentence: 171


In [48]:
max_len = 100
X_train = pad_sequences(X_train,padding='post',maxlen=max_len)
X_val = pad_sequences(X_val,padding='post',maxlen=max_len)
X_test = pad_sequences(X_test,padding='post',maxlen=max_len)

y_train = pad_sequences(y_train,padding='post',maxlen=max_len)
y_val = pad_sequences(y_val,padding='post',maxlen=max_len)
y_test = pad_sequences(y_test,padding='post',maxlen=max_len)

In [49]:
from keras.utils.np_utils import to_categorical

y_train = to_categorical(y_train)
y_val = to_categorical(y_val)
y_test = to_categorical(y_test)

In [51]:
print(X_train.shape)
print(y_train.shape)

(1963, 100)
(1963, 100, 33)


## 2. GloVe 
GloVe (Global Vectors for Word Representation) is a method for learning vector representations of words, called "word embeddings," from a large corpus of text. Word embeddings are numerical representations of words that capture the semantic relationships between words in a continuous, low-dimensional space. They are commonly used as input to natural language processing models, such as language translation and language modeling.

GloVe works by learning the co-occurrence statistics of words in a corpus, and using this information to learn word embeddings that capture the semantic relationships between words. The GloVe method produces word embeddings that are trained on a global corpus, as opposed to embeddings that are trained on a specific task or dataset.

There are different versions of the GloVe word embeddings, including 50-dimensional, 100-dimensional, and 200-dimensional embeddings. The 50-dimensional version of GloVe embeddings may be better in some applications because they have a lower dimensionality, which can make them easier to work with and more computationally efficient.

In [21]:
pbar = None
def show_progress(block_num, block_size, total_size):
    global pbar
    if pbar is None:
        pbar = progressbar.ProgressBar(maxval=total_size)
        pbar.start()

    downloaded = block_num * block_size
    if downloaded < total_size:
        pbar.update(downloaded)
    else:
        pbar.finish()
        pbar = None

# Download the GloVe embeddings file
url = 'http://nlp.stanford.edu/data/glove.6B.zip'
urllib.request.urlretrieve(url, 'glove.6B.zip', show_progress)

# Extract the zip file
zip_ref = zipfile.ZipFile('glove.6B.zip', 'r')
zip_ref.extractall()
zip_ref.close()

  0% (1064960 of 862182613) |            | Elapsed Time: 0:00:00 ETA:   0:10:47

KeyboardInterrupt: ignored

In [17]:
# Load the GloVe embeddings into a dictionary
embedding_dict = {}
with open('glove.6B.50d.txt', 'r', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embedding_dict[word] = coefs

# Print the number of words in the embeddings dictionary
print(f'Found {len(embedding_dict)} word vectors.')

Found 400000 word vectors.


In [18]:
def find_closest_embeddings(embedding):
    return sorted(embedding_dict.keys(), key=lambda word: np.linalg.norm(embedding_dict[word]- embedding))[:5]

find_closest_embeddings(embedding_dict['iphone'])

['iphone', 'ipad', 'smartphone', 'ipod', 'android']

In [52]:
input_dim = X_train.shape[1]
embedding_dim = 50
embedding_matrix = create_embedding_matrix(f'glove.6B.{embedding_dim}d.txt', word_tokenizer.word_index, embedding_dim)

## 3. Model
### 3.1 Baseline (MACRO f1 0.84)
Bidirectional LSTM layers are able to process sequential data in both the forward and backward directions, which can allow the model to capture contextual information from both the past and the future. This can be particularly useful for natural language processing tasks, where the meaning of a word can depend on the context in which it is used.

In [53]:
from tensorflow.keras.layers import TimeDistributed

# Define the model
model = tf.keras.Sequential(name='Baseline')

# Add the Embedding layer
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, \
                    weights = [embedding_matrix], input_length = max_len, trainable=True))

# Add the Bidirectional LSTM layer
model.add(Bidirectional(LSTM(units=128, return_sequences=True)))

# Add the Dense/Fully-Connected layer
# model.add(Dense(units=32, activation='softmax'))
model.add(TimeDistributed(Dense(33, activation='softmax')))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Summary
model.summary()

Model: "Baseline"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 100, 50)           369150    
                                                                 
 bidirectional_1 (Bidirectio  (None, 100, 256)         183296    
 nal)                                                            
                                                                 
 time_distributed_1 (TimeDis  (None, 100, 33)          8481      
 tributed)                                                       
                                                                 
Total params: 560,927
Trainable params: 560,927
Non-trainable params: 0
_________________________________________________________________


In [54]:
results = model.fit(X_train, y_train, epochs=10, verbose = True, validation_data=(X_val,y_val), batch_size=32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [66]:
y_pred = model.predict(X_test)



In [71]:
from sklearn.utils.multiclass import type_of_target

print(y_pred.shape)
print(y_test.shape)


y_test = y_test[:, -1].astype(int)
y_pred = y_pred[:, -1].astype(int)

print(type_of_target(y_pred))
print(type_of_target(y_test))

(652, 100, 33)
(652, 100, 33)
multilabel-indicator
multilabel-indicator


In [74]:
# th = 0.1
# y_pred[:][y_pred >= th] = 1 
# y_pred[:][y_pred  < th] = 0

tags_train = [
    'CC',
    'CD',
    'DT',
    'EX',
    'IN',
    'JJ',
    'JJR',
    'JJS',
    'MD',
    'NN',
    'NNP',
    'NNPS',
    'NNS',
    'PDT',
    'POS',
    'PRP',
    'PRP$',
    'RB',
    'RBR',
    'RBS',
    'RP',
    'TO',
    'VB',
    'VBD',
    'VBG',
    'VBN',
    'VBP',
    'VBZ',
    'WDT',
    'WP',
    'WP$'
]

print(classification_report(y_test, y_pred, target_names = tags_train, zero_division=True))

              precision    recall  f1-score   support

          CC       1.00      0.00      0.00       652
          CD       1.00      1.00      1.00         0
          DT       1.00      1.00      1.00         0
          EX       1.00      1.00      1.00         0
          IN       1.00      1.00      1.00         0
          JJ       1.00      1.00      1.00         0
         JJR       1.00      1.00      1.00         0
         JJS       1.00      1.00      1.00         0
          MD       1.00      1.00      1.00         0
          NN       1.00      1.00      1.00         0
         NNP       1.00      1.00      1.00         0
        NNPS       1.00      1.00      1.00         0
         NNS       1.00      1.00      1.00         0
         PDT       1.00      1.00      1.00         0
         POS       1.00      1.00      1.00         0
         PRP       1.00      1.00      1.00         0
        PRP$       1.00      1.00      1.00         0
          RB       1.00    

### 3.2 GRU 
Unica che non funzia dio po

In [75]:
from tensorflow.keras.layers import TimeDistributed

# Define the model
model = tf.keras.Sequential(name='GRU')

# Add the Embedding layer
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, \
                    weights = [embedding_matrix], input_length = max_len, trainable=True))

# Add the GRU layer
model.add(GRU(units=32, return_sequences=True))

# Add the Dense/Fully-Connected layer
# model.add(Dense(units=len(tags_train), activation='softmax'))
model.add(TimeDistributed(Dense(len(tags_train), activation='softmax')))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Summary
model.summary()

Model: "GRU"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 100, 50)           369150    
                                                                 
 gru (GRU)                   (None, 100, 32)           8064      
                                                                 
 time_distributed_2 (TimeDis  (None, 100, 33)          1089      
 tributed)                                                       
                                                                 
Total params: 378,303
Trainable params: 378,303
Non-trainable params: 0
_________________________________________________________________


In [76]:
results = model.fit(X_train, y_train, epochs=10, verbose = True, validation_data=(X_val,y_val), batch_size=32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [24]:
y_pred = model.predict(X_test)



In [26]:
th = 0.1
y_pred[y_pred >= th] = 1 
y_pred[y_pred  < th] = 0

print(classification_report(y_test, y_pred, target_names = tags_train, zero_division=True))

              precision    recall  f1-score   support

          CC       1.00      0.00      0.00       366
          CD       1.00      0.00      0.00       858
          DT       0.10      1.00      0.18      1335
          EX       1.00      0.00      0.00         5
          IN       0.12      1.00      0.21      1630
          JJ       1.00      0.00      0.00       918
         JJR       1.00      0.00      0.00        59
         JJS       1.00      0.00      0.00        31
          MD       1.00      0.00      0.00       167
          NN       0.17      1.00      0.30      2383
         NNP       0.11      1.00      0.20      1504
        NNPS       1.00      0.00      0.00        44
         NNS       1.00      0.00      0.00       941
         PDT       1.00      0.00      0.00         4
         POS       1.00      0.00      0.00       152
         PRP       1.00      0.00      0.00       192
        PRP$       1.00      0.00      0.00        99
          RB       1.00    

### 3.3 Additional LSTM layer (MACRO f1 0.82) 
Using two bidirectional LSTM layers can allow the model to learn more complex patterns in the data and make more accurate predictions. 
However, they can increase the computational complexity of our model, which may require more computational resources to train.

Indeed, here the train was slower and the results similar to the baseline architecture.

In [27]:
# Define the model
model = tf.keras.Sequential(name='Additional_LSTM')

# Add the Embedding layer
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, \
                    weights = [embedding_matrix], input_length = max_len, trainable=True))

# Add the Bidirectional LSTM layer
model.add(Bidirectional(LSTM(units=64, return_sequences=True)))

# Add another LSTM layer
model.add(Bidirectional(LSTM(units=64)))

# Add the Dense/Fully-Connected layer
model.add(Dense(units=len(tags_train), activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Summary
model.summary()

Model: "Additional_LSTM"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 50, 50)            158450    
                                                                 
 bidirectional_1 (Bidirectio  (None, 50, 128)          58880     
 nal)                                                            
                                                                 
 bidirectional_2 (Bidirectio  (None, 128)              98816     
 nal)                                                            
                                                                 
 dense_2 (Dense)             (None, 32)                4128      
                                                                 
Total params: 320,274
Trainable params: 320,274
Non-trainable params: 0
_________________________________________________________________


In [28]:
results = model.fit(X_train, y_train, epochs=10, verbose = True, validation_data=(X_val,y_val), batch_size=32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [29]:
y_pred = model.predict(X_test)



In [31]:
th = 0.1
y_pred[y_pred >= th] = 1 
y_pred[y_pred  < th] = 0

print(classification_report(y_test, y_pred, target_names = tags_train, zero_division=True))

              precision    recall  f1-score   support

          CC       0.99      0.96      0.98       366
          CD       0.98      1.00      0.99       858
          DT       0.91      1.00      0.95      1335
          EX       0.71      1.00      0.83         5
          IN       0.76      1.00      0.86      1630
          JJ       0.85      0.99      0.91       918
         JJR       0.77      1.00      0.87        59
         JJS       0.84      1.00      0.91        31
          MD       0.97      1.00      0.98       167
          NN       0.86      0.99      0.92      2383
         NNP       0.81      0.97      0.88      1504
        NNPS       0.71      0.89      0.79        44
         NNS       0.95      0.99      0.97       941
         PDT       0.22      0.50      0.31         4
         POS       0.93      0.95      0.94       152
         PRP       0.99      1.00      0.99       192
        PRP$       0.99      1.00      0.99        99
          RB       0.86    

### 3.4 Additional dense layer (MACRO f1 0.85)

Using two dense layers, one with a non-linear activation function and one with a softmax activation function, is a common pattern in neural network architectures for classification tasks.

The purpose of the non-linear dense layer is to introduce non-linearity into the model, which can allow the model to learn more complex patterns in the data. Common choices for the activation function in this layer include ReLU (Rectified Linear Unit), sigmoid, and tanh.

The purpose of the softmax dense layer is to produce a probability distribution over the possible classes. The softmax activation function transforms the output of the preceding layer into a probability distribution, where the sum of the probabilities is equal to 1. This is useful for classification tasks, where you want to predict the probability that an input belongs to each of the possible classes. Using two dense layers in this way can allow the model to learn more complex patterns in the data and make more accurate predictions.

In [34]:
# Define the model
model = tf.keras.Sequential(name='Baseline')

# Add the Embedding layer
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, \
                    weights = [embedding_matrix], input_length = max_len, trainable=True))

# Add the Bidirectional LSTM layer
model.add(Bidirectional(LSTM(units=128)))

# Add another Dense layer
model.add(Dense(units=256, activation='relu'))

# Add the Dense/Fully-Connected layer
model.add(Dense(units=len(tags_train), activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Summary
model.summary()

Model: "Baseline"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     (None, 50, 50)            158450    
                                                                 
 bidirectional_4 (Bidirectio  (None, 256)              183296    
 nal)                                                            
                                                                 
 dense_5 (Dense)             (None, 256)               65792     
                                                                 
 dense_6 (Dense)             (None, 32)                8224      
                                                                 
Total params: 415,762
Trainable params: 415,762
Non-trainable params: 0
_________________________________________________________________


In [35]:
results = model.fit(X_train, y_train, epochs=10, verbose = True, validation_data=(X_val,y_val), batch_size=32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [36]:
y_pred = model.predict(X_test)



In [37]:
th = 0.1
y_pred[y_pred >= th] = 1 
y_pred[y_pred  < th] = 0

print(classification_report(y_test, y_pred, target_names = tags_train, zero_division=True))

              precision    recall  f1-score   support

          CC       0.81      1.00      0.89       366
          CD       1.00      1.00      1.00       858
          DT       0.99      0.99      0.99      1335
          EX       0.83      1.00      0.91         5
          IN       0.95      0.99      0.97      1630
          JJ       0.84      0.99      0.91       918
         JJR       0.79      1.00      0.88        59
         JJS       0.91      1.00      0.95        31
          MD       0.97      1.00      0.98       167
          NN       0.88      0.99      0.93      2383
         NNP       0.84      0.97      0.90      1504
        NNPS       0.59      0.95      0.73        44
         NNS       0.96      1.00      0.98       941
         PDT       0.12      0.50      0.20         4
         POS       0.94      0.95      0.94       152
         PRP       1.00      1.00      1.00       192
        PRP$       1.00      1.00      1.00        99
          RB       0.75    