# Assignment 3
Training a neural named entity recognition (NER) tagger 

In [None]:
import torch
import torch.nn as nn
import numpy as np
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
import warnings
warnings.filterwarnings('ignore')

In this assignment you are required to build a full training and testing pipeline for a neural sequentail tagger for named entities, using LSTM.

The dataset that you will be working on is called ReCoNLL 2003, which is a corrected version of the CoNLL 2003 dataset: https://www.clips.uantwerpen.be/conll2003/ner/

[Train data](https://drive.google.com/file/d/1hG66e_OoezzeVKho1w7ysyAx4yp0ShDz/view?usp=sharing)

[Dev data](https://drive.google.com/file/d/1EAF-VygYowU1XknZhvzMi2CID65I127L/view?usp=sharing)

[Test data](https://drive.google.com/file/d/16gug5wWnf06JdcBXQbcICOZGZypgr4Iu/view?usp=sharing)

As you can see, the annotated texts are labeled according to the IOB annotation scheme, for 3 entity types: Person, Organization, Location.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Task 1:** Write a funtion for reading the data from a single file (of the ones that are provided above). The function recieves a filepath and then it encodes every sentence individually using a pair of lists, one list contains the words and one list contains the tags. Each list pair will be added to a general list (data), which will be returned back from the function.

In [None]:
def read_data(filepath):
    # initialize a general list of list pairs ([word], [tag])
    data = []
    # open properly the filepath
    with open(filepath) as file_:
      # a list of words
      words = []
      # a list of tags
      tags = []
      # read the data from a single file
      for line in file_:
        # check if a line is empty
        if line == '\n':
          data.append((words, tags))
          words = []
          tags = []
        # if a line contain a pair of word and tag
        else:
          word, tag = line.split()
          words.append(word)
          tags.append(tag)
    return data

In [None]:
# train = read_data('https://drive.google.com/file/d/1hG66e_OoezzeVKho1w7ysyAx4yp0ShDz/view?usp=sharing')
# dev = read_data('https://drive.google.com/file/d/1EAF-VygYowU1XknZhvzMi2CID65I127L/view?usp=sharing')
# test = read_data('https://drive.google.com/file/d/16gug5wWnf06JdcBXQbcICOZGZypgr4Iu/view?usp=sharing')
train = read_data('/content/drive/MyDrive/connl03_train.txt')
test = read_data('/content/drive/MyDrive/connl03_test.txt')
dev = read_data('/content/drive/MyDrive/connl03_dev.txt')

In [None]:
# print(train)

The following Vocab class can be served as a dictionary that maps words and tags into Ids. The UNK_TOKEN should be used for words that are not part of the training data.

In [None]:
UNK_TOKEN = 0

class Vocab:
    def __init__(self):
        self.word2id = {"__unk__": UNK_TOKEN}
        self.id2word = {UNK_TOKEN: "__unk__"}
        self.n_words = 1
        
        self.tag2id = {"O":0, "B-PER":1, "I-PER": 2, "B-LOC": 3, "I-LOC": 4, "B-ORG": 5, "I-ORG": 6}
        self.id2tag = {0:"O", 1:"B-PER", 2:"I-PER", 3:"B-LOC", 4:"I-LOC", 5:"B-ORG", 6:"I-ORG"}
        
    def index_words(self, words):
      word_indexes = [self.index_word(w) for w in words]
      return word_indexes

    def index_tags(self, tags):
      tag_indexes = [self.tag2id[t] for t in tags]
      return tag_indexes
    
    def index_word(self, w):
        if w not in self.word2id:
            self.word2id[w] = self.n_words
            self.id2word[self.n_words] = w
            self.n_words += 1
        return self.word2id[w]
            

**Task 2:** Write a function prepare_data that takes one of the [train, dev, test] and the Vocab instance, for converting each pair of (words,tags) to a pair of indexes. Each pair should be added to data_sequences, which will be returned back from the function.

In [None]:
vocab = Vocab()

def prepare_data(data, vocab):
    # initialize a return value
    data_sequences = []
    # loop over a list of list pairs ([word], [tag])
    for words, tags in data:
      # convert each pair to a pair of indexes
      data_sequences.append((vocab.index_words(words), vocab.index_tags(tags)))

    return data_sequences, vocab

In [None]:
train_sequences, vocab = prepare_data(train, vocab)
dev_sequences, vocab = prepare_data(dev, vocab)
test_sequences, vocab = prepare_data(test, vocab)

In [None]:
# print(train_sequences, vocab)

**Task 3:** Write NERNet, a PyTorch Module for labeling words with NER tags. 

*input_size:* the size of the vocabulary

*embedding_size:* the size of the embeddings

*hidden_size:* the LSTM hidden size

*output_size:* the number tags we are predicting for

*n_layers:* the number of layers we want to use in LSTM

*directions:* could 1 or 2, indicating unidirectional or bidirectional LSTM, respectively

The input for your forward function should be a single sentence tensor.

*note:* the embeddings in this section are learned embedding. That means that you don't need to use pretrained embedding like the one used in class. You will use them in part 5

In [None]:
class NERNet(nn.Module):
    def __init__(self, input_size, embedding_size, hidden_size, output_size, n_layers, directions):
        super(NERNet, self).__init__()
        self.direction = False if directions == 1 else True
        self.input_features = hidden_size * 2 if self.direction else hidden_size
        self.embedding = nn.Embedding(input_size, embedding_size)
        self.lstm = nn.LSTM(embedding_size, hidden_size, n_layers, bidirectional=self.direction)
        self.out = nn.Linear(self.input_features, output_size)

    
    def forward(self, input_sentence):
        # TODO: your code...
        dim = len(input_sentence)
        hidden = None
        embedded = self.embedding(input_sentence)
        lstm_output, _ = self.lstm(embedded.view(dim, 1, -1), hidden)
        output = self.out(lstm_output.view(dim, -1))
        return output
    

**Task 4:** write a training loop, which takes a model (instance of NERNet) and number of epochs to train on. The loss is always CrossEntropyLoss and the optimizer is always Adam.

In [None]:
def train_loop(model, n_epochs):
  # Loss function
  criterion = nn.CrossEntropyLoss()

  # Optimizer (ADAM is a fancy version of SGD)
  optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
  
  for e in range(1, n_epochs + 1):
    for seq in train_sequences:
      sentence, tags = seq
      sentence_tensor = torch.LongTensor(sentence).cuda()
      tags_tensor = torch.LongTensor(tags).cuda()

      if len(sentence_tensor) == 0:
        continue
      
      model.zero_grad()
      scores = model.forward(sentence_tensor)
      criterion(scores, tags_tensor).backward()
      optimizer.step()

**Task 5:** write an evaluation loop on a trained model, using the dev and test datasets. This function print the true positive rate (TPR), also known as Recall and the opposite to false positive rate (FPR), also known as precision, of each label seperately (7 labels in total), and for all the 6 labels (except O) together. The caption argument for the function should be served for printing, so that when you print include it as a prefix.

In [None]:
def evaluate(model, caption):
  labels = ["O", "B-PER", "I-PER", "B-LOC", "I-LOC", "B-ORG", "I-ORG"]
  datasets = [dev_sequences, test_sequences]
  print("tested model is: " + caption + "\n")

  for data in datasets:
    all_pred_labels = []
    all_true_labels = []

    for sentence, tags in data:
      sentence_tensor = torch.LongTensor(sentence).cuda()
      _, preds = model(sentence_tensor).T.max(0)
      preds_list = preds.tolist()
      all_pred_labels += preds_list
      all_true_labels += tags

      seperate_recall = recall_score(all_true_labels,
                                     all_pred_labels,
                                     average=None)
      average_recall = recall_score(all_true_labels,
                                    all_pred_labels,
                                    labels=[1,2,3,4,5,6],
                                    average='micro')

      seperate_precision = precision_score(all_true_labels,
                                           all_pred_labels,
                                           average=None)
      average_precision = precision_score(all_true_labels,
                                          all_pred_labels,
                                          labels=[1,2,3,4,5,6],
                                          average='micro')

    if data == dev_sequences:
      print("Dev dataset results:\n")
    else:
      print("Test dataset results:\n")

    print("Recall scores:\n")
    for tag, score in zip(labels, seperate_recall):
      print("tag: {}, Recall: {:.4f}".format(tag, score))
    print("\nThe Recall result for all labels together (except O): {}\n".format(average_recall))

    print("Precision scores:\n")
    for tag, score in zip(labels, seperate_precision):
      print("tag: {}, Precision: {:.4f}".format(tag, score))
    print("\nThe Precision result for all labels together (except O): {}".format(average_precision))

    print("\n------------------------------------------------------------------------------------")
    print("------------------------------------------------------------------------------------\n")

**Task 6:** Train and evaluate a few models, all with embedding_size=300, and with the following hyper parameters (you may use that as captions for the models as well):

Model 1: (hidden_size: 500, n_layers: 1, directions: 1)

Model 2: (hidden_size: 500, n_layers: 2, directions: 1)

Model 3: (hidden_size: 500, n_layers: 3, directions: 1)

Model 4: (hidden_size: 500, n_layers: 1, directions: 2)

Model 5: (hidden_size: 500, n_layers: 2, directions: 2)

Model 6: (hidden_size: 500, n_layers: 3, directions: 2)

Model 4: (hidden_size: 800, n_layers: 1, directions: 2)

Model 5: (hidden_size: 800, n_layers: 2, directions: 2)

Model 6: (hidden_size: 800, n_layers: 3, directions: 2)

In [None]:
EMBEDDING_SIZE = 300
EPOCHS = 10
INPUT_SIZE = len(vocab.word2id)
OUTPUT_SIZE = len(vocab.tag2id)

models_config = [{'hidden_size': 500, 'layers':1, 'directions':1},
                {'hidden_size': 500, 'layers':2, 'directions':1},
                {'hidden_size': 500, 'layers':2, 'directions':1},
                {'hidden_size': 500, 'layers':1, 'directions':2},
                {'hidden_size': 500, 'layers':2, 'directions':2},
                {'hidden_size': 500, 'layers':3, 'directions':2},
                {'hidden_size': 800, 'layers':1, 'directions':2},
                {'hidden_size': 800, 'layers':2, 'directions':2},
                {'hidden_size': 800, 'layers':2, 'directions':2}]

for i, config in enumerate(models_config, 1):
    model_name = "Model " + str(i)
    hidden_size = config['hidden_size']
    n_layers = config['layers']
    directions = config['directions']
    model = NERNet(INPUT_SIZE, EMBEDDING_SIZE, hidden_size, OUTPUT_SIZE, n_layers, directions).cuda()
    train_loop(model, EPOCHS)
    evaluate(model, model_name)

tested model is: Model 1

Dev dataset results:

Recall scores:

tag: O, Recall: 0.9596
tag: B-PER, Recall: 0.6700
tag: I-PER, Recall: 0.6815
tag: B-LOC, Recall: 0.7049
tag: I-LOC, Recall: 0.4783
tag: B-ORG, Recall: 0.5893
tag: I-ORG, Recall: 0.4224

The Recall result for all labels together (except O): 0.6245572609208973

Precision scores:

tag: O, Precision: 0.9273
tag: B-PER, Precision: 0.6802
tag: I-PER, Precision: 0.7985
tag: B-LOC, Precision: 0.7544
tag: I-LOC, Precision: 0.7857
tag: B-ORG, Precision: 0.6346
tag: I-ORG, Precision: 0.7313

The Precision result for all labels together (except O): 0.7158322056833559

------------------------------------------------------------------------------------
------------------------------------------------------------------------------------

Test dataset results:

Recall scores:

tag: O, Recall: 0.9603
tag: B-PER, Recall: 0.6705
tag: I-PER, Recall: 0.6757
tag: B-LOC, Recall: 0.6939
tag: I-LOC, Recall: 0.5094
tag: B-ORG, Recall: 0.5886
tag: 

**Task 6:** Download the GloVe embeddings from https://nlp.stanford.edu/projects/glove/ (use the 300-dim vectors from glove.6B.zip). Then intialize the nn.Embedding module in your NERNet with these embeddings, so that you can start your training with pre-trained vectors. Repeat Task 6 and print the results for each model.

Note: make sure that vectors are aligned with the IDs in your Vocab, in other words, make sure that for example the word with ID 0 is the first vector in the GloVe matrix of vectors that you initialize nn.Embedding with. For a dicussion on how to do that, check it this link:
https://discuss.pytorch.org/t/can-we-use-pre-trained-word-embeddings-for-weight-initialization-in-nn-embedding/1222

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip
GLOVE_PATH = 'glove.6B.300d.txt'

--2022-05-27 08:34:42--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2022-05-27 08:34:42--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2022-05-27 08:34:43--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2022-0

In [None]:
def load_embeddings(path, word2id=vocab.word2id, dimension=300):
    with open(path) as f:
        embeddings = np.zeros((len(word2id), dimension))
        for line in f.readlines():
            values = line.split()
            word = values[0]
            index = word2id.get(word)
            if index:
                vector = np.array(values[1:], dtype='float32')
                embeddings[index] = vector
        return torch.from_numpy(embeddings).float()

In [None]:
weights = load_embeddings(GLOVE_PATH)

In [None]:
for i, config in enumerate(models_config, 1):
    model_name = "Model " + str(i)
    hidden_size = config['hidden_size']
    n_layers = config['layers']
    directions = config['directions']
    model_glove = NERNet(INPUT_SIZE, EMBEDDING_SIZE, hidden_size, OUTPUT_SIZE, n_layers, directions).cuda()
    model_glove.embedding = nn.Embedding.from_pretrained(weights,freeze=True).cuda()
    train_loop(model_glove, EPOCHS)
    evaluate(model_glove, model_name)

tested model is: Model 1

Dev dataset results:

Recall scores:

tag: O, Recall: 0.9474
tag: B-PER, Recall: 0.5450
tag: I-PER, Recall: 0.7006
tag: B-LOC, Recall: 0.6393
tag: I-LOC, Recall: 0.3913
tag: B-ORG, Recall: 0.6845
tag: I-ORG, Recall: 0.6034

The Recall result for all labels together (except O): 0.6257378984651711

Precision scores:

tag: O, Precision: 0.9566
tag: B-PER, Precision: 0.7956
tag: I-PER, Precision: 0.8209
tag: B-LOC, Precision: 0.7748
tag: I-LOC, Precision: 0.7500
tag: B-ORG, Precision: 0.4228
tag: I-ORG, Precision: 0.4094

The Precision result for all labels together (except O): 0.6043329532497149

------------------------------------------------------------------------------------
------------------------------------------------------------------------------------

Test dataset results:

Recall scores:

tag: O, Recall: 0.9493
tag: B-PER, Recall: 0.4885
tag: I-PER, Recall: 0.6486
tag: B-LOC, Recall: 0.6006
tag: I-LOC, Recall: 0.4528
tag: B-ORG, Recall: 0.6886
tag: 

**Good luck!**