# M2 project : NER tagging

This project aims to implement a NER Tagger with Pytorch. We will be using the English CONLL 2003 data set.

Data download & description
--------

In [None]:
from urllib.request import urlretrieve
urlretrieve('https://raw.githubusercontent.com/pranabsarkar/Conll_task/master/conll-2003/eng.train','eng.train')
urlretrieve('https://raw.githubusercontent.com/pranabsarkar/Conll_task/master/conll-2003/eng.testa','eng.testa')

#Prints the beginning of the training set
istream = open('eng.train')
for idx, line in enumerate(istream):
  print(line.strip())
  if idx >=20:
    break
istream.close()


-DOCSTART- -X- -X- O

EU NNP I-NP I-ORG
rejects VBZ I-VP O
German JJ I-NP I-MISC
call NN I-NP O
to TO I-VP O
boycott VB I-VP O
British JJ I-NP I-MISC
lamb NN I-NP O
. . O O

Peter NNP I-NP I-PER
Blackburn NNP I-NP I-PER

BRUSSELS NNP I-NP I-LOC
1996-08-22 CD I-NP O

The DT I-NP O
European NNP I-NP I-ORG
Commission NNP I-NP I-ORG


The CONLL 2003 dataset encodes each token on a single line followed by its annotation. A token line is a quadruple:

> (token,tag,chunk,named entity)

A named entity tagger aims to predict the named entity annotations given the raw tokens. The NER tags follows the IOB convention.
* **I** stands for **Inside** and is used to flag tokens that are part of a named entity.
* **B** stands for **Begin** and is used to flag a token starting a new entity when the preceding token is already part of an entity.
* **O** stands for **Outside** and is used to flag tokens that are not part of a named entity.

The I and B Tag are followed by a specifier. For instance I-PER means that the named entity refers to a person, I-ORG means that the entity is refers to an Organisation.

Sentences are separated by a blank line. The train file is `eng.train`, the dev file is `eng.testa`. I will evaluate your work with a test file unknown to you.
To do this, I will change the content of the dev file



First exercise : data preprocessing (5pts)
---


Using CONLL2003 the train file, you will:

* Extract an input vocabulary and create two maps: one mapping tokens to integers and a second mapping integers to tokens (see the pdf notes)
* Include elements in the input vocabulary for padding and for unknown words
* Extract an output vocabulary (the set of NER tags) and returns two maps
mapping tags to integer and vice-versa.

These functionalities should be implemented in a function with signature `vocabulary(filename)` that returns the two maps

In [None]:
def vocabulary(filename,input_vocab,padding='<pad>',unknown='<unk>'):
    #input_vocab is a boolean flag that tells if we extract input or output vocabulary
    #the two optional flags indicate that a padding and an unknown token
    #have to be added to the vocabulary if their value is not None

    idx2sym = {}
    sym2idx = {}
    current_index = 0

    if padding is not None:
        idx2sym[current_index] = padding
        sym2idx[padding] = current_index
        current_index += 1

    if unknown is not None:
        idx2sym[current_index] = unknown
        sym2idx[unknown] = current_index
        current_index += 1

    with open(filename, 'r') as file:
        lines = file.readlines()[1:]  # Read all lines except the first one: "-DOCSTART- -X- -X- O"

    current_sentence = []
    for line in lines:
        line = line.strip()
        if not line:
            for token in current_sentence:
                word = token[0]  # Token
                tag = token[3]   # NER Tag
                if input_vocab:
                    if word not in sym2idx:
                        idx2sym[current_index] = word
                        sym2idx[word] = current_index
                        current_index += 1
                else:
                    if tag not in sym2idx:
                        idx2sym[current_index] = tag
                        sym2idx[tag] = current_index
                        current_index += 1
            current_sentence = []
        else:
            parts = line.split()
            current_sentence.append(parts)

    return idx2sym, sym2idx


Now we implement three functions:

* One that performs padding
* The second will encode a sequence of tokens (or a sequence of tags) on integers
* The third will decode as sequence of symbols from integers to strings

At test time, some tokens might not belong to the vocabulary. Ensure that your encoding function does not crash in this case.


In [None]:
def pad_sequence(sequence,pad_size,pad_token):
    #returns a list with additional pad tokens if needed
    if len(sequence) >= pad_size:
        return sequence[:pad_size]
    else:
        return sequence + [pad_token] * (pad_size - len(sequence))

def code_sequence(sequence,coding_map,unk_token=None):
    #takes a list of strings and returns a list of integers
    if unk_token is not None:
        return [coding_map.get(token, coding_map[unk_token]) for token in sequence]
    else:
        return [coding_map[token] for token in sequence]

def decode_sequence(sequence,decoding_map):
    #takes a list of integers and returns a list of strings
    return [decoding_map[idx] for idx in sequence]



Second exercise: data generator (5pts)
------

In this second exercise, we will write a mini-batch generator.
This is a class in charge of generating randomized batches of data from the dataset. We start by implementing two functions for reading the textfile


In [None]:
def read_conll_tokens(conllfilename):
    """
    Reads a CONLL 2003 file and returns a list of sentences.
    A sentence is a list of strings (tokens)
    """
    sentences = []
    current_sentence = []

    with open(conllfilename, 'r') as file:
        for line in file:
            line = line.strip()
            if not line:
                if current_sentence:
                    sentences.append(current_sentence)
                current_sentence = []
            else:
                parts = line.split()
                token = parts[0]
                current_sentence.append(token)

    return sentences

def read_conll_tags(conllfilename):
    """
    Reads a CONLL 2003 file and returns a list of sentences.
    A sentence is a list of strings (NER-tags)
    """
    sentences = []
    current_sentence = []

    with open(conllfilename, 'r') as file:
        for line in file:
            line = line.strip()
            if not line:
                if current_sentence:
                    sentences.append(current_sentence)
                current_sentence = []
            else:
                parts = line.split()
                ner_tag = parts[-1]
                current_sentence.append(ner_tag)

    return sentences





Now we implement the class. You will rely on the helper functions designed above in order to fill in the blanks in the constructor.

In [None]:
import torch
import torch.nn as nn
from random import shuffle

class DataGenerator:

        #Reuse all relevant helper functions defined above to solve the problems
        def __init__(self,conllfilename, parentgenerator = None, pad_token='<pad>',unk_token='<unk>'):

              if parentgenerator is not None: #Reuse the encodings of the parent if specified
                  self.pad_token      = parentgenerator.pad_token
                  self.unk_token      = parentgenerator.unk_token
                  self.input_sym2idx  = parentgenerator.input_sym2idx
                  self.input_idx2sym  = parentgenerator.input_idx2sym
                  self.output_sym2idx = parentgenerator.output_sym2idx
                  self.output_idx2sym = parentgenerator.output_idx2sym
              else:                           #Creates new encodings
                  self.pad_token = pad_token
                  self.unk_token = unk_token
                  # Create 4 encoding maps from datafile
                  self.input_idx2sym,self.input_sym2idx   = vocabulary(conllfilename, True, self.pad_token, self.unk_token)
                  self.output_idx2sym,self.output_sym2idx = vocabulary(conllfilename, False, self.pad_token, self.unk_token)


              # store the conll dataset with sentence structure (a list of lists of strings) in the following fields
              self.Xtokens = read_conll_tokens(conllfilename)
              self.Ytokens = read_conll_tags(conllfilename)


        def generate_batches(self,batch_size):

              #This is an example generator function yielding one batch after another
              #Batches are lists of lists

              assert(len(self.Xtokens) == len(self.Ytokens))

              N     = len(self.Xtokens)
              idxes = list(range(N))

              #Data ordering (try to explain why these 2 lines make sense...)
              #  shuffling and sorting by sentence length are common techniques used to ensure randomness and efficiency when generating batches of data for training neural networks
              shuffle(idxes)
              idxes.sort(key=lambda idx: len(self.Xtokens[idx]))

              #batch generation
              bstart = 0
              while bstart < N:
                 bend        = min(bstart+batch_size,N)
                 batch_idxes = idxes[bstart:bend]
                 batch_len   = max(len(self.Xtokens[idx]) for idx in batch_idxes)

                 seqX = [ pad_sequence(self.Xtokens[idx],batch_len,self.pad_token) for idx in batch_idxes]
                 seqY = [ pad_sequence(self.Ytokens[idx],batch_len,self.pad_token) for idx in batch_idxes]
                 seqX = [ code_sequence(seq,self.input_sym2idx,self.unk_token) for seq in seqX]
                 seqY = [ code_sequence(seq,self.output_sym2idx) for seq in seqY]

                 assert(len(seqX) == len(seqY))
                 yield (seqX,seqY)
                 bstart += batch_size

In [None]:
trainset = DataGenerator('eng.train')
validset = DataGenerator('eng.testa',parentgenerator = trainset)

Third exercise : implement the tagger (5pts)
---------------
This is the core exercise. There are three main tasks:
* Implement parameter allocation. This implies allocating the embedding layer, the LSTM (or bi-LSTM) layer and the Linear Layer.
* Implement the forward method. This method expects a tensor encoding the input and outputs a tensor of predictions
* Implement the train method

The evaluation (`validate`) method is given and cannot be modified. But it can be used as source of inspiration for implementing the train method.

In [None]:
import torch.optim as optim

class NERtagger(nn.Module):

      def __init__(self,traingenerator,embedding_size,hidden_size,dropout_proba,device='cpu'):
        super(NERtagger, self).__init__()
        self.embedding_size    = embedding_size
        self.hidden_size       = hidden_size
        self.allocate_params(traingenerator,device, dropout_proba)

      def load(self,filename):
        self.load_state_dict(torch.load(filename))

      def allocate_params(self,datagenerator,device, dropout_proba):

        vocab_size = len(datagenerator.input_sym2idx)
        num_labels = len(datagenerator.output_sym2idx)

        self.embedding = nn.Embedding(vocab_size, self.embedding_size)
        self.lstm = nn.LSTM(self.embedding_size, self.hidden_size, bidirectional=True, batch_first=True)
        self.linear = nn.Linear(self.hidden_size * 2, num_labels)  # Output size is doubled due to bidirectional LSTM
        self.dropout = nn.Dropout(dropout_proba)
        self.to(device)

      def forward(self,Xinput):

        embedded = self.embedding(Xinput)
        lstm_output, _ = self.lstm(embedded)
        lstm_output = self.dropout(lstm_output)
        output = self.linear(lstm_output)

        return output

      def train(self,traingenerator,validgenerator,epochs,batch_size,device='cpu',learning_rate=0.001):

        self.minloss = 10000000  # The minimum loss found so far on validation data

        optimizer = optim.Adam(self.parameters(), lr=learning_rate)

        device = torch.device(device)
        pad_index = traingenerator.output_sym2idx[traingenerator.pad_token]
        loss_fnc = nn.CrossEntropyLoss(ignore_index=pad_index)

        for epoch in range(epochs):
            batch_losses = []

            for (seqX, seqY) in traingenerator.generate_batches(batch_size):
                X = torch.LongTensor(seqX).to(device)
                Y = torch.LongTensor(seqY).to(device)

                optimizer.zero_grad()

                Yhat = self.forward(X)

                # Flattening and loss computation
                batch_size, seq_len = Y.shape
                Yhat = Yhat.view(batch_size * seq_len, -1)
                Y = Y.view(batch_size * seq_len)
                loss = loss_fnc(Yhat, Y)
                loss.backward()
                optimizer.step()
                batch_losses.append(loss.item())

            train_loss = sum(batch_losses) / len(batch_losses)

            valid_loss, valid_accuracy = self.validate(validgenerator, batch_size, device)

            print(f'Train Loss: {train_loss:.4f}\n--- END OF EPOCH {epoch}')

            # Save the model if the validation loss improves
            if valid_loss < self.minloss:
                self.minloss = valid_loss
                torch.save(self.state_dict(), 'tagger_params.pt')


      def validate(self,datagenerator,batch_size,device='cpu',save_min_model=False):

          batch_accurracies = []
          batch_losses      = []

          device = torch.device(device)
          pad_index = datagenerator.output_sym2idx[datagenerator.pad_token]
          loss_fnc  = nn.CrossEntropyLoss(ignore_index=pad_index)

          for (seqX,seqY) in datagenerator.generate_batches(batch_size):
                with torch.no_grad():
                  X = torch.LongTensor(seqX).to(device)
                  Y = torch.LongTensor(seqY).to(device)

                  Yhat = self.forward(X)

                  #Flattening and loss computation
                  batch_size,seq_len = Y.shape
                  Yhat = Yhat.view(batch_size*seq_len,-1)
                  Y    = Y.view(batch_size*seq_len)
                  loss = loss_fnc(Yhat,Y)
                  batch_losses.append(loss.item())

                  #Accurracy computation
                  mask    = (Y != pad_index)
                  Yargmax = torch.argmax(Yhat,dim=1)
                  correct = torch.sum((Yargmax == Y) * mask)
                  total   = torch.sum(mask)
                  batch_accurracies.append(float(correct)/float(total))

          L = len(batch_losses)
          valid_loss = sum(batch_losses)/L

          if save_min_model and valid_loss < self.minloss:
            self.minloss = valid_loss
            torch.save(self.state_dict(), 'tagger_params.pt')

          print('[valid] mean Loss = %f | mean accurracy = %f'%(valid_loss,sum(batch_accurracies)/L))

          return valid_loss, sum(batch_accurracies)/L

In [None]:
    model = NERtagger(trainset, 32, 64, 0, 'cuda')
    model.train(trainset, validset, 5, 32, 'cuda', 0.0001)

[valid] mean Loss = 1.114844 | mean accurracy = 0.784624
Train Loss: 1.4324
--- END OF EPOCH 0
[valid] mean Loss = 0.844253 | mean accurracy = 0.787345
Train Loss: 0.8638
--- END OF EPOCH 1
[valid] mean Loss = 0.683321 | mean accurracy = 0.816481
Train Loss: 0.6684
--- END OF EPOCH 2
[valid] mean Loss = 0.581171 | mean accurracy = 0.825110
Train Loss: 0.5420
--- END OF EPOCH 3
[valid] mean Loss = 0.523278 | mean accurracy = 0.830003
Train Loss: 0.4490
--- END OF EPOCH 4


The main program is the following. You are expected to add code for searching for hyperparameters that maximise the validation score

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
from itertools import product

param_grid = {
        'embedding_size': [32, 64, 128],
        'hidden_size': [64, 128, 256],
        'learning_rate': [0.0005, 0.005, 0.05],
        'epochs': [10], # we set it at 10 to keep it from being too long
        'batch_size': [32, 64, 128]
}

best_hyperparameters = None
best_loss = 10000000

hyperparameter_combinations = product(*param_grid.values())

for params in hyperparameter_combinations:
    embedding_size, hidden_size, learning_rate, epochs, batch_size = params
    print(f"embedding_size = {embedding_size}, hidden_size = {hidden_size}, learning_rate = {learning_rate}, epochs = {epochs}, batch_size = {batch_size}")

    model = NERtagger(trainset, embedding_size, hidden_size, 'cuda')
    model.train(trainset, validset, epochs, batch_size, 'cuda', learning_rate)

    if model.minloss < best_loss:
        best_loss = model.minloss
        best_hyperparameters = {
            'embedding_size': embedding_size,
            'hidden_size': hidden_size,
            'learning_rate': learning_rate,
            'epochs': epochs,
            'batch_size': batch_size
        }

    # simplement pour sauvegarder les résultats au cas où
    with open('hyperparameter_results.csv', 'a+') as f:
        f.write(str(embedding_size) + "," + str(hidden_size) + "," + str(learning_rate) + "," + str(epochs) + "," + str(batch_size) + "," + str(model.minloss) + "\n")

print("Best Hyperparameters:", best_hyperparameters)


embedding_size = 32, hidden_size = 64, learning_rate = 0.0005, epochs = 10, batch_size = 32
[valid] mean Loss = 0.796054 | mean accurracy = 0.793534
Train Loss: 0.9844
--- END OF EPOCH 0
[valid] mean Loss = 0.526015 | mean accurracy = 0.841266
Train Loss: 0.5553
--- END OF EPOCH 1
[valid] mean Loss = 0.394619 | mean accurracy = 0.875046
Train Loss: 0.3699
--- END OF EPOCH 2
[valid] mean Loss = 0.325545 | mean accurracy = 0.892210
Train Loss: 0.2628
--- END OF EPOCH 3
[valid] mean Loss = 0.292867 | mean accurracy = 0.897580
Train Loss: 0.1881
--- END OF EPOCH 4
[valid] mean Loss = 0.249898 | mean accurracy = 0.913481
Train Loss: 0.1315
--- END OF EPOCH 5
[valid] mean Loss = 0.231388 | mean accurracy = 0.920636
Train Loss: 0.0906
--- END OF EPOCH 6
[valid] mean Loss = 0.234025 | mean accurracy = 0.923483
Train Loss: 0.0611
--- END OF EPOCH 7
[valid] mean Loss = 0.242111 | mean accurracy = 0.925732
Train Loss: 0.0396
--- END OF EPOCH 8
[valid] mean Loss = 0.266261 | mean accurracy = 0.922

Fourth exercise : improve the tagger (5pts)
----------

This exercise is relatively free. You may add improvements to the basic tagger.
Note that I expect that improving the management of unknown words and of subword units is key on this task. You may wish to:
* Add an attention layer
* Use part of speech tags embeddings as additional inputs
* Find a way to learn a word embedding for unknown words
* Integrate your convolutional word embedding module into the tagger
* ...

Describe your improvements below and point me out the name(s) of the function(s)
where they are implemented.
