# bertchunker: default program

In [None]:
from bertchunker import *
import os, sys
import random

## Run the default solution on dev

In [5]:
chunker = FinetuneTagger(os.path.join('..', 'data', 'chunker'), modelsuffix='.pt')
decoder_output = chunker.decode(os.path.join('..', 'data', 'input', 'dev.txt'))

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1027/1027 [00:20<00:00, 51.11it/s]


Ignore the warnings from the transformers library. They are expected to occur.

## Evaluate the default output

In [6]:
flat_output = [ output for sent in decoder_output for output in sent ]
sys.path.append('..')
import conlleval
true_seqs = []
with open(os.path.join('..', 'data', 'reference', 'dev.out')) as r:
    for sent in conlleval.read_file(r):
        true_seqs += sent.split()
conlleval.evaluate(true_seqs, flat_output)

processed 23663 tokens with 11896 phrases; found: 13226 phrases; correct: 9689.
accuracy:  87.04%; (non-O)
accuracy:  87.45%; precision:  73.26%; recall:  81.45%; FB1:  77.14
             ADJP: precision:  13.32%; recall:  53.98%; FB1:  21.37  916
             ADVP: precision:  31.16%; recall:  58.79%; FB1:  40.73  751
            CONJP: precision:   0.00%; recall:   0.00%; FB1:   0.00  8
             INTJ: precision:   0.00%; recall:   0.00%; FB1:   0.00  11
              LST: precision:   0.00%; recall:   0.00%; FB1:   0.00  3
               NP: precision:  80.58%; recall:  80.86%; FB1:  80.72  6258
               PP: precision:  95.97%; recall:  86.93%; FB1:  91.23  2211
              PRT: precision:  22.15%; recall:  77.78%; FB1:  34.48  158
             SBAR: precision:  36.12%; recall:  80.17%; FB1:  49.80  526
              UCP: precision:   0.00%; recall:   0.00%; FB1:   0.00  64
               VP: precision:  83.75%; recall:  84.33%; FB1:  84.04  2320


(73.25722062603963, 81.44754539340954, 77.13557837751772)

## Documentation

In order to improve the performance of the the fine-tuned model we tried a couple of ideas explained in the following:

### 1) Data augmentation
in order to apply data augmentation, we need to select what type of noise should we use in order to enhance the performance of the model and achive better results.
looking at dev.txt data, it seems like added noise consists of changing or deleting or replacing characters in some tokens.

We implemented all these types of manipulation mentioned in https://www.aclweb.org/anthology/P19-1561/ as well, including "swap", "drop", "add", and "replace".

in order to   achive this goal, following function is implemented , which takes sentence and augmentation percentage as an input and applies data augmentation on the sequence before passing it to the tokenizer.

In [None]:
def augment_sentence( sent, aug_perc ):
    chance = torch.rand( len(sent) )
    selected_tokens = torch.where( chance <  aug_perc)[0]
    sent = list( sent )
    augmented = False
    
    for ind in selected_tokens:
        token = sent[ind]   
        if len( token ) < 3 :
            continue
        as_list = list( token )
        operation = random.choice(["swap", "insert", "drop", "replace"])
        
        if operation == "replace":
            selected_ind = random.randint(0, len(token) -1)
            if as_list[selected_ind] not in aug_tokens:
                continue
            new_char = aug_tokens[random.randint(0, len(aug_tokens)-1)]
            as_list[selected_ind] = new_char
        elif operation == "drop":
            selected_ind = random.randint(0, len(token) -1)
            if as_list[selected_ind] not in aug_tokens:
                continue
            del as_list[selected_ind]
        elif operation == "swap":
            selected_ind = random.randint(0, len(token) -2)
            as_list[selected_ind], as_list[selected_ind+1] = as_list[selected_ind+1], as_list[selected_ind]
        elif operation == "insert":
            selected_ind = random.randint(0, len(token))
            new_char = aug_tokens[random.randint(0, len(aug_tokens)-1)]
            as_list.insert(selected_ind, new_char)
        
        sent[ind] = "".join(as_list)
        augmented = True
    return tuple(sent), augmented

### 2) Two optimizers and learning rate schedulers
Since in most cases, MLPs work better with SGD optimizer, a second optimzer for classification_head is applied to the model with the initial lr = .1
We also utilized ExponentialLR schedulers for both optimizers to decay the learning rate throughout the training process.

### 3) Improving classification head

in order to improve classification head, single linear layer is replaced with the following sequential model:

  (classification_head): Sequential(

    (0): Dropout(p=0.2, inplace=False)
    
    (1): Linear(in_features=768, out_features=512, bias=True)
    
    (2): GELU(approximate='none')
    
    (3): Linear(in_features=512, out_features=512, bias=True)
    
    (4): GELU(approximate='none')
    
    (5): Linear(in_features=512, out_features=22, bias=True)
  
  )

In [None]:
class TransformerModel(nn.Module):

    def __init__(
            self,
            basemodel,
            tagset_size,
            lr=5e-5
        ):
        torch.manual_seed(1)
        super(TransformerModel, self).__init__()
        self.basemodel = basemodel
        # the encoder will be a BERT-like model that receives an input text in subwords and maps each subword into
        # contextual representations
        self.encoder = None
        # the hidden dimension of the BERT-like model will be automatically set in the init function!
        self.encoder_hidden_dim = 0
        # The linear layer that maps the subword contextual representation space to tag space
        self.classification_head = None
        # The CRF layer on top of the classification head to make sure the model learns to move from/to relevant tags
        # self.crf_layer = None
        # optimizers will be initialized in the init_model_from_scratch function
        self.optimizers = None
        self.init_model_from_scratch(basemodel, tagset_size, lr)

    def init_model_from_scratch(self, basemodel, tagset_size, lr):
        self.encoder = AutoModel.from_pretrained(basemodel)
        self.encoder_hidden_dim = self.encoder.config.hidden_size
        # self.classification_head = nn.Linear(self.encoder_hidden_dim, tagset_size)
        self.classification_head = nn.Sequential(
            nn.Dropout(.1, inplace=False),
            nn.Linear( self.encoder_hidden_dim, 512 ),
            nn.GELU(),
            nn.Linear( 512, 512 ),
            nn.GELU(),
            nn.Linear( 512, tagset_size )
        )

        # TODO initialize self.crf_layer in here as well.
        # TODO modify the optimizers in a way that each model part is optimized with a proper learning rate!
        self.optimizers = [ 
            optim.Adam(
                list(self.encoder.parameters()),
                lr=lr
            ),
            optim.SGD( 
                list(self.classification_head.parameters() ) ,
                lr= 0.1
                      )
        ]
        
        self.lr_schedulers = [
            ExponentialLR(self.optimizers[0], 0.9),
            ExponentialLR(self.optimizers[1], 0.9),
        ]

    def forward(self, sentence_input):
        encoded = self.encoder(sentence_input).last_hidden_state
        tag_space = self.classification_head(encoded)
        tag_scores = F.log_softmax(tag_space, dim=-1)
        # TODO modify the tag_scores to use the parameters of the crf_layer
        return tag_scores

## Analysis

### Data augmentation
Most of the improvement was obtained only buy augmenting training data.
we have implemented data augmentation  dynamically, wich means every epoch, a new augmentation step is applied to the data, this  results in more varied training data.
we achived accuracy of 96.4 and FB1 score of 94.54 only by data augmentation in the first place. Our initial data augmentation process involved only character replacement. However, we improved it by adding other types of text manipulations and got better accuracy.

### Using seperate optimizer
As it was proposed in the code, we have utilized a seperate optimizer for classification_head.
it seems like the results are not much improved and using seperate optimizer had minimum effect on accuracy of the model.

### Freezing the encoder parameters in the beginning
We also tried freezing the encoder's weight for the first few epochs and training only the classification_head but it didn't go well and worsened the results

### MLP classification head
The classification head was modified as stated before, also we have increased number of training epochs to 20.
these changes improved both accuracy and F1 score. accuracy = 96.77 -- FB1 = 94.91

### Using learning rate scheduler
We first halved the learning rate at epoch 15 for both optimizers which improved the FB1 score by almost 0.1. So we though it might be useful to use ExponentialLR schedulers from Pytorch with the gamma value of 0.9 to decay the learning rate throughout the whole training procedure starting from epoch 4. This idea also had a good impact on the accuracy.

### Increasing the number of epochs
We finally increased the number of epochs to 40 and reached the FB1 score of about 95.6.