# Advanced NLP Assignment 3, Group 1
Thijs Vollebregt(2670637), Chuqiao Guo(2798305), Yijing Zhang(2818171), Danna Shao(2663369)

### Importing libraries and utils:

In [2]:
import warnings
warnings.filterwarnings('ignore')

import transformers
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer, DataCollatorForTokenClassification
from datasets import load_metric

from read_and_preprocess import *

### Basic settings
- We use seqeval as the metric, and bert-base-uncased as BERT model with its corresponding auto tokenizer

In [3]:
metric = load_metric("seqeval")
task = "srl"
model_checkpoint = "bert-base-uncased" # bert-base-uncased for better percision, distilbert-base-uncased for faster run
batch_size = 16
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

## 1. Baseline model
The appraoch used for the baseline model is basically converting the sentence into the following form:

> [CLS] This is the sentence content [SEP] is [SEP].

And this is realized by simply using the logic of the auto tokenizer:
`tokenizer(list1,list2)` will return [CLS] list1 content [SEP] list2 content [SEP].

### 1.1 Importing datasets and libraries:
Corpus is load and preprocessed into huggingface dataset type by baseline_ds.py. This script contains following functions and variables for the baseline model:

Dataset preparation:
- `get_mappings_dict()`: Get the dictionary mapping string classes (e.g. 'ARG0') to int labels and its reverse.
- `create_word_sentlist()`: Create dictionary containing the required data for generating desired huggingface dataset.
- `tokenize_and_align_labels()`: Adapted from the example notebook. Solves label alignment after re-tokenization.
    
Evaluation:
- `compute_metrics()`: Compute the overall percision, recall and f1.
- `reverse_label()`: Map the int class labels back to strings.
- `class_results()`: Compute the percision, recall and f1 for each class.

Variables: 
- `label_dict, label_dict_rev`: The mapping dictionary from string class to int class and its reverse.
- `tokenized_train, tokenized_dev, tokenized_test`: The tokenized and ready-to-use datasets.

In [4]:
import baseline_ds

Map:   0%|          | 0/42466 [00:00<?, ? examples/s]

Map:   0%|          | 0/5441 [00:00<?, ? examples/s]

Map:   0%|          | 0/5328 [00:00<?, ? examples/s]

Inspect the tokenized (ready to use) dataset:

In [10]:
print(baseline_ds.tokenized_test[0])

{'tokens': ['What', 'if', 'Google', 'Morphed', 'Into', 'GoogleOS', '?'], 'srl_arg_tags': [29, 29, 11, 29, 29, 43, 29], 'pred': ['Morphed'], 'input_ids': [101, 2054, 2065, 8224, 22822, 8458, 2098, 2046, 8224, 2891, 1029, 102, 22822, 8458, 2098, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'word_ids': [None, 0, 1, 2, 3, 3, 3, 4, 5, 5, 6, None, 0, 0, 0, None], 'labels': [-100, 29, 29, 11, 29, 29, 29, 29, 43, 43, 29, -100, 29, 29, 29, -100]}


We can see that the last 4 input ids are 102, x, x, 102 where 102 is the labels of [SEP], and their corresponding word_ids are None, indicating the tokenization worked properly.

### 1.2 Creating baseline model:
The model is trained on the tokenized train dataset, and evaluated on the dev dataset in the training process.

In [5]:
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(baseline_ds.label_dict))
model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    f"{model_name}-finetuned-{task}",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=False,
)

data_collator = DataCollatorForTokenClassification(tokenizer)
label_list = baseline_ds.label_list

base_trainer = Trainer(
    model,
    args,
    train_dataset=baseline_ds.tokenized_train,
    eval_dataset=baseline_ds.tokenized_dev,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=baseline_ds.compute_metrics
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### 1.3 Train and evaluate:
The evaluation is done on the test dataset

In [5]:
base_trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0882,0.108037,0.798406,0.819914,0.809017,0.970571
2,0.063,0.105916,0.817995,0.827277,0.82261,0.972739
3,0.0481,0.109497,0.820079,0.83386,0.826912,0.973297


TrainOutput(global_step=7965, training_loss=0.07123838856919187, metrics={'train_runtime': 352.3275, 'train_samples_per_second': 361.59, 'train_steps_per_second': 22.607, 'total_flos': 4555060725912480.0, 'train_loss': 0.07123838856919187, 'epoch': 3.0})

- Getting the prediction and true lables for the test dataset:

In [8]:
base_pred, base_labels, _ = base_trainer.predict(baseline_ds.tokenized_test)

- Evaluate the results of each class:

In [9]:
baseline_ds.class_results(base_pred, base_labels)

{'ADJ': {'precision': 0.7312252964426877,
  'recall': 0.7142857142857143,
  'f1': 0.72265625,
  'number': 259},
 'ADV': {'precision': 0.6910569105691057,
  'recall': 0.6378986866791745,
  'f1': 0.6634146341463414,
  'number': 533},
 'ARG0': {'precision': 0.9242424242424242,
  'recall': 0.8714285714285714,
  'f1': 0.8970588235294117,
  'number': 70},
 'ARG1': {'precision': 0.6554621848739496,
  'recall': 0.75,
  'f1': 0.6995515695067266,
  'number': 104},
 'ARG1-DSP': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1},
 'ARG2': {'precision': 0.2,
  'recall': 0.25,
  'f1': 0.22222222222222224,
  'number': 8},
 'ARG3': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 2},
 'ARGM-ADJ': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1},
 'ARGM-ADV': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1},
 'ARGM-CXN': {'precision': 0.3333333333333333,
  'recall': 0.4,
  'f1': 0.3636363636363636,
  'number': 5},
 'ARGM-DIR': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 

### 1.4 Limitations of the baseline model:
- This model cannot precisely handle sentences with duplicated predicates.
- The input didn't include a predicate indicator[V] to distinguish predicate tokens and non-predicate ones.
- This model has the risk of overfitting because we didn't modify the input features to replace entities as BIO labels as it in [Shi and Lin (2019)](https://arxiv.org/pdf/1904.05255.pdf).
- Training is time-consuming because of the large size of dataset and numbers of different labels.

## 2. More advanced model:
The more advanced model uses similar apporach as the Augment method described in [NegBERT (Khandelwal, et al. 2020)](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.704.pdf). That is, adding a special token ([V]) immediately before the predicate:
> This [V] is a sentence.

(Note that **the special token and the predicate is considered a whole**. That is, the actual sentence is like
> 'This' **'[V] is'** 'a' 'sentence' '.'

### 2.1 Implementation
The implementation is actually simpler than the basic model, and the coding are only slightly adjusted.

We now use `augment_sentlist()` to create a list of sentences that having all the predicates augmented into '[V] pred', and removed the [SEP] part from the tokenizer.

In [8]:
import advanced_ds

Map:   0%|          | 0/42466 [00:00<?, ? examples/s]

Map:   0%|          | 0/5441 [00:00<?, ? examples/s]

Map:   0%|          | 0/5328 [00:00<?, ? examples/s]

Inspect the tokenized data again:

In [11]:
print(advanced_ds.tokenized_test[0])

{'tokens': ['What', 'if', 'Google', '[V] Morphed', 'Into', 'GoogleOS', '?'], 'srl_arg_tags': [29, 29, 11, 29, 29, 43, 29], 'input_ids': [101, 2054, 2065, 8224, 1031, 1058, 1033, 22822, 8458, 2098, 2046, 8224, 2891, 1029, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'word_ids': [None, 0, 1, 2, 3, 3, 3, 3, 3, 3, 4, 5, 5, 6, None], 'labels': [-100, 29, 29, 11, 29, 29, 29, 29, 29, 29, 29, 43, 43, 29, -100]}


The predicate is correctly augmented into '[V] pred'.

### 2.2 Creating the advanced model, train and evaluate:

In [12]:
advanced_trainer = Trainer(
    model,
    args,
    train_dataset=advanced_ds.tokenized_train,
    eval_dataset=advanced_ds.tokenized_dev,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=baseline_ds.compute_metrics
)

In [14]:
advanced_trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0441,0.086001,0.851091,0.861808,0.856416,0.980564
2,0.0285,0.089585,0.855718,0.866338,0.860995,0.981127
3,0.018,0.09487,0.860633,0.866519,0.863566,0.98143


TrainOutput(global_step=7965, training_loss=0.03031403552939588, metrics={'train_runtime': 339.2309, 'train_samples_per_second': 375.549, 'train_steps_per_second': 23.48, 'total_flos': 4617691475641200.0, 'train_loss': 0.03031403552939588, 'epoch': 3.0})

In [19]:
advanced_pred, advanced_labels, _ = advanced_trainer.predict(advanced_ds.tokenized_test)

- Evaluate the results of each class (slightly better than the baseline model):

In [20]:
baseline_ds.class_results(advanced_pred, advanced_labels)

{'ADJ': {'precision': 0.7361111111111112,
  'recall': 0.7130044843049327,
  'f1': 0.7243735763097949,
  'number': 223},
 'ADV': {'precision': 0.7036247334754797,
  'recall': 0.6707317073170732,
  'f1': 0.6867845993756504,
  'number': 492},
 'ARG0': {'precision': 0.9393939393939394,
  'recall': 0.8857142857142857,
  'f1': 0.9117647058823529,
  'number': 70},
 'ARG1': {'precision': 0.7788461538461539,
  'recall': 0.7788461538461539,
  'f1': 0.7788461538461539,
  'number': 104},
 'ARG1-DSP': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1},
 'ARG2': {'precision': 0.23529411764705882,
  'recall': 0.5,
  'f1': 0.31999999999999995,
  'number': 8},
 'ARG3': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 2},
 'ARGM-ADJ': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1},
 'ARGM-ADV': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1},
 'ARGM-CXN': {'precision': 0.5714285714285714,
  'recall': 0.8,
  'f1': 0.6666666666666666,
  'number': 5},
 'ARGM-DIR': {'precis

### 2.3 Limitations of the advanced model:
- The predicate indicator is not precise enough. Categorizing predicates based on their part-of-speech tags (VB/JJ/RB/NN/...) can help in increasing the model performance.
- In this model, we didn't use text embedding or other forms of feature extraction techniques, thus training is still time-consuming because of the large size of the dataset and numbers of different labels.

## 3. Description of the results

For most individual classes, the more advanced model shows improvements in precision, recall, and F1 score compared to the baseline. However, there are some exceptions where the more advanced model's  performance is lower in certain classes compared to the baseline, such as the 'DIS', F1 of baseline model being 0.808 while the score of more advanced model being 0.781. Table of Section 3.1 has shown individual cases where the baseline model has better performances in terms of precision, recall or F1 score. Overall, the more advanced model generally demonstrates better performance, with higher precision, recall, and F1 score compared to the baseline results (illustrated by the table in Section 3.2).

### 3.1 Cases with baseline model outperforming

In [2]:
from result_table import baseline_df, advanced_df, filtered_table
import pandas as pd
from IPython.display import display_html

In [3]:
fil_baseline_df, fil_advanced_df = filtered_table(baseline_df, advanced_df)
comparison_df = pd.concat([fil_baseline_df, fil_advanced_df], axis=1, keys=['Baseline', 'Advanced'])
comparison_df = comparison_df.sort_values(by=('Baseline', 'number'), ascending=False)
# Display the comparison DataFrame
display(comparison_df)

Unnamed: 0_level_0,Baseline,Baseline,Baseline,Baseline,Baseline,Advanced,Advanced,Advanced,Advanced,Advanced
Unnamed: 0_level_1,precision,recall,f1,number,Model,precision,recall,f1,number,Model
DIS,0.787302,0.829431,0.807818,299,Baseline,0.777174,0.785714,0.781421,182,Advanced
ADJ,0.731225,0.714286,0.722656,259,Baseline,0.736111,0.713004,0.724374,223,Advanced
MNR,0.585526,0.570513,0.577922,156,Baseline,0.569444,0.557823,0.563574,147,Advanced
EXT,0.780952,0.745455,0.762791,110,Baseline,0.775701,0.798077,0.78673,104,Advanced
PRP,0.55814,0.631579,0.592593,76,Baseline,0.597403,0.613333,0.605263,75,Advanced
RG3,0.714286,0.337838,0.458716,74,Baseline,0.660714,0.5,0.569231,74,Advanced
PRR,0.787879,0.753623,0.77037,69,Baseline,0.776119,0.753623,0.764706,69,Advanced
RG4,0.640625,0.732143,0.683333,56,Baseline,0.655738,0.714286,0.683761,56,Advanced
DIR,0.47619,0.425532,0.449438,47,Baseline,0.45098,0.489362,0.469388,47,Advanced
PRD,0.451613,0.318182,0.373333,44,Baseline,0.365854,0.340909,0.352941,44,Advanced


### 3.2 Results table

In [4]:
comparison_df_origin = pd.concat([baseline_df, advanced_df], axis=1, keys=['Baseline', 'Advanced'])
comparison_df_origin = comparison_df_origin.sort_values(by=('Baseline', 'number'), ascending=False)
# Display the comparison DataFrame
display(comparison_df_origin)

Unnamed: 0_level_0,Baseline,Baseline,Baseline,Baseline,Baseline,Advanced,Advanced,Advanced,Advanced,Advanced
Unnamed: 0_level_1,precision,recall,f1,number,Model,precision,recall,f1,number,Model
_,0.777235,0.79863,0.787787,8174,Baseline,0.834091,0.843263,0.838652,7184,Advanced
RG1,0.844009,0.885061,0.864047,3454,Baseline,0.886426,0.912758,0.899399,3198,Advanced
RG0,0.87101,0.907104,0.888691,2196,Baseline,0.887577,0.916425,0.90177,1723,Advanced
RG2,0.755257,0.783595,0.769165,1146,Baseline,0.853769,0.835556,0.844564,1125,Advanced
TMP,0.798658,0.857658,0.827107,555,Baseline,0.864717,0.874307,0.869485,541,Advanced
ADV,0.691057,0.637899,0.663415,533,Baseline,0.703625,0.670732,0.686785,492,Advanced
MOD,0.932653,0.970276,0.951093,471,Baseline,0.9819,0.9819,0.9819,442,Advanced
DIS,0.787302,0.829431,0.807818,299,Baseline,0.777174,0.785714,0.781421,182,Advanced
ADJ,0.731225,0.714286,0.722656,259,Baseline,0.736111,0.713004,0.724374,223,Advanced
NEG,0.917749,0.959276,0.938053,221,Baseline,0.981221,0.967593,0.974359,216,Advanced


## 4. Conclusion and Future work

In conclusion, two models were developed. A basic (or, baseline) model, and a more advanced model. The baseline model struggled with several factors, outlined in section 1.4. Primarily, the baseline had no robust mechanism for dealing with sentences containing multiple predicates. This lead to misclassifications leading to lower accuracy scores. Furthermore, Owing to a high cardinality in the target feature, training time was high. In the more advanced model, a 'V' was affixed to each predicate which allowed to 'flag' predicates in a sentence. It is important to note that no additional features were added to the more advanced model. Augmenting the dataset only slighty yielded better results across all measures. overal precision increased from ~0.79 to ~0.84, recall improved from roughly 0.82 to 0.85, the f1 score improved from 0.80 to 0.84 and accuracy improved (slightly) from 0.974 to 0.982. However, as the output classes were left unaltered the high cardinality remained, thus the significant training times also remained the same. 

Future work, more investigation could be performed in text embedding or other forms of feature extraction which can aid in reducing the size of the training data, whilst retaining the same level of granularity. More importantly however, research should be dedicated to meaningful reductions on the amount of output classes. Depening on model application, reducing the number of output classes either by dropping or preferrably grouping classes together would lead to significantly reduced training times. This would make it more realistic to train even larger language models.