# Advanced NLP Assignment 3, Group 1
Thijs Vollebregt(2670637), Chuqiao Guo(2798305), Yijing Zhang(2818171), Danna Shao(2663369)

### Importing libraries and utils:

In [2]:
import warnings
warnings.filterwarnings('ignore')

import transformers
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer, DataCollatorForTokenClassification
from datasets import load_metric

from read_and_preprocess import *

### Basic settings
- We use seqeval as the metric, and bert-base-uncased as BERT model with its corresponding auto tokenizer

In [3]:
metric = load_metric("seqeval")
task = "srl"
model_checkpoint = "bert-base-uncased" # bert-base-uncased for better percision, distilbert-base-uncased for faster run
batch_size = 16
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

## 1. Baseline model
The appraoch used for the baseline model is basically converting the sentence into the following form:

> [CLS] This is the sentence content [SEP] is [SEP].

And this is realized by simply using the logic of the auto tokenizer:
`tokenizer(list1,list2)` will return [CLS] list1 content [SEP] list2 content [SEP].

### 1.1 Importing datasets and libraries:
Corpus is load and preprocessed into huggingface dataset type by baseline_ds.py. This script contains following functions and variables for the baseline model:

Dataset preparation:
- `get_mappings_dict()`: Get the dictionary mapping string classes (e.g. 'ARG0') to int labels and its reverse.
- `create_word_sentlist()`: Create dictionary containing the required data for generating desired huggingface dataset.
- `tokenize_and_align_labels()`: Adapted from the example notebook. Solves label alignment after re-tokenization.
    
Evaluation:
- `compute_metrics()`: Compute the overall percision, recall and f1.
- `reverse_label()`: Map the int class labels back to strings.
- `class_results()`: Compute the percision, recall and f1 for each class.

Variables: 
- `label_dict, label_dict_rev`: The mapping dictionary from string class to int class and its reverse.
- `tokenized_train, tokenized_dev, tokenized_test`: The tokenized and ready-to-use datasets.

In [4]:
import baseline_ds

Map:   0%|          | 0/42466 [00:00<?, ? examples/s]

Map:   0%|          | 0/5441 [00:00<?, ? examples/s]

Map:   0%|          | 0/5328 [00:00<?, ? examples/s]

### 1.2 Creating basline model:
The model is trained on the tokenized trian dataset, and evaluated on the dev dataset in the training process.

In [5]:
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(baseline_ds.label_dict))
model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    f"{model_name}-finetuned-{task}",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=False,
)

data_collator = DataCollatorForTokenClassification(tokenizer)
label_list = baseline_ds.label_list

base_trainer = Trainer(
    model,
    args,
    train_dataset=baseline_ds.tokenized_train,
    eval_dataset=baseline_ds.tokenized_dev,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=baseline_ds.compute_metrics
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### 1.3 Train and evaluate:
The evaluation is done on the test dataset

In [5]:
base_trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0882,0.108037,0.798406,0.819914,0.809017,0.970571
2,0.063,0.105916,0.817995,0.827277,0.82261,0.972739
3,0.0481,0.109497,0.820079,0.83386,0.826912,0.973297


TrainOutput(global_step=7965, training_loss=0.07123838856919187, metrics={'train_runtime': 352.3275, 'train_samples_per_second': 361.59, 'train_steps_per_second': 22.607, 'total_flos': 4555060725912480.0, 'train_loss': 0.07123838856919187, 'epoch': 3.0})

- Getting the prediction and true lables for the test dataset:

In [8]:
base_pred, base_labels, _ = base_trainer.predict(baseline_ds.tokenized_test)

- Evaluate the results of each class:

In [9]:
baseline_ds.class_results(base_pred, base_labels)

{'ADJ': {'precision': 0.7312252964426877,
  'recall': 0.7142857142857143,
  'f1': 0.72265625,
  'number': 259},
 'ADV': {'precision': 0.6910569105691057,
  'recall': 0.6378986866791745,
  'f1': 0.6634146341463414,
  'number': 533},
 'ARG0': {'precision': 0.9242424242424242,
  'recall': 0.8714285714285714,
  'f1': 0.8970588235294117,
  'number': 70},
 'ARG1': {'precision': 0.6554621848739496,
  'recall': 0.75,
  'f1': 0.6995515695067266,
  'number': 104},
 'ARG1-DSP': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1},
 'ARG2': {'precision': 0.2,
  'recall': 0.25,
  'f1': 0.22222222222222224,
  'number': 8},
 'ARG3': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 2},
 'ARGM-ADJ': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1},
 'ARGM-ADV': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1},
 'ARGM-CXN': {'precision': 0.3333333333333333,
  'recall': 0.4,
  'f1': 0.3636363636363636,
  'number': 5},
 'ARGM-DIR': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 

### 1.4 Limitations of the baseline model:
- Cannot precisely handle sentences with duplicated predicates.

## 2. More advanced model: