# Semantic Role Labelling(SRL) with BERT 
##### by Andrei, Sezen and Emma

SRL is the task of identifying and classifying predicate-argument semantic roles within a text and eventually answer the questions of: who, did what to whom, and where, when...etc. It comprises four main
sub-tasks: predicate detection, predicate sense disambiguation, argument identification, argument classification. 
In this notebook, we will focus solely on argument identification and classification. 

To carry out the task, we will use the English-language Universal Proposition Banks dataset. The dataset is structured as a CoNLL-U file containing different sentences. In the file, each sentence is first represented with a document and sentence id and then as raw text. 
Then, each sentence is broken down in tokens with each line concerning a single token and its corresponding syntactic features. The various fields are separated by tabs. If a sentence contains predicates, then the next field indicates the labeled (disambiguated) predicates within each sentence, while the final field marks their corresponding arguments.

The disambiguated predicates in the gold data will be used as the basis for the argument extraction and classification while the process will be performed in a single step.



In [1]:
# !pip install -r requirements.txt

### Required Imports

In [2]:
import pandas as pd
import numpy as np

from code_.process_conll import process_file, advanced_process_file
from code_.evaluation import class_report_base, class_report_advanced, shrink_predictions
from code_.bert import Tokenizer, convert_to_dataset, compute_metrics, get_labels_list_from_dataset, task, batch_size, model_checkpoint
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
from transformers import DataCollatorForTokenClassification
from datasets import load_metric

  metric = load_metric("seqeval")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


### The Approach
The argument identification and classification experiment will be carried out with two different approaches both employing the BERT model. 

Given that the predicate identification is a given in this experiment, the main challenge of this task is to find a solution for
representing the relation between each sentence and the predicate under consideration.

In the original PropBank dataset, sentences can host any number of predicates (including none) thus, when a sentence has two or more predicates, the rest of its tokens are bound to express different arguments in relation to each predicate. Consequently, in order for the model to be able to correctly predict and assign argument labels for a sentence, we need to first establish - each time - the predicate under consideration. For example, in the sentence:

"John cooked dinner and served dessert to his guests."

for both predicates "cooked" and "served", the subject (ARG0) is "John", but the object (ARG1) is "dinner" for "cooked" and "dessert" for "served".

Thus, our focus lies in determining an effective representation for the model input allowing for the model to classify the arguments within a sentence relative to a specific predicate. The first step is reduplicating each sentence based on the number of its predicates so each sentence is presented in the dataset each time in relation to a unique predicate.

Next, the process involves structuring the input data in a manner that enables the model to understand the relationships between tokens and the possible argument labels within the context of a specific predicate.

To experiment on the basis of the input representation, we will build both a baseline and an advanced model. In both cases the relation of each sentence and a specific predicate will be encoded into the model's input. 

### The Model(s) & Loading the Datasets

#### The Baseline

The function of the baseline model is to offer a straightforward approach to the task, serving as a reference point for evaluating the effectiveness of the more advanced model. As such, the baseline will be employing a more simplistic approach to input representation inspired by Shi and Lin (2019):

[CLS] Barack Obama went to Paris [SEP] went [SEP] ,

in which the predicate information - each time signaling the predicate in question - only consists of the predicate token itself. At the same time, the [CLS] token typically represents the aggregate understanding of the entire sentence, serving as a sentence-level representation and capturing the contextual information necessary for semantic role labeling.
The [SEP] token demarcates the boundaries between different segments of the input. In this representation, it separates the sentence from the predicate.

This approach entails limitations. The mosts obvious one would be the case in which a sentence contains the same predicate twice. In this case only using the predicate itself as the predicate indicator would not allow the model to distinguish between the two different predicates. 

In [3]:
df_val = process_file('data/raw/en_ewt-up-dev.conllu')
df_train = process_file('data/raw/en_ewt-up-train.conllu')
df_test = process_file('data/raw/en_ewt-up-test.conllu')

process_file(): dataframe len: 4979
process_file(): dataframe len: 40498
process_file(): dataframe len: 4802


For the baseline model, the datasets are transformed into dataframes using the imported `process_file` fuction. The function splits the text into individual sentences based on empty lines (\n\n) with each line in the produced dataframe concerning a single sentence. The function then retrieves predicate indices and columns from the current sentence and extract labels for tokens in relation to predicates from the current sentence. Finally, it retrieves the corresponding word for each predicate from the sentence and adds a row to the DataFrame.

More specifically, the first column of the DataFrame hosts the tokenized sentence. The second column hosts one predicate per sentence each time. The third column gives the length of the argument columns attached to each predicate while the last one hosts the labelled arguments for each predicate mapped to the sentence tokens.

In [4]:
df_test.head(10)

Unnamed: 0,sentence,predicate,pred columns,labels
0,"[What, if, Google, Morphed, Into, GoogleOS, ?]",Morphed,11,"_, _, ARG1, V, _, ARG2, _"
1,"[What, if, Google, expanded, on, its, search, ...",expanded,11,"_, _, ARG0, V, _, _, _, _, _, _, _, _, _, _, A..."
2,"[(, And, ,, by, the, way, ,, is, anybody, else...",way,11,"_, _, _, _, _, V, _, _, _, _, _, _, _, _, _, _..."
3,"[(, And, ,, by, the, way, ,, is, anybody, else...",is,12,"_, _, _, _, _, ARGM-DIS, _, V, ARG1, _, _, _, ..."
4,"[(, And, ,, by, the, way, ,, is, anybody, else...",was,13,"_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _..."
5,"[This, BuzzMachine, post, argues, that, Google...",post,11,"_, ARG2, V, _, _, _, _, _, _, _, _, _, _, _, _..."
6,"[This, BuzzMachine, post, argues, that, Google...",argues,12,"_, _, ARG0, V, _, _, _, _, _, _, _, ARG1, _, _..."
7,"[This, BuzzMachine, post, argues, that, Google...",rush,13,"_, _, _, _, _, ARG1, _, V, _, ARG2, _, _, _, _..."
8,"[This, BuzzMachine, post, argues, that, Google...",backfire,14,"_, _, _, _, _, _, _, ARG1, _, _, ARGM-MOD, V, ..."
9,"[This, BuzzMachine, post, argues, that, Google...",'ve,15,"_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, V..."


The `convert_to_dataset` function transforms data from pandas DataFrames (train, val, and test) into a DatasetDict. The function streamlines data preparation for machine learning tasks, facilitating easier management and processing of data.

The labels_list contains all of the target labels used to train and test the model.

The dataset variable now contains the transformed train, evaluation, and test datasets.

In [5]:
dataset = convert_to_dataset(df_train, df_val, df_test)

In [6]:
labels_list = get_labels_list_from_dataset(dataset)
print(sorted(labels_list, key=lambda x: len(x)))

['V', '_', 'C-V', 'ARG0', 'ARG1', 'ARG2', 'ARG3', 'ARG4', 'ARG5', 'ARGA', 'C-ARG0', 'C-ARG1', 'C-ARG2', 'C-ARG3', 'C-ARG4', 'R-ARG0', 'R-ARG1', 'R-ARG2', 'R-ARG3', 'R-ARG4', 'ARG1-DSP', 'ARGM-ADJ', 'ARGM-ADV', 'ARGM-CAU', 'ARGM-COM', 'ARGM-CXN', 'ARGM-DIR', 'ARGM-DIS', 'ARGM-EXT', 'ARGM-GOL', 'ARGM-LOC', 'ARGM-LVB', 'ARGM-MNR', 'ARGM-MOD', 'ARGM-NEG', 'ARGM-PRD', 'ARGM-PRP', 'ARGM-PRR', 'ARGM-REC', 'ARGM-TMP', 'C-ARG1-DSP', 'C-ARGM-ADV', 'C-ARGM-COM', 'C-ARGM-CXN', 'C-ARGM-DIR', 'C-ARGM-EXT', 'C-ARGM-GOL', 'C-ARGM-LOC', 'C-ARGM-MNR', 'C-ARGM-PRP', 'C-ARGM-PRR', 'C-ARGM-TMP', 'R-ARGM-ADJ', 'R-ARGM-ADV', 'R-ARGM-CAU', 'R-ARGM-COM', 'R-ARGM-DIR', 'R-ARGM-GOL', 'R-ARGM-LOC', 'R-ARGM-MNR', 'R-ARGM-TMP']


A Transformers Tokenizer is used to tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires. The functions in the Tokenizer can be seen in bert.py, which is the code_ folder. 

By making use of the Tokenizer and target label list, the `tokenize_and_align_labels_pred` function preprocesses the data for a sequence labeling task by tokenizing sentences, aligning labels with tokens, and preparing them for model training or evaluation. 

More specifically, the function tokenizes sentences and predicates using the provided tokenizer, aligns the tokens from the sentence with the predicate tokens, iterates through each tokenized word in the sentence, checking if it matches any label and assigning the corresponding label index. If there's no match, it assigns the index of a placeholder label ('_') while if a predicate occurs multiple times in the sentence, the labels are repeated accordingly.

It returns a list of lists, where each inner list represents the labels associated with each token in the sentence, including those corresponding to the predicate.

In [7]:
tok = Tokenizer(model_checkpoint, labels_list)

As can be seen in the following output, the tokenized_datasets consists of the trainset, validation set, and the test set. The features that are in these datasets are: sentence, predicate, pred columns, labels, input_ids, and the attention mask

The input_ids are the in the format: [101, 2632, 1011, 23564, 2386, 1024, 2137, 2749, 2730, 21146, 28209, 14093, 2632, 1011, 2019, 2072, 1010, 1996, 14512, 2012, 1996, 8806, 1999, 1996, 2237, 1997, 1053, 4886, 2213, 1010, 2379, 1996, 9042, 3675, 1012, 102, 2730, 102]. All tokens are tokenized and split into these word_ids. A [CLS] (input_id = 101) and [SEP] (input_id = 102) is added to these input_ids. We have made sure that the length of these input_ids, the labels and the attention_mask for each sentence are all the same, which is necessary for training the model. 

In [8]:
tokenized_datasets = dataset.map(tok.tokenize_and_align_labels_pred, batched=True)
print(tokenized_datasets)

Map:   0%|          | 0/40498 [00:00<?, ? examples/s]

Map:   0%|          | 0/4979 [00:00<?, ? examples/s]

Map:   0%|          | 0/4802 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence', 'predicate', 'pred columns', 'labels', '__index_level_0__', 'input_ids', 'attention_mask'],
        num_rows: 40498
    })
    validation: Dataset({
        features: ['sentence', 'predicate', 'pred columns', 'labels', '__index_level_0__', 'input_ids', 'attention_mask'],
        num_rows: 4979
    })
    test: Dataset({
        features: ['sentence', 'predicate', 'pred columns', 'labels', '__index_level_0__', 'input_ids', 'attention_mask'],
        num_rows: 4802
    })
})


In [9]:
tokenized_datasets['test']['labels'][0]

[-100, 60, 60, 1, 59, 59, 59, 60, 3, 3, 60, -100, 60, 60, 60, -100]

In [10]:
tok.tokenizer(dataset['test']['sentence'][0], is_split_into_words=True).word_ids()

[None, 0, 1, 2, 3, 3, 3, 4, 5, 5, 6, None]

In [12]:
dataset['test']['labels'][0]

'_, _, ARG1, V, _, ARG2, _'

In [13]:
[labels_list[i] if i >=0 else 'X' for i in tokenized_datasets['test']['labels'][0]]

['X',
 '_',
 '_',
 'ARG1',
 'V',
 'V',
 'V',
 '_',
 'ARG2',
 'ARG2',
 '_',
 'X',
 '_',
 '_',
 '_',
 'X']

By using the `dataset.map()` method, we apply the `tokenize_and_align_labels_pred` function to each part (train, dev and test) in the dataset in a parallel manner. 

The batched parameter set to True indicates that the function should be applied in batches, rather than one part at a time, to improve efficiency.

The next cell is used to save the tokenized trainfile, in order to check whether or not the training data has the correct format. When this augmented_train_tokenized.csv is opened, we can detect that the padding is added correctly, as well as the special tokens cls and sep. 

In [14]:
new_training_data = []
for sent in tokenized_datasets['train']:
    tokens = tok.tokenizer.convert_ids_to_tokens(sent['input_ids'])
    new_training_data.append(tokens)
df_train_tokenized = pd.DataFrame(new_training_data)
df_train_tokenized.to_csv('augmented_train_tokenized.csv', index=False)

We have stored the weights for both baseline and advanced models at Google Drive, in the [folder](https://drive.google.com/drive/folders/1Vi92r9aPJ7laShdYxuFSEYCRyp7K54Qv?usp=sharing), or you can train them yourselves, it was about 5 minutes on colab's T4 GPU for 1 epoch. In this case you need to uncomment the cell with `trainer.train()` code below

In [21]:
# use this to train the model from scratch
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(labels_list))

# or this to load the weights of finetuned model that you may download here:
# model = AutoModelForTokenClassification.from_pretrained('model_checkpoints/baseline', num_labels=len(labels_list))

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


This next cell gives us the arguments which are used to train the model. The model is saved in model_checkpoints/pred. Even though the metric 'seqeval' is given, this is not used to compute the evaluation of the model. The `data_collator` is used to handle the processing of the data into batches during traning. The `train_dataset` takes as input the whole `tokenized_datasets["train"]`, and the evaluation dataset takes the entire `tokenized_datasets["validation"]`. However, this dataset is not used here, since the evaluation itself is performed later in this notebook. The number of epochs is set to 1, which can eventually lead to underfitting, since the model might miss some patterns that have not been caught during this 1 epoch. 

In [22]:
model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
    f"model_checkpoints/baseline",
    evaluation_strategy = 'epoch',
    # eval_steps=200,
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    push_to_hub=False,
)

data_collator = DataCollatorForTokenClassification(tok.tokenizer)
metric = load_metric("seqeval")

trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tok.tokenizer,
    compute_metrics=compute_metrics
)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [23]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.1737,0.148761,0.69973,0.65211,0.675081,0.958931
2,0.1085,0.123785,0.752093,0.689801,0.719601,0.964453
3,0.0897,0.1167,0.734153,0.748828,0.741418,0.965736
4,0.0753,0.114137,0.745819,0.750234,0.74802,0.966715
5,0.0685,0.113817,0.747808,0.749824,0.748815,0.966938


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


TrainOutput(global_step=6330, training_loss=0.12475125251034802, metrics={'train_runtime': 933.5534, 'train_samples_per_second': 216.902, 'train_steps_per_second': 6.781, 'total_flos': 4267297264010748.0, 'train_loss': 0.12475125251034802, 'epoch': 5.0})

The next cells predicts the validation set by using `trainer.predict`. 
These predictions are then saved into a bigger list (`list_predictions`), as well as the `true_labels`. The tokens with -100 are not used, since these indicate a special padding token (which are used to make all sentences the same length). For evaluation, these padding tokens must be removed.   

In [24]:
label_list = tokenized_datasets["train"].features['labels'].feature

Value(dtype='int64', id=None)

In [26]:
import warnings
warnings.filterwarnings('ignore')
predictions_raw, labels, _ = trainer.predict(tokenized_datasets["test"])
# predictions_raw, labels, _ = trainer.predict(tokenized_datasets["validation"])
predictions = np.argmax(predictions_raw, axis=2)

# Remove ignored index (special tokens)
true_predictions = [
    [labels_list[p] for (p, l) in zip(prediction, label) if l != -100 and p < len(labels_list)]
    for prediction, label in zip(predictions, labels)
]
true_labels = [
    [labels_list[l] for (p, l) in zip(prediction, label) if l != -100 and p < len(labels_list)]
    for prediction, label in zip(predictions, labels)
]

In [28]:
val_word_ids = []
# for sentence in dataset['validation']['sentence']:
for sentence in dataset['test']['sentence']:
    val_word_ids.append(tok.tokenizer(sentence, truncation=True, is_split_into_words=True).word_ids())

df = pd.DataFrame(columns=['sentence', 'prediction', 'gold', 'word_ids'])
# for tokens, prediction, gold, word_ids in zip(tokenized_datasets['validation']['input_ids'], true_predictions, true_labels, val_word_ids):
for tokens, prediction, gold, word_ids in zip(tokenized_datasets['test']['input_ids'], true_predictions, true_labels, val_word_ids):
    sentence = tok.tokenizer.decode(tokens)
    df.loc[len(df.index)] = [sentence, prediction, gold, word_ids]
# df.to_csv('data/output/base.csv')

In [29]:
gold_restored = []
pred_restored = []
for i, row in df.iterrows():
    sentence = row[0]
    orig_sentence = sentence.split('[SEP]')[0].split(' ')[1:]
    prediction = row[1]
    gold = row[2]
    word_ids = row[3][1:-1]
    gold_restored.append(shrink_predictions(word_ids, gold))
    pred_restored.append(shrink_predictions(word_ids, prediction))

df['gold_restored'] = gold_restored
df['pred_restored'] = pred_restored
df.to_csv('data/output/base_test.csv')

In [30]:
from code_.evaluation import class_report_base
class_report_base('data/output/base_test.csv')

              precision    recall  f1-score   support

         'V'       0.96      0.97      0.97      4808
         '_'       0.99      0.99      0.99     87015
       'C-V'       0.62      0.62      0.62        16
      'ARGA'       0.00      0.00      0.00         2
      'ARG3'       1.00      0.04      0.08        74
      'ARG2'       0.73      0.76      0.74      1130
      'ARG5'       0.00      0.00      0.00         1
      'ARG0'       0.84      0.89      0.86      1736
      'ARG4'       0.65      0.54      0.59        56
      'ARG1'       0.85      0.87      0.86      3244
    'C-ARG4'       0.00      0.00      0.00         0
    'C-ARG2'       0.00      0.00      0.00         7
    'R-ARG2'       0.00      0.00      0.00         1
    'C-ARG0'       0.00      0.00      0.00         3
    'R-ARG0'       0.88      0.91      0.90        67
    'C-ARG1'       0.57      0.54      0.55        52
    'C-ARG3'       0.00      0.00      0.00         2
    'R-ARG3'       0.00    

In [31]:
trainer.save_model()

### Results of the baseline model.
The overall results of the baseline model evidently show its limitations also on the practical level. Although the model presents a high micro-averaged F1-score of 0.95, indicating that the model performs relatively well across all classes - when considering each prediction equally - this is quickly controverted by its very low macro average scores. Macro-average calculates the metric independently for each class and then takes the average across all classes, treating each class equally regardless of its size or frequency. A quick inspection of individual classes makes it clear that the high aggragated micro scores are mostly the product of the a few higly represented classes, including both the predicate class and the placeholder label. 

There is a considerable variation in precision, recall, and F1-score across labels as some classes have high precision and recall, while others have much lower scores including 0. The model seems to be performing well on frequent and well represented labels while the very infrequent ones do not get at all predicted. Classes like 'ARGM-NEG' (Negative Modifier) and 'ARGM-MOD' (Modal Modifier) have a big range in performance, with F1-scores ranging from 0.33 to 0.91, although they are not the most represented ones. On the other hand, 'ARG1' and 'ARG0' are more frequent and show promising but clearly lower results.

While this approach provides a straightforward starting point, it presents notable limitations. One critical drawback is the model's inability to grasp contextual nuances, limiting its capacity to capture the intricate relationships between predicates and their arguments accurately. Due to the lack of contextual understanding, the model may struggle to disambiguate between different senses of predicates and may misclassify arguments. Moreover, relying on simplistic representations limits the model's generalization capabilities across diverse texts and domains, impacting its performance on unseen data. This can also be proven by the extremely low results on classes that are presented to the model in very low frequency.

##### Limitations of baseline Model. 
Giving only the predicate as a method to find the semantic roles for this task can resolve in finding the roles. When the context is given, this can help in detecting the semantic roles. When the focus is laid on the the surrounding words of the predicate, it is more likley that the model will be able to find Arg0 and Arg1 since they are typically right before and after the predicate.

The obvious limitation is that the baseline model does not understand when 2 predicates are of the same token. In a sentence such as: My mom is a good cook and is lovely. having this same predicate 'is' could result in confusion for the different arguments per predicate. 

# Advanced Model

For the advanced model we chose to implement a version of your baseline that represents the relation between predicate and sentence in a more sophisticated way which will hopefully manage to tackle the same predicate limitation entailed in the baseline.

More specifically, instead of only representing the predicate with its corresponding token, we tried to distinguish it even further by employing its context: 

[CLS] Barack Obama went to Paris [SEP] Obama,went,to [SEP].

The previous and next tokens - together with the predicate itself - are used to represent the predicate and differentiate using context. Preprocessing the files for this advanced model is done by using advanced_process_file in process_conll.py

Most code cells are the same as the baseline model. Only the input of this model has been changed to the input that is described previously. 

In [None]:
df_val = advanced_process_file('data/raw/en_ewt-up-dev.conllu')
df_train = advanced_process_file('data/raw/en_ewt-up-train.conllu')
df_test = advanced_process_file('data/raw/en_ewt-up-test.conllu')

In [None]:
dataset = convert_to_dataset(df_train, df_val, df_test)

By again making use of the `Tokenizer` and target label list, the `tokenize_and_align_labels_context` function preprocesses the data for a sequence labeling task by tokenizing sentences, aligning labels with tokens, and preparing them for model training or evaluation. 

More specifically, the function tokenizes sentences and predicates using the provided tokenizer, aligns the tokens from the sentence with the predicate tokens, iterates through each tokenized word in the sentence, checking if it matches any label and assigning the corresponding label index. If there's no match, it assigns the index of a placeholder label ('_') while if a predicate occurs multiple times in the sentence, the labels are repeated accordingly.

It returns a list of lists, where each inner list represents the labels associated with each token in the sentence, including those corresponding to the predicate.

In [None]:
tokenized_datasets = dataset.map(tok.tokenize_and_align_labels_context, batched=True)

In [None]:
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(labels_list))
# model = AutoModelForTokenClassification.from_pretrained('model_checkpoints/advanced', num_labels=len(labels_list))

In [None]:
model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
    f"model_checkpoints/context",
    evaluation_strategy = 'epoch',
    learning_rate=2e-5,
    save_steps=7000,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    push_to_hub=False,
)

data_collator = DataCollatorForTokenClassification(tok.tokenizer)
metric = load_metric("seqeval")

trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tok.tokenizer,
    compute_metrics=compute_metrics
)


In [None]:
print(tokenized_datasets['train'][0]['input_ids'])

In [None]:
trainer.train()

In [None]:
predictions_raw, labels, _ = trainer.predict(tokenized_datasets["validation"])
predictions = np.argmax(predictions_raw, axis=2)

list_predictions = [
    [labels_list[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]
true_labels = [
    [labels_list[l] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]

results = metric.compute(predictions=list_predictions, references=true_labels)


In [None]:
val_word_ids = []
for sentence in dataset['test']['sentence']:
    val_word_ids.append(tok.tokenizer(sentence, truncation=True, is_split_into_words=True).word_ids())

df = pd.DataFrame(columns=['sentence', 'prediction', 'gold', 'word_ids'])
for tokens, prediction, gold, word_ids in zip(tokenized_datasets['test']['input_ids'], list_predictions, true_labels, val_word_ids):
    sentence = tok.tokenizer.decode(tokens)
    df.loc[len(df.index)] = [sentence, prediction, gold, word_ids]

gold_restored = []
pred_restored = []
for i, row in df.iterrows():
    sentence = row[0]
    orig_sentence = sentence.split('[SEP]')[0].split(' ')[1:]
    prediction = row[1]
    gold = row[2]
    word_ids = row[3][1:-1]
    gold_restored.append(shrink_predictions(word_ids, gold))
    pred_restored.append(shrink_predictions(word_ids, prediction))

df['gold_restored'] = gold_restored
df['pred_restored'] = pred_restored
df.to_csv('data/output/advanced.csv')

In [None]:
from code_.evaluation import class_report_advanced
class_report_base('data/output/advanced.csv')

In [None]:
from code_.evaluation import class_report_advanced
class_report_base('data/output/advanced.csv')

In [None]:
trainer.save_model()

Output exceeds the size limit. Open the full output data in a text editor
              precision    recall  f1-score   support

         'V'       1.00      1.00      1.00       221
         '_'       0.96      0.97      0.96      3727
       'C-V'       0.00      0.00      0.00         0
      'ARGA'       0.00      0.00      0.00         0
      'ARG3'       0.00      0.00      0.00         2
      'ARG2'       1.00      0.11      0.20        18
      'ARG5'       0.00      0.00      0.00         0
      'ARG0'       0.82      0.89      0.85       420
      'ARG4'       0.00      0.00      0.00         0
      'ARG1'       0.82      0.84      0.83       337
    'C-ARG4'       0.00      0.00      0.00         0
    'C-ARG2'       0.00      0.00      0.00         0
    'R-ARG2'       0.00      0.00      0.00         0
    'C-ARG0'       0.00      0.00      0.00         0
    'R-ARG0'       0.00      0.00      0.00         0
    'C-ARG1'       0.00      0.00      0.00         0
    'C-ARG3'       0.00      0.00      0.00         0
    'R-ARG3'       0.00      0.00      0.00         0
    'R-ARG1'       0.00      0.00      0.00         0
  'ARGM-EXT'       0.60      0.60      0.60         5
  'ARGM-NEG'       0.50      1.00      0.67         4
  'ARGM-DIS'       0.74      0.60      0.67        91
  'ARGM-PRD'       0.00      0.00      0.00         2
...
   micro avg       0.93      0.93      0.93      4979
   macro avg       0.20      0.18      0.17      4979
weighted avg       0.93      0.93      0.93      4979

#Results of Advanced Model
The advanced model has a macro average of 0.23 for precision, which means that when each class is treated equally, this value gives the model's overall performance of precision. This macro average is 0.21 for recall, and 0.21 for f1. All of these values are lower then the macro averages of the baseline model. 

The micro average is the same as the baseline model, which means that when the classes are given the same weight, the model the precision, recall and f1-score all have a value of 0.95

The most common arguments in the dataset, and most common arguments in SRL are Arg0 and Arg1. Arg0 has a high precision, recall and f1-score (0.78, 0.82, 0.80), same for Arg1 (0.72, 0.81, 0.77). There are many instances that the model is trained on which have these Arg0 and Arg1 arguments in the sentence. Therefore, it was expected that these values would be higher then other argument-values. 

Arg2 represents the instrument used in SRL. This is also a very important argument that needs to be loooked into. The precision, recall and f1-score for this ARG2 are: (0.63, 0.58, 0.60). 

For example in this sentence: From the ap comes this story : [SEP] ap comes this [SEP], the model predicts 'AP' as Arg0 when in fact, the gold states this as 'Arg2'. But in the sentence: [CLS] it seems clear to me that the manhunt for high baath officials in the sunni heartland is being done wrong, or at least in ways that are bad for us standing with local iraqis. and 'seems' is the predicate, the model predicts that 'clear' and 'me' are both ARG2. However, gold labels say that 'clear' is ARG1 and 'me' is ARG2. 

However, when the predicate is 'clear', the model predicts that 'me' is ARG2, when it is in fact not an argument (as can be seen in the gold data). When the predicate is 'bad', ARG2 is 'standing'. But again the model predicts that there are 2 ARG2's, which are 'US' and 'standing'

In the sentence: this operation would only consolidate the terrorist acts in the world and would not bring peace to the region, "" the message claimed., where 'bring' is the predicate, the model correctly labels 'region' as ARG2. This could explain that the model learns when a 'to' is presented after the predicate, the following token (but excluding 'the'), is labeled as ARG2. Here, it seems as if the model also learns that only the headword can be labeled as an argument, not 'the'.  

The sentence: it's not quite as freewheeling an environment as you'd imagine : sergey brin has actually created a mathematical'proof'that the company's self - driven research strategy, which gives employees one day a week to do research projects on their own, is a good, respectable idea. The predicate is "'s". The first few tokens are labeled correctly (It, ', s, not) = ['ARG1', 'V', 'V', 'ARGM-NEG']. The next token 'quite' is then labeled as ARG2, as well as 'freewheeling', 'an', and 'environment'. Again, thre are multiple ARG2's predicted in the sentence when in fact the correct ARG2 is 'you'. This is interesting, since the model might be trained on sentences with this 'as ... as' structure, and learned that when an 'as' is presented, the following token should be an ARG2 (which is the case for 'you'). However, it could be that it detects the first 'as' and directly labels the next token as ARG2, not looking further into the sentence as well as the token 'as' itself. Moreover, here, it does not detect the headword of a constituent, which is a good indication of which token needs to be labeled, and which one does not. This could also be further implemented in future work.

In the sentence: the food is mediocre at best. --> the predicate is 'is' and the sentence consists of 3 ARG2's. The model predicts these 3 ARG2's correctly. However, the gold gives ARGM-ADV to 'best', the model does not detect this. 

When we look at ARG1, we see that many times, this ARG1 is right before the predicate and is correctly labeled. In the sentence: 4826: and they seem to be posted at fairly regular intervals? The ARG1 in this sentence is correctly labeled to be 'they'. 


# Comparing the models
The overall results of the baseline model evidently show its limitations also on the practical level. Although the model presents a high micro-averaged F1-score of 0.95, indicating that the model performs relatively well across all classes - when considering each prediction equally - this is quickly controverted by its very low macro average scores. Macro-average calculates the metric independently for each class and then takes the average across all classes, treating each class equally regardless of its size or frequency. A quick inspection of individual classes makes it clear that the high aggragated micro scores are mostly the product of the a few higly represented classes, including both the predicate class and the placeholder label. 

There is a considerable variation in precision, recall, and F1-score across labels as some classes have high precision and recall, while others have much lower scores including 0. The model seems to be performing well on frequent and well represented labels while the very infrequent ones do not get at all predicted. Classes like 'ARGM-NEG' (Negative Modifier) and 'ARGM-MOD' (Modal Modifier) have a big range in performance, with F1-scores ranging from 0.33 to 0.91, although they are not the most represented ones. On the other hand, 'ARG1' and 'ARG0' are more frequent and show promising but clearly lower results.

While this approach provides a straightforward starting point, it presents notable limitations. One critical drawback is the model's inability to grasp contextual nuances, limiting its capacity to capture the intricate relationships between predicates and their arguments accurately. Due to the lack of contextual understanding, the model may struggle to disambiguate between different senses of predicates and may misclassify arguments. Moreover, relying on simplistic representations limits the model's generalization capabilities across diverse texts and domains, impacting its performance on unseen data. This can also be proven by the extremely low results on classes that are presented to the model in very low frequency.

## Future Work & Discussion

The similarity in performance between the baseline and advanced models suggests that the baseline approach of representing predicates solely by themselves can be surprisingly effective for well represented and common argument labels. This indicates that for certain datasets, simple representations may suffice to capture the necessary information for Semantic Role Labeling (SRL). However, the overall experiment underscores the importance of predicate representation in SRL tasks. While more advanced representations may offer additional contextual information, they do not always guarantee significant improvements over simpler approaches. The fact that the advanced model did not outperform the baseline model indicates that the additional contextual information provided by the previous and following tokens may not be as influential or as sufficient as thought.


Nevertheless, the difference in performance among some classes - as discussed above - between the two models presents extreme interest and should be further explored. What is more,  future work could explore incorporating broader contextual cues, such as expanding on the previous and next tokens representations with including more tokens. In addition to contextual information, exploring the role of syntactic structures, such as dependency parsing or constituency parsing, in predicate representation could offer valuable insights into improving SRL model performance. For example,  detecting the headword of a constituent, could be a good indication of which token needs to be labeled, and which one does not.

Models in: 
https://drive.google.com/drive/folders/1Vi92r9aPJ7laShdYxuFSEYCRyp7K54Qv?usp=share_link 