# Semantic Role Labelling(SRL) with BERT 
##### by Andrei, Sezen and Emma

SRL is the task of identifying and classifying predicate-argument semantic roles within a text and eventually answer the questions of: who, did what to whom, and where, when...etc. It comprises four main
sub-tasks: predicate detection, predicate sense disambiguation, argument identification, argument classification. 
In this notebook, we will focus solely on argument identification and classification. 

To carry out the task, we will use the English-language Universal Proposition Banks dataset. The dataset is structured as a CoNLL-U file containing different sentences. In the file, each sentence is first represented with a document and sentence id and then as raw text. 
Then, each sentence is broken down in tokens with each line concerning a single token and its corresponding syntactic features. The various fields are separated by tabs. If a sentence contains predicates, then the next field indicates the labeled (disambiguated) predicates within each sentence, while the final field marks their corresponding arguments.

The disambiguated predicates in the gold data will be used as the basis for the argument extraction and classification while the process will be performed in a single step.



### Required Imports

In [28]:
%reload_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np

from code_.process_conll import process_file, advanced_process_file, extract_features
from code_.evaluation import class_report
from code_.bert import Tokenizer, convert_to_dataset, compute_metrics, get_labels_list_from_dataset, task, batch_size, model_checkpoint
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
from transformers import DataCollatorForTokenClassification
from datasets import load_metric

### The Approach
The argument identification and classification experiment will be carried out with two different approaches both employing the BERT model. 

Given that the predicate identification is a given in this experiment, the main challenge of this task is to find a solution for
representing the relation between each sentence and the predicate under consideration.

In the original PropBank dataset, sentences can host any number of predicates (including none) thus, when a sentence has two or more predicates, the rest of its tokens are bound to express different arguments in relation to each predicate. Consequently, in order for the model to be able to correctly predict and assign argument labels for a sentence, we need to first establish - each time - the predicate under consideration. For example, in the sentence:

"John cooked dinner and served dessert to his guests."

for both predicates "cooked" and "served", the subject (ARG0) is "John", but the object (ARG1)js "dinner" for "cooked" and "dessert" for "served".

Thus, our focus lies in determining an effective representation for the model input allowing for the model to classify the arguments within a sentence relative to a specific predicate. The first step is reduplicating each sentence based on the number of its predicates so each sentence is presented in the dataset each time in relation to a unique predicate.

Next, the process involves structuring the input data in a manner that enables the model to understand the relationships between tokens and the possible argument labels within the context of a specific predicate.

To experiment on the basis of the input representation, we will build both a baseline and an advanced model. In both cases the relation of each sentence and a specific predicate will be encoded into the model's input. 

### The Model (s) & Loading the Datasets

#### The Baseline

The fucntion of the baseline model is to offer a straightforward approach to the task, serving as a reference point for evaluating the effectiveness of the more advanced model. As such, the baseline will be employing a more simplistic approach to input representation inspired by Shi and Lin (2019):

[CLS] Barack Obama went to Paris [SEP] went [SEP] ,

in which the predicate information - each time signaling the predicate in question - only consists of the predicate token itself. At the same time, the [CLS] token typically represents the aggregate understanding of the entire sentence, serving as a sentence-level representation and capturing the contextual information necessary for semantic role labeling.
The [SEP] token demarcates the boundaries between different segments of the input. In this representation, it separates the sentence from the predicate.

This approach entails limitations. The mosts obvious one would be the case in which a sentence contains the same predicate twice. In this case only using the predicate itself as the predicate indicator would not allow the model to distinguish between the two different predicates. 

In [4]:
df_val = process_file('data/raw/en_ewt-up-dev.conllu')
df_train = process_file('data/raw/en_ewt-up-train.conllu')
df_test = process_file('data/raw/en_ewt-up-test.conllu')

process_file(): dataframe len: 4979
process_file(): dataframe len: 40498
process_file(): dataframe len: 4802


For the baseline model, the datasets are transformed into dataframes using the imported process_file fuction. The function splits the text into individual sentences based on empty lines (\n\n) with each line in the produced dataframe concerning a single sentence. The function then retrieves predicate indices and columns from the current sentence and extract labels for tokens in relation to predicates from the current sentence. Finally, it retrieves the corresponding word for each predicate from the sentence and adds a row to the DataFrame.

More specifically, the first column of the DataFrame hosts the tokenized sentence. The second column hosts one predicate per sentence each time. The third column gives the length of the argument columns attached to each predicate while the last one hosts the labelled arguments for each predicate mapped to the sentence tokens.

In [5]:
df_test

Unnamed: 0,sentence,predicate,pred columns,labels
0,"[What, if, Google, Morphed, Into, GoogleOS, ?]",Morphed,11,"_, _, ARG1, V, _, ARG2, _"
1,"[What, if, Google, expanded, on, its, search, ...",expanded,11,"_, _, ARG0, V, _, _, _, _, _, _, _, _, _, _, A..."
2,"[(, And, ,, by, the, way, ,, is, anybody, else...",way,11,"_, _, _, _, _, V, _, _, _, _, _, _, _, _, _, _..."
3,"[(, And, ,, by, the, way, ,, is, anybody, else...",is,12,"_, _, _, _, _, ARGM-DIS, _, V, ARG1, _, _, _, ..."
4,"[(, And, ,, by, the, way, ,, is, anybody, else...",was,13,"_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _..."
...,...,...,...,...
4797,"[He, listens, and, is, excellent, in, diagnosi...",explaining,16,"ARG0, _, _, _, _, _, _, _, _, _, V, _, _, ARG1..."
4798,"[He, listens, and, is, excellent, in, diagnosi...",issues,17,"_, _, _, _, _, _, _, _, _, _, _, _, ARGM-ADJ, ..."
4799,"[He, listens, and, is, excellent, in, diagnosi...",suggesting,18,"ARG0, _, _, _, _, _, _, _, _, _, _, _, _, _, _..."
4800,"[He, listens, and, is, excellent, in, diagnosi...",exercises,19,"_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _..."


The convert_to_dataset function transforms data from pandas DataFrames (train, val, and test) into a DatasetDict. The function streamlines data preparation for machine learning tasks, facilitating easier management and processing of data.

The labels_list contains all of the target labels used to train and test the model.

The dataset variable now contains the transformed train, evaluation, and test datasets.

In [6]:
dataset = convert_to_dataset(df_train, df_val, df_test)

In [8]:
labels_list = get_labels_list_from_dataset(dataset)
print(sorted(labels_list, key=lambda x: len(x)))

['V', '_', 'C-V', 'ARG2', 'ARG5', 'ARG0', 'ARG1', 'ARGA', 'ARG3', 'ARG4', 'R-ARG2', 'R-ARG4', 'R-ARG1', 'C-ARG1', 'C-ARG4', 'C-ARG2', 'C-ARG0', 'R-ARG0', 'C-ARG3', 'R-ARG3', 'ARGM-ADV', 'ARGM-LOC', 'ARGM-TMP', 'ARGM-CXN', 'ARGM-REC', 'ARGM-LVB', 'ARGM-ADJ', 'ARGM-COM', 'ARGM-PRP', 'ARGM-PRR', 'ARGM-CAU', 'ARGM-NEG', 'ARGM-PRD', 'ARGM-EXT', 'ARGM-DIR', 'ARGM-GOL', 'ARGM-DIS', 'ARGM-MOD', 'ARGM-MNR', 'ARG1-DSP', 'C-ARGM-EXT', 'R-ARGM-ADJ', 'R-ARGM-ADV', 'R-ARGM-LOC', 'C-ARG1-DSP', 'C-ARGM-ADV', 'C-ARGM-TMP', 'C-ARGM-LOC', 'R-ARGM-DIR', 'R-ARGM-CAU', 'C-ARGM-CXN', 'C-ARGM-PRR', 'R-ARGM-COM', 'R-ARGM-TMP', 'C-ARGM-PRP', 'C-ARGM-COM', 'C-ARGM-DIR', 'R-ARGM-MNR', 'C-ARGM-MNR', 'R-ARGM-GOL', 'C-ARGM-GOL']


A Transformers Tokenizer is used to tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires. The functions in the Tokenizer can be seen in bert.py, which is the code_. 

By making use of the Tokenizer and target label list, the tokenize_and_align_labels_pred function preprocesses the data for a sequence labeling task by tokenizing sentences, aligning labels with tokens, and preparing them for model training or evaluation. 

More specifically, the function tokenizes sentences and predicates using the provided tokenizer, aligns the tokens from the sentence with the predicate tokens, iterates through each tokenized word in the sentence, checking if it matches any label and assigning the corresponding label index. If there's no match, it assigns the index of a placeholder label ('_') while if a predicate occurs multiple times in the sentence, the labels are repeated accordingly.

It returns a list of lists, where each inner list represents the labels associated with each token in the sentence, including those corresponding to the predicate.

In [9]:
tok = Tokenizer(model_checkpoint, labels_list)

In [10]:
tokenized_datasets = dataset.map(tok.tokenize_and_align_labels_pred, batched=True)

Map:   0%|          | 0/40498 [00:00<?, ? examples/s]

Map:   0%|          | 0/4979 [00:00<?, ? examples/s]

Map:   0%|          | 0/4802 [00:00<?, ? examples/s]

As can be seen in the following output, the tokenized_datasets consists of the trainset, validation set, and the test set. The features that are in these datasets are: sentence, predicate, pred columns, labels, input_ids, and the attention mask

The input_ids are the in the format: [101, 2632, 1011, 23564, 2386, 1024, 2137, 2749, 2730, 21146, 28209, 14093, 2632, 1011, 2019, 2072, 1010, 1996, 14512, 2012, 1996, 8806, 1999, 1996, 2237, 1997, 1053, 4886, 2213, 1010, 2379, 1996, 9042, 3675, 1012, 102, 2730, 102]. All tokens are tokenized and split into these word_ids. A [CLS] (input_id = 101) and [SEP] (input_id = 1-2) is added to these input_ids. We have made sure that the length of these input_ids, the labels and the attention_mask for each sentence are all the same, which is necessary for training the model. 

In [12]:
print(tokenized_datasets)

DatasetDict({
    train: Dataset({
        features: ['sentence', 'predicate', 'pred columns', 'labels', '__index_level_0__', 'input_ids', 'attention_mask'],
        num_rows: 40498
    })
    validation: Dataset({
        features: ['sentence', 'predicate', 'pred columns', 'labels', '__index_level_0__', 'input_ids', 'attention_mask'],
        num_rows: 4979
    })
    test: Dataset({
        features: ['sentence', 'predicate', 'pred columns', 'labels', '__index_level_0__', 'input_ids', 'attention_mask'],
        num_rows: 4802
    })
})


By using the dataset.map() method, we apply the tokenize_and_align_labels_pred function to each part (train, dev and test) in the dataset in a parallel manner. 

The batched parameter set to True indicates that the function should be applied in batches, rather than one part at a time, to improve efficiency.

In [None]:
tokenized_datasets = dataset.map(tok.tokenize_and_align_labels_pred, batched=True)

Map:   0%|          | 0/4110 [00:00<?, ? examples/s]

Map:   0%|          | 0/4979 [00:00<?, ? examples/s]

Map:   0%|          | 0/4802 [00:00<?, ? examples/s]

The next cell is used to save the tokenized trainfile, in order to check whether or not the training data has the correct format. When this augmented_train_tokenized.csv is opened, we can detect that the padding is added correctly, as well as the special tokens cls and sep. 

In [16]:
tokenizer = tok.tokenizer
training_data = tokenized_datasets['train']
new_training_data = []
for sent in training_data:
    tokens = tokenizer.convert_ids_to_tokens(sent['input_ids'])
    new_training_data.append(tokens)
df_train_tokenized = pd.DataFrame(new_training_data)
df_train_tokenized.to_csv('augmented_train_tokenized.csv', index=False)

In [11]:
# initialise model
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(labels_list))

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


This next cell gives us the arguments which are used to train the model. The model is saved in model_checkpoints/pred. Even though the metric 'seqeval' is given, this is not used to compute the evaluation of the model. The data_collator is used to handle the processing of the data into batches during traning. The train_dataset takes as input the whole tokenized_datasets["train"], and the evaluation dataset takes the entire tokenized_datasets['validation]. However, this dataset is not used here, since the evaluation itself is performed later in this notebook. The number of epochs is set to 1, which can eventually lead to underfitting, since the model might miss some patterns that have not been caught during this 1 epoch. 

In [14]:
model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
    f"model_checkpoints/pred",
    evaluation_strategy = 'epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=1,
    weight_decay=0.01,
    push_to_hub=False,
)

data_collator = DataCollatorForTokenClassification(tok.tokenizer)
metric = load_metric("seqeval")

trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tok.tokenizer,
    compute_metrics=compute_metrics
)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [19]:
#trainer.train()

In [None]:
trainer.evaluate()

The next cells predicts the validation set by using trainer.predict. 
These predictions are then saved into a bigger list (list_predictions), as well as the true_labels. The tokens with -100 are not used, since these indicate a special padding token (which are used to make all sentences the same length). For evaluation, these padding tokens must be removed.   

In [None]:
predictions_raw, labels, _ = trainer.predict(tokenized_datasets["validation"])

predictions = np.argmax(predictions_raw, axis=2)

list_predictions = [
    [labels_list[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]
true_labels = [
    [labels_list[l] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]

In [None]:

df = pd.DataFrame(columns=['sentence', 'prediction', 'gold'])
for tokens, prediction, gold in zip(tokenized_datasets['validation']['input_ids'], list_predictions, true_labels):
    sentence = tok.tokenizer.decode(tokens)
    df.loc[len(df.index)] = [sentence, prediction, gold]
df.to_csv('base.csv')

In [32]:
baseline_classification_report = class_report('data/output/base.csv')#, labels_list)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

         'V'       0.96      0.98      0.97       221
         '_'       0.97      0.97      0.97      3727
       'C-V'       0.00      0.00      0.00         0
      'ARGA'       0.00      0.00      0.00         0
      'ARG3'       0.00      0.00      0.00         2
      'ARG2'       0.44      0.22      0.30        18
      'ARG5'       0.00      0.00      0.00         0
      'ARG0'       0.83      0.88      0.85       420
      'ARG4'       0.00      0.00      0.00         0
      'ARG1'       0.80      0.91      0.85       337
    'C-ARG4'       0.00      0.00      0.00         0
    'C-ARG2'       0.00      0.00      0.00         0
    'R-ARG2'       0.00      0.00      0.00         0
    'C-ARG0'       0.00      0.00      0.00         0
    'R-ARG0'       0.00      0.00      0.00         0
    'C-ARG1'       0.00      0.00      0.00         0
    'C-ARG3'       0.00      0.00      0.00         0
    'R-ARG3'       0.00    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Results of the baseline model.

##### Limitations of baseline Model. 
Giving only the predicate as a method to find the semantic roles for this task can resolve in finding the roles. When the context is given, this can help in detecting the semantic roles. When the focus is laid on the the surrounding words of the predicate, it is more likley that the model will be able to find Arg0 and Arg1 since they are typically right before and after the predicate.

Moreover, when we look at auxiliary verbs which can either be a part of a long predicate, or a verb indicating an action. For example, in Barack Obama went to Paris, 'went' is used to state that Obama performed an action, which is going to Paris. However, when this sentence is made longer (Barack Obama went to Paris and visited the Eifel Tower.) the verb 'went' is part of the predicate which expresses the condition of Barack.

#### Advanced Model

For the advanced model we chose to implement a version of your baseline that represents the relation between predicate and sentence in a more sophisticated way which will hopefully manage to tackle the same predicate limitation entailed in the baseline.

More specifically, instead of only representing the predicate with its corresponding token, we tried to distinguish it even further by employing its context: 

[CLS] Barack Obama went to Paris [SEP] Obama,went,to [SEP].

The previous and next tokens - together with the predicate itself - are used to represent the predicate and differentiate using context. Preprocessing the files for this advanced model is done by using advanced_process_file in process_conll.py

In [33]:
df_val = advanced_process_file('data/raw/en_ewt-up-dev.conllu')
df_train = advanced_process_file('data/raw/en_ewt-up-train.conllu')
df_test = advanced_process_file('data/raw/en_ewt-up-test.conllu')

advanced_process_file(): dataframe len: 4979
advanced_process_file(): dataframe len: 40498
advanced_process_file(): dataframe len: 4802


In [39]:
dataset = convert_to_dataset(df_train, df_val, df_test)

By again making use of the Tokenizer and target label list, the tokenize_and_align_labels_pred function preprocesses the data for a sequence labeling task by tokenizing sentences, aligning labels with tokens, and preparing them for model training or evaluation. 

More specifically, the function tokenizes sentences and predicates using the provided tokenizer, aligns the tokens from the sentence with the predicate tokens, iterates through each tokenized word in the sentence, checking if it matches any label and assigning the corresponding label index. If there's no match, it assigns the index of a placeholder label ('_') while if a predicate occurs multiple times in the sentence, the labels are repeated accordingly.

It returns a list of lists, where each inner list represents the labels associated with each token in the sentence, including those corresponding to the predicate.

In [42]:
tokenized_datasets = dataset.map(tok.tokenize_and_align_labels_context, batched=True)


Map:   0%|          | 0/40498 [00:00<?, ? examples/s]

TypeError: PreTokenizedEncodeInput must be Union[PreTokenizedInputSequence, Tuple[PreTokenizedInputSequence, PreTokenizedInputSequence]]

In [23]:
model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
    f"model_checkpoints/context",
    evaluation_strategy = 'epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=1,
    weight_decay=0.01,
    push_to_hub=False,
)

data_collator = DataCollatorForTokenClassification(tok.tokenizer)
metric = load_metric("seqeval")

trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tok.tokenizer,
    compute_metrics=compute_metrics
)


In [27]:
print(tokenized_datasets['train'][0]['input_ids'])

[101, 2632, 1011, 23564, 2386, 1024, 2137, 2749, 2730, 21146, 28209, 14093, 2632, 1011, 2019, 2072, 1010, 1996, 14512, 2012, 1996, 8806, 1999, 1996, 2237, 1997, 1053, 4886, 2213, 1010, 2379, 1996, 9042, 3675, 1012, 102]


In [43]:
#trainer.train()

In [44]:
#trainer.evaluate()

In [45]:
#predictions, labels, _ = trainer.predict(tokenized_datasets["validation"])
#predictions = np.argmax(predictions, axis=2)

# Remove ignored index (special tokens)
list_predictions = [
    [labels_list[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]
true_labels = [
    [labels_list[l] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]

results = metric.compute(predictions=list_predictions, references=true_labels)


  0%|          | 0/156 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [None]:
baseline_classification_report = class_report('data/output/advanced.csv')#, labels_list)

In [30]:
predictions

array([[4, 4, 4, ..., 0, 0, 0],
       [4, 4, 4, ..., 0, 0, 0],
       [4, 4, 4, ..., 0, 0, 0],
       ...,
       [4, 4, 4, ..., 0, 0, 0],
       [4, 4, 4, ..., 0, 0, 0],
       [4, 4, 4, ..., 0, 0, 0]], dtype=int64)

'ARGM-LOC'