# Semantic role labeling with BERT

In this notebook, you'll perform semantic role labeling with BERT, using the Universal Propbank dataset.

## Before you begin

### Install libraries

Uncomment and run the following cells to install the required pip packages, for example when running the notebook in [colab](https://colab.research.google.com).

In [1]:
#!pip install datasets
#!pip install seqeval
#!pip install accelerate==0.21.0
#!pip install transformers[torch]
#!pip install accelerate -U

In [2]:
#from google.colab import drive
#drive.mount('/content/drive')

### Import libraries

In [3]:
import time
import pandas as pd
import transformers
import numpy as np
from transformers import AutoTokenizer,AutoModelForTokenClassification, TrainingArguments, Trainer, DataCollatorForTokenClassification
from datasets import Dataset
from utils import read_data_as_sentence,map_labels_in_dataframe,tokenize_and_align_labels,get_label_mapping,get_labels_from_map,load_srl_model,load_dataset,compute_metrics,write_predictions_to_csv,compute_evaluation_metrics_from_csv

## Step 1: Preprocess data

Before you can train the model, you need to extract sentences from the training, development and test datasets, and preprocess the sentences.

To preprocess the datase and save the resulting DataFrame to a file, call the `read_data_as_sentence()` function, including:

| Parameter name     | Required | Parameter description |
|--------------------|:--------------:|-------------|
| *positional 1*                   | ✅️ | The filepath for the CoNNLU dataset. |
| *positional 2*                 | ✅ | The filepath to write the preprocessed DataFrame to. |

In [4]:
train_data = read_data_as_sentence('data/en_ewt-up-train.conllu', 'data/en_ewt-up-train.preprocessed.csv')
dev_data = read_data_as_sentence('data/en_ewt-up-dev.conllu', 'data/en_ewt-up-dev.preprocessed.csv')
test_data = read_data_as_sentence('data/en_ewt-up-test.conllu', 'data/en_ewt-up-test.preprocessed.csv')

The `read_data_as_sentence()` function returns DataFrames, where each row represents a sentence from the dataset passed to the function. Each sentence has been expanded based on its predicates, resulting in multiple copies of the same sentence, each focused on a different predicate.

The DataFrame has two columns:

- `input_form`: a list of strings, where each string represents a words in the sentence, followed by two special tokens:
    1. A special token (`[SEP]`), which denotes the separation between the words of the sentence and the predicate form.
    2. The predicate form, which corresponds to the `argument` values for the same row in the DataFrame.
- `argument`: a list of strings, representing the arguments associated with each word in the sentence. The length of each list is equal to the number of words in the sentence, plus two additional elements, for the special token and predicate form. The arguments match the predicate appended to the `input_form` for the same row in the DataFrame.

### Explore the DataFrame

To explore the DataFrame, print the head of the preprocessed DataFrame:

In [5]:
print(test_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4799 entries, 0 to 4798
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   input_form  4799 non-null   object
 1   argument    4799 non-null   object
dtypes: object(2)
memory usage: 75.1+ KB
None


The **Non-Null** count for both columns should match, indicating there are as many lists of `input_form` values as there are lists of `argument` values.

Next, print the words and their argument labels for the first 20 sentences of the test dataset:

In [6]:
for form, argument in zip(test_data.input_form[:20], test_data.argument[:20]):
    for f, a in zip(form, argument):
        if f == '[SEP]':
            print('-' * 40)
        print(f"form: {f:<15} argument: {a}")
    print('\n' + '=' * 40 + '\n')

form: What            argument: _
form: if              argument: _
form: Google          argument: ARG1
form: Morphed         argument: _
form: Into            argument: _
form: GoogleOS        argument: ARG2
form: ?               argument: _
----------------------------------------
form: [SEP]           argument: None
form: Morphed         argument: None


form: What            argument: _
form: if              argument: _
form: Google          argument: ARG0
form: expanded        argument: _
form: on              argument: _
form: its             argument: _
form: search          argument: _
form: -               argument: _
form: engine          argument: _
form: (               argument: _
form: and             argument: _
form: now             argument: _
form: e-mail          argument: _
form: )               argument: _
form: wares           argument: ARG1
form: into            argument: _
form: a               argument: _
form: full            argument: _
form: -              

## Step 2: Import the BERT model and tokenizer

Use HuggingFace's [`AutoTokenizer`](https://huggingface.co/docs/transformers/v4.38.2/en/model_doc/auto#transformers.AutoTokenizer) to construct a DistilBERT tokenizer, which is based on the WordPiece algorithm.

In [22]:
# Set the model ID to use
model_checkpoint = "distilbert-base-uncased"

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# Check the assertion that the tokenizer is an instance of transformers.PreTrainedTokenizerFast
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

Checking the sentence representation

In [23]:
example = test_data['input_form'][1]
print(example)

['What', 'if', 'Google', 'expanded', 'on', 'its', 'search', '-', 'engine', '(', 'and', 'now', 'e-mail', ')', 'wares', 'into', 'a', 'full', '-', 'fledged', 'operating', 'system', '?', '[SEP]', 'expanded']


The sentence contains the [SEP] special token followed by the predicate. Therefore, the parameter `add_special_tokens` is set to True so that the index is converted to 102 accordingly and is not treated as another word. \
In addition, the sentence is already split into tokens, to the parameter `is_split_into_words` is also set to True

In [24]:
tokenizer(example,add_special_tokens=True,is_split_into_words=True)

{'input_ids': [101, 2054, 2065, 8224, 4423, 2006, 2049, 3945, 1011, 3194, 1006, 1998, 2085, 1041, 1011, 5653, 1007, 16283, 2015, 2046, 1037, 2440, 1011, 26712, 4082, 2291, 1029, 102, 4423, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [25]:
tokenized_input = tokenizer(example,add_special_tokens=True,is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
print(tokens)

['[CLS]', 'what', 'if', 'google', 'expanded', 'on', 'its', 'search', '-', 'engine', '(', 'and', 'now', 'e', '-', 'mail', ')', 'ware', '##s', 'into', 'a', 'full', '-', 'fledged', 'operating', 'system', '?', '[SEP]', 'expanded', '[SEP]']


## Tokenizing and preparing input for the model

Getting the mapping of all possible arguments across all datasets to a numerical value with the `get_label_mapping` function.\
None value stays as None to be mapped to the special token in the model.

In [26]:
label_map = get_label_mapping(train_data, test_data, dev_data)

In [27]:
print(label_map)

{'_': 0, 'ARG0': 1, 'ARG1': 2, 'ARG1-DSP': 3, 'ARG2': 4, 'ARG3': 5, 'ARG4': 6, 'ARG5': 7, 'ARGA': 8, 'ARGM-ADJ': 9, 'ARGM-ADV': 10, 'ARGM-CAU': 11, 'ARGM-COM': 12, 'ARGM-CXN': 13, 'ARGM-DIR': 14, 'ARGM-DIS': 15, 'ARGM-EXT': 16, 'ARGM-GOL': 17, 'ARGM-LOC': 18, 'ARGM-LVB': 19, 'ARGM-MNR': 20, 'ARGM-MOD': 21, 'ARGM-NEG': 22, 'ARGM-PRD': 23, 'ARGM-PRP': 24, 'ARGM-PRR': 25, 'ARGM-REC': 26, 'ARGM-TMP': 27, 'C-ARG0': 28, 'C-ARG1': 29, 'C-ARG1-DSP': 30, 'C-ARG2': 31, 'C-ARG3': 32, 'C-ARG4': 33, 'C-ARGM-ADV': 34, 'C-ARGM-COM': 35, 'C-ARGM-CXN': 36, 'C-ARGM-DIR': 37, 'C-ARGM-EXT': 38, 'C-ARGM-GOL': 39, 'C-ARGM-LOC': 40, 'C-ARGM-MNR': 41, 'C-ARGM-PRP': 42, 'C-ARGM-PRR': 43, 'C-ARGM-TMP': 44, 'R-ARG0': 45, 'R-ARG1': 46, 'R-ARG2': 47, 'R-ARG3': 48, 'R-ARG4': 49, 'R-ARGM-ADJ': 50, 'R-ARGM-ADV': 51, 'R-ARGM-CAU': 52, 'R-ARGM-COM': 53, 'R-ARGM-DIR': 54, 'R-ARGM-GOL': 55, 'R-ARGM-LOC': 56, 'R-ARGM-MNR': 57, 'R-ARGM-TMP': 58, None: None}


Converting the labels in the df to numerical values for the language model with `map_labels_in_dataframe` function. The label_map dictionary from the function above is needed to map the arguments to their value.\
Add a new column to the df matching the arguments to label numbers. 0 stands for '_' (no argument) and the rest of the arguments are alphabetically ordered. \
*None* label will be mapped to the *[SEP]* token.


In [28]:
train_data = map_labels_in_dataframe(train_data,label_map)
dev_data = map_labels_in_dataframe(dev_data,label_map)
test_data = map_labels_in_dataframe(test_data,label_map)

Checking the head to confirm the labels were correctly converted:

In [29]:
test_data.head()

Unnamed: 0,input_form,argument,mapped_labels
0,"[What, if, Google, Morphed, Into, GoogleOS, ?,...","[_, _, ARG1, _, _, ARG2, _, None, None]","[0, 0, 2, 0, 0, 4, 0, None, None]"
1,"[What, if, Google, expanded, on, its, search, ...","[_, _, ARG0, _, _, _, _, _, _, _, _, _, _, _, ...","[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, ..."
2,"[(, And, ,, by, the, way, ,, is, anybody, else...","[_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,"[(, And, ,, by, the, way, ,, is, anybody, else...","[_, _, _, _, _, ARGM-DIS, _, _, ARG1, _, _, _,...","[0, 0, 0, 0, 0, 15, 0, 0, 2, 0, 0, 0, 0, 4, 0,..."
4,"[(, And, ,, by, the, way, ,, is, anybody, else...","[_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


Use `tokenize_and_align_labels` function to tokenize train, test, and dev dataframe. Padding is applied to make sure all input is the same length for the model.

In [30]:
tokenized_test = tokenize_and_align_labels(tokenizer, test_data, label_all_tokens=True)
tokenized_train = tokenize_and_align_labels(tokenizer, train_data, label_all_tokens=True)
tokenized_dev = tokenize_and_align_labels(tokenizer, dev_data, label_all_tokens=True)

The input for the model has the corresponding special token [CLS] followed by the tokenized sentence, the special token [SEP], the predicate and the final [SEP] token. \
The numerical labels to be fed to the model correspond to the tokenized sentence.\
The input is padded so that every vector is of the same length, including the labels and the attention mask.

In [31]:
print(tokenizer.convert_ids_to_tokens(tokenized_test["input_ids"][0]))
print(tokenized_test["attention_mask"][0])
print(tokenized_test["input_ids"][0])
print(tokenized_test["labels"][0])

['[CLS]', 'what', 'if', 'google', 'mor', '##ph', '##ed', 'into', 'google', '##os', '?', '[SEP]', 'mor', '##ph', '##ed', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

Confirming all the tokens contain a label and the attention mask also matches the length of the input

In [32]:
print(len(tokenized_test["input_ids"][0]),len(tokenized_test["labels"][0]),len(tokenized_test["attention_mask"][0]))

97 97 97


Converting the tokenized data to datasets format with the function `load_dataset`

In [33]:
dataset_train = load_dataset(tokenized_train)
dataset_dev = load_dataset(tokenized_dev)
dataset_test = load_dataset(tokenized_test)

## Fine-tuning the model

If the model is to be run with a smaller size of the data, reducing the size of the dataset for a mini test with the below cell.

In [34]:
#small_train_dataset = dataset_train.shuffle(seed=42).select(range(1000))
#small_eval_dataset = dataset_dev.shuffle(seed=42).select(range(1000))
#small_test_dataset = dataset_test.shuffle(seed=42).select(range(1000))

Getting the labels that will be predicted by the model with the `get_labels_from_map`function

In [35]:
label_list = get_labels_from_map(label_map)

Loading the model for semantic role labelling task with function `load_srl_model` to get the model, its name and the arguments necessary for training. \
The model selected is **distilbert-base-uncased**

In [36]:
model, model_name, args = load_srl_model(model_checkpoint, label_list)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [37]:
print(model_name)

distilbert-base-uncased


Passing the arguments along with the datasets to the `trainer` function to fine-tune the model for semantic role labelling with `trainer.train()`

In [38]:
trainer = Trainer(
        model,
        args,
        train_dataset=dataset_train,
        eval_dataset=dataset_dev,
        tokenizer=tokenizer,
        compute_metrics=lambda p: compute_metrics(*p, label_list)
    )
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.1271,0.142947,0.364061,0.329676,0.330257,0.960291
2,0.0968,0.125784,0.416398,0.367692,0.376981,0.964362
3,0.0841,0.124459,0.424408,0.391904,0.399526,0.965291


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


TrainOutput(global_step=7593, training_loss=0.1287919250100431, metrics={'train_runtime': 2010.6714, 'train_samples_per_second': 60.401, 'train_steps_per_second': 3.776, 'total_flos': 5615118349103484.0, 'train_loss': 0.1287919250100431, 'epoch': 3.0})

Evaluate a model fine-tuned for semantic role labelling with `trainer.evaluate()`

In [39]:
trainer.evaluate()

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'eval_loss': 0.12445865571498871,
 'eval_precision': 0.4244081146244632,
 'eval_recall': 0.3919038696272433,
 'eval_f1': 0.39952554037676824,
 'eval_accuracy': 0.9652914041791456,
 'eval_runtime': 15.8633,
 'eval_samples_per_second': 313.742,
 'eval_steps_per_second': 19.668,
 'epoch': 3.0}

After training is finished, the precision/recall/f1 for each category can be computed. \
The same function `compute_metrics` is applied on the result of the predict method.

In [40]:
predictions, labels, _ = trainer.predict(dataset_test)
results = compute_metrics(predictions, labels, label_list)
results

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'precision': 0.42951477322261544,
 'recall': 0.4030070005307468,
 'f1': 0.4036396071913792,
 'accuracy': 0.9663773918794519}

Writing the predictions together with the gold labels to a csv file with the function `write_predictions_to_csv` so that the metrics per class can be computed with the `compute_evaluation_metrics_from_csv` function.

In [41]:
results_file = "predictions.csv"
write_predictions_to_csv(predictions, labels, label_list, results_file)
classification_report = compute_evaluation_metrics_from_csv("predictions.csv")
print(classification_report)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

        ARG0       0.87      0.81      0.84      2023
        ARG1       0.84      0.81      0.82      3750
    ARG1-DSP       0.00      0.00      0.00         0
        ARG2       0.70      0.69      0.70      1309
        ARG3       0.01      0.25      0.02         4
        ARG4       0.59      0.62      0.61        64
        ARG5       0.00      0.00      0.00         0
        ARGA       0.00      0.00      0.00         0
    ARGM-ADJ       0.76      0.74      0.75       261
    ARGM-ADV       0.55      0.67      0.61       429
    ARGM-CAU       0.58      0.52      0.55        54
    ARGM-COM       0.25      0.57      0.35         7
    ARGM-CXN       0.50      0.75      0.60         8
    ARGM-DIR       0.36      0.49      0.41        35
    ARGM-DIS       0.72      0.74      0.73       191
    ARGM-EXT       0.76      0.75      0.76       106
    ARGM-GOL       0.07      0.67      0.12         3
    ARGM-LOC       0.64    

  _warn_prf(average, modifier, msg_start, len(result))


Then, we save fine-tuned model.

In [42]:
# Use these codes to save model:
tokenizer.save_pretrained("tokenizer.save_pretrained.distillbert-base-uncased-finetuned-srl")
trainer.save_model("trainer.save_model.distillbert-base-uncased-finetuned-srl")
model.save_pretrained("model.save_pretrained.distillbert-base-uncased-finetuned-srl")

In [45]:
!mkdir -p "/content/drive/MyDrive/NLP_3_baseline_model/model"

Here, we copy saved model to google drive.

In [46]:
!cp -r '/content/trainer.save_model.distillbert-base-uncased-finetuned-srl' '/content/drive/MyDrive/NLP_3_baseline_model/model'
!cp -r '/content/model.save_pretrained.distillbert-base-uncased-finetuned-srl' '/content/drive/MyDrive/NLP_3_baseline_model/model'
!cp -r '/content/tokenizer.save_pretrained.distillbert-base-uncased-finetuned-srl' '/content/drive/MyDrive/NLP_3_baseline_model/model'

## Group Contribution:

##### Ariana Britez:
- functions to map the labels to number for model input: get_label_mapping, map_labels_to_numbers, map_labels_in_dataframe
- function to get the list of labels for model input: get_labels_from_map
- function to compute the metrics during training, evaluation and inference: compute_metrics, compute_evaluation_metrics_from_csv
- function to load the transformer model for fine-tuning: load_srl_model
- function to load the dataset in format that model can handle: load_dataset
- function to save the model predictions with gold labels for evaluation: write_predictions_to_csv
- writing markdown from importing the model section until evaluation of the baseline model