
#TRAINING CROSS-LINGUAL MODELS FOR TOXIC SPANS DETECTION
This notebook contains the code for training mBERT, DistilmBERT, XLM-roBERTa and XLM-V models for toxic spans detection. The notebook is designed for use in Google Colab. It is recommended to run the notebook on TPU (available in Colab Pro).



# Prep work: load in files, import modules

Run the following cells. A pop-up will connect the notebook to google drive for storing the trained model(s). Run the cells after to install and import modules and clone github repos.

In [1]:
from google.colab import drive
import os
#connect drive
drive.mount('/content/drive')

#create folder for saving models
destination_folder = ('/content/drive/MyDrive/TSD_models/')
try:
  os.mkdir(destination_folder)
except FileExistsError:
  pass

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
#install modules
!pip install transformers==4.28.1
!pip install sacremoses

Collecting transformers==4.28.1
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m46.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.30.2
    Uninstalling transformers-4.30.2:
      Successfully uninstalled transformers-4.30.2
Successfully installed transformers-4.28.1


In [4]:
#import modules
from transformers import AutoTokenizer, BatchEncoding, AutoModelForTokenClassification, Trainer, TrainingArguments
import os, torch, ast, itertools
import numpy as np
import pandas as pd
from statistics import mean

In [5]:
#import git repository to collect the training data
!git clone https://github.com/ipavlopoulos/toxic_spans

fatal: destination path 'Cross_lingual_TSD' already exists and is not an empty directory.
fatal: destination path 'toxic_spans' already exists and is not an empty directory.


#SET VARIABLES, INITIALIZE MODEL AND TOKENIZER
Below, we set the variables used for training, and load in the model and tokenizer. The model_checkpoint can be set to the transformer model we want to fine-tune. This can be one of the following: 'bert-base-multilingual-cased','distilbert-base-multilingual-cased', 'xlm-roberta-base', 'facebook/xlm-v-base'

In [23]:
model_checkpoint = 'distilbert-base-multilingual-cased'
max_len = 512
pad_token = -100

id2label = {0:'O', 1:'I'}
label2id = {v: k for k, v in id2label.items()}

data_files = {'train': '/content/toxic_spans/SemEval2021/data/tsd_trial.csv',
              'eval': '/content/toxic_spans/SemEval2021/data/tsd_train.csv'}

destination_folder = f"/content/drive/MyDrive/models_trained_for_TSD/CM2_BS8_LR1E05/trained_on_dutch_split/{model_checkpoint.replace('/', '_')}"

try:
  os.mkdir(destination_folder)
except FileExistsError:
  pass

In [24]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint,
                                                        num_labels = 2,
                                                        id2label=id2label,
                                                        label2id=label2id,)

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/466 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/542M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-multilingual-cased were not used when initializing DistilBertForTokenClassification: ['vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this 

# PREPROCESS DATA

Run the cells below to preprocess the data

In [25]:
class TSDdataset(torch.utils.data.Dataset):
    #inspired by https://huggingface.co/transformers/v4.1.1/custom_datasets.html
    #Creates a dataset object that holds encodings as well as gold labels
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

def align_tokens_and_annotation_labels(tokenized: BatchEncoding, annotations: str):
#inspired by https://github.com/LightTag/sequence-labeling-with-transformers/blob/master/notebooks/how-to-align-notebook.ipynb
    """Aligning tokens with annotation labels (I/O), given a BatchEncoding (a tokenized text) and a list of character offsets.
    Param BatchEncoding tokenized: a tokenized sentence that has been tokenized by a FastTokenizer.
    Param annotations: a list or string that indicates the character indices that are toxic spans. """
    #create aligned_labels as a list of 0's, length equals the number of tokens
    aligned_labels = [0] * len([ids for ids in tokenized.ids if id != 0])
    #convert annotation from str to list
    spanlist = ast.literal_eval(annotations)

    #iterate over indices in the span list
    for char_ix in spanlist:
        #Find the corresponding token index
        token_ix = tokenized.char_to_token(char_ix)
        #Change the value in aligned_labels to 1 (I)
        if token_ix is not None: # White spaces have no token and will return None
          aligned_labels[token_ix] = 1
    #aligned_labels now looks like a list of 0s and 1s, of equal length as the tokens
    #we add the pad_token to the list until the list has length max_len
    n_pad_tokens = max_len-len(aligned_labels)
    aligned_labels += [pad_token]*n_pad_tokens

    return aligned_labels

def preprocess (file_dict, tokenizer):
  """Preprocesses a dict of .csv files into a dictionary TSDdataset objects.
    Param files: a dict holding paths to .csv files to be preprocessed. Keys should be 'train' 'dev' and/or 'test', values should be paths to respective data file.
    tokenizer: an AutoTokenizer with which we want to preprocess the data."""

  TSD_datasetdict = {}

  #open file and extract the list of texts and list of spans
  for data_type, file in file_dict.items():
    df = pd.read_csv(file)
    texts = list(df['text'])
    spans = list(df['spans'])
    #tokenize the texts so that we can create a list of gold labels that is aligned with the tokens
    encodings = tokenizer(texts, truncation = True, max_length = max_len, padding = 'max_length')

    labels = [align_tokens_and_annotation_labels(tokenized, annotation) for tokenized, annotation in zip(encodings.encodings, spans)]
    TSD_datasetdict[data_type] = TSDdataset(encodings, labels)

  return TSD_datasetdict



Run the next cell to preprocess the data. Make sure that datasetdict has the keys 'train', 'eval': otherwise the training code below will not work.

In [26]:
datasetdict = preprocess(data_files, tokenizer)
datasetdict.items()

dict_items([('train', <__main__.TSDdataset object at 0x7f0901a42890>), ('eval', <__main__.TSDdataset object at 0x7f097d808af0>)])

# Functions to evaluate the model's performance in between epochs

Run the following cell to define compute_metrics, a function we give to the Trainer that is used to evaluate the model's performance in between epochs. Below, calculate_evaluation_metrics calculates precision, recall and f1; convert_token_predictions_to_spans converts the token-level predictions of the model to character indices that denote toxic spans; compute_metrics combines these functions and gives the resulting evaluation metrics back to the Trainer.

In [27]:
def calculate_evaluation_metrics(gold, pred):
  """Calculates averaged f1, precision and recall for TSD, via the metric defined by Pavlopoulos et al's "SemEval-2021 Task 5: Toxic Spans Detection" (2021)
    Param gold: the gold labels (list of lists of character indices)
    Param pred: system predictions (list of lists of character indices)
    Returns: a dictionary holding precision, recall and f1"""
  all_precision = []
  all_recall = []
  all_f1 = []

  #iterate over all sublists in the gold and pred lists
  for tweet_gold, tweet_pred in zip(gold,pred):
     #gold may hold strings instead of sublists. If so, convert to list.
    if type(tweet_gold) == str:
      tweet_gold = ast.literal_eval(tweet_gold)

    #If there are no toxic spans and none predicted, set precision, recall and f1 to 1
    if tweet_gold == [] and tweet_pred == []:
      precision, recall, f1 = 1,1,1
    #else, if either gold or pred holds no spans, set precision, recall and f1 to 0
    elif tweet_gold == [] or tweet_pred == []:
      precision, recall, f1 = 0,0,0

    #else, count the number of true positives, false positives, false negatives
    else:
      TP = len([char for char in tweet_pred if char in tweet_gold])
      FP = len([char for char in tweet_pred if char not in tweet_gold])
      FN = len([char for char in tweet_gold if char not in tweet_pred])

      #calculate precision, recall, f1
      precision = (TP/(TP+FP))
      recall = (TP/(TP+FN))
      try:
        f1 = 2*precision*recall/(precision+recall)
      except ZeroDivisionError: #can still happen if TP=0
        f1 = 0

    all_precision.append(precision)
    all_recall.append(recall)
    all_f1.append(f1)

  return {
          # 'TP': total_TP,
          # 'FP': total_FP,
          # 'FN': total_FN,
          'precision': mean(all_precision),
          'recall': mean(all_recall),
          'f1': mean(all_f1),
          }

def convert_token_predictions_to_spans(binary_predictions, test_datasetdict):

  all_spans = []
  for tweet_idx, predictions in enumerate(binary_predictions):
    tweet_spans = []
    for token_idx, pred_token in enumerate(predictions):
      if test_datasetdict.labels[tweet_idx][token_idx] != pad_token:
        if pred_token == 1:
          token_span = test_datasetdict.encodings.token_to_chars(tweet_idx, token_idx)
          tweet_spans += ([idx for idx in range(token_span.start, token_span.end)])


      else:
        break
    all_spans.append(tweet_spans)

  return all_spans



def compute_metrics(eval_preds):
  """A compute_metrics function to use in objects of the Trainer class. Is used to evaluate the performance in between epochs using Pavlopoulos et al's evaluation metric.
  param eval_preds: the system predictions and gold labels
  returns: a dictionary holding evaluation metrics and their values"""

  logits, labels = eval_preds

  #remove predictions for the pad token (-100) from the labels
  true_labels = [[l for l in label if l != pad_token] for label in labels]
  #convert predictions to 0's and 1's
  predictions = np.argmax(logits, axis=-1)
  #convert token-level predictions to spans
  all_prediction_spans = convert_token_predictions_to_spans(predictions, datasetdict['eval'])
  all_gold_spans = convert_token_predictions_to_spans(true_labels, datasetdict['eval'])
  #calculate evaluation metrics
  performance_dict = calculate_evaluation_metrics(all_prediction_spans, all_gold_spans)

  return {'f1': performance_dict['f1']}


#Set training arguments and train
Next, we set the training arguments and create a Trainer object to fine-tune the model. By running trainer.train, the model starts training. We can optionally change parameter values in 'args'.

In [29]:

args = TrainingArguments(
    output_dir = destination_folder,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=1e-5,
    num_train_epochs=2,
    weight_decay=0.01,
    per_device_train_batch_size = 8,
    per_device_eval_batch_size = 8,
    load_best_model_at_end=True,
    metric_for_best_model = 'f1'
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=datasetdict['train'],
    eval_dataset=datasetdict['eval'],
    compute_metrics=compute_metrics,

    tokenizer=tokenizer,

)
trainer.train()
trainer.save_model(destination_folder)

Epoch,Training Loss,Validation Loss,F1
1,No log,0.032841,0.061091
2,No log,0.02941,0.209814


#Training finished.
If the cell above has run succesfully, the fine-tuned models are saved in google Drive.