# Fine-tuning RoBERTa

Importing libraries.

In [3]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, DataCollatorForTokenClassification, Trainer
from sklearn.preprocessing import LabelEncoder
import warnings, evaluate, pickle, json
from tqdm import tqdm
import numpy as np
warnings.filterwarnings("ignore")
from datasets import disable_caching
disable_caching()

## Dataset tokenization

We load the dataset using the `datasets` library function `load_dataset`. This is the format used by Hugging Face.

In [2]:
dataset = load_dataset("json", data_files={"train": "data/train_data.json", "validation": "data/val_data.json", "test": "data/test_data.json"})

Using the model id we can load both model and tokenizer. Since we have a special token `@PADDING`, we shall add it to the tokenizer.

In [3]:
model_id = "MMG/xlm-roberta-large-ner-spanish"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.add_tokens(new_tokens = ["@@PADDING@@"])

1

Let us quickly check how the tokenizer transform our words. Two things stand out:

- It adds special tokens to the beginning and end of the sentence, `<s>` and `</s>`. We will have to let the model know that these are special tokens.
- It breaks some words, indicating this by an underscore at the beginning of the first token of the decomposition, for example, _firstWord. When rearranging labels, we'll have to be sure to label only the first word of the decomposition.

In [4]:
example = dataset["train"][0]
tokenized_input = tokenizer(example["modified_words"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
example["modified_words"], tokens

(['@@PADDING@@',
  'Aun',
  'así',
  'no',
  'hemos',
  'mi',
  'favorito',
  'de',
  'los',
  'poca',
  'que',
  'ANTE',
  'el',
  'momento',
  'SERÁN',
  'podido',
  'escuchar',
  'de',
  'PeerGynt',
  'Lobogris',
  'fuimos'],
 ['<s>',
  '@@PADDING@@',
  '▁A',
  'un',
  '▁así',
  '▁no',
  '▁hemos',
  '▁mi',
  '▁favorito',
  '▁de',
  '▁los',
  '▁poca',
  '▁que',
  '▁',
  'ANTE',
  '▁el',
  '▁momento',
  '▁SER',
  'ÁN',
  '▁podido',
  '▁escuchar',
  '▁de',
  '▁Pe',
  'er',
  'G',
  'y',
  'nt',
  '▁Lo',
  'bog',
  'ris',
  '▁fui',
  'mos',
  '</s>'])

The next function tokenizes our phrase, and reassign its labels accordingly. Particularly, it will label as -100 both [CLS] and [SEP], and it will only label the first word of a decomposition, telling the model to ignore the rest with the label -100.

In [6]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["modified_words"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

We map our dataset with this function.

In [7]:
tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)

Map: 100%|████████████████████| 819832/819832 [01:04<00:00, 12697.52 examples/s]
Map: 100%|████████████████████| 234237/234237 [00:18<00:00, 12493.89 examples/s]
Map: 100%|████████████████████| 117120/117120 [00:09<00:00, 12857.83 examples/s]


With it, we can define the `data_collator` for training.

In [8]:
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

With our previously saved `LabelEncoder` we will define dictionaries `id2label` and `label2id` to jump between our integer representations and our labels. This info will also be passed out to the model configuration.

In [11]:
with open("data/labelencoder.pkl","rb") as f:
    le = pickle.load(f)

id2label = {i: le.classes_[i] for i in range(len(le.classes_))}
label2id = {id2label[j]: j for j in range(len(id2label))}

We load the model, also rezising it accordingly to the tokenizer, since we added one more token.

In [12]:
model = AutoModelForTokenClassification.from_pretrained(model_id, num_labels=len(le.classes_), id2label=id2label, label2id=label2id, ignore_mismatched_sizes = True)
model.resize_token_embeddings(len(tokenizer))

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at MMG/xlm-roberta-large-ner-spanish and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([9]) in the checkpoint and torch.Size([5000]) in the model instantiated
- classifier.weight: found shape torch.Size([9, 1024]) in the checkpoint and torch.Size([5000, 1024]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Embedding(250003, 1024)

We will be computing metrics with the help of the `seqeval` library. We also define a function to preprocess the logits for the metrics calculation, this will help out optimizing the training and evaluation loop.

In [13]:
seqeval = evaluate.load("seqeval")

def compute_metrics(p):
    predictions = p.predictions
    labels = p.label_ids
    
    true_predictions = [
        [id2label[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [id2label[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = seqeval.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

def preprocess_logits_for_metrics(logits, labels):
    pred_ids = np.argmax(logits.cpu(), axis=2)
    return pred_ids

Now, the `TrainingArguments` definition. We shall train for 2 epochs, with a precision of `fp16`. We'll only save the best model based on the `f1-score`. 

**IMPORTANT:** although this notebook works, RoBERTa training takes a lot of time. Training for 2 epochs will take about 9-10 hours. As such, the training was executed via a script called `roberta_finetune_script.py`, which basically sintetizes all of this training pipeline, in order to run it via terminal. 

In [None]:
training_args = TrainingArguments(
    output_dir="test_model_roberta",
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    evaluation_strategy="steps",
    eval_steps = 0.5,
    save_strategy="steps",
    save_steps = 0.5,
    load_best_model_at_end=True,
    fp16=True,
    save_total_limit=1,
    metric_for_best_model="f1",
    greater_is_better=True,
    logging_dir = "./logs",
    logging_steps = 0.5
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    preprocess_logits_for_metrics=preprocess_logits_for_metrics
)

trainer.train()

In [None]:
trainer.save_model("roberta_ner_model")