<a href="https://colab.research.google.com/github/damayantinaik/Fine-tune-model/blob/main/Fine_tuning_multilingual_NER_LLM_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Fine-Tuning Multilingual Named Entity Recognition LLM model**

Named Entity Recognition (NER) is very popular common task that identifies entities like person, organization, location in text. These entities can be used for many applications such as gaining insights from company documents, augmenting the quality of search engines, or building a structured database from a corpus.  

In this project, I'll show how a single transformer model called "XLM-RoBERTa" can be fine-tuned to perform named entity recognition across several labnguages.

To carry out the task, I'm choosing to perform NER for Switzerland based customers, where there are four national languages German, French, Italian, and English (with English often serving as bridge between them).




# **Multilingual Transformers**

Multilingual transformers involve similar architecture and training procedures as monolingual counterparts. The only difference is that the multilingual transformers are trained on corpus created from multilanguage documents as compared to monolingual transformers where the later is only trained on corpus of one language. Multilingual transformers are able to generalize well across languages for a variety of down stream tasks.

One efficient multilingual Transformers is the XLM-RoBERTa (also called XLM-R). It has a large dataset, uses "SentencePiece" to tokenize the raw text. In this project, I'll fine-tune the model to obatin the maximum performance.

# **The Dataset**
The dataset that I am going to use is a subset of XTREME (Cross-lingual TRansfer Evaluation of Multilingual Encoders) dataset from Hugging Face hub called WikiANN or PAN-X. This dataset consists of Wikipedia articles in many languages (176 languages), with labels for tagging tokens as person, location or organization using IOB-2 (Inside-Outside-Begining-2) format.

# **Load the dataset**

A datset comes in confugrations like 'cola', 'PAN-X', 'sst2', 'mrpc', 'qqp', 'stsb', 'mnli' etc, so, before we load our dataset in PAN-X configuration,  let us have a look on the configurations that it comes with and load the one we are interested in.

In [None]:
import datasets

In [None]:
from datasets import get_dataset_config_names
xtreme_subsets = get_dataset_config_names('xtreme')
print(f"XTREME has {len(xtreme_subsets)} configurations")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


XTREME has 183 configurations


# **Exploratory Data Analysis**

In [None]:
xtreme_subsets[:5]

['MLQA.ar.ar', 'MLQA.ar.de', 'MLQA.ar.en', 'MLQA.ar.es', 'MLQA.ar.hi']

The PAN-X (WikiANN) datset is specifically designed for training and evaluating NER that can identify and classify person, location and organization within text across multiple languages. The dataset is based on Wikipedia articles and annotated with location (LOC), person(PER) and organization(ORG) tags in IOB2 (Inside, Outside, Begining,  2)format.  

So, we'll only look for the configuration that starts with "PAN":

In [None]:
panx_subset = [s for s in xtreme_subsets if s.startswith("PAN")]

In [None]:
panx_subset[:10]

['PAN-X.af',
 'PAN-X.ar',
 'PAN-X.bg',
 'PAN-X.bn',
 'PAN-X.de',
 'PAN-X.el',
 'PAN-X.en',
 'PAN-X.es',
 'PAN-X.et',
 'PAN-X.eu']

As there are four languages in Swizerland, to make a Swiss corpus, we'll sample the German(de), French(fr), Italian(it) and English(en) from PAN-X configuration(of course according to their spoken proportion): <br>
German(62.9%) <br> French(22.9%) <br> Italian (8.4%) <br> English(5.9%)<br>

Although imbalanced, this datset will represent the real-world dataset, where the numebr of samples from all languages are not always of same number. This happens because of the lack of expertise in minority languages. The de, fr, it, and en represents German, French, Itialian and English languages respectively.

In [None]:
from collections import defaultdict
from datasets import DatasetDict, load_dataset

langs = ['de', 'fr', 'it', 'en']
fracs = [0.629, 0.229, 0.084, 0.059]

panx_ch = defaultdict(DatasetDict)

for lang, frac in zip(langs, fracs):
  # Load monolingual corpus
  ds = load_dataset(f"xtreme", name = f"PAN-X.{lang}") # name: dataset configuration (or also called subset)
# print(ds)
# ds['train'][0]
  for split in ds: # Split here is train, validation and test
    panx_ch[lang][split] = ds[split].select(range(int(frac * len(ds[split]))))



In [None]:
import pandas as pd
pd.DataFrame({lang: [panx_ch[lang]["train"].num_rows] for lang in langs}, index = ['Number of training examples'])

Unnamed: 0,de,fr,it,en
Number of training examples,12580,4580,1680,1180


In [None]:
panx_ch['de']["train"][:3]

{'tokens': [['als', 'Teil', 'der', 'Savoyer', 'Voralpen', 'im', 'Osten', '.'],
  ['WEITERLEITUNG', 'Antonina', 'Wladimirowna', 'Kriwoschapka'],
  ['**', "''", 'Lou', 'Salomé', "''", '.']],
 'ner_tags': [[0, 0, 0, 5, 6, 0, 0, 0], [0, 1, 2, 2], [0, 0, 1, 2, 0, 0]],
 'langs': [['de', 'de', 'de', 'de', 'de', 'de', 'de', 'de'],
  ['de', 'de', 'de', 'de'],
  ['de', 'de', 'de', 'de', 'de', 'de']]}

In [None]:
panx_ch['de']['train'].features

{'tokens': List(Value('string')),
 'ner_tags': List(ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'])),
 'langs': List(Value('string'))}

In [None]:
tags = panx_ch['de']['train'].features['ner_tags'].feature
print(tags)
print(tags.names)


ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'])
['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']


In [None]:
# Convert the number to human readable string tags by using int2str() method on ClassLabel object
# The above can be obtained by a function as defined below
def create_tag_names(batch):
  return {"ner_tags_str": [tags.int2str(idx) for idx in batch["ner_tags"]]}

panx_de = panx_ch['de'].map(create_tag_names)

# Let us check the 1st row
de_example = panx_de['train'][0]
de_example


{'tokens': ['als', 'Teil', 'der', 'Savoyer', 'Voralpen', 'im', 'Osten', '.'],
 'ner_tags': [0, 0, 0, 5, 6, 0, 0, 0],
 'langs': ['de', 'de', 'de', 'de', 'de', 'de', 'de', 'de'],
 'ner_tags_str': ['O', 'O', 'O', 'B-LOC', 'I-LOC', 'O', 'O', 'O']}

Let us check the English train dataset.

In [None]:
panx_ch['en']["train"][:3]

{'tokens': [['R.H.',
   'Saunders',
   '(',
   'St.',
   'Lawrence',
   'River',
   ')',
   '(',
   '968',
   'MW',
   ')'],
  [';', "'", "''", 'Anders', 'Lindström', "''", "'"],
  ['Karl', 'Ove', 'Knausgård', '(', 'born', '1968', ')']],
 'ner_tags': [[3, 4, 0, 3, 4, 4, 0, 0, 0, 0, 0],
  [0, 0, 0, 1, 2, 0, 0],
  [1, 2, 2, 0, 0, 0, 0]],
 'langs': [['en', 'en', 'en', 'en', 'en', 'en', 'en', 'en', 'en', 'en', 'en'],
  ['en', 'en', 'en', 'en', 'en', 'en', 'en'],
  ['en', 'en', 'en', 'en', 'en', 'en', 'en']]}

In [None]:
panx_ch['en']['train'].features

{'tokens': List(Value('string')),
 'ner_tags': List(ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'])),
 'langs': List(Value('string'))}

In [None]:
tags2 = panx_ch['en']['train'].features['ner_tags'].feature

In [None]:
tags2

ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'])

In [None]:
tags_arrangement = panx_ch['en']['train']['ner_tags'][0:3]

In [None]:
tags_arrangement

[[3, 4, 0, 3, 4, 4, 0, 0, 0, 0, 0],
 [0, 0, 0, 1, 2, 0, 0],
 [1, 2, 2, 0, 0, 0, 0]]

**Let us check the datset to see we don't have unusual imbalance in the tags. Let us check the frequencies of the tags.**  

In [None]:
from collections import Counter

split2freqs = defaultdict(Counter)
for split, dataset in panx_de.items(): # Here split is the train, validation, test
  for row in dataset["ner_tags_str"]:
    for tag in row:
      if tag.startswith("B"):
        tag_type = tag.split("-")[1]
        split2freqs[split][tag_type] +=1
pd.DataFrame.from_dict(split2freqs, orient = 'index')


Unnamed: 0,LOC,PER,ORG
train,6089,5778,5434
validation,3127,2891,2707
test,3166,2942,2509


The distribution of LOC, PER and ORG are nearly same in each split, so it is good to proceed further.

# **Tokenization using XLM-R**

## **Tokenizer pipeline**

The tokenization is not a single step but a pipeline with multiple steps. The steps are:<br>
**1. Normalization**<br>
This steps involves <br>a. Stripping white space <br>b. removing accented characters <br>c. Unicode normailization (i.e writing the same character in various ways) <br>d. lowercasing the text (if needed) <br><br>
**2. Pretokenization** <br> a. Split to words  <br><br>
**3. Tokenizer model** <br> a. Further spilt the words formed in the Pretokenization steps (if required) <br> b. Convert them to input_ids (integers) <br><br>
**4. Postprocessing** <br> a. Add the special tokens to start and end of each sentence. For example BERT uses (CLS, SEP) and XLM-R uses (<s>, <\s>) .



The XLM-R used "SentencePiece" tokenizer which is based on type of subword segmentation called "Unigram" and encodes each input text as a sequence of Unicode characters.

# **Building Tokenizer**

In [None]:
from transformers import AutoTokenizer
xlmr_model_name = 'xlm-roberta-base'
xlmr_tokenizer = AutoTokenizer.from_pretrained(xlmr_model_name)

# Test on a text
text = "Jack Sparrow loves New York!"
input_ids = xlmr_tokenizer(text)
print(f"Output from tokenizer: {input_ids}")

Output from tokenizer: {'input_ids': [0, 21763, 37456, 15555, 5161, 7, 2356, 5753, 38, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [None]:
# For testing, return back to tokens
xlmr_tokens = input_ids.tokens()
print(f"\nThe tokens: {xlmr_tokens}")


The tokens: ['<s>', '▁Jack', '▁Spar', 'row', '▁love', 's', '▁New', '▁York', '!', '</s>']


# **Tokenize the whole dataset**

Now let us write a function to tokenize the whole dataset, and further converting them to ids (in serial number).

In [None]:
# Define a function to obtain tokenization for feeding into NER classification model
def tokenize_and_align_labels(examples):
  # Convert this to input_ids and attention_mask
  tokenized_inputs_ids = xlmr_tokenizer(examples["tokens"], truncation = True, is_split_into_words = True)

  labels = []

  for idx, label in enumerate(examples["ner_tags"]):
    #
    word_ids = tokenized_inputs_ids.word_ids(batch_index = idx)
    previous_word_idx = None
    label_ids = []

    for word_idx in word_ids:
      if word_idx is None or word_idx == previous_word_idx:
          label_ids.append(-100)
      else:
        label_ids.append(label[word_idx])

      previous_word_idx = word_idx

    labels.append(label_ids)
  tokenized_inputs_ids["labels"] = labels
  return tokenized_inputs_ids

With the above function we can encode each split, so let us write a function that we can itrate over.

In [None]:
def encode_panx_dataset(corpus):
  return corpus.map(tokenize_and_align_labels, batched = True, remove_columns = ['langs', 'ner_tags', 'tokens'] )

In [None]:
# By applying this function to DatasetDict object (train, test, validation), we'll obtain an encoded Dataset object per split.
panx_de_encoded = encode_panx_dataset(panx_ch["de"])
panx_fr_encoded = encode_panx_dataset(panx_ch["fr"])
panx_it_encoded = encode_panx_dataset(panx_ch["it"])
panx_en_encoded = encode_panx_dataset(panx_ch["en"])

Map:   0%|          | 0/840 [00:00<?, ? examples/s]

In [None]:
index2tag = {idx: tag for idx, tag in enumerate(tags.names)} # tags.names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
tag2index = {tag: idx for idx, tag in enumerate(tags.names)}

In [None]:
tags.num_classes

7

# **Building Configuration <br>**
The AutoConfig class contains the blueprint of the LLM model's architecture. To configure, we'll load with AutoConfig class. Further we need to modify this with number of labels (number of classes).

In [None]:
from transformers import AutoConfig
xlmr_config = AutoConfig.from_pretrained(xlmr_model_name, num_labels = tags.num_classes, id2label = index2tag, label2id = tag2index)

In [None]:
#from transformers import XLMRobertaConfig
#xlmr_config = XLMRRobertaConfig()
#print(xlmr_config)



---



# **Fine-Tune XLM-Roberta**

To fine-tune our model, first I'll fine-tune our base model on the German subset of PAN-X and then evaluate its zero-shot cross-lingual on French, Italian and English. For this, I'll use the Trainer() class to handle our training loop. However, to build a trainer class I need to instatinate a TrainingArguments class with all the
required arguments.<br><br>The trainer class for our task here requires following:<br>

* Initialization parameters
* Training arguments
* Data collator
* Evaluation metrics
 Let us build those below.

### **Initializing the model's paramters (loading XLM-R model's weight)**

Here, we may required to train several models, so to avoid intilizing a new model each time, we'll write a function to carry out the job.

Here, we'll import XLMRobertaForTokenClassification pre-trained model to obtain all the model weights.

In [None]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
from transformers import XLMRobertaForTokenClassification
xlmr_model = (XLMRobertaForTokenClassification.from_pretrained(xlmr_model_name, config = xlmr_config).to(device))

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


As a quick check that we have initialized the tokenizer and model correctly, let us test our prediction on our small text "Jack Sparrow loves New York!"

In [None]:
input_ids = xlmr_tokenizer.encode(text, return_tensors = "pt")


In [None]:
input_ids

tensor([[    0, 21763, 37456, 15555,  5161,     7,  2356,  5753,    38,     2]])

Let us make a dataframe with the tokens and input_ids.

In [None]:
pd.DataFrame({'tokens': xlmr_tokens, 'input_ids':input_ids[0].numpy()})

Unnamed: 0,tokens,input_ids
0,<s>,0
1,▁Jack,21763
2,▁Spar,37456
3,row,15555
4,▁love,5161
5,s,7
6,▁New,2356
7,▁York,5753
8,!,38
9,</s>,2


Here, we see the start <s> and end </s> are given the IDs 0 and 2 respectively.

In [None]:
out = xlmr_model(input_ids.to(device))

In [None]:
out

TokenClassifierOutput(loss=None, logits=tensor([[[-0.6063,  0.6641, -0.3277, -0.0939,  0.3452, -0.1897,  0.6844],
         [-0.5433,  0.4683, -0.1153, -0.7021, -0.0010,  0.1209,  0.8486],
         [-0.5783,  0.6500,  0.0046, -0.5492,  0.0562,  0.0711,  0.7966],
         [-0.5613,  0.5617, -0.0157, -0.6082,  0.1497,  0.1694,  0.9321],
         [-0.5418,  0.5781, -0.0269, -0.6018,  0.1070,  0.0899,  0.7714],
         [-0.4733,  0.5179,  0.0932, -0.6015,  0.1784,  0.2271,  0.6838],
         [-0.4439,  0.5719,  0.0111, -0.6726, -0.0106,  0.1545,  0.7259],
         [-0.4073,  0.5405, -0.0690, -0.6857,  0.0363,  0.1496,  0.8517],
         [-0.3866,  0.5769, -0.0286, -0.6792,  0.1679,  0.1651,  0.7283],
         [-0.6176,  0.6374, -0.3135, -0.1513,  0.3912, -0.1744,  0.7378]]],
       grad_fn=<ViewBackward0>), hidden_states=None, attentions=None)

Here, we can see that the start <s> and end </s> are given the IDs 0 and 2 respectively.

Let us now pass the inputs to the model and extract the predictions by taking the argmax to get the most likely class per token.

In [None]:
outputs = out.logits

In [None]:
outputs

tensor([[[-0.6063,  0.6641, -0.3277, -0.0939,  0.3452, -0.1897,  0.6844],
         [-0.5433,  0.4683, -0.1153, -0.7021, -0.0010,  0.1209,  0.8486],
         [-0.5783,  0.6500,  0.0046, -0.5492,  0.0562,  0.0711,  0.7966],
         [-0.5613,  0.5617, -0.0157, -0.6082,  0.1497,  0.1694,  0.9321],
         [-0.5418,  0.5781, -0.0269, -0.6018,  0.1070,  0.0899,  0.7714],
         [-0.4733,  0.5179,  0.0932, -0.6015,  0.1784,  0.2271,  0.6838],
         [-0.4439,  0.5719,  0.0111, -0.6726, -0.0106,  0.1545,  0.7259],
         [-0.4073,  0.5405, -0.0690, -0.6857,  0.0363,  0.1496,  0.8517],
         [-0.3866,  0.5769, -0.0286, -0.6792,  0.1679,  0.1651,  0.7283],
         [-0.6176,  0.6374, -0.3135, -0.1513,  0.3912, -0.1744,  0.7378]]],
       grad_fn=<ViewBackward0>)

In [None]:
#preds = torch.argmax(outputs, dim = 2)

In [None]:
#preds

In [None]:
predictions = torch.argmax(outputs, dim = -1)
predictions

tensor([[6, 6, 6, 6, 6, 6, 6, 6, 6, 6]])

In [None]:
outputs.shape

torch.Size([1, 10, 7])

In [None]:
print(f"Number of tokens in the sequence: {len(xlmr_tokens)}")

Number of tokens in the sequence: 10


In [None]:
print(f"Number of predictions:{len(predictions[0])}")

Number of predictions:10


In [None]:
final_preds = [tags.names[p] for p in predictions[0].numpy()]

In [None]:
final_preds

['I-LOC',
 'I-LOC',
 'I-LOC',
 'I-LOC',
 'I-LOC',
 'I-LOC',
 'I-LOC',
 'I-LOC',
 'I-LOC',
 'I-LOC']

Very Bad! Our token classification layer with random weights leaves it to perform bad. However, to test in other language text, let us wrap it in a function.

In [None]:
def tag_text(text, tags, model, tokenizer):
  tokens = tokenizer(text).tokens
  input_ids = xlmr_tokenizer(text, return_tensors = 'pt')

  # Get prediction over 7 possible classes
  outputs = model(input_ids)[0]

  #Take argmax to obtain most likely class per token
  predictions = torch.argmax(outputs, dim = -1)

  # Convert to DataFrame
  preds = [tags.names[p] for p in predictions[0].numpy()]
  return pd.DataFrame([tokens, preds], index = ["tokens", "Tags"])


Let us see how the pre-trained model is working in our train dataset of German.

In [None]:
words, labels = de_example['tokens'], de_example['ner_tags']

In [None]:
words

['als', 'Teil', 'der', 'Savoyer', 'Voralpen', 'im', 'Osten', '.']

In [None]:
labels

[0, 0, 0, 5, 6, 0, 0, 0]

Let us check the encoded output of the above row.

In [None]:
encoded = panx_de_encoded['train'][0]
encoded

{'input_ids': [0,
  737,
  16046,
  122,
  73829,
  8889,
  4855,
  289,
  2278,
  566,
  180,
  2581,
  6,
  5,
  2],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'labels': [-100, 0, 0, 0, 5, -100, 6, -100, -100, 0, 0, -100, 0, -100, -100]}

In [None]:
final_tags = [index2tag[l] if l != -100 else 'IGN' for l in panx_de_encoded['train'][0]['labels']]

Let us put it in dataframe:

In [None]:
pd.DataFrame([words, encoded['input_ids'], encoded['attention_mask'], encoded['labels'], final_tags], index = ['words', 'input_ids', 'attention_mask', 'labels', 'tags'])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
words,als,Teil,der,Savoyer,Voralpen,im,Osten,.,,,,,,,
input_ids,0,737,16046,122,73829,8889,4855,289,2278,566,180,2581,6,5,2
attention_mask,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
labels,-100,0,0,0,5,-100,6,-100,-100,0,0,-100,0,-100,-100
tags,IGN,O,O,O,B-LOC,IGN,I-LOC,IGN,IGN,O,O,IGN,O,IGN,IGN


Let us do this for English train set (first row)

In [None]:
panx_ch['en']["train"][0]

{'tokens': ['R.H.',
  'Saunders',
  '(',
  'St.',
  'Lawrence',
  'River',
  ')',
  '(',
  '968',
  'MW',
  ')'],
 'ner_tags': [3, 4, 0, 3, 4, 4, 0, 0, 0, 0, 0],
 'langs': ['en', 'en', 'en', 'en', 'en', 'en', 'en', 'en', 'en', 'en', 'en']}

In [None]:
encoded_en = panx_en_encoded['train'][0]
encoded_en

{'input_ids': [0,
  627,
  5,
  841,
  5,
  947,
  24658,
  7,
  15,
  2907,
  5,
  155484,
  32547,
  1388,
  15,
  483,
  16028,
  46298,
  1388,
  2],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'labels': [-100,
  3,
  -100,
  -100,
  -100,
  4,
  -100,
  -100,
  0,
  3,
  -100,
  4,
  4,
  0,
  0,
  0,
  -100,
  0,
  0,
  -100]}

In [None]:
final_tags_en = [index2tag[l] if l != -100 else 'IGN' for l in encoded_en['labels']]

In [None]:
pd.DataFrame([panx_ch['en']["train"][0]['tokens'], encoded_en['input_ids'], encoded_en['attention_mask'], encoded_en['labels'], final_tags_en], index = ['words', 'input_ids', 'attention_mask', 'labels', 'tags'])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
words,R.H.,Saunders,(,St.,Lawrence,River,),(,968,MW,),,,,,,,,,
input_ids,0,627,5,841,5,947,24658,7,15,2907,5,155484,32547,1388,15,483,16028,46298,1388,2
attention_mask,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
labels,-100,3,-100,-100,-100,4,-100,-100,0,3,-100,4,4,0,0,0,-100,0,0,-100
tags,IGN,B-ORG,IGN,IGN,IGN,I-ORG,IGN,IGN,O,B-ORG,IGN,I-ORG,I-ORG,O,O,O,IGN,O,O,IGN


From the output, one can see that the model is not efficiently predicting. Let us fine tune our model.

In [None]:
from transformers import XLMRobertaForTokenClassification
def model_init():
  return (XLMRobertaForTokenClassification.from_pretrained(xlmr_model_name, config = xlmr_config)).to(device)

In [None]:
from transformers import TrainingArguments

num_epochs = 3
batch_size = 24
logging_steps = len(panx_de_encoded["train"])//batch_size
model_name = f"{xlmr_model_name}-finetuned-panx-de" # It'll be our ouput directory
training_args = TrainingArguments(output_dir = model_name,
                                  log_level = "error",
                                  num_train_epochs = num_epochs,
                                  per_device_train_batch_size = batch_size,
                                  per_device_eval_batch_size = batch_size,
                                  eval_strategy = "epoch",
                                  save_steps = 1e6, # We kept this steps to high number to spped up the training process
                                  weight_decay = 0.01,
                                  disable_tqdm = False,
                                  logging_steps = logging_steps,
                                  push_to_hub = False)

## **Data Collator** <br>
With the "data collator" we can pad each input sequence to the largest sequence length in a batch. For this we'll use Hugging face's DataColllatorForTokenClassification method.

In [None]:
from transformers import DataCollatorForTokenClassification
data_collator = DataCollatorForTokenClassification(tokenizer = xlmr_tokenizer, padding = True, label_pad_token_id = -100, return_tensors = 'pt')

## **Evaluation metrics**

The token classification performance being a classification problem will be evailuated by calculating Recall, Precison and f1-score. To carry out this job, we'll use **seqeval** library.

In [None]:
!pip install seqeval

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16162 sha256=99b29e2273d420541d3ea82f9f7c4368dcdd8c110ede52f9288f7a8db1cd7f60
  Stored in directory: /root/.cache/pip/wheels/5f/b8/73/0b2c1a76b701a677653dd79ece07cfabd7457989dbfbdcd8d7
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


In [None]:
from seqeval.metrics import f1_score

As seqeval accepts list of lists, we'll write a function to obtain the predictions and label ids lists.

In [None]:
import numpy as np
def align_predictions(predictions, label_ids): # y_pred = predictions , y_true = label_ids
  preds = np.argmax(predictions, axis = 2)
  batch_size, seq_len = preds.shape

  labels_list, preds_list = [], []

  for batch_idx in range(batch_size):
    example_labels, example_preds = [], []
    for seq_idx in range(seq_len):
      # Ignore the label IDs = -100
      if label_ids[batch_idx,seq_idx] != -100:
        example_labels.append(index2tag[label_ids[batch_idx][seq_idx]])
        example_preds.append(index2tag[preds[batch_idx][seq_idx]])

    labels_list.append(example_labels)
    preds_list.append(example_preds)

  return preds_list, labels_list

In [None]:
# Define a function to compute metrics
def compute_metrics(eval_pred):
  y_pred, y_true = align_predictions(eval_pred.predictions, eval_pred.label_ids)
  return {"f1": f1_score(y_true, y_pred)}

# Fine-tune with Trainer API

In [None]:
from transformers import Trainer

trainer = Trainer(model_init = model_init,
                  args = training_args,
                  data_collator = data_collator,
                  compute_metrics = compute_metrics,
                  train_dataset = panx_de_encoded["train"],
                  eval_dataset = panx_de_encoded["validation"],
                  processing_class = xlmr_tokenizer
                  )

In [None]:
trainer.train()

  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mdamayanti-naik222[0m ([33mdamayanti-naik222-none[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin




Epoch,Training Loss,Validation Loss,F1
1,0.2714,0.156047,0.822873
2,0.128,0.133228,0.855646
3,0.0833,0.13085,0.863116




TrainOutput(global_step=1575, training_loss=0.16080530109859648, metrics={'train_runtime': 19935.3548, 'train_samples_per_second': 1.893, 'train_steps_per_second': 0.079, 'total_flos': 854205461835792.0, 'train_loss': 0.16080530109859648, 'epoch': 3.0})

Now, we'll create a trainer for our model to see its performance providing all required parameters that we've created already.

In [None]:
import pandas as pd
df = pd.DataFrame(trainer.state.log_history)[['epoch','loss' ,'eval_loss', 'eval_f1']]
df = df.rename(columns={"epoch":"Epoch","loss": "Training Loss", "eval_loss": "Validation Loss", "eval_f1":"F1"})
df['Epoch'] = df["Epoch"].apply(lambda x: round(x))
df['Training Loss'] = df["Training Loss"].ffill()
df[['Validation Loss', 'F1']] = df[['Validation Loss', 'F1']].bfill().ffill()
df.drop_duplicates()

Unnamed: 0,Epoch,Training Loss,Validation Loss,F1
0,1,0.2714,0.156047,0.822873
2,2,0.128,0.133228,0.855646
4,3,0.0833,0.13085,0.863116


# **Cross-Lingual Transfer**

In [None]:
# A function to evaluate the metrics
def get_f1_score(trainer, dataset):
    return trainer.predict(dataset).metrics["test_f1"]

Evaluate on German test dataset.

In [None]:
f1_scores = defaultdict(dict)
f1_scores["de"]["de"] = get_f1_score(trainer, panx_de_encoded["test"])
print(f"F1-score of [de] model on [de] dataset: {f1_scores['de']['de']:.3f}")



F1-score of [de] model on [de] dataset: 0.874


Let us see the fine-tuned model's performance on French, Italian and English.

In [None]:
# A function to evaluate on all test data

def evaluate_lang_performance(lang, trainer):
    panx_ds = encode_panx_dataset(panx_ch[lang])
    return get_f1_score(trainer, panx_ds["test"])

In [None]:
f1_scores["de"]["fr"] = evaluate_lang_performance("fr", trainer)
print(f"F1-score of [de] model on [fr] dataset: {f1_scores['de']['fr']:.3f}")

f1_scores["de"]["it"] = evaluate_lang_performance("it", trainer)
print(f"F1-score of [de] model on [it] dataset: {f1_scores['de']['it']:.3f}")

f1_scores["de"]["en"] = evaluate_lang_performance("en", trainer)
print(f"F1-score of [de] model on [en] dataset: {f1_scores['de']['en']:.3f}")



Map:   0%|          | 0/4580 [00:00<?, ? examples/s]

Map:   0%|          | 0/2290 [00:00<?, ? examples/s]

Map:   0%|          | 0/2290 [00:00<?, ? examples/s]

F1-score of [de] model on [fr] dataset: 0.721


Map:   0%|          | 0/1680 [00:00<?, ? examples/s]

Map:   0%|          | 0/840 [00:00<?, ? examples/s]

Map:   0%|          | 0/840 [00:00<?, ? examples/s]

F1-score of [de] model on [it] dataset: 0.655


Map:   0%|          | 0/1180 [00:00<?, ? examples/s]

Map:   0%|          | 0/590 [00:00<?, ? examples/s]

Map:   0%|          | 0/590 [00:00<?, ? examples/s]

F1-score of [de] model on [en] dataset: 0.595


**Observations**: So far, we fine-tuned our XLM-R model on German dataset, and observed a f1-socore of 85%, while in other language dataset, it yields modest performance which is really a concern and we need to work on it to improve the performance significantly. <br>

In order to work towards it, let us first train our XLM-R model on a multilingual dataset with concatenated corpus of German, Freanch, Italian and English.

For simplicity, I'll keep the same hyperparameters from the fine-tuning run on the German corpus, except the logging_steps argument of Training Arguments to account for the increase size of the training multilingual corpus.




To carry out this opration, first I'll concatetanate all four languages corpora.  

In [None]:
from datasets import concatenate_datasets

def concatenate_splits(corpora):
    multi_corpus = DatasetDict()
    for split in corpora[0].keys():
        multi_corpus[split] = concatenate_datasets(
            [corpus[split] for corpus in corpora]).shuffle(seed=42)
    return multi_corpus

In [None]:
panx_de_fr_it_en_encoded = concatenate_splits([panx_de_encoded, panx_fr_encoded, panx_it_encoded, panx_en_encoded])

In [None]:
panx_de_fr_it_en_encoded

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 20020
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 10010
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 10010
    })
})

In [None]:
training_args.logging_steps = (len(panx_de_fr_it_en_encoded["train"])/10) // batch_size


In [None]:
training_args.output_dir = "xlm-roberta-base-finetuned-panx-de-fr-it-en"

trainer = Trainer(model_init = model_init,
                  args = training_args,
                  data_collator = data_collator,
                  compute_metrics = compute_metrics,
                  train_dataset = panx_de_fr_it_en_encoded["train"],
                  eval_dataset = panx_de_fr_it_en_encoded["validation"],
                  processing_class = xlmr_tokenizer)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss


**Due to computation limitation (per day), the run stopped. I need to carry out the further tuning with more compute.**