# Lab 4.3 Use fine-tuned crosslingual transformer models

Transformer models such as BERT and RoBERTa can easily be fine-tuned for downstream tasks. The Huggingface hub lists many of these models already trained for specific tasks. New fine-tuned transformer models are published regularly on the huggingface platform: https://huggingface.co/models

In this notebook, we show two examples of fine-tuned models for xlm-roberta. Because the language model is cross-lingual, also the fine-tuned model works for all the 100 languages that xlm-roberta supports.

## NLP pipelines for transformers

Huggingface transfomers provides an option to create an **pipeline** to perform a specific NLP task with a pretrained model:

https://huggingface.co/docs/transformers/main_classes/pipelines

The pipelines are abstractions from specific tasks such as sentiment-analysis and entity recognition. In the case of sentiment-analysis, the complete sentence representation of the model is taken as the input and classified with the defined labels. The sentence information is packaged in the special ```[CLS]``` token that was trained in BERT for next sentence prediction. Alternatively, token embeddings of sentences can be summed and averaged,

In the case of entity recognition, each token in a sentence is classified separately in a sequence, i.e. a sequence labelling or token classification task. Whether a finetuned model can be used for a specific task depends on the way it was fine-tuned with labeled data, i.e. what type of classification head as added to the model. 

In this notebook, we will demonstrate two differently fine-tuned models. We will use the ```sentiment-analysis``` pipeline to demonstrate text classification and the ```ner``` pipeline to demonstrate token classification. 

## Sentiment

In [7]:
from transformers import pipeline

# Load the classification pipeline with the specified model
sentiment_task = pipeline("text-classification", model="tabularisai/multilingual-sentiment-analysis")

# Classify a new sentence
sentence = "I love this product! It's amazing and works perfectly."
result = pipe(sentence)

# Print the result
print(result)


Device set to use mps:0


[{'label': 'Very Positive', 'score': 0.5586308240890503}]


We search on the Model Hub of Huggingface for a fine-tuned model for sentiment classification. This is one of the many we find that works for different languges:

https://huggingface.co/tabularisai/multilingual-sentiment-analysis

The model is exclusively trained on synthetic multilingual data generated by advanced LLMs.

Languages covered are: English plus Chinese (中文), Spanish (Español), Hindi (हिन्दी), Arabic (العربية), Bengali (বাংলা), Portuguese (Português), Russian (Русский), Japanese (日本語), German (Deutsch), Malay (Bahasa Melayu), Telugu (తెలుగు), Vietnamese (Tiếng Việt), Korean (한국어), French (Français), Turkish (Türkçe), Italian (Italiano), Polish (Polski), Ukrainian (Українська), Tagalog, Dutch (Nederlands), Swiss German (Schweizerdeutsch), and Swahili.

Sentiment analysis is modeled as text classification. This means that the label is associated with a text as a whole and not to individual tokens (as is done for sequence classification). When using the model, we need to choose the same classification type as was set for training. Since we will use the **pipeline** API to the transformer models, we need to select a pipeline name that matches the training settings. You can check the Huggingface model card and training configuration of the model to check the details.

In this case, this is easy as the pipeline name is also called "sentiment-analysis". We thus initialise a sentiment_task module by the **pipeline** constructor by giving it the task name "sentiment-analysis" and the name of the model on the Huggingface site.

Once the model is loaded, you can pass in any text into the **sentiment_task** instance of the model to get the prediction. You can try out any of the listed languages.

In [9]:
print(sentiment_task("What an awful movie!"))
print(sentiment_task("Wat een waardeloze film!"))

[{'label': 'Very Negative', 'score': 0.5752831697463989}]
[{'label': 'Very Positive', 'score': 0.48439693450927734}]


In the documentation of the model, you see that you can also use the model without a pipeline. You need to load the model and the tokenizer separately. Next when you give it a text, you first need to tokenise the text and feed the tokens to the model. The probabilities from the classification head are mapped to a sentiment_map dictionary to get a readible output.

The next code from the Huggingface site demonstrates this. We will not go further into the details of the parameters and the other functions you can call. We leave this to the machine learning course.

In [10]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "tabularisai/multilingual-sentiment-analysis"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

def predict_sentiment(texts):
    inputs = tokenizer(texts, return_tensors="pt", truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
    sentiment_map = {0: "Very Negative", 1: "Negative", 2: "Neutral", 3: "Positive", 4: "Very Positive"}
    return [sentiment_map[p] for p in torch.argmax(probabilities, dim=-1).tolist()]

texts = ["What an awful movie!", "Wat een waardeloze film!"]
for text, sentiment in zip(texts, predict_sentiment(texts)):
    print(f"Text: {text}\nSentiment: {sentiment}\n")


Text: What an awful movie!
Sentiment: Very Negative

Text: Wat een waardeloze film!
Sentiment: Very Positive



## Named Entity Recognition

Named-entity-recognition and classification (NERC) is typically conceived as a sequence labelling or token classification task. This means that every word (token) in the input text will receive a label as it occurs in a sequence. In the case of NERC, the labels mark each token in a text separately as either the beginning of a named-entity expression, being inside such an expression or being outside such an expression. This type of annotation is called **BIO** or **IOB** annotation, where B=beginning, I=inside and O=outside. In addition to the B and I tag, the type of entity is added as a suffix, e.g. B-PER=beginning of an expression that names a person, whereas B-LOC=beginning of an expression that names a location.

Transformer models that are fine-tuned for NERC typically are set to do sequence labelling. To use such a model, we can create a pipeline for the task "ner". Always check the model card on Huggingface which pipeline task it was designed for:

https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2


In [12]:
from transformers import pipeline
nerc_task = pipeline("ner", model="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use mps:0


In [13]:
example = "Nader Jokhadar had given Syria the lead with a well-struck header in the seventh minute."
nerc_results = nerc_task(example)
for result in nerc_results:
    print(result)

{'entity': 'LABEL_1', 'score': 0.5092719, 'index': 1, 'word': '▁Na', 'start': 0, 'end': 2}
{'entity': 'LABEL_1', 'score': 0.512619, 'index': 2, 'word': 'der', 'start': 2, 'end': 5}
{'entity': 'LABEL_0', 'score': 0.5670998, 'index': 3, 'word': '▁Jo', 'start': 6, 'end': 8}
{'entity': 'LABEL_1', 'score': 0.5187163, 'index': 4, 'word': 'kha', 'start': 8, 'end': 11}
{'entity': 'LABEL_1', 'score': 0.50178516, 'index': 5, 'word': 'dar', 'start': 11, 'end': 14}
{'entity': 'LABEL_1', 'score': 0.5449812, 'index': 6, 'word': '▁had', 'start': 15, 'end': 18}
{'entity': 'LABEL_1', 'score': 0.58548856, 'index': 7, 'word': '▁given', 'start': 19, 'end': 24}
{'entity': 'LABEL_1', 'score': 0.51082754, 'index': 8, 'word': '▁Syria', 'start': 25, 'end': 30}
{'entity': 'LABEL_1', 'score': 0.6047771, 'index': 9, 'word': '▁the', 'start': 31, 'end': 34}
{'entity': 'LABEL_1', 'score': 0.5604353, 'index': 10, 'word': '▁lead', 'start': 35, 'end': 39}
{'entity': 'LABEL_1', 'score': 0.59024274, 'index': 11, 'word': 

Note that the tokens in the output do not correspond to the words from the input. Remember that XLM-RoBERTa captures 100 languages and although the vocabulary is more than 250K items, this is by far not enough to represent all words of these languages. Therefore, the tokenizer of the model breaks down these words to smaller pieces in order to represent the complete sentence.

In [14]:
example = "Mark Rutte kondigt aan dat de VVD tech bedrijven zoals Google, Facebook en Apple zwaarder gaat belasten."
nerc_results = nerc_task(example)
for result in nerc_results:
    print(result)

{'entity': 'LABEL_0', 'score': 0.5145547, 'index': 1, 'word': '▁Mark', 'start': 0, 'end': 4}
{'entity': 'LABEL_0', 'score': 0.5238967, 'index': 2, 'word': '▁Rut', 'start': 5, 'end': 8}
{'entity': 'LABEL_0', 'score': 0.50275517, 'index': 3, 'word': 'te', 'start': 8, 'end': 10}
{'entity': 'LABEL_0', 'score': 0.54380727, 'index': 4, 'word': '▁kon', 'start': 11, 'end': 14}
{'entity': 'LABEL_0', 'score': 0.54584754, 'index': 5, 'word': 'digt', 'start': 14, 'end': 18}
{'entity': 'LABEL_1', 'score': 0.51212454, 'index': 6, 'word': '▁aan', 'start': 19, 'end': 22}
{'entity': 'LABEL_1', 'score': 0.565536, 'index': 7, 'word': '▁dat', 'start': 23, 'end': 26}
{'entity': 'LABEL_1', 'score': 0.52824455, 'index': 8, 'word': '▁de', 'start': 27, 'end': 29}
{'entity': 'LABEL_0', 'score': 0.5462488, 'index': 9, 'word': '▁VVD', 'start': 30, 'end': 33}
{'entity': 'LABEL_1', 'score': 0.5039486, 'index': 10, 'word': '▁tech', 'start': 34, 'end': 38}
{'entity': 'LABEL_1', 'score': 0.5374721, 'index': 11, 'word'

## End of notebook