# Lab4.3 How to use fine-tuned crosslinglual transformer models

Transformer models such as BERT and RoBERTa can easily be fine-tuned for downstream tasks. The Huggingface model hub lists many of these models already trained for specific tasks. In this notebook, we show two examples of fine-tuned models for xlm-roberta. Because the language model is cross-lingual, also the fine-tuned model works for all the 100 languages that xlm-roberta models.

## Sentiment

We search on the Model Hub of Huggingface for a fine-tuned xlm-roberta model for sentiment classification, e.g.:

https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base-sentiment

The is models was  trained on ~198M tweets and finetuned for sentiment analysis. According to the Huggingface model card text:

"The sentiment fine-tuning was done on 8 languages (Ar, En, Fr, De, Hi, It, Sp, Pt) but it can be used for more languages (see paper for details)."

    Barbieri, Francesco, Luis Espinosa Anke, and Jose Camacho-Collados. "Xlm-t: Multilingual language models in twitter for sentiment analysis and beyond." arXiv preprint arXiv:2104.12250 (2021).

The cross-lingual capabilities come from XLM-roberta that has a vocabulary for 100 languages and represents texts in any of these in a language-agnostic way. Fine-tuning it with example in the 8 languages transfers it to all 100 languages.

Sentiment analysis as modeled here is a form of text classification. This means that the label is associated with a text as a whole and not to individual tokens (which is done by sequence classification). When training the model with examples: tweets with a sentiument label, the type of classification needs to be set.

When using the model, we need to select the same type of classification as was set for training. Since we will use the **pipeline** API to the transformer models, we need to select a pipeline name that matches the trainng settings. In this case, this is easy as the name is also called "sentiment-analysis". We thus initialise a sentiment_task module by the **pipeline** constructor by giving it the task name "sentiment-analysis" and the name of the model on the Huggingface site.

In [2]:
#!pip install sentencepiece

Collecting sentencepiece
  Using cached sentencepiece-0.1.99-cp39-cp39-macosx_10_9_x86_64.whl (1.2 MB)
Installing collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [1]:
from transformers import pipeline
model_path = "cardiffnlp/twitter-xlm-roberta-base-sentiment"
sentiment_task = pipeline("sentiment-analysis", model=model_path)

  from .autonotebook import tqdm as notebook_tqdm


ImportError: 
XLMRobertaConverter requires the protobuf library but it was not found in your environment. Checkout the instructions on the
installation page of its repo: https://github.com/protocolbuffers/protobuf/tree/master/python#installation and follow the ones
that match your environment. Please note that you may need to restart your runtime after installation.


Once the

In [2]:
print(sentiment_task("What an awful movie!"))
print(sentiment_task("Wat een waardeloze film!"))

[{'label': 'Negative', 'score': 0.9273003935813904}]
[{'label': 'Negative', 'score': 0.8501137495040894}]


## Named Entity Recognition

In [3]:
from transformers import pipeline
#tokenizer = AutoTokenizer.from_pretrained("Davlan/xlm-roberta-base-ner-hrl")
#model = AutoModelForTokenClassification.from_pretrained("Davlan/xlm-roberta-base-ner-hrl")
nerc_task = pipeline("ner", model="Davlan/xlm-roberta-base-ner-hrl")

Downloading:   0%|          | 0.00/980 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.03G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/211 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.83M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

In [4]:
example = "Nader Jokhadar had given Syria the lead with a well-struck header in the seventh minute."
nerc_results = nerc_task(example)
for result in nerc_results:
    print(result)

{'entity': 'B-PER', 'score': 0.9998415, 'index': 1, 'word': '▁Na', 'start': 0, 'end': 2}
{'entity': 'I-PER', 'score': 0.88056284, 'index': 2, 'word': 'der', 'start': 2, 'end': 5}
{'entity': 'I-PER', 'score': 0.99981594, 'index': 3, 'word': '▁Jo', 'start': 5, 'end': 8}
{'entity': 'I-PER', 'score': 0.99980223, 'index': 4, 'word': 'kha', 'start': 8, 'end': 11}
{'entity': 'I-PER', 'score': 0.999753, 'index': 5, 'word': 'dar', 'start': 11, 'end': 14}
{'entity': 'B-LOC', 'score': 0.99962485, 'index': 8, 'word': '▁Syria', 'start': 24, 'end': 30}


In [5]:
example = "Mark Rutte kondigt aan dat de VVD tech bedrijven zoals Google, Facebook en Apple zwaarder gaat belasten."
nerc_results = nerc_task(example)
for result in nerc_results:
    print(result)

{'entity': 'B-PER', 'score': 0.9998753, 'index': 1, 'word': '▁Mark', 'start': 0, 'end': 4}
{'entity': 'I-PER', 'score': 0.99985516, 'index': 2, 'word': '▁Rut', 'start': 4, 'end': 8}
{'entity': 'I-PER', 'score': 0.9998762, 'index': 3, 'word': 'te', 'start': 8, 'end': 10}
{'entity': 'B-ORG', 'score': 0.999185, 'index': 9, 'word': '▁VVD', 'start': 29, 'end': 33}
{'entity': 'B-ORG', 'score': 0.99986625, 'index': 13, 'word': '▁Google', 'start': 54, 'end': 61}
{'entity': 'B-ORG', 'score': 0.99985904, 'index': 15, 'word': '▁Facebook', 'start': 62, 'end': 71}
{'entity': 'B-ORG', 'score': 0.9998469, 'index': 17, 'word': '▁Apple', 'start': 74, 'end': 80}


## End of notebook