# Lab 4.3 Use fine-tuned crosslingual transformer models

Transformer models such as BERT and RoBERTa can easily be fine-tuned for downstream tasks. The Huggingface hub lists many of these models already trained for specific tasks. New fine-tuned transformer models are published regularly on the huggingface platform: https://huggingface.co/models

In this notebook, we show two examples of fine-tuned models for xlm-roberta. Because the language model is cross-lingual, also the fine-tuned model works for all the 100 languages that xlm-roberta supports.

## NLP pipelines for transformers

Huggingface transfomers provides an option to create an **pipeline** to perform a specific NLP task with a pretrained model:

https://huggingface.co/docs/transformers/main_classes/pipelines

The pipelines are abstractions from specific tasks such as sentiment-analysis and entity recognition. In the case of sentiment-analysis, the complete sentence representation of the model is taken as the input and classified with the defined labels. The sentence information is packaged in the special ```[CLS]``` token that was trained in BERT for next sentence prediction. Alternatively, token embeddings of sentences can be summed and averaged,

In the case of entity recognition, each token in a sentence is classified separately in a sequence, i.e. a sequence labelling or token classification task. Whether a finetuned model can be used for a specific task depends on the way it was fine-tuned with labeled data, i.e. what type of classification head as added to the model. 

In this notebook, we will demonstrate two differently fine-tuned models. We will use the ```sentiment-analysis``` pipeline to demonstrate text classification and the ```ner``` pipeline to demonstrate token classification. 

## Sentiment

We search on the Model Hub of Huggingface for a fine-tuned xlm-roberta model for sentiment classification, e.g.:

https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base-sentiment

This model was trained on ~198M tweets and finetuned for sentiment analysis. According to the Huggingface model card text:

"The sentiment fine-tuning was done on 8 languages (Ar, En, Fr, De, Hi, It, Sp, Pt) but it can be used for more languages (see paper for details)."

    Barbieri, Francesco, Luis Espinosa Anke, and Jose Camacho-Collados. "Xlm-t: Multilingual language models in twitter for sentiment analysis and beyond." arXiv preprint arXiv:2104.12250 (2021).

The cross-lingual capabilities come from XLM-roberta that has a vocabulary for 100 languages and represents texts in any of these in a language-agnostic way. Fine-tuning it with example in the 8 languages transfers to all 100 languages.

Sentiment analysis is modeled as text classification. This means that the label is associated with a text as a whole and not to individual tokens (as is done for sequence classification). When using the model, we need to choose the same classification type as was set for training. Since we will use the **pipeline** API to the transformer models, we need to select a pipeline name that matches the training settings. You can check the Huggingface model card and training configuration of the model to check the details.

In this case, this is easy as the pipeline name is also called "sentiment-analysis". We thus initialise a sentiment_task module by the **pipeline** constructor by giving it the task name "sentiment-analysis" and the name of the model on the Huggingface site.

In [1]:
#!pip install sentencepiece

In [2]:
from transformers import pipeline
model_path = "cardiffnlp/twitter-xlm-roberta-base-sentiment"
sentiment_task = pipeline("sentiment-analysis", model=model_path)

ImportError: 
XLMRobertaConverter requires the protobuf library but it was not found in your environment. Checkout the instructions on the
installation page of its repo: https://github.com/protocolbuffers/protobuf/tree/master/python#installation and follow the ones
that match your environment.


Once the model is loaded, you can pass in any text into the **sentiment_task** instance of the model to get the prediction. You can try out any of the 100 languages of XLM-RoBERTa.

In [None]:
print(sentiment_task("What an awful movie!"))
print(sentiment_task("Wat een waardeloze film!"))

## Named Entity Recognition

Named-entity-recognition and classification (NERC) is typically conceived as a sequence labelling or token classification task. This means that every word (token) in the input text will receive a label as it occurs in a sequence. In the case of NERC, the labels mark each token in a text separately as either the beginning of a named-entity expression, being inside such an expression or being outside such an expression. This type of annotation is called **BIO** or **IOB** annotation, where B=beginning, I=inside and O=outside. In addition to the B and I tag, the type of entity is added as a suffix, e.g. B-PER=beginning of an expression that names a person, whereas B-LOC=beginning of an expression that names a location.

Transformer models that are fine-tuned for NERC typically are set to do sequence labelling. To use such a model, we can create a pipeline for the task "ner". Always check the model card on Huggingface which pipeline task it was designed for.

In [3]:
from transformers import pipeline
nerc_task = pipeline("ner", model="Davlan/xlm-roberta-base-ner-hrl")

ImportError: 
XLMRobertaConverter requires the protobuf library but it was not found in your environment. Checkout the instructions on the
installation page of its repo: https://github.com/protocolbuffers/protobuf/tree/master/python#installation and follow the ones
that match your environment.


In [None]:
example = "Nader Jokhadar had given Syria the lead with a well-struck header in the seventh minute."
nerc_results = nerc_task(example)
for result in nerc_results:
    print(result)

Note that the tokens in the output do not correspond to the words from the input. Remember that XLM-RoBERTa captures 100 languages and although the vocabulary is more than 250K items, this is by far not enough to represent all words of these languages. Therefore, the tokenizer of the model breaks down these words to smaller pieces in order to represent the complete sentence.

In [None]:
example = "Mark Rutte kondigt aan dat de VVD tech bedrijven zoals Google, Facebook en Apple zwaarder gaat belasten."
nerc_results = nerc_task(example)
for result in nerc_results:
    print(result)

## End of notebook