# Fine-tune a multilingual transformer model for sentence classification

| Authors | Last update |
|:------ |:----------- |
| Hauke Licht (https://github.com/haukelicht) | 2023-11-29 |

This notebook shows how to use the hugging face 🤗 `transformers` library to train a Transformer-based sentence classifier with transfer learning (i.e., "fine tuning").

**_Source:_** The notebook is adapted from the one distributed with this tutorial: https://huggingface.co/docs/transformers/tasks/sequence_classification

<a target="_blank" href="https://colab.research.google.com/github/fabiennelind/Going-Cross-Lingual_Course/blob/main/code/transformer_finetuning.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Supervised text classification

Text classification means to assigns a label or class to each text in a corpus.
It is a common NLP and computational text analysis task.

For example. a common and popular text classification task is *sentiment analysis*, which assigns a label like 'positive', 'negative', or 'neutral' to a sequence of text, for example a sentence of paragraph.

### Ingredients

Here is what you need for training a supervised text classifier through fine-tuning (i.e. transfer learning):

- a **pre-trained model** you can **fine-tune** for ["sequence classification"](https://huggingface.co/docs/transformers/tasks/sequence_classification)
    - the pre-trained tokenizer comes with the model
- a pre-defined set of **label classes** (e.g., 'positive', 'neutral', 'negative')
- a **label dataset**, i.e., a corpus of texts (e.g., sentences) in which each document/text has been assigned to a *single* label class
    - we will split this dataset into train, development, and test sets
- some **metric to quantify classification performance** so that we know how well our classifier is doing

### The model fine-tuning approach

Fine-tuning means to taking a **pre-trained model** and training it on a specific task.
That is, instead of training a model from scratch on a small dataset, you use a pre-trained model as a starting point and adapt it to your specific classification task.

In the context of text classification, we use pre-trained **Transformers** for fine-tuning.
Transformers are deep learning models designed to handle sequential data, making them well-suited for natural language processing tasks.
Models like **BERT** (*Bidirectional Encoder Representations from Transformers*) or **GPT** (**Generative Pre-trained Transformer**) are pre-trained on massive datasets in a self-supervised fashion to learn general language patterns.
This allows them to learn good text representations ("embeddings").

During fine-tuning, we adapt the pre-trained model's parameters to perform the new task.
The idea behind fine-tuning is to transfer the knowledge gained during pre-training to a specific task.
This is called *transfer learning*.
Transfer learning has two related benefits:

1. We can leverage the pre-trained model's language representation capabilities to perform our classifciation task.
2. If the model generates useful text representations, we can train a well-performing classifier with relatively little labeled data (compared to training a model from scratch)


## This notebook

In this notebook, we will use

1. the [PimPo](https://manifesto-project.wzb.eu/information/documents/pimpo) dataset contributed by the Pola Lehmann and Malisa Zobel ([2018](https://doi.org/10.1111/1475-6765.12266)) that records party manifesto senteces annotated for stance on the issues of immigration and integration, and
2. finetune a multilingual[XLM-RoBERTa](https://huggingface.co/xlm-roberta-base) tranformer model (Conneau *et al.*, [2019](http://arxiv.org/abs/1911.02116))

In [None]:
MODEL_NAME = 'xlm-roberta-base'

## Setup

If you run this notebook on Google Colab or you have not yet installed the `transformers` and `datasets` python libraries, you need to do so first:

In [None]:
# check if on colab
try:
    import google.colab
    COLAB = True
except:
    COLAB=False
print('on colab:', COLAB)

on colab: True


In [None]:
%%capture
if COLAB:
    !pip install accelerate==0.24.1 sentencepiece==0.1.99 tokenizers==0.14.1 datasets==2.14.6 transformers==4.35.0 easyNMT==2.0.0 iso639==0.1.4

In [None]:
import os
import shutil

import pandas as pd
import numpy as np

# for train/dev/test splitting
from sklearn.model_selection import train_test_split

# dataset loading
from datasets import Dataset, DatasetDict

# fine-tuning
from transformers import (
    # used to tokenize text
    AutoTokenizer,
    # used to load the pre-trained model
    AutoModelForSequenceClassification,
    # used to finetune the pre-trained model
    Trainer, TrainingArguments, DataCollatorWithPadding,
    # for reproducibility
    set_seed
)

# evaluation
from sklearn.metrics import precision_recall_fscore_support

### Using the GPU or M1 (if available)

To make training as fast as possible, you want to utilize GPU computing.
When you run notebooks on Colab, you can enable GPU computing by

1. clicking on "Runtime" in the menu,
2. selecting "Change runtime type", and
3. choose "GPU" in the "Hardware accelerator" section of the pop-up

If you are running this notebook elsewhere, you want to determine to what kind of device you have access

- with a GPU &rarr; "cuda"
- with MacOS's M1/M2 chip &rarr; "mps"
- else "cpu"

We do so like this:

In [None]:
import torch
# check if GPU or MPS is available, else use CPU
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
device = torch.device(device)
device

device(type='cuda')

### Reproducibility

In programming, randomness is governed by *Random Number Generator* (RNG) algorithms.
You can control randomness by setting a so called ["seed"](https://towardsdatascience.com/random-seeds-and-reproducibility-933da79446e3) that determines an RNG's initial state.
By setting a seed at the beginning of your script, each random value you generate (e.g., like [this](https://realpython.com/numpy-random-number-generator/)) will be the same at each run of your script &mdash; as long as you run all the code in the script in the same order (e.g., cell by cell, from bottom to top).

**_Important:_** If you set the seed but programm interactively, the order in which you call individual code chunks will vary from interactive use to interactive use.
So setting a seed at the beginning of the script will make it's execution only reproducible for runs from top to bottom (without user interaction in between).

The packages you will use for randomizing computations have notes on the topic of reproducibility you should read:

- `random`: https://docs.python.org/3/library/random.html#notes-on-reproducibility
- `numpy`: https://numpy.org/doc/stable/reference/random/generator.html (but read also [here](https://builtin.com/data-science/numpy-random-seed), [here](https://albertcthomas.github.io/good-practices-random-number-generators/), and [here](https://stackoverflow.com/a/5837352))
- `pandas`: random number generation inside `pandas` code uses `numpy` under the hood.
- `torch` and all pacakges using it (e.g., `transformers`): https://pytorch.org/docs/stable/notes/randomness.html
- `transformers`: introduced [here](https://github.com/huggingface/transformers/pull/16907)

In [None]:
SEED = 1234
set_seed(SEED) # <== sets random, numpy, and torch seeds

**_Important:_** When fine-tuning with `transformers`'s `Trainer` (like above), you also need to set the following arguments in your call to `TrainingArguments()`:

```python
TrainingArguments(
    ...
    # ensure reproducibility
    full_determinism = True,
    seed = SEED,
    data_seed = SEED,
    ...
)
```

## Looking under the hood of multilingual Transformers

Let's take sentences in four different languages for which the meaning of "government" is the same:

In [None]:
# sentences in different languages with the same meanings and content
sentences = [
    "The government should care more about voters' concerns.",
    "Die Regierung sollte sich mehr um die Anliegen der Wähler kümmern.",
    "La gouvernement devrait s'occuper davantage des préoccupations des électeurs.",
    "Los gobiernos deberían preocuparse más por las preocupaciones de los votantes.",
]

Now, let's tokenize them:

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
toks = tokenizer(sentences, padding=True, return_tensors='pt')
for input_ids in toks['input_ids']:
    print([(i, tokenizer.decode(tok)) for i, tok in enumerate(input_ids)])

Downloading config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

[(0, '<s>'), (1, 'The'), (2, 'government'), (3, 'should'), (4, 'care'), (5, 'more'), (6, 'about'), (7, 'voter'), (8, 's'), (9, "'"), (10, 'concern'), (11, 's'), (12, '.'), (13, '</s>'), (14, '<pad>'), (15, '<pad>'), (16, '<pad>'), (17, '<pad>'), (18, '<pad>')]
[(0, '<s>'), (1, 'Die'), (2, 'Regierung'), (3, 'sollte'), (4, 'sich'), (5, 'mehr'), (6, 'um'), (7, 'die'), (8, 'Anliegen'), (9, 'der'), (10, 'Wähle'), (11, 'r'), (12, 'kümmern'), (13, '.'), (14, '</s>'), (15, '<pad>'), (16, '<pad>'), (17, '<pad>'), (18, '<pad>')]
[(0, '<s>'), (1, 'La'), (2, 'gouvernement'), (3, 'devrait'), (4, 's'), (5, "'"), (6, 'occuper'), (7, 'davantage'), (8, 'des'), (9, 'pré'), (10, 'occupation'), (11, 's'), (12, 'des'), (13, 'é'), (14, 'lec'), (15, 'teurs'), (16, '.'), (17, '</s>'), (18, '<pad>')]
[(0, '<s>'), (1, 'Los'), (2, 'gobierno'), (3, 's'), (4, 'debería'), (5, 'n'), (6, 'preocupar'), (7, 'se'), (8, 'más'), (9, 'por'), (10, 'las'), (11, 'preocupa'), (12, 'ciones'), (13, 'de'), (14, 'los'), (15, 'vot'

The term referring to "government" is always the third token in each sequence (index 2) .

Let's feed the sequences through the pre-trained transformer and see the magic:

In [None]:
import torch
from transformers import AutoModel
model = AutoModel.from_pretrained(MODEL_NAME)

with torch.no_grad():
    output = model(**toks, output_hidden_states=True)

Downloading model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

We will see how the cross-lingual similarity of the embedding representation of the government tokens changes as we pass from the embedding layer to the final output layer of the model.

In [None]:
# we'll use the cosine similarity measure
from sklearn.metrics.pairwise import cosine_similarity

As shown in the cell below, the embedding representation of the government token are still quite different across languages:

In [None]:
# similarity of input embeddings
cosine_similarity(output.hidden_states[0][:,2,:]).round(3)

array([[1.   , 0.385, 0.516, 0.5  ],
       [0.385, 1.   , 0.418, 0.369],
       [0.516, 0.418, 1.   , 0.532],
       [0.5  , 0.369, 0.532, 1.   ]], dtype=float32)

But this gets better as the inputs are fed deeper into the model:

In [None]:
# similarity of hidden states at layer 4
cosine_similarity(output.hidden_states[5][:,2,:]).round(3)

array([[1.   , 0.864, 0.902, 0.877],
       [0.864, 1.   , 0.882, 0.856],
       [0.902, 0.882, 1.   , 0.895],
       [0.877, 0.856, 0.895, 1.   ]], dtype=float32)

In [None]:
# similarity of hidden states at layer 8
cosine_similarity(output.hidden_states[9][:,2,:]).round(3)

array([[1.   , 0.929, 0.94 , 0.927],
       [0.929, 1.   , 0.934, 0.916],
       [0.94 , 0.934, 1.   , 0.923],
       [0.927, 0.916, 0.923, 1.   ]], dtype=float32)

And at the final layer that serves as input for, for example, classifiers, the tokens' representations are almost identical:

In [None]:
# similarity of last hidden states
cosine_similarity(output.last_hidden_state[:, 2, :]).round(3)

array([[1.   , 0.986, 0.991, 0.991],
       [0.986, 1.   , 0.987, 0.984],
       [0.991, 0.987, 1.   , 0.989],
       [0.991, 0.984, 0.989, 1.   ]], dtype=float32)

So by feeding the input sequences through successive layers, the model is able to "recognize" that, despite different symbols, the tokens "government," "Regierung," "gouvernement," and "gobierno" all refer to a semanticeally equivalent concept.

## Preparing the data

### Load the dataset

In [None]:
fp = 'https://raw.githubusercontent.com/fabiennelind/Going-Cross-Lingual_Course/main/data/lehmann%2Bzobel_2018_pimpo_positions.tsv'
df = pd.read_csv(fp, sep='\t')

**_Note:_** there is also a 'lehmann+zobel_2018_pimpo_positions**_translated**.tsv' data file on github that contains machine translations in addition to the source texts.

### Inspect and subset the data

Let's cross-tabulate the two relevant indicator columns in this dataset.

In [None]:
df[['position', 'issue']].value_counts(dropna=False).unstack()

issue,immigration,integration
position,Unnamed: 1_level_1,Unnamed: 2_level_1
neutral,578,390
sceptical,1360,776
supportive,1955,3886


**_Note:_** In the original data, 'position' is only coded for sentences with 'issue' "immigration" or "integration".
We have discarded all sentences where 'issue' was coded as "other".

A problem with the 'position' indicator in all coded sentences is that the "neutral" category is very rare.

In [None]:
df.value_counts('position', dropna=False)

position
supportive    5841
sceptical     2136
neutral        968
dtype: int64

We'll thus discard "neutral" sentences to make the classification task simpler:

In [None]:
df = df[df.position.isin(['supportive', 'sceptical'])]

Now we can tabulate the languages covered by the remaining data:

In [None]:
df[['country_iso3c', 'lang']].value_counts()

country_iso3c  lang
DEU            deu     1814
NOR            nor     1273
NLD            nld     1228
AUT            deu      861
ESP            spa      814
DNK            dan      482
CHE            deu      351
SWE            swe      336
FIN            fin      261
IRL            eng      188
AUS            eng      139
USA            eng      139
CAN            fra       73
NZL            eng       10
FIN            swe        8
dtype: int64

## Prepare training

### Prepare the train, dev, and test splits

Because machine learning models need numbers as inputs, we need to convert our position labels into numeric indicators.

To this end we first create a "mapping" `id2label` label number to label name (and *vice versa*):

In [None]:
id2label = {0: 'sceptical', 1: 'supportive'}
# alternatively: `id2label = dict(enumerate(df.position.unique()))?
label2id = {v: k for k, v in id2label.items()}

We then use this mapping to creat a number label indicator column:

In [None]:
# encode label indicator
df['label'] = df['position'].map(label2id)
df[['label', 'position']].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['label'] = df['position'].map(label2id)


label  position  
1      supportive    5841
0      sceptical     2136
dtype: int64

Now we can subset the data frame to only the columns we'll need for finetuning and evaluation:

In [None]:
cols = ['text', 'label', 'lang']
df = df[cols].reset_index(drop=True)

Next, we need to split the dataset into three **splits**:

- a training set ("train") used for finetuning the pre-trained model's parameters with our data and classification task
- a development set ("dev") used to monitor our progress during finetuning
- a test set ("test") we use to estimate the final classifiers out-of-sample performance (completely held-out during finetuning!)

We set aside 10% of sentences for the development and 15% of sentences for the test set. The rest is used for training.

In [None]:
# use 15% of the data for dev and test, respectively
n = len(df)

test_size = 0.15
n_test = int(n*test_size)

dev_size = 0.10
n_dev = int(n*dev_size)

tmp, test_idxs = train_test_split(range(n), test_size=n_test, random_state=SEED, stratify=df.label)
train_idxs, dev_idxs = train_test_split(tmp, test_size=n_dev, random_state=SEED, stratify=df.label.iloc[tmp])
del tmp

# show the number of examples in each split
len(train_idxs), len(dev_idxs), len(test_idxs)

(5984, 797, 1196)

Now we are ready to create our dataset:

In [None]:
dataset = DatasetDict({
    "train": Dataset.from_pandas(df.iloc[train_idxs], preserve_index=False),
    "dev": Dataset.from_pandas(df.iloc[dev_idxs], preserve_index=False),
    "test": Dataset.from_pandas(df.iloc[test_idxs], preserve_index=False),
})

The object 'dataset' is an instance of the `datasets` `DatasetDict` class.

In [None]:
type(dataset)

datasets.dataset_dict.DatasetDict

This class is there to gather several pre-defined splits of a dataset.

Among these splits, one is usually named "train" and another on "test" (see next cell).

**_Note:_** It'll become clearer further below why we need these splits.

In [None]:
dataset.keys()

dict_keys(['train', 'dev', 'test'])

In [None]:
len(dataset['train']), len(dataset['dev']), len(dataset['test'])

(5984, 797, 1196)

Here is how you can access one "example" (i.e., observation) in the the "test" split:

In [None]:
type(dataset["test"])

datasets.arrow_dataset.Dataset

In [None]:
dataset["test"][1], id2label

({'text': 'Auslöser für Angst und Intoleranz ist Unwissenheit.',
  'label': 1,
  'lang': 'deu'},
 {0: 'sceptical', 1: 'supportive'})

This shows in the test split, there are three fields for *each* example :

- `text`: the movie review text.
- `label`: a value that is either `1` for a supportive stance or `0` for a sceptical stance.
- `lang`: lange code indicatiting the sentenecs language

**_Important:_** check that the splits have about equal label class distributions:

In [None]:
print('% "pos" in train:', np.mean([ex['label'] for ex in dataset["train"]]))
print('% "pos" in dev:', np.mean([ex['label'] for ex in dataset["dev"]]))
print('% "pos" in test:', np.mean([ex['label'] for ex in dataset["test"]]))

% "pos" in train: 0.7321189839572193
% "pos" in dev: 0.7327478042659975
% "pos" in test: 0.7324414715719063


### Preprocessing texts

The next step is to load a DistilBERT tokenizer to preprocess the `text` field:

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
# read about "byte-pair encoding" and "sentence-piece" algorithms if interested in how tokenizers work

The tokenizer is a so-called *callable* and can thus be used like a function:
If you input a text string, it return a dictionary with the tokenized text and additional information.

In [None]:
toks = tokenizer("Hello, this one sentence! <s> And this is another one.")
print(toks.keys())

dict_keys(['input_ids', 'attention_mask'])


- The field 'input_ids' indicates the numbers used to represent the tokens in the example sentence.
- The 'attention_mask' is there to help the model to know to which tokens in of a bunch of sentences it should pay attention when fine-tuning, and which it can ignore.

In [None]:
toks['input_ids']
print(toks['input_ids'])
tokenizer.convert_ids_to_tokens(toks['input_ids'])

[0, 35378, 4, 903, 1632, 149357, 38, 0, 3493, 903, 83, 15700, 1632, 5, 2]


['<s>',
 '▁Hello',
 ',',
 '▁this',
 '▁one',
 '▁sentence',
 '!',
 '<s>',
 '▁And',
 '▁this',
 '▁is',
 '▁another',
 '▁one',
 '.',
 '</s>']

In [None]:
print(toks['attention_mask'])

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


Let's create a helper function that tokenizes the `text` value of an input called example.
This will allow us to iterate over examples in our dataset splits (e.g., `imdb["test"]`) and pre-process them one by one.

**_Note:_** Setting `truncate=True` we ensure that none of the text sequences we'll use for fine-tuning is too longer for DistilBERT to handle it.

In [None]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

To apply the preprocessing function over the entire dataset, use 🤗 Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function. You can speed up `map` by setting `batched=True` to process multiple elements of the dataset at once:

In [None]:
# need to do this beause you want to add the input IDs and
#  attention mask values to each example in each of the data splits
dataset = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/5984 [00:00<?, ? examples/s]

Map:   0%|          | 0/797 [00:00<?, ? examples/s]

Map:   0%|          | 0/1196 [00:00<?, ? examples/s]

In [None]:
dataset['train'][0].keys()

dict_keys(['text', 'label', 'lang', 'input_ids', 'attention_mask'])

Now create a batch of examples using [DataCollatorWithPadding](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorWithPadding).

Our `data_collator` instance of this class will handle preprocessing and a thing called "padding" when sampling batches of examples during finetuning to iteratively update our classifier's parameters.

*Padding* means that you make all text sequences in a set of sequences the same length.
To do this, we just append the `<PAD>` special token to shorter text sequences in the set.
For example, the (tokenized) sequences in the following set

```json
[
    ['Hello', 'world', '!'               ],
    ['Have',  'a',     'nice', 'day', '!']
]
```

will be "padded" to

```json
[
    ['Hello', 'world', '!',    '<PAD>', '<PAD>'],
    ['Have',  'a',     'nice', 'day',   '!'    ]
]
```

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

### How to evaluate model performance

To see how well our model is doing on the classification task, we compute *evaluation metrics*.
Broadly speaking, they summarize how often our model predicts the correct labels on held-out data.

Applying an evaluation metric to held-out data during training is helpful for assessing how your model's performance improves during training.

Once your model is trained, evaluation metrics are important to qunatify your model's final performance.

We compute the following metrics:

- precision: the share of examples a classifier as correctly assigned into a class
- recall: the share of positive examples a classifier labels correctly
- F1: a measure combining recall and precision
- balanced accuary: an accurarcy metric adjusting for class imbalance

<p><a href="https://commons.wikimedia.org/wiki/File:Precisionrecall.svg#/media/File:Precisionrecall.svg"><img src="https://upload.wikimedia.org/wikipedia/commons/2/26/Precisionrecall.svg" alt="Precisionrecall.svg" height="800" width="440"></a><br>By &lt;a href="//commons.wikimedia.org/wiki/User:Walber" title="User:Walber"&gt;Walber&lt;/a&gt; - &lt;span class="int-own-work" lang="en"&gt;Own work&lt;/span&gt;, <a href="https://creativecommons.org/licenses/by-sa/4.0" title="Creative Commons Attribution-Share Alike 4.0">CC BY-SA 4.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=36926283">Link</a></p>

In [None]:
p, r, f1, _ = precision_recall_fscore_support(y_true=[0, 1, 1], y_pred=[0, 0, 1], average='binary', zero_division=0)
p, r, f1

(1.0, 0.5, 0.6666666666666666)

In [None]:
2*(p*r / (p+r))

0.6666666666666666

Let's create a function that passes your predictions and labels to calculate some central metrics (explanations below):

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    p, r, f1, _ = precision_recall_fscore_support(y_true=labels, y_pred=predictions, average='binary', zero_division=0)
    metrics = {
        "f1": f1,
        "precision": p,
        "recall": r,
    }
    return metrics

**_Note_** You could also just load a evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library (see the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric).

Your `compute_metrics` function is ready to go now, and you'll return to it when you setup your training.

### Load the pre-trained model

<Tip>

If you aren't familiar with finetuning a model with the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer), take a look at the basic tutorial [here](https://huggingface.co/docs/transformers/main/en/tasks/../training#train-with-pytorch-trainer)!

</Tip>

You're ready to start training your model now! Load XLM-RoBERTa with [AutoModelForSequenceClassification](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForSequenceClassification) along with the number of expected labels, and the label mappings:

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME, # <== the name of the pre-trained model (downloaded from huggingface hub)
    num_labels=len(id2label), # number of label classes (adapt this if you have, e.g., 4 label classes)
    id2label=id2label,
    label2id=label2id
)

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.out_proj.bias', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Once we've figured this out, we put our model on that device:

In [None]:
# IMPORTANT: put the model to the respective device (e.g., GPU)
if str(device) != 'cpu':
  model.to(device);
print(model.device)

cuda:0


At this point, only three steps remain:

1. Define your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). The only required parameter is `output_dir` which specifies where to save your model. At the end of each epoch, the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) will evaluate the accuracy and save the training checkpoint.
2. Pass the training arguments to [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

### Preparing the trainer

In [None]:
# create a temporary directory to store the model predictions
# create a temporary folder to save training checkpoints and logs
import tempfile
model_path = tempfile.mkdtemp()

In [None]:
training_args = TrainingArguments(
    output_dir=model_path,
    # leave the following unchanged ;)
    optim='adamw_torch',
    learning_rate=2e-5,
    per_device_train_batch_size=16, # <== reduce only if you get a "CUDA out of memory" error
    gradient_accumulation_steps=2,  # <== increase if your reduce `per_device_train_batch_size`
    per_device_eval_batch_size=32,
    weight_decay=0.01,
    # increase this value to train for longer
    num_train_epochs=3,
    evaluation_strategy="epoch",
    report_to="none",
    # how to save and determine ("best") model
    save_strategy="epoch",
    metric_for_best_model="f1", # <== needs to match one of the names of the dictionary returned by `compute_metrics()` function
    load_best_model_at_end=True,
    save_total_limit=2,
)

In [None]:
import gc
def clean_memory(device):
  """Free GPU memory"""
  if device == 'cuda':
    torch.cuda.empty_cache()
    torch.cuda.ipc_collect()
  elif device == 'mps':
    torch.mps.empty_cache()
  else:
    pass
  gc.collect()

In [None]:
clean_memory(str(model.device))

trainer = Trainer(
    model=model, # the model instance you loaded two cells above
    args=training_args, # the training args you created one cells above
    train_dataset=dataset["train"], # the training data split
    eval_dataset=dataset["dev"], # the testing data split
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

## Train

Now we can finetune the model!

**_Warning:_** This will take long if you are using only your CPU 🥹

In [None]:
trainer.train()

You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,F1,Precision,Recall
1,No log,0.452009,0.849292,0.82658,0.873288
2,No log,0.439485,0.856204,0.823065,0.892123
3,0.449400,0.439729,0.862969,0.823988,0.905822


TrainOutput(global_step=561, training_loss=0.44080802251098417, metrics={'train_runtime': 348.6321, 'train_samples_per_second': 51.493, 'train_steps_per_second': 1.609, 'total_flos': 554531215832640.0, 'train_loss': 0.44080802251098417, 'epoch': 3.0})

**_Note:_** The metrics reported in above are computed on the development (`dev`) set!
Once trained, we still need to compute performance on the test set. 👇

Clean up the checkpoints save during training:

In [None]:
import shutil

shutil.rmtree(model_path)

## Evaluation

In [None]:
# evaluate the final model on the held-out tetst set
trainer.evaluate(dataset["test"])

{'eval_loss': 0.4528695344924927,
 'eval_f1': 0.8749317312943746,
 'eval_precision': 0.8387434554973822,
 'eval_recall': 0.9143835616438356,
 'eval_runtime': 4.5601,
 'eval_samples_per_second': 262.277,
 'eval_steps_per_second': 8.333,
 'epoch': 3.0}

**Interpretation**

- The **precision** tells us how often the classifier is correct when it says a text expresses a "supportive" stance (our positive label class). a score of ~0.85 means that the classifier is correct in 17 out of 20 cases.
- The **recall** tells us how many of "supportive"-stance sentences the classifier classifies correctly. The score of 0.90 means that, on average, 9 out of 10 "supportive"-stance instances are correctly classified by our classifier.
- The F1 score just summarizes thes values in one score.

Overall our classifier performs very well! 🥳

### Save the model and tokenizer for re-use

In [None]:
model_path = os.path.join('..', 'models', 'pimpo_stance_classifier')
os.makedirs(model_path, exist_ok=True)
trainer.save_model(model_path)
tokenizer.save_pretrained(model_path)

('../models/pimpo_stance_classifier/tokenizer_config.json',
 '../models/pimpo_stance_classifier/special_tokens_map.json',
 '../models/pimpo_stance_classifier/sentencepiece.bpe.model',
 '../models/pimpo_stance_classifier/added_tokens.json',
 '../models/pimpo_stance_classifier/tokenizer.json')

In [None]:
os.listdir(model_path)

['training_args.bin',
 'tokenizer.json',
 'sentencepiece.bpe.model',
 'special_tokens_map.json',
 'config.json',
 'model.safetensors',
 'tokenizer_config.json']

### Detailed look at the classifiers output

Let's create predictions for the five examples in the test set:

In [None]:
rng = np.random.default_rng(SEED)
idxs = rng.choice(len(dataset["test"]), 10, replace=False)
idxs
dataset["test"][idxs]['text']

['Besonders häufig sind Migrantinnen und Migranten gezwungen, zu skandalösen Löhnen zu arbeiten.',
 'It is clear that the current bureaucratic process is ineffective and that the Department has no real capacity to competently assess the legitimacy of the claims by asylum seekers.',
 'Nous allons également présenter de nouveau une mesure législative visant à protéger les travailleurs étrangers vulnérables - par exemple, les femmes qui entrent au Canada pour travailler comme danseuses exotiques.',
 'Dies darf in Zukunft nicht mehr möglich sein.',
 'Setzt sich d ieser Trend längerfristig fort, übertreffen die negativen Folgen allfällige positive Effekte deutlich.',
 '-\xa0\xa0 Wir wollen ein Einwanderungsrecht, das Fachkräften eine Chance am Arbeitsmarkt gibt.',
 'Daneben muss die Erwachsenenbildung bessere Angebote machen für Menschen aus anderen Kulturkreisen, die sich in unsere Gesellschaft einleben wollen.',
 'Sosial dumping er en trussel mot lønns– og arbeidsvilkår  i Norge',
 '. Det

In [None]:
preds = trainer.predict(dataset["test"].select(idxs))

In [None]:
type(preds)

transformers.trainer_utils.PredictionOutput

In [None]:
type(preds.predictions)

numpy.ndarray

In [None]:
preds.predictions.shape

(10, 2)

The prediction array has two dimensions:

- the first axis ('rows') corresponds to the *number of examples* for which we generated predictions
- the second axis ('columns') corresponds to the *number of label classes* we generate probability-like scores for when predicting

Let's look at the scores for the second example:

In [None]:
preds.predictions[1]

array([ 0.664218 , -0.5859217], dtype=float32)

The first score is bigger than the second one.
This means that given example is more similar to examples from the first label class: documents with a "sceptical" stance.

In [None]:
id2label[0]

'sceptical'

To convert those scores in something probability-like, we apply the so-called [softmax transformation](), which rescales values such that they each range between 0 and 1 and sum to 1:

In [None]:
from scipy.special import softmax

softmax(preds.predictions[1])

array([0.7773241 , 0.22267595], dtype=float32)

We can also call this function on all examples' prediction scores in our current batch:

In [None]:
pred_probs = softmax(preds.predictions, axis=1)
pred_probs

array([[0.21041147, 0.7895886 ],
       [0.7773241 , 0.22267595],
       [0.02119703, 0.978803  ],
       [0.34057665, 0.6594234 ],
       [0.7822727 , 0.21772735],
       [0.02137618, 0.97862387],
       [0.01546516, 0.98453486],
       [0.08381011, 0.9161899 ],
       [0.43684134, 0.56315863],
       [0.64307564, 0.35692436]], dtype=float32)

Now if you want to know for each row in which cell the value is the largest, you can call the `argmax()` method on the numpy array:

In [None]:
pred_probs.argmax(axis=1)

array([1, 0, 1, 1, 0, 1, 1, 1, 1, 0])

This turns prediction scores into predicted labels:

In [None]:
[id2label[pp] for pp in preds.predictions.argmax(axis=1)]

['supportive',
 'sceptical',
 'supportive',
 'supportive',
 'sceptical',
 'supportive',
 'supportive',
 'supportive',
 'supportive',
 'sceptical']

And to make it clearer, let's put predictions, "true" labels, and the text side by side:

In [None]:
pd.DataFrame(
    {
        'text': dataset["test"][idxs]['text'],
        'label': [id2label[l] for l in dataset["test"][idxs]['label']],
        'prediction': [id2label[pp] for pp in preds.predictions.argmax(axis=1)],
    }
  )

Unnamed: 0,text,label,prediction
0,Besonders häufig sind Migrantinnen und Migrant...,supportive,supportive
1,It is clear that the current bureaucratic proc...,sceptical,sceptical
2,Nous allons également présenter de nouveau une...,supportive,supportive
3,Dies darf in Zukunft nicht mehr möglich sein.,sceptical,supportive
4,"Setzt sich d ieser Trend längerfristig fort, ü...",sceptical,sceptical
5,"- Wir wollen ein Einwanderungsrecht, das Fac...",supportive,supportive
6,Daneben muss die Erwachsenenbildung bessere An...,supportive,supportive
7,Sosial dumping er en trussel mot lønns– og arb...,sceptical,supportive
8,. Det er derfor ønskelig at flere land i Europ...,sceptical,supportive
9,Dit zal regelmatig worden getoetst;,supportive,sceptical


### Compute language-specific performances

While multilingual Transformers and embedding models more gnerally are great to adopt an "input alignment" approach to multilingual text analysis, they are no silver bullet the obtain measurement instruments that perform equally well across languages.

To illustrate this point, let's split our test dataset by language and compute our classifiers performacne in each of it:


In [None]:
preds = trainer.predict(dataset["test"])
langs = dataset['test']['lang']

performances = {}
y_pred = preds.predictions.argmax(axis=1)
y_true = preds.label_ids
for lang in set(langs):
    idxs = np.array(langs) == lang
    n_ = idxs.sum()
    if n_ == 0:
      continue
    p, r, f1, s = precision_recall_fscore_support(y_true[idxs], y_pred[idxs], average='binary', zero_division=0)
    performances[lang] = {
        "f1": f1,
        "precision": p,
        "recall": r,
        "prevalence": y_true[idxs].mean(),
        "n": int(idxs.sum())
    }

pd.DataFrame(performances).T.sort_values('precision', ascending=False)

Unnamed: 0,f1,precision,recall,prevalence,n
fra,0.9,0.9,0.9,0.909091,11.0
nor,0.920152,0.896296,0.945312,0.842105,152.0
spa,0.924528,0.890909,0.960784,0.796875,128.0
deu,0.897368,0.863291,0.934247,0.763598,478.0
swe,0.888889,0.852459,0.928571,0.835821,67.0
eng,0.867925,0.821429,0.92,0.724638,69.0
fin,0.83871,0.787879,0.896552,0.674419,43.0
nld,0.75,0.716814,0.786408,0.559783,184.0
dan,0.72,0.642857,0.818182,0.515625,64.0


As you can see, our classifier performs differently well across language subsets.
Overall, its recall is higher than its precision, suggesting that it generally is "overshooting" (i.e., assigning too many "supportive" labels).
But this tendency is more pronounced in some languages (Finnish, Danish, Spanish, German, English) than in others (French, Dutch).

Such systematic discrpancies can have implications for down-stream analyses because you'll systematically over-estimate the prevalence of supportive stances in Finnish, Danish, etc. languages.

Note, however, that the language-specific subsets are sometimes relatively small.
So you should these point estimates with a grain of salt.
(Bootstrapping these metrics is a best practice.)

## Inference/prediction

When you have saved your finetuned model, you can always re-load it to label texts.
In machine learning this is called "inference" &mdash; which is unfortunate given the meaning of the term in positive social science methodology.

So let's just call it **prediction**.


In [None]:
from transformers import pipeline

stance_classifier = pipeline(
    task="text-classification",
    model=model_path # <== load the model we saved to disk
)

The simplest way to try out your finetuned model for inference is to use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiate a `pipeline` for sentiment analysis with your model, and pass your text to it:

In [None]:
text = "Refugees welcome!"
print(stance_classifier(text))

[{'label': 'supportive', 'score': 0.9616464972496033}]


In [None]:
text = 'Our immigration policy shoud be more restrictive.'
print(stance_classifier(text))

[{'label': 'sceptical', 'score': 0.6391111016273499}]


### Looking under the hood of `pipeline`

You can also manually replicate the results of the `pipeline` if you'd like:

Tokenize the text and return PyTorch tensors:

Pass your inputs to the model and return the `logits`:

In [None]:
# tokenizer = AutoTokenizer.from_pretrained(model_path)
inputs = tokenizer("Refugees welcome!", return_tensors="pt")
inputs.to(device); # <== important to put in GPU!!

# model = AutoModelForSequenceClassification.from_pretrained(model_path)
with torch.no_grad():
    logits = model(**inputs).logits

predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id] # <== use the 'id2label' we've added to the model we saved

'supportive'

### Clean up

In [None]:
del model, trainer, stance_classifier
clean_memory(device)

## Measurement validation

In most applications, we are ultimately interested in using lower-level classifications (e.g., at the sentence level) to generate higher-level measurements (e.g., at the document level).

To validate these measures, it is a best-practice to conduct a convergent validity assessment.
This means to compare the measurement generated by your "instrument" (i.e., the classifier) to a external  measurement that is supposed to measure the same construct.

In our application, we will use our classifier to obtain position scores for party manifestos sections about immigration/integration, and compare these scores to the position scores to the judgments of experts recruited by the Chapel Hill Expert Survey (CHES).

### Assessing equivalence

In [None]:
df_test = df.iloc[test_idxs]

In [None]:
langs = set(df_test.lang.unique())

In [None]:
import easynmt
translator = easynmt.EasyNMT('m2m_100_418M', device=device)

89.9kB [00:00, 39.9MB/s]                   


In [None]:
clean_memory(device)

translations = {i: [] for i in df_test.index}
for lang, d in df_test.groupby('lang'):
  for l in langs:
    if l == lang:
      continue
    print(f'translating from {lang} to {l}')
    try:
      res = translator.translate_sentences(
          d.text.to_list(),
          source_lang=iso639.to_iso639_1(lang),
          target_lang=iso639.to_iso639_1(l),
          batch_size=16,
          beam_size=5,
          show_progress_bar=True,
        )
    except Exception as e:
      print(f'WARNING: cannot translate "{lang}" to "{l}" with M2M model: {str(e)}')
      continue
    for i, t in zip(d.index, res):
      translations[i].append((l, t))

translating from dan to swe


100%|██████████| 64/64.0 [00:09<00:00,  6.90it/s]


translating from dan to deu


100%|██████████| 64/64.0 [00:10<00:00,  5.92it/s]


translating from dan to spa


100%|██████████| 64/64.0 [00:08<00:00,  7.71it/s]


translating from dan to fra


100%|██████████| 64/64.0 [00:07<00:00,  8.79it/s]


translating from dan to nor


100%|██████████| 64/64.0 [00:06<00:00,  9.68it/s]


translating from dan to nld


100%|██████████| 64/64.0 [00:07<00:00,  8.83it/s]


translating from dan to fin


100%|██████████| 64/64.0 [00:07<00:00,  8.86it/s]


translating from dan to eng


100%|██████████| 64/64.0 [00:06<00:00, 10.48it/s]


translating from deu to swe


  full_bar = Bar(frac,
100%|██████████| 480/478.0 [00:47<00:00, 10.02it/s]


translating from deu to spa


100%|██████████| 480/478.0 [00:48<00:00,  9.86it/s]


translating from deu to fra


 13%|█▎        | 64/478.0 [00:16<01:45,  3.93it/s]

In [None]:
translations

{457: [],
 7145: [],
 2133: [],
 1431: [],
 1820: [],
 4376: [],
 6959: [],
 7068: [],
 4968: [],
 5951: [],
 1691: [],
 1102: [],
 3042: [],
 7428: [],
 2574: [],
 1107: [],
 7617: [],
 2358: [],
 381: [],
 5833: [],
 3703: [],
 4736: [],
 7039: [],
 1749: [],
 4393: [],
 6896: [],
 447: [],
 7330: [],
 7554: [],
 7383: [],
 6312: [],
 3215: [],
 101: [],
 2126: [],
 4539: [],
 1601: [],
 132: [],
 3834: [],
 2487: [],
 337: [],
 6877: [],
 2416: [],
 2191: [],
 2529: [],
 5698: [],
 6712: [],
 2565: [],
 4249: [],
 5893: [],
 5457: [],
 2694: [],
 2884: [],
 6436: [],
 7803: [],
 6519: [],
 160: [],
 38: [],
 5997: [],
 7701: [],
 3857: [],
 3883: [],
 4776: [],
 5472: [],
 1892: [],
 6416: [],
 5163: [],
 6182: [],
 25: [],
 4330: [],
 6173: [],
 3260: [],
 4522: [],
 7085: [],
 1539: [],
 48: [],
 315: [],
 6366: [],
 5783: [],
 4107: [],
 7854: [],
 5686: [],
 5063: [],
 3945: [],
 4630: [],
 100: [],
 7176: [],
 5280: [],
 6089: [],
 952: [],
 1732: [],
 6576: [],
 7301: [],
 675

In [None]:
d.index[:4]

Int64Index([1609, 1610, 1611, 1612], dtype='int64')

In [None]:
df_test.loc[df.index, 'translations'] = translations

### Convergent validation with manifesto data

We get the manifesto data from [OSF](https://osf.io/bj27x/), specifically the replication materials of Ruedin and Morales ([2020](https://doi.org/10.1177/1354068817713122)).

I've already splitted the texts in relevant manifesto sections into sentences:

In [None]:
fp = 'https://raw.githubusercontent.com/fabiennelind/Going-Cross-Lingual_Course/main/data/ruedin_and_morales_2019/ruedin_and_morales_2019_manifest_section_sentences.csv'
df = pd.read_csv(fp)

In [None]:
# remove rows with missing text
df = df[~df.text.isna()]
df = df[df.text.apply(len) > 0]

NameError: name 'df' is not defined

In [None]:
# I have also mapped party names to CHES party IDs
fp = 'https://raw.githubusercontent.com/fabiennelind/Going-Cross-Lingual_Course/main/data/ruedin_and_morales_2019/party_codes.tsv'
parties = pd.read_csv(fp, sep='\t')

In [None]:
parties = parties[~parties.ches.isna()]
parties['ches'] = parties.ches.astype(int)

In [None]:
# just keep cases with matching party ID
df = pd.merge(df, parties, on=['country', 'party'], how='inner')

In [None]:
df.ches.isna().sum()

0

### Machine-label the sentences

We just take the stance classifier we have trained above and apply it to the sentences in the manifesto data:

In [None]:
stance_classifier = pipeline(
    task="text-classification",
    model=model_path, # <== load the model we saved to disk
    device=device,
    batch_size=8
  )

In [None]:
from tqdm.auto import tqdm
from transformers.pipelines.pt_utils import KeyDataset
# this code is complicated but faster (see https://stackoverflow.com/a/77442622)
kd = KeyDataset(Dataset.from_pandas(df[['text']]), 'text')
preds = []
for p in tqdm(stance_classifier(kd), total=len(dat)):
  preds.append(p)

  0%|          | 0/18780 [00:00<?, ?it/s]

In [None]:
df['label'] = [p['label'] for p in preds]
df['confidence'] = [p['score'] for p in preds]

In [None]:
grps = ['country', 'year', 'party', 'language']
df.groupby(grps).apply(lambda x: (x['label'] == 'supportive').sum()).reset_index(name='count')

Unnamed: 0,country,year,party,language,count
0,at,1994,fpö,de,45
1,at,1994,green,de,35
2,at,1994,lif,de,10
3,at,1994,övp,de,49
4,at,1994,spö,de,17
...,...,...,...,...,...
269,uk,2019,green,en,40
270,uk,2019,lab,en,48
271,uk,2019,libdem,en,67
272,uk,2019,plaid,en,33


### Computing manifesto-section level measurements

In [None]:
grps = ['country', 'ches']
out = df.groupby(grps).agg(
    n_supportive=('label', lambda x: (x == 'supportive').sum()),
    n_sceptical=('label', lambda x: (x == 'sceptical').sum())
).reset_index()

# formula from Proksch et al. (2019): https://onlinelibrary.wiley.com/doi/10.1111/lsq.12218
out['position'] = np.log((out['n_supportive'] + 0.5) / (out['n_sceptical'] + 0.5))

In [None]:
out[out.country == 'uk'].sort_values('position')

Unnamed: 0,country,ches,n_supportive,n_sceptical,position
58,uk,1109,425,330,0.252658
57,uk,1108,307,222,0.323548
61,uk,1151,4,2,0.587787
59,uk,1110,3,1,0.847298
51,uk,1101,268,105,0.93414
52,uk,1102,249,88,1.036456
54,uk,1105,45,10,1.466337
53,uk,1104,162,30,1.672951
56,uk,1107,95,12,2.033398
55,uk,1106,65,6,2.310248


### Load the CHES data

In [None]:
# load the file with the aggregate scores
ches = pd.read_csv('https://www.chesdata.eu/s/1999-2019_CHES_dataset_meansv3.csv')

In [None]:
# convert numeric country codes to ISO codes
ches_country_codes = {
  1: 'be',
  5: 'es',
  6: 'fr',
  7: 'ie',
  10: 'nl',
  11: 'uk',
  13: 'at',
}
ches['country_code'] = ches.country.map(ches_country_codes)
ches = ches[~ches['country_code'].isna()]

In [None]:
# keep selected columns
cols = ['country_code', 'year', 'party_id', 'party', 'immigrate_policy']
ches = ches[cols]

# discard cases with mising immigration policy judgments
ches = ches[~ches.immigrate_policy.isna()]
ches

Unnamed: 0,country_code,year,party_id,party,immigrate_policy
24,be,2006,109,CD&V,5.88
25,be,2006,106,MR,5.33
26,be,2006,105,Groen,1.13
27,be,2006,110,NVA,7.38
28,be,2006,102,PS,2.50
...,...,...,...,...,...
638,at,2019,1303,FPO,9.80
639,at,2019,1301,SPO,4.70
640,at,2019,1302,OVP,8.60
641,at,2019,1306,NEOS,4.30


The CHES indicator `immigration_policy` ranges from 0 ("[party] strongly favors a liberal policy on immigration") to 10 ("[party] strongly favors a restrictive policy on immigration").

In [None]:
ches_agg = ches.groupby(['country_code', 'party_id', 'party']).agg(immigrate_policy=('immigrate_policy', 'mean')).reset_index()

In [None]:
ches_agg.rename(columns={'country_code': 'country', 'party_id': 'ches'}, inplace=True)

### Conduct convergent validity assessment

In [None]:
res = pd.merge(out, ches_agg, on=['country', 'ches'], how='inner')

In [None]:
from scipy.stats import pearsonr
r, p = pearsonr(res.position.values, res.immigrate_policy.values)
print(f'Overall correlation: {r:+.03f} (p≤{p:.03f})')

Overall correlation: -0.548 (p≤0.000)


Note that on the CHES indicators, a low score means supportive/permissive stance.
On our classifier-based scores, low (negativev) scores indicate sceptical stance.
So the negative correlation is good ;)

In [None]:
for c, d in res.groupby('country'):
  r, p = pearsonr(d.position.values, d.immigrate_policy.values)
  print(f'correlation for subset "{c.upper()}": {r:+.03f} (p≤{p:.03f})')

correlation for subset "AT": -0.838 (p≤0.009)
correlation for subset "BE": -0.573 (p≤0.013)
correlation for subset "ES": -0.330 (p≤0.352)
correlation for subset "FR": -0.819 (p≤0.001)
correlation for subset "IE": -0.656 (p≤0.039)
correlation for subset "NL": -0.642 (p≤0.062)
correlation for subset "UK": -0.925 (p≤0.000)


## Appendix

### Dataset splitting

To split your dataset into train, dev, and test sets, you should

- rely on scikit-learns pre-defined functions,
- and always set the seed

This helps to avoid that you accidentally have the same examples in different sets, and that your data splitting is reproducible.

In [None]:
# example data (see https://chat.openai.com/share/2d5af33c-acb9-4b0f-a260-7c38651fa7b0)
labeled_data = [
    {"text": "I'm over the moon with joy!", "label": "happy"},
    {"text": "Happiness radiates from every fiber of my being.", "label": "happy"},
    {"text": "I can't stop smiling because life is beautiful.", "label": "happy"},
    {"text": "Tears stream down my face; I'm so heartbroken.", "label": "sad"},
    {"text": "I feel so lonely and despondent right now.", "label": "sad"},
    {"text": "It's a gloomy day, and my spirits are low.", "label": "sad"},
    {"text": "Laughter fills the air, and my heart is light.", "label": "happy"},
    {"text": "I'm ecstatic about the news! Pure bliss!", "label": "happy"},
    {"text": "The weight of sadness bears down on me like a ton of bricks.", "label": "sad"},
    {"text": "Every moment without them feels like an eternity of sorrow.", "label": "sad"}
]


We usually specify the set sizes in percentages:

In [None]:
test_size = 0.20
dev_size = 0.20

# compute Ns
n = len(labeled_data)
n_test = int(n*test_size)
n_dev = int(n*dev_size)
n_train = n-n_test-n_dev

#### Simple random splitting

The simplest splitting strategy is to assign examples randomly to the three sets.

In [None]:
# use train_test_split from sklearn
from sklearn.model_selection import train_test_split
SEED = 1234

tmp, test_idxs = train_test_split(range(n), test_size=n_test, random_state=SEED)
train_idxs, dev_idxs = train_test_split(tmp, test_size=n_dev, random_state=SEED)
del tmp

print(len(train_idxs), len(dev_idxs), len(test_idxs)) # should be approx 60%, 20%, 20%

print(len(set(train_idxs).intersection(set(dev_idxs)))) # should be 0
print(len(set(train_idxs).intersection(set(test_idxs)))) # should be 0
print(len(set(dev_idxs).intersection(set(test_idxs)))) # should be 0

train_data = [labeled_data[i] for i in train_idxs]
dev_data = [labeled_data[i] for i in dev_idxs]
test_data = [labeled_data[i] for i in test_idxs]

6 2 2
0
0
0


With random sampling-based splitting, you might end up with different label proportions, though.

In [None]:
print(np.mean([d['label'] == 'happy' for d in train_data]))
print(np.mean([d['label'] == 'happy' for d in dev_data]))
print(np.mean([d['label'] == 'happy' for d in test_data]))

0.16666666666666666
1.0
1.0


**_Note:_** In a larger dataset, the differences wouldn't be as dramatic though. So the problem is not too bad.

#### Stratify by label class

If you want to ensure that the label proportions in your train, dev, and test splits are identical,
you want to stratify by examples' true labels:

In [None]:
# use train_test_split from sklearn
import numpy as np
from sklearn.model_selection import train_test_split
SEED = 1234

labels = np.array([d['label'] for d in labeled_data])

tmp, test_idxs = train_test_split(range(n), test_size=n_test, random_state=SEED, stratify=labels)
train_idxs, dev_idxs = train_test_split(tmp, test_size=n_dev, random_state=SEED, stratify=labels[tmp])
del tmp

print(len(train_idxs), len(dev_idxs), len(test_idxs)) # should be approx 60%, 20%, 20%

print(len(set(train_idxs).intersection(set(dev_idxs)))) # should be 0
print(len(set(train_idxs).intersection(set(test_idxs)))) # should be 0
print(len(set(dev_idxs).intersection(set(test_idxs)))) # should be 0

train_data = [labeled_data[i] for i in train_idxs]
dev_data = [labeled_data[i] for i in dev_idxs]
test_data = [labeled_data[i] for i in test_idxs]

# the label proportions are now (approx.) the same in all splits
print(np.mean([d['label'] == 'happy' for d in train_data]))
print(np.mean([d['label'] == 'happy' for d in dev_data]))
print(np.mean([d['label'] == 'happy' for d in test_data]))

6 2 2
0
0
0
0.5
0.5
0.5


**_Note:_**
You could also stratify by other indicators, such as document authors' IDs.
In this case, you'd have similar proportions of author's documents in the different splits.

#### Grouped sampling

Sometimes you want to develop a classifier that is able to predict text from held-out documents.
For example, if you have collected annotations for sentences sampled from parties' elections manifestos, you might not have sampled sentences from some manifestos.
In this case, at prediction time (i.e., when applying your final model to the entire corpus of election manifestos), you'd need to make "out-of-document" classifications.


Achieving good "out-of-document" classification performance requires **generalization**.
But it can be difficult because language use within documents tends to be more similar than across documents.

What you want to do to asses the ability of your classifier to predict reliably "out-of-document" is to mirror this setup in your splitting strategy.
This is done by assigning sentences to the train, dev, and test, splits based on their document membership.

So for example, below we have a simple illustration of how we assign all sentences from document 1 to the test set, and all sentences from documents 2 and 3 to the train set:

| sentence ID | doc ID | set |
|:----------- |:------ |:--- |
| 1 | 1 | &rarr; 'test' |
| 2 | 1 | &rarr; 'test' |
| 1 | 2 | &rarr; 'train' |
| 2 | 2 | &rarr; 'train' |
| 3 | 2 | &rarr; 'train' |
| 1 | 3 | &rarr; 'train' |
| 2 | 3 | &rarr; 'train' |
| 1 | 3 | &rarr; 'train' |
| ... | ... | ... |





In [None]:
labeled_data = [
    {"doc_id": 0, "text": "I'm over the moon with joy!", "label": "happy"},
    {"doc_id": 0,"text": "Happiness radiates from every fiber of my being.", "label": "happy"},
    {"doc_id": 1,"text": "I can't stop smiling because life is beautiful.", "label": "happy"},
    {"doc_id": 1,"text": "Tears stream down my face; I'm so heartbroken.", "label": "sad"},
    {"doc_id": 2,"text": "I feel so lonely and despondent right now.", "label": "sad"},
    {"doc_id": 2,"text": "It's a gloomy day, and my spirits are low.", "label": "sad"},
    {"doc_id": 3,"text": "Laughter fills the air, and my heart is light.", "label": "happy"},
    {"doc_id": 3,"text": "I'm ecstatic about the news! Pure bliss!", "label": "happy"},
    {"doc_id": 4,"text": "The weight of sadness bears down on me like a ton of bricks.", "label": "sad"},
    {"doc_id": 4,"text": "Every moment without them feels like an eternity of sorrow.", "label": "sad"}
]
# note: in reality, the number of sentences per document might vary.
#       But this is not a problem for running the code below!

In [None]:
# use GroupSplit strategy from sklearn
from sklearn.model_selection import GroupShuffleSplit

doc_ids = np.array([d["doc_id"] for d in labeled_data])

gss = GroupShuffleSplit(n_splits=2, test_size=test_size, random_state=SEED)
tmp, test_idxs = next(gss.split(range(n), groups=doc_ids))
train_idxs, dev_idxs = next(gss.split(range(len(tmp)), groups=doc_ids[tmp]))
del tmp

print(len(train_idxs), len(dev_idxs), len(test_idxs)) # should be approx 60%, 20%, 20%

print(len(set(train_idxs).intersection(set(dev_idxs)))) # should be 0
print(len(set(train_idxs).intersection(set(test_idxs)))) # should be 0
print(len(set(dev_idxs).intersection(set(test_idxs)))) # should be 0

train_data = [labeled_data[i] for i in train_idxs]
dev_data = [labeled_data[i] for i in dev_idxs]
test_data = [labeled_data[i] for i in test_idxs]

train_doc_ids = [d["doc_id"] for d in train_data]
dev_doc_ids = [d["doc_id"] for d in dev_data]
test_doc_ids = [d["doc_id"] for d in test_data]

# the label proportions are now (approx.) the same in all splits
print(len(set(train_doc_ids).intersection(set(dev_doc_ids)))) # should be 0
print(len(set(train_doc_ids).intersection(set(test_doc_ids)))) # should be 0
print(len(set(dev_doc_ids).intersection(set(test_doc_ids)))) # should be 0


6 2 2
0
0
0
0
0
0


### Multi-class classification

In many use cases, you have more than two label classes.
If each document ("sample") has only one label (e.g., 'positive'), then you want to perform **multi-class classification**.

#### Model setup

`transformers`' `AutoModelForSequenceClassification` can handle this well.
You just need to

1. adapt your `id2label` and `label2id` dictionaries accordingly. So if you have three label classes "positive", "neutral", and "negative,"

```python
id2label = {0: "negative", 1: "neutral", 2: "positive"}
label2id = {"negative": 0, "neutral": 1, "positive": 2}
```

2. adpat the `num_labels` argument accordingly:

```python
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=3,
    id2label=id2label,
    label2id=label2id
)
```

#### Evaluation metrics

If you have more than two classes, you will need to adapt your evaluation metrics.
This is because the *precision* and *recall* metrics, for example, only differentiate between correct and false classifications ("hits" and "misses" ) of "positive" vs. "negative" examples.
So for each label class, you will convert ("dichotomize") your multi-class labels and predicted classifications into so-called "one vs. rest" indicators.

**_Example:_** If you are interested in the performance of your model to correctly classify "neutral" samples and the other two label classes are "positive" and "negative", you will redifine the label categories as follows

- "positive" &rarr; 0
- "neutral" &rarr; 1
- "negative" &rarr; 0

In this way, you can compute a **_"neutral"-specific_ recall, precision, and F1 scores**.

Evaluation functions in the `sklearn.metrics` module like `f1_score()` support multi-class classification:

In [None]:
true_labels = [0, 0, 1, 1, 2, 2]
pred_labels = [0, 1, 0, 1, 2, 1]

from sklearn.metrics import f1_score

# get one F1 score per label class 0, 1, and 2 (in ascending order)
f1_score(y_true=true_labels, y_pred=pred_labels, average=None)

array([0.5       , 0.4       , 0.66666667])

Put for finding the "best" model, you'll still need a single performance score.

So what we do is **average** class-specific scores into one performacne estimate.
The most common averaging strategy is the so-called **_macro_ average**.
It just computes the average between class-specific scores.

So given the example above, the "macro F1 score" is `(0.5+0.4+0.666)/3 = 0.5222`

In [None]:
f1_score(y_true=true_labels, y_pred=pred_labels, average='macro')

0.5222222222222223

The alternative strategy is the **_micro_ average**.
In this strategy, we just summarize the which labels we got right, and which we got wrong. So the "micro F1 scores" is just the accuracy:

In [None]:
f1_score(y_true=true_labels, y_pred=pred_labels, average='micro')

0.5

There is also a function that summarizes everything:

In [None]:
from sklearn.metrics import classification_report

id2label = {0: "negative", 1: "neutral", 2: "positive"}

rep = classification_report(
    y_true=true_labels,
    y_pred=pred_labels,
    labels=list(id2label.keys()),
    target_names=list(id2label.values()),
)
print(rep)

              precision    recall  f1-score   support

    negative       0.50      0.50      0.50         2
     neutral       0.33      0.50      0.40         2
    positive       1.00      0.50      0.67         2

    accuracy                           0.50         6
   macro avg       0.61      0.50      0.52         6
weighted avg       0.61      0.50      0.52         6



Let's warp this in a function we can pass to `Trainer`'s `compute_metric` argument:

In [None]:

def classification_metrics_multiclass(
    p: Union[EvalPrediction, PredictionOutput]
  ):
  preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
  preds = preds.argmax(-1)
  res = classification_report(
    y_true=y_true,
    y_pred=y_pred,
    labels=list(label2id.values()), # assuming you have a `label2id` object defined
    target_names=list(label2id.keys()),
    output_dict=True,
    zero_division=0.0
  )

  metrics = {}

  # aggregate metrics
  metrics['f1_macro'] = res['macro avg']['f1-score']
  metrics['f1_micro'] = res['accuracy']
  metrics['balanced_accuracy'] = balanced_accuracy_score(y_true, y_pred)

  # class-wise metrics
  metrics.update(
    {f"{l}_{m.replace('-score', '')}": res[l][m] if l in res else np.nan
      for l in label2id.keys()
      for m in ['f1-score', 'precision', 'recall']
    }
  )

  return metrics