<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#NER" data-toc-modified-id="NER-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>NER</a></span></li><li><span><a href="#Multilingual-Transformers-with-XLM-R" data-toc-modified-id="Multilingual-Transformers-with-XLM-R-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Multilingual Transformers with XLM-R</a></span></li><li><span><a href="#Tokenization" data-toc-modified-id="Tokenization-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Tokenization</a></span><ul class="toc-item"><li><span><a href="#Sentence-Piece-vs-WordPiece" data-toc-modified-id="Sentence-Piece-vs-WordPiece-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Sentence Piece vs WordPiece</a></span></li><li><span><a href="#Tokenizer-Pipeline" data-toc-modified-id="Tokenizer-Pipeline-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Tokenizer Pipeline</a></span></li><li><span><a href="#Sentencepiece-tokenizer" data-toc-modified-id="Sentencepiece-tokenizer-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Sentencepiece tokenizer</a></span></li></ul></li><li><span><a href="#Transformers-for-NER-and-the-inner-working-of-HuggingFace-Transformers" data-toc-modified-id="Transformers-for-NER-and-the-inner-working-of-HuggingFace-Transformers-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Transformers for NER and the inner working of HuggingFace Transformers</a></span><ul class="toc-item"><li><span><a href="#Body-and-Head" data-toc-modified-id="Body-and-Head-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Body and Head</a></span></li><li><span><a href="#Create-custom-model" data-toc-modified-id="Create-custom-model-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Create custom model</a></span></li><li><span><a href="#Adjust-config-file" data-toc-modified-id="Adjust-config-file-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Adjust config file</a></span></li><li><span><a href="#Load-the-custom-model" data-toc-modified-id="Load-the-custom-model-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Load the custom model</a></span></li></ul></li><li><span><a href="#Tokenization-for-NER" data-toc-modified-id="Tokenization-for-NER-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Tokenization for NER</a></span></li><li><span><a href="#Define-metrics" data-toc-modified-id="Define-metrics-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Define metrics</a></span></li><li><span><a href="#Finetuning-XLM-R" data-toc-modified-id="Finetuning-XLM-R-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Finetuning XLM-R</a></span></li></ul></div>

Multilingual transformers: Like BERT, these models use **masked language modeling** as a pretraining objective, but they are trained **jointly on texts in over one hundred languages** (huge corpora across many languages)

=> a model that is fine-tuned on one language can be applied to others without any further training

=> these multilingual transformers enable zero-shot cross-lingual transfer


a single transformer model called XLM-RoBERTa (introduced in Chapter 3)1 can be fine-tuned to perform named entity recognition (NER) across several languages


Note: Zero-shot transfer or zero-shot learning usually refers to the task of training a model on one set of labels and then evaluating it on a different set of labels. In the context of transformers, zero-shot learning may also refer to situations where a language model like GPT-3 is evaluated on a downstream task that it wasn’t even fine-tuned on.


For this chapter let’s assume that we want to perform NER for a customer based in Switzerland, where there are four national languages (with English often serving as a bridge between them). Let’s start by getting a suitable multilingual corpus for this problem.

# NER

Each article is annotated with 
- LOC (location), 
- PER (person), and 
- ORG (organization) tags 
- in the “inside-outside-beginning” (IOB2) format. 
    - In this format, a B- prefix indicates the beginning of an entity, and consecutive tokens belonging to the same entity are given an I- prefix. 
- An O tag indicates that the token does not belong to any entity

We will be using a subset of the Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark called WikiANN or PAN-X

In [1]:
from datasets import load_dataset
from datasets import get_dataset_config_names


In [2]:

xtreme_subsets = get_dataset_config_names("xtreme")
print(f"XTREME has {len(xtreme_subsets)} configurations")

XTREME has 183 configurations


In [3]:
panx_subsets = [s for s in xtreme_subsets if s.startswith("PAN")]
panx_subsets

['PAN-X.af',
 'PAN-X.ar',
 'PAN-X.bg',
 'PAN-X.bn',
 'PAN-X.de',
 'PAN-X.el',
 'PAN-X.en',
 'PAN-X.es',
 'PAN-X.et',
 'PAN-X.eu',
 'PAN-X.fa',
 'PAN-X.fi',
 'PAN-X.fr',
 'PAN-X.he',
 'PAN-X.hi',
 'PAN-X.hu',
 'PAN-X.id',
 'PAN-X.it',
 'PAN-X.ja',
 'PAN-X.jv',
 'PAN-X.ka',
 'PAN-X.kk',
 'PAN-X.ko',
 'PAN-X.ml',
 'PAN-X.mr',
 'PAN-X.ms',
 'PAN-X.my',
 'PAN-X.nl',
 'PAN-X.pt',
 'PAN-X.ru',
 'PAN-X.sw',
 'PAN-X.ta',
 'PAN-X.te',
 'PAN-X.th',
 'PAN-X.tl',
 'PAN-X.tr',
 'PAN-X.ur',
 'PAN-X.vi',
 'PAN-X.yo',
 'PAN-X.zh']

To make a realistic Swiss corpus, we’ll sample the German (de), French (fr), Italian (it), and English (en) corpora from PAN-X according to their spoken proportions. This will create a language imbalance that is very common in real-world datasets

In [4]:
tmp = load_dataset("xtreme", name=f"PAN-X.vi")

Reusing dataset xtreme (/home/quan/.cache/huggingface/datasets/xtreme/PAN-X.vi/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e)


  0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
tmp

DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 20000
    })
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 10000
    })
})

In [6]:
tmp['train']['tokens'][:10]

[['Đồng', 'bằng', 'sông', 'Cửu', 'Long'],
 ['Trần', 'Trọng', 'Kim', ',', "''Đường", 'thi', "''", '.'],
 ['Gaushorn', '(', '213', ')'],
 ['10.17', '-', "'", "''", 'Wonder', 'Girls15th', '-', 'Nobody3rd', "''", "'"],
 ['đổi', 'CJ', 'E', '&', 'M', 'Pictures'],
 ['Đây',
  'là',
  'loài',
  'bản',
  'địa',
  'của',
  'Argentina',
  ',',
  'Brasil',
  ',',
  'và',
  'Paraguay',
  '.'],
 ['Sân',
  'vận',
  'động',
  'chính',
  'Khu',
  'liên',
  'hợp',
  'thể',
  'thao',
  'quốc',
  'gia',
  'Lào'],
 ['Kia]]', ',', 'Mazda', ',', 'BMW'],
 ["''", "'", "''", "'", '-', 'Gustavo', 'Gianetti'],
 ["''", 'Senecio', 'nevadensis', "''", 'Boiss', '.']]

In [7]:
from collections import defaultdict
from datasets import DatasetDict

langs = ["de", "fr", "it", "en"]
fracs = [0.629, 0.229, 0.084, 0.059]
# Return a DatasetDict if a key doesn't exist
panx_ch = defaultdict(DatasetDict)

for lang, frac in zip(langs, fracs):
    # Load monolingual corpus
    ds = load_dataset("xtreme", name=f"PAN-X.{lang}")
    # Shuffle and downsample each split according to spoken proportion
    for split in ds:
        panx_ch[lang][split] = (
            ds[split]
            .shuffle(seed=0)
            .select(range(int(frac * ds[split].num_rows))))

Reusing dataset xtreme (/home/quan/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at /home/quan/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-ad1b311e95818edf.arrow
Loading cached shuffled indices for dataset at /home/quan/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-9e4b5e384626785e.arrow
Loading cached shuffled indices for dataset at /home/quan/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-b80ca41f318cd7e7.arrow
Reusing dataset xtreme (/home/quan/.cache/huggingface/datasets/xtreme/PAN-X.fr/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at /home/quan/.cache/huggingface/datasets/xtreme/PAN-X.fr/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-222f2a739e50779b.arrow
Loading cached shuffled indices for dataset at /home/quan/.cache/huggingface/datasets/xtreme/PAN-X.fr/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-ae79577dfb0e7498.arrow
Loading cached shuffled indices for dataset at /home/quan/.cache/huggingface/datasets/xtreme/PAN-X.fr/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-0bc206f54324de18.arrow
Reusing dataset xtreme (/home/quan/.cache/huggingface/datasets/xtreme/PAN-X.it/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at /home/quan/.cache/huggingface/datasets/xtreme/PAN-X.it/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-2a286f85a785394c.arrow
Loading cached shuffled indices for dataset at /home/quan/.cache/huggingface/datasets/xtreme/PAN-X.it/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-55894a1d8ab171ae.arrow
Loading cached shuffled indices for dataset at /home/quan/.cache/huggingface/datasets/xtreme/PAN-X.it/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-b2a9c20bbec1f943.arrow
Reusing dataset xtreme (/home/quan/.cache/huggingface/datasets/xtreme/PAN-X.en/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at /home/quan/.cache/huggingface/datasets/xtreme/PAN-X.en/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-b91db9df81081a1f.arrow
Loading cached shuffled indices for dataset at /home/quan/.cache/huggingface/datasets/xtreme/PAN-X.en/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-dea8d95ed2e6a82a.arrow
Loading cached shuffled indices for dataset at /home/quan/.cache/huggingface/datasets/xtreme/PAN-X.en/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-56dba7892a30e39c.arrow


In [8]:
panx_ch.keys(),panx_ch['de']

(dict_keys(['de', 'fr', 'it', 'en']),
 DatasetDict({
     train: Dataset({
         features: ['tokens', 'ner_tags', 'langs'],
         num_rows: 12580
     })
     validation: Dataset({
         features: ['tokens', 'ner_tags', 'langs'],
         num_rows: 6290
     })
     test: Dataset({
         features: ['tokens', 'ner_tags', 'langs'],
         num_rows: 6290
     })
 }))

In [9]:
import pandas as pd

pd.DataFrame({lang: [panx_ch[lang]["train"].num_rows] for lang in langs},
             index=["Number of training examples"])

Unnamed: 0,de,fr,it,en
Number of training examples,12580,4580,1680,1180


we have more examples in German than all other languages combined, so we’ll use it as a starting point from which to perform zero-shot cross-lingual transfer to French, Italian, and English.

In [10]:
panx_ch["de"]["train"][0]

{'tokens': ['2.000',
  'Einwohnern',
  'an',
  'der',
  'Danziger',
  'Bucht',
  'in',
  'der',
  'polnischen',
  'Woiwodschaft',
  'Pommern',
  '.'],
 'ner_tags': [0, 0, 0, 0, 5, 6, 0, 0, 5, 5, 6, 0],
 'langs': ['de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de']}

In [11]:
panx_ch["de"]["train"].features

{'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'ner_tags': Sequence(feature=ClassLabel(num_classes=7, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], id=None), length=-1, id=None),
 'langs': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}

In [12]:
# extract ner_tags
tags = panx_ch["de"]["train"].features["ner_tags"].feature
print(tags)

ClassLabel(num_classes=7, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], id=None)


Note: function to put in `map` function for DatasetDict should have this signature

`function(examples: Dict[str, List]) -> Dict[str, List]`

In [13]:
def create_tag_names(batch):
    return {"ner_tags_str": [tags.int2str(idx) for idx in batch["ner_tags"]]}

In [14]:
# convert from ner_tags to string
panx_ch['de']['train'][0],create_tag_names(panx_ch['de']['train'][0])

({'tokens': ['2.000',
   'Einwohnern',
   'an',
   'der',
   'Danziger',
   'Bucht',
   'in',
   'der',
   'polnischen',
   'Woiwodschaft',
   'Pommern',
   '.'],
  'ner_tags': [0, 0, 0, 0, 5, 6, 0, 0, 5, 5, 6, 0],
  'langs': ['de',
   'de',
   'de',
   'de',
   'de',
   'de',
   'de',
   'de',
   'de',
   'de',
   'de',
   'de']},
 {'ner_tags_str': ['O',
   'O',
   'O',
   'O',
   'B-LOC',
   'I-LOC',
   'O',
   'O',
   'B-LOC',
   'B-LOC',
   'I-LOC',
   'O']})

In [15]:
# go through the entire de
panx_de = panx_ch["de"].map(create_tag_names)

Loading cached processed dataset at /home/quan/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-55acd4c0c78d0f11.arrow
Loading cached processed dataset at /home/quan/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-b74f4b8beefbe2d4.arrow
Loading cached processed dataset at /home/quan/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-0d2a1d8035c7d0e3.arrow


In [16]:
panx_de

DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'ner_tags_str'],
        num_rows: 12580
    })
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'ner_tags_str'],
        num_rows: 6290
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'ner_tags_str'],
        num_rows: 6290
    })
})

In [17]:
de_example = panx_de["train"][0]
pd.DataFrame([de_example["tokens"], de_example["ner_tags_str"]],
['Tokens', 'Tags'])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
Tokens,2.000,Einwohnern,an,der,Danziger,Bucht,in,der,polnischen,Woiwodschaft,Pommern,.
Tags,O,O,O,O,B-LOC,I-LOC,O,O,B-LOC,B-LOC,I-LOC,O


As a quick check that we don’t have any unusual imbalance in the tags, let’s calculate the frequencies of each entity across each split:

In [18]:
from collections import Counter

split2freqs = defaultdict(Counter)
for split, dataset in panx_de.items():
    for row in dataset["ner_tags_str"]:
        for tag in row:
            if tag.startswith("B"):
                tag_type = tag.split("-")[1]
                split2freqs[split][tag_type] += 1
pd.DataFrame.from_dict(split2freqs, orient="index")

Unnamed: 0,LOC,ORG,PER
train,6186,5366,5810
validation,3172,2683,2893
test,3180,2573,3071


# Multilingual Transformers with XLM-R

A remarkable feature of this approach is that **despite receiving no explicit information to differentiate among the languages, the resulting linguistic representations are able to generalize well across languages** for a variety of downstream tasks

Multilingual transformer models are usually evaluated in three different ways:

- en
    - Fine-tune on the English training data and then evaluate on each language’s test set.

- each
    - Fine-tune and evaluate on monolingual test data to measure per-language performance.

- all
    - Fine-tune on all the training data to evaluate on all on each language’s test set.

XLM-R uses only MLM as a pretraining objective for 100 languages, but is distinguished by the huge size of its pretraining corpus compared to its predecessors: Wikipedia dumps for each language and 2.5 terabytes of Common Crawl data from the web. 

The 'R' (RoBERTa) improved on several aspects of BERT, in particular by **removing the next sentence prediction task altogether**

Language embeddings used in XLM is dropped. Use **Sentence Piece** tokenize raw texts directly

Notable difference between XLM-R and RoBERTa is the **size of the respective vocabularies**: 250,000 tokens versus 55,000

# Tokenization

## Sentence Piece vs WordPiece

In [19]:
from transformers import AutoTokenizer

bert_model_name = "bert-base-cased"
xlmr_model_name = "xlm-roberta-base"
bert_tokenizer = AutoTokenizer.from_pretrained(bert_model_name)
xlmr_tokenizer = AutoTokenizer.from_pretrained(xlmr_model_name)

In [20]:
text = "Jack Sparrow loves New York!"
bert_tokens = bert_tokenizer(text).tokens()
xlmr_tokens = xlmr_tokenizer(text).tokens()

In [21]:
bert_tokens,xlmr_tokens


(['[CLS]', 'Jack', 'Spa', '##rrow', 'loves', 'New', 'York', '!', '[SEP]'],
 ['<s>', '▁Jack', '▁Spar', 'row', '▁love', 's', '▁New', '▁York', '!', '</s>'])

XLM-R uses `<s> and <\s>` to denote the start and end of a sequence.

## Tokenizer Pipeline

It's not entirely accurate that tokenization is a single operation that transforms strings to integers

![](https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781098136789/files/assets/nlpt_0401.png)

Jack Sparrow loves New York!


1. Normalization
    - set of operations you apply to a raw string to **make it “cleaner”**, e.g. stripping whitespace, rm accented chars, lowercasing, Unicode normalization (unify various ways to write the same character)

=> jack sparrow loves new york!

2. Pretokenization
    - splits a text into smaller objects (can be words) that **give an upper bound to what your tokens will be at the end of training; your final tokens will be parts of these smaller objects**
    - Sometimes splitting into 'words' is not always trivial (Chinese, Japanese, Korean). In this case, it might be best to not pretokenize the text and instead use a language-specific library for pretokenization.

=> ["jack", "sparrow", "loves", "new", "york", "!"]

3. Tokenizer model
    - tokenizer applies a **subword splitting model** on the words. This is the part of the pipeline that **needs to be trained on your corpus (or that has been trained if you are using a pretrained tokenizer)**
    -  to split the words into subwords to reduce the size of the vocabulary and try to reduce the number of out-of-vocabulary tokens
    - Several subword tokenization algorithms exist, including BPE, Unigram, and WordPiece
    
=> [jack, spa, rrow, loves, new, york, !]

NOTE: at this point we no longer have a list of strings but a list of integers (input IDs)

4. Postprocessing
    - some additional transformations can be applied on the list of tokens
    - e.g. adding special tokens at the beginning or the end
    - This is the last step, and the sequence of integers can be fed to the model
=> a BERT-style tokenizer would add classifications and separator tokens: [CLS, jack, spa, rrow, loves, new, york, !, SEP]
    

## Sentencepiece tokenizer

- Based on Unigram
- Encodes each input text as sequence of **Unicode characters** => agnostic to accents, punctuation
- whitespace characters == '_' (\u2581) => dont have to rely on language-specific pretokenizers

In [22]:
text = "Jack Sparrow loves New York!"
bert_tokens = bert_tokenizer(text).tokens()
xlmr_tokens = xlmr_tokenizer(text).tokens()

bert_tokens,xlmr_tokens


(['[CLS]', 'Jack', 'Spa', '##rrow', 'loves', 'New', 'York', '!', '[SEP]'],
 ['<s>', '▁Jack', '▁Spar', 'row', '▁love', 's', '▁New', '▁York', '!', '</s>'])

There is no whitespace between York and !, but WordPiece does not reflect this. In contrast...

In [23]:
"".join(xlmr_tokens).replace(u"\u2581", " ")


'<s> Jack Sparrow loves New York!</s>'

# Transformers for NER and the inner working of HuggingFace Transformers

![](https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781098136789/files/assets/nlpt_0403.png)

In the BERT paper,5 the authors assigned this label to the first subword (“Chr” in our example) and ignored the following subword (“##ista”). This is the convention we’ll adopt here, and we’ll indicate the ignored subwords with IGN

## Body and Head

Transformers is organized around dedicated classes for each architecture and task. The model classes associated with different tasks are named according to a `<ModelName>For<Task>` convention, or `AutoModelFor<Task>` when using the AutoModel classes.
    

This structure is reflected in the HuggingFace Transformers code as well: the **body of a model is implemented in a class such as `BertModel` or `GPT2Model` that returns the hidden states of the last layer**. **Task-specific models** such as BertForMaskedLM or BertForSequenceClassification use the base model and **add the necessary head on top** of the hidden states, 

## Create custom model

building a custom token classification head for XLM-R

(this is an exercise, as we already have a XLM-R for token classification called `XLMRobertaForTokenClassification` from HuggingFace Transformer)

In [24]:
import torch.nn as nn
from transformers import XLMRobertaConfig
from transformers.modeling_outputs import TokenClassifierOutput
from transformers.models.roberta.modeling_roberta import RobertaModel # body only

#inherit this to load pretrained weight
from transformers.models.roberta.modeling_roberta import RobertaPreTrainedModel


In [25]:
class XLMRobertaForTokenClassification(RobertaPreTrainedModel):
    # make sure standard XLM-R are used
    config_class = XLMRobertaConfig

    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        # Load model body
        # add_polling_layer to False
        #    to ensure all hidden states are returned 
        #    and not only the one associated with the [CLS] token.
        self.roberta = RobertaModel(config, add_pooling_layer=False)
        # Set up token classification head
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
        # Load and initialize weights from RobertaPretrainedModel
        self.init_weights()

    def forward(self, input_ids=None, attention_mask=None, token_type_ids=None,
                labels=None, **kwargs):
        # Use model body to get encoder representations
        # the only ones we need for now are input_ids and attention_mask
        outputs = self.roberta(input_ids, attention_mask=attention_mask,
                               token_type_ids=token_type_ids, **kwargs)
        # Apply classifier to encoder representation
        sequence_output = self.dropout(outputs[0])
        logits = self.classifier(sequence_output) # (bs,seq_len,num_labels)
        # Calculate losses
        loss = None
        if labels is not None: # labels size: (bs,seq_len)
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
        # Return model output object
        # Use TokenClassifierOutput for familiar named tuple
        return TokenClassifierOutput(loss=loss, logits=logits,
                                     hidden_states=outputs.hidden_states,
                                     attentions=outputs.attentions)

In [26]:
type(tags)

datasets.features.features.ClassLabel

In [27]:
tags.names

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']

In [28]:
index2tag = {idx: tag for idx, tag in enumerate(tags.names)}
tag2index = {tag: idx for idx, tag in enumerate(tags.names)}

## Adjust config file

In [29]:
# adjust config file
from transformers import AutoConfig

In [30]:
xlmr_model_name

'xlm-roberta-base'

In [31]:
# adjust config file

xlmr_config = AutoConfig.from_pretrained(xlmr_model_name,
                                         num_labels=tags.num_classes,
                                         id2label=index2tag, label2id=tag2index)

In [32]:
xlmr_config

XLMRobertaConfig {
  "_name_or_path": "xlm-roberta-base",
  "architectures": [
    "XLMRobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "O",
    "1": "B-PER",
    "2": "I-PER",
    "3": "B-ORG",
    "4": "I-ORG",
    "5": "B-LOC",
    "6": "I-LOC"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "B-LOC": 5,
    "B-ORG": 3,
    "B-PER": 1,
    "I-LOC": 6,
    "I-ORG": 4,
    "I-PER": 2,
    "O": 0
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "xlm-roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.19.2",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 250002
}

## Load the custom model

In [33]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
xlmr_model = (XLMRobertaForTokenClassification
              .from_pretrained(xlmr_model_name, config=xlmr_config)
              .to(device))

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForTokenClassification: ['roberta.pooler.dense.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.bias']
- This IS expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['roberta.embeddings.position_

In [34]:
text

'Jack Sparrow loves New York!'

In [35]:
# xlmr_tokenizer = AutoTokenizer.from_pretrained(xlmr_model_name)

In [36]:
# check model
input_ids = xlmr_tokenizer.encode(text, return_tensors="pt")
pd.DataFrame([xlmr_tokens, input_ids[0].numpy()], index=["Tokens", "Input IDs"])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
Tokens,<s>,▁Jack,▁Spar,row,▁love,s,▁New,▁York,!,</s>
Input IDs,0,21763,37456,15555,5161,7,2356,5753,38,2


In [37]:
# or you can get input_ids with this
xlmr_tokenizer(text)

{'input_ids': [0, 21763, 37456, 15555, 5161, 7, 2356, 5753, 38, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [38]:
input_ids, input_ids.shape

(tensor([[    0, 21763, 37456, 15555,  5161,     7,  2356,  5753,    38,     2]]),
 torch.Size([1, 10]))

Put this into the model

In [39]:
outputs = xlmr_model(input_ids.to(device)).logits
predictions = torch.argmax(outputs, dim=-1)
print(f"Number of tokens in sequence: {len(xlmr_tokens)}")
print(f"Shape of outputs: {outputs.shape}")

Number of tokens in sequence: 10
Shape of outputs: torch.Size([1, 10, 7])


In [40]:
tmp = xlmr_model(input_ids.to(device))

In [41]:
tmp

TokenClassifierOutput(loss=None, logits=tensor([[[ 0.2395, -0.2022,  0.1623, -0.1746,  0.0917,  0.0861,  0.2825],
         [ 0.3496, -0.4554,  0.1253, -0.3041,  0.4720,  0.3052,  0.3916],
         [ 0.2996, -0.3328,  0.1024, -0.0769,  0.4870,  0.2902,  0.4587],
         [ 0.3527, -0.4782,  0.1966, -0.2326,  0.4849,  0.4054,  0.4094],
         [ 0.3171, -0.4485,  0.1332, -0.1963,  0.4135,  0.3383,  0.4583],
         [ 0.3616, -0.4737,  0.0716, -0.3572,  0.5030,  0.3396,  0.4111],
         [ 0.3084, -0.5401, -0.0018, -0.2754,  0.5698,  0.4278,  0.4860],
         [ 0.3197, -0.5057,  0.0216, -0.2937,  0.4844,  0.3045,  0.5179],
         [ 0.2101, -0.4296,  0.1701, -0.4232,  0.5533,  0.4048,  0.3666],
         [ 0.2475, -0.2291,  0.1405, -0.1075,  0.0328,  0.1174,  0.2405]]],
       device='cuda:0', grad_fn=<AddBackward0>), hidden_states=None, attentions=None)

In [42]:
predictions

tensor([[6, 4, 4, 4, 6, 4, 4, 6, 4, 0]], device='cuda:0')

In [43]:
preds = [tags.names[p] for p in predictions[0].cpu().numpy()]
pd.DataFrame([xlmr_tokens, preds], index=["Tokens", "Tags"])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
Tokens,<s>,▁Jack,▁Spar,row,▁love,s,▁New,▁York,!,</s>
Tags,I-LOC,I-ORG,I-ORG,I-ORG,I-LOC,I-ORG,I-ORG,I-LOC,I-ORG,O


Lots of wrong prediction. Make sense as we are using a randomized head

Create helper function to output NER prediction from our model

In [44]:
def tag_text(text, tags, model, tokenizer):
    # Get tokens with special characters
    tokens = tokenizer(text).tokens()
    # Encode the sequence into IDs
    input_ids = xlmr_tokenizer(text, return_tensors="pt").input_ids.to(device)
    # Get predictions as distribution over 7 possible classes
    outputs = model(input_ids)[0]
    # Take argmax to get most likely class per token
    predictions = torch.argmax(outputs, dim=2)
    # Convert to DataFrame
    preds = [tags.names[p] for p in predictions[0].cpu().numpy()]
    return pd.DataFrame([tokens, preds], index=["Tokens", "Tags"])

# Tokenization for NER

In [45]:
de_example = panx_de["train"][0]
de_example

{'tokens': ['2.000',
  'Einwohnern',
  'an',
  'der',
  'Danziger',
  'Bucht',
  'in',
  'der',
  'polnischen',
  'Woiwodschaft',
  'Pommern',
  '.'],
 'ner_tags': [0, 0, 0, 0, 5, 6, 0, 0, 5, 5, 6, 0],
 'langs': ['de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de'],
 'ner_tags_str': ['O',
  'O',
  'O',
  'O',
  'B-LOC',
  'I-LOC',
  'O',
  'O',
  'B-LOC',
  'B-LOC',
  'I-LOC',
  'O']}

In [46]:
words, labels = de_example["tokens"], de_example["ner_tags"]
words, labels

(['2.000',
  'Einwohnern',
  'an',
  'der',
  'Danziger',
  'Bucht',
  'in',
  'der',
  'polnischen',
  'Woiwodschaft',
  'Pommern',
  '.'],
 [0, 0, 0, 0, 5, 6, 0, 0, 5, 5, 6, 0])

We cannot just use this `labels`, as after tokenization, words are broken into subwords (Einwohnern => ▁Einwohner + ▁n, and only ▁Einwohner has an 'Other' tag, and ▁n will have the 'Ignored' tag. We need to write some code for this

In [47]:
# xlmr_tokenizer(words) 
# the tokenizer thought each word is a sentence => not good

In [48]:
tokenized_input = xlmr_tokenizer(de_example["tokens"], 
                                 is_split_into_words=True)
tokenized_input

{'input_ids': [0, 70101, 176581, 19, 142, 122, 2290, 708, 1505, 18363, 18, 23, 122, 127474, 15439, 13787, 14, 15263, 18917, 663, 6947, 19, 6, 5, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [49]:
tokens = xlmr_tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
pd.DataFrame([tokens], index=["Tokens"])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
Tokens,<s>,▁2.000,▁Einwohner,n,▁an,▁der,▁Dan,zi,ger,▁Buch,...,▁Wo,i,wod,schaft,▁Po,mmer,n,▁,.,</s>


word_ids has mapped each subword to the corresponding index in the words sequence, so the first subword, “▁2.000”, is assigned the index 0, while “▁Einwohner” and “n” are assigned the index 1 (since “Einwohnern” is the second word in words

In [50]:
word_ids = tokenized_input.word_ids()
pd.DataFrame([tokens, word_ids], index=["Tokens", "Word IDs"])


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
Tokens,<s>,▁2.000,▁Einwohner,n,▁an,▁der,▁Dan,zi,ger,▁Buch,...,▁Wo,i,wod,schaft,▁Po,mmer,n,▁,.,</s>
Word IDs,,0,1,1,2,3,4,4,4,5,...,9,9,9,9,10,10,10,11,11,


We can also see that special tokens like `<s>` and `<\s>` are mapped to None. Let’s set –100 as the label for these special tokens and the subwords we wish to mask during training

Note: why -100? B/c PyTorch the **cross-entropy loss class torch.nn.CrossEntropyLoss has an attribute called ignore_index whose value is –100**. This index is ignored during training, so we can use it to ignore the tokens associated with consecutive subwords.

In [51]:
previous_word_idx = None
label_ids = []

for word_idx in word_ids:
    if word_idx is None or word_idx == previous_word_idx:
        label_ids.append(-100)
    elif word_idx != previous_word_idx:
        label_ids.append(labels[word_idx])
    previous_word_idx = word_idx

labels = [index2tag[l] if l != -100 else "IGN" for l in label_ids]


In [52]:
print(labels)

['IGN', 'O', 'O', 'IGN', 'O', 'O', 'B-LOC', 'IGN', 'IGN', 'I-LOC', 'IGN', 'O', 'O', 'B-LOC', 'IGN', 'B-LOC', 'IGN', 'IGN', 'IGN', 'I-LOC', 'IGN', 'IGN', 'O', 'IGN', 'IGN']


In [53]:
index = ["Tokens", "Word IDs", "Label IDs", "Labels"]

pd.DataFrame([tokens, word_ids, label_ids, labels], index=index)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
Tokens,<s>,▁2.000,▁Einwohner,n,▁an,▁der,▁Dan,zi,ger,▁Buch,...,▁Wo,i,wod,schaft,▁Po,mmer,n,▁,.,</s>
Word IDs,,0,1,1,2,3,4,4,4,5,...,9,9,9,9,10,10,10,11,11,
Label IDs,-100,0,0,-100,0,0,5,-100,-100,6,...,5,-100,-100,-100,6,-100,-100,0,-100,-100
Labels,IGN,O,O,IGN,O,O,B-LOC,IGN,IGN,I-LOC,...,B-LOC,IGN,IGN,IGN,I-LOC,IGN,IGN,O,IGN,IGN


Define a function to wrap up all this

In [54]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = xlmr_tokenizer(examples["tokens"], truncation=True,
                                      is_split_into_words=True)
    labels = []
    for idx, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=idx)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None or word_idx == previous_word_idx:
                label_ids.append(-100)
            else:
                label_ids.append(label[word_idx])
            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

With input `examples` is a batch from Transformers Dataset, such as panx_de['train'][:10]

In [55]:
panx_de['train'][:2]

{'tokens': [['2.000',
   'Einwohnern',
   'an',
   'der',
   'Danziger',
   'Bucht',
   'in',
   'der',
   'polnischen',
   'Woiwodschaft',
   'Pommern',
   '.'],
  ['Sie',
   'geht',
   'hinter',
   'Walluf',
   'nahtlos',
   'in',
   'die',
   'Bundesautobahn',
   '66',
   'über',
   '.']],
 'ner_tags': [[0, 0, 0, 0, 5, 6, 0, 0, 5, 5, 6, 0],
  [0, 0, 0, 3, 0, 0, 0, 3, 4, 0, 0]],
 'langs': [['de',
   'de',
   'de',
   'de',
   'de',
   'de',
   'de',
   'de',
   'de',
   'de',
   'de',
   'de'],
  ['de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de']],
 'ner_tags_str': [['O',
   'O',
   'O',
   'O',
   'B-LOC',
   'I-LOC',
   'O',
   'O',
   'B-LOC',
   'B-LOC',
   'I-LOC',
   'O'],
  ['O', 'O', 'O', 'B-ORG', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'O']]}

And write a function that wraps the `map`, which accepts a DatasetDict

In [56]:
def encode_panx_dataset(corpus):
    return corpus.map(tokenize_and_align_labels, batched=True,
                      remove_columns=['langs', 'ner_tags', 'tokens'])

In [57]:
panx_de_encoded = encode_panx_dataset(panx_ch["de"])

Loading cached processed dataset at /home/quan/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-12ac6485871107b9.arrow


  0%|          | 0/7 [00:00<?, ?ba/s]

Loading cached processed dataset at /home/quan/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-c2045239f83cd18a.arrow


In [58]:
panx_de_encoded['train']

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 12580
})

In [59]:
print(panx_de_encoded['train'][0])

{'input_ids': [0, 70101, 176581, 19, 142, 122, 2290, 708, 1505, 18363, 18, 23, 122, 127474, 15439, 13787, 14, 15263, 18917, 663, 6947, 19, 6, 5, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [-100, 0, 0, -100, 0, 0, 5, -100, -100, 6, -100, 0, 0, 5, -100, 5, -100, -100, -100, 6, -100, -100, 0, -100, -100]}


# Define metrics

common to report results for precision, recall, and F1-score. The only subtlety is that all words of an entity need to be predicted correctly in order for a prediction to be counted as correct

We will use library called seqeval for this. 

```
seqeval can evaluate the performance of chunking tasks such as named-entity recognition, part-of-speech tagging, semantic role labeling and so on.
```

In [60]:
from seqeval.metrics import classification_report

In [62]:
y_true = [["O", "O", "O", "B-MISC", "I-MISC", "I-MISC", "O"],
          ["B-PER", "I-PER", "O"]]
y_pred = [["O", "O", "O", "B-MISC", "I-MISC", "I-MISC", "O"],
          ["B-PER", "I-PER", "O"]]
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

        MISC       1.00      1.00      1.00         1
         PER       1.00      1.00      1.00         1

   micro avg       1.00      1.00      1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2



In [61]:
y_true = [["O", "O", "O", "B-MISC", "I-MISC", "I-MISC", "O"],
          ["B-PER", "I-PER", "O"]]
y_pred = [["O", "O", "B-MISC", "I-MISC", "I-MISC", "I-MISC", "O"],
          ["B-PER", "I-PER", "O"]]
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

        MISC       0.00      0.00      0.00         1
         PER       1.00      1.00      1.00         1

   micro avg       0.50      0.50      0.50         2
   macro avg       0.50      0.50      0.50         2
weighted avg       0.50      0.50      0.50         2



We need a function that can take the outputs of the model and convert them into the lists that seqeval expects, and ignoring the -100 id

In [64]:
import numpy as np

In [65]:
def align_predictions(predictions, label_ids):
    preds = np.argmax(predictions, axis=2)
    batch_size, seq_len = preds.shape
    labels_list, preds_list = [], []

    for batch_idx in range(batch_size):
        example_labels, example_preds = [], []
        for seq_idx in range(seq_len):
            # Ignore label IDs = -100
            if label_ids[batch_idx, seq_idx] != -100:
                example_labels.append(index2tag[label_ids[batch_idx][seq_idx]])
                example_preds.append(index2tag[preds[batch_idx][seq_idx]])

        labels_list.append(example_labels)
        preds_list.append(example_preds)

    return preds_list, labels_list

# Finetuning XLM-R