# 4 - Multilingual named entity recognition

So far in this book we have applied transformers to solve NLP tasks on English corpora - but what do you do when your documents are written in Greek, Swahili, or Klingon? One approach is to search the Hugging Face Hub for a suitable pretrained language model and fine-tune it on the task at hand. However, these pretrained models tend to exist only for "high-resource" languages like German, Russian, or Mandarin, where plenty of webtext is available for pretraining. Another common challenge arises when your corpus is multilingual: maintaining multiple monolingual models in production will not be any fun for you or your engineering team.

Fortunately, there is a class of multilingual transformers that come to the resue. Like BERT, these models use masked language moeling as a pretraining objective, but they are trained jointly on texts in over one hundred languages. By pretraining on huge corpora across many languages, these multilingual transformers enable **zero-shot cross-lingual transfer**.This means that a model that is fine-tuned on one language can be applied to others without any further training! This also makes these models well suited for "code-switching", where a speaker  alternates between two or more languages or dialects in the context of a single conversation.

In this chapter, we will focus on the encoder-only model XLM-RoBERTa ([Conneau et al., 2019](https://arxiv.org/abs/1911.02116)), which can be fine-tuned to perform named entity recognition (NER) across several languages. NER is a common NLP task that identifies entities like people, organizations, or locations in text. These entities can be used for various applications such as gaining insights from company documents, augmenting the quality of search engines, or simply building a structured database from a corpus.

For this chapter, let's assume that we want to perform NER for a customer based in Switzerland, where there are four national languages (with English often serving as a bridge between them). Let's start by getting a suitable multilingual corpus for this problem.

## 4.1 - The dataset

We are going to use a subset of the Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark called WikiANN or PAN-X. This dataset consists of Wikipedia articles in many languages, including the four most commonly spoken languages in Switzerland: German (62.9%), French (22.9%), Italian (8.4%), and English (5.9%). Each article is annotated with <code>LOC</code> (location), <code>PER</code> (person), and <code>ORG</code> (organization) tags in the ["inside-outside-beginning (IOB2) format"](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). 

In the IOB2 format, a <code>B-</code> prefix indicates the beginning of an entity, and consecutive tokens belonging to the same entity are given an <code>I-</code> prefix. An <code>O</code> tag indicates that the token does not belong to any entity. For example, the following sentence:

`Jeff Dean is a computer scientist at Google in California`

<img src="images/ner_example.png" title="" alt="" width="500" data-align="center">

To load one of the <code>PAN-X</code> subsets in <code>XTREME</code>, we'll need to know which dataset configuration to pass the `load_dataset()` function. Whenever you are dealing with a dataset that has multiple domains, you can use the <code>get_dataset_config_names()</code> function to find out which subsets are available:

In [None]:
from datasets import get_dataset_config_names

xtreme_subsets = get_dataset_config_names("xtreme")
print(f"XTREME has {len(xtreme_subsets)} configurations")

Whoa, that is a lot of configurations! Let's narrow the search by just looking for the configurations that start with "PAN":

In [None]:
panx_subsets = [s for s in xtreme_subsets if s.startswith("PAN")]
print(len(panx_subsets))
panx_subsets[:3]

It appears that there are 40 different configurations for the `PAN-X` subset data, where each one has a two-letter suffix that appears to be an [ISO 639-1 language code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). This means that to load the German corpus, we would have to do the following:

In [None]:
from datasets import load_dataset

ger_data = load_dataset("xtreme", name="PAN-X.de")

To make a realisitc Swiss corpus, we'll sample the German (`de`), French (`fr`), Italian (`it`), and English (`en`) corpora according to their spoken proportions. this will create a language imbalance that is very common in real-world datasets. where adquiring labeled examples ina a minority language can be expensive due to the lack of domain expoerts who are fluent in that language. This imbalanced dataset will simulate a common situation when working on multilingual applications, and we'll see how we can build a model that works on all languages.

To keep track of each language, let's create a Python `defaultdict` that stores the language code as the key and a `PAN-X` corpus of type [DatasetDict](https://huggingface.co/docs/datasets/v2.1.0/en/package_reference/main_classes#datasets.DatasetDict) as the value.

---

<mark><b>Note:</mark> Defaultdict is a sub-class of the dictionary class that returns a dictionary-like object. The functionality of `dict` and `defaultdict` are almost same [except for the fact that <code>defaultdict</code> never raises a <code>KeyError</code>](https://www.geeksforgeeks.org/defaultdict-in-python/). It provides a default value for the key that does not exists.

---

In [None]:
from collections import defaultdict
from datasets import DatasetDict

langs = ["de", "fr", "it", "en"]
fracs = [0.629, 0.229, 0.084, 0.059]
# Return a DatasetDict if a key doesn't exist
panx_ch = defaultdict(DatasetDict)

for lang, frac in zip(langs, fracs):
    # Load monolingual corpus
    ds = load_dataset("xtreme", name=f"PAN-X.{lang}")
    # Shuffle and downsample each split according to spoken proportion
    for split in ds:
        panx_ch[lang][split] = (
        ds[split]
        .shuffle(seed=0)
        .select(range(int(frac * ds[split].num_rows))))

Here we used the `shuffle()` method to make sure we don't accidentally bias our dataset splits, while `select()` allows us to downsample each corpus according to the values in `fracs`. Let's have a look at how many examples we have per language in the training sets by accessing the `Dataset.num_rows` attribute:

In [None]:
import pandas as pd

pd.DataFrame({lang: [panx_ch[lang]["train"].num_rows] for lang in langs}, index=["Number of training examples"])
# panx_ch

If we look at `panx_ch` we can see that each language is splitted in `train`, `validation` and `test`. We can also see that we have more examples in German than all other languages combines, so we'll use it as a starting point from which to perform zero-shot cross-lingual transfer to French, Italian, and English. Let's inspect one of the examples in the German corpus:

In [None]:
element = panx_ch["de"]["train"][0]
for key, value in element.items():
    print(f"{key}: {value}")

The keys of our example correspond to the column names of an Arrow table, while the values denote the entries in each column. The `ner_tags` column correspond to the mapping of each entity to a class ID. However, this is a bit cryptic to the human eye so let's transform those IDs into our familiar `LOC`, `PER`, and `ORG` tags. To do this, we can take advantage of the `features` attribute that specified the underlying data types associated with each column:

In [None]:
for key, value in panx_ch["de"]["train"].features.items():
    print(f"{key}: {value}")

The `Sequence` class specifies that the field contains a list of features, which in the case of `ner_tags` corresponds to a list of `ClassLabel` features. Let’s pick out this feature from the training set as follows:

In [None]:
tags = panx_ch["de"]["train"].features["ner_tags"].feature
print(tags)

We can use the `ClassLabel.int2str()` method to create a new column in our training set with class names for each tag. We'll use the `map()` method to return a `dict` with the key corresponding the new column name and the values as a `list` of class names:

In [22]:
def create_tag_names(batch):
    return {"ner_tags_str": [tags.int2str(idx) for idx in batch["ner_tags"]]}

panx_de = panx_ch["de"].map(create_tag_names)

  0%|          | 0/12580 [00:00<?, ?ex/s]

  0%|          | 0/6290 [00:00<?, ?ex/s]

  0%|          | 0/6290 [00:00<?, ?ex/s]

Now that we have our tags in human-readable format, let's see how the tokens and tags align for the first example in the training set

In [23]:
de_example = panx_de["train"][0]
pd.DataFrame([de_example["tokens"], de_example["ner_tags_str"]],['Tokens', 'Tags'])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
Tokens,2.000,Einwohnern,an,der,Danziger,Bucht,in,der,polnischen,Woiwodschaft,Pommern,.
Tags,O,O,O,O,B-LOC,I-LOC,O,O,B-LOC,B-LOC,I-LOC,O


The presence of the LOC tags make sense since the sentence "2,000 Einwohnern an der Danziger Bucht in der polnischen Woiwodschaft Pommern" means "2,000 inhabitants at the Gdansk Bay in the Polish voivodeship of Pomerania" in English, and Gdansk Bay is a bay in the Baltic sea, while "voivodeship" corresponds to a state in Poland.

As a quick check that we don’t have any unusual imbalance in the tags, let’s calculate the frequencies of each entity across each split:

In [24]:
from collections import Counter

split2freqs = defaultdict(Counter)

for split, dataset in panx_de.items():
    for row in dataset["ner_tags_str"]:
        for tag in row:
            if tag.startswith("B"):
                tag_type = tag.split("-")[1]
                split2freqs[split][tag_type] += 1
                
pd.DataFrame.from_dict(split2freqs, orient="index")

Unnamed: 0,LOC,ORG,PER
train,6186,5366,5810
validation,3172,2683,2893
test,3180,2573,3071


This looks good—the distributions of the `PER`, `LOC`, and `ORG` frequencies are roughly the same for each split, so the validation and test sets should provide a good measure of our NER tagger's ability to generalize. Next, let’s look at a few popular multilingual transformers and how they can be adapted to tackle our NER task.