# Multilingual Named Entity Recognition

We'll use the [xtreme](https://github.com/google-research/xtreme) dataset from Huggingface, that contains [IOB2](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging))-tagged tokens. This task is derived from an O'Reilly book - "Natural Language Processing with Transformers."

In [1]:
from datasets import get_dataset_config_names

xtreme_subsets = get_dataset_config_names("xtreme")
print(f"XTREME has {len(xtreme_subsets)} configurations")

XTREME has 183 configurations


Look for the configurations that just contain "PAN". This will allow us to conduct NER on different languages, marked with the `ISO 639-1` language code. 

In [2]:
panx_subsets = [s for s in xtreme_subsets if s.startswith("PAN")]
panx_subsets[:3]

['PAN-X.af', 'PAN-X.ar', 'PAN-X.bg']

## Create a Swiss-like Multilingual Dataset

Use the general spoken proportions of German, Italian, French, and English in Switzerland to emulate data.

In [3]:
from collections import defaultdict
from datasets import DatasetDict, load_dataset

langs = ["de", "fr", "it", "en"]
fracs = [0.629, 0.229, 0.084, 0.059]
# Return a DatasetDict if a key doesn't exist
panx_ch = defaultdict(DatasetDict)

for lang, frac in zip(langs, fracs):
    # Load monolingual corpus
    ds = load_dataset("xtreme", name=f"PAN-X.{lang}")
    # Shuffle and downsample each split according to spoken proportion
    for split in ds:
        panx_ch[lang][split] = (
            ds[split]
            .shuffle(seed=0)
            .select(range(int(frac * ds[split].num_rows))))

In [15]:
# show the features of the training set for German
panx_ch["de"]["train"].features

{'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'ner_tags': Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], id=None), length=-1, id=None),
 'langs': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}

In [17]:
# show the first few rows of the training set for German
panx_ch["de"]["train"].data.to_pandas()

Unnamed: 0,tokens,ner_tags,langs
0,"[als, Teil, der, Savoyer, Voralpen, im, Osten, .]","[0, 0, 0, 5, 6, 0, 0, 0]","[de, de, de, de, de, de, de, de]"
1,"[WEITERLEITUNG, Antonina, Wladimirowna, Kriwos...","[0, 1, 2, 2]","[de, de, de, de]"
2,"[**, '', Lou, Salomé, '', .]","[0, 0, 1, 2, 0, 0]","[de, de, de, de, de, de]"
3,"[Spieler, vom, SKA, Sankt, Petersburg, ausgewä...","[0, 0, 3, 4, 4, 0, 0, 0, 0, 3, 4, 4, 0, 0, 0, ...","[de, de, de, de, de, de, de, de, de, de, de, d..."
4,"[Jaan, Kirsipuu, 74, P, .]","[1, 2, 0, 0, 0]","[de, de, de, de, de]"
...,...,...,...
19995,"[**, ', '', Grafschaft, Edessa, '', ']","[0, 0, 0, 3, 4, 0, 0]","[de, de, de, de, de, de, de]"
19996,"[1358, hielt, er, sich, in, Padua, auf, und, m...","[0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0]","[de, de, de, de, de, de, de, de, de, de, de, d..."
19997,"[***, ', '', Hochstift, Chiemsee, '', ']","[0, 0, 0, 5, 6, 0, 0]","[de, de, de, de, de, de, de]"
19998,"[Peter, Falk, –, Columbo]","[1, 2, 0, 3]","[de, de, de, de]"
