# Language Detection with xlm-roberta-base-language-detection
* Notebook by Adam Lang
* Date: 8/28/2024

# Overview
* In this notebook we will utilize a well known huggingface model to perform language detection.
* This model is an XLM-RoBERTa transformer model with a classification head on top (i.e. a linear layer on top of the pooled output).
* According to the model card there are 20 languages supported.
* Model card: https://huggingface.co/papluca/xlm-roberta-base-language-detection

# Import Libraries
* We need:
1. transformers
2. datasets (from huggingface)
3. plotly-express

In [1]:
## pip install
!pip install transformers datasets plotly-express

Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting plotly-express
  Downloading plotly_express-0.4.1-py2.py3-none-any.whl.metadata (1.7 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.21.0-py3-none-any.whl (527 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading plotly_express-0.4.1-py2.py3-none-any.whl (2.9 kB)
Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
## imports
from datasets import load_dataset ##huggingface
from transformers import pipeline ##huggingface
import pandas as pd
import plotly.express as px

## other imports
import torch
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Multilingual Dataset
* We will test this model on the Language Identification dataset which is what this model was trained on.
* The dataset has over 90,000 text passages with language samples.
* Dataset link: https://huggingface.co/datasets/papluca/language-identification

In [3]:
## get dataset from huggingface
dataset = load_dataset('papluca/language-identification', split="test")

Downloading readme:   0%|          | 0.00/4.99k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.71M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.69M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/70000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [4]:
## view dataset
print(dataset)

Dataset({
    features: ['labels', 'text'],
    num_rows: 10000
})


In [5]:
## convert dataset to pandas df
df = pd.DataFrame(dataset).drop('labels', axis=1)

## keep only first 100
df_sub = df[:100]
df_sub.head()

Unnamed: 0,text
0,Een man zingt en speelt gitaar.
1,De technologisch geplaatste Nasdaq Composite I...
2,Es muy resistente la parte trasera rígida y lo...
3,"""In tanti modi diversi, l'abilità artistica de..."
4,منحدر يواجه العديد من النقاشات المتجهه إزاء ال...


In [6]:
## random sample of text
df_sub.sample(2)

Unnamed: 0,text
67,Капитан Блъд свали шапката си и се поклони тих...
63,Profaili ya muda wa Kenneth Starr inamwonyesha...


# Language Detection Model
* We will get the model from the huggingface hub and set `device=0` to use GPU during inference.
* We will use the huggingface pipeline for immediate inference rather than loading the model and the tokenizer we can use it out of the box this way: https://huggingface.co/docs/transformers/main_classes/pipelines

In [8]:
## get model from hf hub
model = pipeline(
    'text-classification',
    model="papluca/xlm-roberta-base-language-detection",
)

Use model to detect language of each text passage.

In [11]:
from tqdm import tqdm
## detect language of each passage
all_text = df_sub['text'].values.tolist()

## apply model pipeline to data
all_lang = model(all_text)

## get language detection label
df_sub['language_label'] = [d['label'] for d in tqdm(all_lang)]

## print head
df_sub.head()

100%|██████████| 100/100 [00:00<00:00, 472864.04it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sub['language_label'] = [d['label'] for d in tqdm(all_lang)]


Unnamed: 0,text,language_label
0,Een man zingt en speelt gitaar.,nl
1,De technologisch geplaatste Nasdaq Composite I...,nl
2,Es muy resistente la parte trasera rígida y lo...,es
3,"""In tanti modi diversi, l'abilità artistica de...",it
4,منحدر يواجه العديد من النقاشات المتجهه إزاء ال...,ar


In [16]:
## we can see the value_counts
df_sub['language_label'].value_counts()

Unnamed: 0_level_0,count
language_label,Unnamed: 1_level_1
ru,9
tr,7
fr,7
el,7
nl,6
it,6
ar,6
ur,6
es,6
bg,5


In [13]:
## show languages detected
plot_title = "Languages Detected"
labels = {
    "x": "Language",
    "y": "Count of text"
}
fig = px.histogram(df_sub, x="language_label", template="plotly_dark",
                   title=plot_title,
                   labels=labels)

fig.show()

## Seeing if the model worked
* Let's check some of the language inference for specific languages.

In [14]:
## show french
text_french = df_sub[df_sub['language_label'] == 'fr']

## print french
print('French Text identified:')
for i, row in text_french.iterrows():
  print(f"- {row['text']}")
  if i == 10:
    break

French Text identified:
- Bonjour, Le produit est conforme à la description, reste a voir la durée de vie des cartouches Voila pour les commentaires
- Petite qualité les véhicules se sont décollés après quelques heures dans les mains des enfants les enfants que je garde ont moins de trois ans j’ai donc retiré les véhicules de leurs jeux
- Expédition rapide. Produit bien protégé dans une boite carton. Le rapport qualité prix est excellent. Le produit est visuellement de qualité. A voir maintenant dans son utilisation sur du long terme.
- Cartes plastifiées idéales pour jouer près ou dans la piscine. Attention le jeux n a pas les memes règles que l original. Un bon complément au jungle speed original
- Fonctionne très bien
- Très bonne coque rentre parfaitement sur le téléphone coque bien épaisse elle est de très bonne qualité je conseille cette coque
- pour une question de prix et de ne pas avoir a me deplacer


In [15]:
## show russian text
text_russian = df_sub[df_sub['language_label'] == 'ru']

## print russian
print('Russian Text:')
for i, row in text_russian.iterrows():
  print(f"- {row['text']}")
  if i == 10:
    break

Russian Text:
- Через каждые сто градусов пятна краски меняют свой цвет, она может быть красной и изменить цвет на синий.
- Если когда-нибудь я буду писать автобиографию, это будут словарные названия мест и имена людей, определенные в контексте личной приоритетности.
- Мы, возможно, не имеем все, что нам нужно, или все, что мы видим у других людей, однако она заверила, что у нас есть все необходимое, что нам нужно.
- Поэтапные закупки позволяют сократить риски благодаря своевременному выявлению проблем, что облегчает внесение изменений или исправлений.
- В конце концов, руководитель контролирует передачу функций в области информационных технологий и управления директору по ИТ, департаменту ИТ и иным подразделениям.
- Ч.П. Сноу писал о двух культурах, точных и гуманитарных науках, никогда их не смешивая.
- Но я торопился высадить тебя.
- Похоже никто не знает, играются ли эти виды спорта на корте с сеткой, у стены или и то и другое.
- Временная приостановка юридического представительств

# Summary
* Using a neural transformer language detection works quite well.
* XLM models have been shown to outperform the multilingual BERT or mBERT model as seen in the paper: https://arxiv.org/pdf/1911.02116

* The task of language detection seems rather easy, isnt it enough to count how many words occur in each language-specific dictionary and then return the language with the highest count?

* The model card of the `papluca/xlm-roberta-base-language-detection` model, shows it achieves an accuracy of 99.6% over the test set of the Language Identification dataset.
* Another common library for language detection, `langid`, works by looking for a subset of these dictionary matches and assigning them individual weights learned with a Naive Bayes model.
   * The `langid` library achieves 98.5% accuracy on the same test set.

* There is a small improvement in accuracy when using neural networks, with the downside that they are slower.

* Deciding whether or not to use neural networks depends on understanding whether that 1% improvement in accuracy is important for your specific use case and dataset.

# Appendix
* Currently, it supports the following 20 languages:

1. arabic (ar)
2. bulgarian (bg)
3. german (de)
4. modern greek (el)
5. english (en)
6. spanish (es)
7. french (fr)
8. hindi (hi)
9. italian (it)
10. japanese (ja)
11. dutch (nl)
12. polish (pl)
13. portuguese (pt)
14. russian (ru)
15. swahili (sw)
16. thai (th)
17. turkish (tr)
18. urdu (ur)
19. vietnamese (vi)
20. chinese (zh)