Model for identifying the most probable language(s) of a text, inspired by -- and using the same methodology as -- Facebook's fastText.
Model
Text is tokenized into a bag of word 1- and 2-grams and character 1- through 5-grams. The collection of n-grams is embedded into a 128-dimensional space, then averaged. The resulting features are fed into a linear classifier with a hierarchical softmax output to compute (approximate) language probabilities for 140 ISO 639-1 languages.
Dataset
The model was trained on a randomized, stratified subset of ~2.9M texts drawn from several sources:
- WiLi: A public dataset of short text extracts from Wikipedias in over 230 languages. Style is relatively formal; subject matter is "encyclopedic". Source: https://zenodo.org/record/841984
- Tatoeba: A crowd-sourced collection of sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.
- UDHR: The UN's Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html
- DSLCC: Two collections of short excerpts of journalistic texts in a handful of language groups that are highly similar to each other. Style is relatively formal; subject matter is current events. Source: http://ttg.uni-saarland.de/resources/DSLCC/
- Ted 2020: A crawl of nearly 4000 TED and TED-X transcripts from 2020, translated by a global community of volunteers into more than 100 languages. Style is conversational, covering a broad range of subjects. Source: https://opus.nlpl.eu/TED2020.php
- SETimes: A corpus of news articles in Balkan languages, originally extracted from http://www.setimes.com and compiled by Nikola Ljubešić. Source: https://opus.nlpl.eu/SETIMES.php
Performance
The trained model achieved F1 = 0.97 when averaged over all languages.
A few languages have worse performance; most notably, the two sub-Norwegians ("nb" and "no"), as well as Bosnian ("bs"), Serbian ("sr"), and Croatian ("hr"), which are extremely similar to each other.
precision recall f1-score support
af 0.96 0.97 0.96 948
am 1.00 1.00 1.00 220
an 0.93 0.80 0.86 101
ar 1.00 0.80 0.89 7953
as 0.96 0.96 0.96 159
av 0.89 0.77 0.83 101
ay 0.93 0.92 0.93 106
az 0.99 0.97 0.98 1644
ba 0.94 0.98 0.96 116
be 1.00 0.99 0.99 4600
bg 0.99 0.99 0.99 7475
bn 1.00 0.99 1.00 1516
bo 1.00 0.99 1.00 200
br 0.98 0.99 0.99 483
bs 0.63 0.66 0.65 4457
ca 0.98 0.99 0.98 6863
ce 0.99 1.00 1.00 101
co 0.95 0.93 0.94 106
cs 0.99 0.98 0.99 7947
cu 1.00 1.00 1.00 404
cv 0.99 0.95 0.97 188
cy 0.99 0.98 0.99 502
da 0.96 0.95 0.95 5178
de 0.99 0.99 0.99 7975
dv 1.00 1.00 1.00 107
el 1.00 1.00 1.00 6982
en 0.97 0.97 0.97 9944
eo 0.99 0.99 0.99 2920
es 0.98 0.98 0.98 9078
et 0.99 0.99 0.99 6338
eu 0.99 0.99 0.99 2655
fa 1.00 1.00 1.00 7395
fi 0.99 0.99 0.99 7950
fo 0.94 0.96 0.95 432
fr 0.82 0.99 0.90 9080
fy 0.94 0.87 0.91 132
ga 0.99 0.99 0.99 1204
gd 0.98 0.99 0.99 744
gl 0.96 0.96 0.96 4239
gn 0.99 0.97 0.98 278
gu 1.00 1.00 1.00 1601
gv 0.95 0.99 0.97 214
ha 0.99 0.99 0.99 1813
he 1.00 1.00 1.00 5895
hi 1.00 1.00 1.00 5314
hr 0.82 0.79 0.80 7748
ht 0.99 0.96 0.97 160
hu 1.00 0.99 1.00 4846
hy 1.00 1.00 1.00 3804
ia 0.95 0.96 0.96 1795
id 0.95 0.96 0.95 6735
ie 0.91 0.91 0.91 439
ig 0.96 0.87 0.91 126
io 0.95 0.92 0.94 639
is 0.99 0.99 0.99 4795
it 0.99 0.99 0.99 7964
ja 1.00 1.00 1.00 7892
jv 0.96 0.90 0.93 177
ka 1.00 1.00 1.00 3115
kk 1.00 0.99 0.99 1543
km 0.99 0.97 0.98 229
kn 1.00 1.00 1.00 329
ko 1.00 1.00 1.00 4951
ku 1.00 1.00 1.00 2809
kv 0.96 0.95 0.95 100
kw 0.99 0.95 0.97 210
ky 0.97 0.95 0.96 196
la 0.99 0.99 0.99 5276
lb 0.92 0.93 0.93 157
lg 0.95 0.98 0.97 105
li 0.99 0.96 0.97 100
ln 0.96 0.97 0.96 553
lo 0.97 0.94 0.95 157
lt 1.00 1.00 1.00 5119
lv 0.99 1.00 1.00 5119
mg 0.97 0.97 0.97 148
mi 0.98 0.94 0.96 135
mk 0.99 0.99 0.99 6485
ml 1.00 1.00 1.00 731
mn 1.00 1.00 1.00 2993
mr 1.00 1.00 1.00 3276
ms 0.79 0.73 0.76 1349
mt 0.97 0.98 0.98 437
my 0.93 0.96 0.95 3937
nb 0.85 0.89 0.87 3910
ne 0.99 0.98 0.99 497
nl 0.99 0.99 0.99 6730
nn 0.55 0.49 0.52 343
no 0.87 0.87 0.87 3466
nv 1.00 0.98 0.99 113
oc 0.87 0.88 0.87 520
om 0.94 0.97 0.96 106
or 1.00 0.96 0.98 103
os 0.98 1.00 0.99 454
pa 1.00 1.00 1.00 178
pl 1.00 1.00 1.00 7960
ps 0.99 0.97 0.98 213
pt 0.98 0.99 0.98 9082
qu 0.95 0.93 0.94 137
rm 0.94 0.94 0.94 144
rn 0.96 0.90 0.93 223
ro 1.00 0.99 0.99 9976
ru 0.99 0.99 0.99 7962
rw 0.87 0.87 0.87 108
sa 0.99 0.99 0.99 356
sc 0.85 0.93 0.89 107
sd 0.99 0.98 0.98 100
se 0.93 0.96 0.94 112
si 0.99 0.97 0.98 212
sk 0.98 0.97 0.97 4292
sl 0.98 0.98 0.98 4999
sn 0.93 0.89 0.91 110
so 0.98 0.96 0.97 313
sq 0.99 0.99 0.99 4962
sr 0.85 0.86 0.86 8340
su 0.95 0.97 0.96 108
sv 0.99 0.99 0.99 6060
sw 0.94 0.95 0.95 106
ta 1.00 1.00 1.00 1321
te 1.00 1.00 1.00 660
tg 0.99 0.98 0.98 165
th 1.00 1.00 1.00 3092
tk 0.98 0.97 0.98 638
tl 0.99 0.99 0.99 1933
tn 0.95 0.98 0.96 109
to 0.99 1.00 1.00 107
tr 0.99 1.00 0.99 9965
tt 0.99 0.99 0.99 1236
ug 1.00 1.00 1.00 1094
uk 0.99 0.99 0.99 5420
ur 1.00 1.00 1.00 2540
uz 0.98 0.98 0.98 856
vi 1.00 1.00 1.00 4771
vo 0.98 0.96 0.97 298
wa 0.98 0.93 0.95 108
wo 0.97 0.97 0.97 349
xh 0.94 0.93 0.94 120
yi 1.00 1.00 1.00 799
yo 0.89 0.93 0.91 150
zh 1.00 1.00 1.00 3351
accuracy 0.97 361821
macro avg 0.96 0.96 0.96 361821
weighted avg 0.97 0.97 0.97 361821