Language identification model v2.0
Model for identifying the most probable language(s) of a text, inspired by Google's Compact Language Detector v3 and implemented with thinc v8.0.
Model
Character unigrams, bigrams, and trigrams are extracted separately from the first 1000 characters of lower-cased input text. Each collection of ngrams is hash-embedded into a 100-dimensional space, then averaged. The resulting feature vectors are concatenated into a single embedding layer, then passed on to a dense layer with ReLu activation and finally a Softmax output layer. The model's predictions give the probabilities for a text to be written in ~140 ISO 639-1 languages.
Dataset
The model was trained on a randomized, stratified subset of ~375k texts drawn from several sources:
- WiLi: A public dataset of short text extracts from Wikipedias in over 230 languages. Style is relatively formal; subject matter is "encyclopedic". Source: https://zenodo.org/record/841984
- Tatoeba: A crowd-sourced collection of sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.
- UDHR: The UN's Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html
- DSLCC: Two collections of short excerpts of journalistic texts in a handful of language groups that are highly similar to each other. Style is relatively formal; subject matter is current events. Source: http://ttg.uni-saarland.de/resources/DSLCC/
Performance
The trained model achieved F1 = 0.97 when averaged over all languages.
A few languages have worse performance; for example, the two Norwegians ("nb" and "no"), as well as Bosnian ("bs"), Serbian ("sr"), and Croatian ("hr"), which are extremely similar to each other.
precision recall f1-score support
af 0.98 0.98 0.98 1096
am 1.00 1.00 1.00 267
an 0.97 0.96 0.96 202
ar 0.96 1.00 0.98 1096
as 1.00 0.97 0.98 248
av 0.94 0.93 0.93 200
ay 0.98 0.95 0.96 212
az 0.99 0.97 0.98 501
ba 0.99 0.98 0.98 230
be 0.99 0.99 0.99 1096
bg 0.98 0.98 0.98 1096
bm 1.00 0.98 0.99 137
bn 0.98 0.99 0.98 303
bo 1.00 1.00 1.00 214
br 0.99 0.99 0.99 614
bs 0.63 0.65 0.64 1376
ca 0.96 0.97 0.96 1096
ce 1.00 0.99 0.99 201
co 0.99 0.95 0.97 213
cs 0.98 0.96 0.97 1096
cu 1.00 1.00 1.00 606
cv 0.99 0.98 0.98 367
cy 1.00 0.99 1.00 764
da 0.94 0.94 0.94 1096
de 0.96 0.99 0.97 1108
dv 1.00 1.00 1.00 212
el 1.00 1.00 1.00 1107
en 0.94 0.97 0.96 3096
eo 0.97 0.96 0.97 490
es 0.96 0.97 0.97 2207
et 0.99 0.98 0.99 1096
eu 0.99 1.00 0.99 1096
fa 1.00 1.00 1.00 1940
fi 0.99 0.99 0.99 1096
fo 0.99 0.98 0.98 857
fr 0.96 0.98 0.97 2207
fy 0.98 0.96 0.97 239
ga 0.99 0.99 0.99 1059
gd 0.99 0.99 0.99 955
gl 0.96 0.94 0.95 1096
gn 1.00 0.99 0.99 488
gu 0.98 0.96 0.97 216
gv 0.99 0.99 0.99 285
ha 0.97 0.98 0.98 239
he 1.00 1.00 1.00 1095
hi 1.00 0.99 0.99 1096
hr 0.78 0.75 0.76 2207
ht 1.00 0.98 0.99 228
hu 0.99 0.99 0.99 1096
hy 1.00 1.00 1.00 969
ia 0.93 0.95 0.94 490
id 0.93 0.92 0.93 2207
ie 0.94 0.94 0.94 478
ig 0.96 0.91 0.93 214
io 0.95 0.94 0.95 489
is 0.99 0.99 0.99 1096
it 0.98 0.98 0.98 1096
ja 1.00 1.00 1.00 1095
jv 0.97 0.93 0.95 277
ka 0.99 1.00 0.99 490
kk 0.99 0.99 0.99 652
km 0.97 0.95 0.96 246
kn 1.00 1.00 1.00 224
ko 1.00 1.00 1.00 957
ku 1.00 0.99 0.99 212
kv 0.94 0.96 0.95 200
kw 0.99 0.99 0.99 419
ky 0.99 0.97 0.98 235
la 0.97 0.98 0.97 1108
lb 0.97 0.96 0.97 280
lg 0.99 0.99 0.99 210
li 0.99 0.99 0.99 200
ln 0.95 0.92 0.93 231
lo 0.95 0.93 0.94 227
lt 0.99 0.99 0.99 1096
lv 1.00 0.99 0.99 818
mg 1.00 0.99 1.00 215
mi 1.00 1.00 1.00 269
mk 0.94 0.97 0.96 490
ml 1.00 0.98 0.99 288
mn 1.00 0.99 0.99 491
mr 0.99 0.99 0.99 533
ms 0.74 0.65 0.69 200
mt 0.99 0.99 0.99 836
my 0.89 0.92 0.91 1340
nb 0.81 0.89 0.85 491
ne 0.98 0.98 0.98 211
nl 0.98 0.97 0.97 1096
nn 0.88 0.88 0.88 397
no 0.92 0.86 0.89 606
nv 1.00 1.00 1.00 226
oc 0.96 0.92 0.94 561
om 0.98 0.98 0.98 212
or 0.99 0.97 0.98 204
os 1.00 0.98 0.99 230
pa 1.00 0.99 1.00 218
pl 0.99 1.00 0.99 1096
ps 0.97 0.95 0.96 219
pt 0.98 0.98 0.98 2219
qu 0.97 0.96 0.96 274
rm 0.97 0.98 0.98 289
rn 0.96 0.97 0.97 290
ro 0.99 0.99 0.99 1120
ru 0.96 0.97 0.97 1096
rw 0.95 0.95 0.95 215
sa 1.00 0.99 1.00 713
sc 0.97 0.98 0.97 213
sd 0.99 0.99 0.99 200
se 0.98 0.99 0.98 223
si 0.97 0.96 0.96 213
sk 0.97 0.97 0.97 1096
sl 0.96 0.97 0.96 929
sn 0.96 0.95 0.96 220
so 0.98 0.98 0.98 221
sq 1.00 0.98 0.99 492
sr 0.81 0.83 0.82 2219
su 0.99 0.91 0.95 216
sv 0.98 0.97 0.98 1096
sw 0.96 0.97 0.96 212
ta 1.00 1.00 1.00 476
te 0.99 0.97 0.98 312
tg 0.98 0.96 0.97 220
th 0.99 0.99 0.99 682
tk 0.99 0.99 0.99 502
tl 0.98 0.98 0.98 513
tn 1.00 0.98 0.99 217
to 1.00 1.00 1.00 213
tr 0.98 0.99 0.99 1096
tt 0.98 0.98 0.98 490
ug 1.00 1.00 1.00 1108
uk 0.99 0.99 0.99 1096
ur 1.00 1.00 1.00 1080
uz 0.98 0.96 0.97 313
vi 1.00 0.99 0.99 1028
vo 0.98 0.99 0.98 478
wa 1.00 0.97 0.99 217
wo 0.99 0.99 0.99 694
xh 0.96 0.91 0.94 240
yi 0.99 1.00 0.99 490
yo 0.93 0.94 0.93 301
zh 1.00 1.00 1.00 825
accuracy 0.96 94140
macro avg 0.97 0.97 0.97 94140
weighted avg 0.96 0.96 0.96 94140