Skip to content

Language identification model v3.0

Latest

Choose a tag to compare

@bdewilde bdewilde released this 02 Apr 20:49

Model for identifying the most probable language(s) of a text, inspired by -- and using the same methodology as -- Facebook's fastText.

Model

Text is tokenized into a bag of word 1- and 2-grams and character 1- through 5-grams. The collection of n-grams is embedded into a 128-dimensional space, then averaged. The resulting features are fed into a linear classifier with a hierarchical softmax output to compute (approximate) language probabilities for 140 ISO 639-1 languages.

Dataset

The model was trained on a randomized, stratified subset of ~2.9M texts drawn from several sources:

  • WiLi: A public dataset of short text extracts from Wikipedias in over 230 languages. Style is relatively formal; subject matter is "encyclopedic". Source: https://zenodo.org/record/841984
  • Tatoeba: A crowd-sourced collection of sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.
  • UDHR: The UN's Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html
  • DSLCC: Two collections of short excerpts of journalistic texts in a handful of language groups that are highly similar to each other. Style is relatively formal; subject matter is current events. Source: http://ttg.uni-saarland.de/resources/DSLCC/
  • Ted 2020: A crawl of nearly 4000 TED and TED-X transcripts from 2020, translated by a global community of volunteers into more than 100 languages. Style is conversational, covering a broad range of subjects. Source: https://opus.nlpl.eu/TED2020.php
  • SETimes: A corpus of news articles in Balkan languages, originally extracted from http://www.setimes.com and compiled by Nikola Ljubešić. Source: https://opus.nlpl.eu/SETIMES.php

Performance

The trained model achieved F1 = 0.97 when averaged over all languages.

A few languages have worse performance; most notably, the two sub-Norwegians ("nb" and "no"), as well as Bosnian ("bs"), Serbian ("sr"), and Croatian ("hr"), which are extremely similar to each other.

              precision    recall  f1-score   support

          af       0.96      0.97      0.96       948
          am       1.00      1.00      1.00       220
          an       0.93      0.80      0.86       101
          ar       1.00      0.80      0.89      7953
          as       0.96      0.96      0.96       159
          av       0.89      0.77      0.83       101
          ay       0.93      0.92      0.93       106
          az       0.99      0.97      0.98      1644
          ba       0.94      0.98      0.96       116
          be       1.00      0.99      0.99      4600
          bg       0.99      0.99      0.99      7475
          bn       1.00      0.99      1.00      1516
          bo       1.00      0.99      1.00       200
          br       0.98      0.99      0.99       483
          bs       0.63      0.66      0.65      4457
          ca       0.98      0.99      0.98      6863
          ce       0.99      1.00      1.00       101
          co       0.95      0.93      0.94       106
          cs       0.99      0.98      0.99      7947
          cu       1.00      1.00      1.00       404
          cv       0.99      0.95      0.97       188
          cy       0.99      0.98      0.99       502
          da       0.96      0.95      0.95      5178
          de       0.99      0.99      0.99      7975
          dv       1.00      1.00      1.00       107
          el       1.00      1.00      1.00      6982
          en       0.97      0.97      0.97      9944
          eo       0.99      0.99      0.99      2920
          es       0.98      0.98      0.98      9078
          et       0.99      0.99      0.99      6338
          eu       0.99      0.99      0.99      2655
          fa       1.00      1.00      1.00      7395
          fi       0.99      0.99      0.99      7950
          fo       0.94      0.96      0.95       432
          fr       0.82      0.99      0.90      9080
          fy       0.94      0.87      0.91       132
          ga       0.99      0.99      0.99      1204
          gd       0.98      0.99      0.99       744
          gl       0.96      0.96      0.96      4239
          gn       0.99      0.97      0.98       278
          gu       1.00      1.00      1.00      1601
          gv       0.95      0.99      0.97       214
          ha       0.99      0.99      0.99      1813
          he       1.00      1.00      1.00      5895
          hi       1.00      1.00      1.00      5314
          hr       0.82      0.79      0.80      7748
          ht       0.99      0.96      0.97       160
          hu       1.00      0.99      1.00      4846
          hy       1.00      1.00      1.00      3804
          ia       0.95      0.96      0.96      1795
          id       0.95      0.96      0.95      6735
          ie       0.91      0.91      0.91       439
          ig       0.96      0.87      0.91       126
          io       0.95      0.92      0.94       639
          is       0.99      0.99      0.99      4795
          it       0.99      0.99      0.99      7964
          ja       1.00      1.00      1.00      7892
          jv       0.96      0.90      0.93       177
          ka       1.00      1.00      1.00      3115
          kk       1.00      0.99      0.99      1543
          km       0.99      0.97      0.98       229
          kn       1.00      1.00      1.00       329
          ko       1.00      1.00      1.00      4951
          ku       1.00      1.00      1.00      2809
          kv       0.96      0.95      0.95       100
          kw       0.99      0.95      0.97       210
          ky       0.97      0.95      0.96       196
          la       0.99      0.99      0.99      5276
          lb       0.92      0.93      0.93       157
          lg       0.95      0.98      0.97       105
          li       0.99      0.96      0.97       100
          ln       0.96      0.97      0.96       553
          lo       0.97      0.94      0.95       157
          lt       1.00      1.00      1.00      5119
          lv       0.99      1.00      1.00      5119
          mg       0.97      0.97      0.97       148
          mi       0.98      0.94      0.96       135
          mk       0.99      0.99      0.99      6485
          ml       1.00      1.00      1.00       731
          mn       1.00      1.00      1.00      2993
          mr       1.00      1.00      1.00      3276
          ms       0.79      0.73      0.76      1349
          mt       0.97      0.98      0.98       437
          my       0.93      0.96      0.95      3937
          nb       0.85      0.89      0.87      3910
          ne       0.99      0.98      0.99       497
          nl       0.99      0.99      0.99      6730
          nn       0.55      0.49      0.52       343
          no       0.87      0.87      0.87      3466
          nv       1.00      0.98      0.99       113
          oc       0.87      0.88      0.87       520
          om       0.94      0.97      0.96       106
          or       1.00      0.96      0.98       103
          os       0.98      1.00      0.99       454
          pa       1.00      1.00      1.00       178
          pl       1.00      1.00      1.00      7960
          ps       0.99      0.97      0.98       213
          pt       0.98      0.99      0.98      9082
          qu       0.95      0.93      0.94       137
          rm       0.94      0.94      0.94       144
          rn       0.96      0.90      0.93       223
          ro       1.00      0.99      0.99      9976
          ru       0.99      0.99      0.99      7962
          rw       0.87      0.87      0.87       108
          sa       0.99      0.99      0.99       356
          sc       0.85      0.93      0.89       107
          sd       0.99      0.98      0.98       100
          se       0.93      0.96      0.94       112
          si       0.99      0.97      0.98       212
          sk       0.98      0.97      0.97      4292
          sl       0.98      0.98      0.98      4999
          sn       0.93      0.89      0.91       110
          so       0.98      0.96      0.97       313
          sq       0.99      0.99      0.99      4962
          sr       0.85      0.86      0.86      8340
          su       0.95      0.97      0.96       108
          sv       0.99      0.99      0.99      6060
          sw       0.94      0.95      0.95       106
          ta       1.00      1.00      1.00      1321
          te       1.00      1.00      1.00       660
          tg       0.99      0.98      0.98       165
          th       1.00      1.00      1.00      3092
          tk       0.98      0.97      0.98       638
          tl       0.99      0.99      0.99      1933
          tn       0.95      0.98      0.96       109
          to       0.99      1.00      1.00       107
          tr       0.99      1.00      0.99      9965
          tt       0.99      0.99      0.99      1236
          ug       1.00      1.00      1.00      1094
          uk       0.99      0.99      0.99      5420
          ur       1.00      1.00      1.00      2540
          uz       0.98      0.98      0.98       856
          vi       1.00      1.00      1.00      4771
          vo       0.98      0.96      0.97       298
          wa       0.98      0.93      0.95       108
          wo       0.97      0.97      0.97       349
          xh       0.94      0.93      0.94       120
          yi       1.00      1.00      1.00       799
          yo       0.89      0.93      0.91       150
          zh       1.00      1.00      1.00      3351

    accuracy                           0.97    361821
   macro avg       0.96      0.96      0.96    361821
weighted avg       0.97      0.97      0.97    361821