Skip to content

Language identification model v2.0

Choose a tag to compare

@bdewilde bdewilde released this 02 Apr 03:30

Model for identifying the most probable language(s) of a text, inspired by Google's Compact Language Detector v3 and implemented with thinc v8.0.

Model

Character unigrams, bigrams, and trigrams are extracted separately from the first 1000 characters of lower-cased input text. Each collection of ngrams is hash-embedded into a 100-dimensional space, then averaged. The resulting feature vectors are concatenated into a single embedding layer, then passed on to a dense layer with ReLu activation and finally a Softmax output layer. The model's predictions give the probabilities for a text to be written in ~140 ISO 639-1 languages.

Dataset

The model was trained on a randomized, stratified subset of ~375k texts drawn from several sources:

  • WiLi: A public dataset of short text extracts from Wikipedias in over 230 languages. Style is relatively formal; subject matter is "encyclopedic". Source: https://zenodo.org/record/841984
  • Tatoeba: A crowd-sourced collection of sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.
  • UDHR: The UN's Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html
  • DSLCC: Two collections of short excerpts of journalistic texts in a handful of language groups that are highly similar to each other. Style is relatively formal; subject matter is current events. Source: http://ttg.uni-saarland.de/resources/DSLCC/

Performance

The trained model achieved F1 = 0.97 when averaged over all languages.

A few languages have worse performance; for example, the two Norwegians ("nb" and "no"), as well as Bosnian ("bs"), Serbian ("sr"), and Croatian ("hr"), which are extremely similar to each other.

              precision    recall  f1-score   support

          af       0.98      0.98      0.98      1096
          am       1.00      1.00      1.00       267
          an       0.97      0.96      0.96       202
          ar       0.96      1.00      0.98      1096
          as       1.00      0.97      0.98       248
          av       0.94      0.93      0.93       200
          ay       0.98      0.95      0.96       212
          az       0.99      0.97      0.98       501
          ba       0.99      0.98      0.98       230
          be       0.99      0.99      0.99      1096
          bg       0.98      0.98      0.98      1096
          bm       1.00      0.98      0.99       137
          bn       0.98      0.99      0.98       303
          bo       1.00      1.00      1.00       214
          br       0.99      0.99      0.99       614
          bs       0.63      0.65      0.64      1376
          ca       0.96      0.97      0.96      1096
          ce       1.00      0.99      0.99       201
          co       0.99      0.95      0.97       213
          cs       0.98      0.96      0.97      1096
          cu       1.00      1.00      1.00       606
          cv       0.99      0.98      0.98       367
          cy       1.00      0.99      1.00       764
          da       0.94      0.94      0.94      1096
          de       0.96      0.99      0.97      1108
          dv       1.00      1.00      1.00       212
          el       1.00      1.00      1.00      1107
          en       0.94      0.97      0.96      3096
          eo       0.97      0.96      0.97       490
          es       0.96      0.97      0.97      2207
          et       0.99      0.98      0.99      1096
          eu       0.99      1.00      0.99      1096
          fa       1.00      1.00      1.00      1940
          fi       0.99      0.99      0.99      1096
          fo       0.99      0.98      0.98       857
          fr       0.96      0.98      0.97      2207
          fy       0.98      0.96      0.97       239
          ga       0.99      0.99      0.99      1059
          gd       0.99      0.99      0.99       955
          gl       0.96      0.94      0.95      1096
          gn       1.00      0.99      0.99       488
          gu       0.98      0.96      0.97       216
          gv       0.99      0.99      0.99       285
          ha       0.97      0.98      0.98       239
          he       1.00      1.00      1.00      1095
          hi       1.00      0.99      0.99      1096
          hr       0.78      0.75      0.76      2207
          ht       1.00      0.98      0.99       228
          hu       0.99      0.99      0.99      1096
          hy       1.00      1.00      1.00       969
          ia       0.93      0.95      0.94       490
          id       0.93      0.92      0.93      2207
          ie       0.94      0.94      0.94       478
          ig       0.96      0.91      0.93       214
          io       0.95      0.94      0.95       489
          is       0.99      0.99      0.99      1096
          it       0.98      0.98      0.98      1096
          ja       1.00      1.00      1.00      1095
          jv       0.97      0.93      0.95       277
          ka       0.99      1.00      0.99       490
          kk       0.99      0.99      0.99       652
          km       0.97      0.95      0.96       246
          kn       1.00      1.00      1.00       224
          ko       1.00      1.00      1.00       957
          ku       1.00      0.99      0.99       212
          kv       0.94      0.96      0.95       200
          kw       0.99      0.99      0.99       419
          ky       0.99      0.97      0.98       235
          la       0.97      0.98      0.97      1108
          lb       0.97      0.96      0.97       280
          lg       0.99      0.99      0.99       210
          li       0.99      0.99      0.99       200
          ln       0.95      0.92      0.93       231
          lo       0.95      0.93      0.94       227
          lt       0.99      0.99      0.99      1096
          lv       1.00      0.99      0.99       818
          mg       1.00      0.99      1.00       215
          mi       1.00      1.00      1.00       269
          mk       0.94      0.97      0.96       490
          ml       1.00      0.98      0.99       288
          mn       1.00      0.99      0.99       491
          mr       0.99      0.99      0.99       533
          ms       0.74      0.65      0.69       200
          mt       0.99      0.99      0.99       836
          my       0.89      0.92      0.91      1340
          nb       0.81      0.89      0.85       491
          ne       0.98      0.98      0.98       211
          nl       0.98      0.97      0.97      1096
          nn       0.88      0.88      0.88       397
          no       0.92      0.86      0.89       606
          nv       1.00      1.00      1.00       226
          oc       0.96      0.92      0.94       561
          om       0.98      0.98      0.98       212
          or       0.99      0.97      0.98       204
          os       1.00      0.98      0.99       230
          pa       1.00      0.99      1.00       218
          pl       0.99      1.00      0.99      1096
          ps       0.97      0.95      0.96       219
          pt       0.98      0.98      0.98      2219
          qu       0.97      0.96      0.96       274
          rm       0.97      0.98      0.98       289
          rn       0.96      0.97      0.97       290
          ro       0.99      0.99      0.99      1120
          ru       0.96      0.97      0.97      1096
          rw       0.95      0.95      0.95       215
          sa       1.00      0.99      1.00       713
          sc       0.97      0.98      0.97       213
          sd       0.99      0.99      0.99       200
          se       0.98      0.99      0.98       223
          si       0.97      0.96      0.96       213
          sk       0.97      0.97      0.97      1096
          sl       0.96      0.97      0.96       929
          sn       0.96      0.95      0.96       220
          so       0.98      0.98      0.98       221
          sq       1.00      0.98      0.99       492
          sr       0.81      0.83      0.82      2219
          su       0.99      0.91      0.95       216
          sv       0.98      0.97      0.98      1096
          sw       0.96      0.97      0.96       212
          ta       1.00      1.00      1.00       476
          te       0.99      0.97      0.98       312
          tg       0.98      0.96      0.97       220
          th       0.99      0.99      0.99       682
          tk       0.99      0.99      0.99       502
          tl       0.98      0.98      0.98       513
          tn       1.00      0.98      0.99       217
          to       1.00      1.00      1.00       213
          tr       0.98      0.99      0.99      1096
          tt       0.98      0.98      0.98       490
          ug       1.00      1.00      1.00      1108
          uk       0.99      0.99      0.99      1096
          ur       1.00      1.00      1.00      1080
          uz       0.98      0.96      0.97       313
          vi       1.00      0.99      0.99      1028
          vo       0.98      0.99      0.98       478
          wa       1.00      0.97      0.99       217
          wo       0.99      0.99      0.99       694
          xh       0.96      0.91      0.94       240
          yi       0.99      1.00      0.99       490
          yo       0.93      0.94      0.93       301
          zh       1.00      1.00      1.00       825

    accuracy                           0.96     94140
   macro avg       0.97      0.97      0.97     94140
weighted avg       0.96      0.96      0.96     94140