Skip to content

Releases: bdewilde/textacy-data

Supreme Court dataset (for Python 3)

Choose a tag to compare

@bdewilde bdewilde released this 28 Nov 20:33

A collection of ~8.4k (almost all) decisions issued by the U.S. Supreme Court from November 1946 through June 2016 — the "modern" era.

Records include the following fields:

  • text: full text of the Court's decision
  • case_name: name of the court case, in all caps
  • argument_date: date on which the case was argued before the Court, as a string with format 'YYYY-MM-DD'
  • decision_date: date on which the Court's decision was announced, as a string with format 'YYYY-MM-DD'
  • decision_direction: ideological direction of the majority decision; either 'conservative', 'liberal', or 'unspecifiable'
  • maj_opinion_author: name of the majority opinion's author, if available and identifiable, as an integer code whose mapping is given in SupremeCourt.opinion_author_codes
  • n_maj_votes: number of justices voting in the majority
  • n_min_votes: number of justices voting in the minority
  • issue: subject matter of the case's core disagreement (e.g. affirmative action) rather than its legal basis (e.g. the equal protection clause), as a string code whose mapping is given in SupremeCourt.issue_codes
  • issue_area: higher-level categorization of the issue (e.g. Civil Rights), as an integer code whose mapping is given in SupremeCourt.issue_area_codes
  • us_cite_id: citation identifier for each case according to the official United States Reports; Note: There are ~300 cases with duplicate ids, and it's not clear if that's "correct" or a data quality problem

The text in this dataset was derived from FindLaw's searchable database of court cases: http://caselaw.findlaw.com/court/us-supreme-court

The metadata was extracted without modification from the Supreme Court Database:
Harold J. Spaeth, Lee Epstein, et al. 2016 Supreme Court Database, Version 2016 Release 1. http://supremecourtdatabase.org.
Its license is CC BY-NC 3.0 US: https://creativecommons.org/licenses/by-nc/3.0/us/

This corpus' creation was inspired by a blog post by Emily Barry: http://www.emilyinamillion.me/blog/2016/7/13/visualizing-supreme-court-topics-over-time

NOTE: The two datasets were merged through much munging and a carefully trained model using the dedupe package. The model's duplicate threshold was set so as to maximize the F-score where precision had twice as much weight as recall. Still, given occasionally baffling inconsistencies in case naming, citation ids, and decision dates, a very small percentage of texts may be incorrectly matched to metadata. (Sorry.)

Supreme Court dataset (for Python 2)

Choose a tag to compare

@bdewilde bdewilde released this 28 Nov 20:35

A collection of ~8.4k (almost all) decisions issued by the U.S. Supreme Court from November 1946 through June 2016 — the "modern" era.

Records include the following fields:

  • text: full text of the Court's decision
  • case_name: name of the court case, in all caps
  • argument_date: date on which the case was argued before the Court, as a string with format 'YYYY-MM-DD'
  • decision_date: date on which the Court's decision was announced, as a string with format 'YYYY-MM-DD'
  • decision_direction: ideological direction of the majority decision; either 'conservative', 'liberal', or 'unspecifiable'
  • maj_opinion_author: name of the majority opinion's author, if available and identifiable, as an integer code whose mapping is given in SupremeCourt.opinion_author_codes
  • n_maj_votes: number of justices voting in the majority
  • n_min_votes: number of justices voting in the minority
  • issue: subject matter of the case's core disagreement (e.g. affirmative action) rather than its legal basis (e.g. the equal protection clause), as a string code whose mapping is given in SupremeCourt.issue_codes
  • issue_area: higher-level categorization of the issue (e.g. Civil Rights), as an integer code whose mapping is given in SupremeCourt.issue_area_codes
  • us_cite_id: citation identifier for each case according to the official United States Reports; Note: There are ~300 cases with duplicate ids, and it's not clear if that's "correct" or a data quality problem

The text in this dataset was derived from FindLaw's searchable database of court cases: http://caselaw.findlaw.com/court/us-supreme-court

The metadata was extracted without modification from the Supreme Court Database:
Harold J. Spaeth, Lee Epstein, et al. 2016 Supreme Court Database, Version 2016 Release 1. http://supremecourtdatabase.org.
Its license is CC BY-NC 3.0 US: https://creativecommons.org/licenses/by-nc/3.0/us/

This corpus' creation was inspired by a blog post by Emily Barry: http://www.emilyinamillion.me/blog/2016/7/13/visualizing-supreme-court-topics-over-time

NOTE: The two datasets were merged through much munging and a carefully trained model using the dedupe package. The model's duplicate threshold was set so as to maximize the F-score where precision had twice as much weight as recall. Still, given occasionally baffling inconsistencies in case naming, citation ids, and decision dates, a very small percentage of texts may be incorrectly matched to metadata. (Sorry.)

Language identification pipeline v1.1 (sklearn v0.23)

Choose a tag to compare

Pipeline for identifying the language of a text, using a model inspired by Google's Compact Language Detector v3 and implemented with scikit-learn==0.23.

Model

Character unigrams, bigrams, and trigrams are extracted from input text, and their frequencies of occurence within the text are counted. The full set of ngrams are then hashed into a 4096-dimensional feature vector with values given by the L2 norm of the counts. These features are passed into a Multi-layer Perceptron with a single hidden layer of 512 rectified linear units and a softmax output layer giving probabilities for ~140 different languages as ISO 639-1 language codes.

Technically, the model was implemented as a sklearn.pipeline.Pipeline with two steps: a sklearn.feature_extraction.text.HashingVectorizer for vectorizing input texts and a sklearn.neural_network.MLPClassifier for multi-class language classification.

Dataset

The pipeline was trained on a randomized, stratified subset of ~750k texts drawn from several sources:

  • Tatoeba: A crowd-sourced collection of sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.
  • Leipzig Corpora: A collection of corpora for many languages pulling from comparable sources -- specifically, 10k Wikipedia articles from official database dumps and 10k news articles from either RSS feeds or web scrapes, when available. Style is relatively formal; subject matter is a variety of notable things and goings-on. Source: http://wortschatz.uni-leipzig.de/en/download
  • UDHR: The UN's Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html
  • Twitter: A collection of tweets in each of ~70 languages, posted in July 2014, with languages assigned through a combination of models and human annotators. Style is informal; subject matter is whatever Twitter was going on about back then, who could say. Source: https://blog.twitter.com/engineering/en_us/a/2015/evaluating-language-identification-performance.html
  • DSLCC: Two collections of short excerpts of journalistic texts in a handful of language groups that are highly similar to each other. Style is relatively formal; subject matter is current events. Source: http://ttg.uni-saarland.de/resources/DSLCC/

Performance

              precision    recall  f1-score   support

          af       0.98      0.99      0.98      1382
          am       1.00      0.99      1.00      1157
          an       0.95      0.95      0.95      1016
          ar       0.99      0.99      0.99      1907
          as       1.00      0.99      1.00      1021
          av       0.89      0.83      0.86       179
          ay       0.94      0.95      0.94       206
          az       0.99      0.98      0.99      1338
          ba       0.83      0.75      0.79      1045
          be       0.99      1.00      0.99      1623
          bg       0.98      0.97      0.98      1767
          bn       1.00      0.99      0.99      1178
          bo       0.99      1.00      0.99       262
          br       0.99      0.99      0.99      1471
          bs       0.59      0.63      0.61      1495
          ca       0.96      0.97      0.96      1837
          ce       1.00      1.00      1.00       997
          co       0.98      0.99      0.98      1016
          cs       0.96      0.96      0.96      1758
          cv       1.00      0.98      0.99      1135
          cy       0.99      0.98      0.99      1383
          da       0.90      0.94      0.92      1627
          de       0.96      0.98      0.97      1890
          dv       1.00      1.00      1.00      1180
          el       1.00      1.00      1.00      1868
          en       0.92      0.97      0.95      3512
          eo       0.99      0.98      0.99      1593
          es       0.97      0.95      0.96      2385
          et       0.96      0.95      0.96      1468
          eu       0.99      0.99      0.99      1733
          fa       0.95      0.97      0.96      1720
          fi       0.99      0.99      0.99      1833
          fo       0.98      0.97      0.98      1031
          fr       0.95      0.97      0.96      2312
          fy       0.99      0.97      0.98      1041
          ga       1.00      0.99      0.99      1182
          gd       0.98      0.98      0.98       326
          gl       0.94      0.96      0.95      1586
          gn       1.00      0.99      0.99      1085
          gu       0.99      1.00      0.99      1235
          gv       1.00      1.00      1.00      1075
          ha       0.98      1.00      0.99       217
          he       0.99      1.00      1.00      1699
          hi       0.95      0.99      0.97      1480
          hr       0.73      0.62      0.67      1914
          ht       0.99      0.95      0.97      1165
          hu       1.00      0.99      0.99      1829
          hy       1.00      1.00      1.00      1376
          ia       0.97      0.97      0.97      1616
          id       0.86      0.90      0.88      2024
          ie       0.92      0.94      0.93       514
          ig       1.00      0.90      0.95       251
          io       0.97      0.98      0.97      1489
          is       0.99      0.98      0.99      1729
          it       0.96      0.97      0.96      1814
          ja       1.00      0.99      0.99      1942
          jv       0.97      0.95      0.96       234
          ka       1.00      1.00      1.00      1241
          kk       1.00      1.00      1.00      1385
          kl       1.00      0.99      0.99       811
          km       0.98      0.95      0.97       329
          kn       1.00      1.00      1.00      1120
          ko       1.00      0.99      0.99      1171
          ku       0.99      1.00      1.00      1072
          kv       0.99      0.98      0.99      1025
          kw       0.99      0.98      0.99       264
          ky       1.00      0.99      0.99      1011
          la       0.96      0.98      0.97      1607
          lb       0.99      0.98      0.98      1110
          lg       1.00      0.99      1.00      1025
          li       0.97      0.98      0.98      1002
          ln       0.97      0.93      0.95       207
          lo       0.97      0.96      0.96       316
          lt       0.99      0.99      0.99      1686
          lv       1.00      0.98      0.99      1130
          mg       1.00      1.00      1.00       997
          mi       1.00      1.00      1.00       230
          mk       0.96      0.99      0.98      1602
          ml       1.00      0.99      1.00      1193
          mn       1.00      1.00      1.00      1072
          mr       0.99      0.99      0.99      1602
          ms       0.70      0.69      0.70      1041
          mt       1.00      1.00      1.00      1057
          my       0.77      0.70      0.73       753
          nb       0.67      0.81      0.73      1638
          ne       0.99      0.98      0.99      1212
          nl       0.97      0.96      0.97      1832
          nn       0.90      0.89      0.90      1149
          no       0.61      0.42      0.49      1052
          nv       1.00      1.00      1.00       211
          oc       0.97      0.94      0.95      1665
          om       0.99      0.96      0.97       212
          or       1.00      0.99      1.00      1006
          os       1.00      0.99      1.00      1021
          pa       1.00      1.00      1.00      1154
          pl       0.98      0.99      0.98      1778
          ps       0.96      0.91      0.93      1254
          pt       0.97      0.96      0.96      2285
          qu       0.98      0.98      0.98      1088
          rm       0.98      0.98      0.98      1087
          rn       0.96      0.90      0.93        87
          ro       0.98      0.98      0.98      1796
          ru       0.96      0.97      0.96      1910
          rw       0.93      0.92      0.93       196
          sa       0.99      0.99      0.99      1063
          sc       0.97      0.98      0.97      1019
          sd       0.99      0.99      0.99      1216
          se       0.99      0.97      0.98       194
          si       1.00      0.99      0.99      1133
          sk       0.95      0.96      0.96      1279
          sl       0.96      0.96      0.96      1324
          sn       1.00      0.96      0.98       217
          so       0.99      0.99      0.99      1034
          sq       0.99      0.99      0.99      1134
          sr       0.81      0.89      0.85      2135
          su       0.96      0.96      0.96      1070
          sv       0.98      0.98      0.98      1932
          sw       0.99      0.98      0.98      1079
          ta       1.00      1.00      1.00      1170
          te       1.00      0.99      1.00      1166
          tg       0.99      1.00      0.99      1056
          th       1.00      0.99      0.99      1331
          tk       1.00      0.99      0.99      1659
          tl       0.98      0.96      0.97      1803
          tn       1.00      0.98      0.99       223
          to       1.00      0.99      0.99       207
          tr       0.97      0.99      0.98      1892
          tt       0.85      0.90      0.88      1717
          ug       1.00      1.00      1.00      1646
          uk       0.99      0.99      0.99      1677
          ur       0.99      0.99      0.99      1353
          uz       1.00      0.99      0.99      1147
          vi       0.99      0.99      0.99      1819
          v...
Read more

Language identification pipeline v1.1 (sklearn v0.22)

Choose a tag to compare

Pipeline for identifying the language of a text, using a model inspired by Google's Compact Language Detector v3 and implemented with scikit-learn==0.22.

Model

Character unigrams, bigrams, and trigrams are extracted from input text, and their frequencies of occurence within the text are counted. The full set of ngrams are then hashed into a 4096-dimensional feature vector with values given by the L2 norm of the counts. These features are passed into a Multi-layer Perceptron with a single hidden layer of 512 rectified linear units and a softmax output layer giving probabilities for ~140 different languages as ISO 639-1 language codes.

Technically, the model was implemented as a sklearn.pipeline.Pipeline with two steps: a sklearn.feature_extraction.text.HashingVectorizer for vectorizing input texts and a sklearn.neural_network.MLPClassifier for multi-class language classification.

Dataset

The pipeline was trained on a randomized, stratified subset of ~750k texts drawn from several sources:

  • Tatoeba: A crowd-sourced collection of sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.
  • Leipzig Corpora: A collection of corpora for many languages pulling from comparable sources -- specifically, 10k Wikipedia articles from official database dumps and 10k news articles from either RSS feeds or web scrapes, when available. Style is relatively formal; subject matter is a variety of notable things and goings-on. Source: http://wortschatz.uni-leipzig.de/en/download
  • UDHR: The UN's Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html
  • Twitter: A collection of tweets in each of ~70 languages, posted in July 2014, with languages assigned through a combination of models and human annotators. Style is informal; subject matter is whatever Twitter was going on about back then, who could say. Source: https://blog.twitter.com/engineering/en_us/a/2015/evaluating-language-identification-performance.html
  • DSLCC: Two collections of short excerpts of journalistic texts in a handful of language groups that are highly similar to each other. Style is relatively formal; subject matter is current events. Source: http://ttg.uni-saarland.de/resources/DSLCC/

Performance


          af       0.99      0.98      0.99      1363
          am       1.00      1.00      1.00      1098
          an       0.94      0.96      0.95      1005
          ar       0.99      0.99      0.99      1902
          as       1.00      0.99      1.00       959
          av       0.98      0.81      0.88       186
          ay       0.99      0.94      0.96       224
          az       0.98      0.99      0.99      1348
          ba       0.84      0.76      0.80      1037
          be       1.00      1.00      1.00      1559
          bg       0.97      0.99      0.98      1808
          bn       1.00      0.99      0.99      1175
          bo       0.99      0.99      0.99       281
          br       0.98      0.99      0.99      1469
          bs       0.67      0.48      0.56      1474
          ca       0.97      0.96      0.97      1740
          ce       1.00      0.99      1.00      1030
          co       0.99      0.97      0.98       986
          cs       0.96      0.96      0.96      1830
          cv       1.00      0.99      0.99      1145
          cy       0.99      0.99      0.99      1370
          da       0.92      0.93      0.92      1731
          de       0.97      0.97      0.97      1891
          dv       1.00      1.00      1.00      1138
          el       1.00      1.00      1.00      1882
          en       0.91      0.98      0.94      3589
          eo       0.98      0.99      0.98      1616
          es       0.94      0.96      0.95      2343
          et       0.98      0.95      0.97      1466
          eu       0.98      0.99      0.99      1743
          fa       0.96      0.97      0.96      1693
          fi       0.99      0.98      0.99      1785
          fo       0.99      0.95      0.97      1079
          fr       0.95      0.98      0.96      2302
          fy       0.98      0.98      0.98      1053
          ga       1.00      0.99      0.99      1198
          gd       0.99      0.97      0.98       276
          gl       0.96      0.95      0.95      1539
          gn       0.99      0.99      0.99      1110
          gu       1.00      0.99      1.00      1219
          gv       0.98      0.99      0.99      1031
          ha       0.97      0.99      0.98       230
          he       1.00      1.00      1.00      1566
          hi       0.98      0.97      0.97      1435
          hr       0.69      0.75      0.72      1968
          ht       0.99      0.96      0.97      1163
          hu       0.99      0.99      0.99      1794
          hy       1.00      0.99      1.00      1322
          ia       0.97      0.98      0.97      1602
          id       0.82      0.93      0.87      2107
          ie       0.96      0.91      0.94       513
          ig       0.97      0.93      0.95       230
          io       0.99      0.97      0.98      1522
          is       0.98      0.99      0.98      1607
          it       0.95      0.98      0.96      1937
          ja       1.00      0.99      1.00      1930
          jv       0.97      0.96      0.97       239
          ka       1.00      1.00      1.00      1243
          kk       1.00      1.00      1.00      1348
          kl       1.00      1.00      1.00       809
          km       0.99      0.93      0.96       347
          kn       1.00      1.00      1.00      1188
          ko       1.00      1.00      1.00      1180
          ku       1.00      1.00      1.00      1049
          kv       0.99      0.98      0.99       987
          kw       0.99      0.98      0.99       249
          ky       0.99      0.99      0.99      1074
          la       0.96      0.98      0.97      1605
          lb       0.99      0.97      0.98      1104
          lg       1.00      0.99      1.00      1019
          li       0.98      0.98      0.98      1081
          ln       0.99      0.92      0.95       220
          lo       0.99      0.94      0.96       331
          lt       0.99      0.99      0.99      1645
          lv       0.99      0.98      0.99      1183
          mg       1.00      1.00      1.00      1049
          mi       1.00      1.00      1.00       273
          mk       0.98      0.98      0.98      1643
          ml       1.00      1.00      1.00      1225
          mn       0.99      1.00      1.00      1141
          mr       0.99      0.99      0.99      1682
          ms       0.67      0.61      0.64      1030
          mt       1.00      0.99      1.00      1022
          my       0.80      0.63      0.71       851
          nb       0.66      0.83      0.74      1643
          ne       0.99      0.99      0.99      1180
          nl       0.97      0.97      0.97      1866
          nn       0.91      0.88      0.90      1114
          no       0.62      0.39      0.48      1019
          nv       1.00      1.00      1.00       212
          oc       0.96      0.95      0.96      1621
          om       0.99      0.97      0.98       219
          or       1.00      0.98      0.99      1062
          os       1.00      1.00      1.00      1036
          pa       1.00      1.00      1.00      1085
          pl       0.99      0.99      0.99      1804
          ps       0.95      0.91      0.93      1151
          pt       0.96      0.97      0.97      2335
          qu       0.99      0.97      0.98      1098
          rm       0.99      0.98      0.98      1105
          rn       0.94      0.83      0.88        96
          ro       1.00      0.98      0.99      1814
          ru       0.96      0.98      0.97      1870
          rw       0.93      0.96      0.94       205
          sa       0.99      1.00      0.99      1019
          sc       0.98      0.98      0.98      1041
          sd       0.99      0.99      0.99      1274
          se       0.98      0.98      0.98       187
          si       1.00      1.00      1.00      1189
          sk       0.96      0.95      0.95      1281
          sl       0.95      0.96      0.96      1306
          sn       0.98      0.95      0.96       208
          so       1.00      0.98      0.99      1036
          sq       0.99      0.99      0.99      1148
          sr       0.81      0.90      0.85      2153
          su       0.99      0.95      0.97      1000
          sv       0.98      0.98      0.98      1817
          sw       0.99      0.98      0.98      1042
          ta       1.00      1.00      1.00      1196
          te       1.00      0.98      0.99      1124
          tg       1.00      0.99      0.99      1012
          th       0.99      0.99      0.99      1273
          tk       0.99      1.00      1.00      1595
          tl       0.96      0.98      0.97      1843
          tn       1.00      1.00      1.00       207
          to       1.00      0.98      0.99       212
          tr       0.99      0.97      0.98      1881
          tt       0.86      0.91      0.88      1690
          ug       1.00      1.00      1.00      1773
          uk       0.99      0.99      0.99      1771
          ur       0.99      0.99      0.99      1307
          uz       0.98      0.99      0.99      1063
          vi       1.00      0.99      0.99      1849
          vo ...
Read more

Language identification pipeline v1.1 (sklearn v0.21)

Choose a tag to compare

Pipeline for identifying the language of a text, using a model inspired by Google's Compact Language Detector v3 and implemented with scikit-learn==0.21.

Model

Character unigrams, bigrams, and trigrams are extracted from input text, and their frequencies of occurence within the text are counted. The full set of ngrams are then hashed into a 4096-dimensional feature vector with values given by the L2 norm of the counts. These features are passed into a Multi-layer Perceptron with a single hidden layer of 512 rectified linear units and a softmax output layer giving probabilities for ~140 different languages as ISO 639-1 language codes.

Technically, the model was implemented as a sklearn.pipeline.Pipeline with two steps: a sklearn.feature_extraction.text.HashingVectorizer for vectorizing input texts and a sklearn.neural_network.MLPClassifier for multi-class language classification.

Dataset

The pipeline was trained on a randomized, stratified subset of ~750k texts drawn from several sources:

  • Tatoeba: A crowd-sourced collection of sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.
  • Leipzig Corpora: A collection of corpora for many languages pulling from comparable sources -- specifically, 10k Wikipedia articles from official database dumps and 10k news articles from either RSS feeds or web scrapes, when available. Style is relatively formal; subject matter is a variety of notable things and goings-on. Source: http://wortschatz.uni-leipzig.de/en/download
  • UDHR: The UN's Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html
  • Twitter: A collection of tweets in each of ~70 languages, posted in July 2014, with languages assigned through a combination of models and human annotators. Style is informal; subject matter is whatever Twitter was going on about back then, who could say. Source: https://blog.twitter.com/engineering/en_us/a/2015/evaluating-language-identification-performance.html
  • DSLCC: Two collections of short excerpts of journalistic texts in a handful of language groups that are highly similar to each other. Style is relatively formal; subject matter is current events. Source: http://ttg.uni-saarland.de/resources/DSLCC/

Performance

              precision    recall  f1-score   support

          af       0.98      0.98      0.98      1335
          am       1.00      0.99      1.00      1098
          an       0.96      0.96      0.96      1008
          ar       0.99      0.99      0.99      1889
          as       1.00      0.99      0.99      1034
          av       0.92      0.90      0.91       205
          ay       0.99      0.94      0.97       200
          az       0.99      0.99      0.99      1311
          ba       0.89      0.71      0.79      1064
          be       0.99      1.00      0.99      1606
          bg       0.98      0.97      0.98      1856
          bn       1.00      0.99      0.99      1183
          bo       0.99      1.00      0.99       292
          br       1.00      0.99      0.99      1441
          bs       0.65      0.52      0.58      1570
          ca       0.96      0.96      0.96      1776
          ce       1.00      1.00      1.00      1023
          co       0.99      0.97      0.98      1074
          cs       0.98      0.94      0.96      1752
          cv       1.00      0.99      0.99      1101
          cy       1.00      0.99      0.99      1363
          da       0.92      0.93      0.92      1744
          de       0.96      0.97      0.97      1893
          dv       1.00      1.00      1.00      1102
          el       1.00      1.00      1.00      1857
          en       0.92      0.97      0.95      3545
          eo       0.99      0.99      0.99      1635
          es       0.94      0.97      0.95      2307
          et       0.95      0.96      0.95      1417
          eu       0.99      0.99      0.99      1737
          fa       0.94      0.99      0.96      1651
          fi       0.99      0.99      0.99      1736
          fo       0.98      0.98      0.98      1110
          fr       0.95      0.98      0.96      2351
          fy       0.98      0.98      0.98       997
          ga       1.00      0.99      0.99      1183
          gd       0.96      0.98      0.97       305
          gl       0.95      0.94      0.95      1435
          gn       1.00      0.99      0.99      1072
          gu       1.00      0.99      0.99      1247
          gv       0.99      0.99      0.99      1050
          ha       0.98      0.99      0.99       224
          he       0.99      1.00      1.00      1639
          hi       0.98      0.96      0.97      1426
          hr       0.65      0.76      0.70      1867
          ht       0.98      0.97      0.98      1226
          hu       1.00      0.99      0.99      1768
          hy       1.00      1.00      1.00      1333
          ia       0.96      0.98      0.97      1710
          id       0.84      0.91      0.88      2073
          ie       0.95      0.94      0.95       530
          ig       0.96      0.89      0.93       209
          io       0.98      0.98      0.98      1493
          is       0.99      0.99      0.99      1812
          it       0.95      0.97      0.96      1849
          ja       1.00      0.99      1.00      1817
          jv       0.98      0.93      0.96       275
          ka       1.00      1.00      1.00      1216
          kk       1.00      1.00      1.00      1403
          kl       1.00      1.00      1.00       851
          km       1.00      0.96      0.98       360
          kn       1.00      1.00      1.00      1161
          ko       1.00      0.99      1.00      1148
          ku       1.00      0.99      0.99      1119
          kv       0.99      0.98      0.99       989
          kw       1.00      0.99      0.99       287
          ky       0.99      1.00      0.99      1050
          la       0.96      0.98      0.97      1632
          lb       0.98      0.98      0.98      1090
          lg       1.00      1.00      1.00      1024
          li       0.99      0.97      0.98      1043
          ln       0.94      0.94      0.94       237
          lo       0.99      0.94      0.96       317
          lt       0.99      0.99      0.99      1674
          lv       0.99      0.98      0.99      1171
          mg       1.00      1.00      1.00      1011
          mi       1.00      1.00      1.00       250
          mk       0.97      0.98      0.98      1666
          ml       1.00      1.00      1.00      1211
          mn       1.00      0.99      1.00      1139
          mr       0.99      0.99      0.99      1698
          ms       0.65      0.64      0.64      1044
          mt       1.00      0.99      0.99      1017
          my       0.78      0.64      0.70       795
          nb       0.68      0.79      0.73      1601
          ne       0.98      0.99      0.99      1257
          nl       0.96      0.98      0.97      1860
          nn       0.90      0.92      0.91      1174
          no       0.58      0.42      0.49       965
          nv       1.00      1.00      1.00       215
          oc       0.98      0.94      0.96      1641
          om       0.97      0.96      0.97       221
          or       1.00      0.99      0.99      1097
          os       1.00      0.99      1.00      1052
          pa       1.00      1.00      1.00      1110
          pl       0.99      0.99      0.99      1839
          ps       0.97      0.90      0.93      1163
          pt       0.98      0.95      0.97      2392
          qu       0.98      0.98      0.98      1049
          rm       0.99      0.98      0.99      1091
          rn       0.97      0.85      0.90        71
          ro       0.98      0.99      0.98      1764
          ru       0.97      0.97      0.97      1860
          rw       0.90      0.93      0.92       213
          sa       1.00      0.99      0.99      1083
          sc       0.98      0.97      0.98      1053
          sd       1.00      0.98      0.99      1248
          se       0.98      0.97      0.98       191
          si       1.00      0.99      1.00      1149
          sk       0.94      0.97      0.96      1239
          sl       0.94      0.96      0.95      1233
          sn       1.00      0.94      0.97       226
          so       1.00      0.98      0.99       994
          sq       0.99      0.99      0.99      1131
          sr       0.83      0.87      0.85      2186
          su       0.96      0.96      0.96      1020
          sv       0.98      0.98      0.98      1865
          sw       0.98      0.98      0.98      1048
          ta       1.00      1.00      1.00      1236
          te       1.00      0.99      0.99      1094
          tg       0.99      0.99      0.99      1054
          th       1.00      0.99      0.99      1337
          tk       1.00      1.00      1.00      1701
          tl       0.97      0.97      0.97      1792
          tn       1.00      0.99      0.99       175
          to       1.00      0.99      1.00       204
          tr       0.98      0.98      0.98      1909
          tt       0.83      0.94      0.88      1641
          ug       1.00      1.00      1.00      1648
          uk       0.99      0.98      0.99      1735
          ur       0.99      0.99      0.99      1339
          uz       1.00      0.98      0.99      1081
          vi       1.00      0.98      0.99      1873
          v...
Read more

Language identification pipeline v1.1 (sklearn v0.20)

Choose a tag to compare

Pipeline for identifying the language of a text, using a model inspired by Google's Compact Language Detector v3 and implemented with scikit-learn==0.20.

Model

Character unigrams, bigrams, and trigrams are extracted from input text, and their frequencies of occurence within the text are counted. The full set of ngrams are then hashed into a 4096-dimensional feature vector with values given by the L2 norm of the counts. These features are passed into a Multi-layer Perceptron with a single hidden layer of 512 rectified linear units and a softmax output layer giving probabilities for ~140 different languages as ISO 639-1 language codes.

Technically, the model was implemented as a sklearn.pipeline.Pipeline with two steps: a sklearn.feature_extraction.text.HashingVectorizer for vectorizing input texts and a sklearn.neural_network.MLPClassifier for multi-class language classification.

Dataset

The pipeline was trained on a randomized, stratified subset of ~750k texts drawn from several sources:

  • Tatoeba: A crowd-sourced collection of sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.
  • Leipzig Corpora: A collection of corpora for many languages pulling from comparable sources -- specifically, 10k Wikipedia articles from official database dumps and 10k news articles from either RSS feeds or web scrapes, when available. Style is relatively formal; subject matter is a variety of notable things and goings-on. Source: http://wortschatz.uni-leipzig.de/en/download
  • UDHR: The UN's Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html
  • Twitter: A collection of tweets in each of ~70 languages, posted in July 2014, with languages assigned through a combination of models and human annotators. Style is informal; subject matter is whatever Twitter was going on about back then, who could say. Source: https://blog.twitter.com/engineering/en_us/a/2015/evaluating-language-identification-performance.html
  • DSLCC: Two collections of short excerpts of journalistic texts in a handful of language groups that are highly similar to each other. Style is relatively formal; subject matter is current events. Source: http://ttg.uni-saarland.de/resources/DSLCC/

Performance

              precision    recall  f1-score   support

          af       0.99      0.98      0.98      1372
          am       1.00      1.00      1.00      1063
          an       0.95      0.97      0.96      1017
          ar       1.00      0.99      0.99      1944
          as       1.00      1.00      1.00      1029
          av       0.95      0.86      0.90       190
          ay       0.98      0.90      0.94       220
          az       0.99      0.99      0.99      1351
          ba       0.89      0.72      0.80      1024
          be       0.99      1.00      0.99      1614
          bg       0.98      0.98      0.98      1804
          bn       1.00      0.99      1.00      1204
          bo       0.99      0.99      0.99       278
          br       0.99      0.99      0.99      1476
          bs       0.60      0.66      0.62      1526
          ca       0.97      0.96      0.96      1805
          ce       1.00      1.00      1.00      1032
          co       0.98      0.99      0.99       960
          cs       0.97      0.94      0.96      1860
          cv       1.00      0.99      0.99      1129
          cy       0.98      0.99      0.99      1304
          da       0.93      0.92      0.92      1793
          de       0.96      0.97      0.97      1936
          dv       1.00      1.00      1.00      1139
          el       1.00      1.00      1.00      1909
          en       0.93      0.98      0.95      3643
          eo       0.97      0.99      0.98      1573
          es       0.95      0.95      0.95      2310
          et       0.96      0.94      0.95      1422
          eu       0.99      0.99      0.99      1739
          fa       0.95      0.97      0.96      1685
          fi       0.99      0.99      0.99      1788
          fo       0.98      0.98      0.98      1050
          fr       0.96      0.97      0.96      2323
          fy       0.98      0.97      0.98      1006
          ga       1.00      0.99      0.99      1198
          gd       1.00      0.98      0.99       285
          gl       0.94      0.96      0.95      1469
          gn       1.00      0.99      0.99      1059
          gu       1.00      0.99      1.00      1237
          gv       0.99      1.00      0.99      1014
          ha       0.99      0.98      0.98       201
          he       0.99      1.00      1.00      1678
          hi       0.98      0.97      0.97      1409
          hr       0.71      0.64      0.67      1914
          ht       0.94      0.97      0.96      1196
          hu       1.00      0.99      0.99      1712
          hy       1.00      1.00      1.00      1363
          ia       0.95      0.98      0.97      1603
          id       0.88      0.85      0.87      2143
          ie       0.93      0.93      0.93       504
          ig       0.94      0.93      0.94       193
          io       0.98      0.98      0.98      1474
          is       0.99      0.99      0.99      1750
          it       0.96      0.97      0.97      1892
          ja       1.00      1.00      1.00      1887
          jv       0.98      0.94      0.96       234
          ka       1.00      1.00      1.00      1273
          kk       1.00      1.00      1.00      1381
          kl       1.00      1.00      1.00       848
          km       1.00      0.94      0.97       343
          kn       1.00      1.00      1.00      1136
          ko       1.00      0.99      0.99      1184
          ku       1.00      0.99      0.99      1083
          kv       0.98      0.99      0.98      1057
          kw       1.00      0.98      0.99       251
          ky       1.00      0.99      0.99      1088
          la       0.98      0.97      0.97      1647
          lb       0.99      0.98      0.98      1054
          lg       1.00      0.99      0.99      1059
          li       0.99      0.98      0.98       999
          ln       0.87      0.96      0.91       219
          lo       0.98      0.93      0.96       302
          lt       0.99      0.99      0.99      1675
          lv       0.99      0.98      0.99      1151
          mg       1.00      1.00      1.00      1041
          mi       1.00      0.98      0.99       263
          mk       0.97      0.99      0.98      1645
          ml       1.00      1.00      1.00      1159
          mn       1.00      1.00      1.00      1041
          mr       0.99      0.99      0.99      1665
          ms       0.59      0.75      0.66      1059
          mt       1.00      0.99      0.99      1009
          my       0.80      0.60      0.69       805
          nb       0.75      0.68      0.71      1682
          ne       0.99      0.99      0.99      1225
          nl       0.96      0.97      0.97      1791
          nn       0.89      0.92      0.91      1143
          no       0.53      0.61      0.57       980
          nv       1.00      1.00      1.00       232
          oc       0.96      0.96      0.96      1682
          om       1.00      0.95      0.97       217
          or       1.00      0.99      0.99      1016
          os       1.00      1.00      1.00      1073
          pa       1.00      1.00      1.00      1154
          pl       0.99      0.99      0.99      1801
          ps       0.96      0.91      0.93      1208
          pt       0.97      0.96      0.97      2453
          qu       0.97      0.98      0.98      1075
          rm       0.99      0.98      0.99      1113
          rn       0.95      0.91      0.93        82
          ro       0.99      0.98      0.99      1761
          ru       0.96      0.97      0.96      1838
          rw       0.96      0.93      0.94       219
          sa       1.00      0.99      0.99      1016
          sc       0.99      0.97      0.98      1053
          sd       0.98      0.99      0.99      1230
          se       0.98      0.97      0.98       198
          si       0.99      1.00      1.00      1113
          sk       0.93      0.96      0.95      1279
          sl       0.95      0.97      0.96      1356
          sn       0.99      0.96      0.97       219
          so       1.00      0.98      0.99      1048
          sq       1.00      0.98      0.99      1125
          sr       0.85      0.86      0.86      2196
          su       0.98      0.96      0.97      1031
          sv       0.98      0.97      0.97      1761
          sw       0.99      0.98      0.98       991
          ta       1.00      0.99      1.00      1185
          te       1.00      0.99      1.00      1146
          tg       0.99      0.99      0.99       987
          th       0.99      0.99      0.99      1315
          tk       1.00      0.99      1.00      1676
          tl       0.94      0.98      0.96      1892
          tn       1.00      0.98      0.99       194
          to       1.00      1.00      1.00       210
          tr       0.98      0.98      0.98      1849
          tt       0.84      0.94      0.89      1643
          ug       1.00      1.00      1.00      1666
          uk       0.99      0.98      0.99      1764
          ur       0.99      0.99      0.99      1329
          uz       0.99      0.99      0.99      1102
          vi       1.00      0.99      0.99      1804
          v...
Read more

Language identification pipeline v1.1 (sklearn v0.19)

Choose a tag to compare

Pipeline for identifying the language of a text, using a model inspired by Google's Compact Language Detector v3 and implemented with scikit-learn==0.19.

Model

Character unigrams, bigrams, and trigrams are extracted from input text, and their frequencies of occurence within the text are counted. The full set of ngrams are then hashed into a 4096-dimensional feature vector with values given by the L2 norm of the counts. These features are passed into a Multi-layer Perceptron with a single hidden layer of 512 rectified linear units and a softmax output layer giving probabilities for ~140 different languages as ISO 639-1 language codes.

Technically, the model was implemented as a sklearn.pipeline.Pipeline with two steps: a sklearn.feature_extraction.text.HashingVectorizer for vectorizing input texts and a sklearn.neural_network.MLPClassifier for multi-class language classification.

Dataset

The pipeline was trained on a randomized, stratified subset of ~750k texts drawn from several sources:

  • Tatoeba: A crowd-sourced collection of sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.
  • Leipzig Corpora: A collection of corpora for many languages pulling from comparable sources -- specifically, 10k Wikipedia articles from official database dumps and 10k news articles from either RSS feeds or web scrapes, when available. Style is relatively formal; subject matter is a variety of notable things and goings-on. Source: http://wortschatz.uni-leipzig.de/en/download
  • UDHR: The UN's Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html
  • Twitter: A collection of tweets in each of ~70 languages, posted in July 2014, with languages assigned through a combination of models and human annotators. Style is informal; subject matter is whatever Twitter was going on about back then, who could say. Source: https://blog.twitter.com/engineering/en_us/a/2015/evaluating-language-identification-performance.html
  • DSLCC: Two collections of short excerpts of journalistic texts in a handful of language groups that are highly similar to each other. Style is relatively formal; subject matter is current events. Source: http://ttg.uni-saarland.de/resources/DSLCC/

Performance

             precision    recall  f1-score   support

         af       0.99      0.98      0.98      1294
         am       1.00      0.99      1.00      1115
         an       0.97      0.95      0.96      1052
         ar       0.99      0.99      0.99      1927
         as       1.00      0.99      0.99       989
         av       0.96      0.79      0.87       177
         ay       0.99      0.92      0.96       212
         az       1.00      0.98      0.99      1318
         ba       0.88      0.74      0.81      1034
         be       1.00      0.99      1.00      1646
         bg       0.97      0.98      0.98      1801
         bn       0.99      0.99      0.99      1177
         bo       1.00      0.99      0.99       317
         br       0.99      0.99      0.99      1503
         bs       0.63      0.51      0.56      1497
         ca       0.97      0.95      0.96      1819
         ce       1.00      0.99      1.00      1053
         co       0.98      0.97      0.98      1026
         cs       0.96      0.96      0.96      1766
         cv       1.00      0.99      0.99      1116
         cy       0.99      0.99      0.99      1338
         da       0.89      0.95      0.92      1705
         de       0.97      0.96      0.97      1818
         dv       1.00      1.00      1.00      1112
         el       1.00      1.00      1.00      1904
         en       0.89      0.99      0.94      3607
         eo       0.98      0.98      0.98      1628
         es       0.95      0.97      0.96      2331
         et       0.92      0.96      0.94      1423
         eu       0.99      0.99      0.99      1725
         fa       0.93      0.98      0.96      1590
         fi       0.99      0.98      0.99      1814
         fo       0.98      0.98      0.98      1022
         fr       0.94      0.97      0.96      2322
         fy       1.00      0.98      0.99      1073
         ga       1.00      0.99      0.99      1185
         gd       0.95      0.97      0.96       308
         gl       0.96      0.95      0.95      1503
         gn       0.99      0.99      0.99      1096
         gu       1.00      0.99      1.00      1230
         gv       1.00      0.99      0.99       992
         ha       0.97      0.97      0.97       236
         he       1.00      1.00      1.00      1623
         hi       0.94      0.98      0.96      1396
         hr       0.65      0.78      0.71      1968
         ht       0.97      0.97      0.97      1190
         hu       1.00      0.99      0.99      1814
         hy       1.00      1.00      1.00      1315
         ia       0.98      0.96      0.97      1559
         id       0.89      0.84      0.86      2148
         ie       0.93      0.93      0.93       538
         ig       0.99      0.86      0.92       198
         io       0.98      0.98      0.98      1476
         is       0.99      0.99      0.99      1730
         it       0.97      0.97      0.97      1866
         ja       1.00      0.99      0.99      1892
         jv       0.96      0.96      0.96       250
         ka       1.00      0.99      1.00      1275
         kk       1.00      1.00      1.00      1406
         kl       1.00      1.00      1.00       861
         km       1.00      0.93      0.96       345
         kn       1.00      1.00      1.00      1160
         ko       1.00      0.99      0.99      1182
         ku       1.00      0.99      0.99      1060
         kv       0.99      0.97      0.98       951
         kw       1.00      0.99      0.99       269
         ky       1.00      0.99      0.99      1047
         la       0.96      0.98      0.97      1603
         lb       1.00      0.97      0.98      1052
         lg       1.00      0.99      0.99      1032
         li       0.97      0.98      0.98      1005
         ln       0.98      0.94      0.96       232
         lo       0.99      0.93      0.96       295
         lt       1.00      0.99      0.99      1643
         lv       0.99      0.98      0.99      1157
         mg       1.00      1.00      1.00      1039
         mi       1.00      0.99      1.00       244
         mk       0.98      0.98      0.98      1625
         ml       1.00      1.00      1.00      1186
         mn       1.00      0.99      1.00      1140
         mr       0.99      0.99      0.99      1670
         ms       0.58      0.75      0.65       991
         mt       1.00      0.99      0.99      1030
         my       0.82      0.62      0.71       796
         nb       0.69      0.79      0.74      1605
         ne       1.00      0.98      0.99      1215
         nl       0.97      0.97      0.97      1827
         nn       0.89      0.94      0.92      1157
         no       0.66      0.39      0.49      1055
         nv       1.00      1.00      1.00       229
         oc       0.96      0.95      0.95      1639
         om       1.00      0.97      0.98       214
         or       1.00      0.99      1.00      1006
         os       1.00      1.00      1.00      1024
         pa       1.00      1.00      1.00      1137
         pl       0.99      0.99      0.99      1817
         ps       0.97      0.90      0.94      1302
         pt       0.95      0.98      0.96      2351
         qu       1.00      0.97      0.98      1078
         rm       0.98      0.99      0.99      1102
         rn       0.94      0.90      0.92        90
         ro       0.99      0.98      0.98      1777
         ru       0.96      0.97      0.97      1929
         rw       0.95      0.94      0.95       211
         sa       0.99      0.99      0.99      1049
         sc       0.96      0.99      0.97       991
         sd       0.99      0.98      0.99      1271
         se       0.98      0.98      0.98       230
         si       1.00      0.99      1.00      1120
         sk       0.96      0.96      0.96      1263
         sl       0.96      0.96      0.96      1314
         sn       0.99      0.89      0.94       223
         so       1.00      0.99      0.99      1047
         sq       1.00      0.99      0.99      1159
         sr       0.87      0.82      0.85      2230
         su       0.97      0.96      0.96       985
         sv       0.98      0.98      0.98      1860
         sw       0.99      0.97      0.98      1064
         ta       1.00      1.00      1.00      1258
         te       1.00      0.99      0.99      1157
         tg       1.00      0.99      0.99      1069
         th       0.99      0.99      0.99      1382
         tk       1.00      0.99      1.00      1630
         tl       0.98      0.96      0.97      1875
         tn       1.00      0.98      0.99       187
         to       1.00      0.99      0.99       210
         tr       0.97      0.99      0.98      1814
         tt       0.85      0.93      0.89      1609
         ug       1.00      1.00      1.00      1731
         uk       0.99      0.99      0.99      1766
         ur       0.99      0.99      0.99      1313
         uz       0.99      0.99      0.99      1073
         vi       0.99      0.99      0.99      1790
         vo       1.00      1.00      1.00      1160
         wa       0.99      0.99      0.99      1027
         wo       0.95      0.96     ...
Read more

Language Identification model (PY3)

Choose a tag to compare

@bdewilde bdewilde released this 04 Jun 00:35

Functionality for identifying the language of a text, using a model inspired by Google's Compact Language Detector v3 and implemented with scikit-learn.

Model

Character unigrams, bigrams, and trigrams are extracted from input text, and their frequencies of occurence within the text are counted. The full set of ngrams are then hashed into a 4096-dimensional feature vector with values given by the L2 norm of the counts. These features are passed into a Multi-layer Perceptron with a single hidden layer of 512 rectified linear units and a softmax output layer giving probabilities for ~130 different languages as ISO 639-1 language codes.

Technically, the model was implemented as a sklearn.pipeline.Pipeline with two steps: a sklearn.feature_extraction.text.HashingVectorizer for vectorizing input texts and a sklearn.neural_network.MLPClassifier for multi-class language classification.

Dataset

The pipeline was trained on a randomized, stratified subset of ~1.5M texts
drawn from several sources:

  • Tatoeba: A crowd-sourced collection of ~5M sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.
  • Leipzig Corpora: A collection of corpora for many languages in the same format and pulling from comparable sources -- specifically, 10k Wikipedia articles from official database dumps and 10k news articles from either RSS feeds or web scrapes. Only the most recently updated version was used, when available. Style is relatively formal; subject matter is a variety of notable things and goings-on. Source: http://wortschatz.uni-leipzig.de/en/download
  • UDHR: The UN's Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html
  • Twitter: A collection of ~1.5k tweets in each of ~70 languages, posted in July 2014, with languages assigned through a combination of models and human annotators. Style is informal; subject matter is whatever Twitter was going on about back then, who could say. Source: https://blog.twitter.com/engineering/en_us/a/2015/evaluating-language-identification-performance.html

Language Identification model (PY2)

Choose a tag to compare

@bdewilde bdewilde released this 04 Jun 00:36

Functionality for identifying the language of a text, using a model inspired by Google's Compact Language Detector v3 and implemented with scikit-learn.

Model

Character unigrams, bigrams, and trigrams are extracted from input text, and their frequencies of occurence within the text are counted. The full set of ngrams are then hashed into a 4096-dimensional feature vector with values given by the L2 norm of the counts. These features are passed into a Multi-layer Perceptron with a single hidden layer of 512 rectified linear units and a softmax output layer giving probabilities for ~130 different languages as ISO 639-1 language codes.

Technically, the model was implemented as a sklearn.pipeline.Pipeline with two steps: a sklearn.feature_extraction.text.HashingVectorizer for vectorizing input texts and a sklearn.neural_network.MLPClassifier for multi-class language classification.

Dataset

The pipeline was trained on a randomized, stratified subset of ~1.5M texts
drawn from several sources:

  • Tatoeba: A crowd-sourced collection of ~5M sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.
  • Leipzig Corpora: A collection of corpora for many languages in the same format and pulling from comparable sources -- specifically, 10k Wikipedia articles from official database dumps and 10k news articles from either RSS feeds or web scrapes. Only the most recently updated version was used, when available. Style is relatively formal; subject matter is a variety of notable things and goings-on. Source: http://wortschatz.uni-leipzig.de/en/download
  • UDHR: The UN's Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html
  • Twitter: A collection of ~1.5k tweets in each of ~70 languages, posted in July 2014, with languages assigned through a combination of models and human annotators. Style is informal; subject matter is whatever Twitter was going on about back then, who could say. Source: https://blog.twitter.com/engineering/en_us/a/2015/evaluating-language-identification-performance.html

Language identification model v3.0

Choose a tag to compare

@bdewilde bdewilde released this 02 Apr 20:49

Model for identifying the most probable language(s) of a text, inspired by -- and using the same methodology as -- Facebook's fastText.

Model

Text is tokenized into a bag of word 1- and 2-grams and character 1- through 5-grams. The collection of n-grams is embedded into a 128-dimensional space, then averaged. The resulting features are fed into a linear classifier with a hierarchical softmax output to compute (approximate) language probabilities for 140 ISO 639-1 languages.

Dataset

The model was trained on a randomized, stratified subset of ~2.9M texts drawn from several sources:

  • WiLi: A public dataset of short text extracts from Wikipedias in over 230 languages. Style is relatively formal; subject matter is "encyclopedic". Source: https://zenodo.org/record/841984
  • Tatoeba: A crowd-sourced collection of sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.
  • UDHR: The UN's Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html
  • DSLCC: Two collections of short excerpts of journalistic texts in a handful of language groups that are highly similar to each other. Style is relatively formal; subject matter is current events. Source: http://ttg.uni-saarland.de/resources/DSLCC/
  • Ted 2020: A crawl of nearly 4000 TED and TED-X transcripts from 2020, translated by a global community of volunteers into more than 100 languages. Style is conversational, covering a broad range of subjects. Source: https://opus.nlpl.eu/TED2020.php
  • SETimes: A corpus of news articles in Balkan languages, originally extracted from http://www.setimes.com and compiled by Nikola Ljubešić. Source: https://opus.nlpl.eu/SETIMES.php

Performance

The trained model achieved F1 = 0.97 when averaged over all languages.

A few languages have worse performance; most notably, the two sub-Norwegians ("nb" and "no"), as well as Bosnian ("bs"), Serbian ("sr"), and Croatian ("hr"), which are extremely similar to each other.

              precision    recall  f1-score   support

          af       0.96      0.97      0.96       948
          am       1.00      1.00      1.00       220
          an       0.93      0.80      0.86       101
          ar       1.00      0.80      0.89      7953
          as       0.96      0.96      0.96       159
          av       0.89      0.77      0.83       101
          ay       0.93      0.92      0.93       106
          az       0.99      0.97      0.98      1644
          ba       0.94      0.98      0.96       116
          be       1.00      0.99      0.99      4600
          bg       0.99      0.99      0.99      7475
          bn       1.00      0.99      1.00      1516
          bo       1.00      0.99      1.00       200
          br       0.98      0.99      0.99       483
          bs       0.63      0.66      0.65      4457
          ca       0.98      0.99      0.98      6863
          ce       0.99      1.00      1.00       101
          co       0.95      0.93      0.94       106
          cs       0.99      0.98      0.99      7947
          cu       1.00      1.00      1.00       404
          cv       0.99      0.95      0.97       188
          cy       0.99      0.98      0.99       502
          da       0.96      0.95      0.95      5178
          de       0.99      0.99      0.99      7975
          dv       1.00      1.00      1.00       107
          el       1.00      1.00      1.00      6982
          en       0.97      0.97      0.97      9944
          eo       0.99      0.99      0.99      2920
          es       0.98      0.98      0.98      9078
          et       0.99      0.99      0.99      6338
          eu       0.99      0.99      0.99      2655
          fa       1.00      1.00      1.00      7395
          fi       0.99      0.99      0.99      7950
          fo       0.94      0.96      0.95       432
          fr       0.82      0.99      0.90      9080
          fy       0.94      0.87      0.91       132
          ga       0.99      0.99      0.99      1204
          gd       0.98      0.99      0.99       744
          gl       0.96      0.96      0.96      4239
          gn       0.99      0.97      0.98       278
          gu       1.00      1.00      1.00      1601
          gv       0.95      0.99      0.97       214
          ha       0.99      0.99      0.99      1813
          he       1.00      1.00      1.00      5895
          hi       1.00      1.00      1.00      5314
          hr       0.82      0.79      0.80      7748
          ht       0.99      0.96      0.97       160
          hu       1.00      0.99      1.00      4846
          hy       1.00      1.00      1.00      3804
          ia       0.95      0.96      0.96      1795
          id       0.95      0.96      0.95      6735
          ie       0.91      0.91      0.91       439
          ig       0.96      0.87      0.91       126
          io       0.95      0.92      0.94       639
          is       0.99      0.99      0.99      4795
          it       0.99      0.99      0.99      7964
          ja       1.00      1.00      1.00      7892
          jv       0.96      0.90      0.93       177
          ka       1.00      1.00      1.00      3115
          kk       1.00      0.99      0.99      1543
          km       0.99      0.97      0.98       229
          kn       1.00      1.00      1.00       329
          ko       1.00      1.00      1.00      4951
          ku       1.00      1.00      1.00      2809
          kv       0.96      0.95      0.95       100
          kw       0.99      0.95      0.97       210
          ky       0.97      0.95      0.96       196
          la       0.99      0.99      0.99      5276
          lb       0.92      0.93      0.93       157
          lg       0.95      0.98      0.97       105
          li       0.99      0.96      0.97       100
          ln       0.96      0.97      0.96       553
          lo       0.97      0.94      0.95       157
          lt       1.00      1.00      1.00      5119
          lv       0.99      1.00      1.00      5119
          mg       0.97      0.97      0.97       148
          mi       0.98      0.94      0.96       135
          mk       0.99      0.99      0.99      6485
          ml       1.00      1.00      1.00       731
          mn       1.00      1.00      1.00      2993
          mr       1.00      1.00      1.00      3276
          ms       0.79      0.73      0.76      1349
          mt       0.97      0.98      0.98       437
          my       0.93      0.96      0.95      3937
          nb       0.85      0.89      0.87      3910
          ne       0.99      0.98      0.99       497
          nl       0.99      0.99      0.99      6730
          nn       0.55      0.49      0.52       343
          no       0.87      0.87      0.87      3466
          nv       1.00      0.98      0.99       113
          oc       0.87      0.88      0.87       520
          om       0.94      0.97      0.96       106
          or       1.00      0.96      0.98       103
          os       0.98      1.00      0.99       454
          pa       1.00      1.00      1.00       178
          pl       1.00      1.00      1.00      7960
          ps       0.99      0.97      0.98       213
          pt       0.98      0.99      0.98      9082
          qu       0.95      0.93      0.94       137
          rm       0.94      0.94      0.94       144
          rn       0.96      0.90      0.93       223
          ro       1.00      0.99      0.99      9976
          ru       0.99      0.99      0.99      7962
          rw       0.87      0.87      0.87       108
          sa       0.99      0.99      0.99       356
          sc       0.85      0.93      0.89       107
          sd       0.99      0.98      0.98       100
          se       0.93      0.96      0.94       112
          si       0.99      0.97      0.98       212
          sk       0.98      0.97      0.97      4292
          sl       0.98      0.98      0.98      4999
          sn       0.93      0.89      0.91       110
          so       0.98      0.96      0.97       313
          sq       0.99      0.99      0.99      4962
          sr       0.85      0.86      0.86      8340
          su       0.95      0.97      0.96       108
          sv       0.99      0.99      0.99      6060
          sw       0.94      0.95      0.95       106
          ta       1.00      1.00      1.00      1321
          te       1.00      1.00      1.00       660
          tg       0.99      0.98      0.98       165
          th       1.00      1.00      1.00      3092
          tk       0.98      0.97      0.98       638
          tl       0.99      0.99      0.99      1933
          tn       0.95      0.98      0.96       109
          to       0.99      1.00      1.00       107
          tr       0.99      1.00      0.99      9965
          tt       0.99      0.99      0.99      1236
          ug       1.00      1.00      1.00      1094
          uk       0.99      0.99      0.99      5420
          ur       1.00      1.00      1.00      2540
          uz       0.98      0.98      0.98       856
          vi       1.00      1.00      1.00      4771
          vo       0.98      0.96      0.97       298
          wa       0.98      0.93      0.95       108
          wo       0.97      0.97      0.97       349
          xh       0.94      0.93      0.94       120
          yi       1.00      1.00      1.00  ...
Read more