Releases: bdewilde/textacy-data
Release list
Supreme Court dataset (for Python 3)
A collection of ~8.4k (almost all) decisions issued by the U.S. Supreme Court from November 1946 through June 2016 — the "modern" era.
Records include the following fields:
text: full text of the Court's decisioncase_name: name of the court case, in all capsargument_date: date on which the case was argued before the Court, as a string with format 'YYYY-MM-DD'decision_date: date on which the Court's decision was announced, as a string with format 'YYYY-MM-DD'decision_direction: ideological direction of the majority decision; either 'conservative', 'liberal', or 'unspecifiable'maj_opinion_author: name of the majority opinion's author, if available and identifiable, as an integer code whose mapping is given inSupremeCourt.opinion_author_codesn_maj_votes: number of justices voting in the majorityn_min_votes: number of justices voting in the minorityissue: subject matter of the case's core disagreement (e.g. affirmative action) rather than its legal basis (e.g. the equal protection clause), as a string code whose mapping is given inSupremeCourt.issue_codesissue_area: higher-level categorization of the issue (e.g. Civil Rights), as an integer code whose mapping is given inSupremeCourt.issue_area_codesus_cite_id: citation identifier for each case according to the official United States Reports; Note: There are ~300 cases with duplicate ids, and it's not clear if that's "correct" or a data quality problem
The text in this dataset was derived from FindLaw's searchable database of court cases: http://caselaw.findlaw.com/court/us-supreme-court
The metadata was extracted without modification from the Supreme Court Database:
Harold J. Spaeth, Lee Epstein, et al. 2016 Supreme Court Database, Version 2016 Release 1. http://supremecourtdatabase.org.
Its license is CC BY-NC 3.0 US: https://creativecommons.org/licenses/by-nc/3.0/us/
This corpus' creation was inspired by a blog post by Emily Barry: http://www.emilyinamillion.me/blog/2016/7/13/visualizing-supreme-court-topics-over-time
NOTE: The two datasets were merged through much munging and a carefully trained model using the dedupe package. The model's duplicate threshold was set so as to maximize the F-score where precision had twice as much weight as recall. Still, given occasionally baffling inconsistencies in case naming, citation ids, and decision dates, a very small percentage of texts may be incorrectly matched to metadata. (Sorry.)
Supreme Court dataset (for Python 2)
A collection of ~8.4k (almost all) decisions issued by the U.S. Supreme Court from November 1946 through June 2016 — the "modern" era.
Records include the following fields:
text: full text of the Court's decisioncase_name: name of the court case, in all capsargument_date: date on which the case was argued before the Court, as a string with format 'YYYY-MM-DD'decision_date: date on which the Court's decision was announced, as a string with format 'YYYY-MM-DD'decision_direction: ideological direction of the majority decision; either 'conservative', 'liberal', or 'unspecifiable'maj_opinion_author: name of the majority opinion's author, if available and identifiable, as an integer code whose mapping is given inSupremeCourt.opinion_author_codesn_maj_votes: number of justices voting in the majorityn_min_votes: number of justices voting in the minorityissue: subject matter of the case's core disagreement (e.g. affirmative action) rather than its legal basis (e.g. the equal protection clause), as a string code whose mapping is given inSupremeCourt.issue_codesissue_area: higher-level categorization of the issue (e.g. Civil Rights), as an integer code whose mapping is given inSupremeCourt.issue_area_codesus_cite_id: citation identifier for each case according to the official United States Reports; Note: There are ~300 cases with duplicate ids, and it's not clear if that's "correct" or a data quality problem
The text in this dataset was derived from FindLaw's searchable database of court cases: http://caselaw.findlaw.com/court/us-supreme-court
The metadata was extracted without modification from the Supreme Court Database:
Harold J. Spaeth, Lee Epstein, et al. 2016 Supreme Court Database, Version 2016 Release 1. http://supremecourtdatabase.org.
Its license is CC BY-NC 3.0 US: https://creativecommons.org/licenses/by-nc/3.0/us/
This corpus' creation was inspired by a blog post by Emily Barry: http://www.emilyinamillion.me/blog/2016/7/13/visualizing-supreme-court-topics-over-time
NOTE: The two datasets were merged through much munging and a carefully trained model using the dedupe package. The model's duplicate threshold was set so as to maximize the F-score where precision had twice as much weight as recall. Still, given occasionally baffling inconsistencies in case naming, citation ids, and decision dates, a very small percentage of texts may be incorrectly matched to metadata. (Sorry.)
Language identification pipeline v1.1 (sklearn v0.23)
Pipeline for identifying the language of a text, using a model inspired by Google's Compact Language Detector v3 and implemented with scikit-learn==0.23.
Model
Character unigrams, bigrams, and trigrams are extracted from input text, and their frequencies of occurence within the text are counted. The full set of ngrams are then hashed into a 4096-dimensional feature vector with values given by the L2 norm of the counts. These features are passed into a Multi-layer Perceptron with a single hidden layer of 512 rectified linear units and a softmax output layer giving probabilities for ~140 different languages as ISO 639-1 language codes.
Technically, the model was implemented as a sklearn.pipeline.Pipeline with two steps: a sklearn.feature_extraction.text.HashingVectorizer for vectorizing input texts and a sklearn.neural_network.MLPClassifier for multi-class language classification.
Dataset
The pipeline was trained on a randomized, stratified subset of ~750k texts drawn from several sources:
- Tatoeba: A crowd-sourced collection of sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.
- Leipzig Corpora: A collection of corpora for many languages pulling from comparable sources -- specifically, 10k Wikipedia articles from official database dumps and 10k news articles from either RSS feeds or web scrapes, when available. Style is relatively formal; subject matter is a variety of notable things and goings-on. Source: http://wortschatz.uni-leipzig.de/en/download
- UDHR: The UN's Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html
- Twitter: A collection of tweets in each of ~70 languages, posted in July 2014, with languages assigned through a combination of models and human annotators. Style is informal; subject matter is whatever Twitter was going on about back then, who could say. Source: https://blog.twitter.com/engineering/en_us/a/2015/evaluating-language-identification-performance.html
- DSLCC: Two collections of short excerpts of journalistic texts in a handful of language groups that are highly similar to each other. Style is relatively formal; subject matter is current events. Source: http://ttg.uni-saarland.de/resources/DSLCC/
Performance
precision recall f1-score support
af 0.98 0.99 0.98 1382
am 1.00 0.99 1.00 1157
an 0.95 0.95 0.95 1016
ar 0.99 0.99 0.99 1907
as 1.00 0.99 1.00 1021
av 0.89 0.83 0.86 179
ay 0.94 0.95 0.94 206
az 0.99 0.98 0.99 1338
ba 0.83 0.75 0.79 1045
be 0.99 1.00 0.99 1623
bg 0.98 0.97 0.98 1767
bn 1.00 0.99 0.99 1178
bo 0.99 1.00 0.99 262
br 0.99 0.99 0.99 1471
bs 0.59 0.63 0.61 1495
ca 0.96 0.97 0.96 1837
ce 1.00 1.00 1.00 997
co 0.98 0.99 0.98 1016
cs 0.96 0.96 0.96 1758
cv 1.00 0.98 0.99 1135
cy 0.99 0.98 0.99 1383
da 0.90 0.94 0.92 1627
de 0.96 0.98 0.97 1890
dv 1.00 1.00 1.00 1180
el 1.00 1.00 1.00 1868
en 0.92 0.97 0.95 3512
eo 0.99 0.98 0.99 1593
es 0.97 0.95 0.96 2385
et 0.96 0.95 0.96 1468
eu 0.99 0.99 0.99 1733
fa 0.95 0.97 0.96 1720
fi 0.99 0.99 0.99 1833
fo 0.98 0.97 0.98 1031
fr 0.95 0.97 0.96 2312
fy 0.99 0.97 0.98 1041
ga 1.00 0.99 0.99 1182
gd 0.98 0.98 0.98 326
gl 0.94 0.96 0.95 1586
gn 1.00 0.99 0.99 1085
gu 0.99 1.00 0.99 1235
gv 1.00 1.00 1.00 1075
ha 0.98 1.00 0.99 217
he 0.99 1.00 1.00 1699
hi 0.95 0.99 0.97 1480
hr 0.73 0.62 0.67 1914
ht 0.99 0.95 0.97 1165
hu 1.00 0.99 0.99 1829
hy 1.00 1.00 1.00 1376
ia 0.97 0.97 0.97 1616
id 0.86 0.90 0.88 2024
ie 0.92 0.94 0.93 514
ig 1.00 0.90 0.95 251
io 0.97 0.98 0.97 1489
is 0.99 0.98 0.99 1729
it 0.96 0.97 0.96 1814
ja 1.00 0.99 0.99 1942
jv 0.97 0.95 0.96 234
ka 1.00 1.00 1.00 1241
kk 1.00 1.00 1.00 1385
kl 1.00 0.99 0.99 811
km 0.98 0.95 0.97 329
kn 1.00 1.00 1.00 1120
ko 1.00 0.99 0.99 1171
ku 0.99 1.00 1.00 1072
kv 0.99 0.98 0.99 1025
kw 0.99 0.98 0.99 264
ky 1.00 0.99 0.99 1011
la 0.96 0.98 0.97 1607
lb 0.99 0.98 0.98 1110
lg 1.00 0.99 1.00 1025
li 0.97 0.98 0.98 1002
ln 0.97 0.93 0.95 207
lo 0.97 0.96 0.96 316
lt 0.99 0.99 0.99 1686
lv 1.00 0.98 0.99 1130
mg 1.00 1.00 1.00 997
mi 1.00 1.00 1.00 230
mk 0.96 0.99 0.98 1602
ml 1.00 0.99 1.00 1193
mn 1.00 1.00 1.00 1072
mr 0.99 0.99 0.99 1602
ms 0.70 0.69 0.70 1041
mt 1.00 1.00 1.00 1057
my 0.77 0.70 0.73 753
nb 0.67 0.81 0.73 1638
ne 0.99 0.98 0.99 1212
nl 0.97 0.96 0.97 1832
nn 0.90 0.89 0.90 1149
no 0.61 0.42 0.49 1052
nv 1.00 1.00 1.00 211
oc 0.97 0.94 0.95 1665
om 0.99 0.96 0.97 212
or 1.00 0.99 1.00 1006
os 1.00 0.99 1.00 1021
pa 1.00 1.00 1.00 1154
pl 0.98 0.99 0.98 1778
ps 0.96 0.91 0.93 1254
pt 0.97 0.96 0.96 2285
qu 0.98 0.98 0.98 1088
rm 0.98 0.98 0.98 1087
rn 0.96 0.90 0.93 87
ro 0.98 0.98 0.98 1796
ru 0.96 0.97 0.96 1910
rw 0.93 0.92 0.93 196
sa 0.99 0.99 0.99 1063
sc 0.97 0.98 0.97 1019
sd 0.99 0.99 0.99 1216
se 0.99 0.97 0.98 194
si 1.00 0.99 0.99 1133
sk 0.95 0.96 0.96 1279
sl 0.96 0.96 0.96 1324
sn 1.00 0.96 0.98 217
so 0.99 0.99 0.99 1034
sq 0.99 0.99 0.99 1134
sr 0.81 0.89 0.85 2135
su 0.96 0.96 0.96 1070
sv 0.98 0.98 0.98 1932
sw 0.99 0.98 0.98 1079
ta 1.00 1.00 1.00 1170
te 1.00 0.99 1.00 1166
tg 0.99 1.00 0.99 1056
th 1.00 0.99 0.99 1331
tk 1.00 0.99 0.99 1659
tl 0.98 0.96 0.97 1803
tn 1.00 0.98 0.99 223
to 1.00 0.99 0.99 207
tr 0.97 0.99 0.98 1892
tt 0.85 0.90 0.88 1717
ug 1.00 1.00 1.00 1646
uk 0.99 0.99 0.99 1677
ur 0.99 0.99 0.99 1353
uz 1.00 0.99 0.99 1147
vi 0.99 0.99 0.99 1819
v...
Language identification pipeline v1.1 (sklearn v0.22)
Pipeline for identifying the language of a text, using a model inspired by Google's Compact Language Detector v3 and implemented with scikit-learn==0.22.
Model
Character unigrams, bigrams, and trigrams are extracted from input text, and their frequencies of occurence within the text are counted. The full set of ngrams are then hashed into a 4096-dimensional feature vector with values given by the L2 norm of the counts. These features are passed into a Multi-layer Perceptron with a single hidden layer of 512 rectified linear units and a softmax output layer giving probabilities for ~140 different languages as ISO 639-1 language codes.
Technically, the model was implemented as a sklearn.pipeline.Pipeline with two steps: a sklearn.feature_extraction.text.HashingVectorizer for vectorizing input texts and a sklearn.neural_network.MLPClassifier for multi-class language classification.
Dataset
The pipeline was trained on a randomized, stratified subset of ~750k texts drawn from several sources:
- Tatoeba: A crowd-sourced collection of sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.
- Leipzig Corpora: A collection of corpora for many languages pulling from comparable sources -- specifically, 10k Wikipedia articles from official database dumps and 10k news articles from either RSS feeds or web scrapes, when available. Style is relatively formal; subject matter is a variety of notable things and goings-on. Source: http://wortschatz.uni-leipzig.de/en/download
- UDHR: The UN's Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html
- Twitter: A collection of tweets in each of ~70 languages, posted in July 2014, with languages assigned through a combination of models and human annotators. Style is informal; subject matter is whatever Twitter was going on about back then, who could say. Source: https://blog.twitter.com/engineering/en_us/a/2015/evaluating-language-identification-performance.html
- DSLCC: Two collections of short excerpts of journalistic texts in a handful of language groups that are highly similar to each other. Style is relatively formal; subject matter is current events. Source: http://ttg.uni-saarland.de/resources/DSLCC/
Performance
af 0.99 0.98 0.99 1363
am 1.00 1.00 1.00 1098
an 0.94 0.96 0.95 1005
ar 0.99 0.99 0.99 1902
as 1.00 0.99 1.00 959
av 0.98 0.81 0.88 186
ay 0.99 0.94 0.96 224
az 0.98 0.99 0.99 1348
ba 0.84 0.76 0.80 1037
be 1.00 1.00 1.00 1559
bg 0.97 0.99 0.98 1808
bn 1.00 0.99 0.99 1175
bo 0.99 0.99 0.99 281
br 0.98 0.99 0.99 1469
bs 0.67 0.48 0.56 1474
ca 0.97 0.96 0.97 1740
ce 1.00 0.99 1.00 1030
co 0.99 0.97 0.98 986
cs 0.96 0.96 0.96 1830
cv 1.00 0.99 0.99 1145
cy 0.99 0.99 0.99 1370
da 0.92 0.93 0.92 1731
de 0.97 0.97 0.97 1891
dv 1.00 1.00 1.00 1138
el 1.00 1.00 1.00 1882
en 0.91 0.98 0.94 3589
eo 0.98 0.99 0.98 1616
es 0.94 0.96 0.95 2343
et 0.98 0.95 0.97 1466
eu 0.98 0.99 0.99 1743
fa 0.96 0.97 0.96 1693
fi 0.99 0.98 0.99 1785
fo 0.99 0.95 0.97 1079
fr 0.95 0.98 0.96 2302
fy 0.98 0.98 0.98 1053
ga 1.00 0.99 0.99 1198
gd 0.99 0.97 0.98 276
gl 0.96 0.95 0.95 1539
gn 0.99 0.99 0.99 1110
gu 1.00 0.99 1.00 1219
gv 0.98 0.99 0.99 1031
ha 0.97 0.99 0.98 230
he 1.00 1.00 1.00 1566
hi 0.98 0.97 0.97 1435
hr 0.69 0.75 0.72 1968
ht 0.99 0.96 0.97 1163
hu 0.99 0.99 0.99 1794
hy 1.00 0.99 1.00 1322
ia 0.97 0.98 0.97 1602
id 0.82 0.93 0.87 2107
ie 0.96 0.91 0.94 513
ig 0.97 0.93 0.95 230
io 0.99 0.97 0.98 1522
is 0.98 0.99 0.98 1607
it 0.95 0.98 0.96 1937
ja 1.00 0.99 1.00 1930
jv 0.97 0.96 0.97 239
ka 1.00 1.00 1.00 1243
kk 1.00 1.00 1.00 1348
kl 1.00 1.00 1.00 809
km 0.99 0.93 0.96 347
kn 1.00 1.00 1.00 1188
ko 1.00 1.00 1.00 1180
ku 1.00 1.00 1.00 1049
kv 0.99 0.98 0.99 987
kw 0.99 0.98 0.99 249
ky 0.99 0.99 0.99 1074
la 0.96 0.98 0.97 1605
lb 0.99 0.97 0.98 1104
lg 1.00 0.99 1.00 1019
li 0.98 0.98 0.98 1081
ln 0.99 0.92 0.95 220
lo 0.99 0.94 0.96 331
lt 0.99 0.99 0.99 1645
lv 0.99 0.98 0.99 1183
mg 1.00 1.00 1.00 1049
mi 1.00 1.00 1.00 273
mk 0.98 0.98 0.98 1643
ml 1.00 1.00 1.00 1225
mn 0.99 1.00 1.00 1141
mr 0.99 0.99 0.99 1682
ms 0.67 0.61 0.64 1030
mt 1.00 0.99 1.00 1022
my 0.80 0.63 0.71 851
nb 0.66 0.83 0.74 1643
ne 0.99 0.99 0.99 1180
nl 0.97 0.97 0.97 1866
nn 0.91 0.88 0.90 1114
no 0.62 0.39 0.48 1019
nv 1.00 1.00 1.00 212
oc 0.96 0.95 0.96 1621
om 0.99 0.97 0.98 219
or 1.00 0.98 0.99 1062
os 1.00 1.00 1.00 1036
pa 1.00 1.00 1.00 1085
pl 0.99 0.99 0.99 1804
ps 0.95 0.91 0.93 1151
pt 0.96 0.97 0.97 2335
qu 0.99 0.97 0.98 1098
rm 0.99 0.98 0.98 1105
rn 0.94 0.83 0.88 96
ro 1.00 0.98 0.99 1814
ru 0.96 0.98 0.97 1870
rw 0.93 0.96 0.94 205
sa 0.99 1.00 0.99 1019
sc 0.98 0.98 0.98 1041
sd 0.99 0.99 0.99 1274
se 0.98 0.98 0.98 187
si 1.00 1.00 1.00 1189
sk 0.96 0.95 0.95 1281
sl 0.95 0.96 0.96 1306
sn 0.98 0.95 0.96 208
so 1.00 0.98 0.99 1036
sq 0.99 0.99 0.99 1148
sr 0.81 0.90 0.85 2153
su 0.99 0.95 0.97 1000
sv 0.98 0.98 0.98 1817
sw 0.99 0.98 0.98 1042
ta 1.00 1.00 1.00 1196
te 1.00 0.98 0.99 1124
tg 1.00 0.99 0.99 1012
th 0.99 0.99 0.99 1273
tk 0.99 1.00 1.00 1595
tl 0.96 0.98 0.97 1843
tn 1.00 1.00 1.00 207
to 1.00 0.98 0.99 212
tr 0.99 0.97 0.98 1881
tt 0.86 0.91 0.88 1690
ug 1.00 1.00 1.00 1773
uk 0.99 0.99 0.99 1771
ur 0.99 0.99 0.99 1307
uz 0.98 0.99 0.99 1063
vi 1.00 0.99 0.99 1849
vo ...
Language identification pipeline v1.1 (sklearn v0.21)
Pipeline for identifying the language of a text, using a model inspired by Google's Compact Language Detector v3 and implemented with scikit-learn==0.21.
Model
Character unigrams, bigrams, and trigrams are extracted from input text, and their frequencies of occurence within the text are counted. The full set of ngrams are then hashed into a 4096-dimensional feature vector with values given by the L2 norm of the counts. These features are passed into a Multi-layer Perceptron with a single hidden layer of 512 rectified linear units and a softmax output layer giving probabilities for ~140 different languages as ISO 639-1 language codes.
Technically, the model was implemented as a sklearn.pipeline.Pipeline with two steps: a sklearn.feature_extraction.text.HashingVectorizer for vectorizing input texts and a sklearn.neural_network.MLPClassifier for multi-class language classification.
Dataset
The pipeline was trained on a randomized, stratified subset of ~750k texts drawn from several sources:
- Tatoeba: A crowd-sourced collection of sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.
- Leipzig Corpora: A collection of corpora for many languages pulling from comparable sources -- specifically, 10k Wikipedia articles from official database dumps and 10k news articles from either RSS feeds or web scrapes, when available. Style is relatively formal; subject matter is a variety of notable things and goings-on. Source: http://wortschatz.uni-leipzig.de/en/download
- UDHR: The UN's Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html
- Twitter: A collection of tweets in each of ~70 languages, posted in July 2014, with languages assigned through a combination of models and human annotators. Style is informal; subject matter is whatever Twitter was going on about back then, who could say. Source: https://blog.twitter.com/engineering/en_us/a/2015/evaluating-language-identification-performance.html
- DSLCC: Two collections of short excerpts of journalistic texts in a handful of language groups that are highly similar to each other. Style is relatively formal; subject matter is current events. Source: http://ttg.uni-saarland.de/resources/DSLCC/
Performance
precision recall f1-score support
af 0.98 0.98 0.98 1335
am 1.00 0.99 1.00 1098
an 0.96 0.96 0.96 1008
ar 0.99 0.99 0.99 1889
as 1.00 0.99 0.99 1034
av 0.92 0.90 0.91 205
ay 0.99 0.94 0.97 200
az 0.99 0.99 0.99 1311
ba 0.89 0.71 0.79 1064
be 0.99 1.00 0.99 1606
bg 0.98 0.97 0.98 1856
bn 1.00 0.99 0.99 1183
bo 0.99 1.00 0.99 292
br 1.00 0.99 0.99 1441
bs 0.65 0.52 0.58 1570
ca 0.96 0.96 0.96 1776
ce 1.00 1.00 1.00 1023
co 0.99 0.97 0.98 1074
cs 0.98 0.94 0.96 1752
cv 1.00 0.99 0.99 1101
cy 1.00 0.99 0.99 1363
da 0.92 0.93 0.92 1744
de 0.96 0.97 0.97 1893
dv 1.00 1.00 1.00 1102
el 1.00 1.00 1.00 1857
en 0.92 0.97 0.95 3545
eo 0.99 0.99 0.99 1635
es 0.94 0.97 0.95 2307
et 0.95 0.96 0.95 1417
eu 0.99 0.99 0.99 1737
fa 0.94 0.99 0.96 1651
fi 0.99 0.99 0.99 1736
fo 0.98 0.98 0.98 1110
fr 0.95 0.98 0.96 2351
fy 0.98 0.98 0.98 997
ga 1.00 0.99 0.99 1183
gd 0.96 0.98 0.97 305
gl 0.95 0.94 0.95 1435
gn 1.00 0.99 0.99 1072
gu 1.00 0.99 0.99 1247
gv 0.99 0.99 0.99 1050
ha 0.98 0.99 0.99 224
he 0.99 1.00 1.00 1639
hi 0.98 0.96 0.97 1426
hr 0.65 0.76 0.70 1867
ht 0.98 0.97 0.98 1226
hu 1.00 0.99 0.99 1768
hy 1.00 1.00 1.00 1333
ia 0.96 0.98 0.97 1710
id 0.84 0.91 0.88 2073
ie 0.95 0.94 0.95 530
ig 0.96 0.89 0.93 209
io 0.98 0.98 0.98 1493
is 0.99 0.99 0.99 1812
it 0.95 0.97 0.96 1849
ja 1.00 0.99 1.00 1817
jv 0.98 0.93 0.96 275
ka 1.00 1.00 1.00 1216
kk 1.00 1.00 1.00 1403
kl 1.00 1.00 1.00 851
km 1.00 0.96 0.98 360
kn 1.00 1.00 1.00 1161
ko 1.00 0.99 1.00 1148
ku 1.00 0.99 0.99 1119
kv 0.99 0.98 0.99 989
kw 1.00 0.99 0.99 287
ky 0.99 1.00 0.99 1050
la 0.96 0.98 0.97 1632
lb 0.98 0.98 0.98 1090
lg 1.00 1.00 1.00 1024
li 0.99 0.97 0.98 1043
ln 0.94 0.94 0.94 237
lo 0.99 0.94 0.96 317
lt 0.99 0.99 0.99 1674
lv 0.99 0.98 0.99 1171
mg 1.00 1.00 1.00 1011
mi 1.00 1.00 1.00 250
mk 0.97 0.98 0.98 1666
ml 1.00 1.00 1.00 1211
mn 1.00 0.99 1.00 1139
mr 0.99 0.99 0.99 1698
ms 0.65 0.64 0.64 1044
mt 1.00 0.99 0.99 1017
my 0.78 0.64 0.70 795
nb 0.68 0.79 0.73 1601
ne 0.98 0.99 0.99 1257
nl 0.96 0.98 0.97 1860
nn 0.90 0.92 0.91 1174
no 0.58 0.42 0.49 965
nv 1.00 1.00 1.00 215
oc 0.98 0.94 0.96 1641
om 0.97 0.96 0.97 221
or 1.00 0.99 0.99 1097
os 1.00 0.99 1.00 1052
pa 1.00 1.00 1.00 1110
pl 0.99 0.99 0.99 1839
ps 0.97 0.90 0.93 1163
pt 0.98 0.95 0.97 2392
qu 0.98 0.98 0.98 1049
rm 0.99 0.98 0.99 1091
rn 0.97 0.85 0.90 71
ro 0.98 0.99 0.98 1764
ru 0.97 0.97 0.97 1860
rw 0.90 0.93 0.92 213
sa 1.00 0.99 0.99 1083
sc 0.98 0.97 0.98 1053
sd 1.00 0.98 0.99 1248
se 0.98 0.97 0.98 191
si 1.00 0.99 1.00 1149
sk 0.94 0.97 0.96 1239
sl 0.94 0.96 0.95 1233
sn 1.00 0.94 0.97 226
so 1.00 0.98 0.99 994
sq 0.99 0.99 0.99 1131
sr 0.83 0.87 0.85 2186
su 0.96 0.96 0.96 1020
sv 0.98 0.98 0.98 1865
sw 0.98 0.98 0.98 1048
ta 1.00 1.00 1.00 1236
te 1.00 0.99 0.99 1094
tg 0.99 0.99 0.99 1054
th 1.00 0.99 0.99 1337
tk 1.00 1.00 1.00 1701
tl 0.97 0.97 0.97 1792
tn 1.00 0.99 0.99 175
to 1.00 0.99 1.00 204
tr 0.98 0.98 0.98 1909
tt 0.83 0.94 0.88 1641
ug 1.00 1.00 1.00 1648
uk 0.99 0.98 0.99 1735
ur 0.99 0.99 0.99 1339
uz 1.00 0.98 0.99 1081
vi 1.00 0.98 0.99 1873
v...
Language identification pipeline v1.1 (sklearn v0.20)
Pipeline for identifying the language of a text, using a model inspired by Google's Compact Language Detector v3 and implemented with scikit-learn==0.20.
Model
Character unigrams, bigrams, and trigrams are extracted from input text, and their frequencies of occurence within the text are counted. The full set of ngrams are then hashed into a 4096-dimensional feature vector with values given by the L2 norm of the counts. These features are passed into a Multi-layer Perceptron with a single hidden layer of 512 rectified linear units and a softmax output layer giving probabilities for ~140 different languages as ISO 639-1 language codes.
Technically, the model was implemented as a sklearn.pipeline.Pipeline with two steps: a sklearn.feature_extraction.text.HashingVectorizer for vectorizing input texts and a sklearn.neural_network.MLPClassifier for multi-class language classification.
Dataset
The pipeline was trained on a randomized, stratified subset of ~750k texts drawn from several sources:
- Tatoeba: A crowd-sourced collection of sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.
- Leipzig Corpora: A collection of corpora for many languages pulling from comparable sources -- specifically, 10k Wikipedia articles from official database dumps and 10k news articles from either RSS feeds or web scrapes, when available. Style is relatively formal; subject matter is a variety of notable things and goings-on. Source: http://wortschatz.uni-leipzig.de/en/download
- UDHR: The UN's Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html
- Twitter: A collection of tweets in each of ~70 languages, posted in July 2014, with languages assigned through a combination of models and human annotators. Style is informal; subject matter is whatever Twitter was going on about back then, who could say. Source: https://blog.twitter.com/engineering/en_us/a/2015/evaluating-language-identification-performance.html
- DSLCC: Two collections of short excerpts of journalistic texts in a handful of language groups that are highly similar to each other. Style is relatively formal; subject matter is current events. Source: http://ttg.uni-saarland.de/resources/DSLCC/
Performance
precision recall f1-score support
af 0.99 0.98 0.98 1372
am 1.00 1.00 1.00 1063
an 0.95 0.97 0.96 1017
ar 1.00 0.99 0.99 1944
as 1.00 1.00 1.00 1029
av 0.95 0.86 0.90 190
ay 0.98 0.90 0.94 220
az 0.99 0.99 0.99 1351
ba 0.89 0.72 0.80 1024
be 0.99 1.00 0.99 1614
bg 0.98 0.98 0.98 1804
bn 1.00 0.99 1.00 1204
bo 0.99 0.99 0.99 278
br 0.99 0.99 0.99 1476
bs 0.60 0.66 0.62 1526
ca 0.97 0.96 0.96 1805
ce 1.00 1.00 1.00 1032
co 0.98 0.99 0.99 960
cs 0.97 0.94 0.96 1860
cv 1.00 0.99 0.99 1129
cy 0.98 0.99 0.99 1304
da 0.93 0.92 0.92 1793
de 0.96 0.97 0.97 1936
dv 1.00 1.00 1.00 1139
el 1.00 1.00 1.00 1909
en 0.93 0.98 0.95 3643
eo 0.97 0.99 0.98 1573
es 0.95 0.95 0.95 2310
et 0.96 0.94 0.95 1422
eu 0.99 0.99 0.99 1739
fa 0.95 0.97 0.96 1685
fi 0.99 0.99 0.99 1788
fo 0.98 0.98 0.98 1050
fr 0.96 0.97 0.96 2323
fy 0.98 0.97 0.98 1006
ga 1.00 0.99 0.99 1198
gd 1.00 0.98 0.99 285
gl 0.94 0.96 0.95 1469
gn 1.00 0.99 0.99 1059
gu 1.00 0.99 1.00 1237
gv 0.99 1.00 0.99 1014
ha 0.99 0.98 0.98 201
he 0.99 1.00 1.00 1678
hi 0.98 0.97 0.97 1409
hr 0.71 0.64 0.67 1914
ht 0.94 0.97 0.96 1196
hu 1.00 0.99 0.99 1712
hy 1.00 1.00 1.00 1363
ia 0.95 0.98 0.97 1603
id 0.88 0.85 0.87 2143
ie 0.93 0.93 0.93 504
ig 0.94 0.93 0.94 193
io 0.98 0.98 0.98 1474
is 0.99 0.99 0.99 1750
it 0.96 0.97 0.97 1892
ja 1.00 1.00 1.00 1887
jv 0.98 0.94 0.96 234
ka 1.00 1.00 1.00 1273
kk 1.00 1.00 1.00 1381
kl 1.00 1.00 1.00 848
km 1.00 0.94 0.97 343
kn 1.00 1.00 1.00 1136
ko 1.00 0.99 0.99 1184
ku 1.00 0.99 0.99 1083
kv 0.98 0.99 0.98 1057
kw 1.00 0.98 0.99 251
ky 1.00 0.99 0.99 1088
la 0.98 0.97 0.97 1647
lb 0.99 0.98 0.98 1054
lg 1.00 0.99 0.99 1059
li 0.99 0.98 0.98 999
ln 0.87 0.96 0.91 219
lo 0.98 0.93 0.96 302
lt 0.99 0.99 0.99 1675
lv 0.99 0.98 0.99 1151
mg 1.00 1.00 1.00 1041
mi 1.00 0.98 0.99 263
mk 0.97 0.99 0.98 1645
ml 1.00 1.00 1.00 1159
mn 1.00 1.00 1.00 1041
mr 0.99 0.99 0.99 1665
ms 0.59 0.75 0.66 1059
mt 1.00 0.99 0.99 1009
my 0.80 0.60 0.69 805
nb 0.75 0.68 0.71 1682
ne 0.99 0.99 0.99 1225
nl 0.96 0.97 0.97 1791
nn 0.89 0.92 0.91 1143
no 0.53 0.61 0.57 980
nv 1.00 1.00 1.00 232
oc 0.96 0.96 0.96 1682
om 1.00 0.95 0.97 217
or 1.00 0.99 0.99 1016
os 1.00 1.00 1.00 1073
pa 1.00 1.00 1.00 1154
pl 0.99 0.99 0.99 1801
ps 0.96 0.91 0.93 1208
pt 0.97 0.96 0.97 2453
qu 0.97 0.98 0.98 1075
rm 0.99 0.98 0.99 1113
rn 0.95 0.91 0.93 82
ro 0.99 0.98 0.99 1761
ru 0.96 0.97 0.96 1838
rw 0.96 0.93 0.94 219
sa 1.00 0.99 0.99 1016
sc 0.99 0.97 0.98 1053
sd 0.98 0.99 0.99 1230
se 0.98 0.97 0.98 198
si 0.99 1.00 1.00 1113
sk 0.93 0.96 0.95 1279
sl 0.95 0.97 0.96 1356
sn 0.99 0.96 0.97 219
so 1.00 0.98 0.99 1048
sq 1.00 0.98 0.99 1125
sr 0.85 0.86 0.86 2196
su 0.98 0.96 0.97 1031
sv 0.98 0.97 0.97 1761
sw 0.99 0.98 0.98 991
ta 1.00 0.99 1.00 1185
te 1.00 0.99 1.00 1146
tg 0.99 0.99 0.99 987
th 0.99 0.99 0.99 1315
tk 1.00 0.99 1.00 1676
tl 0.94 0.98 0.96 1892
tn 1.00 0.98 0.99 194
to 1.00 1.00 1.00 210
tr 0.98 0.98 0.98 1849
tt 0.84 0.94 0.89 1643
ug 1.00 1.00 1.00 1666
uk 0.99 0.98 0.99 1764
ur 0.99 0.99 0.99 1329
uz 0.99 0.99 0.99 1102
vi 1.00 0.99 0.99 1804
v...
Language identification pipeline v1.1 (sklearn v0.19)
Pipeline for identifying the language of a text, using a model inspired by Google's Compact Language Detector v3 and implemented with scikit-learn==0.19.
Model
Character unigrams, bigrams, and trigrams are extracted from input text, and their frequencies of occurence within the text are counted. The full set of ngrams are then hashed into a 4096-dimensional feature vector with values given by the L2 norm of the counts. These features are passed into a Multi-layer Perceptron with a single hidden layer of 512 rectified linear units and a softmax output layer giving probabilities for ~140 different languages as ISO 639-1 language codes.
Technically, the model was implemented as a sklearn.pipeline.Pipeline with two steps: a sklearn.feature_extraction.text.HashingVectorizer for vectorizing input texts and a sklearn.neural_network.MLPClassifier for multi-class language classification.
Dataset
The pipeline was trained on a randomized, stratified subset of ~750k texts drawn from several sources:
- Tatoeba: A crowd-sourced collection of sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.
- Leipzig Corpora: A collection of corpora for many languages pulling from comparable sources -- specifically, 10k Wikipedia articles from official database dumps and 10k news articles from either RSS feeds or web scrapes, when available. Style is relatively formal; subject matter is a variety of notable things and goings-on. Source: http://wortschatz.uni-leipzig.de/en/download
- UDHR: The UN's Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html
- Twitter: A collection of tweets in each of ~70 languages, posted in July 2014, with languages assigned through a combination of models and human annotators. Style is informal; subject matter is whatever Twitter was going on about back then, who could say. Source: https://blog.twitter.com/engineering/en_us/a/2015/evaluating-language-identification-performance.html
- DSLCC: Two collections of short excerpts of journalistic texts in a handful of language groups that are highly similar to each other. Style is relatively formal; subject matter is current events. Source: http://ttg.uni-saarland.de/resources/DSLCC/
Performance
precision recall f1-score support
af 0.99 0.98 0.98 1294
am 1.00 0.99 1.00 1115
an 0.97 0.95 0.96 1052
ar 0.99 0.99 0.99 1927
as 1.00 0.99 0.99 989
av 0.96 0.79 0.87 177
ay 0.99 0.92 0.96 212
az 1.00 0.98 0.99 1318
ba 0.88 0.74 0.81 1034
be 1.00 0.99 1.00 1646
bg 0.97 0.98 0.98 1801
bn 0.99 0.99 0.99 1177
bo 1.00 0.99 0.99 317
br 0.99 0.99 0.99 1503
bs 0.63 0.51 0.56 1497
ca 0.97 0.95 0.96 1819
ce 1.00 0.99 1.00 1053
co 0.98 0.97 0.98 1026
cs 0.96 0.96 0.96 1766
cv 1.00 0.99 0.99 1116
cy 0.99 0.99 0.99 1338
da 0.89 0.95 0.92 1705
de 0.97 0.96 0.97 1818
dv 1.00 1.00 1.00 1112
el 1.00 1.00 1.00 1904
en 0.89 0.99 0.94 3607
eo 0.98 0.98 0.98 1628
es 0.95 0.97 0.96 2331
et 0.92 0.96 0.94 1423
eu 0.99 0.99 0.99 1725
fa 0.93 0.98 0.96 1590
fi 0.99 0.98 0.99 1814
fo 0.98 0.98 0.98 1022
fr 0.94 0.97 0.96 2322
fy 1.00 0.98 0.99 1073
ga 1.00 0.99 0.99 1185
gd 0.95 0.97 0.96 308
gl 0.96 0.95 0.95 1503
gn 0.99 0.99 0.99 1096
gu 1.00 0.99 1.00 1230
gv 1.00 0.99 0.99 992
ha 0.97 0.97 0.97 236
he 1.00 1.00 1.00 1623
hi 0.94 0.98 0.96 1396
hr 0.65 0.78 0.71 1968
ht 0.97 0.97 0.97 1190
hu 1.00 0.99 0.99 1814
hy 1.00 1.00 1.00 1315
ia 0.98 0.96 0.97 1559
id 0.89 0.84 0.86 2148
ie 0.93 0.93 0.93 538
ig 0.99 0.86 0.92 198
io 0.98 0.98 0.98 1476
is 0.99 0.99 0.99 1730
it 0.97 0.97 0.97 1866
ja 1.00 0.99 0.99 1892
jv 0.96 0.96 0.96 250
ka 1.00 0.99 1.00 1275
kk 1.00 1.00 1.00 1406
kl 1.00 1.00 1.00 861
km 1.00 0.93 0.96 345
kn 1.00 1.00 1.00 1160
ko 1.00 0.99 0.99 1182
ku 1.00 0.99 0.99 1060
kv 0.99 0.97 0.98 951
kw 1.00 0.99 0.99 269
ky 1.00 0.99 0.99 1047
la 0.96 0.98 0.97 1603
lb 1.00 0.97 0.98 1052
lg 1.00 0.99 0.99 1032
li 0.97 0.98 0.98 1005
ln 0.98 0.94 0.96 232
lo 0.99 0.93 0.96 295
lt 1.00 0.99 0.99 1643
lv 0.99 0.98 0.99 1157
mg 1.00 1.00 1.00 1039
mi 1.00 0.99 1.00 244
mk 0.98 0.98 0.98 1625
ml 1.00 1.00 1.00 1186
mn 1.00 0.99 1.00 1140
mr 0.99 0.99 0.99 1670
ms 0.58 0.75 0.65 991
mt 1.00 0.99 0.99 1030
my 0.82 0.62 0.71 796
nb 0.69 0.79 0.74 1605
ne 1.00 0.98 0.99 1215
nl 0.97 0.97 0.97 1827
nn 0.89 0.94 0.92 1157
no 0.66 0.39 0.49 1055
nv 1.00 1.00 1.00 229
oc 0.96 0.95 0.95 1639
om 1.00 0.97 0.98 214
or 1.00 0.99 1.00 1006
os 1.00 1.00 1.00 1024
pa 1.00 1.00 1.00 1137
pl 0.99 0.99 0.99 1817
ps 0.97 0.90 0.94 1302
pt 0.95 0.98 0.96 2351
qu 1.00 0.97 0.98 1078
rm 0.98 0.99 0.99 1102
rn 0.94 0.90 0.92 90
ro 0.99 0.98 0.98 1777
ru 0.96 0.97 0.97 1929
rw 0.95 0.94 0.95 211
sa 0.99 0.99 0.99 1049
sc 0.96 0.99 0.97 991
sd 0.99 0.98 0.99 1271
se 0.98 0.98 0.98 230
si 1.00 0.99 1.00 1120
sk 0.96 0.96 0.96 1263
sl 0.96 0.96 0.96 1314
sn 0.99 0.89 0.94 223
so 1.00 0.99 0.99 1047
sq 1.00 0.99 0.99 1159
sr 0.87 0.82 0.85 2230
su 0.97 0.96 0.96 985
sv 0.98 0.98 0.98 1860
sw 0.99 0.97 0.98 1064
ta 1.00 1.00 1.00 1258
te 1.00 0.99 0.99 1157
tg 1.00 0.99 0.99 1069
th 0.99 0.99 0.99 1382
tk 1.00 0.99 1.00 1630
tl 0.98 0.96 0.97 1875
tn 1.00 0.98 0.99 187
to 1.00 0.99 0.99 210
tr 0.97 0.99 0.98 1814
tt 0.85 0.93 0.89 1609
ug 1.00 1.00 1.00 1731
uk 0.99 0.99 0.99 1766
ur 0.99 0.99 0.99 1313
uz 0.99 0.99 0.99 1073
vi 0.99 0.99 0.99 1790
vo 1.00 1.00 1.00 1160
wa 0.99 0.99 0.99 1027
wo 0.95 0.96 ...
Language Identification model (PY3)
Functionality for identifying the language of a text, using a model inspired by Google's Compact Language Detector v3 and implemented with scikit-learn.
Model
Character unigrams, bigrams, and trigrams are extracted from input text, and their frequencies of occurence within the text are counted. The full set of ngrams are then hashed into a 4096-dimensional feature vector with values given by the L2 norm of the counts. These features are passed into a Multi-layer Perceptron with a single hidden layer of 512 rectified linear units and a softmax output layer giving probabilities for ~130 different languages as ISO 639-1 language codes.
Technically, the model was implemented as a sklearn.pipeline.Pipeline with two steps: a sklearn.feature_extraction.text.HashingVectorizer for vectorizing input texts and a sklearn.neural_network.MLPClassifier for multi-class language classification.
Dataset
The pipeline was trained on a randomized, stratified subset of ~1.5M texts
drawn from several sources:
- Tatoeba: A crowd-sourced collection of ~5M sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.
- Leipzig Corpora: A collection of corpora for many languages in the same format and pulling from comparable sources -- specifically, 10k Wikipedia articles from official database dumps and 10k news articles from either RSS feeds or web scrapes. Only the most recently updated version was used, when available. Style is relatively formal; subject matter is a variety of notable things and goings-on. Source: http://wortschatz.uni-leipzig.de/en/download
- UDHR: The UN's Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html
- Twitter: A collection of ~1.5k tweets in each of ~70 languages, posted in July 2014, with languages assigned through a combination of models and human annotators. Style is informal; subject matter is whatever Twitter was going on about back then, who could say. Source: https://blog.twitter.com/engineering/en_us/a/2015/evaluating-language-identification-performance.html
Language Identification model (PY2)
Functionality for identifying the language of a text, using a model inspired by Google's Compact Language Detector v3 and implemented with scikit-learn.
Model
Character unigrams, bigrams, and trigrams are extracted from input text, and their frequencies of occurence within the text are counted. The full set of ngrams are then hashed into a 4096-dimensional feature vector with values given by the L2 norm of the counts. These features are passed into a Multi-layer Perceptron with a single hidden layer of 512 rectified linear units and a softmax output layer giving probabilities for ~130 different languages as ISO 639-1 language codes.
Technically, the model was implemented as a sklearn.pipeline.Pipeline with two steps: a sklearn.feature_extraction.text.HashingVectorizer for vectorizing input texts and a sklearn.neural_network.MLPClassifier for multi-class language classification.
Dataset
The pipeline was trained on a randomized, stratified subset of ~1.5M texts
drawn from several sources:
- Tatoeba: A crowd-sourced collection of ~5M sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.
- Leipzig Corpora: A collection of corpora for many languages in the same format and pulling from comparable sources -- specifically, 10k Wikipedia articles from official database dumps and 10k news articles from either RSS feeds or web scrapes. Only the most recently updated version was used, when available. Style is relatively formal; subject matter is a variety of notable things and goings-on. Source: http://wortschatz.uni-leipzig.de/en/download
- UDHR: The UN's Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html
- Twitter: A collection of ~1.5k tweets in each of ~70 languages, posted in July 2014, with languages assigned through a combination of models and human annotators. Style is informal; subject matter is whatever Twitter was going on about back then, who could say. Source: https://blog.twitter.com/engineering/en_us/a/2015/evaluating-language-identification-performance.html
Language identification model v3.0
Model for identifying the most probable language(s) of a text, inspired by -- and using the same methodology as -- Facebook's fastText.
Model
Text is tokenized into a bag of word 1- and 2-grams and character 1- through 5-grams. The collection of n-grams is embedded into a 128-dimensional space, then averaged. The resulting features are fed into a linear classifier with a hierarchical softmax output to compute (approximate) language probabilities for 140 ISO 639-1 languages.
Dataset
The model was trained on a randomized, stratified subset of ~2.9M texts drawn from several sources:
- WiLi: A public dataset of short text extracts from Wikipedias in over 230 languages. Style is relatively formal; subject matter is "encyclopedic". Source: https://zenodo.org/record/841984
- Tatoeba: A crowd-sourced collection of sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.
- UDHR: The UN's Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html
- DSLCC: Two collections of short excerpts of journalistic texts in a handful of language groups that are highly similar to each other. Style is relatively formal; subject matter is current events. Source: http://ttg.uni-saarland.de/resources/DSLCC/
- Ted 2020: A crawl of nearly 4000 TED and TED-X transcripts from 2020, translated by a global community of volunteers into more than 100 languages. Style is conversational, covering a broad range of subjects. Source: https://opus.nlpl.eu/TED2020.php
- SETimes: A corpus of news articles in Balkan languages, originally extracted from http://www.setimes.com and compiled by Nikola Ljubešić. Source: https://opus.nlpl.eu/SETIMES.php
Performance
The trained model achieved F1 = 0.97 when averaged over all languages.
A few languages have worse performance; most notably, the two sub-Norwegians ("nb" and "no"), as well as Bosnian ("bs"), Serbian ("sr"), and Croatian ("hr"), which are extremely similar to each other.
precision recall f1-score support
af 0.96 0.97 0.96 948
am 1.00 1.00 1.00 220
an 0.93 0.80 0.86 101
ar 1.00 0.80 0.89 7953
as 0.96 0.96 0.96 159
av 0.89 0.77 0.83 101
ay 0.93 0.92 0.93 106
az 0.99 0.97 0.98 1644
ba 0.94 0.98 0.96 116
be 1.00 0.99 0.99 4600
bg 0.99 0.99 0.99 7475
bn 1.00 0.99 1.00 1516
bo 1.00 0.99 1.00 200
br 0.98 0.99 0.99 483
bs 0.63 0.66 0.65 4457
ca 0.98 0.99 0.98 6863
ce 0.99 1.00 1.00 101
co 0.95 0.93 0.94 106
cs 0.99 0.98 0.99 7947
cu 1.00 1.00 1.00 404
cv 0.99 0.95 0.97 188
cy 0.99 0.98 0.99 502
da 0.96 0.95 0.95 5178
de 0.99 0.99 0.99 7975
dv 1.00 1.00 1.00 107
el 1.00 1.00 1.00 6982
en 0.97 0.97 0.97 9944
eo 0.99 0.99 0.99 2920
es 0.98 0.98 0.98 9078
et 0.99 0.99 0.99 6338
eu 0.99 0.99 0.99 2655
fa 1.00 1.00 1.00 7395
fi 0.99 0.99 0.99 7950
fo 0.94 0.96 0.95 432
fr 0.82 0.99 0.90 9080
fy 0.94 0.87 0.91 132
ga 0.99 0.99 0.99 1204
gd 0.98 0.99 0.99 744
gl 0.96 0.96 0.96 4239
gn 0.99 0.97 0.98 278
gu 1.00 1.00 1.00 1601
gv 0.95 0.99 0.97 214
ha 0.99 0.99 0.99 1813
he 1.00 1.00 1.00 5895
hi 1.00 1.00 1.00 5314
hr 0.82 0.79 0.80 7748
ht 0.99 0.96 0.97 160
hu 1.00 0.99 1.00 4846
hy 1.00 1.00 1.00 3804
ia 0.95 0.96 0.96 1795
id 0.95 0.96 0.95 6735
ie 0.91 0.91 0.91 439
ig 0.96 0.87 0.91 126
io 0.95 0.92 0.94 639
is 0.99 0.99 0.99 4795
it 0.99 0.99 0.99 7964
ja 1.00 1.00 1.00 7892
jv 0.96 0.90 0.93 177
ka 1.00 1.00 1.00 3115
kk 1.00 0.99 0.99 1543
km 0.99 0.97 0.98 229
kn 1.00 1.00 1.00 329
ko 1.00 1.00 1.00 4951
ku 1.00 1.00 1.00 2809
kv 0.96 0.95 0.95 100
kw 0.99 0.95 0.97 210
ky 0.97 0.95 0.96 196
la 0.99 0.99 0.99 5276
lb 0.92 0.93 0.93 157
lg 0.95 0.98 0.97 105
li 0.99 0.96 0.97 100
ln 0.96 0.97 0.96 553
lo 0.97 0.94 0.95 157
lt 1.00 1.00 1.00 5119
lv 0.99 1.00 1.00 5119
mg 0.97 0.97 0.97 148
mi 0.98 0.94 0.96 135
mk 0.99 0.99 0.99 6485
ml 1.00 1.00 1.00 731
mn 1.00 1.00 1.00 2993
mr 1.00 1.00 1.00 3276
ms 0.79 0.73 0.76 1349
mt 0.97 0.98 0.98 437
my 0.93 0.96 0.95 3937
nb 0.85 0.89 0.87 3910
ne 0.99 0.98 0.99 497
nl 0.99 0.99 0.99 6730
nn 0.55 0.49 0.52 343
no 0.87 0.87 0.87 3466
nv 1.00 0.98 0.99 113
oc 0.87 0.88 0.87 520
om 0.94 0.97 0.96 106
or 1.00 0.96 0.98 103
os 0.98 1.00 0.99 454
pa 1.00 1.00 1.00 178
pl 1.00 1.00 1.00 7960
ps 0.99 0.97 0.98 213
pt 0.98 0.99 0.98 9082
qu 0.95 0.93 0.94 137
rm 0.94 0.94 0.94 144
rn 0.96 0.90 0.93 223
ro 1.00 0.99 0.99 9976
ru 0.99 0.99 0.99 7962
rw 0.87 0.87 0.87 108
sa 0.99 0.99 0.99 356
sc 0.85 0.93 0.89 107
sd 0.99 0.98 0.98 100
se 0.93 0.96 0.94 112
si 0.99 0.97 0.98 212
sk 0.98 0.97 0.97 4292
sl 0.98 0.98 0.98 4999
sn 0.93 0.89 0.91 110
so 0.98 0.96 0.97 313
sq 0.99 0.99 0.99 4962
sr 0.85 0.86 0.86 8340
su 0.95 0.97 0.96 108
sv 0.99 0.99 0.99 6060
sw 0.94 0.95 0.95 106
ta 1.00 1.00 1.00 1321
te 1.00 1.00 1.00 660
tg 0.99 0.98 0.98 165
th 1.00 1.00 1.00 3092
tk 0.98 0.97 0.98 638
tl 0.99 0.99 0.99 1933
tn 0.95 0.98 0.96 109
to 0.99 1.00 1.00 107
tr 0.99 1.00 0.99 9965
tt 0.99 0.99 0.99 1236
ug 1.00 1.00 1.00 1094
uk 0.99 0.99 0.99 5420
ur 1.00 1.00 1.00 2540
uz 0.98 0.98 0.98 856
vi 1.00 1.00 1.00 4771
vo 0.98 0.96 0.97 298
wa 0.98 0.93 0.95 108
wo 0.97 0.97 0.97 349
xh 0.94 0.93 0.94 120
yi 1.00 1.00 1.00 ...