Release Language Identification model (PY2) · bdewilde/textacy-data

Functionality for identifying the language of a text, using a model inspired by Google's Compact Language Detector v3 and implemented with scikit-learn.

Model

Character unigrams, bigrams, and trigrams are extracted from input text, and their frequencies of occurence within the text are counted. The full set of ngrams are then hashed into a 4096-dimensional feature vector with values given by the L2 norm of the counts. These features are passed into a Multi-layer Perceptron with a single hidden layer of 512 rectified linear units and a softmax output layer giving probabilities for ~130 different languages as ISO 639-1 language codes.

Technically, the model was implemented as a sklearn.pipeline.Pipeline with two steps: a sklearn.feature_extraction.text.HashingVectorizer for vectorizing input texts and a sklearn.neural_network.MLPClassifier for multi-class language classification.

Dataset

The pipeline was trained on a randomized, stratified subset of ~1.5M texts
drawn from several sources:

Tatoeba: A crowd-sourced collection of ~5M sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.
Leipzig Corpora: A collection of corpora for many languages in the same format and pulling from comparable sources -- specifically, 10k Wikipedia articles from official database dumps and 10k news articles from either RSS feeds or web scrapes. Only the most recently updated version was used, when available. Style is relatively formal; subject matter is a variety of notable things and goings-on. Source: http://wortschatz.uni-leipzig.de/en/download
UDHR: The UN's Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html
Twitter: A collection of ~1.5k tweets in each of ~70 languages, posted in July 2014, with languages assigned through a combination of models and human annotators. Style is informal; subject matter is whatever Twitter was going on about back then, who could say. Source: https://blog.twitter.com/engineering/en_us/a/2015/evaluating-language-identification-performance.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language Identification model (PY2)

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Model

Dataset

Uh oh!