Skip to content

Language Identification model (PY2)

Choose a tag to compare

@bdewilde bdewilde released this 04 Jun 00:36

Functionality for identifying the language of a text, using a model inspired by Google's Compact Language Detector v3 and implemented with scikit-learn.

Model

Character unigrams, bigrams, and trigrams are extracted from input text, and their frequencies of occurence within the text are counted. The full set of ngrams are then hashed into a 4096-dimensional feature vector with values given by the L2 norm of the counts. These features are passed into a Multi-layer Perceptron with a single hidden layer of 512 rectified linear units and a softmax output layer giving probabilities for ~130 different languages as ISO 639-1 language codes.

Technically, the model was implemented as a sklearn.pipeline.Pipeline with two steps: a sklearn.feature_extraction.text.HashingVectorizer for vectorizing input texts and a sklearn.neural_network.MLPClassifier for multi-class language classification.

Dataset

The pipeline was trained on a randomized, stratified subset of ~1.5M texts
drawn from several sources:

  • Tatoeba: A crowd-sourced collection of ~5M sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.
  • Leipzig Corpora: A collection of corpora for many languages in the same format and pulling from comparable sources -- specifically, 10k Wikipedia articles from official database dumps and 10k news articles from either RSS feeds or web scrapes. Only the most recently updated version was used, when available. Style is relatively formal; subject matter is a variety of notable things and goings-on. Source: http://wortschatz.uni-leipzig.de/en/download
  • UDHR: The UN's Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html
  • Twitter: A collection of ~1.5k tweets in each of ~70 languages, posted in July 2014, with languages assigned through a combination of models and human annotators. Style is informal; subject matter is whatever Twitter was going on about back then, who could say. Source: https://blog.twitter.com/engineering/en_us/a/2015/evaluating-language-identification-performance.html