Language Identifier

A Python implementation of language identification using Canvar and Trenkle’s approach and the WiLI-2018 - Wikipedia Language Identification database.

The WiLI-2018 – Wikipedia Language Identification dataset has been designed as a benchmark dataset and contains 235,000 paragraphs in 235 languages. The dataset is balanced and a train-test split is provided.

Instead of working with all 235 languages, I work with 20 languages to make the analysis easier.

I use Canvar and Trenkle’s seminal paper from 1994 to compute n-grams, with n varying from 1 to 5, and then use a vector containing the most frequent 300 n-grams as a signature or profile of the language under consideration. So, to start, I compute a language profile set. e.g. P = {pe, ps} where pe is the profile vector for English, and ps is the profile vector for Spanish.

Given a test document t whose language is not known, the algorithm creates a profile pt for it using the same approach, i.e., by computing n-grams and making a 300-long vector of most common n-grams in t. It then computes the distance between the profile vectors for the two languages, pe and ps, and the test document vector pt. The distance can be computed using Canvar and Trenkle’s out-of-place measure. The document t is assumed to be in the language whose profile is the nearest.

Finally, the accuracy of language identification for the 20-language dataset is measured.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
profiles		profiles
wili-2018		wili-2018
.gitignore		.gitignore
README.md		README.md
language_predictor.ipynb		language_predictor.ipynb
language_predictor_without_NLTK_CountVectorizer.ipynb		language_predictor_without_NLTK_CountVectorizer.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language Identifier

About

Releases

Packages

Languages

alianoroozi/Language-Identification

Folders and files

Latest commit

History

Repository files navigation

Language Identifier

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages