lplangid: Reciprocal Rank Classifier for Language Identification

This package is a python implementation of the classifier described in the paper Language Identification with a Reciprocal Rank Classifier.

For more detailed package documentation, see the project wiki.

Installation and Usage in Classification

You can install the package by running $ pip install lplangid which installs the package from the distribution at https://pypi.org/project/lplangid/, or by cloning this repository and running pip install -e . in this directory.

Basic usage example for language classification:

>>> from lplangid.language_classifier import RRCLanguageClassifier
>>> my_classifier = RRCLanguageClassifier.default_instance()
>>> my_classifier.get_winner("C'est use teste")

'fr'

The default instance supports 24 common languages. To classify many more languages, use RRCLanguageClassifier.many_language_bible_instance(), which supports 103 languages.

A single 'correct' language is not always the most appropriate output. For more informative options, see RecommendedUsagePatterns.

Data Preparation and Distribution

Throughout this package, languages are identified and referred to using 2-letter ISO 339-1 codes. For example, en for English, es (from Español) for Spanish, zh (from 中文, Zhōngwén) for Chinese. These are used throughout for directory names, keys in dictionary tables, and reporting classifier results.

The classifier uses the datafiles checked in to the ./lplangid/freq_data directory here, which is just a few megabytes. It would be relatively easy to decouple the way these files are distributed. The benefit of combining them is it's very easy for clients to use.

The frequency tables in ./lplangid/freq_data are from Wikipedia data (single shards), tokenized on whitespace. In addition, a few conversational words from ./training/data_overrides.py have been added at the top of the term rank files. The xx_char_freq.csv files contain characters and sample frequencies. The xx_term_rank.csv files contain only the terms / words. Only the ranks (line numbers) of the words in these files matters. Unlike most classifier models, you can edit these files directly. For example, the word "bye" and other conversational terms that are rare in Wikipedia have already been added to the top of the en_term_rank.csv file.

See ./training/README.md for data preparation instructions and tools for adding new languages to the classifier.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
.github/workflows		.github/workflows
experiments		experiments
lplangid		lplangid
training		training
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

experiments

experiments

lplangid

lplangid

training

training

.gitignore

.gitignore

CONTRIBUTING.md

CONTRIBUTING.md

LICENSE

LICENSE

MANIFEST.in

MANIFEST.in

README.md

README.md

pyproject.toml

pyproject.toml

setup.cfg

setup.cfg

setup.py

setup.py

Repository files navigation

lplangid: Reciprocal Rank Classifier for Language Identification

Installation and Usage in Classification

Data Preparation and Distribution

About

Releases

Packages

Contributors 4

Languages

License

dwiddows/lplangid

Folders and files

Latest commit

History

Repository files navigation

lplangid: Reciprocal Rank Classifier for Language Identification

Installation and Usage in Classification

Data Preparation and Distribution

About

Resources

License

Stars

Watchers

Forks

Languages