Skip to content
Language detection library for Crystal
Crystal
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
bin
data
languages Initial commit Sep 14, 2019
spec Initial commit Sep 14, 2019
src Fixed some things Oct 4, 2019
.editorconfig Initial commit Sep 14, 2019
.gitignore Initial commit Sep 14, 2019
.travis.yml
LICENSE Initial commit Sep 14, 2019
README.md
shard.yml Initial commit Sep 14, 2019

README.md

Cadmium::Lang

Lang is a language detector for Crystal. It was trained using the Bayes Classifier and is extreemely accurate (test results coming soon) even for very similar languages such as Russian and Ukranian (with a large enough sample size).

Supported languages (116)

Abkhazian (abk), Afrikaans (af), Albanian (alb), English, Old (ca.450-1100) (ang), Arabic (ar), Official Aramaic (700-300 BCE) (arc), Assamese (asm), Asturian (ast), Avaric (ava), Azerbaijani (az), Bashkir (ba), Belarusian (bel), Bengali (bn), Bhojpuri (bho), Bosnian (bs), Burmese (my), Chechen (ce), Cornish (kw), Corsican (co), Crimean Tatar (crh), Kashubian (csb), Lower Sorbian (dsb), French (fr), Northern Frisian (frr), Friulian (fur), Georgian (ka), Gaelic (gd), Galician (gl), Manx (gv), Swiss German (gsw), Hawaiian (haw), Hebrew (he), Hindi (hi), Croatian (hr), Upper Sorbian (hsb), Igbo (ig), Iloko (ilo), Javanese (jv), Lojban (jbo), Japanese (ja), Kara-Kalpak (kaa), Kabyle (kab), Kalaallisut (kl), Kannada (kn), Kabardian (kbd), Central Khmer (khm), Kinyarwanda (rw), Komi (kv), Korean (ko), Karachay-Balkar (krc), Ladino (lad), Lao (lo), Lezghian (lez), Limburgan (li), Lingala (ln), Luxembourgish (lb), Lushai (lus), Macedonian (mk), Malayalam (ml), Maori (mi), Marathi (mr), Moksha (mdf), Minangkabau (min), Malagasy (mg), Maltese (mt), Mongolian (mn), Mirandese (mwl), Erzya (myv), Neapolitan (nap), Navajo (nav), Low German (nds), Nepal Bhasa (new), Norwegian Nynorsk (nn), Norwegian (no), Oriya (ori), Ossetian (os), Turkish, Ottoman (1500-1928) (ota), Pangasinan (pag), Pampanga (pam), Papiamento (pap), Pali (pi), Portuguese (pt), Romansh (roh), Russian (ru), Yakut (sah), Sicilian (scn), Scots (sco), Shan (shn), Northern Sami (se), Shona (sna), Sindhi (snd), Somali (so), Spanish (es), Sardinian (sc), Sranan Tongo (srn), Serbian (sr), Swahili (sw), Tamil (ta), Tatar (tat), Telugu (te), Tetum (tet), Thai (th), Tibetan (bo), Tok Pisin (tpi), Turkmen (tk), Turkish (tr), Udmurt (udm), Ukrainian (uk), Urdu (ur), Vietnamese (vi), Volapük (vo), Walloon (wa), Wolof (wo), Kalmyk (xal), Yiddish (yi), Yoruba (yo)

Installation

  1. Add the dependency to your shard.yml:

    dependencies:
      cadmium_lang:
        github: cadmiumcr/lang
  2. Run shards install

Usage

require "cadmium_lang"

# by default language data will be loaded from the data/languages
# directory, however if you include this in a compiled binary you may
# have to set the path manually
lang = Cadmium::Lang.new

puts lang.detect("Название страны происходит от этнонима ") # => ru
puts lang.detect("Якщо сторінка була тут створена нещодавно") # => uk

Development

Adding languages to the data set is easy. Put each language sample in its own file named [iso-code].txt (ex. en.txt for english) and place all examples in a folder. Then paste the following code into a crystal file and modify the constants.

require "cadmium_lang"
# or
require "./src/cadmium_lang"

DATA_PATH = "path/to/data"
OUTPUT_FOLDER = "path/to/output"
LOAD_DATA = false

classifier = Cadmium::Lang::Classifier.new

# If true previous data will be loaded from the OUTPUT_FOLDER.
# Should be true for re-trains, false for first time.
if LOAD_DATA
  classifier.load(OUTPUT_FOLDER)
end

# Train the classifer on a directory full of data samples
classifier.train_on(DATA_PATH)

# Save the results
classifier.save(OUTPUT_FOLDER)

In the case of the standard language set used in this library the values should be (assuming your crystal file is located at the root of the project):

  • DATA_PATH: Anything. You need to make the folder.
  • OUTPUT_FOLDER: "data/languages"
  • LOAD_DATA: true

Contributing

  1. Fork it (https://github.com/cadmiumcr/cadmium_lang/fork)
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request

Contributors

You can’t perform that action at this time.