Skip to content

EliFinkelshteyn/alphabet-detector

Repository files navigation

alphabet-detector

Build Status Coverage Status

A library to detect what alphabet something is written in. Works on Python 2.7+ and 3.3+

Author

Eli Finkelshteyn (founder constructor.io)

Installation

pip install alphabet-detector

Usage

To instantiate an AlphabetDetector (the object is used for speed optimization):

from alphabet_detector import AlphabetDetector
ad = AlphabetDetector()

In general, you can just use the only_alphabet_chars(unicode_str, alphabet) method and expect a boolean response:

ad.only_alphabet_chars(u"ελληνικά means greek", "LATIN") #False
ad.only_alphabet_chars(u"ελληνικά", "GREEK") #True
ad.only_alphabet_chars(u'سماوي يدور', 'ARABIC') #True
ad.only_alphabet_chars(u'שלום', 'HEBREW') #True
ad.only_alphabet_chars(u"frappé", "LATIN") #True
ad.only_alphabet_chars(u"hôtel lœwe 67", "LATIN") #True
ad.only_alphabet_chars(u"det forårsaker første", "LATIN") #True
ad.only_alphabet_chars(u"Cyrillic and кириллический", "LATIN") #False
ad.only_alphabet_chars(u"кириллический", "CYRILLIC") #True

You can also request free-style detection of any unicode string:

ad.detect_alphabet(u'Cyrillic and кириллический') #{'CYRILLIC', 'LATIN'}

Convenience methods are also provided for some major languages:

ad.is_cyrillic(u"Привет") #True  
ad.is_latin(u"howdy") #True
# NOTE: this only detects Chinese script characters (Hanzi/Kanji/Hanja).
# it does not detect other CJK script characters like Hangul or Katakana
ad.is_cjk(u"hi") #False
ad.is_cjk(u'汉字') #True

NOTE: all strings are expected to be unicode to keep things consistent. Conversion is never done for you, and errors are thrown when a string is not unicode.