Skip to content

fazalmajid/ctranslitcodec

Repository files navigation

-*- coding: utf-8 -*-

Unicode to 8-bit charset transliteration codec.

This package contains codecs for transliterating ISO 10646 texts
into best-effort representations using smaller coded character sets
(ASCII, ISO 8859, etc.). It is a C reimplementation of the
[translitcodec][1] module by Jason Kirtland, as well as a
drop-in replacement.

The translation tables used by the codecs are from the [transtab][2]
collection by [Prof. Markus Kuhn][3], with some extensions by
[Fazal Majid][4].

Three types of transliterating codecs are provided:

  "long", using as many characters as needed to make a natural
   replacement.  For example, \u00e4 LATIN SMALL LETTER A WITH
   DIAERESIS ``ä`` will be replaced with ``ae``.

  "short", using the minimum number of characters to make a
  replacement.  For example, \u00e4 LATIN SMALL LETTER A WITH
  DIAERESIS ``ä`` will be replaced with ``a``.

  "one", only performing single character replacements.  Characters
  that can not be transliterated with a single character are passed
  through unchanged. For example, \u2639 WHITE FROWNING FACE ``☹``
  will be passed through unchanged.

Using the codecs is simple::

  >>> import ctranslitcodec
  >>> import codecs
  >>> codecs.encode(u'fácil € ☺', 'ctranslit/long')
  u'facil EUR :-)'
  >>> codecs.encode(u'fácil € ☺', 'ctranslit/short')
  u'facil E :-)'

[1]: https://pypi.org/project/translitcodec/
[2]: https://www.cl.cam.ac.uk/~mgk25/unicode.html#libs
[3]: https://www.cl.cam.ac.uk/~mgk25/
[4]: https://majid.info/

About

A C version of the Python translitcodec module

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published