This is a pidgin plugin for
- De-transliteraion of russian messages in ISO-9 translit.
- Providing a keymap inside the pidging conversations.
Building and installing
In order to build a plugin, you have to have:
- Pidgin libraries
- GTK libraries and GLib-2.0
- Headers for pango, cairo, atk, gdk-pixbuf
After that run
make and copy
the next start of pidgin, in plugins section you should see a plugin called
Translit tools; enable it, and read help for
How does it work?
This plugin is using ISO-9 coding scheme for de-transliteration of the russian messages. For every conversation from the user included in the de-transliterarion list, the message is going to be de-coded using ISO-9 table and a set of custom exceptions. The plugin tries to leave as is html-tags, &xxxx; symbols and urls.
The main feature of the plugin is a set of custom exceptions, which were
retrieved by analyzing a russian dictionary of hunspell. Consider the russian
ушла in a transliterated fashion --
ushla -- it can be decoded by
simple per-letter decoding; so far, so good. Now let's look at
shodit'; any naiive decoder would decode it as
шодить, which of
course is wrong. In order to overcome this, one has to apply a
set of rules to recognize such patterns. These rules can be found in the
INPUT (a, b) means that while
a is going to be replaced by
b. The longest-match
principle is used while searching for the replacement candidate.
There are two more files involved in the decoding process:
ru-replacement.defwhich is just ISO-9 table
ru-capital-letters.defwhich is a table for replacing lowercase russian letters with capital.
As this table can be considerably large, we are using a trie data structure for fast matching. It works considerably fast -- 4 Mb can be detransliterated in 0.2 seconds on core i5.
De-transliteration works outside the plugin context, and one can compile
detrans-input binary by running
make detrans-input which read a message
stdin and outputs decoded version on the
detrans-file binary, which can be built with
make detrans-binary, accepts
a file where each line is tab-separated correct russian word and
transliterated version of it, and the binary will print out those pairs
where de-transliteration wouldn't match the original. As an example of such
a file see
It would be nice to come-up with a minimal set of exceptions, to decrease a size of the binary.
One can put some effort in making the tool aware of encodings, which may make it work a wee bit faster.
The only word that currently fails is
pasha. It is being decoded as
паша. Seems to be really non-trivial task to resolve it.
I didn't get a chance to test it on windows.
More testing :)
As always, pathces and suggestions would be highly appreciated.