Added Ara-Lat transducer and modified Ara-Cyr transducer #8

connormayer · 2020-02-18T23:50:47Z

I added an Arabic-Latin transducer and modified the Arabic-Cyrillic transducer to fix a few issues. I may have done some things in suboptimal ways out of lack of experience with hfst, so please correct me!

I think things basically work fine when they plug into the morphological transducer, but there are a few little issues when the transducers are inverted:

The hamza rules for both Ara-Lat and Ara-Cyr are intended to be enforce mandatory hamza insertion when going from Lat/Cyr -> Ara. E.g., always insert a hamza word initially before a vowel. In practice, the inverted transducer spits out forms both with and without the hamza:

echo "Уйғурчә" | hfst-proc ara-cyr-rev.hfst
^Уйғурчә/ئۇيغۇرچە/ۇيغۇرچە$

Here the apostrophe between n and g should be removed, but this removal is again treated as optional and it's subsequently replaced by a hamza (Uyghur latin script uses apostrophes both to represent (some) hamzas and to disambiguate sequences of n and g from the single sound ng.

echo "in'gliz" | hfst-proc ara-lat-rev.hfst
^in'gliz/ئىنئگلىز/ئىنگلىز/ىنئگلىز/ىنگلىز$

This is basically fine because one of them is the correct form (and the hamzas are removed by the morphological transducer anyways), but it would be nice to be able to rule out the hamza-less forms completely. Or is this even worth worrying about?

Both Arabic and Latin use bigrams that correspond to unigrams in Cyrillic/Arabic respectively. Running hfst-proc on the transducers produces the warning:

!! Warning: Transducer contains one or more multi-character symbols made up of
ASCII characters which are also available as single-character symbols. The
input stream will always be tokenised using the longest symbols available.
Use the -t option to view the tokenisation. The problematic symbol(s):
يا يۇ

This isn't really a problem (the longest tokenization is what we want), but it's sort of obnoxious. Any way to set things up differently?

jonorthwash · 2020-02-19T02:13:33Z

Reversing a given transliterator will in theory work if it's implemented extremely precisely, but as you found out, there are limitations. The way I would do it is to just write a fresh transliterator for the other direction.
It's best to treat everything as an individual symbol, even if they're technically bigrams. This can mean multiple rules to deal with one thing, but that's okay.

connormayer added 2 commits February 18, 2020 15:31

Added Arabic-Latin transducer and modified Arabic-Cyrillic transducer.

451c155

Merge remote-tracking branch 'upstream/master'

34cf824

jonorthwash merged commit 7b492b7 into apertium:master Feb 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Ara-Lat transducer and modified Ara-Cyr transducer #8

Added Ara-Lat transducer and modified Ara-Cyr transducer #8

connormayer commented Feb 18, 2020 •

edited

jonorthwash commented Feb 19, 2020

Added Ara-Lat transducer and modified Ara-Cyr transducer #8

Added Ara-Lat transducer and modified Ara-Cyr transducer #8

Conversation

connormayer commented Feb 18, 2020 • edited

jonorthwash commented Feb 19, 2020

connormayer commented Feb 18, 2020 •

edited