Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Ara-Lat transducer and modified Ara-Cyr transducer #8

Merged
merged 2 commits into from Feb 19, 2020

Conversation

connormayer
Copy link
Contributor

@connormayer connormayer commented Feb 18, 2020

I added an Arabic-Latin transducer and modified the Arabic-Cyrillic transducer to fix a few issues. I may have done some things in suboptimal ways out of lack of experience with hfst, so please correct me!

I think things basically work fine when they plug into the morphological transducer, but there are a few little issues when the transducers are inverted:

  • The hamza rules for both Ara-Lat and Ara-Cyr are intended to be enforce mandatory hamza insertion when going from Lat/Cyr -> Ara. E.g., always insert a hamza word initially before a vowel. In practice, the inverted transducer spits out forms both with and without the hamza:
echo "Уйғурчә" | hfst-proc ara-cyr-rev.hfst
^Уйғурчә/ئۇيغۇرچە/ۇيغۇرچە$

Here the apostrophe between n and g should be removed, but this removal is again treated as optional and it's subsequently replaced by a hamza (Uyghur latin script uses apostrophes both to represent (some) hamzas and to disambiguate sequences of n and g from the single sound ng.

echo "in'gliz" | hfst-proc ara-lat-rev.hfst
^in'gliz/ئىنئگلىز/ئىنگلىز/ىنئگلىز/ىنگلىز$

This is basically fine because one of them is the correct form (and the hamzas are removed by the morphological transducer anyways), but it would be nice to be able to rule out the hamza-less forms completely. Or is this even worth worrying about?

  • Both Arabic and Latin use bigrams that correspond to unigrams in Cyrillic/Arabic respectively. Running hfst-proc on the transducers produces the warning:
!! Warning: Transducer contains one or more multi-character symbols made up of
ASCII characters which are also available as single-character symbols. The
input stream will always be tokenised using the longest symbols available.
Use the -t option to view the tokenisation. The problematic symbol(s):
يا يۇ

This isn't really a problem (the longest tokenization is what we want), but it's sort of obnoxious. Any way to set things up differently?

@jonorthwash
Copy link
Member

  1. Reversing a given transliterator will in theory work if it's implemented extremely precisely, but as you found out, there are limitations. The way I would do it is to just write a fresh transliterator for the other direction.
  2. It's best to treat everything as an individual symbol, even if they're technically bigrams. This can mean multiple rules to deal with one thing, but that's okay.

@jonorthwash jonorthwash merged commit 7b492b7 into apertium:master Feb 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants