Added Ara-Lat transducer and modified Ara-Cyr transducer #8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I added an Arabic-Latin transducer and modified the Arabic-Cyrillic transducer to fix a few issues. I may have done some things in suboptimal ways out of lack of experience with
hfst
, so please correct me!I think things basically work fine when they plug into the morphological transducer, but there are a few little issues when the transducers are inverted:
Here the apostrophe between
n
andg
should be removed, but this removal is again treated as optional and it's subsequently replaced by a hamza (Uyghur latin script uses apostrophes both to represent (some) hamzas and to disambiguate sequences ofn
andg
from the single soundng
.This is basically fine because one of them is the correct form (and the hamzas are removed by the morphological transducer anyways), but it would be nice to be able to rule out the hamza-less forms completely. Or is this even worth worrying about?
This isn't really a problem (the longest tokenization is what we want), but it's sort of obnoxious. Any way to set things up differently?