-
-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vidyut-lipi needs to handle the colon separator in ISO-15919 #103
Comments
Thanks. It seems that Aksharamukha supports this and also supports a plain |
SaulabhyaJS also supports it, if you want to take a look. https://github.com/deepestblue/saulabhyaJS/blob/main/src/saulabhya.js and search for |
@deepestblue requesting review of this basic spec:
|
1 and 3 sound right to me. On item 2, given Sanskrit doesn't use Latin punctuation traditionally and even in modern Sanskrit, people generally only use the comma, the question mark and the exclamation mark (because I guess the colon is rare even in English), I'd maybe propose instead to error out if Latin input contains the colon other than in these specified contexts? |
Thanks, will proceed. On erroring out: I'm undecided on the right error-handling policy for this library, since I expect that a lot of library input will be noisy in various ways (mixed content, large content that hasn't been proofread, etc.) I am considering returning a
Edit: to be specific, I like that this struct returns a best-effort output while also annotating problematic regions of the input text. |
Hmm, my 2 cents is that I'd expect a transliterator like this to be very conservative on input-handling; otherwise round-tripping gets messy, behaviour becomes fuzzy, etc. I'd propose that un-proofread content isn't a valid scenario. As for mixed-content, my thought is that the content could be marked up appropriately outside of invoking this library. Say in HTML, the markup can contain the |
Thanks for your 2c! I agree that conservatism is important and that it's important to flag errors clearly rather than muddling along and producing garbage (or worse, clean data with a few hidden dollops of garbage). Ideally, any transliterator output is lossless across a round trip. At the same time, I also want to balance this principle with ergonomics. For example, I've encountered scenarios like the following either personally or through friends:
As a user, I prefer that a transliterator return some useful result, especially if I want to spend at most a few seconds on the task. This is why I'm drawn to the approach I sketched above. I think your mixed content approach will work well for structured documents like HTML, but if (for example) I'm copying a paragraph from a PDF, that structure won't be easily available. Other potential approaches:
|
(Responding also to #33 (comment) ) I suggest having options for what the transliterator should do with unexpected text. (This is one of the things I'd hope for from a Rust transliterator…) Like {PASS_THROUGH, BEST_EFFORT, ERROR}, say. And/or correspondingly the result from the transliterator can be a sequence of chunks, each of them saying whether it's a "proper" transliterated result, or just a best-guess muddling through, or what.
|
Possible examples of the options I mean:
Even if we expect very few people to use the transliterator "core" function directly, it would be a way of writing down explicitly all the choices that have been made in the convenience wrapper. |
Ha, I missed that this discussion was about treating colon as a separator, which is relevant to two of my examples above :) Also more concretely responding to comment #103 (comment) above, rather than struct Result {
text: String
errors: Vec<ErrorSpan>
}
struct ErrorSpan {
// Byte offsets in the input string. `usize` = platform-specific unsigned int, e.g. u64
start: usize
end: usize
error: ...
} where the consumer has to manually match up the best-effort text with byte offsets, one of the things I'm proposing is something like (may not be working code, treat as pseudocode): // result: Vec<ResultChunk>
struct ResultChunk {
text: String,
kind: ResultKind,
}
enum ResultKind {
Fine(String), // perfectly fine and unproblematic input for the source and destination scripts: well-understood and will round-trip cleanly
UnknownPassedThrough(String), // emoji, punctuation, etc: not part of the source and destination scripts, but just passed through
LikelyInputErrorSilentlyCorrected(String), // e.g. "s" in Devanagari corrected to avagraha
Separator, // goes with empty text, for input like कइ क्ह to avoid कै ख
Numeric(String, String), // e.g. ('1234', '१२३४'), so that the user can choose whether to transliterate digits or not.
UnrepresentableClosestMatch(String), // turning some of the different Tamil `L`s into ल and/or ळ
Dropped(String), // Accents and chandrabindu or whatever that we know what they are but don't know how to represent in the target script
// ...
} or whatever, and the default convenience wrapper would just concatenate all the result chunks' (Having these in the result may be even better than having to pre-specify some options e.g. whether to transliterate digits or not. A higher-level UI could say: “I transliterated your text for you, but note the following that I couldn't do anything with, or which you may want to change in your input…”) (Doing all this may make it slower but despite the temptation of "it's in Rust, it must be fast" I believe hardly any applications are bottlenecked by transliteration speed in practice, and the appeal of Rust here for me is more in the types being able to represent all this.) |
Transliterating from a script to itself (Devanagari to Devanagari, or IAST to IAST) would then be a way of finding all problematic stuff in it :-) Anyway I'll stop the flood of comments here; aware that what I'm proposing is likely overengineering :-) The broader point is just a desire for a really conservative/pedantic/lossless transliterator which will never silently corrupt text no matter what the input is or how many rounds of transliteration in whatever directions it undergoes using the library. |
Thank you for the wonderful discussion! I think error handling is a large enough topic that it deserves its own issue, so I've created #105. Let's continue there so that this issue can remain focused on ISO 15919. |
A couple of sorta related issues
|
@deepestblue Thanks for the bug report! I was hoping to transliterate in as few input passes as possible, but I guess a basic (Edit: fixed by calling |
Returning to the main issue (mainly taking notes for myself) -- I tried to hack around this behavior by enumerating all cases specifically and adding them to the token map. The block there was in how to support Stepping back, the core logic seems to be something like: if from.is_abugida() && to.is_alphabet() && to.has_separator() {
if prev.is_consonant() && cur.is_independent_vowel() {
// for a:i, a:u
output += separator;
} else if TODO {
output += separator.
}
} Maybe we can combine these by hard-coding Tentative test cases:
|
Yep, this seems similar to the code in saulabhyaJS near https://github.com/deepestblue/saulabhyaJS/blob/main/src/saulabhya.js#L352 |
One of the many corner cases ISO-15919 supports is using a
:
to disambiguate Latin letter clusters. Here're a couple examples that need support in vidyut-lipi../lipi -f devanagari -t iso19519 "अर्शइत्यादयः"
Expected:
arśa:ityādayaḥ
Actual:
arśaityādayaḥ
./lipi -t devanagari -f iso19519 "arśa:ityādayaḥ"
Expected:
अर्शइत्यादयः
Actual:
अर्श:इत्यादयः
./lipi -f devanagari -t iso19519 "वाग्हरि"
Expected:
vāg:hari
Actual:
vāghari
./lipi -t devanagari -f iso19519 "vāg:hari"
Expected:
वाग्हरि
Actual:
वाग्:हरि
The text was updated successfully, but these errors were encountered: