vidyut-lipi needs to handle the colon separator in ISO-15919 #103

deepestblue · 2024-01-02T04:39:48Z

One of the many corner cases ISO-15919 supports is using a : to disambiguate Latin letter clusters. Here're a couple examples that need support in vidyut-lipi.

./lipi -f devanagari -t iso19519 "अर्शइत्यादयः"

Expected: arśa:ityādayaḥ
Actual: arśaityādayaḥ

./lipi -t devanagari -f iso19519 "arśa:ityādayaḥ"
Expected: अर्शइत्यादयः
Actual: अर्श:इत्यादयः

./lipi -f devanagari -t iso19519 "वाग्हरि"
Expected: vāg:hari
Actual: vāghari

./lipi -t devanagari -f iso19519 "vāg:hari"
Expected: वाग्हरि
Actual: वाग्:हरि

The text was updated successfully, but these errors were encountered:

akprasad · 2024-01-02T04:56:48Z

Thanks. It seems that Aksharamukha supports this and also supports a plain : if it could not be a disambiguator in the current context. indic_transliteration has no support for this, which is not surprising because it is based on a Sanscript port, and Sanscript's core algorithm is quite simple.

deepestblue · 2024-01-02T05:24:57Z

SaulabhyaJS also supports it, if you want to take a look. https://github.com/deepestblue/saulabhyaJS/blob/main/src/saulabhya.js and search for separator

akprasad · 2024-01-23T05:57:26Z

@deepestblue requesting review of this basic spec:

treat the colon as a disambiguating separator for a:i, a:u, k:h, g:h, c:h, j:h, ṭ:h, ḍ:h, t:h, d:h, p:h, and b:h (for Sanskrit; other non-Sanskrit languages might need support for other clusters.)
treat the colon as an ordinary punctuation mark in all other cases.
when translating from Devanagari, insert a colon as needed only for the cases mentioned above.

deepestblue · 2024-01-25T05:25:43Z

1 and 3 sound right to me. On item 2, given Sanskrit doesn't use Latin punctuation traditionally and even in modern Sanskrit, people generally only use the comma, the question mark and the exclamation mark (because I guess the colon is rare even in English), I'd maybe propose instead to error out if Latin input contains the colon other than in these specified contexts?

akprasad · 2024-01-25T05:54:48Z

Thanks, will proceed.

On erroring out: I'm undecided on the right error-handling policy for this library, since I expect that a lot of library input will be noisy in various ways (mixed content, large content that hasn't been proofread, etc.)

I am considering returning a Result struct in this format, which should be readable to you even though it uses some Rust constructs:

struct Result {
  text: String
  errors: Vec<ErrorSpan>
}

struct ErrorSpan {
  // Byte offsets in the input string. `usize` = platform-specific unsigned int, e.g. u64
  start: usize
  end: usize
  error: ...
}

Edit: to be specific, I like that this struct returns a best-effort output while also annotating problematic regions of the input text.

deepestblue · 2024-01-25T06:28:38Z

Hmm, my 2 cents is that I'd expect a transliterator like this to be very conservative on input-handling; otherwise round-tripping gets messy, behaviour becomes fuzzy, etc.

I'd propose that un-proofread content isn't a valid scenario.

As for mixed-content, my thought is that the content could be marked up appropriately outside of invoking this library. Say in HTML, the markup can contain the lang attribute, and the JS that would invoke vidyut would invoke it only for the appropriately marked up nodes.

akprasad · 2024-01-25T07:26:40Z

Thanks for your 2c! I agree that conservatism is important and that it's important to flag errors clearly rather than muddling along and producing garbage (or worse, clean data with a few hidden dollops of garbage). Ideally, any transliterator output is lossless across a round trip.

At the same time, I also want to balance this principle with ergonomics. For example, I've encountered scenarios like the following either personally or through friends:

a user sees a Kannada web document they can't read (full site, forum comment, etc.) and wants to transliterate it to Devanagari.
a user has the raw data for a text from sanskritdocuments.org, GRETIL, etc. and wants to convert it to Telugu.
a user has a very long text file produced by Devanagari OCR and wants to convert it to ISO 15919 for easier proofreading.

As a user, I prefer that a transliterator return some useful result, especially if I want to spend at most a few seconds on the task. This is why I'm drawn to the approach I sketched above.

I think your mixed content approach will work well for structured documents like HTML, but if (for example) I'm copying a paragraph from a PDF, that structure won't be easily available.

Other potential approaches:

a new transliterate_strict function that errors out early
a transliteration option that lets users select a strictness policy (Strict, Permissive)
return a Result<String> (see std::result) and including the best-effort text in the error condition.

shreevatsa · 2024-01-25T13:24:04Z

(Responding also to #33 (comment) )

I suggest having options for what the transliterator should do with unexpected text. (This is one of the things I'd hope for from a Rust transliterator…) Like {PASS_THROUGH, BEST_EFFORT, ERROR}, say. And/or correspondingly the result from the transliterator can be a sequence of chunks, each of them saying whether it's a "proper" transliterated result, or just a best-guess muddling through, or what.

There can be a "core" transliterator function that is very strict/conservative/pedantic and makes no choices / has no opinions of its own, all of them exposed through options that must be set.
Then there can be convenience wrapper functions for different use-cases (like the "I just want to get something useful" ones mentioned above, and the other use-case that @deepestblue and I are advocating for, of “If run my text through this transliterator, I'd want to be very sure that if it cannot round-trip back I'd know right away; I don't want to lose any information silently and find out days later”).

shreevatsa · 2024-01-25T13:24:17Z

Possible examples of the options I mean:

for the case from the other bug, an option for whether "rR" should be transliterated into [‎0930 DEVANAGARI LETTER RA, ‎094D DEVANAGARI SIGN VIRAMA, ‎090B DEVANAGARI LETTER VOCALIC R] or into [‎0930 DEVANAGARI LETTER RA, ‎0943 DEVANAGARI VOWEL SIGN VOCALIC R]. (Rendered the same in some fonts?)
What to do with "कइ" (transliterate as "kai", "ka{}i", "käi", return two separate chunks for the wrapper to deal with, …).
Whether to treat a colon as visarga, pass it through as a colon, or throw an error. (A colon_strategy field of the options struct parameter?)
How to deal with short e/o when encountered in non-Devanagari (transliterate to Devanagari short e/o that are strictly correct but many people don't recognize and possibly some fonts too, or the regular long ones?), or in Devanagari (some people seem to be using text input methods that produce these characters even when they clearly intend the regular long ones).
Many more (see Aksharamukha UI for a few: replace anusvāra with corresponding anunāsika or vice-versa, etc)…

Even if we expect very few people to use the transliterator "core" function directly, it would be a way of writing down explicitly all the choices that have been made in the convenience wrapper.

shreevatsa · 2024-01-25T14:38:42Z

Ha, I missed that this discussion was about treating colon as a separator, which is relevant to two of my examples above :)

Also more concretely responding to comment #103 (comment) above, rather than

struct Result {
  text: String
  errors: Vec<ErrorSpan>
}

struct ErrorSpan {
  // Byte offsets in the input string. `usize` = platform-specific unsigned int, e.g. u64
  start: usize
  end: usize
  error: ...
}

where the consumer has to manually match up the best-effort text with byte offsets, one of the things I'm proposing is something like (may not be working code, treat as pseudocode):

// result: Vec<ResultChunk>

struct ResultChunk {
    text: String,
    kind: ResultKind,
}

enum ResultKind {
    Fine(String), // perfectly fine and unproblematic input for the source and destination scripts: well-understood and will round-trip cleanly
    UnknownPassedThrough(String), // emoji, punctuation, etc: not part of the source and destination scripts, but just passed through
    LikelyInputErrorSilentlyCorrected(String), // e.g. "s" in Devanagari corrected to avagraha
    Separator, // goes with empty text, for input like कइ क्ह to avoid कै ख
    Numeric(String, String), // e.g. ('1234', '१२३४'), so that the user can choose whether to transliterate digits or not.
    UnrepresentableClosestMatch(String), // turning some of the different Tamil `L`s into ल and/or ळ
    Dropped(String), // Accents and chandrabindu or whatever that we know what they are but don't know how to represent in the target script
   // ...
}

or whatever, and the default convenience wrapper would just concatenate all the result chunks' text while the "serious" user could assemble their own different result by looking into the ResultKinds.

(Having these in the result may be even better than having to pre-specify some options e.g. whether to transliterate digits or not. A higher-level UI could say: “I transliterated your text for you, but note the following that I couldn't do anything with, or which you may want to change in your input…”)

(Doing all this may make it slower but despite the temptation of "it's in Rust, it must be fast" I believe hardly any applications are bottlenecked by transliteration speed in practice, and the appeal of Rust here for me is more in the types being able to represent all this.)

shreevatsa · 2024-01-25T14:45:51Z

Transliterating from a script to itself (Devanagari to Devanagari, or IAST to IAST) would then be a way of finding all problematic stuff in it :-)

Anyway I'll stop the flood of comments here; aware that what I'm proposing is likely overengineering :-) The broader point is just a desire for a really conservative/pedantic/lossless transliterator which will never silently corrupt text no matter what the input is or how many rounds of transliteration in whatever directions it undergoes using the library.

akprasad · 2024-01-26T03:04:35Z

Thank you for the wonderful discussion!

I think error handling is a large enough topic that it deserves its own issue, so I've created #105. Let's continue there so that this issue can remain focused on ISO 15919.

deepestblue · 2024-02-10T02:34:37Z

A couple of sorta related issues

aū should transliterate to अऊ

agḥ should transliterate to अग्ः (I'm not sure there's a use-case for this specific example)

akprasad · 2024-02-10T05:57:26Z

@deepestblue Thanks for the bug report! I was hoping to transliterate in as few input passes as possible, but I guess a basic convert to NFC pass is worth avoiding the headaches elsewhere.

(Edit: fixed by calling to_nfc first.)

akprasad · 2024-02-11T01:07:43Z

Returning to the main issue (mainly taking notes for myself) --

I tried to hack around this behavior by enumerating all cases specifically and adding them to the token map. The block there was in how to support a:i since a is an implicit vowel on the consonant before. We could explicitly store all mappings कइ, खइ, etc. to get around this, but this feels gross and unprincipled.

Stepping back, the core logic seems to be something like:

if from.is_abugida() && to.is_alphabet() && to.has_separator() {
  if prev.is_consonant() && cur.is_independent_vowel() {
    // for a:i, a:u
    output += separator;
  } else if TODO {
    output += separator.
  }
}

Maybe we can combine these by hard-coding k:h etc. then using custom code for the vowel-based separator.

Tentative test cases:

// positive
"a:i a:u"
"ka:i ka:u"
"k:ha g:ha c:ha j:ha ṭ:ha ḍ:ha t:ha d:ha p:ha b:ha"
"ḷ:ha"

// negative -- colon should be ignored
a:
ka:
k:
a:A
k:ta

deepestblue · 2024-02-11T01:16:13Z

Yep, this seems similar to the code in saulabhyaJS near https://github.com/deepestblue/saulabhyaJS/blob/main/src/saulabhya.js#L352

akprasad added bug Something isn't working lipi labels Jan 2, 2024

akprasad mentioned this issue Jan 26, 2024

Decide on and implement an error-handling policy #105

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vidyut-lipi needs to handle the colon separator in ISO-15919 #103

vidyut-lipi needs to handle the colon separator in ISO-15919 #103

deepestblue commented Jan 2, 2024 •

edited

Loading

akprasad commented Jan 2, 2024

deepestblue commented Jan 2, 2024

akprasad commented Jan 23, 2024 •

edited

Loading

deepestblue commented Jan 25, 2024

akprasad commented Jan 25, 2024 •

edited

Loading

deepestblue commented Jan 25, 2024

akprasad commented Jan 25, 2024

shreevatsa commented Jan 25, 2024

shreevatsa commented Jan 25, 2024

shreevatsa commented Jan 25, 2024 •

edited

Loading

shreevatsa commented Jan 25, 2024

akprasad commented Jan 26, 2024

deepestblue commented Feb 10, 2024

akprasad commented Feb 10, 2024 •

edited

Loading

akprasad commented Feb 11, 2024 •

edited

Loading

deepestblue commented Feb 11, 2024

vidyut-lipi needs to handle the colon separator in ISO-15919 #103

vidyut-lipi needs to handle the colon separator in ISO-15919 #103

Comments

deepestblue commented Jan 2, 2024 • edited Loading

akprasad commented Jan 2, 2024

deepestblue commented Jan 2, 2024

akprasad commented Jan 23, 2024 • edited Loading

deepestblue commented Jan 25, 2024

akprasad commented Jan 25, 2024 • edited Loading

deepestblue commented Jan 25, 2024

akprasad commented Jan 25, 2024

shreevatsa commented Jan 25, 2024

shreevatsa commented Jan 25, 2024

shreevatsa commented Jan 25, 2024 • edited Loading

shreevatsa commented Jan 25, 2024

akprasad commented Jan 26, 2024

deepestblue commented Feb 10, 2024

akprasad commented Feb 10, 2024 • edited Loading

akprasad commented Feb 11, 2024 • edited Loading

deepestblue commented Feb 11, 2024

deepestblue commented Jan 2, 2024 •

edited

Loading

akprasad commented Jan 23, 2024 •

edited

Loading

akprasad commented Jan 25, 2024 •

edited

Loading

shreevatsa commented Jan 25, 2024 •

edited

Loading

akprasad commented Feb 10, 2024 •

edited

Loading

akprasad commented Feb 11, 2024 •

edited

Loading