Text direction needs to be taken into account #4

r12a · 2020-11-17T13:38:58Z

Not only will the recogniser need to take into account the language, but it will be unable to decipher the text unless it understands the glyphs it recognises proceed from right-to-left or left-to-right or vertically top-to-bottom with lines stacked LTR or RTL.

This includes orthographies that are generally written in one direction, but that have embedded text that runs in the opposite direction, and sometimes embedded text within that.

To some extent the recogniser will be able to apply the Unicode bidi algorithm to reverse engineer the logical character sequence, but in other bidirectional cases this will not be sufficient. Also it would probably be beneficial to indicate for the recogniser the overall scanning direction for the text being entered, for which it may be useful to apply a directional label, in a similar way to how one does this for language.

wacky6 · 2020-11-26T06:36:20Z

Can you give some examples of how these mixed direction texts are written? Here I mean the actual process of how they are written (e.g. which character, which strokes are written first).

We didn't expect this to be a problem though. The assumption we made is that handwriting follows the natural flow of speech. In other words, we didn't expect the characters to be written in reverse (relative to their speech / interpretation direction). For example, we didn't expect "hello" to be written in "elloh" order).

r12a · 2020-11-27T11:49:18Z

I grabbed some examples from Wikipedia home pages.

First example. Unidirectional text, but the recogniser has to scan from right to left.

Second example. Numbers and Latin text run LTR within the overall RTL flow. People writing the text tend to leave a gap and write the LTR text from LTR. They don't write the numbers or the Latin text backwards.

Note, btw, that in the example just above, the parenthesis on the left is U+0029 RIGHT PARENTHESIS, and the one on the right is U+0028 LEFT PARENTHESIS. These are mirrored characters, whose glyph in typed text is established only when the directional context is known. The recogniser will also need to assign the glyph to a code point depending on the current base direction.

Third example. Overall LTR sentence has RTL text with embedded LTR text in it. I expect that 'W3C' would probably be the last 3 code points written and stored in memory once the text has been recognised.

To be honest, it can be difficult to know where the boundaries are for the changes in base direction here, though in this example the quote marks help. I don't know how this is done in practice, i'm just flagging up that it will be necessary.

When it comes to speech, there is no flip-flopping of direction involved, and in fact in memory all code points are also arrange in one logical, unidirectional sequence. The changes in direction are only a feature of the written text. Unfortunately for you, that's what you're starting from.

wacky6 · 2021-09-01T04:53:23Z

Sorry about the delay. I forgot to mention you in w3ctag/design-reviews#591 (comment)

@r12a

Let's continue the discussion here.

WDYT about a direction hint to disambiguate the main direction here? This will help telling "82: Score" and ("Score: 28" or
Score: 82" apart (especially for rule-based recognizers).

For distinguishing between "Score: 28" and "Score: 82" (esp. rule based ones). I imagine the recognizer can determine the script of each word, use script's LTR or RTL to decide. In the above case, "Score:" is Hibrew, and "82" is Latin. With the presence of direction hint, in memory string starts with "Score:", followed by "82".

For machine learning based recognizers (the ones we currently have), handwriting "Score: 82" is part of their training dataset. The Hibrew recognizer will learn from the dataset and output characters in the correct in-memory order (i.e. direction hint is unnecessary). As for how it knows the right order, we don't know (hence why it's ML based).

r12a mentioned this issue Jan 26, 2021

Handwriting Recognition API w3ctag/design-reviews#591

Closed

1 task

r12a added the i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. label Feb 10, 2021

w3cbot mentioned this issue Feb 10, 2021

Text direction needs to be taken into account w3c/i18n-activity#1030

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text direction needs to be taken into account #4

Text direction needs to be taken into account #4

r12a commented Nov 17, 2020

wacky6 commented Nov 26, 2020

r12a commented Nov 27, 2020 •

edited

Loading

wacky6 commented Sep 1, 2021

Text direction needs to be taken into account #4

Text direction needs to be taken into account #4

Comments

r12a commented Nov 17, 2020

wacky6 commented Nov 26, 2020

r12a commented Nov 27, 2020 • edited Loading

wacky6 commented Sep 1, 2021

r12a commented Nov 27, 2020 •

edited

Loading