Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text direction needs to be taken into account #4

Open
r12a opened this issue Nov 17, 2020 · 3 comments
Open

Text direction needs to be taken into account #4

r12a opened this issue Nov 17, 2020 · 3 comments
Labels
i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.

Comments

@r12a
Copy link

r12a commented Nov 17, 2020

Not only will the recogniser need to take into account the language, but it will be unable to decipher the text unless it understands the glyphs it recognises proceed from right-to-left or left-to-right or vertically top-to-bottom with lines stacked LTR or RTL.

This includes orthographies that are generally written in one direction, but that have embedded text that runs in the opposite direction, and sometimes embedded text within that.

To some extent the recogniser will be able to apply the Unicode bidi algorithm to reverse engineer the logical character sequence, but in other bidirectional cases this will not be sufficient. Also it would probably be beneficial to indicate for the recogniser the overall scanning direction for the text being entered, for which it may be useful to apply a directional label, in a similar way to how one does this for language.

@wacky6
Copy link
Member

wacky6 commented Nov 26, 2020

Can you give some examples of how these mixed direction texts are written? Here I mean the actual process of how they are written (e.g. which character, which strokes are written first).

We didn't expect this to be a problem though. The assumption we made is that handwriting follows the natural flow of speech. In other words, we didn't expect the characters to be written in reverse (relative to their speech / interpretation direction). For example, we didn't expect "hello" to be written in "elloh" order).

@r12a
Copy link
Author

r12a commented Nov 27, 2020

I grabbed some examples from Wikipedia home pages.

First example. Unidirectional text, but the recogniser has to scan from right to left.

Screenshot 2020-11-27 at 11 25 56

Second example. Numbers and Latin text run LTR within the overall RTL flow. People writing the text tend to leave a gap and write the LTR text from LTR. They don't write the numbers or the Latin text backwards.

Screenshot 2020-11-27 at 11 26 38

Note, btw, that in the example just above, the parenthesis on the left is U+0029 RIGHT PARENTHESIS, and the one on the right is U+0028 LEFT PARENTHESIS. These are mirrored characters, whose glyph in typed text is established only when the directional context is known. The recogniser will also need to assign the glyph to a code point depending on the current base direction.

Third example. Overall LTR sentence has RTL text with embedded LTR text in it. I expect that 'W3C' would probably be the last 3 code points written and stored in memory once the text has been recognised.

Screenshot 2020-11-27 at 11 35 52

To be honest, it can be difficult to know where the boundaries are for the changes in base direction here, though in this example the quote marks help. I don't know how this is done in practice, i'm just flagging up that it will be necessary.

When it comes to speech, there is no flip-flopping of direction involved, and in fact in memory all code points are also arrange in one logical, unidirectional sequence. The changes in direction are only a feature of the written text. Unfortunately for you, that's what you're starting from.

@r12a r12a added the i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. label Feb 10, 2021
@wacky6
Copy link
Member

wacky6 commented Sep 1, 2021

Sorry about the delay. I forgot to mention you in w3ctag/design-reviews#591 (comment)

@r12a

Let's continue the discussion here.

image

WDYT about a direction hint to disambiguate the main direction here? This will help telling "82: Score" and ("Score: 28" or
Score: 82" apart (especially for rule-based recognizers).

For distinguishing between "Score: 28" and "Score: 82" (esp. rule based ones). I imagine the recognizer can determine the script of each word, use script's LTR or RTL to decide. In the above case, "Score:" is Hibrew, and "82" is Latin. With the presence of direction hint, in memory string starts with "Score:", followed by "82".

For machine learning based recognizers (the ones we currently have), handwriting "Score: 82" is part of their training dataset. The Hibrew recognizer will learn from the dataset and output characters in the correct in-memory order (i.e. direction hint is unnecessary). As for how it knows the right order, we don't know (hence why it's ML based).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.
Projects
None yet
Development

No branches or pull requests

2 participants