Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Arabic text #18

Open
benjamingeer opened this issue Feb 5, 2019 · 27 comments
Open

Support for Arabic text #18

benjamingeer opened this issue Feb 5, 2019 · 27 comments

Comments

@benjamingeer
Copy link

This is really an impressive standoff editor, and I'm looking forward to exploring it more thoroughly. Just as an initial test, I tried typing the Arabic word اختبار ("test") in the demo, and it looks like this:

screenshot 2019-02-05 at 18 40 39

RTL text would need to be represented using the HTML dir attribute (see Structural markup and right-to-left text in HTML).

Also, Arabic letters have to be joined together. Putting each character in a separate <span> seems to breaks the joining of letters, but I've read that this can be solved by adding a zero-width joiner Unicode character.

How easy do you think it would be to do this? I would be glad to try to help, with some guidance.

@argimenes
Copy link
Owner

Hi, glad you're finding the editor useful so far.

The main challenges that I can see to supporting Arabic script lie in the design decision behind SPEEDy to enclose characters in single SPANs. This was done to leverage the browser's built-in NodeList structure which resembles a linked list. While the text direction can be set on the DIV that contains the character SPANs, this would apply to the entire text and would not allow portions of text to be in Arabic script. Additionally, it also means that SPEEDy currently does not support text blocks, but essentially treats the whole editor space as one text block. The only way around this that I can see is a fundamental rewrite of the engine to allow both DIVs and SPANs to be stored, but aside from the problem of mapping the caret to a character object (which is non-trivial in itself), it also poses questions about how the text will be serialised: that is, what constitutes the raw text stream output when non-character elements (like DIVs, TABLEs, etc.) are present in the editor?

I am very open to any ideas on how to tackle this problem. For the time being I've decided to work within the limitations as in most other respects the editor works for standoff annotation.

If you have any thoughts on a solution to this design issue, I would be happy to take them on board ...

@benjamingeer
Copy link
Author

benjamingeer commented Feb 6, 2019

Thanks for your reply. I think it would be OK just to be able to set the text direction for the entire text, as long as the character joining problem could be fixed. I think the current design with one span per character is necessary so that markup can begin and end in the middle of a word (if I understand correctly how it works), and that’s something important for us. So I think it would be worth trying to fix the joining problem by adding the invisible zero-width joiner characters around each Arabic letter, as described in this Stack Overflow answer, to make the browser join the letters across the span tags. These extra invisible characters would just be used in the HTML, and would not be included in the output of the editor. Do you think it would be difficult to do that?

@argimenes
Copy link
Owner

If the trick mentioned in the Stack Overflow answer works I think it would be possible to extend SPEEDy to support Arabic script. It will require some careful rewriting around character insertion and deletion to cater for the ligature directive symbols, but it seems like it should be possible.

Some means of identifying a text as Arabic script would be necessary when it is being loaded, so perhaps a meta-data standoff property (i.e., one without a start or end index) could serve this purpose...

@argimenes
Copy link
Owner

I reprogrammed the editor to insert a zero-width joining character span between all regular character spans and unfortunately this had no effect on the ligaturing of the Arabic characters. It may be that this only works if the ZWJ is inside the same span as the text character, rather than wrapped in its own span. The example on Stack Overflow is a little ambiguous as it shows a text node adjacent to a span rather than two spans.

I will see if I can append the ZWJ to the text content of the span and see what happens.

@argimenes
Copy link
Owner

I have got the ZWJ characters to work inside the editor, but the RTL orientation is mixing up the order of the words. Not sure if this is a result of using the ZWJ characters, or if I am misinterpreting it ...

@benjamingeer
Copy link
Author

I have got the ZWJ characters to work inside the editor

That’s great news, thank you!

the RTL orientation is mixing up the order of the words

The first word you type should start at the right margin, then word 2 should be to the left of word 1, word 3 should be to the left of word 2, etc., like this:

3 2 1

Could you make a branch that I could try?

@argimenes
Copy link
Owner

Hi,

I've now checked in some code that appears to fix the Arabic ligature rendering issue, and hopefully implements RTL correctly. I noticed that the text seemed misaligned in RTL for Western texts, and my concern is that the CSS annotation might be interfering in some way.

In any case, the best way to test the Arabic script feature out is to select 'arabic.json' from the File drop down list and click 'Load'. This not only loads a sample text but also reloads the editor with the required RTL and character interpolation settings. Currently the editor needs to be reloaded to carry this out, as as character interpolation (i.e., ZWJ) needs to be configured up front. I should be able to refactor this soon to allow more dynamic switching, but for now this works.

I also created a branch called 'arabic-script' for you to play in.

@benjamingeer
Copy link
Author

benjamingeer commented Feb 20, 2019

Hi, I've finally had a chance to try this out. It's definitely progress, thank you! Looking at the result in the editor, I now realise that the use of ZWJ is a little more complex than I thought: it's needed sometimes before the character, sometimes after the character, and sometimes both, but this depends on the character and on its position in the word. To make this easier, I'm writing a JavaScript function to determine where to add ZWJ.

About dir="rtl", there are basically two common use cases:

  1. A text that's mostly in Arabic, with a few words in the Latin alphabet. The direction of whole document can be RTL.
  2. A text that's mostly in the Latin alphabet, with a few words in Arabic. The direction of the whole document can be LTR.

I think it would be fine to have an RTL button to change the document direction. Joining has to work in both cases, though. I think the easiest way to do this would be to apply the ZWJ logic to any character with a Unicode code point in an Arabic range. A function like this should do it (using the ranges from https://en.wikipedia.org/wiki/Arabic_script_in_Unicode):

function isArabicChar(char) {
    var codePoint = char.codePointAt(0);

    return (codePoint >= 0x0600 && codePoint <= 0x06FF) ||
        (codePoint >= 0x0750 && codePoint <= 0x077F) ||
        (codePoint >= 0x08A0 && codePoint <= 0x08FF) ||
        (codePoint >= 0xFB50 && codePoint <= 0xFDFF) ||
        (codePoint >= 0xFE70 && codePoint <= 0xFEFF) ||
        (codePoint >= 0x10E60 && codePoint <= 0x10E7F) ||
        (codePoint >= 0x1EC70 && codePoint <= 0x1ECBF) ||
        (codePoint >= 0x1EE00 && codePoint <= 0x1EEFF)
}

I'm thinking that maybe I should write a little library with these functions, which could then be used in SPEEDy.

@benjamingeer
Copy link
Author

Also, Arabic has diacritics as separate Unicode characters. For example, here's a string with two characters: the letter U+0644 followed by the diacritic U+064F, which appears above the letter:

screenshot 2019-02-20 at 18 39 18

A diacritic (called a 'nonspacing character' in Unicode) has to be rendered above or below the letter. As far as I can tell, the only way to make this work is to put them in the same <span> element.

This means that it won't be possible to annotate just the diacritic with standoff. I think that's OK: the annotation can be attached to the letter. But it does mean that there needs to be a way for a <span> to contain a letter plus one or more diacritics.

@benjamingeer
Copy link
Author

I made a little library that does most of the work:

https://github.com/dhlab-basel/arabic-shaping

It provides functions for splitting a string into an array of character groups, with each group containing at most one letter and its diacritics, and for adding any ZWJ characters that are needed to each group. Then you can wrap each group in a <span> element.

Please let me know if you can use it or if it needs anything else.

@argimenes
Copy link
Owner

argimenes commented Mar 7, 2019 via email

@benjamingeer
Copy link
Author

OK, great, thanks for letting me know. Please also let me know if there’s anything I can do to help.

@benjamingeer
Copy link
Author

benjamingeer commented Mar 19, 2019

I just wanted to check with you whether your script will work with single characters at a time

Yes. My idea is that you can construct each <span> so that it contains a "character group" consisting of at most one Arabic non-diacritic character (e.g. a letter) followed by zero or more Arabic diacritics. (This seems to be the only way to get the diacritics to appear above or below the letters.)

For each character entered, you can call:

  • isArabicChar to find out if it's an Arabic character
  • isArabicNonDiacritic to find out if it's an Arabic non-diacritic character
  • isArabicDiacritic to find out if it's an Arabic diacritic

For each character group, you can call addZwj to add any necessary ZWJ at the beginning and/or end of the group (this function first removes any existing ZWJ from the group).

Maybe it's clearer to illustrate this step by step.

Suppose we start with an empty text. The user starts by typing an Arabic letter, which we can represent in this illustration as L. We call isArabicNonDiacritic, which returns true. So we make our first character group:

<span>L</span>

Next the user types a diacritic, which we can represent here as D. isArabicNonDiacritic returns false, and isArabicDiacritic returns true, so we add the character to the same group:

<span>LD</span>

Now the user types another Arabic letter. isArabicNonDiacritic returns true, so we have to start a new group:

<span>LD</span><span>L</span>

Now that we have two groups, we must add any necessary ZWJ characters to make them join together correctly. We call addZwj(charGroup, previousCharGroup, nextCharGroup) for each group:

  1. addZwj("LD", null, "L"). This might return LDZ (representing the ZWJ character as Z).
  2. addZwj("L", "LD", null). This might return ZL.

So now our two groups look like this:

<span>LDZ</span><span>ZL</span>

These two groups should now be joined correctly. Now the user types a third letter, so we start group 3. To join groups 2 and 3, we have to redo the ZWJ in group 2, as well as adding ZWJ to group 3. We call addZwj for groups 2 and 3. The result might be:

<span>LDZ</span><span>ZLZ</span><span>ZL</span>

In short:

  • If we're adding characters to the end of the text, every time we start a new character group, we have to calculate ZWJ for the new group, and recalculate ZWJ for the preceding group.
  • If an existing group is changed, we have to redo its ZWJ, as well as the ZWJ of the preceding group and the following group (if they exist).
  • If a new group is added between two existing groups, we have to add its ZWJ, and redo the ZWJ of the preceding group and the following group.
  • If a non-diacritic is deleted, the whole group should be deleted. Then the ZWJ of the preceding group and the following group need to be recalculated.

Does this seem workable?

@benjamingeer
Copy link
Author

(Edited above comment after simplifying functions a bit.)

@argimenes
Copy link
Owner

argimenes commented Mar 20, 2019

Hi,

This is a great explanation, but I have a few more questions, I'm afraid. The situation with the editor is that the user's cursor can be anywhere in a text; they could be typing the first letter of the text, or at the end of the text, or inserting a letter somewhere between. Alternatively, they could be deleting a character, or a whole range of characters at once. I am trying to determine what input exactly I would need to pass to your functions in these various circumstances, and the best way of getting that input. For example, as soon as someone inserts a character I am able to output the SPAN that wraps that character, along with the previous and next siblings (in some cases these will be NULL). Is that sufficient material for your shaping code to work with, do you think, or would you recommend some other parameters? And if it sufficient, can you suggest the procedure I should follow to generate the char-groups from those inputs?

Thanks,
Iian

@benjamingeer
Copy link
Author

benjamingeer commented Mar 20, 2019

Hi Iian,

For example, as soon as someone inserts a character I am able to output the SPAN that wraps that character, along with the previous and next siblings (in some cases these will be NULL). Is that sufficient material for your shaping code to work with

Yes, that's fine. Whenever a character is inserted/changed, you have to update:

  • the ZWJ of the <span> containing the inserted/changed character
  • the ZWJ of the previous <span> (if there is one)
  • the ZWJ of the next <span> (if there is one)

If a <span> is deleted, you need to update:

  • the ZWJ of the previous <span> (if there is one)
  • the ZWJ of the next <span> (if there is one)

To update the ZWJ of a <span>, call addZwj(charGroup, previousCharGroup, nextCharGroup), and replace the contents of the <span> with the return value of that function. It's OK if previousCharGroup or nextCharGroup is null.

For example, suppose we have these groups:

<span>a</span>
<span>b</span>
<span>c</span>

The user inserts group x after b. The text is now:

<span>a</span>
<span>b</span>
<span>x</span>
<span>c</span>

We call:

  • addZwj(b, a, x): returns b with ZWJ
  • addZwj(x, b, c): returns x with ZWJ
  • addZwj(c, x, null): returns c with ZWJ

Then the user deletes the x. The text is now back to:

<span>a</span>
<span>b</span>
<span>c</span>

We call:

  • addZwj(b, a, c): returns b with ZWJ
  • addZwj(c, b, null): returns c with ZWJ

can you suggest the procedure I should follow to generate the char-groups from those inputs?

I think the following should work. For each character typed, if isArabicCharacter returns true:

  • if (isArabicDiacritic(char)) and the cursor is in (or at the end of) an existing group:
    • removeZwj(existingGroup)
    • append the character to the existing group
  • else
    • Make a new group

Then update ZWJ as described above.

Does that make sense?

Thanks for your patience,
Ben

@argimenes
Copy link
Owner

argimenes commented Mar 21, 2019

Hi Ben,

Thanks for the clarification, it has really helped. I've now added a 'onCharacterAdded' handler to SPEEDy to allow the client code to access the text stream, and I've attempted to implement the add part of your algorithm above.

Would you mind loading the demo page when you can and try pasting in some Arabic text. From my end some portions of the text look correct, while others are off. There's probably something simple I'm missing with my implementation ...

Best regards,
Iian

PS. Keep in mind that my hookup is very basic, even around the assumptions it makes about the next and previous elements. But I wanted to start with the simple case first.

@argimenes
Copy link
Owner

argimenes commented Mar 21, 2019 via email

@benjamingeer
Copy link
Author

Would you mind loading the demo page when you can and try pasting in some Arabic text.

Thanks so much, I’m looking forward to trying this later today.

would your Arabic shaping
solution be reworkable to other non-Latin alphabets, and Unicode rendering
in general?

I think the basic principle should be the same. I believe the WebKit shaping bug affects Indic scripts as well. I thought about trying to implement a general-purpose solution for all affected scripts, but Arabic is the only one of the affected languages that I actually know, so I can see whether it’s rendered correctly. Once this works for Arabic, I’d be glad to refactor it to make it work for more scripts, with the help of someone who knows another affected language.

@argimenes
Copy link
Owner

argimenes commented Mar 22, 2019 via email

@argimenes
Copy link
Owner

argimenes commented Mar 22, 2019 via email

@benjamingeer
Copy link
Author

Hi Iian,

I've just tried this. We're getting closer! 🙂 But I see two problems. First, you need to pull the latest version of my shaping code from GitHub, because the procedures I described above won't work with the slightly older version you have. I think this is why the text in arabic.json isn't rendered correctly when you load it from a file.

I've also just now changed my example text to the one in arabic.json, so you can compare these correct <span> elements with the ones produced by the editor:

https://github.com/dhlab-basel/arabic-shaping/blob/master/correct-example.html

The second problem is that when I type Arabic text (one character at a time) into an empty editor window, the characters are inserted backwards; the order of the <span> elements is the reverse of what it should be. For example, the first word of your sample text is بدأ ("began"). If I type that word into the editor (I type first ب, then د, then أ), the editor inserts each new character before the previous one, generating this HTML:

<span style="position: relative;">أ</span>
<span style="position: relative;">د</span>
<span style="position: relative;">ب</span>

But that's the word أدب ("literature"). 🙂 Are you trying to insert the <span> elements in reverse order for RTL? If so, that's not how it works: the "logical orderering" of the characters is the same for LTR and RTL. The browser handles the "visual ordering". Here's an explanation:

https://www.w3.org/International/questions/qa-visual-vs-logical

Have a good weekend,
Ben

@argimenes
Copy link
Owner

argimenes commented Mar 22, 2019 via email

@benjamingeer
Copy link
Author

No problem, and don’t worry, I know it’s difficult to do this without knowing the language, but we’ll get there. I’ll try this again on Monday.

@benjamingeer
Copy link
Author

benjamingeer commented Mar 26, 2019

OK, the good news is that when I load arabic.json in the editor, the sample text is shaped and joined correctly, and the diacritics are in the right places. Yay!

The only problem I see with the rendering of the sample text is that sometimes a line is broken in the middle of a word, like this:

Screen Shot 2019-03-26 at 14 23 30

When I start with an empty editor and start typing in Arabic, the characters are still added in reverse order. To type the first word of the sample text, first I type the letter ب, and I see:

Screen Shot 2019-03-26 at 14 28 13

Notice that the cursor is positioned to the left of the letter, as it should be. But now I type the letter د, and I see:

Screen Shot 2019-03-26 at 14 28 29

The د should have appeared to the left of the ب, but has instead appeared to the right of it. Again, the cursor appears to the left of the د, as it should. Now I type the letter أ, and again it's added on the right instead of on the left:

Screen Shot 2019-03-26 at 14 28 52

Also, now the cursor is still to the left of the د where it was before.

If I use the Chrome developer tools to look at the generated HTML, I see:

Screen Shot 2019-03-26 at 14 33 09

You can see that the characters have been inserted in reverse order. Since I typed ب, then د, then أ, I should have got this instead:

<div data-role="editor" spellcheck="false" contenteditable="true" class="editor">
<span style="position: relative;">ب</span>
<span style="position: relative;">د</span>
<span style="position: relative;">أ</span>
</div>

Maybe for testing, you can try an Arabic keyboard layout and just type those three letters yourself. On macOS (using the "Arabic - PC" layout), Linux, or Windows, you can find them here:

800px-KB_Arabic svg

On a QWERTY keyboard:

  1. ب is the F key.
  2. د is the right square bracket key (]).
  3. أ is Shift-H.

If you type them in that order, the resulting <span> elements should match the first three <span> elements that you get when you load arabic.json:

Screen Shot 2019-03-26 at 14 52 18

@benjamingeer
Copy link
Author

Hi again, is there any chance you might have some time for this?

@argimenes
Copy link
Owner

argimenes commented Jul 15, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants