Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-byte characters support #3754

Closed
Reinmar opened this issue Jun 13, 2016 · 22 comments · Fixed by ckeditor/ckeditor5-engine#550
Closed

Multi-byte characters support #3754

Reinmar opened this issue Jun 13, 2016 · 22 comments · Fixed by ckeditor/ckeditor5-engine#550
Assignees
Labels
package:engine type:feature This issue reports a feature request (an idea for a new functionality or a missing option).
Milestone

Comments

@Reinmar
Copy link
Member

Reinmar commented Jun 13, 2016

At the moment, neither the engine (except some fragments) nor the features support multi-byte characters in the content (you can read about them in https://mathiasbynens.be/notes/javascript-unicode).

This means that if we insert e.g. '\u{1F4A9}' into the model and press backspace after it only half of that character will be deleted.

The model features CharacterProxy class but other pieces of code don't expect the character to be of length 2. The same knowledge will need to be spread in the view.

@Reinmar
Copy link
Member Author

Reinmar commented Jul 1, 2016

In iteration 2 I would check what kind of changes this feature would require and eventually move it to iteration 3 (which is about stability of the code and missing features).

@scofalik
Copy link
Contributor

scofalik commented Jul 4, 2016

Edit: Solutions described in this post are out-dated and proven to be flawed: https://github.com/ckeditor/ckeditor5-engine/issues/478#issuecomment-235916089.

Looking at this problem, I can see four groups/types of operations:

  • operations done directly on the string,
  • operations using engine.model.Position and engine.view.Position,
  • mapping engine.model.Position to engine.view.Position,
  • converting selection from DOM to model (after user manually changes it).

We cannot do anything about the first one, because we can't force developers to write correct code. So it's up to a developer to know about existance of Unicode and problems with it. Unless, of course, we introduce our own String class, but I think that might be going too far and it will probably never be perfect.

As far as second point is concerned, we need to ensure couple of things:

  • engine.model.Text and engine.model.TextProxy have to have correct "length". (Most probably it will be represented just as Text#startOffset and Text#endOffset but additional Text#length property might be introduced. What is important is that Text with #data equal to 💩💩💩 should have "length" equal to 3).
  • engine.model.Position offset should be correctly placed/translated for unicode #data. So for 💩💩💩, if offset is 1 it should be right after first 💩. What it really means is that any methods operating on offsets should take this into consideration. This mostly includes inserting and removing nodes at given position, walking through model and iterating on ranges (correctly creating TextProxy), etc. Maybe it will be just a matter of Text and TextProxy classes, but I can't guarantee it without a deeper reasearch. One of problems is with deleteContents method, because it moves selection only by one offset and it might end up "between" parts of unicode character. But if we guarantee that Position is aware of unicode, that will not be a problem. Almost all operations on model require you to have some position or transform a position so it will fix most of problems.
  • similar set of changes will be required for engine.view.Position, engine.view.Text and engine.view.TextProxy.

Mapping between model and view should not be that difficult as long as entities in model and view act similary. After all, mapping is mostly about mapping Elements and counting texts lengths.

Then we have "from DOM" conversion, but I can't really say much about it as I haven't been doing much in this part of code.

@scofalik
Copy link
Contributor

scofalik commented Jul 4, 2016

Another subject is what we count as "one" character. This is, in fact, a difficult matter. It's easy to say that a single unicode character should be treated as one, atomic character in our model. So, 💩 is a one character. Setting a caret after it and clicking backspace should remove 💩 and not leave any artifacts. In a string I ^💘 CKEditor, if caret is at ^, clicking right arrow key should jump over the heart symbol. This is fine and easy, as those symbols are single unicode characters.

What about "signs" created from "basic" characters and "combining marks", like: q̣̇? Chrome treat them differently, depending on context:

  • when caret is before a character, clicking right arrow key moves caret after the sign,
  • when caret is after a character, clicking backspace key removes one of "combining marks",
  • when caret is before a character, clicking delete key removes whole "sign" together with "combining marks".

It has perfect sense - you can't have "combining marks" without "basic" character so putting selection between them or removing just "basic" character doesn't have any sense. But if you composed this character somehow, using backspace to "go back" in this composition have perfect sense.

If we implement unicode handling as I described above, it will be impossible to remove a single "combining mark" after clicking backspace. This "sign" will be treated as one by model. Unless we make a special case in deleteContents.

The article @Reinmar linked provides a regexp to filter out combining marks. We could use it to prepare a kind of map in Text, that will map "basic" characters and "multibyte" characters to index in string so that high level methods like insertChildren or setting position works correctly. Check out function countSymbolsIgnoringCombiningMarks in linked article.

Let's research a bit further. [...'q̣̇'] will result in an array containing 3 characters (as opposed to [...'💩'] which will return array with just one character). Which has perfect sense. But what about நிலைக்கு? This is 8 unicode characters, it looks like 5 signs, in fact this is a composition of 4 basic characters and 4 add-on characters - or at least this is how I understand this:

for ( let l of 'நிலைக்கு' ) console.log( l );
ந
 ி
 ல
 ை
 க
 ்
 க
 ு

But browser treats it like 3 characters (try selection or putting position inside this string). It's difficult for us because we don't know the language so we have difficulty saying what are correct positions. But in the article @Reinmar linked there is that:

Accounting for other types of grapheme clusters

The above algorithm is still an oversimplification — it fails for grapheme clusters such as நி (U+0BA8 TAMIL LETTER NA + U+0BBF TAMIL VOWEL SIGN I), or Hangul made of conjoining Jamo such as 깍 (ᄁ + ᅡ + ᆨ), or other similar symbols.

Unicode Standard Annex ckeditor/ckeditor5-engine#29 describes an algorithm for determining grapheme cluster boundaries. For a completely accurate solution that works for all Unicode scripts, implement this algorithm in JavaScript, and then count each grapheme cluster as a single symbol.

Linking to: http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

I guess if we want a perfect unicode handling we will have to look at the spec. Unfortunately, regexp that I mentioned earlier does not filter out ி ை ் ு so the நிலைக்கு string is seen as 8 characters.

So, we either implement an algorithm accroding to spec (the person doing it should get a week of holidays afterwards), or try to find a lib/tool/script that already handles this and can be suited to our needs.

@scofalik
Copy link
Contributor

scofalik commented Jul 4, 2016

After some quick research I can see that there are some libraries which do that, albeit heavy., but mostly because of hardcoding different character groups unicode ranges. Using such a library should not create efficiency problems since we would use it mostly when creating Text objects and then just keep generated index-to-grapheme map and operate on it.

@scofalik scofalik self-assigned this Jul 21, 2016
@scofalik
Copy link
Contributor

scofalik commented Jul 21, 2016

Some research how browsers handle this.

String अनुच्छेद.
It's made of + + + + + + + .
It's 5 base characters + 3 combining marks.
For me personally it looks like 4 characters. Looks like 3th and 4th characters are merged together.

Chrome/Opera: sees it as 4 graphemes नु च्छे
Firefox: sees it as 5 graphemes नु च् छे
Edge/Word2013: sees it as 4 graphemes

String நிலைக்கு.
It's made of + ி + + + + + + .
It's 4 base characters + 4 combining marks.
For me it looks like 5 characters, it's because one of combining mark is very big.

Chrome/Opera: sees it as 3 graphemes நி லை க்கு
Firefox: sees it as 4 graphemes நி லை க் கு
Edge/Word2013: sees it as 4 graphemes

From this quick research you can see that Chrome groups characters "tightely", while Firefox goes for base character + combining marks. Edge is inconsistent in this manner.

Unfortunately there is no way for me to guess which is correct. I.e. spec could change and some browsers may be behind.

From what I was able to read, graphemes are not just about base character + combining marks. Still, the library that I've found acts like Firefox. I wonder if it implements spec correctly or is outdated/incorrect.

PS. Bonus round. Tests what happens when you remove/add space between those two: च् छे (you can check in console). These are 3rd and 4th character from first string, where inconsistency between browsers happen. Check in Chrome and Firefox.

PPS. It seems that spec was updated in 2016-06-20. The lib has last commit 1-year-old and it bases on different, 2-year-old lib. The latter on implements Unicode 7.0 while the current version is 9.0 (see spec).

@Reinmar
Copy link
Member Author

Reinmar commented Jul 21, 2016

I'll ask a question without reading all that, so I'm sorry if it doesn't make any sense :D.

Do we have to care about all the details? Can we, somehow, base on browser implementations. Like, keep in one our character only the piece that browser inserted one one user action, so when user undoes those actions, they are reverted in the same manner?

I'm afraid I know the answer, which is, we could, but then we still need to know between which characters you can place the caret and which pieces makes for one character so OT and related algorithms know how to process all that.

Second topic – I think that we don't have to have a full support for all that from the day one. I guess that most apps get it wrong (if even browsers behave inconsistently). We don't even know if CKEditor 4 supported it all correctly, so there could be no regression. What's important now is to have the most important engine changes that will be required to have full(er) support in the future.

@scofalik
Copy link
Contributor

scofalik commented Jul 21, 2016

Second topic – I think that we don't have to have a full support for all that from the day one.

Looking at how browsers behave also made me feel that we don't have to be perfect here.

Do we have to care about all the details? Can we, somehow, base on browser implementations.

I will write from top of my head so don't take those things as granted. I am more assuming than being sure.

The most important thing is that Positions would have to behave correctly. I am worried that trying to use browser would be more complicated than finding out good algorithm/lib/tool and use it.

I mean, the problem is that I don't know how those characters are composed. If we could be sure that whenever user makes an action (presses a key) it creates on grapheme, we could use some browser events, try to "save" what was inserted into DOM, etc. etc. But we know it does not work like this. Heck I don't even know what languages are these, I found those strings as examples on websites related to the problem.

But I suppose you compose those like with Chineese or Korean IMEs so actually every keydown changes the composition. Composition probably may be more graphemes. And what will happen after someone pastes in a big block of text?

EDIT: I updated my reasearch with Word2013 interpretation. It behaves like Edge, who would have guessed?
EDIT2: Google Docs on each browser works like in Chrome.

@fredck
Copy link
Contributor

fredck commented Jul 22, 2016

// Not sure if this is an issue, but leaving it here just in case.

Mind the differences on browsers. We may have the content shared among users with different browsers so all positioning and OT tasks should work on independent, normalized and consistent data.

@scofalik
Copy link
Contributor

That's another good point.

@Reinmar
Copy link
Member Author

Reinmar commented Jul 27, 2016

Comparison of deleting characters and graphemes (backspace vs forward delete):

Comparison of delete and forward delete behaviour with graphemes

@Reinmar
Copy link
Member Author

Reinmar commented Jul 27, 2016

The biggest question that we need to answer is how many algorithms need to know about graphemes. The minimum that we know now consists of:

  • Forward delete which needs to delete whole graphemes.
  • Backspace which needs to delete single characters or whole graphemes (e.g. 💩 needs to be deleted as one grapheme, but as you can see above some characters are deleted separately).
  • Selection which should not be placed inside graphemes. However, most of the time it's browser who says us where to place the selection in the DOM and it's hard to imagine features which will want to place selection in some arbitrary position within text (usually you get the selection that browser created and work on it – that selection should be fine).
  • Algorithms which try to count characters, cut some text off, replace some text, etc. should usually operate on whole graphemes (but it's really hard to understand which features are that).

The second thing is that we miss experience with some languages which really need to use multi-byte characters. Most (if not all) languages fit into one-byte characters (but we need to remember about normalising characters because some graphemes can be represented as two chars or one). The only real use case we know is emojis (💩) which is actually pretty important nowadays.

I feel that languages which use multi-byte chars are super rare (edit: I've been so wrong) and a significant amount of software doesn't support them correctly and their users are used to that. I'm afraid that even if we'll get something wrong, we won't have bug reports... So I'd still do some research to find whether there are any languages we need to worry about and maybe the community will know more.

Regardless of what we'll find about those special languages we need to still worry about:

  1. Normalising input so input data which consists of unnormalised characters is normalised.
  2. Handling special unicode chars like 💩.

Based on that we need to decide whether we want to make the text node API an interface over normal strings that exposes graphemes instead of raw data. Taken how few bugs we could find so far (basically only deleting "💩" didn't work well), we may keep the raw API. But if we expect that more algorithms and developers need to know about graphemes, then we should create a good interface which hides the raw data.

@Reinmar
Copy link
Member Author

Reinmar commented Jul 27, 2016

According to http://info.lionbridge.com/rs/lionbridge/images/Lionbridge%20FAQ_encoding_2013.pdf

Multi-Byte The Asian languages — Chinese, Japanese, and Korean (CJK) — are
intrinsically different. Their character sets (meaning all the symbols needed to express the language) contain a subset that is less complex, including ASCII characters and punctuation marks. The subset requires one byte only. However, Asian languages also have a larger set of ideographic characters of Chinese origin — literally thousands of them. We need two or more bytes for representing such a great number of these complex characters. The term for mixing single-byte characters alongside two-or-more-byte characters is “multi-byte.”

That's interesting, because I think we checked some characters, that after normalization they fit within one byte. So we checked wrong ones :D.

@scofalik
Copy link
Contributor

http://stackoverflow.com/questions/5567249/what-are-the-most-common-non-bmp-unicode-characters-in-actual-use

Here is similiar discussion.

Anyway, the problem is that some astral symbols are made of base characters and combining marks, while other are made of two characters combining a surrogate pair.

@scofalik
Copy link
Contributor

scofalik commented Jul 27, 2016

To get some insight on how characters are build in Tamil, on Windows, you can download Tamil language/keyboard. Then, some keys enter base characters, like வ ர or க while other enter combining marks like ி ு or ோ. b+a on this keyboard = வ ோ -> வோ.

These aren't normalized to one-character symbols.

Maybe some own names in Hangul (Korean alphabet) also create symbols that are not normalized to one character?

@scofalik
Copy link
Contributor

scofalik commented Jul 28, 2016

After making all changes mentioned in: https://github.com/ckeditor/ckeditor5-engine/issues/478#issuecomment-230246136 and writing a lot of tests, it appeared that the solution is flawed.

The solution assumed, that Positions inside "graphemes" (i.e. between base character and combining mark) are incorrect. So I decided to make nice, transparent API so developers using it would never know that they are dealing with complex stuff. Positions would always be between full symbols, offsets/offset sizes would match them, etc.

Unfortunately, we need Positions and Ranges between base characters and combining mark. They are needed for conversion when user types combining marks one by one using i.e. keyboard (or pastes them -- the bottom line is that combining mark is inserted apart from base character). For example, user might write and then combining mark .

After user types combining mark, mutation is handled, weakInsertDelta is created, is inserted to model and it merges with into வா . The problem is that then, text node in model still has offsetSize equal to 1 and any correct Range has to include the base character. In other words we can't create Range that has only combining mark.

Unfortunately, to convert inserting from model to view, we need to pass a model.Range, spanning over nodes that have been inserted. But as was said, such Range cannot be created, meaning that we are unable to convert such insertion from model to view.

Similar problems will happen when removing characters using backspace -- if we want to handle native behavior, which is removing single combining marks not whole grapheme, we will need Ranges that contain only those combining marks.

Of course we can make hacks. However I propose to be safe with hacking -- last time we tried to make some magic with texts, we ended up in two-weeks-long refactor. Hacking in this case would be difficult. modelWriter would have to recognize that insertion have not changed Text#offsetSize and inform InsertOperation about it. In this case InsertOperation would have to inform Document#applyOperation that such situation happened (and probably return more/different values). Then, Document#applyOperation would have to send two events instead of one: first would have to "remove base character event" and second would have to "insert base character and combining mark". For deleting, we would have to make an if for backward delete, and typing feature would have to remove whole grapheme and add it without combining mark.

Another solution is to abandon what already was written and tackle the problem from different side. Let's keep model and view as it is and not try to fix too much. This way developers might run into some problems but as we discussed with @Reinmar and @pjasiun, such cases are very narrow and invlove setting model.Position "by hand" or in a werid, arbitrary way. Most of time current selection is used or a position before/after given node.

Instead of fixing how texts are handled in model and view, we will improve Selection API and modifySelection algorithm. That should be enough for a lot of cases, because by far most problems that we anticipate are connected with incorrect Selection placement. We will make it so the Document#selection cannot be placed in incorrect places (will throw).

For now, I will leave my work on a branch if we would have to get back to the original idea at some point.

Edit: to cheer up a bit, at least most of time was put into theoretical research and some overal tests. Those tests can be ported and used in different solution. Most (all?) of them will be green, if we don't do any magic in model and view, but it will be nice to have them anyway.

@Reinmar
Copy link
Member Author

Reinmar commented Jul 28, 2016

If we wouldn't try we wouldn't know what's the best solution :). Now it's clear.

OFC, if we knew all that from the beginning we'd implement some things differently globally and I guess we'd manage to implement text nodes based on graphemes, but fortunately it's not a big difference for most developers. This case mostly concerns us, as it's mostly the base tools like modifySelection() what needs to precisely understand multi-byte chars.

Anyway, I think that with the warnings logged from the document selection if it was placed in incorrect places we'll secure the case significantly. After all, if someone operates on the model in an incorrect way sooner or later they may create an invalid selection as well.

@Reinmar
Copy link
Member Author

Reinmar commented Jul 29, 2016

I've just stumbled upon w3c/editing#133 :D. May be worth scanning for some additional info and leaving some feedback if we can help.

@pjasiun
Copy link

pjasiun commented Aug 2, 2016

We talked with @scofalik and there is one more problem with this issue. We need information about types of characters. That information is not stored in the JavaScript, and it is pretty a lot of data: http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakProperty.txt We can use external lib @scofalik found (https://github.com/orling/grapheme-splitter) but it will be a lot of data anyway (about 100 kB) which will not compress very well. Also, profit latin-language user gets thanks to this code is not very big. It is not like, we do not support multi-bytes characters without this code at all. We support typing anyway, we, for instance, do not support deleting as well as we could with this additional 100 kB.

This is why I believe multi-bytes character support should be a separate plugin which user can add if he wants.

@scofalik
Copy link
Contributor

scofalik commented Aug 2, 2016

It's important note that characters being halves of surrogate pairs are easy to distinguish, as they are all stored in one range. So selecting/removing whole pile of poos or other emojis will work. Same for characters that have their normalized code point in unicode, like é.

What we are losing is checking whether we are inside grapheme or symbol combined of base character and combining mark. Those will be removed separately without the plugin/library. In this case, forward deletion might have unexpected results, and it will be possible to put document selection in "weird" positions. Still, as @pjasiun said, most people will not expect such things working properly, so, at least for now, we should introduce this as optional plugin.

@scofalik
Copy link
Contributor

scofalik commented Aug 2, 2016

There is one issue with this approach, though.

modifySelection expects data object, with unit property, specifying how selection should be expanded. In "all-in-engine" approach, this was easy. I improved modifySelection and added new flag, so there is character or codePoint unit. character is default and most commonly used unit. This is also the one that takes care of not placing selection inside grapheme. What's more, very core features (Delete, in this case), uses both character and codePoint flags: one when del key is pressed, other when backspace is pressed.

If I extract the special behavior for character unit, weird thing happens. Why? I still need to keep description in docs, that there are two flags available. Then, I have to explain what is the difference. The problem is that the default flag does not work as described without the external plugin (holding the library and support for graphemes)...

So, I don't really know what would be an elegant way to describe all this in docs.

@pjasiun
Copy link

pjasiun commented Aug 2, 2016

There is nothing wrong in creating a flag in one plugin and handling it in another. The flag is part of the API, it says what type of data/action it is. It does not mean that there is a plugin which will handle this type of data/action. Think about pasting: one plugin will tell what is a data type, but it does not mean that there is a plugin which will handle such data type.

@scofalik
Copy link
Contributor

scofalik commented Aug 2, 2016

The solution for this issue would be to introduce partial support in engine for character unit. How would it work? It would recognize combining marks but not graphemes. This can be done with regexp only, as combining marks (as for now) are several unicode ranges:

/[\u0300-\u036f\u1ab0-\u1aff\u1dc0-\u1dff\u20d0-\u20ff\ufe20-\ufe2f]/

We achieve two nice things:

  • We would have even better "out-of-the-box" support, which means that less people would need external plugin,
  • There would be a difference between character and codePoint and it would make sense to document it. Still, in documentation, we could mention that there is a plugin that enhances this behavior.

@mlewand mlewand transferred this issue from ckeditor/ckeditor5-engine Oct 9, 2019
@mlewand mlewand added this to the iteration 2 milestone Oct 9, 2019
@mlewand mlewand added status:confirmed type:feature This issue reports a feature request (an idea for a new functionality or a missing option). package:engine labels Oct 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
package:engine type:feature This issue reports a feature request (an idea for a new functionality or a missing option).
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants