New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-byte characters support #3754
Comments
In iteration 2 I would check what kind of changes this feature would require and eventually move it to iteration 3 (which is about stability of the code and missing features). |
Edit: Solutions described in this post are out-dated and proven to be flawed: https://github.com/ckeditor/ckeditor5-engine/issues/478#issuecomment-235916089. Looking at this problem, I can see four groups/types of operations:
We cannot do anything about the first one, because we can't force developers to write correct code. So it's up to a developer to know about existance of Unicode and problems with it. Unless, of course, we introduce our own As far as second point is concerned, we need to ensure couple of things:
Mapping between model and view should not be that difficult as long as entities in model and view act similary. After all, mapping is mostly about mapping Then we have "from DOM" conversion, but I can't really say much about it as I haven't been doing much in this part of code. |
Another subject is what we count as "one" character. This is, in fact, a difficult matter. It's easy to say that a single unicode character should be treated as one, atomic character in our model. So, What about "signs" created from "basic" characters and "combining marks", like:
It has perfect sense - you can't have "combining marks" without "basic" character so putting selection between them or removing just "basic" character doesn't have any sense. But if you composed this character somehow, using backspace to "go back" in this composition have perfect sense. If we implement unicode handling as I described above, it will be impossible to remove a single "combining mark" after clicking backspace. This "sign" will be treated as one by model. Unless we make a special case in The article @Reinmar linked provides a regexp to filter out combining marks. We could use it to prepare a kind of map in Let's research a bit further.
But browser treats it like 3 characters (try selection or putting position inside this string). It's difficult for us because we don't know the language so we have difficulty saying what are correct positions. But in the article @Reinmar linked there is that:
Linking to: http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries I guess if we want a perfect unicode handling we will have to look at the spec. Unfortunately, regexp that I mentioned earlier does not filter out So, we either implement an algorithm accroding to spec (the person doing it should get a week of holidays afterwards), or try to find a lib/tool/script that already handles this and can be suited to our needs. |
After some quick research I can see that there are some libraries which do that, albeit heavy., but mostly because of hardcoding different character groups unicode ranges. Using such a library should not create efficiency problems since we would use it mostly when creating |
Some research how browsers handle this. String Chrome/Opera: sees it as 4 graphemes String Chrome/Opera: sees it as 3 graphemes From this quick research you can see that Chrome groups characters "tightely", while Firefox goes for base character + combining marks. Edge is inconsistent in this manner. Unfortunately there is no way for me to guess which is correct. I.e. spec could change and some browsers may be behind. From what I was able to read, graphemes are not just about base character + combining marks. Still, the library that I've found acts like Firefox. I wonder if it implements spec correctly or is outdated/incorrect. PS. Bonus round. Tests what happens when you remove/add space between those two: PPS. It seems that spec was updated in 2016-06-20. The lib has last commit 1-year-old and it bases on different, 2-year-old lib. The latter on implements |
I'll ask a question without reading all that, so I'm sorry if it doesn't make any sense :D. Do we have to care about all the details? Can we, somehow, base on browser implementations. Like, keep in one our character only the piece that browser inserted one one user action, so when user undoes those actions, they are reverted in the same manner? I'm afraid I know the answer, which is, we could, but then we still need to know between which characters you can place the caret and which pieces makes for one character so OT and related algorithms know how to process all that. Second topic – I think that we don't have to have a full support for all that from the day one. I guess that most apps get it wrong (if even browsers behave inconsistently). We don't even know if CKEditor 4 supported it all correctly, so there could be no regression. What's important now is to have the most important engine changes that will be required to have full(er) support in the future. |
Looking at how browsers behave also made me feel that we don't have to be perfect here.
I will write from top of my head so don't take those things as granted. I am more assuming than being sure. The most important thing is that I mean, the problem is that I don't know how those characters are composed. If we could be sure that whenever user makes an action (presses a key) it creates on grapheme, we could use some browser events, try to "save" what was inserted into DOM, etc. etc. But we know it does not work like this. Heck I don't even know what languages are these, I found those strings as examples on websites related to the problem. But I suppose you compose those like with Chineese or Korean IMEs so actually every keydown changes the composition. Composition probably may be more graphemes. And what will happen after someone pastes in a big block of text? EDIT: I updated my reasearch with Word2013 interpretation. It behaves like Edge, who would have guessed? |
// Not sure if this is an issue, but leaving it here just in case. Mind the differences on browsers. We may have the content shared among users with different browsers so all positioning and OT tasks should work on independent, normalized and consistent data. |
That's another good point. |
The biggest question that we need to answer is how many algorithms need to know about graphemes. The minimum that we know now consists of:
The second thing is that we miss experience with some languages which really need to use multi-byte characters. Most (if not all) languages fit into one-byte characters (but we need to remember about normalising characters because some graphemes can be represented as two chars or one). The only real use case we know is emojis (💩) which is actually pretty important nowadays.
Regardless of what we'll find about those special languages we need to still worry about:
Based on that we need to decide whether we want to make the text node API an interface over normal strings that exposes graphemes instead of raw data. Taken how few bugs we could find so far (basically only deleting "💩" didn't work well), we may keep the raw API. But if we expect that more algorithms and developers need to know about graphemes, then we should create a good interface which hides the raw data. |
According to http://info.lionbridge.com/rs/lionbridge/images/Lionbridge%20FAQ_encoding_2013.pdf
That's interesting, because I think we checked some characters, that after normalization they fit within one byte. So we checked wrong ones :D. |
Here is similiar discussion. Anyway, the problem is that some astral symbols are made of base characters and combining marks, while other are made of two characters combining a surrogate pair. |
To get some insight on how characters are build in Tamil, on Windows, you can download Tamil language/keyboard. Then, some keys enter base characters, like வ ர or க while other enter combining marks like ி ு or ோ. b+a on this keyboard = வ ோ -> வோ. These aren't normalized to one-character symbols. Maybe some own names in Hangul (Korean alphabet) also create symbols that are not normalized to one character? |
After making all changes mentioned in: https://github.com/ckeditor/ckeditor5-engine/issues/478#issuecomment-230246136 and writing a lot of tests, it appeared that the solution is flawed. The solution assumed, that Unfortunately, we need After user types combining mark, mutation is handled, Unfortunately, to convert inserting Similar problems will happen when removing characters using backspace -- if we want to handle native behavior, which is removing single combining marks not whole grapheme, we will need Of course we can make hacks. However I propose to be safe with hacking -- last time we tried to make some magic with texts, we ended up in two-weeks-long refactor. Hacking in this case would be difficult. Another solution is to abandon what already was written and tackle the problem from different side. Let's keep model and view as it is and not try to fix too much. This way developers might run into some problems but as we discussed with @Reinmar and @pjasiun, such cases are very narrow and invlove setting Instead of fixing how texts are handled in model and view, we will improve For now, I will leave my work on a branch if we would have to get back to the original idea at some point. Edit: to cheer up a bit, at least most of time was put into theoretical research and some overal tests. Those tests can be ported and used in different solution. Most (all?) of them will be green, if we don't do any magic in model and view, but it will be nice to have them anyway. |
If we wouldn't try we wouldn't know what's the best solution :). Now it's clear. OFC, if we knew all that from the beginning we'd implement some things differently globally and I guess we'd manage to implement text nodes based on graphemes, but fortunately it's not a big difference for most developers. This case mostly concerns us, as it's mostly the base tools like Anyway, I think that with the warnings logged from the document selection if it was placed in incorrect places we'll secure the case significantly. After all, if someone operates on the model in an incorrect way sooner or later they may create an invalid selection as well. |
I've just stumbled upon w3c/editing#133 :D. May be worth scanning for some additional info and leaving some feedback if we can help. |
We talked with @scofalik and there is one more problem with this issue. We need information about types of characters. That information is not stored in the JavaScript, and it is pretty a lot of data: http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakProperty.txt We can use external lib @scofalik found (https://github.com/orling/grapheme-splitter) but it will be a lot of data anyway (about 100 kB) which will not compress very well. Also, profit latin-language user gets thanks to this code is not very big. It is not like, we do not support multi-bytes characters without this code at all. We support typing anyway, we, for instance, do not support deleting as well as we could with this additional 100 kB. This is why I believe multi-bytes character support should be a separate plugin which user can add if he wants. |
It's important note that characters being halves of surrogate pairs are easy to distinguish, as they are all stored in one range. So selecting/removing whole pile of poos or other emojis will work. Same for characters that have their normalized code point in unicode, like é. What we are losing is checking whether we are inside grapheme or symbol combined of base character and combining mark. Those will be removed separately without the plugin/library. In this case, forward deletion might have unexpected results, and it will be possible to put document selection in "weird" positions. Still, as @pjasiun said, most people will not expect such things working properly, so, at least for now, we should introduce this as optional plugin. |
There is one issue with this approach, though.
If I extract the special behavior for So, I don't really know what would be an elegant way to describe all this in docs. |
There is nothing wrong in creating a flag in one plugin and handling it in another. The flag is part of the API, it says what type of data/action it is. It does not mean that there is a plugin which will handle this type of data/action. Think about pasting: one plugin will tell what is a data type, but it does not mean that there is a plugin which will handle such data type. |
The solution for this issue would be to introduce partial support in engine for /[\u0300-\u036f\u1ab0-\u1aff\u1dc0-\u1dff\u20d0-\u20ff\ufe20-\ufe2f]/ We achieve two nice things:
|
At the moment, neither the engine (except some fragments) nor the features support multi-byte characters in the content (you can read about them in https://mathiasbynens.be/notes/javascript-unicode).
This means that if we insert e.g.
'\u{1F4A9}'
into the model and press backspace after it only half of that character will be deleted.The model features
CharacterProxy
class but other pieces of code don't expect the character to be of length 2. The same knowledge will need to be spread in the view.The text was updated successfully, but these errors were encountered: