-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Unicode] Unicode combining characters not kept with base character #31
Comments
There is! The Unicode tables, specifically UnicodeData.txt, Canonical_Combining_Class section have this information via the Unicode R package: > install.packages("Unicode")
> library(unicode)
> u_char_properties(utf8ToInt("Composed: \u00ea, DeC: e\u0302, \u05aa\u05d0\u05aa"), "Canonical_Combining_Class")
Canonical_Combining_Class
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 230
21 0
22 0
23 220
24 0
25 220
>
The 0's are normal letters, details here: https://www.unicode.org/reports/tr44/#Canonical_Combining_Class_Values Alternatively, one could check the class, as u_char_properties(utf8ToInt("Composed: \u00ea, DeC: e\u0302, \u05aa\u05d0\u05aa #"), "General_Category") I'm unsure if there is another R package that is more high level, but that is the base data that it would use too. |
Oh and for reference my screenshot was taken on X11, and the plain text looks like ragg/the correct rendering |
One possible, cheap solution that I could find is to put zero-width glyphs together with the preceding glyph. I don't know whether that has any drawbacks at the moment. Because a space precedes the first |
Yes, the extra U+05AA was an error that was how I discovered the RTL issues :-) That approach may have some drawbacks. This table (ignore the tailored clusters) has some nice examples, particularly Hangul (각): https://www.unicode.org/reports/tr29/#Table_Sample_Grapheme_Clusters There are some libraries for this algorithm, and for R integration I am currently investigating utf8proc or maybe something in the style of https://github.com/foliojs/grapheme-breaker for a pure R implement |
This should now also be fixed, so I'll be closing this. |
Code
Expected
Text should show "Composed: ê, DeC: ê, ֪א֪" on a line
Actual
Explanation
The first "exciting" character is the precomposed form U+00EA, which works just fine, and is how I'd expect it to appear. The next one, after DeC[omposed], is the standard "e" with the combining hat U+0302. The hat is not rotated, and not located in the correct position, instead appearing inside the e. For this specific case, this could be solved with NFC or NFKC, but that only sidesteps the problem, as there is no, for example p̂ that is one codepoint. The Hebrew Alef U+05D0 is combined with some Hebrew accents U+05AA, which should be below the letters, but instead are shown in "naked" form as they are rendered separately.
The text was updated successfully, but these errors were encountered: