Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Unicode] Unicode combining characters not kept with base character #31

Closed
byteit101 opened this issue Dec 8, 2021 · 6 comments
Closed

Comments

@byteit101
Copy link

Code

ggplot(iris, aes(x=Sepal.Length))+
geom_textpath(aes(label="Composed: \u00ea, DeC: e\u0302, \u05aa\u05d0\u05aa"), stat="density", vjust=-0.2, size=8, fontface=1)+
ylim(0.1,0.5)

Expected

Text should show "Composed: ê, DeC: ê, ֪א֪" on a line

Actual

combiningbug

Explanation

The first "exciting" character is the precomposed form U+00EA, which works just fine, and is how I'd expect it to appear. The next one, after DeC[omposed], is the standard "e" with the combining hat U+0302. The hat is not rotated, and not located in the correct position, instead appearing inside the e. For this specific case, this could be solved with NFC or NFKC, but that only sidesteps the problem, as there is no, for example p̂ that is one codepoint. The Hebrew Alef U+05D0 is combined with some Hebrew accents U+05AA, which should be below the letters, but instead are shown in "naked" form as they are rendered separately.

@teunbrand
Copy link
Collaborator

Hello there,

Thanks for bringing this to our attention. Indeed this does look unexpected.
I'm getting varying results with this label in different graphics devices, with some refusing to render some accents at all. I highly doubt that aspect of the issue can be fixed here. See examples below (I added a plain text to your example for reference),

ragg:
image
Cairo PNG:
image
Windows device:
image

As far as I can tell, the ragg device does the correct thing in the plain text. I think the issue might stem from a decoupling of the accents from the glyph to which they should be applied. Because they are registered as having zero width, the angles are incorrectly inferred by projecting the xmin and xmax positions, which is why they remain unrotated. I'm by no means an expert on unicode text representation, so I'm left wondering whether there is a reliable way of knowing glyph-accent relationships.

@byteit101
Copy link
Author

I'm by no means an expert on unicode text representation, so I'm left wondering whether there is a reliable way of knowing glyph-accent relationships.

There is! The Unicode tables, specifically UnicodeData.txt, Canonical_Combining_Class section have this information

via the Unicode R package:

> install.packages("Unicode")
> library(unicode)
> u_char_properties(utf8ToInt("Composed: \u00ea, DeC: e\u0302, \u05aa\u05d0\u05aa"), "Canonical_Combining_Class")
   Canonical_Combining_Class
1                          0
2                          0
3                          0
4                          0
5                          0
6                          0
7                          0
8                          0
9                          0
10                         0
11                         0
12                         0
13                         0
14                         0
15                         0
16                         0
17                         0
18                         0
19                         0
20                       230
21                         0
22                         0
23                       220
24                         0
25                       220
> 

The 0's are normal letters, details here: https://www.unicode.org/reports/tr44/#Canonical_Combining_Class_Values

Alternatively, one could check the class, as All characters other than those of General_Category Mn or Mc are guaranteed to have Canonical_Combining_Class=0. :

u_char_properties(utf8ToInt("Composed: \u00ea, DeC: e\u0302, \u05aa\u05d0\u05aa #"), "General_Category")

I'm unsure if there is another R package that is more high level, but that is the base data that it would use too.

@byteit101
Copy link
Author

Oh and for reference my screenshot was taken on X11, and the plain text looks like ragg/the correct rendering

@teunbrand
Copy link
Collaborator

One possible, cheap solution that I could find is to put zero-width glyphs together with the preceding glyph. I don't know whether that has any drawbacks at the moment.

Because a space precedes the first \u05aa character, it gets rendered awkwardly. This would still happen if we did a more fancy thing with Unicode::u_char_properties() though.

image

@byteit101
Copy link
Author

Yes, the extra U+05AA was an error that was how I discovered the RTL issues :-)

That approach may have some drawbacks. This table (ignore the tailored clusters) has some nice examples, particularly Hangul (각): https://www.unicode.org/reports/tr29/#Table_Sample_Grapheme_Clusters

There are some libraries for this algorithm, and for R integration I am currently investigating utf8proc or maybe something in the style of https://github.com/foliojs/grapheme-breaker for a pure R implement

@teunbrand
Copy link
Collaborator

This should now also be fixed, so I'll be closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants