[Unicode] Unicode combining characters not kept with base character #31

byteit101 · 2021-12-08T02:40:59Z

Code

ggplot(iris, aes(x=Sepal.Length))+
geom_textpath(aes(label="Composed: \u00ea, DeC: e\u0302, \u05aa\u05d0\u05aa"), stat="density", vjust=-0.2, size=8, fontface=1)+
ylim(0.1,0.5)

Expected

Text should show "Composed: ê, DeC: ê, ֪א֪" on a line

Actual

Explanation

The first "exciting" character is the precomposed form U+00EA, which works just fine, and is how I'd expect it to appear. The next one, after DeC[omposed], is the standard "e" with the combining hat U+0302. The hat is not rotated, and not located in the correct position, instead appearing inside the e. For this specific case, this could be solved with NFC or NFKC, but that only sidesteps the problem, as there is no, for example p̂ that is one codepoint. The Hebrew Alef U+05D0 is combined with some Hebrew accents U+05AA, which should be below the letters, but instead are shown in "naked" form as they are rendered separately.

teunbrand · 2021-12-08T08:14:12Z

Hello there,

Thanks for bringing this to our attention. Indeed this does look unexpected.
I'm getting varying results with this label in different graphics devices, with some refusing to render some accents at all. I highly doubt that aspect of the issue can be fixed here. See examples below (I added a plain text to your example for reference),

ragg:

Cairo PNG:

Windows device:

As far as I can tell, the ragg device does the correct thing in the plain text. I think the issue might stem from a decoupling of the accents from the glyph to which they should be applied. Because they are registered as having zero width, the angles are incorrectly inferred by projecting the xmin and xmax positions, which is why they remain unrotated. I'm by no means an expert on unicode text representation, so I'm left wondering whether there is a reliable way of knowing glyph-accent relationships.

byteit101 · 2021-12-08T09:02:17Z

I'm by no means an expert on unicode text representation, so I'm left wondering whether there is a reliable way of knowing glyph-accent relationships.

There is! The Unicode tables, specifically UnicodeData.txt, Canonical_Combining_Class section have this information

via the Unicode R package:

> install.packages("Unicode")
> library(unicode)
> u_char_properties(utf8ToInt("Composed: \u00ea, DeC: e\u0302, \u05aa\u05d0\u05aa"), "Canonical_Combining_Class")
   Canonical_Combining_Class
1                          0
2                          0
3                          0
4                          0
5                          0
6                          0
7                          0
8                          0
9                          0
10                         0
11                         0
12                         0
13                         0
14                         0
15                         0
16                         0
17                         0
18                         0
19                         0
20                       230
21                         0
22                         0
23                       220
24                         0
25                       220
>

The 0's are normal letters, details here: https://www.unicode.org/reports/tr44/#Canonical_Combining_Class_Values

Alternatively, one could check the class, as All characters other than those of General_Category Mn or Mc are guaranteed to have Canonical_Combining_Class=0. :

u_char_properties(utf8ToInt("Composed: \u00ea, DeC: e\u0302, \u05aa\u05d0\u05aa #"), "General_Category")

I'm unsure if there is another R package that is more high level, but that is the base data that it would use too.

byteit101 · 2021-12-08T09:09:42Z

Oh and for reference my screenshot was taken on X11, and the plain text looks like ragg/the correct rendering

teunbrand · 2021-12-08T13:20:53Z

One possible, cheap solution that I could find is to put zero-width glyphs together with the preceding glyph. I don't know whether that has any drawbacks at the moment.

Because a space precedes the first \u05aa character, it gets rendered awkwardly. This would still happen if we did a more fancy thing with Unicode::u_char_properties() though.

byteit101 · 2021-12-08T20:06:12Z

Yes, the extra U+05AA was an error that was how I discovered the RTL issues :-)

That approach may have some drawbacks. This table (ignore the tailored clusters) has some nice examples, particularly Hangul (각): https://www.unicode.org/reports/tr29/#Table_Sample_Grapheme_Clusters

There are some libraries for this algorithm, and for R integration I am currently investigating utf8proc or maybe something in the style of https://github.com/foliojs/grapheme-breaker for a pure R implement

teunbrand · 2021-12-09T21:08:52Z

This should now also be fixed, so I'll be closing this.

teunbrand mentioned this issue Dec 8, 2021

[Unicode] RTL scripts are rendered backwards #32

Closed

teunbrand added a commit to teunbrand/geomtextpath that referenced this issue Dec 8, 2021

Hack composite glyphs (AllanCameron#31)

a54e001

teunbrand mentioned this issue Dec 8, 2021

Switch to textshaping #33

Merged

teunbrand closed this as completed Dec 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Unicode] Unicode combining characters not kept with base character #31

[Unicode] Unicode combining characters not kept with base character #31

byteit101 commented Dec 8, 2021

teunbrand commented Dec 8, 2021

byteit101 commented Dec 8, 2021

byteit101 commented Dec 8, 2021

teunbrand commented Dec 8, 2021

byteit101 commented Dec 8, 2021

teunbrand commented Dec 9, 2021

[Unicode] Unicode combining characters not kept with base character #31

[Unicode] Unicode combining characters not kept with base character #31

Comments

byteit101 commented Dec 8, 2021

Code

Expected

Actual

Explanation

teunbrand commented Dec 8, 2021

byteit101 commented Dec 8, 2021

byteit101 commented Dec 8, 2021

teunbrand commented Dec 8, 2021

byteit101 commented Dec 8, 2021

teunbrand commented Dec 9, 2021